## Introduction


=================================================

Milestone 3

Nama  : Syarief Qayum Suaib 

Batch : FTDS-043-RMT

Objective : Milestone 3 aims to evaluate understanding and application of data engineering and analysis tools and concepts learned in Phase 2, including Apache Airflow, Great Expectations, NoSQL databases (Elasticsearch), and data visualization with Kibana. The main task is to build a data pipeline that extracting data from PostgreSQL, cleans and validates it, and loads it into Elasticsearch for visualization using Kibana.

=================================================

## Import Libraries

In [1]:
import pandas as pd
import great_expectations as gx

## Data Loading

Loaded cleaned data from P2M3_syarief_qayum_cleaned.csv

In [2]:
# Load CSV cleaned data
df_clean = pd.read_csv("P2M3_syarief_qayum_cleaned.csv")
print(f"Loaded {len(df_clean)} rows from cleaned CSV.")

# Use the default context and read the dataset
context = gx.get_context()  # Uses the default context 
validator = context.sources.pandas_default.read_dataframe(df_clean)
print("Great Expectations Validator created.")

Loaded 102050 rows from cleaned CSV.
Great Expectations Validator created.


## Great Expectations

We will conduct a data validation on the cleaned data using Great Expectation, this step is important to make sure the data that we used in this project meets our predefined quality standards before loaded for further analysis into Elasticsearch.

### Expectation 1 - `expect_column_values_to_be_unique` (`id`)


The first expecatation is ``expect_column_values_to_be_unique` on the column (`id`). The purpose of this expectation is to ensure each listing in the column `id` is unique and represent single identifier or instance.

In [4]:
# Validate and Assert result

results = validator.expect_column_values_to_be_unique(column="id")
print(results)
assert results.success # Check if expectation passed

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_unique",
    "kwargs": {
      "column": "id",
      "batch_id": "default_pandas_datasource-#ephemeral_pandas_asset"
    },
    "meta": {}
  },
  "result": {
    "element_count": 102050,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}


### Expectation 2 - `expect_column_values_to_be_between` (`price`)


The 2nd expecatation is `expect_column_values_to_be_between` on the column (`price`).
The purpose of this expectation is to ensure each listing in the column `price` is between minimum value of 0 and maximum value of 30000 (USD).

This will tell that the dataset falls in the range of actual pricing based on the dataset.

In [5]:
# Validate and Assert result

results = validator.expect_column_values_to_be_between(column="price", min_value=0, max_value=30000)
print(results)
assert results.success

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_between",
    "kwargs": {
      "column": "price",
      "min_value": 0,
      "max_value": 30000,
      "batch_id": "default_pandas_datasource-#ephemeral_pandas_asset"
    },
    "meta": {}
  },
  "result": {
    "element_count": 102050,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}


### Expectation 3 - `expect_column_values_to_be_in_set` (`room_type`)

The 3rd expecatation is `expect_column_values_to_be_in_set` is on the column (`room_type`).
The purpose of this expectation is to ensure each categories in the column `room_types` are amongst **"Entire home/apt", "Private room", "Shared room", "Hotel room"**

This will tell that those categoires are exist within the column.

In [6]:
# Validate and Assert result

expected_types = ["Entire home/apt", "Private room", "Shared room", "Hotel room"] 
results = validator.expect_column_values_to_be_in_set(column="room_type", value_set=expected_types)
print(results)
assert results.success # Check if expectation passed

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_in_set",
    "kwargs": {
      "column": "room_type",
      "value_set": [
        "Entire home/apt",
        "Private room",
        "Shared room",
        "Hotel room"
      ],
      "batch_id": "default_pandas_datasource-#ephemeral_pandas_asset"
    },
    "meta": {}
  },
  "result": {
    "element_count": 102050,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}


### Expectation 4 - `expect_column_values_to_be_in_type_list` (`price`)

The 4th expecatation is `expect_column_values_to_be_in_type_list` on the column (`price`).
The purpose of this expectation is to ensure the types of the dataset of `price` is indeed a **float**.

In [7]:
# Validate and Assert result

results = validator.expect_column_values_to_be_in_type_list(column="price", type_list=["float", "float64"])
print(results)
assert results.success

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_in_type_list",
    "kwargs": {
      "column": "price",
      "type_list": [
        "float",
        "float64"
      ],
      "batch_id": "default_pandas_datasource-#ephemeral_pandas_asset"
    },
    "meta": {}
  },
  "result": {
    "observed_value": "float64"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}


### Expectation 5 - Additional: `Expect_Column_Value_Lengths_To_Be_Between` (`house_rules`)

The 5th expecatation is `expect_column_value_lengths_to_be_between` on the column (`house_rules`).
The purpose of this expectation is to ensure the character length on the column `house_rules` doesn't exceed 2000 to prevent a really long input to the dataset.

In [8]:
# Validate and Assert result

results = validator.Expect_Column_Value_Lengths_To_Be_Between(column="house_rules", min_value=0,
    max_value=2000,
    strict_max=True)
print(results)
assert results.success

Calculating Metrics:   0%|          | 0/9 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_value_lengths_to_be_between",
    "kwargs": {
      "column": "house_rules",
      "min_value": 0,
      "max_value": 2000,
      "strict_max": true,
      "batch_id": "default_pandas_datasource-#ephemeral_pandas_asset"
    },
    "meta": {}
  },
  "result": {
    "element_count": 102050,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 51840,
    "missing_percent": 50.798628123468895,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}


### Expectation 6 - Additional: `expect_table_column_count_to_be_between`

The 6th expecatation is `expect_table_column_count_to_be_between` on all the column in the dataset.
The purpose of this expectation is to ensure all our column not exceeding the the maximum that we have defined.

Its important to use this if we know how many columns we will work on and to prevent un expected columns in our dataset.

In [9]:
# Validate and Assert result

results = validator.Expect_Table_Column_Count_To_Be_Between(min_value=1,
    max_value=30)
print(results)

assert results.success



Calculating Metrics:   0%|          | 0/3 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_table_column_count_to_be_between",
    "kwargs": {
      "min_value": 1,
      "max_value": 30,
      "batch_id": "default_pandas_datasource-#ephemeral_pandas_asset"
    },
    "meta": {}
  },
  "result": {
    "observed_value": 22
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}


### Expectation 7 - Additional: `expect_column_distinct_values_to_be_less_than` (`review_rate_number`)

The last expecatation is `expect_column_distinct_values_to_be_less_than` on the column (`review_rate_number`)
The purpose of this expectation is to ensure our `review_rate_number` is well within value set of **[1, 2, 3, 4, 5]**

This will ensure all the ratings or review number are exist and no review is more than 5.

In [10]:
# # Validate and Assert result

results = validator.Expect_Column_Distinct_Values_To_Contain_Set(
    column="review_rate_number",
    value_set=[1, 2, 3, 4, 5]
)
print(results)
assert results.success

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_distinct_values_to_contain_set",
    "kwargs": {
      "column": "review_rate_number",
      "value_set": [
        1,
        2,
        3,
        4,
        5
      ],
      "batch_id": "default_pandas_datasource-#ephemeral_pandas_asset"
    },
    "meta": {}
  },
  "result": {
    "observed_value": [
      0.0,
      1.0,
      2.0,
      3.0,
      4.0,
      5.0
    ],
    "details": {
      "value_counts": [
        {
          "value": 0.0,
          "count": 319
        },
        {
          "value": 1.0,
          "count": 9184
        },
        {
          "value": 2.0,
          "count": 22969
        },
        {
          "value": 3.0,
          "count": 23129
        },
        {
          "value": 4.0,
          "count": 23199
        },
        {
          "value": 5.0,
          "count": 23250
        }
      ]
    }
  },
  "meta": {},
  "exception_info": {
    "raised_exception": 

## Conclusion

By running these 7 expectations, we can verifiy our cleaned data based on our defined expectation, by doing this step we will ensure high confidence in the data cleanliness and realibility before used in the next step especially a big data. This step is crucial in a our analysis pipeline as its acts as early prevention on data errors and ensure integrity based on our criterias.

The next step running this quality check is to do vizualisation and analysis using Kibana.