### Step 1: Load the dataset"

In [7]:
import pandas as pd
from pymongo import MongoClient
import matplotlib.pyplot as plt

file_path = 'Ola_Customer_review.csv' 
data = pd.read_csv(file_path)

print("Dataset loaded successfully.")
print(f"Initial shape of data: {data.shape}")

Dataset loaded successfully.
Initial shape of data: (103817, 13)


### Step 2: Connect to MongoDB

In [9]:
client = MongoClient("mongodb://localhost:27017/") 
db = client["OlaReviews"]
collection = db["customer_reviews"]

collection.insert_many(data.to_dict("records"))

print("Data inserted into MongoDB successfully.")
row_count = collection.count_documents({})
print(f"Number of rows in MongoDB: {row_count}")

Data inserted into MongoDB successfully.
Number of rows in MongoDB: 415268


### Step 3: Clean the data (Silver Layer)

In [11]:
for column in data.columns:
    if data[column].dtype == 'object':
        data[column].fillna(data[column].mode()[0], inplace=True)
    else:
        data[column].fillna(data[column].mean(), inplace=True)

data_cleaned = data.drop_duplicates()
print(f"Shape after cleaning: {data_cleaned.shape}")

collection_cleaned = db["customer_reviews_cleaned"]
collection_cleaned.insert_many(data_cleaned.to_dict("records"))
print("Cleaned data inserted into MongoDB.")

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data[column].fillna(data[column].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data[column].fillna(data[column].mean(), inplace=True)


Shape after cleaning: (103817, 13)
Cleaned data inserted into MongoDB.


### Step 4: Create Aggregated Datasets (Gold Layer)

In [30]:
print("Dataset columns available for grouping:", data_cleaned.columns.tolist())

# Check if 'country_code' and 'rating' are available for aggregation
if "country_code" in data_cleaned.columns and "rating" in data_cleaned.columns:
    # Group by 'country_code' instead of 'City'
    agg_data = data_cleaned.groupby("country_code")["rating"].mean().reset_index()
    
    # Connect to MongoDB and insert the aggregated data
    collection_aggregated = db["customer_reviews_aggregated"]
    collection_aggregated.insert_many(agg_data.to_dict("records"))
    print("Aggregated data inserted into MongoDB.")
else:
    print("Required columns for aggregation ('country_code' and 'rating') not found in dataset.")


Dataset columns available for grouping: ['source', 'review_id', 'user_name', 'review_title', 'review_description', 'rating', 'thumbs_up', 'review_date', 'developer_response', 'developer_response_date', 'appVersion', 'laguage_code', 'country_code']
Aggregated data inserted into MongoDB.
