<h1 align="center">Cleaning and Preparing Reviews Data</h1>

*******************************************************************************************************************************

<h2>1. Introduction</h2>

### Data Cleaning Overview

In this project, reviews were collected using the `fetch_reviews.py` script, which extracted reviews for each location listed in the `places_processed.csv` file. Due to limitations in the Google API, only 5 reviews could be extracted per place. The collected data underwent several cleaning steps to prepare it for further analysis. These included:

- **Handling missing values**: Rows with missing ratings or reviews were removed.
- **Removing duplicates**: Duplicate reviews based on the Review ID were removed.
- **Text Length and Word Count Analysis**: Calculated the length (in characters) and word count for each review, adding these as new columns in the dataset.
- **Data reordering**: Columns were reordered to ensure clarity and consistency in the dataset.

These steps helped ensure that the review data is clean, accurate, and ready for analysis.


<h2>2. Initialization</h2>

In [None]:
# Library Imports
import csv
import numpy as np
import pandas as pd
import os
import re

In [None]:
# Configures Pandas display settings: shows all columns in DataFrames and suppresses chained assignment warnings  
pd.set_option("display.max_columns", None)  
pd.options.mode.chained_assignment = None  

<h2>3. Load the Dataset</h2>

In [None]:
PATH = os.path.abspath(os.path.join("..", "data", "raw", "reviews_raw.csv"))

In [None]:
reviews_df = pd.read_csv(PATH, sep=";", header=0, encoding="utf-8")

In [None]:
reviews_df.shape

In [None]:
reviews_df.columns

In [None]:
reviews_df.head(3)

<h2>4. Data Cleaning</h2>

<h3>4.1 Convert Columns Data Types</h3>

In [None]:
def data_types(df):
    """
    Convert columns of the given DataFrame to specified data types.

    This function takes a DataFrame and converts its columns to specific data types. The data types 
    for columns are defined in the col_dict dictionary. Additionally, it ensures that the 'Text' column 
    is of string type, with NaN values replaced by an empty string. The 'Date' column is also converted 
    to datetime format.

    Parameters:
    df (pandas.DataFrame): The input DataFrame containing the review data.

    Returns:
    pandas.DataFrame: The DataFrame with the columns converted to specified data types.
    """
    
    # Dictionary defining the desired data types for each column
    col_dict = {
        'Place ID': 'str',     # 'Place ID' should be a string
        'Place Name': 'str',   # 'Place Name' should be a string
        'Review ID': 'str',    # 'Review ID' should be a string
        'Author': 'str',       # 'Author' should be a string
        'Rating': 'int',       # 'Rating' should be an integer
        'Text': 'str',         # 'Text' should be a string
        'Time': 'int',         # 'Time' should be an integer
        'Date': 'str',         # 'Date' should be initially a string for conversion
        'Response': 'str'      # 'Response' should be a string
    }
    
    # Convert the columns to the specified data types
    df = df.astype(col_dict)  
    
    # Ensure the 'Text' column is a string, and fill NaN values with an empty string
    df["Text"] = df["Text"].astype(str).fillna("")  # Replace NaN values in 'Text' with an empty string

    # Convert the 'Date' column to datetime format, handling errors as 'coerce' (invalid dates will be converted to NaT)
    df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d %H:%M:%S', errors='coerce')
    
    # Return the DataFrame with converted data types
    return df

In [None]:
# Apply the data_types function to the reviews DataFrame
reviews_df = data_types(reviews_df)

In [None]:
# Check the new data types of the columns
reviews_df.dtypes

<h3>4.2 Text Length and Word Count Analysis</h3>

In [None]:
# Add a new column 'Review Length' that contains the number of characters in each review text
reviews_df["Review Length"] = reviews_df["Text"].apply(len)

In [None]:
# Add a new column 'Word Count' that contains the number of words in each review text
reviews_df["Word Count"] = reviews_df["Text"].apply(lambda x: len(x.split()))

In [None]:
reviews_df.head(3)

<h3>4.3 Handling Duplicate Reviews</h3>

In [None]:
# Identify and extract rows with duplicate Review IDs
duplicate_review_ids = reviews_df[reviews_df.duplicated(subset="Review ID", keep=False)]

# Count the number of duplicate Review IDs
num_duplicate_review_ids = duplicate_review_ids.shape[0]
print(f"Number of duplicate Review IDs: {num_duplicate_review_ids}")

In [None]:
# Identify and extract rows with duplicate combinations of Place ID, Author, and Text
duplicate_combinations = reviews_df[reviews_df.duplicated(subset=["Place ID", "Author", "Text"], keep=False)]

# Count the number of duplicate combinations of Place ID, Author, and Text
num_duplicate_combinations = duplicate_combinations.shape[0]
print(f"Number of duplicates for the combination of Place ID, Author, and Text: {num_duplicate_combinations}")

In [None]:
# Remove duplicates based on Review ID, keeping the first occurrence
reviews_df = reviews_df.drop_duplicates(subset="Review ID", keep="first")

In [None]:
# Remove duplicates based on the combination of Place ID, Author, and Text, keeping the first occurrence
reviews_df = reviews_df.drop_duplicates(subset=["Place ID", "Author", "Text"], keep="first")

<h3>4.4 Handling Missing Data</h3>

In [None]:
# Check the number of missing values in each column
reviews_df.isnull().sum()

In [None]:
# Drop rows where the 'Response' column has missing values
reviews_df.dropna(subset="Response", inplace=True)

<h3>4.5 Reorganizing DataFrame Columns</h3>

In [None]:
# Reorder the columns of the DataFrame to follow a specific sequence
ordered_columns = ["Place ID", "Place Name", "Review ID", "Author", "Rating", "Text", "Review Length", "Word Count", "Time", 
                   "Date", "Response"]

In [None]:
# Apply the new column order to the DataFrame
reviews_df = reviews_df[ordered_columns]

<h2>5. Export Processed Data</h2>

In [None]:
# Save the data to a CSV file
reviews_df.to_csv(os.path.join(os.path.abspath(".."), "data/processed/reviews_processed.csv"), sep=";", index=False, encoding="utf-8")