<a href="https://colab.research.google.com/github/srnanda2/CS513_Project/blob/main/CS513_Phase_II_Final_Report.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Report CS 513: Phase II
## Team 46

###

Soumya Nanda	- srnanda2@illinois.edu  
Ayush Ghosh 	-  ayushg7@illinois.edu  
Aditi Ghosh 	- aditig4@illinois.edu

Github Repo: 	https://github.com/srnanda2/CS513_Project


<b>Use Case (U1):</b> <i>Perform a comprehensive analysis of dish prices over time to identify trends, price ranges, and changes in pricing strategies.</i>

###


In [5]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [7]:
# dir and function to load a subset of preprocessed data
raw_data_dir = '/content/drive/My Drive/CS513 Project/NYPL'

import pandas as pd
import numpy as np

dish_df = pd.read_csv(f'{raw_data_dir}/Dish.csv')
menu_df = pd.read_csv(f'{raw_data_dir}/Menu.csv')
menu_item_df = pd.read_csv(f'{raw_data_dir}/MenuItem.csv')
menu_page_df = pd.read_csv(f'{raw_data_dir}/MenuPage.csv')

##1. Description of Data Cleaning Performed
 ### High-Level Data Cleaning Steps and Rationale


####1. Missing Value Imputation

#####- We addressed missing values in several critical columns across the four data files. Specifically:


*  Dish.csv: description, lowest_price, highest_price

*  Menu.csv: name, sponsor, event, venue, place, keywords, date, currency_symbol

*  MenuPage.csv: page_number

*  MenuItem.csv: price, high_price, dish_id


<b>Rationale:</b> *Missing values in critical fields like prices and descriptions can hinder meaningful analysis. Imputing or excluding these values ensures that the dataset is robust enough to support our main use case.*

In [8]:
# For Dish.csv: description, lowest_price, highest_price
dish_df['description'].fillna('No description', inplace=True)
dish_df['lowest_price'].fillna(dish_df['lowest_price'].mean(), inplace=True)
dish_df['highest_price'].fillna(dish_df['highest_price'].mean(), inplace=True)

# For Menu.csv: name, sponsor, event, venue, place, keywords, date, currency_symbol
menu_df.fillna({'name': 'Unknown', 'sponsor': 'Unknown', 'event': 'Unknown', 'venue': 'Unknown', 'place': 'Unknown',
                'keywords': 'None', 'date': 'Unknown', 'currency_symbol': '$'}, inplace=True)

# For MenuPage.csv: page_number
menu_page_df['page_number'].fillna(0, inplace=True)

# For MenuItem.csv: price, high_price, dish_id
menu_item_df['price'].fillna(menu_item_df['price'].mean(), inplace=True)
menu_item_df['high_price'].fillna(menu_item_df['high_price'].mean(), inplace=True)
menu_item_df['dish_id'].fillna(0, inplace=True)


####2. Data Type Correction

#####- We corrected data types to ensure consistency and accuracy. Specifically:


*  Converted lowest_price and highest_price to DECIMAL(10, 2) in Dish.csv.

*  Converted first_appeared and last_appeared to YEAR format in Dish.csv.

*  Corrected data types for keywords, language, and location_type in Menu.csv to VARCHAR.


*  Ensured page_number in MenuPage.csv and dish_id in MenuItem.csv are integers.



<b>Rationale:</b> *Correct data types are essential for performing accurate numerical and temporal analyses, which are critical for understanding price trends and making meaningful comparisons over time (U1)*

In [9]:
# Dish.csv: lowest_price and highest_price to float, first_appeared and last_appeared to YEAR
dish_df['lowest_price'] = dish_df['lowest_price'].astype(float).round(2)
dish_df['highest_price'] = dish_df['highest_price'].astype(float).round(2)
dish_df['first_appeared'] = pd.to_datetime(dish_df['first_appeared'], errors='coerce').dt.year
dish_df['last_appeared'] = pd.to_datetime(dish_df['last_appeared'], errors='coerce').dt.year

# Menu.csv: keywords, language, and location_type to VARCHAR
menu_df['keywords'] = menu_df['keywords'].astype(str)
menu_df['language'] = menu_df['language'].astype(str)
menu_df['location_type'] = menu_df['location_type'].astype(str)

# MenuPage.csv: page_number to integer
menu_page_df['page_number'] = menu_page_df['page_number'].astype(int)

# MenuItem.csv: dish_id to integer
menu_item_df['dish_id'] = menu_item_df['dish_id'].astype(int)


####3. Standardisation of Values:

#####- Standardised values in columns that had inconsistent entries


*  Ensured consistent formatting for first_appeared and last_appeared in Dish.csv

<b>Rationale:</b> *Consistent values are crucial for temporal analyses and comparisons. Standardising these values helps in accurately tracking the historical trends of dish prices (U1)*

In [10]:
# Dish.csv: first_appeared and last_appeared
dish_df['first_appeared'].fillna(0, inplace=True)
dish_df['last_appeared'].fillna(0, inplace=True)


####4. Data Enrichment:

#####- We enriched the dataset by incorporating external data sources to add context:


*  Adjusted historical prices using 2% inflation rates to provide a real-time comparison.

* Added derived columns such as average_price, price_range, and price_trend to facilitate deeper analysis.


<b>Rationale:</b> *Enriching the dataset with additional contextual information like inflation rates helps in understanding the true value changes over time, which is essential for a comprehensive analysis of dish prices (U1)*

In [11]:
# Adjust historical prices using inflation rates
# Assuming a simple example with a fixed inflation rate of 2% per year
inflation_rate = 0.02
current_year = pd.Timestamp.now().year

def adjust_for_inflation(price, year):
    if year == 0:
        return price
    return price * ((1 + inflation_rate) ** (current_year - year))

dish_df['adjusted_lowest_price'] = dish_df.apply(lambda row: adjust_for_inflation(row['lowest_price'], row['first_appeared']), axis=1).round(2)
dish_df['adjusted_highest_price'] = dish_df.apply(lambda row: adjust_for_inflation(row['highest_price'], row['last_appeared']), axis=1).round(2)

# Add derived columns
menu_item_df['average_price'] = (menu_item_df['price'] + menu_item_df['high_price']) / 2
menu_item_df['price_range'] = menu_item_df['high_price'] - menu_item_df['price']
menu_item_df['price_trend'] = menu_item_df['high_price'] / menu_item_df['price']


####4. Handling Duplicates:

#####- Identified and removed duplicate entries in MenuItem.csv.


<b>Rationale:</b> *Duplicate entries can skew analysis results and lead to inaccurate insights. Removing duplicates ensures data integrity and reliability of the analysis (U1)*

In [12]:
# Handling Duplicates
menu_item_df.drop_duplicates(inplace=True)
