# The More The Merrier (Data Cleaning)

**Description:** This notebook is dedicated to the preprocessing and cleaning of raw data stored in *csv* files using the 
*Pandas* library, specifically focusing on three key datasets for this project.

- **Data:** Datasets to clean:
  - `2017_Entry_Exit.csv`
  - `2017_Average_Housing_Prices_in_London.csv`
  - ` LondonUnderground_Stations_Boroughs.csv `


In [93]:
# importing the necessary libraries
import pandas as pd
import numpy as np

In [94]:
def load_data(filepath):
    '''
    This function loads raw data from a csv file into a pandas dataframe and sets the primary key as the index column
    Args:
        filepath: the raw data's filepath in csv format
    Return:
        The loaded raw data into the pandas dataframe ready to be preprocessed
    '''
    df = pd.read_csv(filepath)
    return df 

In [95]:
class data_cleaner:
    def __init__(self, df):
        self.df = df
    
    def get_data(self):
        return self.df
    
    def select_cols(self,cols):
        df = self.df[cols]
        cleaned_data = data_cleaner(df)
        return df

    def adjust_col_dtypes(self,col_dtypes):
        self.df = self.df.astype(col_dtypes)
        cleaned_data = data_cleaner(self.df)
        return self.df
    
    def rename_cols(self,rename):
        self.df.columns = rename
        cleaned_data = data_cleaner(self.df)
        return self.df

In [96]:
def clean_data(df, cols, col_dtypes, rename):
    '''
    This function does the necessary cleaning of data such as selecting relevant column, adjusting each column's data type
    and simplifying the column names.
    Args:
        df: raw dataframe to be cleaned
        cols: list of relevant columns in ascending order
        col_dtypes: dictionary where each column maps to a data type 
        rename: list of simplified column names in order
    Return:
        The cleaned dataframe ready to be cleaned
    '''
    df = df[cols]
    df.astype(col_dtypes)
    df.columns = rename
    return df

In [97]:
raw_df = load_data(r'C:\Users\pjxph\Documents\Data Science Projects\The More The Merrier\raw data\2017_Entry_Exit.csv')
# list of all london stations with it's each respective frequency of touch ins and touch outs.

In [98]:
df = data_cleaner(raw_df)
df = df.select_cols(['Station_ID', 'Station_Name','AnnualEntryExit_Mill'])
df = data_cleaner(df)
df = df.rename_cols(['id','name','frequency(millions)'])
df = data_cleaner(df)
df = df.adjust_col_dtypes({ 'id' : np.int64,
                           'name' : str,
                           'frequency(millions)' : np.float64
    
})
sta_freq = df
#cleaned_data.rename_cols(['id','name','z'])

In [99]:
# 2017_Average_hse_price
raw_df = load_data(r'C:\Users\pjxph\Documents/Data Science Projects/The More The Merrier/raw data/2017_Average_Housing_Prices_in_London.csv')

In [100]:
df = data_cleaner(raw_df)
df = df.select_cols(['Area_ID','Area_Name','average_hse_price'])
df = data_cleaner(df)
df = df.rename_cols(['id','name','avg_hse_price'])
df = data_cleaner(df)
df = df.adjust_col_dtypes({
    'id' : str,
    'name' : str,
    'avg_hse_price' : np.float64
})
ldn_hse_price = df 

In [104]:
raw_df = load_data(r'C:\Users\pjxph\Documents/Data Science Projects/The More The Merrier/raw data/LondonUnderground_Stations_Boroughs.csv')

In [105]:
raw_df.head()

Unnamed: 0,OBJECTID,NAME,NETWORK,Zone,area,Unnamed: 5,Unnamed: 6
0,1,Brent Cross,London Underground,3,Barnet,,London Underground
1,2,Colindale,London Underground,4,Barnet,,London Underground
2,3,Burnt Oak,London Underground,4,Barnet,,London Underground
3,4,Edgware,London Underground,5,Barnet,,London Underground
4,5,Mill Hill East,London Underground,4,Barnet,,London Underground


In [111]:
df = data_cleaner(raw_df)
df = df.select_cols(['OBJECTID','NAME','area'])
df = data_cleaner(df)
df = df.rename_cols(['id','station_name','area_name'])
df = data_cleaner(df)
df = df.adjust_col_dtypes({
    'id' : np.int64,
    'station_name' : str,
    'area_name' : str
})
boroughs = df 

The three cleaned dataframes are:
* sta_freq
* ldn_hse_price
* boroughs
