# The More The Merrier (Data Cleaning)

**Description:** This notebook is dedicated to the preprocessing and cleaning of raw data stored in *csv* files using the 
*Pandas* library, specifically focusing on three key datasets for this project.

- **Data:** Datasets to clean:
  - `2017_Entry_Exit.csv`
  - `2017_Average_Housing_Prices_in_London.csv`
  - ` LondonUnderground_Stations_Boroughs.csv `


In [50]:
# importing the necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

In [51]:
# The data loading pipeline
def load_csv(filepath, r):
    df = pd.read_csv(filepath, skiprows = r)
    return df

def select_cols(df, cols):
    df = df[cols]
    return df

def rename_cols(df, new_name_cols):
    df.columns = new_name_cols
    return df

def data_loading_pipeline(filepath, r,
                          cols,
                          new_name_cols):
    raw_df = load_csv(filepath,r)
    df = select_cols(raw_df, cols)
    df = rename_cols(df, new_name_cols)
    return df 

In [65]:
ldn_sta_freq_loaded = data_loading_pipeline(r'C:\Users\pjxph\Documents\Data Science Projects\The More The Merrier\raw data\2017_Entry_Exit_Frequency.csv', 6,
                           ['Station', 'Borough', 'million'],
                           ['station_name', 'council_name', 'freq(mill)'])
ldn_hse_price_loaded = data_loading_pipeline(r'C:\Users\pjxph\Documents\Data Science Projects\The More The Merrier\raw data\2017 UK Average House Price Index.csv', 0,
                                             ['Local authorities', 'Dec-17'],
                                             ['council_name', 'avg_hse_price(mill)'])

In [96]:
# The data cleaning pipeline
def drop_dups(df, dup_cols):
    df = df.drop_duplicates(subset = dup_cols, ignore_index = True)
    return df
'''
def drop_outs(df, out_cols):
    q1 = df[out_cols].quantile(0.25)
    q3 = df[out_cols].quantile(0.75)
    iqr = q3 - q1
    # remove outliers
    df = df[(df[out_cols] > (q1 - 1.5 * iqr))
            & (df[out_cols] < (q3 + 1.5 * iqr))]
    return df 
'''
def drop_na(df):
    na = df.isnull().sum()
    df = df.dropna()
    print('Removed {} missing values'.format(na.sum()))
    return df

def adjust_cols_dtype(df, cols_dtype):
    df = df.astype(cols_dtype)
    return df

def remove_spec_char(df, char_cols, char):
    for i in char:
        df[char_cols] = df[char_cols].str.replace(i, '')
    return df
    
def data_cleaning_pipeline(df,
                           dup_cols,
                           #out_cols,
                           char_cols, char,
                           cols_dtype):
    df = drop_dups(df, dup_cols)
    #df = drop_outs(df, out_cols)
    df = drop_na(df)
    df = remove_spec_char(df, char_cols, char)
    df = adjust_cols_dtype(df, cols_dtype)
    return df 



In [99]:
ldn_sta_freq = data_cleaning_pipeline(
                                      ldn_sta_freq_loaded,
                                      ['station_name', 'council_name', 'freq(mill)'],
                                      #['freq(mills)']
                                      ',', '',
                                      {
                                        'station_name' : str,
                                        'council_name' : str,
                                        'freq(mill)' : np.float64    
                                      }                                      
                                     )
ldn_sta_freq

Removed 4 missing values


Unnamed: 0,station_name,council_name,freq(mill)
0,Acton Town,Ealing,6.04
1,Aldgate,City of London,8.85
2,Aldgate East,Tower Hamlets,14.00
3,Alperton,Brent,3.05
4,Amersham,Chiltern,2.32
...,...,...,...
263,Wimbledon Park,Merton,2.18
264,Wood Green,Haringey,12.89
265,Wood Lane,Hammersmith and Fulham,4.00
266,Woodford,Redbridge,5.98


In [100]:
ldn_hse_price = data_cleaning_pipeline( 
                                       ldn_hse_price_loaded,
                                       ['council_name', 'avg_hse_price(mill)'],
                                       'avg_hse_price(mill)', ['£',','],
                                       {
                                           'council_name' : str,
                                           'avg_hse_price(mill)' : np.float64
                                       }
                                    )
ldn_hse_price

Removed 0 missing values


Unnamed: 0,council_name,avg_hse_price(mill)
0,Adur,306921.0
1,Allerdale,149657.0
2,Amber Valley,170198.0
3,Arun,288820.0
4,Ashfield,135115.0
...,...,...
348,Wycombe,405071.0
349,Wyre,150409.0
350,Wyre Forest,184840.0
351,York,242125.0


In [120]:
a = ldn_hse_price['council_name'].isin(ldn_sta_freq['council_name'].unique())
ldn_hse_price['council_name'].isin(ldn_sta_freq['council_name'].unique()).value_counts()

ldn_hse_price.loc[a]

Unnamed: 0,council_name,avg_hse_price(mill)
8,Barking and Dagenham,296892.0
9,Barnet,550167.0
30,Brent,484385.0
44,Camden,837707.0
58,Chiltern,556747.0
64,City of London,768751.0
68,City of Westminster,1016380.0
91,Ealing,485602.0
106,Enfield,394603.0
107,Epping Forest,458868.0


### Data Wrangling 
**Description:** After loading the raw data into a dataframe with the function 'load_data', the next step is to wrangle the data. This process involves selecting the relevant columns for analysis, simplifying the column names and adjusting their data types. This process is done using a created class of objects that carries out the aforementioned tasks individually to a dataframe. The resulting dataframes are then ready to be cleaned.

#### The two **wrangled** dataframes are:
* `ldn_sta_freq`
* `ldn_hse_price`

Time to clean!

### Data Cleaning
**Description:** After the data wrangling process, the wrangled data needs to be cleaned. The following processes were carried out:
- Ensure the consistency of the station name column in ldn_sta_freq and ldn_bor. 
  - The total number of stations should be the same.
  - Station names should be consistent

- Ensure the consistency of the area_name column in ldn_hse_price and ldn_bor.
  - The total number of london boroughs should be the same.
  - Borough names should be consistent.
- Ensure that each area name is consistent in ldn_bor

Firstly, the total number of objects in each dataframe such as london station and boroughs were checked against public records.

The total number of London Underground stations and Boroughs in 2017 were 270 and 32 excluding City Of London respectively.

After the initial cleaning process, the total number of underground stations in the ldn_sta_freq dataframe was 268 which is inconsistent with public records. The total number of underground stations in 2017 was 270. The records of the missing stations must be discovered and imported accordingly.ent

In [17]:
# Partially cleaned ldn_hse_price
df = data_cleaner(ldn_hse_price)
df = df.drop_na()
df = data_cleaner(df)
df = df.drop_dups(['id','name'])
df['name'] = df['name'].str.upper()
ldn_hse_price = df
print(len(ldn_hse_price.index))

33


After the initial cleaning process, the total number of London boroughs in the ldn_hse_price dataframe was 33 which is consistent with public records. In 2017, the total number of London borough councils was 32 + City Of London.

After the initial cleaning process, the total number of London borough in the ldn_bor dataframe was 268 which is consistent with the ldn_hse_price dataframe but inconsistent with public records.

The test above shows that all stations in **ldn_bor** dataframe are consistent with the dataframe in **ldn_sta_freq**.

Next is to check if all areas in ldn_hse_price dataframe are in ldn_bor. 

In [21]:
a = ldn_hse_price['name'].isin(ldn_bor['area_name'])
print(a.value_counts())

True     27
False     6
Name: name, dtype: int64


The test showed the above 6 areas in ldn_hse_price_dataframe that are not in ldn_bor. A hypotheses is that these 6 areas do not contain any London Underground Stations. 

After checking against public records, the 6 areas above indeed do not contain any London underground stations. Hence, they can be omitted from the analysis. ldn_hse_price are left with 27 recorded areas.

In [26]:
d = c == False
ldn_borx = ldn_bor.loc[d]

The test above showed that these 15 stations and respective area in the ldn_bor dataframe that are not contained inside the ldn_bor dataframe. These areas correspond to district councils. They can be added to the ldn_hse_price dataframe with gov.uk house price index data. The dataframe name is changed from ldn_bor to ldn_council.

To obtain the avg_hse_price of the district council, average house prices of the councils above were obtained from gov.uk. 