# The More The Merrier (Data Cleaning)

**Description:** This notebook is dedicated to the automation of the data loading and cleaning process of raw data stored in *csv* files using the 
*Pandas* library, specifically focusing on three key datasets for this project.

- **Data:** Datasets to clean:
  - `2017_Entry_Exit_Frequency.csv`
  - `2017 UK Average House Price Index.csv`


In [1]:
# importing the necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

### The Data Loading Pipeline
**Description:** The data loading pipeline involves loading a csv file containing raw data into a dataframe. The next process involves selecting the relevant columns for analysis and simplifying the column names. This was done with a single function named "data_loading_pipeline" that performs the aforemnentioned tasks individually with the folliwng function names:

- `load_csv`
- `select_cols`
- `rename_cols`

The resulting dataframes are then ready to be cleaned.


In [2]:
# The data loading pipeline
def load_csv(filepath, r):
    df = pd.read_csv(filepath, skiprows = r)
    return df

def select_cols(df, cols):
    df = df[cols]
    return df

def rename_cols(df, new_name_cols):
    df.columns = new_name_cols
    return df

def data_loading_pipeline(filepath, r,
                          cols,
                          new_name_cols):
    raw_df = load_csv(filepath,r)
    df = select_cols(raw_df, cols)
    df = rename_cols(df, new_name_cols)
    return df 

In [3]:
ldn_sta_freq_load = data_loading_pipeline(r'C:\Users\pjxph\Documents\Data Science Projects\The More The Merrier\raw data\2017_Entry_Exit_Frequency.csv', 6,
                           ['Station', 'Borough', 'million'],
                           ['station_name', 'council_name', 'freq(mill)'])
ldn_hse_price_load = data_loading_pipeline(r'C:\Users\pjxph\Documents\Data Science Projects\The More The Merrier\raw data\2017 UK Average House Price Index.csv', 0,
                                             ['Local authorities', 'Dec-17'],
                                             ['council_name', 'avg_hse_price(mill)'])

#### The two **loaded** dataframes are:
* `ldn_sta_freq_load`
* `ldn_hse_price_load`

Both dataframes then proceed to the data cleaning process.

### Data Cleaning
**Description:** After the data loading process, the loaded data needs to be cleaned for analysis. As with the data loading pipleine, this process was done using a single function named the "data_loading_pipeline" that performs the tasks listed below individually with its corresponding function.

- `drop_dups:` drop any rows that contain duplicates in any selected columns.
- `drop_na:` drop any rows that contain null values in the entire dataframe.
- `set_cols_dtypes:` set any selected column's
- `remove_spec_char:` remove any character from any selected columns. 


Then, the resulting dataframes are required to be checked for consistency.

In [4]:
# The data cleaning pipeline
def drop_dups(df, dup_cols):
    df = df.drop_duplicates(subset = dup_cols, ignore_index = True)
    return df
'''
def drop_outs(df, out_cols):
    q1 = df[out_cols].quantile(0.25)
    q3 = df[out_cols].quantile(0.75)
    iqr = q3 - q1
    # remove outliers
    df = df[(df[out_cols] > (q1 - 1.5 * iqr))
            & (df[out_cols] < (q3 + 1.5 * iqr))]
    return df 
'''
def drop_na(df):
    na = df.isnull().sum()
    df = df.dropna()
    print('Removed {} missing values'.format(na.sum()))
    return df

def set_cols_dtype(df, cols_dtype):
    df = df.astype(cols_dtype)
    return df

def remove_spec_char(df, char_cols, char):
    for i in char:
        df[char_cols] = df[char_cols].str.replace(i, '')
    return df
    
def data_cleaning_pipeline(df,
                           dup_cols,
                           #out_cols,
                           char_cols, char,
                           cols_dtype):
    df = drop_dups(df, dup_cols)
    #df = drop_outs(df, out_cols)
    df = drop_na(df)
    df = remove_spec_char(df, char_cols, char)
    df = set_cols_dtype(df, cols_dtype)
    return df 



In [5]:
ldn_sta_freq = data_cleaning_pipeline(
                                      ldn_sta_freq_load,
                                      ['station_name', 'council_name', 'freq(mill)'],
                                      #['freq(mills)']
                                      ',', '',
                                      {
                                        'station_name' : str,
                                        'council_name' : str,
                                        'freq(mill)' : np.float64    
                                      }                                      
                                     )
ldn_sta_freq

Removed 4 missing values


Unnamed: 0,station_name,council_name,freq(mill)
0,Acton Town,Ealing,6.04
1,Aldgate,City of London,8.85
2,Aldgate East,Tower Hamlets,14.00
3,Alperton,Brent,3.05
4,Amersham,Chiltern,2.32
...,...,...,...
263,Wimbledon Park,Merton,2.18
264,Wood Green,Haringey,12.89
265,Wood Lane,Hammersmith and Fulham,4.00
266,Woodford,Redbridge,5.98


In [6]:
ldn_hse_price = data_cleaning_pipeline( 
                                       ldn_hse_price_load,
                                       ['council_name', 'avg_hse_price(mill)'],
                                       'avg_hse_price(mill)', ['£',','],
                                       {
                                           'council_name' : str,
                                           'avg_hse_price(mill)' : np.float64
                                       }
                                    )
ldn_hse_price

Removed 0 missing values


Unnamed: 0,council_name,avg_hse_price(mill)
0,Adur,306921.0
1,Allerdale,149657.0
2,Amber Valley,170198.0
3,Arun,288820.0
4,Ashfield,135115.0
...,...,...
348,Wycombe,405071.0
349,Wyre,150409.0
350,Wyre Forest,184840.0
351,York,242125.0


#### The two **cleaned** dataframes are:
* `ldn_sta_freq`
* `ldn_hse_price`

Both dataframes are then checked for consistency.

#### Consistency check
##### First, to ensure the consistency of the unique councils in `ldn_sta_freq` with public records.
  - The total number of unique councils in ldn_sta_freq is required to be same as the total number of stations in public records.
  - Station names are required to be consistent.

In [7]:
u_sta_freq = ldn_sta_freq['council_name'].unique()
print(f'The total number of unique councils in ldn_sta_freq is {len(u_sta_freq)}.')
print(u_sta_freq)

The total number of unique councils in ldn_sta_freq is 31.
['Ealing' 'City of London' 'Tower Hamlets' 'Brent' 'Chiltern' 'Islington'
 'Enfield' 'City of Westminster' 'Wandsworth' 'Barking and Dagenham'
 'Redbridge' 'Hammersmith and Fulham' 'Camden' 'Southwark'
 'Waltham Forest' 'Haringey' 'Barnet' 'Lambeth' 'Epping Forest' 'Newham'
 'Harrow' 'Three Rivers' 'Merton' 'Kensington and Chelsea' 'Hillingdon'
 'Havering' 'Hounslow' 'Richmond' 'Hackney' 'Greenwich' 'Watford']


When this figure was checked with public records, there was a total of 33 London councils including the City Of London. Upon further investigation, several factors were required to be taken into account.

Six london councils were removed because they do not posess any London underground stations, these stations are listed below:
- Bexley
- Bromley
- Croydon
- Kingston Upon Thames
- Lewisham
- Sutton

Four councils were added because they posess at least one London underground station, these stations are listed below:
- Chiltern
- Epping Forest
- Three Rivers
- Watford

Hence, the figure of 31 councils with London underground stations were consistent with public records when these factors were taken into account.

##### Next, to ensure the consistency of the *council_name* columns in `ldn_sta_freq` and `ldn_hse_price`. 
  - The total number of unique councils in ldn_sta_freq is required to be the same as the total number of councils in ldn_hse_price.
  - Council names are required to be consistent.

In [8]:
a = ldn_hse_price['council_name'].isin(ldn_sta_freq['council_name'])
u_hse_price = ldn_hse_price['council_name'].loc[a]
print(f'The total number of councils in ldn_hse_price that match with councils in ldn_sta_freq is {len(u_hse_price)}.')

The total number of councils in ldn_hse_price that match with councils in ldn_sta_freq is 29.


This figure is inconsitent with the total number of councils in public records and ldn_sta_freq. 

In [9]:
sf_hp = ldn_sta_freq['council_name'].isin(ldn_hse_price['council_name'])
sf_hp_diff = ldn_sta_freq['council_name'].loc[~sf_hp].unique()
print(sf_hp_diff)

['Kensington and Chelsea' 'Richmond']


Hence, the two dataframes council_name column entries were tested for consistency. Kensington and Chelsea, Richmond councils were missing in ldn_hse_price so they were searched as it could be named differently.

In [10]:
ken = ldn_hse_price['council_name'].str.contains('Kensington')
ldn_hse_price.loc[ken]

Unnamed: 0,council_name,avg_hse_price(mill)
154,Kensington And Chelsea,1212292.0


As seen above, the 'A' in 'And' is uppercase in ldn_hse_price.

In [11]:
rich = ldn_hse_price['council_name'].str.contains('Rich')
ldn_hse_price.loc[rich]

Unnamed: 0,council_name,avg_hse_price(mill)
229,Richmond upon Thames,668369.0
230,Richmondshire,201638.0


As clearly seen, Richmond is named as Richmond upon Thames in ldn_hse_price.

##### Finally, to ensure the consistency of the stations in `ldn_sta_freq` with public records.
  - The total number of stations in ldn_sta_freq is required to be same as the total number of stations in public records.
  - Station names are required to be consistent.

In [14]:
print(f'The total number of London underground staions were {len(ldn_sta_freq)} in ldn_sta_freq.')

The total number of London underground staions were 268 in ldn_sta_freq.


After the initial cleaning process, the total number of underground stations in the ldn_sta_freq dataframe was 268 which is inconsistent with public records. The total number of underground stations in 2017 was 270. The records of the missing stations must be discovered and imported accordingly.