# The More The Merrier (Data Cleaning)

**Description:** This notebook is dedicated to the preprocessing and cleaning of raw data stored in *csv* files using the 
*Pandas* library, specifically focusing on three key datasets for this project.

- **Data:** Datasets to clean:
  - `2017_Entry_Exit.csv`
  - `2017_Average_Housing_Prices_in_London.csv`
  - ` LondonUnderground_Stations_Boroughs.csv `


In [1]:
# importing the necessary libraries
import pandas as pd
import numpy as np

In [2]:
def load_data(filepath):
    '''
    This function loads raw data from a csv file into a pandas dataframe and sets the primary key as the index column
    Args:
        filepath: the raw data's filepath in csv format
    Return:
        The loaded raw data into the pandas dataframe ready to be preprocessed
    '''
    df = pd.read_csv(filepath)
    return df 

### Data Wrangling 
**Description:** After loading the raw data into a dataframe with the function 'load_data', the next step is to wrangle the data. This process involves selecting the relevant columns for analysis, simplifying the column names and adjusting their data types. This process is done using a created class of objects that carries out the aforementioned tasks individually to a dataframe. The resulting dataframes are then ready to be cleaned.

In [3]:
class data_wrangler:
    '''
    This class does the necessary data wrangling such as selecting the relevant columns, setting each column's data type
    and renaming the columns.
    '''
    def __init__(self, df):
        self.df = df
    
    def get_data(self):
        return self.df
    
    def select_cols(self,cols):
        self.df = self.df[cols]
        cleaned_data = data_wrangler(df)
        return self.df

    def adjust_col_dtypes(self,col_dtypes):
        self.df = self.df.astype(col_dtypes)
        cleaned_data = data_wrangler(self.df)
        return self.df
    
    def rename_cols(self,rename):
        self.df.columns = rename
        cleaned_data = data_wrangler(self.df)
        return self.df
    

In [4]:
raw_df = load_data(r'C:\Users\pjxph\Documents\Data Science Projects\The More The Merrier\raw data\2017_Entry_Exit.csv')
# list of all london stations with it's each respective frequency of touch ins and touch outs.

In [5]:
df = data_wrangler(raw_df)
df = df.select_cols(['Station_ID','Station_Name','AnnualEntryExit_Mill'])
df = data_wrangler(df)
df = df.rename_cols(['id','station_name','frequency(mill)'])
df = data_wrangler(df)
df = df.adjust_col_dtypes({
    'id' : int,
    'station_name' : str,
    'frequency(mill)' : np.float64
})
ldn_sta_freq = df

In [6]:
# 2017_Average_hse_price
raw_df = load_data(r'C:\Users\pjxph\Documents/Data Science Projects/The More The Merrier/raw data/2017_Average_Housing_Prices_in_London.csv')

In [7]:
df = data_wrangler(raw_df)
df = df.select_cols(['Area_ID','Area_Name','average_hse_price'])
df = data_wrangler(df)
df = df.rename_cols(['id','name','avg_hse_price'])
df = data_wrangler(df)
df = df.adjust_col_dtypes({
    'id' : str,
    'name' : str,
    'avg_hse_price' : np.float64
})
ldn_hse_price = df 

In [8]:
raw_df = load_data(r'C:\Users\pjxph\Documents/Data Science Projects/The More The Merrier/raw data/LondonUnderground_Stations_Boroughs.csv')

In [9]:
raw_df.head()

Unnamed: 0,OBJECTID,NAME,NETWORK,Zone,area,Unnamed: 5,Unnamed: 6
0,1,Brent Cross,London Underground,3,Barnet,,London Underground
1,2,Colindale,London Underground,4,Barnet,,London Underground
2,3,Burnt Oak,London Underground,4,Barnet,,London Underground
3,4,Edgware,London Underground,5,Barnet,,London Underground
4,5,Mill Hill East,London Underground,4,Barnet,,London Underground


In [10]:
df = data_wrangler(raw_df)
df = df.select_cols(['OBJECTID','NAME','area'])
df = data_wrangler(df)
df = df.rename_cols(['id','station_name','area_name'])
df = data_wrangler(df)
df = df.adjust_col_dtypes({
    'id' : np.int64,
    'station_name' : str,
    'area_name' : str
})
ldn_bor = df 

In [11]:
# ldn_sta_freq
#ldn_hse_price
# ldn_bor

#### The three **wrangled** dataframes are:
* `ldn_sta_freq`
* `ldn_hse_price`
* `ldn_bor`

Time to clean!

### Data Cleaning
**Description:** After the data wrangling process, the wrangled data needs to be cleaned. The following processes were carried out:
- Ensure the consistency of the station name column in ldn_sta_freq and ldn_bor. 
  - The total number of stations should be the same.
  - Station names should be consistent

- Ensure the consistency of the area_name column in ldn_hse_price and ldn_bor.
  - The total number of london boroughs should be the same.
  - Borough names should be consistent.
- Ensure that each area name is consistent in ldn_bor

In [12]:
class data_cleaner:
    '''
    This class does the necessary data cleaning such as removing duplicates and NaN values for the chosen column.
    '''
    def __init__(self, df):
        self.df = df
        
    def drop_dups(self,columns):
        self.df = ((self.df).drop_duplicates(subset = columns, ignore_index = True))
        return self.df 
    
    def drop_na(self):
        self.df = self.df.dropna()
        return self.df

Firstly, the total number of objects in each dataframe such as london station and boroughs were checked against public records.

In [13]:
# raw_data 
print(len(ldn_sta_freq.index))
print(len(ldn_hse_price.index))
print(len(ldn_bor.index))

268
34
270


In [14]:
# To decide which columns to remove duplicates 
ldn_sta_freq.columns

Index(['id', 'station_name', 'frequency(mill)'], dtype='object')

In [15]:
# Partially cleaned ldn_sta_freq
df = data_cleaner(ldn_sta_freq)
df = df.drop_na()
df = data_cleaner(df)
df = df.drop_dups(['station_name'])
df['station_name'] = df['station_name'].str.upper()
ldn_sta_freq = df
print(len(ldn_sta_freq.index))

268


The total number of underground stations in 2017 with reference to public record was 270 which is inconsistent with the stations in the ldn_sta_freq dataframe. The records of the missing stations must be discovered and imported accordingly.

In [16]:
# To decide which columns to remove duplicates 
ldn_hse_price.columns

Index(['id', 'name', 'avg_hse_price'], dtype='object')

In [17]:
# Partially cleaned ldn_hse_price
df = data_cleaner(ldn_hse_price)
df = df.drop_na()
df = data_cleaner(df)
df = df.drop_dups(['id','name'])
df['name'] = df['name'].str.upper()
ldn_hse_price = df
print(len(ldn_hse_price.index))

33


The total number of London borough council in 2017 with reference to public record was 32 + city of london which is consistent with the number of areas in the ldn_hse_price dataframe. 

In [18]:
# To decide which columns to remove duplicates 
ldn_bor.columns

Index(['id', 'station_name', 'area_name'], dtype='object')

In [19]:
### Partially cleaned ldn_bor
df = data_cleaner(ldn_bor)
df = df.drop_dups(['station_name'])
df = data_cleaner(df)
df = df.drop_na()
df['station_name'] = df['station_name'].str.upper()
df['area_name'] = df['area_name'].str.upper()
ldn_bor = df
print(len(ldn_bor.index))
#print(len(ldn_bor['station_name'].unique()))

268


In [20]:
# Test if all station names in ldn_bor dataframe are in ldn_sta_freq dataframe
ldn_bor['station_name'].isin(ldn_sta_freq['station_name']).value_counts()

True    268
Name: station_name, dtype: int64

In [21]:
(ldn_bor['area_name'].unique())

array(['BARNET', 'EALING', 'BRENT', 'CITY OF WESTMINSTER',
       'KENSINGTON AND CHELSEA', 'CAMDEN', 'ISLINGTON', 'ENFIELD',
       'HACKNEY', 'HAVERING', 'HILLINGDON', 'HAMMERSMITH AND FULHAM',
       'CITY OF LONDON', 'TOWER HAMLETS', 'WALTHAM FOREST', 'REDBRIDGE',
       'EPPING FOREST', 'NEWHAM', 'BARKING AND DAGENHAM', 'MERTON',
       'WANDSWORTH', 'LAMBETH', 'HARINGEY', 'HOUNSLOW', 'HARROW',
       'RICHMOND UPON THAMES', 'SOUTHWARK', 'CHILTERN', 'GREENWICH',
       'THREE RIVERS', 'WATFORD'], dtype=object)

In [22]:
ldn_hse_price['name']

0             CITY OF LONDON
1       BARKING AND DAGENHAM
2                     BARNET
3                     BEXLEY
4                      BRENT
5                    BROMLEY
6                     CAMDEN
7                    CROYDON
8                     EALING
9                    ENFIELD
10                 GREENWICH
11                   HACKNEY
12    HAMMERSMITH AND FULHAM
13                  HARINGEY
14                    HARROW
15                  HAVERING
16                HILLINGDON
17                  HOUNSLOW
18                 ISLINGTON
19    KENSINGTON AND CHELSEA
20      KINGSTON UPON THAMES
21                   LAMBETH
22                  LEWISHAM
23                    MERTON
24                    NEWHAM
25                 REDBRIDGE
26      RICHMOND UPON THAMES
27                 SOUTHWARK
28                    SUTTON
29             TOWER HAMLETS
30            WALTHAM FOREST
31                WANDSWORTH
32       CITY OF WESTMINSTER
Name: name, dtype: object

In [23]:
a = ldn_hse_price['name'].isin(ldn_bor['area_name'])
print(a.value_counts())

True     27
False     6
Name: name, dtype: int64


In [24]:
b = a == False
ldn_hse_price.loc[b]

Unnamed: 0,id,name,avg_hse_price
3,E09000004,BEXLEY,330066.0
5,E09000006,BROMLEY,436538.0
7,E09000008,CROYDON,363241.0
20,E09000021,KINGSTON UPON THAMES,487327.0
22,E09000023,LEWISHAM,401025.0
28,E09000029,SUTTON,365567.0


These false values correspond to areas that do not have any underground stations. Hence they should be omitted.

In [25]:
ldn_hse_price = ldn_hse_price.loc[a]
ldn_hse_price.reset_index(inplace = True)

In [26]:
ldn_hse_price['name'].isin(ldn_bor['area_name']).value_counts()

True    27
Name: name, dtype: int64

In [27]:
c = ldn_bor['area_name'].isin(ldn_hse_price['name'])
print(c.value_counts())
#ldn_bor['area_name'].unique()

True     253
False     15
Name: area_name, dtype: int64


In [28]:
d = c == False
ldn_bor.loc[d]

Unnamed: 0,id,station_name,area_name
64,65,RODING VALLEY,EPPING FOREST
65,66,CHIGWELL,EPPING FOREST
117,120,BUCKHURST HILL,EPPING FOREST
118,121,LOUGHTON,EPPING FOREST
121,124,DEBDEN,EPPING FOREST
122,125,THEYDON BOIS,EPPING FOREST
123,126,EPPING,EPPING FOREST
216,284,AMERSHAM,CHILTERN
252,320,CHALFONT & LATIMER,CHILTERN
253,321,CHORLEYWOOD,THREE RIVERS


In [29]:
ldn_bor = ldn_bor.loc[c]
ldn_bor.reset_index(inplace = True)

In [30]:
ldn_bor['area_name'].isin(ldn_hse_price['name']).value_counts()

True    253
Name: area_name, dtype: int64

In [31]:
ldn_sta_freq

Unnamed: 0,id,station_name,frequency(mill)
0,1,BRENT CROSS,304.63
1,2,COLINDALE,849.48
2,3,BURNT OAK,528.80
3,4,EDGWARE,628.82
4,5,MILL HILL EAST,158.18
...,...,...,...
263,331,ANGEL,2304.94
264,332,HENDON CENTRAL,900.52
265,333,WIMBLEDON,1574.78
266,334,WOOD LANE,470.25


The total number of underground stations in 2017 with reference to public record was 270 which is consistent with the stations in the ldn_bor dataframe.

In [32]:
# Station Name Finder 
ldn_bor.loc[ldn_bor['station_name'] == 'Brent Cross']

Unnamed: 0,index,id,station_name,area_name
