Dissertation
    
# **Notebook 1: Data Collection and Cleaning**
    
This notebook looks to create a full LSOA profile for Kent. This will involve collating all variables (listed in table below) into one dataframe. They will then be geoconverted to 2011 standards, allowing for comparision. 
***

In [1]:
# Set up notebook 

# Install packages

import pandas as pd
import os
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import geopandas as gpd
import urllib
import zipfile
import re
import csv
import gzip
import statsmodels.api as sm
import contextily as ctx
import scipy.stats as stats
from shapely.geometry import Point
from matplotlib_scalebar.scalebar import ScaleBar
from statsmodels.stats.outliers_influence import variance_inflation_factor 
from statsmodels.tools.tools import add_constant
from pyproj import Proj, transform
import matplotlib.patches as mpatches
import glob
import functools
from functools import reduce

  from pandas import (to_datetime, Int64Index, DatetimeIndex, Period,
  from pandas import (to_datetime, Int64Index, DatetimeIndex, Period,


In [2]:
# Look at working dir

print("The working directory is " + os.getcwd())

The working directory is /home/jovyan/work/OneDrive/UCL/Dissertation/Notebooks Tidy


In [3]:
# Set directories

data_2001 = os.path.join("data", "2001")
data_2011 = os.path.join("data", "2011")
shapefiles = os.path.join("Shapefiles")

# 2. Read in data

***
## 2.1 2001 data

In [4]:
# get all the csv files in that directory (assuming they have the extension .csv)
csvfiles = glob.glob(os.path.join(data_2001, '*.csv'))

# loop through the files and read them in with pandas
dataframes_2001 = []  # a list to hold all the individual pandas DataFrames
for csvfile in csvfiles:
    df = pd.read_csv(csvfile, skiprows=5, encoding= 'unicode_escape')
    df = df.rename(columns={"LSOA code":"LSOA01CD"})
    dataframes_2001.append(df)

  df = pd.read_csv(csvfile, skiprows=5, encoding= 'unicode_escape')


In [5]:
# check the column names

for i in range(len(dataframes_2001)):
    print(dataframes_2001[i].columns)

Index(['Local authority code', 'Local authority name', 'LSOA01CD', 'LSOA name',
       'Year ending Mar 2001', 'Year ending Jun 2001', 'Year ending Sep 2001',
       'Year ending Dec 2001', 'Average_2001', 'Unnamed: 9',
       ...
       'Unnamed: 99', 'Unnamed: 100', 'Unnamed: 101', 'Unnamed: 102',
       'Unnamed: 103', 'Unnamed: 104', 'Unnamed: 105', 'Unnamed: 106',
       'Unnamed: 107', 'Unnamed: 108'],
      dtype='object', length=109)
Index(['LSOA01CD', 'Totalweeklyincome(£)'], dtype='object')
Index(['GOR01CD', 'GOR01NM', 'CTY01CD', 'CTY01NM', 'LAD01CD', 'LAD01NM',
       'MSOA01CD', 'MSOA01NM', 'LSOA01CD', 'LSOA01NM', 'RUC01NM', 'RUC01CD',
       'Morphology Name', 'Morphology Code', 'Context Name', 'Context Code'],
      dtype='object')


In [6]:
# get all the csv files in that directory (assuming they have the extension .csv)
csvfiles = glob.glob(os.path.join(data_2001, 'NOMIS/*.csv'))

# loop through the files and read them in with pandas
dataframes_2001_nomis = []  # a list to hold all the individual pandas DataFrames
for csvfile in csvfiles:
    df = pd.read_csv(csvfile, skiprows=6)
    #df = df.rename(columns={"LSOA code":"LSOA01CD", "mnemonic":"LSOA01CD", "super output areas - lower layer":"LSOA01NM"})
    dataframes_2001_nomis.append(df)

In [7]:
# check the column names

for i in range(len(dataframes_2001_nomis)):
    print(dataframes_2001_nomis[i].columns)

Index(['super output areas - lower layer', 'mnemonic', 'All usual residents',
       'Age 0 to 4', 'Age 5 to 7', 'Age 8 to 9', 'Age 10 to 14', 'Age 15',
       'Age 16 to 17', 'Age 18 to 19', 'Age 20 to 24', 'Age 25 to 29',
       'Age 30 to 44', 'Age 45 to 59', 'Age 60 to 64', 'Age 65 to 74',
       'Age 75 to 84', 'Age 85 to 89', 'Age 90 and over'],
      dtype='object')
Index(['super output areas - lower layer', 'mnemonic', 'All people',
       'Single (never married)', 'Married (first marriage)', 'Re-married',
       'Separated (but still legally married)', 'Divorced', 'Widowed'],
      dtype='object')
Index(['super output areas - lower layer', 'mnemonic',
       'All categories: Ethnic group', 'White', 'White: British',
       'White: Irish', 'White: Other', 'Mixed',
       'Mixed: White and Black Caribbean', 'Mixed: White and Black African',
       'Mixed: White and Asian', 'Mixed: Other', 'Asian/Asian British',
       'Asian/Asian British: Indian', 'Asian/Asian British: Pakistan

## 2.2 2011 Data

In [8]:
# get all the csv files in that directory (assuming they have the extension .csv)
csvfiles = glob.glob(os.path.join(data_2011, '*.csv'))

# loop through the files and read them in with pandas
dataframes_2011 = []  # a list to hold all the individual pandas DataFrames
for csvfile in csvfiles:
    df = pd.read_csv(csvfile, skiprows=5)
    df = df.rename(columns={"LSOA code":"LSOA11CD", "mnemonic":"LSOA11CD", "super output areas - lower layer":"LSOA11NM"})
    dataframes_2011.append(df)

In [9]:
# get all the csv files in that directory (assuming they have the extension .csv)
csvfiles = glob.glob(os.path.join(data_2011, 'NOMIS/*.csv'))

# loop through the files and read them in with pandas
dataframes_2011_nomis = []  # a list to hold all the individual pandas DataFrames
for csvfile in csvfiles:
    df = pd.read_csv(csvfile, skiprows=8, skip_blank_lines=True,)
    df = df.rename(columns={"LSOA code":"LSOA11CD", "mnemonic":"LSOA11CD", "super output areas - lower layer":"LSOA11NM", "geography":"LSOA11NM"})
    dataframes_2011_nomis.append(df)

Ok, so we now have 5 key dataframes:

1. Geographic data containing LSOAs codes for just Kent
2. 2001 Data - extra
3. 2001 data - NOMIS
4. 2011 data - extra
5. 2011 data - NOMIS

The next step is to create a long data frame for 2001, convert to 2011 values and codes and then create a massive dataframe with all the values on.

# 2.3 Merge DataFrames
## 2.3.1 2001

In [10]:
# Read in 2001 spatial data frame

kent_2001 = gpd.read_file(os.path.join(shapefiles, "LSOA_KENT_2001.shp"))


In [11]:
kent_2001[kent_2001["LSOA01CD"]=="E01024312"]

Unnamed: 0,OBJECTID,LSOA01CD,LSOA01NM,LSOA01NMW,Shape__Are,Shape__Len,geometry
934,24312,E01024312,Gravesham 008E,Gravesham 008E,1111686.0,6497.565939,"POLYGON ((0.40208 51.42978, 0.40211 51.42977, ..."


In [12]:
# Merge each dataframe in dict to just Kent LSOAs

for i in range(len(dataframes_2001)):
    dataframes_2001[i] = kent_2001.merge(dataframes_2001[i], on="LSOA01CD", how="left")

if len(dataframes_2001[i])!=len(kent_2001):
        print("Error!")
    
for i in range(len(dataframes_2001_nomis)):
    dataframes_2001_nomis[i] = kent_2001.merge(dataframes_2001_nomis[i], left_on=["LSOA01CD"], right_on=["mnemonic"], how="left")
    if len(dataframes_2001_nomis[i])!=len(kent_2001):
        print("Error!")
else:
    print("Looks like all is ok")

Looks like all is ok


In [13]:
for i in range(len(dataframes_2001_nomis)):
    globals()[f'Y_{i}'] = len((dataframes_2001_nomis[i].columns.values))

In [14]:
# Find how many columns are in all data frames

for i in range(len(dataframes_2001)):
    globals()[f'X_{i}'] = len(dataframes_2001[i].columns.values)
a = X_0 + X_1
    
for i in range(len(dataframes_2001_nomis)):
    globals()[f'Y_{i}'] = len((dataframes_2001_nomis[i].columns.values))

b = (Y_0 + Y_1 + Y_2 +  Y_3 + Y_4 +  Y_5 + 
           Y_6 +  Y_7 +  Y_8 + Y_9 + Y_10 + Y_11 +
           Y_12 + Y_13 + Y_14 + Y_15 + Y_16 + Y_17 +
           Y_18 + Y_19 + Y_20 + Y_21+ Y_22)

count_LSOA01CD = len(dataframes_2001) + len(dataframes_2001_nomis) - 1

# Minus how many columns of LSOA01CD there are, as this will be the merging value

columns = a+b-count_LSOA01CD
print("We can expect " + str(columns) + " columns in the 2001 dataframe")


We can expect 608 columns in the 2001 dataframe


In [15]:
# Create one dataframe from list of dataframes

dataframes_2001_merged = reduce(lambda  left,right: pd.merge(left,right,on=['LSOA01CD'], how='outer', suffixes=('', '_drop')), dataframes_2001)
dataframes_2001_merged_nomis = reduce(lambda  left,right: pd.merge(left,right,on=['LSOA01CD'], how='outer', suffixes=('', '_drop')), dataframes_2001_nomis)
census_2001 = []
census_2001 = pd.merge(dataframes_2001_merged, dataframes_2001_merged_nomis, how="outer", on="LSOA01CD", suffixes=('', '_drop'))

In [16]:
# Check all columns are present

if len(census_2001.columns) == columns:
    print("Success!")
else:
    print("Fail")
print("There are " + str(len(census_2001.columns)) + " columns in the merged dataframe, and " + str(columns) + " in the lists")


Fail
There are 651 columns in the merged dataframe, and 608 in the lists


In [17]:
# Drop duplicates

census_2001.drop(census_2001.filter(regex='_drop|_x|_y|Unnamed:').columns, axis=1, inplace=True)


In [18]:
print(census_2001.columns.tolist())
print("Looks good")

['OBJECTID', 'LSOA01CD', 'LSOA01NM', 'LSOA01NMW', 'Shape__Are', 'Shape__Len', 'geometry', 'Local authority code', 'Local authority name', 'LSOA name', 'Year ending Mar 2001', 'Year ending Jun 2001', 'Year ending Sep 2001', 'Year ending Dec 2001', 'Average_2001', 'Totalweeklyincome(£)', 'GOR01CD', 'GOR01NM', 'CTY01CD', 'CTY01NM', 'LAD01CD', 'LAD01NM', 'MSOA01CD', 'MSOA01NM', 'RUC01NM', 'RUC01CD', 'Morphology Name', 'Morphology Code', 'Context Name', 'Context Code', 'super output areas - lower layer', 'mnemonic', 'All usual residents', 'Age 0 to 4', 'Age 5 to 7', 'Age 8 to 9', 'Age 10 to 14', 'Age 15', 'Age 16 to 17', 'Age 18 to 19', 'Age 20 to 24', 'Age 25 to 29', 'Age 30 to 44', 'Age 45 to 59', 'Age 60 to 64', 'Age 65 to 74', 'Age 75 to 84', 'Age 85 to 89', 'Age 90 and over', 'All people', 'Single (never married)', 'Married (first marriage)', 'Re-married', 'Separated (but still legally married)', 'Divorced', 'Widowed', 'All categories: Ethnic group', 'White', 'White: British', 'White: 

In [19]:
# Make it a spatial dataframe again 

kent_2001_filter = kent_2001.filter(items=['LSOA01CD', 'geometry'])

# Merge

census_2001 = pd.merge(kent_2001_filter, census_2001, how="outer", on="LSOA01CD")


In [20]:
# Tidy

with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    print(census_2001.head(5))

    LSOA01CD                                         geometry_x  OBJECTID  \
0  E01000374  MULTIPOLYGON (((0.19815 51.45193, 0.19816 51.4...       374   
1  E01000375  MULTIPOLYGON (((0.18606 51.44590, 0.18601 51.4...       375   
2  E01000378  MULTIPOLYGON (((0.17090 51.44121, 0.17102 51.4...       378   
3  E01000379  MULTIPOLYGON (((0.17470 51.44254, 0.17518 51.4...       379   
4  E01000381  MULTIPOLYGON (((0.15333 51.42546, 0.15333 51.4...       381   

      LSOA01NM    LSOA01NMW    Shape__Are   Shape__Len  \
0  Bexley 019A  Bexley 019A  7.476827e+05  5475.181697   
1  Bexley 019B  Bexley 019B  4.292436e+05  4143.641365   
2  Bexley 019E  Bexley 019E  1.261753e+06  7363.117238   
3  Bexley 019F  Bexley 019F  3.241723e+05  3659.854482   
4  Bexley 028A  Bexley 028A  2.056602e+06  8420.379806   

                                          geometry_y Local authority code  \
0  MULTIPOLYGON (((0.19815 51.45193, 0.19816 51.4...            E09000004   
1  MULTIPOLYGON (((0.18606 51.4459

In [23]:
census_2001_df = pd.DataFrame(census_2001)

In [24]:
# Save

# Create folder 

if os.path.isdir('LSOA Profiles') is not True:
    print("Creating 'LSOA Profiles' directory...")
    os.mkdir('LSOA Profiles')
    
filename = "census_2001_profile.csv"
path = os.path.join("LSOA Profiles")
fullpath = os.path.join(path, filename)

census_2001_df.to_csv(os.path.join(fullpath))

#census_2001_df.to_csv(os.path.join("LSOA Profiles", ))

## 2.3.2 2011

In [25]:
# Read in 2011 spatial data frame

kent_2011 = gpd.read_file(os.path.join(shapefiles, "LSOA_KENT_2011.shp"))

In [26]:
# Merge each dataframe in dict to just Kent LSOAs

for i in range(len(dataframes_2011)):
    dataframes_2011[i] = kent_2011.merge(dataframes_2011[i], on="LSOA11CD", how="left")

if len(dataframes_2011[i])!=len(kent_2011):
        print("Error!")
    
for i in range(len(dataframes_2011_nomis)):
    dataframes_2011_nomis[i] = kent_2011.merge(dataframes_2011_nomis[i], left_on=["LSOA11CD"], right_on=["LSOA11CD"], how="left")
    if len(dataframes_2011_nomis[i])!=len(kent_2011):
        print(f"Error!{i}")
else:
    print("Looks like all is ok")

Looks like all is ok


In [27]:
# Find how many columns are in all data frames

for i in range(len(dataframes_2011)):
    globals()[f'X_{i}'] = len(dataframes_2011[i].columns.values)
    X_2 = 9
    a = X_0 + X_1 + X_2
    
for i in range(len(dataframes_2011_nomis)):
    globals()[f'Y_{i}'] = len((dataframes_2011_nomis[i].columns.values))
    b = (Y_0 + Y_1 + Y_2 +  Y_3 + Y_4 +  Y_5 + 
           Y_6 +  Y_7 +  Y_8 + Y_9 + Y_10 + Y_11 +
           Y_12 + Y_13 + Y_14 + Y_15)

count_LSOA11CD = len(dataframes_2011) + len(dataframes_2011_nomis) - 1

# Minus how many columns of LSOA01CD there are, as this will be the merging value

columns = a+b-count_LSOA11CD
print("We can expect " + str(columns) + " columns in the 2011 dataframe")


We can expect 351 columns in the 2011 dataframe


In [28]:
# Create one dataframe from list of dataframes

dataframes_2011_merged = reduce(lambda  left,right: pd.merge(left,right,on=['LSOA11CD'], how='outer', suffixes=('', '_drop')), dataframes_2011)
dataframes_2011_merged_nomis = reduce(lambda  left,right: pd.merge(left,right,on=['LSOA11CD'], how='outer', suffixes=('', '_drop')), dataframes_2011_nomis)
census_2011= []
census_2011= pd.merge(dataframes_2011_merged, dataframes_2011_merged_nomis, how="outer", on="LSOA11CD", suffixes=('', '_drop'))

In [29]:
# Check all columns are present

if len(census_2011.columns) == columns:
    print("Success!")
else:
    print("Fail")
print("There are " + str(len(census_2011.columns)) + " columns in the merged dataframe, and " + str(columns) + " in the lists")


Fail
There are 579 columns in the merged dataframe, and 351 in the lists


In [30]:
# Drop duplicates

census_2011.drop(census_2011.filter(regex='_drop|_x|_y|Unnamed:').columns, axis=1, inplace=True)


In [31]:
print(census_2011.columns.tolist())
print("Looks good")

['LSOA11CD', 'LSOA11NM', 'geometry', 'Local authority code', 'Local authority name', 'LSOA name', 'Year ending Mar 2011', 'Year ending Jun 2011', 'Year ending Sep 2011', 'Year ending Dec 2011', 'Average_2011', 'Totalweeklyincome(£)', 'Upperconfidencelimit(£)', 'Lowerconfidencelimit(£)', 'Confidenceinterval(£)', 'RUC11CD', 'RUC11', 'FID', '2011 super output area - lower layer', 'All usual residents', 'Age 0 to 4', 'Age 5 to 7', 'Age 8 to 9', 'Age 10 to 14', 'Age 15', 'Age 16 to 17', 'Age 18 to 19', 'Age 20 to 24', 'Age 25 to 29', 'Age 30 to 44', 'Age 45 to 59', 'Age 60 to 64', 'Age 65 to 74', 'Age 75 to 84', 'Age 85 to 89', 'Age 90 and over', 'All usual residents aged 16+', 'Single (never married or never registered a same-sex civil partnership)', 'Married', 'In a registered same-sex civil partnership', 'Separated (but still legally married or still legally in a same-sex civil partnership)', 'Divorced or formerly in a same-sex civil partnership which is now legally dissolved', 'Widowed 

In [32]:
# Make it a spatial dataframe again 

kent_2011_filter = kent_2011.filter(items=['LSOA11CD', 'geometry'])

# Merge

census_2011 = pd.merge(kent_2011_filter, census_2011, how="outer", on="LSOA11CD")


In [33]:
# Tidy

with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    print(census_2011.head(5))

    LSOA11CD                                         geometry_x      LSOA11NM  \
0  E01023972  POLYGON ((602494.344 141509.244, 602498.426 14...  Ashford 006A   
1  E01023973  POLYGON ((601527.620 141293.178, 601527.125 14...  Ashford 005A   
2  E01023974  POLYGON ((599609.242 141534.213, 599612.382 14...  Ashford 007A   
3  E01023975  POLYGON ((599541.509 141383.033, 599540.728 14...  Ashford 007B   
4  E01023976  POLYGON ((600185.589 141361.026, 600187.479 14...  Ashford 008A   

                                          geometry_y Local authority code  \
0  POLYGON ((602494.344 141509.244, 602498.426 14...            E07000105   
1  POLYGON ((601527.620 141293.178, 601527.125 14...            E07000105   
2  POLYGON ((599609.242 141534.213, 599612.382 14...            E07000105   
3  POLYGON ((599541.509 141383.033, 599540.728 14...            E07000105   
4  POLYGON ((600185.589 141361.026, 600187.479 14...            E07000105   

  Local authority name     LSOA name Year ending M

In [34]:
# Save

filename = "census_2011_profile.csv"
path = os.path.join("LSOA Profiles")
fullpath = os.path.join(path, filename)

census_2011.to_csv(os.path.join(fullpath))

In [35]:
print("In 2001, there were " + str(len(census_2001)) + " LSOA's, with an average population of " + str(round(census_2001["All usual residents"].mean())))
print("In 2011, there were " + str(len(census_2011)) + " LSOA's, with an average population of " + str(round(census_2011["All usual residents"].mean())))

In 2001, there were 935 LSOA's, with an average population of 1507
In 2011, there were 902 LSOA's, with an average population of 1623
