# EMA Lab Notebook

__Name:__ Daniel Smith

__PI:__ A7603242


In [1]:
# import the required libraries
# required imports
import pandas as pd
import scipy.stats
import pymongo
import bson
import collections
from matplotlib import pyplot
import seaborn as sns

## Contents 

- [Data preparation](#preparation)
- [Cleaning the dataset](#cleaning)
- [Q1. KS4 Investigation](#q1)
- [Application of Machine Learning](#machine_learning)
- [Q2. KS2 - KS4 Investigation](#q)


# Data preparation
<a name="importing"></a>

Before we can investigate the data we will to have a look at it, determine what cleaning if any needs doing, carry out the cleaning and store it for access in an appropriate form.

## Importing the KS2 data into a pandas dataframe

These steps are from the TMA02_Question_2b file.  I will use them to clean the datasets before storing away into mongo_db for easier querying.

### Import the LEA data

In [2]:
leas_df = pd.read_csv('data/2015-2016/la_and_region_codes_meta.csv')
leas_df.head()

Unnamed: 0,LEA,LA Name,REGION,REGION NAME
0,841,Darlington,1,North East A
1,840,County Durham,1,North East A
2,805,Hartlepool,1,North East A
3,806,Middlesbrough,1,North East A
4,807,Redcar and Cleveland,1,North East A


### Import the KS2 data
Most of the  field names are given in the ks2_meta file, so we will use that to keep a track of the types of the various columns

In [3]:
ks2cols = pd.read_csv('data/2015-2016/ks2_meta.csv')
ks2cols['Field Name'] = ks2cols['Field Name'].apply(lambda r: r.strip())
ks2cols

Unnamed: 0,Column,Field Name,Label/Description
0,1,RECTYPE,Record type (1=mainstream school; 2=special sc...
1,2,ALPHAIND,Alphabetic index
2,3,LEA,Local authority number
3,4,ESTAB,Establishment number
4,5,URN,School unique reference number
5,6,SCHNAME,School/Local authority name
6,7,ADDRESS1,School address (1)
7,8,ADDRESS2,School address (2)
8,9,ADDRESS3,School address (3)
9,10,TOWN,School town


Some columns contain integers, but ___pandas___ will treat any numeric column with `na` values as `float64`, due to NumPys number hierachy.

In [4]:
int_cols = [c for c in ks2cols['Field Name'] 
            if c.startswith('T')
            if c not in ['TOWN', 'TELNUM', 'TKS1AVERAGE']]
int_cols += ['RECTYPE', 'ALPHAIND', 'LEA', 'ESTAB', 'URN', 'URN_AC', 'ICLOSE']
int_cols += ['READ_AVERAGE', 'GPS_AVERAGE', 'MAT_AVERAGE']

Some columns contain percentages. We'll convert these to floating point numbers on import.

Note that we also need to handle the case of `SUPP` and `NEW` in the data.

In [39]:
def p2f(x):
    if x.strip('%').isnumeric():
        return float(x.strip('%'))/100
    elif x in ['SUPP', 'NEW', 'LOWCOV', 'NA', '']:
        return 0.0
    else:
        return x

These are the columns to try to convert from percentages. Note that we can be generous here, as columns like PCODE (postcode) will return the original value if the conversion fails.

In [6]:
percent_cols = [f for f in ks2cols['Field Name'] if f.startswith('P')]
percent_cols += ['WRITCOV', 'MATCOV', 'READCOV'] 
percent_cols += ['PTMAT_HIGH', 'PTREAD_HIGH', 'PSENELSAPK', 'PSENELK', 'PTGPS_HIGH']
percent_converters = {c: p2f for c in percent_cols}

In [7]:
ks2_df = pd.read_csv('data/2015-2016/england_ks2final.csv', 
                   na_values=['SUPP', 'NEW', 'LOWCOV', 'NA', ''],
                   converters=percent_converters)

Drop the summary rows, keeping just the rows for mainstream and special schools.

In [8]:
ks2_df = ks2_df[(ks2_df['RECTYPE'] == 1) | (ks2_df['RECTYPE'] == 2)]

Convert everything to numbers, if possible.

In [9]:
ks2_df = ks2_df.apply(pd.to_numeric, errors='ignore')

Merge the LEA data into the school data

In [10]:
ks2_df = pd.merge(ks2_df, leas_df, on=['LEA'])
ks2_df.head().T

Unnamed: 0,0,1,2,3,4
RECTYPE,1,1,1,1,1
ALPHAIND,53372,11156,11160,11256,16366
LEA,201,202,202,202,202
ESTAB,3614,3323,3327,2842,2184
URN,100000,100028,100029,130342,100013
SCHNAME,Sir John Cass's Foundation Primary School,"Christ Church Primary School, Hampstead",Christ Church School,Christopher Hatton Primary School,Edith Neville Primary School
ADDRESS1,St James's Passage,Christ Church Hill,Redhill Street,38 Laystall Street,174 Ossulston Street
ADDRESS2,Duke's Place,,Camden,,
ADDRESS3,,,,,
TOWN,London,London,London,London,London


## Importing the KS4 data into a pandas dataframe

Now to adapt the steps above to clean and import the KS4 data.  Again these steps are from the the TMA02_Question2b-pd file.  I will use them to clean the datasets before storing away into mongo_db for easier querying.

Again most of the field names are given in the `ks4_meta` file, so we'll import that first.

In [11]:
ks4cols = pd.read_csv('data/2015-2016/ks4_meta.csv')
ks4cols.head()

Unnamed: 0,Column,Metafile heading,Metafile description,Methodology changes,Null field for special schools,Null field for local authority records,Null field for National (all schools) records,Null field for National (maintained schools) records,Unnamed: 8,Unnamed: 9
0,1,RECTYPE,Record type (1=mainstream school; 2=special sc...,,,,,,,
1,2,ALPHAIND,Alphabetic sorting index,,,Yes,Yes,Yes,,
2,3,LEA,Local authority code (see separate list of loc...,,,,Yes,Yes,,
3,4,ESTAB,Establishment number,,,Yes,Yes,Yes,,
4,5,URN,School Unique Reference Number,,,Yes,Yes,Yes,,


There are some extra columns in comparison to the ks2_meta file.  For our purposes we can drop these from the df.

In [12]:
# note can only run once, so added control flow
if not 'Field Name' in ks4cols.columns:
    ks4cols = ks4cols[['Column','Metafile heading', 'Metafile description']]
    # rename the columns to match the KS2cols df
    ks4cols.columns = ['Column','Field Name', 'Label/Description']

In [13]:
ks4cols['Field Name'] = ks4cols['Field Name'].apply(lambda r: r.strip())
ks4cols

Unnamed: 0,Column,Field Name,Label/Description
0,1,RECTYPE,Record type (1=mainstream school; 2=special sc...
1,2,ALPHAIND,Alphabetic sorting index
2,3,LEA,Local authority code (see separate list of loc...
3,4,ESTAB,Establishment number
4,5,URN,School Unique Reference Number
5,6,SCHNAME,School name
6,7,SCHNAME_AC,School now known as (used if the school has co...
7,8,ADDRESS1,School address (1)
8,9,ADDRESS2,School address (2)
9,10,ADDRESS3,School address (3)


Many of the fields are similar to the ks2 file so we need to carry out the same processing if the data.

In [42]:
int_cols = [c for c in ks4cols['Field Name'] 
            if c.startswith('T')
            if c not in ['TOWN', 'TELNUM', 'TKS1AVERAGE']]
int_cols += ['RECTYPE', 'ALPHAIND', 'LEA', 'ESTAB', 'URN', 'ICLOSE']


Again we will convert the percentages.  And handle the missing data keys again.

In [60]:
# reuse the p2f function to handle the % values and missing values
percent_cols = [f for f in ks4cols['Field Name'] if f.startswith('P')]
percent_converters = {c: p2f for c in percent_cols}

In [73]:
ks4_fields = [f for f in ks4cols['Field Name']]
for p in percent_cols:
    if p not in ks4_fields:
        print(p)

In [112]:
def p2f(x):
    if x.strip('%').isnumeric():
        return float(x.strip('%'))/100
    elif x in ['SUPP', 'NEW', 'LOWCOV', 'NA', '']:
        return NaN
    else:
        return x

In [113]:
# read in the ks4final data
ks4_df = pd.read_csv('data/2015-2016/england_ks4final.csv',
                     na_values=['SUPP', 'NEW', 'LOWCOV', 'NA', 'NP', ''],
                     converters=percent_converters,
                     low_memory=False
                    )

Drop any summary rows, keeping only the Mainstream and special schools.  After double checking they are the same as in the ks2 data.

In [114]:
print(ks4cols.iloc[0]['Label/Description'])

Record type (1=mainstream school; 2=special school; 4=local authority; 5=National (all schools); 7=National (maintained schools))


_note:_  Although the values are different for the summary rows to the KS2 meta data, the 1 and 2 are the same.

In [115]:
ks4_df = ks4_df[(ks4_df['RECTYPE'] == 1) | (ks4_df['RECTYPE'] == 2)]

Convert everything to numbers, if possible.

In [116]:
ks4_df = ks4_df.apply(pd.to_numeric, errors='ignore')

Merge the LEA data into the school data

In [117]:
ks4_df = pd.merge(ks4_df, leas_df, on=['LEA'])
ks4_df.head().T

Unnamed: 0,0,1,2,3,4
RECTYPE,1,1,1,1,1
ALPHAIND,11828,11830,368,9318,10054
LEA,201,201,202,202,202
ESTAB,6007,6005,4285,4611,6000
URN,100003,100001,100053,100054,137333
SCHNAME,City of London School,City of London School for Girls,Acland Burghley School,The Camden School for Girls,CATS College London
SCHNAME_AC,,,,,
ADDRESS1,Queen Victoria Street,St Giles' Terrace,Burghley Road,Sandall Road,43-45 Bloomsbury Square & 2 Southampton Place
ADDRESS2,,Barbican,,,
ADDRESS3,,,,,


In [106]:
ks4_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5335 entries, 0 to 5334
Columns: 375 entries, RECTYPE to REGION NAME
dtypes: float64(1), int64(11), object(363)
memory usage: 15.3+ MB


Again, this is a rather large dataframe.  To save memory on the VM once I have got the data into mongodb, I will delete the df.

In [123]:
### Export the cleaned KS4 data into MongoDB

# Open Refine

Cleaning the data is difficult so I'm going to look at it using openrefine

In [125]:
# TODO

## Exporting the cleaned KS2 and KS4 data into MongoDB


In [23]:
# export the cleaned ks2 data to a csv
ks2_df.to_csv('data/2015-2016/cleaned_ks2.csv')

In [24]:
!/usr/bin/mongoimport --port 27351 --drop --db schools_db --collection ks2_results \
    --type csv --headerline --ignoreBlanks \
    --file data/2015-2016/cleaned_ks2.csv

2018-05-19T13:45:11.994+0000	connected to: localhost:27351
2018-05-19T13:45:11.994+0000	dropping: schools_db.ks2_results
2018-05-19T13:45:14.994+0000	[###############.........] schools_db.ks2_results	11.6MB/17.5MB (66.0%)
2018-05-19T13:45:16.422+0000	[########################] schools_db.ks2_results	17.5MB/17.5MB (100.0%)
2018-05-19T13:45:16.422+0000	imported 16162 documents


In [25]:
# open a connection to the mongo db
client = pymongo.MongoClient('mongodb://localhost:27351/')

# open the imported database
db = client.schools_db

In [26]:
# open the imported collection
ks2_results = db.ks2_results 

Check the number of imported matches the length of the original dataframe

In [27]:
len(ks2_df), ks2_results.find().count()

(16162, 16162)

Great, let's quickly look at one of the documents.

In [28]:
ks2_results.find_one()

{'': 0,
 'ADDRESS1': "St James's Passage",
 'ADDRESS2': "Duke's Place",
 'AGERANGE': '3-11',
 'ALPHAIND': 53372.0,
 'BELIG': 16.0,
 'DIFFN_MATPROG': 2.7,
 'DIFFN_READPROG': 0.6,
 'DIFFN_RWM_EXP': 23.0,
 'DIFFN_RWM_HIGH': -7.0,
 'DIFFN_WRITPROG': 0.2,
 'ESTAB': 3614.0,
 'GELIG': 12.0,
 'GPS_AVERAGE': 106.0,
 'GPS_AVERAGE_FSM6CLA1A': 105.0,
 'GPS_AVERAGE_H': 110.0,
 'GPS_AVERAGE_M': 105.0,
 'GPS_AVERAGE_NotFSM6CLA1A': 107.0,
 'ICLOSE': 0.0,
 'LA Name': 'City of London',
 'LEA': 201.0,
 'MATCOV': 1.0,
 'MATPROG': 3.0,
 'MATPROG_B': 2.9,
 'MATPROG_B_LOWER': 0.3,
 'MATPROG_B_UPPER': 5.5,
 'MATPROG_EAL': 3.1,
 'MATPROG_EAL_LOWER': 0.6,
 'MATPROG_EAL_UPPER': 5.6,
 'MATPROG_FSM6CLA1A': 2.9,
 'MATPROG_FSM6CLA1A_LOWER': -0.1,
 'MATPROG_FSM6CLA1A_UPPER': 5.9,
 'MATPROG_G': 3.1,
 'MATPROG_G_LOWER': 0.1,
 'MATPROG_G_UPPER': 6.1,
 'MATPROG_H': 0.4,
 'MATPROG_H_LOWER': -3.9,
 'MATPROG_H_UPPER': 4.7,
 'MATPROG_LOWER': 1.0,
 'MATPROG_M': 3.3,
 'MATPROG_MOBN': 3.0,
 'MATPROG_MOBN_LOWER': 1.0,
 'MATPROG_

We don't need the ks2_df dataframe any longer so I will delete it from memory.

In [29]:
del ks2_df

## Export the KS4 data into MongoDB

In [118]:
# export the cleaned ks4 data to a csv
ks4_df.to_csv('data/2015-2016/cleaned_ks4.csv')

In [119]:
!/usr/bin/mongoimport --port 27351 --drop --db schools_db --collection ks4_results \
    --type csv --headerline --ignoreBlanks \
    --file data/2015-2016/cleaned_ks4.csv

2018-05-19T14:17:56.617+0000	connected to: localhost:27351
2018-05-19T14:17:56.617+0000	dropping: schools_db.ks4_results
2018-05-19T14:17:58.691+0000	imported 5335 documents


In [120]:
# open the imported collection
ks4_results = db.ks4_results 

Check the number of imported matches the length of the original dataframe

In [121]:
len(ks4_df), ks4_results.find().count()

(5335, 5335)

Great, let's quickly look at one of the documents.

In [122]:
ks4_results.find_one()

{'': 0,
 'AC5EM13': '0%',
 'AC5EM14_PTQ': '0%',
 'AC5EM15_PTQ_EE': '0%',
 'AC5EM16_PTQ_EE': '0%',
 'ADDRESS1': 'Queen Victoria Street',
 'AGERANGE': '10-18',
 'ALPHAIND': 11828,
 'ATT8SCR': 42.1,
 'ATT8SCREBAC': 22.2,
 'ATT8SCRENG': 7.3,
 'ATT8SCRMAT': 0,
 'ATT8SCROPEN': 12.6,
 'ATT8SCROPENG': 10.4,
 'ATT8SCROPENNG': 2.2,
 'ATT8SCR_BOYS': 42.1,
 'BPUP': 139,
 'CONTFLAG': 0,
 'EGENDER': 'BOYS',
 'ESTAB': 6007,
 'FEEDER': 0,
 'ICLOSE': 0,
 'LA Name': 'City of London',
 'LEA': 201,
 'NFTYPE': 'IND',
 'NUMBOYS': 918,
 'P8CILOW': 'NP',
 'P8CILOW_15': 0.0,
 'P8CILOW_AV': 'NP',
 'P8CILOW_BOYS': 'NP',
 'P8CILOW_EAL': 'NP',
 'P8CILOW_FSM6CLA1A': 'NP',
 'P8CILOW_GIRLS': 0.0,
 'P8CILOW_HI': 'NP',
 'P8CILOW_LO': 'NP',
 'P8CILOW_NFSM6CLA1A': 'NP',
 'P8CILOW_NMOB': 'NP',
 'P8CIUPP': 'NP',
 'P8CIUPP_15': 0.0,
 'P8CIUPP_AV': 'NP',
 'P8CIUPP_BOYS': 'NP',
 'P8CIUPP_EAL': 'NP',
 'P8CIUPP_FSM6CLA1A': 'NP',
 'P8CIUPP_GIRLS': 0.0,
 'P8CIUPP_HI': 'NP',
 'P8CIUPP_LO': 'NP',
 'P8CIUPP_NFSM6CLA1A': 'NP',
 'P8CI

## Initial look at the KS4 results dataset
Let's have a quick look at the data we will be looking at for the EMA.

In [35]:
!head -5 'data/2015-2016/england_ks4final.csv'

﻿RECTYPE,ALPHAIND,LEA,ESTAB,URN,SCHNAME,SCHNAME_AC,ADDRESS1,ADDRESS2,ADDRESS3,TOWN,PCODE,TELNUM,CONTFLAG,ICLOSE,NFTYPE,RELDENOM,ADMPOL,EGENDER,FEEDER,TABKS2,TAB1618,AGERANGE,CONFEXAM,TOTPUPS,NUMBOYS,NUMGIRLS,TPUP,BPUP,PBPUP,GPUP,PGPUP,KS2APS,TPRIORLO,PTPRIORLO,TPRIORAV,PTPRIORAV,TPRIORHI,PTPRIORHI,TFSM6CLA1A,PTFSM6CLA1A,TNOTFSM6CLA1A,PTNOTFSM6CLA1A,TEALGRP2,PTEALGRP2,TEALGRP1,PTEALGRP1,TEALGRP3,PTEALGRP3,TNMOB,PTNMOB,SENSE4,PSENSE4,SENAPK4,PSENAPK4,TOTATT8,ATT8SCR,TOTATT8ENG,ATT8SCRENG,TOTATT8MAT,ATT8SCRMAT,TOTATT8EBAC,ATT8SCREBAC,TOTATT8OPEN,ATT8SCROPEN,TOTATT8OPENG,ATT8SCROPENG,TOTATT8OPENNG,ATT8SCROPENNG,P8PUP,P8MEACOV,P8MEA,P8CILOW,P8CIUPP,P8MEAENG,P8MEAENG_CILOW,P8MEAENG_CIUPP,P8MEAMAT,P8MEAMAT_CILOW,P8MEAMAT_CIUPP,P8MEAEBAC,P8MEAEBAC_CILOW,P8MEAEBAC_CIUPP,P8MEAOPEN,P8MEAOPEN_CILOW,P8MEAOPEN_CIUPP,PTL2BASICS_LL_PTQ_EE,PTL2BASICS_3YR_PTQ_EE,TEBACC_E_PTQ_EE,PTEBACC_E_PTQ_EE,PTEBACC_PTQ_EE,TEBACENG_E_PTQ_EE,PTEBACENG_E_PTQ_EE,TEBACMAT_E_PTQ_EE,PTEBACMAT_E_PTQ_EE,TEBAC2SCI_E_PTQ_EE,PT

In [36]:
!wc -l 'data/2015-2016/england_ks4final.csv'

5489 data/2015-2016/england_ks4final.csv


The dataset has 5489 rows of data, there appears to be a large number of columns and a lot of codes that I'll need to look up.  There are a also a number of `NA` and `NP` values that could be missing data.  As well as this results data I will need to find and import the relevent metadata file.

Looking through the data/2015-2016 folder there are a number of files that have information on these codes.

In [37]:
!ls data/2015-2016/

abbreviations.xlsx	    england_spine.csv
abs_meta.csv		    england_swf.csv
census_meta.csv		    england_vaqual.csv
cleaned_ks2.csv		    england_vasubj.csv
cleaned_ks4.csv		    ks2_meta.csv
england_abs.csv		    ks4_meta.csv
england_census.csv	    ks4_meta_methodology.csv
england_cfrfull.xlsx	    ks4-pupdest_meta.csv
england_ks2final.csv	    ks5_meta.csv
england_ks4final.csv	    ks5-studest_meta.csv
england_ks4-pupdest.csv     la_and_region_codes_meta.csv
england_ks4underlying.xlsx  sixth_form_centres_and_consortia_meta.xlsx
england_ks5final.csv	    spine_meta.csv
england_ks5-studest.csv     swf_meta.csv
england_ks5underlying.xlsx


There is an abbreviations file, stored as an xlsx file.  I'll have a quick glance at it in excel.  Having looked the abbreviation up in the abbreviations file we can see that they have the following meanings:

- _NA_: Not Applicable
- _NP_: Not Published

However, before importing the results data I want to look at it in Open Refine and decide what I will do with the missing data.

## Open Refine

Looking at the `england_ks4final.csv` file in OpenRefine I can see that the NA and NP values are in many places there are also SUPP values.

However, with there being so many columns to facet and edit one by one it will become very tedious, and a lot of them may have no bearing on my investigations.  Therefore it will actually be easier and more efficient to handle these in the querying of the database. Therefore no changes were made to the file in open refine.

## Choosing MongoDB

With so many columns to investigate I am leaning towards using a DBMS to make the querying of the data more efficient than in a pandas dataframe.  Therefore, I will import the data into MongoDB.  I chose a document database system as they are far more flexible than a relational database.  In this investigation it may become necessary to add fields to certain documents for example.  





# Q1 - KS4 Investigation
<a name="q1"></a>

## Does the type of school impact the overall academic performance results of students at KS4?

# Application of Machine Learning
<a name="machine_learning"></a>

# Q2 - KS2-KS4 Investigation
<a name="q2"></a>

## Do top performing schools at KS2 deliver similar  good or better results at KS4

# Cleanup remove the database

Uncomment the lines below to remove the MongoDB created in the investigation.

In [38]:
# uncomment to remove the database if needed
# client.drop_database('schools_db')

# check which databases are currently in client
# client.database_names()