# EMA Project Notebook
__Name:__ Daniel Smith

__PI:__ A7603242

In [1]:
# import the required libraries
import pandas as pd
import scipy.stats
import numpy as np
import pymongo
import bson
import collections
from matplotlib import pyplot
import seaborn as sns

In [96]:
# make a folder for storing my working files as I go along.
# not used in the end.
# !mkdir -p data/dcs283

In [3]:
# !ls data/dcs283

# Contents

[TODO](#todo)


Use these links to jump to a section.

[Initial look at the ks4 dataset](#initial_look)

[Choosing MongoDB](#mongo)

[Data preparation](#preparation)
 - [Importing the KS2 data](#importing_ks2)
 - [Importing the KS4 data](#importing_ks4)

[Q1, KS4 Investigation](#q1)

[Q2, KS2 - KS4 Investigation](#q2)

[Application of Machine Learning](#machine_learning)

[Cleanup - remove the database](#cleanup)

<a name="initial_look"></a>

# Initial look at the KS4 results dataset
Let's have a quick look at the data we will be looking at for the EMA.

In [4]:
!head -5 'data/2015-2016/england_ks4final.csv'

﻿RECTYPE,ALPHAIND,LEA,ESTAB,URN,SCHNAME,SCHNAME_AC,ADDRESS1,ADDRESS2,ADDRESS3,TOWN,PCODE,TELNUM,CONTFLAG,ICLOSE,NFTYPE,RELDENOM,ADMPOL,EGENDER,FEEDER,TABKS2,TAB1618,AGERANGE,CONFEXAM,TOTPUPS,NUMBOYS,NUMGIRLS,TPUP,BPUP,PBPUP,GPUP,PGPUP,KS2APS,TPRIORLO,PTPRIORLO,TPRIORAV,PTPRIORAV,TPRIORHI,PTPRIORHI,TFSM6CLA1A,PTFSM6CLA1A,TNOTFSM6CLA1A,PTNOTFSM6CLA1A,TEALGRP2,PTEALGRP2,TEALGRP1,PTEALGRP1,TEALGRP3,PTEALGRP3,TNMOB,PTNMOB,SENSE4,PSENSE4,SENAPK4,PSENAPK4,TOTATT8,ATT8SCR,TOTATT8ENG,ATT8SCRENG,TOTATT8MAT,ATT8SCRMAT,TOTATT8EBAC,ATT8SCREBAC,TOTATT8OPEN,ATT8SCROPEN,TOTATT8OPENG,ATT8SCROPENG,TOTATT8OPENNG,ATT8SCROPENNG,P8PUP,P8MEACOV,P8MEA,P8CILOW,P8CIUPP,P8MEAENG,P8MEAENG_CILOW,P8MEAENG_CIUPP,P8MEAMAT,P8MEAMAT_CILOW,P8MEAMAT_CIUPP,P8MEAEBAC,P8MEAEBAC_CILOW,P8MEAEBAC_CIUPP,P8MEAOPEN,P8MEAOPEN_CILOW,P8MEAOPEN_CIUPP,PTL2BASICS_LL_PTQ_EE,PTL2BASICS_3YR_PTQ_EE,TEBACC_E_PTQ_EE,PTEBACC_E_PTQ_EE,PTEBACC_PTQ_EE,TEBACENG_E_PTQ_EE,PTEBACENG_E_PTQ_EE,TEBACMAT_E_PTQ_EE,PTEBACMAT_E_PTQ_EE,TEBAC2SCI_E_PTQ_EE,PT

In [5]:
!wc -l 'data/2015-2016/england_ks4final.csv'

5489 data/2015-2016/england_ks4final.csv


The dataset has 5489 rows of data, there appears to be a large number of columns and a lot of codes that I'll need to look up.  There are a also a number of `NA` and `NP` values that could be missing data.  As well as this results data I will need to find and import the relevent metadata file.

Before importing the dataset I will need to decide which storage method to use.

<a name="mongo"></a>

# Choosing MongoDB

With so many columns to investigate I am leaning towards using a DBMS to make the querying of the data more efficient than in a pandas dataframe.  Therefore, I will import the data into MongoDB.  I chose a document database system as they are far more flexible than a relational database.  In this investigation it may become necessary to add fields to certain documents for example.  

In [6]:
# set up a connection to mongodb server
client = pymongo.MongoClient('mongodb://localhost:27351')

In [7]:
# uncomment to remove the database if needed
client.drop_database('schools_db')
client.database_names()

['accidents', 'admin', 'local']

In [8]:
# setup a schools_db database on mongo
db = client.schools_db

<a name="preparation"></a>

# Data preparation

Before we can investigate the data we will need to have a quick look at it, determine what cleaning, if any, is needed.  Carry out the cleaning and store it for access in tn appropriate form.  

However before doing anything I will import the KS2 data in the same way as was done in `dcs283_TMA02_Question2b-pd`  I will then store the resultant dataframe into mongo for analysis later on.

<a name="importing_ks2"></a>

## Importing the KS2 data


All of this section is the same as in the `TMA02_Question2b-pd` notebook.

***
__ ----------- Beginning of TMA02 code -----------  __

### Import the LEA data

In [9]:
leas_df = pd.read_csv('data/2015-2016/la_and_region_codes_meta.csv')
leas_df.head()

Unnamed: 0,LEA,LA Name,REGION,REGION NAME
0,841,Darlington,1,North East A
1,840,County Durham,1,North East A
2,805,Hartlepool,1,North East A
3,806,Middlesbrough,1,North East A
4,807,Redcar and Cleveland,1,North East A


### Import the KS2 data
Most of the field names are given in the `ks2_meta` file, so we'll use that to keep track of the types of various columns.

In [10]:
ks2cols = pd.read_csv('data/2015-2016/ks2_meta.csv')
ks2cols['Field Name'] = ks2cols['Field Name'].apply(lambda r: r.strip(),)
ks2cols

Unnamed: 0,Column,Field Name,Label/Description
0,1,RECTYPE,Record type (1=mainstream school; 2=special sc...
1,2,ALPHAIND,Alphabetic index
2,3,LEA,Local authority number
3,4,ESTAB,Establishment number
4,5,URN,School unique reference number
5,6,SCHNAME,School/Local authority name
6,7,ADDRESS1,School address (1)
7,8,ADDRESS2,School address (2)
8,9,ADDRESS3,School address (3)
9,10,TOWN,School town


Some columns contain integers, but _**pandas**_ will treat any numeric column with `na` values as `float64`, due to NumPy's number type hierarchy. 

In [11]:
int_cols = [c for c in ks2cols['Field Name'] 
            if c.startswith('T')
            if c not in ['TOWN', 'TELNUM', 'TKS1AVERAGE']]
int_cols += ['RECTYPE', 'ALPHAIND', 'LEA', 'ESTAB', 'URN', 'URN_AC', 'ICLOSE']
int_cols += ['READ_AVERAGE', 'GPS_AVERAGE', 'MAT_AVERAGE']

Some columns contain percentages. We'll convert these to floating point numbers on import.

Note that we also need to handle the case of `SUPP` and `NEW` in the data.

In [12]:
def p2f(x):
    if x.strip('%').isnumeric():
        return float(x.strip('%'))/100
    elif x in ['SUPP', 'NEW', 'LOWCOV', 'NA', '']:
        return 0.0
    else:
        return x

These are the columns to try to convert from percentages. Note that we can be generous here, as columns like PCODE (postcode) will return the original value if the conversion fails.

In [13]:
percent_cols = [f for f in ks2cols['Field Name'] if f.startswith('P')]
percent_cols += ['WRITCOV', 'MATCOV', 'READCOV'] 
percent_cols += ['PTMAT_HIGH', 'PTREAD_HIGH', 'PSENELSAPK', 'PSENELK', 'PTGPS_HIGH']
percent_converters = {c: p2f for c in percent_cols}

In [14]:
ks2_df = pd.read_csv('data/2015-2016/england_ks2final.csv', 
                   na_values=['SUPP', 'NEW', 'LOWCOV', 'NA', ''],
                   converters=percent_converters)

Drop the summary rows, keeping just the rows for mainstream and special schools.

In [15]:
ks2_df = ks2_df[(ks2_df['RECTYPE'] == 1) | (ks2_df['RECTYPE'] == 2)]

Convert everything to numbers, if possible.

In [16]:
ks2_df = ks2_df.apply(pd.to_numeric, errors='ignore')

Merge the LEA data into the school data

In [17]:
ks2_df = pd.merge(ks2_df, leas_df, on=['LEA'])
ks2_df.head().T

Unnamed: 0,0,1,2,3,4
RECTYPE,1,1,1,1,1
ALPHAIND,53372,11156,11160,11256,16366
LEA,201,202,202,202,202
ESTAB,3614,3323,3327,2842,2184
URN,100000,100028,100029,130342,100013
SCHNAME,Sir John Cass's Foundation Primary School,"Christ Church Primary School, Hampstead",Christ Church School,Christopher Hatton Primary School,Edith Neville Primary School
ADDRESS1,St James's Passage,Christ Church Hill,Redhill Street,38 Laystall Street,174 Ossulston Street
ADDRESS2,Duke's Place,,Camden,,
ADDRESS3,,,,,
TOWN,London,London,London,London,London


__ ----------- END of TMA02 code -----------  __
***

### Convert and store the KS2 dataframe into Mongo for use later

In [18]:
# set up a collection for the ks2 results dataframe
ks2 = db.ks2_results

In [19]:
# convert the dataframe into a list of dicts and store in Mongo

# the 'results' argument is needed to get a list of dicts
ks2.insert_many(ks2_df.to_dict('records'))

# snippet reference is from:
# https://stackoverflow.com/questions/33979983/insert-rows-from-pandas-dataframe-into-mongodb-collection-as-individual-document

<pymongo.results.InsertManyResult at 0x7f9dee5309d8>

In [20]:
# check we got them all
ks2.find().count(), len(ks2_df)

(16162, 16162)

In [21]:
ks2.find_one()

{'ADDRESS1': "St James's Passage",
 'ADDRESS2': "Duke's Place",
 'ADDRESS3': nan,
 'AGERANGE': '3-11',
 'ALPHAIND': 53372.0,
 'BELIG': 16.0,
 'CONFEXAM': nan,
 'DIFFN_MATPROG': 2.7,
 'DIFFN_READPROG': 0.6,
 'DIFFN_RWM_EXP': 23.0,
 'DIFFN_RWM_HIGH': -7.0,
 'DIFFN_WRITPROG': 0.2,
 'ESTAB': 3614.0,
 'GELIG': 12.0,
 'GPS_AVERAGE': 106.0,
 'GPS_AVERAGE_FSM6CLA1A': 105.0,
 'GPS_AVERAGE_H': 110.0,
 'GPS_AVERAGE_L': nan,
 'GPS_AVERAGE_M': 105.0,
 'GPS_AVERAGE_NotFSM6CLA1A': 107.0,
 'ICLOSE': 0.0,
 'LA Name': 'City of London',
 'LEA': 201.0,
 'MATCOV': 1.0,
 'MATPROG': 3.0,
 'MATPROG_B': 2.9,
 'MATPROG_B_LOWER': 0.3,
 'MATPROG_B_UPPER': 5.5,
 'MATPROG_EAL': 3.1,
 'MATPROG_EAL_LOWER': 0.6,
 'MATPROG_EAL_UPPER': 5.6,
 'MATPROG_FSM6CLA1A': 2.9,
 'MATPROG_FSM6CLA1A_LOWER': -0.1,
 'MATPROG_FSM6CLA1A_UPPER': 5.9,
 'MATPROG_G': 3.1,
 'MATPROG_G_LOWER': 0.1,
 'MATPROG_G_UPPER': 6.1,
 'MATPROG_H': 0.4,
 'MATPROG_H_LOWER': -3.9,
 'MATPROG_H_UPPER': 4.7,
 'MATPROG_L': nan,
 'MATPROG_LOWER': 1.0,
 'MATPROG

In [22]:
## Initial look at the KS4 results dataset

In [23]:
ks2.find_one()['GPS_AVERAGE_L']

nan

In [24]:
ks2.find({'GPS_AVERAGE_L': np.nan}).count()

11940

Looks good.  We will need to watch out for the NaN values though.

In [25]:
ks2.find({'GPS_AVERAGE_L': np.nan}).count()

11940

In [26]:
ks2.find_one()

{'ADDRESS1': "St James's Passage",
 'ADDRESS2': "Duke's Place",
 'ADDRESS3': nan,
 'AGERANGE': '3-11',
 'ALPHAIND': 53372.0,
 'BELIG': 16.0,
 'CONFEXAM': nan,
 'DIFFN_MATPROG': 2.7,
 'DIFFN_READPROG': 0.6,
 'DIFFN_RWM_EXP': 23.0,
 'DIFFN_RWM_HIGH': -7.0,
 'DIFFN_WRITPROG': 0.2,
 'ESTAB': 3614.0,
 'GELIG': 12.0,
 'GPS_AVERAGE': 106.0,
 'GPS_AVERAGE_FSM6CLA1A': 105.0,
 'GPS_AVERAGE_H': 110.0,
 'GPS_AVERAGE_L': nan,
 'GPS_AVERAGE_M': 105.0,
 'GPS_AVERAGE_NotFSM6CLA1A': 107.0,
 'ICLOSE': 0.0,
 'LA Name': 'City of London',
 'LEA': 201.0,
 'MATCOV': 1.0,
 'MATPROG': 3.0,
 'MATPROG_B': 2.9,
 'MATPROG_B_LOWER': 0.3,
 'MATPROG_B_UPPER': 5.5,
 'MATPROG_EAL': 3.1,
 'MATPROG_EAL_LOWER': 0.6,
 'MATPROG_EAL_UPPER': 5.6,
 'MATPROG_FSM6CLA1A': 2.9,
 'MATPROG_FSM6CLA1A_LOWER': -0.1,
 'MATPROG_FSM6CLA1A_UPPER': 5.9,
 'MATPROG_G': 3.1,
 'MATPROG_G_LOWER': 0.1,
 'MATPROG_G_UPPER': 6.1,
 'MATPROG_H': 0.4,
 'MATPROG_H_LOWER': -3.9,
 'MATPROG_H_UPPER': 4.7,
 'MATPROG_L': nan,
 'MATPROG_LOWER': 1.0,
 'MATPROG

In [27]:
# to save resources we can now remove delete the ks2_df
del ks2_df

<a name="importing_ks4"></a>

## Importing the KS4 results dataset


Before we can investigate the data we will need to have a look at it, determine what cleaning if any needs to be done, and store it for access in an appropriate form.

### Look at the KS4 results dataset
Let's have a quick look at the data we will be looking at for the EMA.

In [28]:
!head -5 'data/2015-2016/england_ks4final.csv'

﻿RECTYPE,ALPHAIND,LEA,ESTAB,URN,SCHNAME,SCHNAME_AC,ADDRESS1,ADDRESS2,ADDRESS3,TOWN,PCODE,TELNUM,CONTFLAG,ICLOSE,NFTYPE,RELDENOM,ADMPOL,EGENDER,FEEDER,TABKS2,TAB1618,AGERANGE,CONFEXAM,TOTPUPS,NUMBOYS,NUMGIRLS,TPUP,BPUP,PBPUP,GPUP,PGPUP,KS2APS,TPRIORLO,PTPRIORLO,TPRIORAV,PTPRIORAV,TPRIORHI,PTPRIORHI,TFSM6CLA1A,PTFSM6CLA1A,TNOTFSM6CLA1A,PTNOTFSM6CLA1A,TEALGRP2,PTEALGRP2,TEALGRP1,PTEALGRP1,TEALGRP3,PTEALGRP3,TNMOB,PTNMOB,SENSE4,PSENSE4,SENAPK4,PSENAPK4,TOTATT8,ATT8SCR,TOTATT8ENG,ATT8SCRENG,TOTATT8MAT,ATT8SCRMAT,TOTATT8EBAC,ATT8SCREBAC,TOTATT8OPEN,ATT8SCROPEN,TOTATT8OPENG,ATT8SCROPENG,TOTATT8OPENNG,ATT8SCROPENNG,P8PUP,P8MEACOV,P8MEA,P8CILOW,P8CIUPP,P8MEAENG,P8MEAENG_CILOW,P8MEAENG_CIUPP,P8MEAMAT,P8MEAMAT_CILOW,P8MEAMAT_CIUPP,P8MEAEBAC,P8MEAEBAC_CILOW,P8MEAEBAC_CIUPP,P8MEAOPEN,P8MEAOPEN_CILOW,P8MEAOPEN_CIUPP,PTL2BASICS_LL_PTQ_EE,PTL2BASICS_3YR_PTQ_EE,TEBACC_E_PTQ_EE,PTEBACC_E_PTQ_EE,PTEBACC_PTQ_EE,TEBACENG_E_PTQ_EE,PTEBACENG_E_PTQ_EE,TEBACMAT_E_PTQ_EE,PTEBACMAT_E_PTQ_EE,TEBAC2SCI_E_PTQ_EE,PT

In [29]:
!wc -l 'data/2015-2016/england_ks4final.csv'

5489 data/2015-2016/england_ks4final.csv


The dataset has 5489 rows of data, there appears to be a large number of columns and a lot of codes that I'll need to look up.  There are a also a number of `NA` and `NP` values that could be missing data.  As well as this results data I will need to find and import the relevent metadata file.

Looking through the data/2015-2016 folder there are a number of files that have information on these codes.

In [30]:
!ls data/2015-2016/

abbreviations.xlsx	    england_swf.csv
abs_meta.csv		    england_vaqual.csv
census_meta.csv		    england_vasubj.csv
england_abs.csv		    keto_setup.xlsx
england_census.csv	    ks2_meta.csv
england_cfrfull.xlsx	    ks4_meta.csv
england_ks2final.csv	    ks4_meta_methodology.csv
england_ks4final.csv	    ks4-pupdest_meta.csv
england_ks4-pupdest.csv     ks5_meta.csv
england_ks4underlying.xlsx  ks5-studest_meta.csv
england_ks5final.csv	    la_and_region_codes_meta.csv
england_ks5-studest.csv     sixth_form_centres_and_consortia_meta.xlsx
england_ks5underlying.xlsx  spine_meta.csv
england_spine.csv	    swf_meta.csv


There is an abbreviations file, stored as an xlsx file.  I'll have a quick glance at it in excel.  Having looked the abbreviation up in the abbreviations file we can see that they have the following meanings:

- _NA_: Not applicable
- _NP_: Not Published
- _NE_: No entries
- _SUPP_: Suppressed (5 or fewer in cohort)
- _LOWCOV_: Low coverage (less than 50% of the cohort
- _NEW_: New institution

The abbreviations file also has listings of all the school types (NFTYPE) that I will need.  I'll grab that for use later on.

## Importing the abbreviations file

In [31]:
# read in the abbreviations file
abbr_df = pd.read_excel('data/2015-2016/abbreviations.xlsx')
abbr_df

Unnamed: 0,2016 KS4 and KS5/16-18 Performance Tables,Unnamed: 1,Unnamed: 2
0,Abbreviations used in the csv and excel Downlo...,,
1,,,
2,Institution type (NFTYPE):,,
3,AC,Sponsored academy,
4,ACC,Academy converter - mainstream,
5,AC1619,Academy 16-19 sponsor led,
6,ACC1619,Academy 16-19 converter,
7,ACCS,Academy converter - special school,
8,ACS,Sponsored special academy,
9,CTC,City technology college,


We can see that the school types are rows 2-25, I'll store them as a dict for reference later on.

In [32]:
# relabel the columns
abbr_df.columns = ['label', 'expanded', 'not_needed']

In [33]:
nftypes = {}
for index, row in abbr_df[3:26].iterrows():
    nftypes[row['label'].strip()] = row['expanded'].strip()
    
nftypes

{'AC': 'Sponsored academy',
 'AC1619': 'Academy 16-19 sponsor led',
 'ACC': 'Academy converter - mainstream',
 'ACC1619': 'Academy 16-19 converter',
 'ACCS': 'Academy converter - special school',
 'ACS': 'Sponsored special academy',
 'CTC': 'City technology college',
 'CY': 'Community school',
 'CYS': 'Community special school',
 'F': 'Free school - mainstream',
 'F1619': 'Free school - 16-19',
 'FD': 'Foundation school',
 'FDS': 'Foundation special school',
 'FESI': 'Further Education Sector Institution',
 'FS': 'Free school - special',
 'FSS': 'Studio school',
 'FUTC': 'UTC (university technical college)',
 'IND': 'Independent school',
 'INDSPEC': 'Independent special school',
 'MODFC': 'College funded by Ministry of Defence',
 'NMSS': 'Non-maintained special school',
 'VA': 'Voluntary aided school',
 'VC': 'Voluntary controlled school'}

And, while we have the abbreviations available I'll store the missing value types for reference later on if needed.

In [34]:
missing_types = {}
for i, r in abbr_df[45:51].iterrows():
    missing_types[r['label']] = r['expanded']

missing_types

{'NEW': 'New institution',
 nan: 'Not applicable: figures are either not available for the year in question, or the data field is not applicable to this school or college',
 'NE': 'No entries',
 'SUPP': "Indicates that a school or college's figures have been suppressed because there are 5 or fewer pupils in the cohort",
 'LOWCOV': 'Low coverage: indicates that a school’s Progress 8 or value added measures have been suppressed because coverage is less than 50% of the cohort',
 'NP': 'Not published - for example we do not publish Progress 8 data for independent schools and independent special schools, or breakdowns by disadvantaged and non-disadvantaged pupils for independent schools, independent special schools and non-maintained special schools.'}

I can now delete the abbr_df as it won't be needed.

In [35]:
del abbr_df

## Importing the KS4 Metadata file

In order to analyse the data we need to be able to reference the columns and the codes they represent.  I'll import the KS4_meta.csv file into the database and use it to help me understand the data in the KS4 results dataset.

In [36]:
!head -5 data/2015-2016/ks4_meta.csv

Column,Metafile heading,Metafile description,Methodology changes,Null field for special schools,Null field for local authority records,Null field for National (all schools) records,Null field for National (maintained schools) records,,1,RECTYPE,Record type (1=mainstream school; 2=special school; 4=local authority; 5=National (all schools); 7=National (maintained schools)),,,,,,,2,ALPHAIND,Alphabetic sorting index,,,Yes,Yes,Yes,,3,LEA,Local authority code (see separate list of local authorities and their codes),,,,Yes,Yes,,4,ESTAB,Establishment number,,,Yes,Yes,Yes,,5,URN,School Unique Reference Number,,,Yes,Yes,Yes,,6,SCHNAME,School name,,,Yes,Yes,Yes,,7,SCHNAME_AC,School now known as (used if the school has converted to an academy on or after 12 Sept 2015),,,Yes,Yes,Yes,,8,ADDRESS1,School address (1),,,Yes,Yes,Yes,,9,ADDRESS2,School address (2),,,Yes,Yes,Yes,,10,ADDRESS3,School address (3),,,Yes,Yes,Yes,,11,TOWN,School town,,,Yes,Yes,Yes,,12,PCODE,School postcode,,,Yes,Yes

In [37]:
!wc -l data/2015-2016/ks4_meta.csv

0 data/2015-2016/ks4_meta.csv


0 lines.. I'll try loding directly into Mongo

In [38]:
!/usr/bin/mongoimport --port 27351 --drop --db schools_db --collection ks4_meta \
    --type csv --headerline --ignoreBlanks \
    --file data/2015-2016/ks4_meta.csv

2018-06-01T07:16:19.208+0000	Failed: fields cannot be identical: '' and ''
2018-06-01T07:16:19.208+0000	imported 0 documents


Clearly there is an issue with the import.  I'll try importing it into a dataframe.

In [39]:
ks4_meta_df = pd.read_csv('data/2015-2016/ks4_meta.csv')
ks4_meta_df.head()

Unnamed: 0,Column,Metafile heading,Metafile description,Methodology changes,Null field for special schools,Null field for local authority records,Null field for National (all schools) records,Null field for National (maintained schools) records,Unnamed: 8,Unnamed: 9
0,1,RECTYPE,Record type (1=mainstream school; 2=special sc...,,,,,,,
1,2,ALPHAIND,Alphabetic sorting index,,,Yes,Yes,Yes,,
2,3,LEA,Local authority code (see separate list of loc...,,,,Yes,Yes,,
3,4,ESTAB,Establishment number,,,Yes,Yes,Yes,,
4,5,URN,School Unique Reference Number,,,Yes,Yes,Yes,,


That imported ok.  But there are a few extra columns for my needs (I only need it to look up the description for a given term)

In [40]:
# reduce the dataframe to the columns of interest
ks4_meta_df = ks4_meta_df[['Metafile heading', 'Metafile description']]
# relabel them to match my target format
ks4_meta_df.columns = ['label', 'expanded']
# check it looks ok
ks4_meta_df

Unnamed: 0,label,expanded
0,RECTYPE,Record type (1=mainstream school; 2=special sc...
1,ALPHAIND,Alphabetic sorting index
2,LEA,Local authority code (see separate list of loc...
3,ESTAB,Establishment number
4,URN,School Unique Reference Number
5,SCHNAME,School name
6,SCHNAME_AC,School now known as (used if the school has co...
7,ADDRESS1,School address (1)
8,ADDRESS2,School address (2)
9,ADDRESS3,School address (3)


In [41]:
# set up a reference to the db.collection 
ks4_meta = db.ks4_meta

In [42]:
ks4_meta.insert_many(ks4_meta_df.to_dict('records'))
# snippet reference is from:
# https://stackoverflow.com/questions/33979983/insert-rows-from-pandas-dataframe-into-mongodb-collection-as-individual-document

<pymongo.results.InsertManyResult at 0x7f9ded364c60>

In [43]:
ks4_meta.find_one({'label': 'NFTYPE'})

{'_id': ObjectId('5b10f2c30fd01f08fa649335'),
 'expanded': 'School type (see separate list of abbreviations used in the tables)',
 'label': 'NFTYPE'}

I want to add the codes from the abbreviations dictionary to this document since it is one of the backdones to my investigation.

In [44]:
ks4_meta.update_one({'label': 'NFTYPE'}, 
                    {'$set': {'codes': nftypes}})

ks4_meta.find_one({'label': 'NFTYPE'})

{'_id': ObjectId('5b10f2c30fd01f08fa649335'),
 'codes': {'AC': 'Sponsored academy',
  'AC1619': 'Academy 16-19 sponsor led',
  'ACC': 'Academy converter - mainstream',
  'ACC1619': 'Academy 16-19 converter',
  'ACCS': 'Academy converter - special school',
  'ACS': 'Sponsored special academy',
  'CTC': 'City technology college',
  'CY': 'Community school',
  'CYS': 'Community special school',
  'F': 'Free school - mainstream',
  'F1619': 'Free school - 16-19',
  'FD': 'Foundation school',
  'FDS': 'Foundation special school',
  'FESI': 'Further Education Sector Institution',
  'FS': 'Free school - special',
  'FSS': 'Studio school',
  'FUTC': 'UTC (university technical college)',
  'IND': 'Independent school',
  'INDSPEC': 'Independent special school',
  'MODFC': 'College funded by Ministry of Defence',
  'NMSS': 'Non-maintained special school',
  'VA': 'Voluntary aided school',
  'VC': 'Voluntary controlled school'},
 'expanded': 'School type (see separate list of abbreviations used in

I'll do the same for the `RECTYPE` label by splitting the description.

In [45]:
# select the correct document
r = ks4_meta.find_one({'label': 'RECTYPE'})

# checks that we haven't already updated the document
# then if not splits the description string, adding a code key
# to reference each school type
if 'codes' not in r.keys():
    expanded = r['expanded']
    e = expanded[:11]
    codelist = expanded[13:-1].split('; ')
    keys = [c[:1] for c in codelist]
    values = [c[2:] for c in codelist]
    codes = (dict(list(zip(keys, values))))
    ks4_meta.update_one({'_id': r['_id']},
                        {'$set': {'expanded': e,
                                  'codes': codes}})

# check that it was processed correctly
ks4_meta.find_one({'label': 'RECTYPE'})

{'_id': ObjectId('5b10f2c30fd01f08fa649326'),
 'codes': {'1': 'mainstream school',
  '2': 'special school',
  '4': 'local authority',
  '5': 'National (all schools)',
  '7': 'National (maintained schools)'},
 'expanded': 'Record type',
 'label': 'RECTYPE'}

Great.  That is most of the cleaning I need to do for the ks4_meta file.  If I were to be doing a different investigation I would consider merging in the LEA data here, but we already have it stored from earlier on (importing ks2) as a dataframe which we can reference if needed.

Great.  Now in the tm351 module materials we had some handy collections provided by the module team that enabled us to quickly look up the labels and codes of a given accident.  I'll borrow that idea here for my purposes.  Because, I will need to do the same for the KS2 dataset, I'll wrap them in a function.

In [46]:
# code adapted from the p14 accidents dataset notebooks

def expanded_label(meta):
    # Load the expanded names of keys and human-readable codes into memory
    expanded_name = collections.defaultdict(str)
    for e in meta.find({'expanded': {"$exists": True}}):
        expanded_name[e['label']] = e['expanded']

    label_of = collections.defaultdict(str)
    for l in meta.find({'codes': {"$exists": True}}):
        for c in l['codes']:
            try:
                label_of[l['label'], int(c)] = l['codes'][c]
            except ValueError: 
                label_of[l['label'], c] = l['codes'][c]
    # return both as a tuple
    return (expanded_name, label_of)

In [47]:
# Set up the expanded_name and label_of for ks4_meta
ks4_expanded_name, ks4_label_of = expanded_label(ks4_meta)

In [48]:
# test it works
[(c, ks4_label_of['RECTYPE', c]) for k, c in ks4_label_of if k == 'RECTYPE']

[(2, 'special school'),
 (4, 'local authority'),
 (1, 'mainstream school'),
 (5, 'National (all schools)'),
 (7, 'National (maintained schools)')]

In [49]:
ks4_expanded_name['NFTYPE']

'School type (see separate list of abbreviations used in the tables)'

In [50]:
ks4_label_of['NFTYPE', 'AC']

'Sponsored academy'

Great that all is working, I can now delete the ks4_meta_df, as the information is stored.

In [51]:
del ks4_meta_df

I'll quickly repeat the same steps for KS2_meta data to include the codes.

In [52]:
# relabel the columns of ks2cols
ks2cols.columns = ['not_needed', 'label', 'expanded']

# create a collection in the database
ks2_meta = db.ks2_meta

In [53]:
# store them into the database
ks2_meta.insert_many(ks2cols[['label', 'expanded']].to_dict('records'))

<pymongo.results.InsertManyResult at 0x7f9de41efee8>

In [54]:
ks2_meta.find_one()

{'_id': ObjectId('5b10f2c30fd01f08fa64949a'),
 'expanded': 'Record type (1=mainstream school; 2=special school; 3=Local Authority; 4=National (all schools); 5=National (maintained schools))',
 'label': 'RECTYPE'}

In [55]:
# repeat the splitting of the `RECTYPE`
# select the correct document
r = ks2_meta.find_one({'label': 'RECTYPE'})

# checks that we haven't already updated the document
# then if not splits the description string, adding a code key
# to reference each school type
if 'codes' not in r.keys():
    expanded = r['expanded']
    e = expanded[:11]
    codelist = expanded[13:-1].split('; ')
    keys = [c[:1] for c in codelist]
    values = [c[2:] for c in codelist]
    codes = (dict(list(zip(keys, values))))
    ks2_meta.update_one({'_id': r['_id']},
                        {'$set': {'expanded': e,
                                  'codes': codes}})

# check that it was processed correctly
ks2_meta.find_one({'label': 'RECTYPE'})

{'_id': ObjectId('5b10f2c30fd01f08fa64949a'),
 'codes': {'1': 'mainstream school',
  '2': 'special school',
  '3': 'Local Authority',
  '4': 'National (all schools)',
  '5': 'National (maintained schools)'},
 'expanded': 'Record type',
 'label': 'RECTYPE'}

In [56]:
# And add the nftype to the meta collection
ks2_meta.update_one({'label': 'NFTYPE'}, 
                    {'$set': {'codes': nftypes}})

ks2_meta.find_one({'label': 'NFTYPE'})

{'_id': ObjectId('5b10f2c30fd01f08fa6494a9'),
 'codes': {'AC': 'Sponsored academy',
  'AC1619': 'Academy 16-19 sponsor led',
  'ACC': 'Academy converter - mainstream',
  'ACC1619': 'Academy 16-19 converter',
  'ACCS': 'Academy converter - special school',
  'ACS': 'Sponsored special academy',
  'CTC': 'City technology college',
  'CY': 'Community school',
  'CYS': 'Community special school',
  'F': 'Free school - mainstream',
  'F1619': 'Free school - 16-19',
  'FD': 'Foundation school',
  'FDS': 'Foundation special school',
  'FESI': 'Further Education Sector Institution',
  'FS': 'Free school - special',
  'FSS': 'Studio school',
  'FUTC': 'UTC (university technical college)',
  'IND': 'Independent school',
  'INDSPEC': 'Independent special school',
  'MODFC': 'College funded by Ministry of Defence',
  'NMSS': 'Non-maintained special school',
  'VA': 'Voluntary aided school',
  'VC': 'Voluntary controlled school'},
 'expanded': 'School type',
 'label': 'NFTYPE'}

In [57]:
# finally, set up the expanded_name and label_of for ks4_meta
ks2_expanded_name, ks2_label_of = expanded_label(ks2_meta)

check they work ok

In [58]:
# test it works
[(c, ks2_label_of['RECTYPE', c]) for k, c in ks4_label_of if k == 'RECTYPE']

[(2, 'special school'),
 (4, 'National (all schools)'),
 (1, 'mainstream school'),
 (5, 'National (maintained schools)'),
 (7, '')]

In [59]:
ks2_label_of['NFTYPE', 'IND']

'Independent school'

In [60]:
ks2_expanded_name['TELIG']

'Published eligible pupil number'

Great that is all the meta data handled, and we can now go about importing the KS4 data into the database and cleaning it.

In [61]:
# delete the ks2cols dataframe as we don't need it anymore
del ks2cols

## Importing the KS4 dataset

Before I import the data I will have another quick look at the file.

In [62]:
! head -5 'data/2015-2016/england_ks4final.csv'

﻿RECTYPE,ALPHAIND,LEA,ESTAB,URN,SCHNAME,SCHNAME_AC,ADDRESS1,ADDRESS2,ADDRESS3,TOWN,PCODE,TELNUM,CONTFLAG,ICLOSE,NFTYPE,RELDENOM,ADMPOL,EGENDER,FEEDER,TABKS2,TAB1618,AGERANGE,CONFEXAM,TOTPUPS,NUMBOYS,NUMGIRLS,TPUP,BPUP,PBPUP,GPUP,PGPUP,KS2APS,TPRIORLO,PTPRIORLO,TPRIORAV,PTPRIORAV,TPRIORHI,PTPRIORHI,TFSM6CLA1A,PTFSM6CLA1A,TNOTFSM6CLA1A,PTNOTFSM6CLA1A,TEALGRP2,PTEALGRP2,TEALGRP1,PTEALGRP1,TEALGRP3,PTEALGRP3,TNMOB,PTNMOB,SENSE4,PSENSE4,SENAPK4,PSENAPK4,TOTATT8,ATT8SCR,TOTATT8ENG,ATT8SCRENG,TOTATT8MAT,ATT8SCRMAT,TOTATT8EBAC,ATT8SCREBAC,TOTATT8OPEN,ATT8SCROPEN,TOTATT8OPENG,ATT8SCROPENG,TOTATT8OPENNG,ATT8SCROPENNG,P8PUP,P8MEACOV,P8MEA,P8CILOW,P8CIUPP,P8MEAENG,P8MEAENG_CILOW,P8MEAENG_CIUPP,P8MEAMAT,P8MEAMAT_CILOW,P8MEAMAT_CIUPP,P8MEAEBAC,P8MEAEBAC_CILOW,P8MEAEBAC_CIUPP,P8MEAOPEN,P8MEAOPEN_CILOW,P8MEAOPEN_CIUPP,PTL2BASICS_LL_PTQ_EE,PTL2BASICS_3YR_PTQ_EE,TEBACC_E_PTQ_EE,PTEBACC_E_PTQ_EE,PTEBACC_PTQ_EE,TEBACENG_E_PTQ_EE,PTEBACENG_E_PTQ_EE,TEBACMAT_E_PTQ_EE,PTEBACMAT_E_PTQ_EE,TEBAC2SCI_E_PTQ_EE,PT

To restate what was noted earlier there appears to be a great number of columns, and a large number of missing values.  How many rows are there?

In [63]:
!wc -l 'data/2015-2016/england_ks4final.csv'

5489 data/2015-2016/england_ks4final.csv


Let's carry out similar steps to those we carried out in importing the ks2 data.  Again this is going to be adapted from the TMA02-Q2

In [64]:
ks4_df = pd.read_csv('data/2015-2016/england_ks4final.csv')
ks4_df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,RECTYPE,ALPHAIND,LEA,ESTAB,URN,SCHNAME,SCHNAME_AC,ADDRESS1,ADDRESS2,ADDRESS3,...,TAVENT_GAV_PTQ_EE,TAVENT_GHI_PTQ_EE,TAVENT_GFSM6CLA1A_PTQ_EE,TAVENT_GNFSM6CLA1A_PTQ_EE,TAVENT_GFSM_13,TAVENT_GNFSM_13,TAVENT_GFSM_14_PTQ,TAVENT_GNFSM_14_PTQ,TAVENT_GFSM6CLA1A_15_PTQ_EE,TAVENT_GNFSM6CLA1A_15_PTQ_EE
0,1,11828.0,201,6007.0,100003.0,City of London School,,Queen Victoria Street,,,...,NP,NP,NP,NP,,,,,,
1,1,11830.0,201,6005.0,100001.0,City of London School for Girls,,St Giles' Terrace,Barbican,,...,NP,NP,NP,NP,,,,,,
2,4,,201,,,,,,,,...,,,,,,,,,,
3,1,368.0,202,4285.0,100053.0,Acland Burghley School,,Burghley Road,,,...,8.9,9.8,8.4,9.5,8.1,9.8,9.0,10.2,8.4,10.6
4,1,9318.0,202,4611.0,100054.0,The Camden School for Girls,,Sandall Road,,,...,8.2,10.2,7.8,9.8,8.6,9.3,7.3,8.6,7.5,8.9


A straight import gives an error (`DtypeWarning`).  Let's look at the file using the tools learned in p2 of the tm351 materials.

In [65]:
# let's quickly look at the file using command line
!file 'data/2015-2016/england_ks4final.csv'

data/2015-2016/england_ks4final.csv: UTF-8 Unicode (with BOM) text, with very long lines, with CRLF line terminators


In [66]:
# and check it using chardet
import chardet

# open the file and read the contents in as a byte object
testfile = open('data/2015-2016/england_ks4final.csv', 'rb').read()

# detect the file encoding
chardet.detect(testfile)

{'confidence': 1.0, 'encoding': 'UTF-8-SIG'}

ks4_df = pd.read_csv('data/2015-2016/england_ks4final.csv', encoding='UTF-8-SIG')
ks4_df.head()

In [67]:
ks4_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5489 entries, 0 to 5488
Columns: 372 entries, RECTYPE to TAVENT_GNFSM6CLA1A_15_PTQ_EE
dtypes: int64(1), object(371)
memory usage: 15.6+ MB


In [68]:
ks4_df.describe()

Unnamed: 0,RECTYPE
count,5489.0
mean,1.292403
std,0.61707
min,1.0
25%,1.0
50%,1.0
75%,1.0
max,7.0


In [69]:
ks4_df.dtypes

RECTYPE                            int64
ALPHAIND                          object
LEA                               object
ESTAB                             object
URN                               object
SCHNAME                           object
SCHNAME_AC                        object
ADDRESS1                          object
ADDRESS2                          object
ADDRESS3                          object
TOWN                              object
PCODE                             object
TELNUM                            object
CONTFLAG                          object
ICLOSE                            object
NFTYPE                            object
RELDENOM                          object
ADMPOL                            object
EGENDER                           object
FEEDER                            object
TABKS2                            object
TAB1618                           object
AGERANGE                          object
CONFEXAM                          object
TOTPUPS         

Most of the columns are mixed with an 'object' datatype.

In [70]:
ks4_dt_df = pd.DataFrame()
for col in ks4_df.columns:
    ks4_dt_df[col] = pd.to_numeric(ks4_df[col], errors='ignore')

ks4_dt_df.head()

Unnamed: 0,RECTYPE,ALPHAIND,LEA,ESTAB,URN,SCHNAME,SCHNAME_AC,ADDRESS1,ADDRESS2,ADDRESS3,...,TAVENT_GAV_PTQ_EE,TAVENT_GHI_PTQ_EE,TAVENT_GFSM6CLA1A_PTQ_EE,TAVENT_GNFSM6CLA1A_PTQ_EE,TAVENT_GFSM_13,TAVENT_GNFSM_13,TAVENT_GFSM_14_PTQ,TAVENT_GNFSM_14_PTQ,TAVENT_GFSM6CLA1A_15_PTQ_EE,TAVENT_GNFSM6CLA1A_15_PTQ_EE
0,1,11828.0,201,6007.0,100003.0,City of London School,,Queen Victoria Street,,,...,NP,NP,NP,NP,,,,,,
1,1,11830.0,201,6005.0,100001.0,City of London School for Girls,,St Giles' Terrace,Barbican,,...,NP,NP,NP,NP,,,,,,
2,4,,201,,,,,,,,...,,,,,,,,,,
3,1,368.0,202,4285.0,100053.0,Acland Burghley School,,Burghley Road,,,...,8.9,9.8,8.4,9.5,8.1,9.8,9.0,10.2,8.4,10.6
4,1,9318.0,202,4611.0,100054.0,The Camden School for Girls,,Sandall Road,,,...,8.2,10.2,7.8,9.8,8.6,9.3,7.3,8.6,7.5,8.9


In [71]:
ks4_dt_df.dtypes

RECTYPE                            int64
ALPHAIND                          object
LEA                               object
ESTAB                             object
URN                               object
SCHNAME                           object
SCHNAME_AC                        object
ADDRESS1                          object
ADDRESS2                          object
ADDRESS3                          object
TOWN                              object
PCODE                             object
TELNUM                            object
CONTFLAG                          object
ICLOSE                            object
NFTYPE                            object
RELDENOM                          object
ADMPOL                            object
EGENDER                           object
FEEDER                            object
TABKS2                            object
TAB1618                           object
AGERANGE                          object
CONFEXAM                          object
TOTPUPS         

I'm not getting very far here.  I'll try and follow the steps from Q2. If after that I have still made no progress I think that the most efficient way to get to the bottom of it will be to take a look at the file in OpenRefine to clean the mixed datatypes and determine what to do with the missing data.

In [72]:
leas_df

Unnamed: 0,LEA,LA Name,REGION,REGION NAME
0,841,Darlington,1,North East A
1,840,County Durham,1,North East A
2,805,Hartlepool,1,North East A
3,806,Middlesbrough,1,North East A
4,807,Redcar and Cleveland,1,North East A
5,808,Stockton-on-Tees,1,North East A
6,390,Gateshead,3,North East B
7,391,Newcastle upon Tyne,3,North East B
8,392,North Tyneside,3,North East B
9,929,Northumberland,3,North East B


I'll find out which columns have percentages in them.

In [73]:
# Look through the meta file and get the columns that are percentages.
percent_cols_list = [(l, ks4_expanded_name[l]) 
                     for l in ks4_expanded_name 
                     if 'percent' in ks4_expanded_name[l].lower()]
percent_cols_list

[('PTEBACC_ELO_PTQ_EE',
  'Percentage of pupils with low prior attainment with entries in all English Baccalaureate subject areas'),
 ('PTEBACENG_LL_PTQ_EE',
  'Percentage of pupils achieving the English Baccalaureate English subject area '),
 ('P8MEACOV',
  'Percentage of pupils at the end of key stage 4 included in Progress 8 measure'),
 ('PTEBACCAV_PTQ_EE',
  'Percentage of pupils with middle prior attainment achieving the English Baccalaureate'),
 ('PTL2BASICS_13',
  'Percentage of pupils achieving grades A*-C in both English and mathematics GCSEs in 2013'),
 ('PTEBACC_E_13',
  'Percentage of pupils entering all English Baccalaureate subject areas in 2013'),
 ('PTFSM6CLA1ABASICS_15_PTQ_EE',
  'Percentage of disadvantaged pupils achieving grades A*-C in both English and mathematics GCSEs in 2015'),
 ('PBPUP', 'Percentage of pupils at the end of key stage 4 who are boys'),
 ('PTANYQ_PTQ_EE', 'Percentage of pupils achieving any qualifications'),
 ('PTEBACHUM_PTQ_EE',
  'Percentage of 

In [74]:
# Save the column headings to a list
percent_cols = [p[0] for p in percent_cols_list]
percent_cols

['PTEBACC_ELO_PTQ_EE',
 'PTEBACENG_LL_PTQ_EE',
 'P8MEACOV',
 'PTEBACCAV_PTQ_EE',
 'PTL2BASICS_13',
 'PTEBACC_E_13',
 'PTFSM6CLA1ABASICS_15_PTQ_EE',
 'PBPUP',
 'PTANYQ_PTQ_EE',
 'PTEBACHUM_PTQ_EE',
 'PTFSMBASICS_14_PTQ',
 'PTEBACC_NFSM6CLA1A_PTQ_EE',
 'PTL2BASICS_14_PTQ',
 'PTEBACHUMAG_PTQ_EE',
 'PTEBACLAN_E_PTQ_EE',
 'PTEBACC_ENMOB_PTQ_EE',
 'PTEBACC_FSM6CLA1A_15_PTQ_EE',
 'PTAC5EM_PTQ_EE',
 'PTEBACMAT_PTQ_EE',
 'PTFSMCLA_13',
 'PGEBACC_PTQ_EE',
 'PBEBACC_E_PTQ_EE',
 'PTPRIORHI',
 'PTEBACC_NFSM6CLA1A_15_PTQ_EE',
 'PTEBACC_E_PTQ_EE',
 'PTEBACC_ENFSM_13',
 'PTEBACC_13',
 'PGPUP',
 'PTEBACC_E_14_PTQ',
 'PTBASICS_LL_LO_PTQ_EE',
 'PTtripleSci_E',
 'PTPRIORAV',
 'PTFSMBASICS_13',
 'PTEBACCAG_PTQ_EE',
 'PTEBACC_E_15_PTQ_EE',
 'PTEBACC_EFSM_14_PTQ',
 'PTEBACC_ENFSM6CLA1A_PTQ_EE',
 'PTNOTFSM6CLA1A',
 'PTEBACC_ENFSM6CLA1A_15_PTQ_EE',
 'PTNOTFSMBASICS_14_PTQ',
 'PTEBACC_EHI_PTQ_EE',
 'PTEALGRP1',
 'PTNOTFSM6CLA1ABASICS_15_PTQ_EE',
 'PTEBACLANAG_PTQ_EE',
 'PTNOTFSMBASICS_13',
 'PTEBACC_EFSM6CLA1A_

In [75]:
# int columns
int_col_list = [(l, ks4_expanded_name[l])
                for l in ks4_expanded_name 
                if 'number' in ks4_expanded_name[l].lower()]
int_col_list

[('TEBACLAN_E_PTQ_EE',
  'Number of pupils entering the English Baccalaureate Language subject area'),
 ('P8PUP_LO',
  'Number of pupils with low prior attainment included in Progress 8 measure'),
 ('TEALGRP3',
  'Number of pupils at the end of key stage 4 whose first language is unclassified'),
 ('P8PUP_EAL',
  'Number of pupils for whom English is an additional language included in Progress 8 measure'),
 ('TAVENT_GAV_PTQ_EE',
  'Average number of GCSE entries per pupil with middle prior attainment'),
 ('TFSMCLA_14',
  'Number of disadvantaged pupils at the end of key stage 4 in 2014'),
 ('BPUP', 'Number of boys at the end of key stage 4'),
 ('TAVENT_E_3NG_LO_PTQ_EE',
  'Average number of GCSE and equivalents entries per pupil with low prior attainment'),
 ('TEBACMAT_PTQ_EE',
  'Number of pupils achieving the English Baccalaureate Maths subject area'),
 ('TAVENT_ENFSM_14_PTQ',
  'Average number of GCSE and equivalents entries per non-disadvantaged pupil in 2014'),
 ('TAVENT_E_3NG_HI_P

In [76]:
# again, save out just the column labels
# Save just the column headings
int_cols = [i[0] for i in int_col_list]
int_cols

['TEBACLAN_E_PTQ_EE',
 'P8PUP_LO',
 'TEALGRP3',
 'P8PUP_EAL',
 'TAVENT_GAV_PTQ_EE',
 'TFSMCLA_14',
 'BPUP',
 'TAVENT_E_3NG_LO_PTQ_EE',
 'TEBACMAT_PTQ_EE',
 'TAVENT_ENFSM_14_PTQ',
 'TAVENT_E_3NG_HI_PTQ_EE',
 'TAVENT_ENFSM_13',
 'TAVENT_GHI_PTQ_EE',
 'TEALGRP1',
 'TAVENT_GNFSM_14_PTQ',
 'TBASICS_LL_LO_PTQ_EE',
 'TEBACC_EAV_PTQ_EE',
 'P8PUP_FSM6CLA1A',
 'TAVENT_E_3NG_AV_PTQ_EE',
 'TAVENT_GLO_PTQ_EE',
 'P8PUP',
 'TEALGRP2',
 'TNOTFSMCLA_14',
 'TEBAC2SCI_PTQ_EE',
 'TBASICS_LL_AV_PTQ_EE',
 'TEBAC2SCI_E_PTQ_EE',
 'TPUP',
 'SENAPK4',
 'TAVENT_G_PTQ_EE',
 'TEBACHUMAG_PTQ_EE',
 'TFSM6CLA1A',
 'TPRIORHI',
 'TEBACCAG_PTQ_EE',
 'ESTAB',
 'TEBACC_E_PTQ_EE',
 'TOTPUPS',
 'TAVENT_GNFSM_13',
 'TAVENT_EFSM_13',
 'P8PUP_NFSM6CLA1A',
 'TELNUM',
 'TPRIORLO',
 'TAVENT_E_3NG_PTQ_EE',
 'TNOTFSM6CLA1A_15',
 'TEBACENGAG_LL_PTQ_EE',
 'TEBACMAT_E_PTQ_EE',
 'TAVENT_GFSM6CLA1A_PTQ_EE',
 'TFSM6CLA1A_15',
 'TPRIORAV',
 'TAVENT_E_3NG_FSM6CLA1A_PTQ_EE',
 'TBASICS_LL_HI_PTQ_EE',
 'URN',
 'TAVENT_GNFSM6CLA1A_15_PTQ_EE',


In [77]:
# remind myself of the missing type codes
missing_types


{'NEW': 'New institution',
 nan: 'Not applicable: figures are either not available for the year in question, or the data field is not applicable to this school or college',
 'NE': 'No entries',
 'SUPP': "Indicates that a school or college's figures have been suppressed because there are 5 or fewer pupils in the cohort",
 'LOWCOV': 'Low coverage: indicates that a school’s Progress 8 or value added measures have been suppressed because coverage is less than 50% of the cohort',
 'NP': 'Not published - for example we do not publish Progress 8 data for independent schools and independent special schools, or breakdowns by disadvantaged and non-disadvantaged pupils for independent schools, independent special schools and non-maintained special schools.'}

Create the percent converter as in the Q2b-pd import method.  Giving a negative value of -1 for missing data and a negative value of -2 for Not Published data. This will make it easier to filter the query results later.  The reason I have given NP a -2 is that I want to separate these data points as I may want to look at them in more detail.


In [78]:
def ks4_p2f(x):
    if x.strip('%').isnumeric():
        return float(x.strip('%'))/100
    elif x in ['SUPP', 'NEW', 'LOWCOV', 'NA', 'NE' ' ']:
        return -1
    elif x == 'NP':
        return -2
    else:
        return x

In [79]:
percent_converters = {c: ks4_p2f for c in percent_cols}

Read in the file to a dataframe

In [80]:
ks4_df = pd.read_csv('data/2015-2016/england_ks4final.csv',
                     na_values=['SUPP', 'NEW', 'LOWCOV', 'NA', ''],
                     converters=percent_converters)

  interactivity=interactivity, compiler=compiler, result=result)


Still showing the error for the data types.  I will continue walking through the cleaning steps from tma02-q2.  For our questions will focus on only mainstream schools we can drop those that are not of `RECTYPE` == 1

In [81]:
ks4_df = ks4_df[ks4_df['RECTYPE'] == 1]

Convert everything to numbers, if possible.


In [82]:
ks4_df = ks4_df.apply(pd.to_numeric, errors='ignore')

Merge the LEA data into the school data.

In [83]:
ks4_df = pd.merge(ks4_df, leas_df, on=['LEA'])
ks4_df.head().T

Unnamed: 0,0,1,2,3,4
RECTYPE,1,1,1,1,1
ALPHAIND,11828,11830,368,9318,10054
LEA,201,201,202,202,202
ESTAB,6007,6005,4285,4611,6000
URN,100003,100001,100053,100054,137333
SCHNAME,City of London School,City of London School for Girls,Acland Burghley School,The Camden School for Girls,CATS College London
SCHNAME_AC,,,,,
ADDRESS1,Queen Victoria Street,St Giles' Terrace,Burghley Road,Sandall Road,43-45 Bloomsbury Square & 2 Southampton Place
ADDRESS2,,Barbican,,,
ADDRESS3,,,,,


That is looking better I'll now import these into mongodb

In [84]:
# create a collection in the database
ks4 = db.ks4

In [85]:
# insert the cleaned dataframe to the database
ks4.insert_many(ks4_df.to_dict('records'))

<pymongo.results.InsertManyResult at 0x7f9de41b7c18>

In [86]:
# check that the correct number of documents were included
len(ks4_df), ks4.find().count()

(4196, 4196)

In [87]:
ks4.find_one()

{'AC5EM13': 0.0,
 'AC5EM14_PTQ': 0.0,
 'AC5EM15_PTQ_EE': 0.0,
 'AC5EM16_PTQ_EE': 0.0,
 'ADDRESS1': 'Queen Victoria Street',
 'ADDRESS2': ' ',
 'ADDRESS3': ' ',
 'ADMPOL': ' ',
 'AGERANGE': '10-18',
 'ALPHAIND': 11828,
 'ATT8SCR': '42.1',
 'ATT8SCREBAC': '22.2',
 'ATT8SCREBAC_FSM6CLA1A': 'NP',
 'ATT8SCREBAC_NFSM6CLA1A': 'NP',
 'ATT8SCRENG': '7.3',
 'ATT8SCRENG_FSM6CLA1A': 'NP',
 'ATT8SCRENG_NFSM6CLA1A': 'NP',
 'ATT8SCRMAT': '0',
 'ATT8SCRMAT_FSM6CLA1A': 'NP',
 'ATT8SCRMAT_NFSM6CLA1A': 'NP',
 'ATT8SCROPEN': '12.6',
 'ATT8SCROPENG': '10.4',
 'ATT8SCROPENG_FSM6CLA1A': 'NP',
 'ATT8SCROPENG_NFSM6CLA1A': 'NP',
 'ATT8SCROPENNG': '2.2',
 'ATT8SCROPENNG_FSM6CLA1A': 'NP',
 'ATT8SCROPENNG_NFSM6CLA1A': 'NP',
 'ATT8SCROPEN_FSM6CLA1A': 'NP',
 'ATT8SCROPEN_NFSM6CLA1A': 'NP',
 'ATT8SCR_15': nan,
 'ATT8SCR_AV': 'NP',
 'ATT8SCR_BOYS': '42.1',
 'ATT8SCR_EAL': 'NP',
 'ATT8SCR_FSM6CLA1A': 'NP',
 'ATT8SCR_GIRLS': nan,
 'ATT8SCR_HI': 'NP',
 'ATT8SCR_LO': 'NP',
 'ATT8SCR_NFSM6CLA1A': 'NP',
 'ATT8SCR_NMOB': 'NP

This is an independent school since they don't need to publish their data there are a lot of missing values.  This is something we will need to be mindful of when carrying out the analysis.  Although the percentages have been handled, there are still a number of other measures that are still showing 'NP'.  Since the majority of the measures I will be looking at will be percentages, instead of working through every single measure I will determine those I want to use in my investigation and then clean those as needed.

In [88]:
ks4_expanded_name['P8MEA_AV']

'Progress 8 measure - pupils with middle prior attainment'

In [89]:
ks4.find({'NFTYPE': 'IND'}).count()

882

In [90]:
nftypes

{'AC': 'Sponsored academy',
 'AC1619': 'Academy 16-19 sponsor led',
 'ACC': 'Academy converter - mainstream',
 'ACC1619': 'Academy 16-19 converter',
 'ACCS': 'Academy converter - special school',
 'ACS': 'Sponsored special academy',
 'CTC': 'City technology college',
 'CY': 'Community school',
 'CYS': 'Community special school',
 'F': 'Free school - mainstream',
 'F1619': 'Free school - 16-19',
 'FD': 'Foundation school',
 'FDS': 'Foundation special school',
 'FESI': 'Further Education Sector Institution',
 'FS': 'Free school - special',
 'FSS': 'Studio school',
 'FUTC': 'UTC (university technical college)',
 'IND': 'Independent school',
 'INDSPEC': 'Independent special school',
 'MODFC': 'College funded by Ministry of Defence',
 'NMSS': 'Non-maintained special school',
 'VA': 'Voluntary aided school',
 'VC': 'Voluntary controlled school'}

In [91]:
ks4.find({'NFTYPE': 'CY'}).count()

541

Good things look they are clean enough to start working on the investigation.

# Q1 - Keystage 4 Investigation.  Does the type of school impact the results students acheive at keystage 4?
<a name="q1"></a>

## Basic stats of the dataset


In [92]:
ks4.find().count()

4196

So there are a large number of documents in the dataset (after taking out the non-mainstream schools)

The first thing I need to decide before I can analyse the data is to decide what I mean by 'good performance' and once that is ascertained which of the many data points I will use as measures to base my comparison of school types on.

For a long time the standard measure of successful schools was the percentage of pupils achieving grades A*-C in Maths and English.  This has changed recently with the government introducing new metrics the 'Progress 8' and 'Achievement 8' and the introduction of the English Baccalaurette which includes English, Maths, Sciences (incl. computer science, history/geograghy a modern/ancient foreign language).  So, I will try to look at these as the success measure of a school, and if possible combine them.

So the first step I need to take is to identify the keys for the data I want to query.

In [93]:
# print all the keys and values of the meta data
# to help choose the columns I will use
for d in ks4_meta.find():
    print(d['label'], ':', d['expanded'], '\n')

RECTYPE : Record type 

ALPHAIND : Alphabetic sorting index 

LEA : Local authority code (see separate list of local authorities and their codes) 

ESTAB : Establishment number 

URN : School Unique Reference Number 

SCHNAME : School name 

SCHNAME_AC : School now known as (used if the school has converted to an academy on or after 12 Sept 2015) 

ADDRESS1 : School address (1) 

ADDRESS2 : School address (2) 

ADDRESS3 : School address (3) 

TOWN : School town 

PCODE : School postcode 

TELNUM : School telephone number 

CONTFLAG : Contingency flag - school results 'significantly affected'. This field is zero for all schools. 

ICLOSE : Closed school flag (0=open; 1=closed) 

NFTYPE : School type (see separate list of abbreviations used in the tables) 

RELDENOM : School religious character 

ADMPOL : School admissions policy (self-declared by schools on Edubase) 

EGENDER : School gender of entry 

FEEDER : Indicates whether school is a feeder school for sixth form centre/consortia 

Looking through these it is clear that I will need to be selective in choosing measures.  There are thousands of ways to subdivide this dataset and investigate it.  I will be focusing on the Average numbers for the whole school, for every student.  There will of course be cases where this skews the results.

For instance, at schools with many disadvantaged students the average scores could be affected and without looking including measures the results can not be fully comprehensive.  That said it is beyond the scope of this project to examine every single possible facet of the dataset.

In [94]:
test = pd.DataFrame(list(ks4.find({}, {'ATT8SCR':1, '_id': 0})))
test.count()

ATT8SCR    4126
dtype: int64

### Descriptive analysis

<a name='todo'></a>

# todo

### Choosing the measures I will use to compare the schools

### What flaws are there in the data/subset of data that could impact findings

### Carry out some analysis resulting in a few plots that convey findings

### Run some statistical tests

### Summarise findings

#  Q2 - Keystage 2 and 4 Investigation.  Do schools that perform well at KS2 deliver as good or better results at KS4.
<a name="q2"></a>

The first thing I need to decide before I can analyse the data is to decide what I mean by 'good performance' and which measures I will use to compare

### Descriptive analysis

### Choosing the measures I will use to compare the schools

### What flaws are there in the data/subset of data that could impact findings

### Carry out some analysis resulting in a few plots that convey findings

### Run some statistical tests

### Summarise findings

<a name="machine_learning"></a>

# Machine Learning.

Two main chances to implement it as I see it.
- it can be used to fill in missing values
- it can be used to predict what a schools results will be at ks4
- it could be used to cluster groups.

# Cleanup/remove the database
<a name="cleanup"></a>

Uncomment the lines below to remove the MongoDB created in the investigation.

In [95]:
# uncomment to remove the database if needed
client.drop_database('schools_db')
client.database_names()

['accidents', 'admin', 'local']