# EMA Project Notebook

__Name:__ Daniel Smith

__PI:__ A7603242


In [1]:
# import the required libraries
# required imports
import pandas as pd
import scipy.stats
import pymongo
import bson
import collections
from matplotlib import pyplot
import seaborn as sns

## Contents

Use these links to jump to a section. 

[Data preparation](#preparation)
 - [OpenRefine](#openrefine)
 - [Cleaning the dataset](#cleaning)

[Q1. KS4 Investigation](#q1)

[Application of Machine Learning](#machine_learning)

[Q2. KS2 - KS4 Investigation](#q)



# Data preparation
<a name="importing"></a>

Before we can investigate the data we will to have a look at it, determine what cleaning if any needs doing, carry out the cleaning and store it for access in an appropriate form.

## Initial look at the KS4 results dataset
Let's have a quick look at the data we will be looking at for the EMA.

In [2]:
!head -5 'data/2015-2016/england_ks4final.csv'

﻿RECTYPE,ALPHAIND,LEA,ESTAB,URN,SCHNAME,SCHNAME_AC,ADDRESS1,ADDRESS2,ADDRESS3,TOWN,PCODE,TELNUM,CONTFLAG,ICLOSE,NFTYPE,RELDENOM,ADMPOL,EGENDER,FEEDER,TABKS2,TAB1618,AGERANGE,CONFEXAM,TOTPUPS,NUMBOYS,NUMGIRLS,TPUP,BPUP,PBPUP,GPUP,PGPUP,KS2APS,TPRIORLO,PTPRIORLO,TPRIORAV,PTPRIORAV,TPRIORHI,PTPRIORHI,TFSM6CLA1A,PTFSM6CLA1A,TNOTFSM6CLA1A,PTNOTFSM6CLA1A,TEALGRP2,PTEALGRP2,TEALGRP1,PTEALGRP1,TEALGRP3,PTEALGRP3,TNMOB,PTNMOB,SENSE4,PSENSE4,SENAPK4,PSENAPK4,TOTATT8,ATT8SCR,TOTATT8ENG,ATT8SCRENG,TOTATT8MAT,ATT8SCRMAT,TOTATT8EBAC,ATT8SCREBAC,TOTATT8OPEN,ATT8SCROPEN,TOTATT8OPENG,ATT8SCROPENG,TOTATT8OPENNG,ATT8SCROPENNG,P8PUP,P8MEACOV,P8MEA,P8CILOW,P8CIUPP,P8MEAENG,P8MEAENG_CILOW,P8MEAENG_CIUPP,P8MEAMAT,P8MEAMAT_CILOW,P8MEAMAT_CIUPP,P8MEAEBAC,P8MEAEBAC_CILOW,P8MEAEBAC_CIUPP,P8MEAOPEN,P8MEAOPEN_CILOW,P8MEAOPEN_CIUPP,PTL2BASICS_LL_PTQ_EE,PTL2BASICS_3YR_PTQ_EE,TEBACC_E_PTQ_EE,PTEBACC_E_PTQ_EE,PTEBACC_PTQ_EE,TEBACENG_E_PTQ_EE,PTEBACENG_E_PTQ_EE,TEBACMAT_E_PTQ_EE,PTEBACMAT_E_PTQ_EE,TEBAC2SCI_E_PTQ_EE,PT

In [3]:
!wc -l 'data/2015-2016/england_ks4final.csv'

5489 data/2015-2016/england_ks4final.csv


The dataset has 5489 rows of data, there appears to be a large number of columns and a lot of codes that I'll need to look up.  There are a also a number of `NA` and `NP` values that could be missing data.  As well as this results data I will need to find and import the relevent metadata file.

Looking through the data/2015-2016 folder there are a number of files that have information on these codes.

In [4]:
!ls data/2015-2016/

abbreviations.xlsx	    england_vaqual.csv
abs_meta.csv		    england_vasubj.csv
census_meta.csv		    ks2_meta.csv
england_abs.csv		    ks4_clean_reduced.tsv
england_census.csv	    ks4final_clean.csv
england_cfrfull.xlsx	    ks4_meta.csv
england_ks2final.csv	    ks4_meta_methodology.csv
england_ks4final.csv	    ks4-pupdest_meta.csv
england_ks4-pupdest.csv     ks5_meta.csv
england_ks4underlying.xlsx  ks5-studest_meta.csv
england_ks5final.csv	    la_and_region_codes_meta.csv
england_ks5-studest.csv     sixth_form_centres_and_consortia_meta.xlsx
england_ks5underlying.xlsx  spine_meta.csv
england_spine.csv	    swf_meta.csv
england_swf.csv


There is an abbreviations file, stored as an xlsx file.  I'll have a quick glance at it in excel.  Having looked the abbreviation up in the abbreviations file we can see that they have the following meanings:

- _NA_: Not applicable
- _NP_: Not Published
- _NE_: No entries
- _SUPP_: Suppressed (5 or fewer in cohort)
- _LOWCOV_: Low coverage (less than 50% of the cohort
- _NEW_: New institution

However, before importing the results data I want to look at it in Open Refine and decide what I will do with the missing data.

In [5]:
# read in the abbreviations file
abbr_df = pd.read_excel('data/2015-2016/abbreviations.xlsx')
abbr_df

Unnamed: 0,2016 KS4 and KS5/16-18 Performance Tables,Unnamed: 1,Unnamed: 2
0,Abbreviations used in the csv and excel Downlo...,,
1,,,
2,Institution type (NFTYPE):,,
3,AC,Sponsored academy,
4,ACC,Academy converter - mainstream,
5,AC1619,Academy 16-19 sponsor led,
6,ACC1619,Academy 16-19 converter,
7,ACCS,Academy converter - special school,
8,ACS,Sponsored special academy,
9,CTC,City technology college,


We can see from this that the NFTYPE information is in rows 3-25.  I'll store them as a dict for reference later.

The General abbreviations are rows 45-50. (I noticed that the file was miss-parsed here.  The `NA` became `NaN` in the dataframe).

In [6]:
# relabel the columns
abbr_df.columns = ['label', 'expanded', 'not_needed']

In [7]:
nftypes = {}
for index, row in abbr_df[3:26].iterrows():
    nftypes[row['label'].strip()] = row['expanded'].strip()

nftypes

{'AC': 'Sponsored academy',
 'AC1619': 'Academy 16-19 sponsor led',
 'ACC': 'Academy converter - mainstream',
 'ACC1619': 'Academy 16-19 converter',
 'ACCS': 'Academy converter - special school',
 'ACS': 'Sponsored special academy',
 'CTC': 'City technology college',
 'CY': 'Community school',
 'CYS': 'Community special school',
 'F': 'Free school - mainstream',
 'F1619': 'Free school - 16-19',
 'FD': 'Foundation school',
 'FDS': 'Foundation special school',
 'FESI': 'Further Education Sector Institution',
 'FS': 'Free school - special',
 'FSS': 'Studio school',
 'FUTC': 'UTC (university technical college)',
 'IND': 'Independent school',
 'INDSPEC': 'Independent special school',
 'MODFC': 'College funded by Ministry of Defence',
 'NMSS': 'Non-maintained special school',
 'VA': 'Voluntary aided school',
 'VC': 'Voluntary controlled school'}

In [8]:
abbr_dict = {}
for i, r in abbr_df[45:51].iterrows():
#     print(r['label'], r['expanded'])
    abbr_dict[r['label']] = r['expanded']
abbr_dict

{nan: 'Not applicable: figures are either not available for the year in question, or the data field is not applicable to this school or college',
 'NP': 'Not published - for example we do not publish Progress 8 data for independent schools and independent special schools, or breakdowns by disadvantaged and non-disadvantaged pupils for independent schools, independent special schools and non-maintained special schools.',
 'NEW': 'New institution',
 'LOWCOV': 'Low coverage: indicates that a school’s Progress 8 or value added measures have been suppressed because coverage is less than 50% of the cohort',
 'NE': 'No entries',
 'SUPP': "Indicates that a school or college's figures have been suppressed because there are 5 or fewer pupils in the cohort"}

In [1]:
# select the rows that are NFTYPES
nftype_df = abbr_df.iloc[3:26]

nftype_df.columns = ['label', 'expanded', 'x']
nftype_df = nftype_df[['label', 'expanded']]

nftype_df

NameError: name 'abbr_df' is not defined

In [24]:
nftypes = {}
for index, row in nftype_df.iterrows():
    nftypes[row['label'].strip()] = row['expanded']
nftypes

{'AC': 'Sponsored academy  ',
 'AC1619': 'Academy 16-19 sponsor led',
 'ACC': 'Academy converter - mainstream',
 'ACC1619': 'Academy 16-19 converter',
 'ACCS': 'Academy converter - special school',
 'ACS': 'Sponsored special academy ',
 'CTC': 'City technology college',
 'CY': 'Community school',
 'CYS': 'Community special school',
 'F': 'Free school - mainstream',
 'F1619': 'Free school - 16-19',
 'FD': 'Foundation school',
 'FDS': 'Foundation special school',
 'FESI': 'Further Education Sector Institution',
 'FS': 'Free school - special',
 'FSS': 'Studio school',
 'FUTC': 'UTC (university technical college)',
 'IND': 'Independent school',
 'INDSPEC': 'Independent special school',
 'MODFC': 'College funded by Ministry of Defence',
 'NMSS': 'Non-maintained special school',
 'VA': 'Voluntary aided school',
 'VC': 'Voluntary controlled school'}

# Open Refine

## Cleaning england_ks4final.csv

### 1. import parameters
Imported the `england_ks4final.csv` with the following parameters.

![ks4 import parameters](img/or_ks4_001.png)

### 2. Remove the non-mainstream schools.

The questions posed only require the mainstream schools.  So we can reduce the file size by selecting only the rows that match `RECTYPE` == 1

This can be done by running a text facet on the `RECTYPE` column and removing all non-matching rows.
![ks4 remove non-mainstream schools](img/or_ks4_002.png)
This reduces the file size by _1293_ rows.

### 3. Removing schools that are closed.

Any schools with the `ICLOSE` flag set to 1 can also be removed:
    
![ks4 remove closed schools](img/or_ks4_003.png)
    

### 4. Removing columns that will not be used in the investigation

Next we can remove any columns that are definitely not needed.  With well over 300 columns the simplest approach will be to exporting the columns we want to csv and then opening a new open refine project with this subset of data.The file is saved to `data/2015-2016/ks4_clean.csv`

After exporting the file I then opened a new project from the CSV file just created.  The next step is to clean up the percentages as they are currently in a string format.



![ks4 remove closed schools](img/or_ks4_004.png)

### 5. Converting the % strings to numbers

After creating the new project from the semi-cleaned `ks4_clean.csv` file it is clear there is still some cleaning to do.  In particular, the columns that contain percentage values are currently strings.  To clean these I ran a text facet on the column.  The majority of the values are of the type `57%`, at the bottom are the missing data values, in the case of the column `PTEBACC_15_PTQ_EE` these were `NA` and `SUPP`.  I then included these two and then inverted the selection, leaving just the percentages left.  I then ran this transform on the cells in the column:
    
    toNumber(value.replace('%','')/100.0

![ks4 converting to numbers](img/or_ks4_005.png)

I then repeated this process on the remaining percentage columns.  I opted to keep the `NA` and `SUPP` values as is because it may be information I want to use during the investigation possibly for filling the data using k-NN or to help me identify the reason the data is not there for a given school.

#### Open refine editing steps taken to convert the percentages from strings to number.
Expand this section to see the JSON extract if needed. Can be found in the following file:
    
    data/2015-2016/openrefine_ks4_cleaning_2.json

### 6.  Exporting the cleaned file
Having converted these columns I was now ready to export the file for the investigation.  It is saved as a tab-seperated file located at `data/2015-2016/ks4_clean_reduced.tsv` 

## Choosing MongoDB

With so many columns to investigate I am leaning towards using a DBMS to make the querying of the data more efficient than in a pandas dataframe.  Therefore, I will import the data into MongoDB.  I chose a document database system as they are far more flexible than a relational database.  In this investigation it may become necessary to add fields to certain documents for example.  





## Importing KS4 results data into MongoDB

In [13]:
# import the results data into mongo db
!/usr/bin/mongoimport --port 27351 --drop --db schools_db --collection ks4final \
    --type tsv --headerline --ignoreBlanks \
    --file data/2015-2016/ks4_clean_reduced.tsv

2018-05-21T17:52:56.409+0000	connected to: localhost:27351
2018-05-21T17:52:56.409+0000	dropping: schools_db.ks4final
2018-05-21T17:52:56.545+0000	imported 4131 documents


In [14]:
# open a connection to the mongo server
client = pymongo.MongoClient('mongodb://localhost:27351/')

In [15]:
# open the imported database and collection
db = client.schools_db
ks4results = db.ks4final

In [16]:
# check the number of imported matches the line length of the file (5489)
ks4results.find().count()

4131

Good, 4131 documents as expected.  Let's have a look at one.

In [21]:
ks4results.find_one()

{'ALPHAIND': 11828,
 'ATT8SCR': 42.1,
 'ATT8SCREBAC': 22.2,
 'ATT8SCRENG': 7.3,
 'ATT8SCRMAT': 0,
 'ATT8SCROPENG': 10.4,
 'ATT8SCR_15': 'NA',
 'ATT8SCR_AV': 'NP',
 'ATT8SCR_HI': 'NP',
 'ATT8SCR_LO': 'NP',
 'ESTAB': 6007,
 'LEA': 201,
 'NFTYPE': 'IND',
 'PTAC5EM_PTQ_EE': 0.0,
 'PTEBACC_15_PTQ_EE': 0.0,
 'PTL2BASICS_3YR_PTQ_EE': 0.0,
 'PTL2BASICS_LL_PTQ_EE': 0.0,
 'PTPRIORAV': 'NP',
 'PTPRIORHI': 'NP',
 'PTPRIORLO': 'NP',
 'SCHNAME': 'City of London School',
 'SCHNAME_AC': ' ',
 'TABKS2': 0,
 'URN': 100003,
 '_id': ObjectId('5b0307780d6f84c0ce756454')}

There is still a little more cleaning to do.  The `SCHNAME_AC` is a `' '` in this file. Let's handle that first.

In [22]:
r = ks4results.update_many({'SCHNAME_AC': ' '}, {'$unset':{'SCHNAME_AC': ''}})
r.matched_count, r.modified_count

(4067, 4067)

In [24]:
ks4results.find_one({'PTAC5EM_PTQ_EE': {'$gte':0.01}})

{'ALPHAIND': 368,
 'ATT8SCR': 50.1,
 'ATT8SCREBAC': 14.2,
 'ATT8SCRENG': 11,
 'ATT8SCRMAT': 9.9,
 'ATT8SCROPENG': 14.5,
 'ATT8SCR_15': 'NA',
 'ATT8SCR_AV': 46.8,
 'ATT8SCR_HI': 63.1,
 'ATT8SCR_LO': 26,
 'ESTAB': 4285,
 'LEA': 202,
 'NFTYPE': 'CY',
 'PTAC5EM_PTQ_EE': 0.53,
 'PTEBACC_15_PTQ_EE': 0.37,
 'PTL2BASICS_3YR_PTQ_EE': 0.59,
 'PTL2BASICS_LL_PTQ_EE': 0.6,
 'PTPRIORAV': 0.48,
 'PTPRIORHI': 0.39,
 'PTPRIORLO': 0.13,
 'SCHNAME': 'Acland Burghley School',
 'TABKS2': 0,
 'URN': 100053,
 '_id': ObjectId('5b0307780d6f84c0ce756456')}

In [None]:
la

Looking through the document we can see the large number of 'NA' and 'NP' we will need to bear them in mind as we carry out the investigation.

## Importing the KS4 Metadata file

In the accidents dataset we explored in the module materials there were some handy functions for looking up human readable descriptions of the codes.  To help make this investigation easier I will try to do a similar thing  thing here.

In [None]:
!head -5 data/2015-2016/ks4_meta.csv

In [None]:
!wc -l data/2015-2016/ks4_meta.csv

0 lines... strange, I'll try loading it into Mongo

In [None]:
!/usr/bin/mongoimport --port 27351 --drop --db schools_db --collection ks4_labels \
    --type csv --headerline --ignoreBlanks \
    --file data/2015-2016/ks4_meta.csv

Clearly there is an issue with the import.  I'll try importing it into a dataframe


In [None]:
ks4_meta_df = pd.read_csv('data/2015-2016/ks4_meta.csv')
ks4_meta_df.head()

To look up the labels I only require the `Metafile heading` and `Metafile description` Columns, so I can drop the others.

In [None]:
ks4_labels_df = ks4_meta_df[['Metafile heading', 'Metafile description']]
# relabel the columns for easier access
ks4_labels_df.columns = ['label', 'expanded']
ks4_labels_df.head()

In [None]:
# access the db ks4_labels collection
ks4_labels = db.ks4_labels

In [None]:
# iterate through each ks4_meta_df row and add it to the database
# I will use the same keys as in
for index, row in ks4_labels_df.iterrows():
    ks4_labels.insert_one({'label': row['label'],
                           'expanded': row['expanded']})
# check it looks ok
ks4_labels.find_one()

In [None]:
# iterate through each ks4_meta_df row and add it to the database
# I will use the same keys as in
for index, row in ks4_labels_df.iterrows():
#     if 'mat' in row['expanded']:
        print(row['label'], ':', row['expanded'], '\n\n')

In [None]:
ks4_labels_df.T


I need to tidy up a little.  

First, the `RECTYPE` expansion contains a list of codes that should be separated for easier access.

In [None]:
# select the correct document
r = ks4_labels.find_one({'label': 'RECTYPE'})

# After checking that the codes have not already been created
# splits the description string
# adds a codes key to reference each school type
if 'codes' not in r.keys():    
    expanded = r['expanded']
    e = expanded[:11]
    codelist = expanded[13:-1].split('; ')
    keys = [c[:1] for c in codelist]
    values = [c[2:] for c in codelist]
    codes = (dict(list(zip(keys, values))))
    ks4_labels.update_one({'_id': r['_id']}, 
                        {'$set': {'expanded': e,
                                  'codes': codes}})

# check that it was procced correctly
ks4_labels.find_one({'label': 'RECTYPE'})

Second, the LEA data is kept in a different file `la_and_region_codes` meta.  However, for my investigation I don't think I need to import it.

In the tm351 course materials we used some in memory collections to access the labels information.  Because, I will need to do the same for the KS2 dataset, I'll wrap them in a function.

In [None]:
def expanded_label(meta):
    # Load the expanded names of keys and human-readable codes into memory
    expanded_name = collections.defaultdict(str)
    for e in meta.find({'expanded': {"$exists": True}}):
        expanded_name[e['label']] = e['expanded']

    label_of = collections.defaultdict(str)
    for l in meta.find({'codes': {"$exists": True}}):
        for c in l['codes']:
            try:
                label_of[l['label'], int(c)] = l['codes'][c]
            except ValueError: 
                label_of[l['label'], c] = l['codes'][c]
    # return both as a tuple
    return (expanded_name, label_of)

In [None]:
# test the function works
ks4_expanded_name, ks4_label_of = expanded_label(ks4_labels)

In [None]:
# test it works
[(c, ks4_label_of['RECTYPE', c]) for k, c in ks4_label_of if k == 'RECTYPE']

Oh that reminds me - I will need to get the School Types from the abbreviations file as I did in TMA02.  Currently the description is: 

In [None]:
ks4_expanded_name['NFTYPE']

In TMA 02 I looked up this data from the abbreviations file and added made a dict to access it conveniently.

In [None]:
school_type_dict = {'VA': 'Voluntary aided school',
             'AC': 'Sponsored Academy',
             'F': 'Free school - mainstream',
             'CY': 'Community school',
             'FS': 'Free school - special',
             'CYS': 'Community special school',
             'FD': 'Foundation school',
             'ACC': 'Academy converter - mainstream',
             'ACCS': 'Academy converter - special school',
             'FDS': 'Foundation special school',
             'ACS': 'Sponsored special academy',
             'VC': 'Voluntary controlled school'}
len(school_type_dict)

I can put this into the `NFTYPE` document.

In [None]:
# update the database document
ks4_labels.find_one_and_update({'label': 'NFTYPE'},
                               {'$set': {'codes': school_type_dict}})
# check it looks ok
ks4_labels.find_one({'label': 'NFTYPE'})

In [None]:
# update the ks4_label_of and expanded name collections
ks4_expanded_name, ks4_label_of = expanded_label(ks4_labels)

In [None]:
# check the codes

In [7]:
[(c, ks4_label_of['NFTYPE', c]) for k, c in ks4_label_of if k =='NFTYPE']

NameError: name 'ks4_label_of' is not defined

In [None]:
ks4_label_of['NFTYPE', 'FD']

Great, now I'll quickly import the KS2 dataset for later on.

In [6]:
for k in lab

SyntaxError: invalid syntax (<ipython-input-6-3d74057cb8a5>, line 1)

## Importing the KS2 data into MongoDB

Now we will follow largely the same steps for the KS2 dataset.

In [None]:
!head -5 data/2015-2016/england_ks2final.csv

In [None]:
!wc -l data/2015-2016/england_ks2final.csv

The dataset has 16316 rows of data, therefore there appears to be far more KS2 school records than KS4.  Again there are a large number of columns and a lots of codes to look up.  There are a also a number of `NA` and `NP` values that could be missing data.  As well as this results data I will need to find and import the relevent metadata file.

In [None]:
!/usr/bin/mongoimport --port 27351 --drop --db schools_db --collection ks2final \
    --type csv --headerline --ignoreBlanks \
    --file data/2015-2016/england_ks2final.csv

In [None]:
# open the imported collection
ks2results = db.ks2final

In [None]:
# check the number of imported documents match the line length of the original file (16316)
ks2results.find().count()

In [None]:
# great, now have a look at one
ks2results.find_one()

In [None]:
!head -5 data/2015-2016/ks2_meta.csv

This looks far more organised than the Ks4  I should be able to import it directly to the mongoDB

In [None]:
!/usr/bin/mongoimport --port 27351 --drop --db schools_db --collection ks2_labels \
    -- type csv --headerline --ignoreBlanks \
    --file data/2015/ks2_meta.csv
    

In [None]:
ks4_expanded_name

In [None]:
[(k, ks4_expanded_name[k]) for k in ks4_expanded_name if 'AVERAGE' in k]

# Q1 - KS4 Investigation
<a name="q1"></a>

## Does the type of school impact the overall academic performance results of students at KS4?

# Application of Machine Learning
<a name="machine_learning"></a>

# Q2 - KS2-KS4 Investigation
<a name="q2"></a>

## Do top performing schools at KS2 deliver similar  good or better results at KS4

# Cleanup remove the database

Uncomment the lines below to remove the MongoDB created in the investigation.

In [2]:
# uncomment to remove the database if needed
# client.drop_database('schools_db')

# show list the databases stored
# client.database_names()