# EMA Lab Notebook

__Name:__ Daniel Smith

__PI:__ A7603242


In [10]:
# import the required libraries
# required imports
import pandas as pd
import scipy.stats
import pymongo
import bson
import collections
from matplotlib import pyplot
import seaborn as sns

## Contents 

- [Data preparation](#preparation)
- [Cleaning the dataset](#cleaning)
- [Q1. KS4 Investigation](#q1)
- [Application of Machine Learning](#machine_learning)
- [Q2. KS2 - KS4 Investigation](#q)


# Data preparation
<a name="importing"></a>

Before we can investigate the data we will to have a look at it, determine what cleaning if any needs doing, carry out the cleaning and store it for access in an appropriate form.

## Initial look at the KS4 results dataset
Let's have a quick look at the data we will be looking at for the EMA.

In [None]:
!head -5 'data/2015-2016/england_ks4final.csv'

In [None]:
!wc -l 'data/2015-2016/england_ks4final.csv'

The dataset has 5489 rows of data, there appears to be a large number of columns and a lot of codes that I'll need to look up.  There are a also a number of `NA` and `NP` values that could be missing data.  As well as this results data I will need to find and import the relevent metadata file.

Looking through the data/2015-2016 folder there are a number of files that have information on these codes.

In [None]:
!ls data/2015-2016/

There is an abbreviations file, stored as an xlsx file.  I'll have a quick glance at it in excel.  Having looked the abbreviation up in the abbreviations file we can see that they have the following meanings:

- _NA_: Not Applicable
- _NP_: Not Published

However, before importing the results data I want to look at it in Open Refine and decide what I will do with the missing data.

## Open Refine

Looking at the `england_ks4final.csv` file in OpenRefine I can see that the NA and NP values are in many places there are also SUPP values.

However, with there being so many columns to facet and edit one by one it will become very tedious, and a lot of them may have no bearing on my investigations.  Therefore it will actually be easier and more efficient to handle these in the querying of the database. Therefore no changes were made to the file in open refine.

## Choosing MongoDB

With so many columns to investigate I am leaning towards using a DBMS to make the querying of the data more efficient than in a pandas dataframe.  Therefore, I will import the data into MongoDB.  I chose a document database system as they are far more flexible than a relational database.  In this investigation it may become necessary to add fields to certain documents for example.  





## Importing KS4 results data into MongoDB

In [11]:
# import the results data into mongo db
!/usr/bin/mongoimport --port 27351 --drop --db schools_db --collection ks4final \
    --type csv --headerline --ignoreBlanks \
    --file data/2015-2016/england_ks4final.csv

2018-05-21T14:35:33.332+0000	connected to: localhost:27351
2018-05-21T14:35:33.332+0000	dropping: schools_db.ks4final
2018-05-21T14:35:35.487+0000	imported 5489 documents


In [12]:
# open a connection to the mongo server
client = pymongo.MongoClient('mongodb://localhost:27351/')

In [13]:
# open the imported database and collection
db = client.schools_db
ks4results = db.ks4final

In [14]:
# check the number of imported matches the line length of the file (5489)
ks4results.find().count()

5489

Good, 5489 documents as expected.  Let's have a look at one.

In [None]:
ks4results.find_one()

Looking through the document we can see the large number of 'NA' and 'NP' we will need to bear them in mind as we carry out the investigation.

## Importing the KS4 Metadata file

In the accidents dataset we explored in the module materials there were some handy functions for looking up human readable descriptions of the codes.  To help make this investigation easier I will try to do a similar thing  thing here.

In [15]:
!head -5 data/2015-2016/ks4_meta.csv

Column,Metafile heading,Metafile description,Methodology changes,Null field for special schools,Null field for local authority records,Null field for National (all schools) records,Null field for National (maintained schools) records,,1,RECTYPE,Record type (1=mainstream school; 2=special school; 4=local authority; 5=National (all schools); 7=National (maintained schools)),,,,,,,2,ALPHAIND,Alphabetic sorting index,,,Yes,Yes,Yes,,3,LEA,Local authority code (see separate list of local authorities and their codes),,,,Yes,Yes,,4,ESTAB,Establishment number,,,Yes,Yes,Yes,,5,URN,School Unique Reference Number,,,Yes,Yes,Yes,,6,SCHNAME,School name,,,Yes,Yes,Yes,,7,SCHNAME_AC,School now known as (used if the school has converted to an academy on or after 12 Sept 2015),,,Yes,Yes,Yes,,8,ADDRESS1,School address (1),,,Yes,Yes,Yes,,9,ADDRESS2,School address (2),,,Yes,Yes,Yes,,10,ADDRESS3,School address (3),,,Yes,Yes,Yes,,11,TOWN,School town,,,Yes,Yes,Yes,,12,PCODE,School postcode,,,Yes,Yes

In [16]:
!wc -l data/2015-2016/ks4_meta.csv

0 data/2015-2016/ks4_meta.csv


0 lines... strange, I'll try loading it into Mongo

In [17]:
!/usr/bin/mongoimport --port 27351 --drop --db schools_db --collection ks4_labels \
    --type csv --headerline --ignoreBlanks \
    --file data/2015-2016/ks4_meta.csv

2018-05-21T14:35:48.027+0000	Failed: fields cannot be identical: '' and ''
2018-05-21T14:35:48.027+0000	imported 0 documents


Clearly there is an issue with the import.  I'll try importing it into a dataframe


In [18]:
ks4_meta_df = pd.read_csv('data/2015-2016/ks4_meta.csv')
ks4_meta_df.head()

Unnamed: 0,Column,Metafile heading,Metafile description,Methodology changes,Null field for special schools,Null field for local authority records,Null field for National (all schools) records,Null field for National (maintained schools) records,Unnamed: 8,Unnamed: 9
0,1,RECTYPE,Record type (1=mainstream school; 2=special sc...,,,,,,,
1,2,ALPHAIND,Alphabetic sorting index,,,Yes,Yes,Yes,,
2,3,LEA,Local authority code (see separate list of loc...,,,,Yes,Yes,,
3,4,ESTAB,Establishment number,,,Yes,Yes,Yes,,
4,5,URN,School Unique Reference Number,,,Yes,Yes,Yes,,


In [19]:
len(ks4_meta_df)

372

To look up the labels I only require the `Metafile heading` and `Metafile description` Columns, so I can drop the others.

In [20]:
ks4_labels_df = ks4_meta_df[['Metafile heading', 'Metafile description']]
# relabel the columns for easier access
ks4_labels_df.columns = ['label', 'expanded']

for i, r in ks4_labels_df.iterrows():
    print(r['label'], ': ', r['expanded'], '\n')

RECTYPE :  Record type (1=mainstream school; 2=special school; 4=local authority; 5=National (all schools); 7=National (maintained schools)) 

ALPHAIND :  Alphabetic sorting index 

LEA :  Local authority code (see separate list of local authorities and their codes) 

ESTAB :  Establishment number 

URN :  School Unique Reference Number 

SCHNAME :  School name 

SCHNAME_AC :  School now known as (used if the school has converted to an academy on or after 12 Sept 2015) 

ADDRESS1 :  School address (1) 

ADDRESS2 :  School address (2) 

ADDRESS3 :  School address (3) 

TOWN :  School town 

PCODE :  School postcode 

TELNUM :  School telephone number 

CONTFLAG :  Contingency flag - school results 'significantly affected'. This field is zero for all schools. 

ICLOSE :  Closed school flag (0=open; 1=closed) 

NFTYPE :  School type (see separate list of abbreviations used in the tables) 

RELDENOM :  School religious character 

ADMPOL :  School admissions policy (self-declared by school

In [21]:
# access the db ks4_labels collection
ks4_labels = db.ks4_labels

In [22]:
# iterate through each ks4_meta_df row and add it to the database
# I will use the same keys as in
for index, row in ks4_labels_df.iterrows():
    ks4_labels.insert_one({'label': row['label'],
                           'expanded': row['expanded']})
# check it looks ok
ks4_labels.find_one()

{'_id': ObjectId('5b02d9520fd01f06cf7428c6'),
 'expanded': 'Record type (1=mainstream school; 2=special school; 4=local authority; 5=National (all schools); 7=National (maintained schools))',
 'label': 'RECTYPE'}

In [23]:
# iterate through each ks4_meta_df row and add it to the database
# I will use the same keys as in
for index, row in ks4_labels_df.iterrows():
#     if 'mat' in row['expanded']:
        print(row['label'], ':', row['expanded'], '\n\n')

RECTYPE : Record type (1=mainstream school; 2=special school; 4=local authority; 5=National (all schools); 7=National (maintained schools)) 


ALPHAIND : Alphabetic sorting index 


LEA : Local authority code (see separate list of local authorities and their codes) 


ESTAB : Establishment number 


URN : School Unique Reference Number 


SCHNAME : School name 


SCHNAME_AC : School now known as (used if the school has converted to an academy on or after 12 Sept 2015) 


ADDRESS1 : School address (1) 


ADDRESS2 : School address (2) 


ADDRESS3 : School address (3) 


TOWN : School town 


PCODE : School postcode 


TELNUM : School telephone number 


CONTFLAG : Contingency flag - school results 'significantly affected'. This field is zero for all schools. 


ICLOSE : Closed school flag (0=open; 1=closed) 


NFTYPE : School type (see separate list of abbreviations used in the tables) 


RELDENOM : School religious character 


ADMPOL : School admissions policy (self-declared by schools

In [None]:
ks4_labels_df.T


I need to tidy up a little.  

First, the `RECTYPE` expansion contains a list of codes that should be separated for easier access.

In [24]:
# select the correct document
r = ks4_labels.find_one({'label': 'RECTYPE'})

# After checking that the codes have not already been created
# splits the description string
# adds a codes key to reference each school type
if 'codes' not in r.keys():    
    expanded = r['expanded']
    e = expanded[:11]
    codelist = expanded[13:-1].split('; ')
    keys = [c[:1] for c in codelist]
    values = [c[2:] for c in codelist]
    codes = (dict(list(zip(keys, values))))
    ks4_labels.update_one({'_id': r['_id']}, 
                        {'$set': {'expanded': e,
                                  'codes': codes}})

# check that it was procced correctly
ks4_labels.find_one({'label': 'RECTYPE'})

{'_id': ObjectId('5b02d9520fd01f06cf7428c6'),
 'codes': {'1': 'mainstream school',
  '2': 'special school',
  '4': 'local authority',
  '5': 'National (all schools)',
  '7': 'National (maintained schools)'},
 'expanded': 'Record type',
 'label': 'RECTYPE'}

Second, the LEA data is kept in a different file `la_and_region_codes` meta.  However, for my investigation I don't think I need to import it.

In the tm351 course materials we used some in memory collections to access the labels information.  Because, I will need to do the same for the KS2 dataset, I'll wrap them in a function.

In [25]:
def expanded_label(meta):
    # Load the expanded names of keys and human-readable codes into memory
    expanded_name = collections.defaultdict(str)
    for e in meta.find({'expanded': {"$exists": True}}):
        expanded_name[e['label']] = e['expanded']

    label_of = collections.defaultdict(str)
    for l in meta.find({'codes': {"$exists": True}}):
        for c in l['codes']:
            try:
                label_of[l['label'], int(c)] = l['codes'][c]
            except ValueError: 
                label_of[l['label'], c] = l['codes'][c]
    # return both as a tuple
    return (expanded_name, label_of)

In [26]:
# test the function works
ks4_expanded_name, ks4_label_of = expanded_label(ks4_labels)

In [27]:
# test it works
[(c, ks4_label_of['RECTYPE', c]) for k, c in ks4_label_of if k == 'RECTYPE']

[(1, 'mainstream school'),
 (4, 'local authority'),
 (7, 'National (maintained schools)'),
 (2, 'special school'),
 (5, 'National (all schools)')]

Oh that reminds me - I will need to get the School Types from the abbreviations file as I did in TMA02.  Currently the description is: 

In [29]:
ks4_expanded_name['NFTYPE']


'School type (see separate list of abbreviations used in the tables)'

In [32]:
cols = ['LEA', 'ESTAB', 'URN', 'SCHNAME', 'SCHNAME_AC', 'NFTYPE',
 'TABKS2', 'PTPRIORLO', 'PTPRIORAV', 'PTPRIORHI', 'ATT8SCR',
 'ATT8SCRENG', 'ATT8SCRMAT', 'ATT8SCREBAC', 'ATT8SCROPENG',
 'PTL2BASICS_LL_PTQ_EE', 'PTL2BASICS_3YR_PTQ_EE', 'ATT8SCR_AV',
 'ATT8SCR_LO', 'ATT8SCR_HI', 'PTEBACC_15_PTQ_EE' 'PTAC5EM_PTQ_EE',
 ]

for c in cols:
    print(c, ':', ks4_expanded_name[c])

LEA : Local authority code (see separate list of local authorities and their codes)
ESTAB : Establishment number
URN : School Unique Reference Number
SCHNAME : School name
SCHNAME_AC : School now known as (used if the school has converted to an academy on or after 12 Sept 2015)
NFTYPE : School type (see separate list of abbreviations used in the tables)
TABKS2 : Indicates whether school is published in the primary school (key stage 2) performance tables (0=No; 1=Yes)
PTPRIORLO : Percentage of pupils at the end of key stage 4 with low prior attainment at the end of key stage 2
PTPRIORAV : Percentage of pupils at the end of key stage 4 with middle prior attainment at the end of key stage 2
PTPRIORHI : Percentage of pupils at the end of key stage 4 with high prior attainment at the end of key stage 2
ATT8SCR : Average Attainment 8 score per pupil
ATT8SCRENG : Average Attainment 8 score per pupil for English element
ATT8SCRMAT : Average Attainment 8 score per pupil for mathematics element


In TMA 02 I looked up this data from the abbreviations file and added made a dict to access it conveniently.

In [None]:
school_type_dict = {'VA': 'Voluntary aided school',
             'AC': 'Sponsored Academy',
             'F': 'Free school - mainstream',
             'CY': 'Community school',
             'FS': 'Free school - special',
             'CYS': 'Community special school',
             'FD': 'Foundation school',
             'ACC': 'Academy converter - mainstream',
             'ACCS': 'Academy converter - special school',
             'FDS': 'Foundation special school',
             'ACS': 'Sponsored special academy',
             'VC': 'Voluntary controlled school'}
len(school_type_dict)

I can put this into the `NFTYPE` document.

In [None]:
# update the database document
ks4_labels.find_one_and_update({'label': 'NFTYPE'},
                               {'$set': {'codes': school_type_dict}})
# check it looks ok
ks4_labels.find_one({'label': 'NFTYPE'})

In [None]:
# update the ks4_label_of and expanded name collections
ks4_expanded_name, ks4_label_of = expanded_label(ks4_labels)

In [None]:
# check the codes

In [None]:
[(c, ks4_label_of['NFTYPE', c]) for k, c in ks4_label_of if k =='NFTYPE']

In [None]:
ks4_label_of['NFTYPE', 'FD']

Great, now I'll quickly import the KS2 dataset for later on.

## Importing the KS2 data into MongoDB

Now we will follow largely the same steps for the KS2 dataset.

In [None]:
!head -5 data/2015-2016/england_ks2final.csv

In [None]:
!wc -l data/2015-2016/england_ks2final.csv

The dataset has 16316 rows of data, therefore there appears to be far more KS2 school records than KS4.  Again there are a large number of columns and a lots of codes to look up.  There are a also a number of `NA` and `NP` values that could be missing data.  As well as this results data I will need to find and import the relevent metadata file.

In [None]:
!/usr/bin/mongoimport --port 27351 --drop --db schools_db --collection ks2final \
    --type csv --headerline --ignoreBlanks \
    --file data/2015-2016/england_ks2final.csv

In [None]:
# open the imported collection
ks2results = db.ks2final

In [None]:
# check the number of imported documents match the line length of the original file (16316)
ks2results.find().count()

In [None]:
# great, now have a look at one
ks2results.find_one()

In [None]:
!head -5 data/2015-2016/ks2_meta.csv

This looks far more organised than the Ks4  I should be able to import it directly to the mongoDB

In [None]:
!/usr/bin/mongoimport --port 27351 --drop --db schools_db --collection ks2_labels \
    -- type csv --headerline --ignoreBlanks \
    --file data/2015/ks2_meta.csv
    

In [None]:
ks4_expanded_name

In [None]:
[(k, ks4_expanded_name[k]) for k in ks4_expanded_name if 'AVERAGE' in k]

# Q1 - KS4 Investigation
<a name="q1"></a>

## Does the type of school impact the overall academic performance results of students at KS4?

# Application of Machine Learning
<a name="machine_learning"></a>

# Q2 - KS2-KS4 Investigation
<a name="q2"></a>

## Do top performing schools at KS2 deliver similar  good or better results at KS4

# Cleanup remove the database

Uncomment the lines below to remove the MongoDB created in the investigation.

In [None]:
# uncomment to remove the database if needed
# client.drop_database('schools_db')
# client.database_names()