# EMA Project Diary

# Initial look at the KS4 dataset

Let's have a quick look at the dataset we will be looking at for the EMA.

In [37]:
!head -5 'data/2015-2016/england_ks4final.csv'

﻿RECTYPE,ALPHAIND,LEA,ESTAB,URN,SCHNAME,SCHNAME_AC,ADDRESS1,ADDRESS2,ADDRESS3,TOWN,PCODE,TELNUM,CONTFLAG,ICLOSE,NFTYPE,RELDENOM,ADMPOL,EGENDER,FEEDER,TABKS2,TAB1618,AGERANGE,CONFEXAM,TOTPUPS,NUMBOYS,NUMGIRLS,TPUP,BPUP,PBPUP,GPUP,PGPUP,KS2APS,TPRIORLO,PTPRIORLO,TPRIORAV,PTPRIORAV,TPRIORHI,PTPRIORHI,TFSM6CLA1A,PTFSM6CLA1A,TNOTFSM6CLA1A,PTNOTFSM6CLA1A,TEALGRP2,PTEALGRP2,TEALGRP1,PTEALGRP1,TEALGRP3,PTEALGRP3,TNMOB,PTNMOB,SENSE4,PSENSE4,SENAPK4,PSENAPK4,TOTATT8,ATT8SCR,TOTATT8ENG,ATT8SCRENG,TOTATT8MAT,ATT8SCRMAT,TOTATT8EBAC,ATT8SCREBAC,TOTATT8OPEN,ATT8SCROPEN,TOTATT8OPENG,ATT8SCROPENG,TOTATT8OPENNG,ATT8SCROPENNG,P8PUP,P8MEACOV,P8MEA,P8CILOW,P8CIUPP,P8MEAENG,P8MEAENG_CILOW,P8MEAENG_CIUPP,P8MEAMAT,P8MEAMAT_CILOW,P8MEAMAT_CIUPP,P8MEAEBAC,P8MEAEBAC_CILOW,P8MEAEBAC_CIUPP,P8MEAOPEN,P8MEAOPEN_CILOW,P8MEAOPEN_CIUPP,PTL2BASICS_LL_PTQ_EE,PTL2BASICS_3YR_PTQ_EE,TEBACC_E_PTQ_EE,PTEBACC_E_PTQ_EE,PTEBACC_PTQ_EE,TEBACENG_E_PTQ_EE,PTEBACENG_E_PTQ_EE,TEBACMAT_E_PTQ_EE,PTEBACMAT_E_PTQ_EE,TEBAC2SCI_E_PTQ_EE,PT

In [38]:
!wc -l 'data/2015-2016/england_ks4final.csv'

5489 data/2015-2016/england_ks4final.csv


The dataset has 5489 rows of data, there looks to be a large number of columns and lots of these are codes that I'll need to look up.  There are also a number of `NA` and `NP` values that could be missing data.  I want to have a quick look at the dataset to determine which columns will be most relevant to my investigation, therefore I will import it into MongoDB to explore further. 

# Importing the datasets into memory

This section is adapted from the OU teams tma02_question2b-pd file.  I'll reuse it here to start the importing of the KS4 data and then export it into mongodb for easy access later on.

In [39]:
# import the required libraries
import pandas as pd
import scipy.stats

## Import the LEA data

In [40]:
leas_df = pd.read_csv('data/2015-2016/la_and_region_codes_meta.csv')
leas_df.head()

Unnamed: 0,LEA,LA Name,REGION,REGION NAME
0,841,Darlington,1,North East A
1,840,County Durham,1,North East A
2,805,Hartlepool,1,North East A
3,806,Middlesbrough,1,North East A
4,807,Redcar and Cleveland,1,North East A


# Import the KS2 data

Again this section has bee adopted from the tma02 file to import the ks2 data

In [41]:
ks2cols = pd.read_csv('data/2015-2016/ks2_meta.csv')
# clean the strings
ks2cols['Field Name'] = ks2cols['Field Name'].apply(lambda r: r.strip(),)
ks2cols.head()

Unnamed: 0,Column,Field Name,Label/Description
0,1,RECTYPE,Record type (1=mainstream school; 2=special sc...
1,2,ALPHAIND,Alphabetic index
2,3,LEA,Local authority number
3,4,ESTAB,Establishment number
4,5,URN,School unique reference number


In [42]:
ks2cols.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 259 entries, 0 to 258
Data columns (total 3 columns):
Column               259 non-null int64
Field Name           259 non-null object
Label/Description    259 non-null object
dtypes: int64(1), object(2)
memory usage: 6.1+ KB


In [43]:
ks2cols = ks2cols[['Field Name', 'Label/Description']]
ks2cols.head()

Unnamed: 0,Field Name,Label/Description
0,RECTYPE,Record type (1=mainstream school; 2=special sc...
1,ALPHAIND,Alphabetic index
2,LEA,Local authority number
3,ESTAB,Establishment number
4,URN,School unique reference number


# Import the KS4 and KS2 metadata

Most of the field names are given in the ks4_meta file, so we will use that to enable us to decipher the codes held in the main dataset.

In [44]:
ks4cols = pd.read_csv('data/2015-2016/ks4_meta.csv')
# clean the strings
ks4cols['Metafile heading'] = ks4cols['Metafile heading'].apply(lambda r: r.strip(),)
ks4cols.head()

Unnamed: 0,Column,Metafile heading,Metafile description,Methodology changes,Null field for special schools,Null field for local authority records,Null field for National (all schools) records,Null field for National (maintained schools) records,Unnamed: 8,Unnamed: 9
0,1,RECTYPE,Record type (1=mainstream school; 2=special sc...,,,,,,,
1,2,ALPHAIND,Alphabetic sorting index,,,Yes,Yes,Yes,,
2,3,LEA,Local authority code (see separate list of loc...,,,,Yes,Yes,,
3,4,ESTAB,Establishment number,,,Yes,Yes,Yes,,
4,5,URN,School Unique Reference Number,,,Yes,Yes,Yes,,


Unlike the `KS2_meta.csv` data file the `KS4_meta.csv` data file has more columns.  For my needs (expanding codes) the extra columns are not needed so I can drop them.  I'll also rename them to match the KS2.

In [45]:
ks4cols = ks4cols[['Metafile heading', 'Metafile description']]
ks4cols.columns = ['Field Name', 'Label/Description']
ks4cols.head()

Unnamed: 0,Field Name,Label/Description
0,RECTYPE,Record type (1=mainstream school; 2=special sc...
1,ALPHAIND,Alphabetic sorting index
2,LEA,Local authority code (see separate list of loc...
3,ESTAB,Establishment number
4,URN,School Unique Reference Number


In [46]:
# compare the number of rows of both ks4
len(ks4cols), len(ks2cols)

(372, 259)

In [53]:
ks2cols.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 259 entries, 0 to 258
Data columns (total 2 columns):
Field Name           259 non-null object
Label/Description    259 non-null object
dtypes: object(2)
memory usage: 4.1+ KB


In [48]:
ks4cols.head()

Unnamed: 0,Field Name,Label/Description
0,RECTYPE,Record type (1=mainstream school; 2=special sc...
1,ALPHAIND,Alphabetic sorting index
2,LEA,Local authority code (see separate list of loc...
3,ESTAB,Establishment number
4,URN,School Unique Reference Number


In [49]:
ks4cols.iloc[258]

Field Name                                          SCIVALOW_AV_PTQ_EE
Label/Description    Lower 95% confidence limit for English Baccala...
Name: 258, dtype: object

In [50]:
ks2cols.iloc[258]

Field Name                                                     PSENELN
Label/Description    Percentage of eligible pupils with SEN (Specia...
Name: 258, dtype: object

Merging the two meta data files.


In [96]:
labels_df = pd.concat([ks2cols, ks4cols])
labels_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 632 entries, 0 to -1
Data columns (total 2 columns):
Field Name           632 non-null object
Label/Description    632 non-null object
dtypes: object(2)
memory usage: 14.8+ KB


In [97]:
labels_df.sort_values('Field Name')

Unnamed: 0,Field Name,Label/Description
308,AC5EM13,Percentage of pupils achieving 5+ A*-C or equi...
309,AC5EM14_PTQ,Percentage of pupils achieving 5+ A*-C or equi...
310,AC5EM15_PTQ_EE,Percentage of pupils achieving 5+ A*-C or equi...
311,AC5EM16_PTQ_EE,Percentage of pupils achieving 5+ A*-C or equi...
7,ADDRESS1,School address (1)
6,ADDRESS1,School address (1)
8,ADDRESS2,School address (2)
7,ADDRESS2,School address (2)
9,ADDRESS3,School address (3)
8,ADDRESS3,School address (3)


In [98]:
labels_df['Field Name'].nunique(), len(labels_df['Field Name'])

(603, 632)

There appears to be a number of duplicates in the dataframe.  I'll remove those then store them as a labels collection for easy access later on.

In [113]:
labels_df = labels_df.drop_duplicates(subset='Field Name')
labels_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 603 entries, 0 to 371
Data columns (total 2 columns):
Field Name           603 non-null object
Label/Description    603 non-null object
dtypes: object(2)
memory usage: 14.1+ KB


In [114]:
test = db.test

In [116]:
for index, row in labels_df.iterrows():
    test.insert_one({'field': row['Field Name'],
                     'label':row['Label/Description']})
test.find_one()

{'_id': ObjectId('5afdc2df0fd01f3d5eff23f1'),
 'field': 'RECTYPE',
 'label': 'Record type (1=mainstream school; 2=special school; 3=Local Authority; 4=National (all schools); 5=National (maintained schools))'}

Let's tidy up the rectype codes.

In [120]:
desc = test.find_one({'field': 'RECTYPE'})

Great now I'll save that put that into a mongo collection for easy access later on.


In [88]:
!mkdir -p 'data/dcs283'

In [89]:
!ls 'data/'

1279924960.csv
2015-2016
2016-17_Pupil_premium_School_level_allocations.xlsx
codepo_gb.zip
Data
dcs283
Doc
Performancetables_150340.zip
Performancetables_150345.zip
Pupil_Premium_final_allocations_2015_to_2016_School_table.xlsx
SFR27_2016_Main_Tables.xlsx
SR63_2016_Tables.xlsx


In [92]:
labels.to_csv('data/dcs283/labels.csv')

In [106]:
!/usr/bin/mongoimport --port 27351 --drop --db schools --collection labels \
    --type csv --headerline --ignoreBlanks \
    --file data/dcs283/labels.csv

2018-05-17T17:51:34.513+0000	connected to: localhost:27351
2018-05-17T17:51:34.513+0000	dropping: schools.labels
2018-05-17T17:51:34.531+0000	imported 603 documents


In [108]:
import pymongo
import collections

In [109]:
# open a connection to the mongodb
client = pymongo.MongoClient('mongodb://localhost:27351')

In [110]:
db = client.schools
ks4 = db.ks4
labels = db.labels

In [111]:
ks4.find_one()

In [112]:
labels.find_one()

{'': 0,
 'Field Name': 'RECTYPE',
 'Label/Description': 'Record type (1=mainstream school; 2=special school; 3=Local Authority; 4=National (all schools); 5=National (maintained schools))',
 '_id': ObjectId('5afdc126b70b0769c01d8ce1')}

It looks like there is still a little tidying up to do

Clean the values (as in tma02 file import)

In [None]:
# Some columns contain integers, but _**pandas**_ will treat any numeric column 
# with `na` values as `float64`, due to NumPy's number type hierarchy.
# adapted from the tma02 import
def get_int_cols(df):
    int_cols = [c for c in df['Field Name'] 
                if c.startswith('T') 
                if c not in ['TOWN', 'TELNUM', 'TKS1AVERAGE']]
    int_cols += ['RECTYPE', 'ALPHAIND', 'LEA', 'ESTAB', 'URN', 'URN_AC', 'ICLOSE']
    int_cols += ['READ_AVERAGE', 'GPS_AVERAGE', 'MAT_AVERAGE']
    return int_cols

In [None]:
ks2_int_cols = get_int_cols(ks2cols)
ks4_int_cols = get_int_cols(ks4cols)

(len(ks2_int_cols), len(ks4_int_cols))

In [None]:
# Some columns contain percentages. We'll convert these to floating point numbers on import.
# 
# Note that we also need to handle the case of `SUPP` and `NEW` in the data.

def p2f(x):
    if x.strip('%').isnumeric():
        return float(x.strip('%'))/100
    elif x in ['SUPP', 'NEW', 'LOWCOV', 'NA', '']:
        return 0.0
    else:
        return x

In [None]:
# These are the columns to try to convert from percentages. Note that we can be generous here, as columns like 
# PCODE (postcode) will return the original value if the conversion fails.

percent_cols = [f for f in ks2cols['Field Name'] if f.startswith('P')]
percent_cols += ['WRITCOV', 'MATCOV', 'READCOV'] 
percent_cols += ['PTMAT_HIGH', 'PTREAD_HIGH', 'PSENELSAPK', 'PSENELK', 'PTGPS_HIGH']
percent_converters = {c: p2f for c in percent_cols}

ks2_df = pd.read_csv('data/2015-2016/england_ks2final.csv', 
                   na_values=['SUPP', 'NEW', 'LOWCOV', 'NA', ''],
                   converters=percent_converters)



In [None]:
# Drop the summary rows, keeping just the rows for mainstream and special schools.

ks2_df = ks2_df[(ks2_df['RECTYPE'] == 1) | (ks2_df['RECTYPE'] == 2)]


In [None]:
# Convert everything to numbers, if possible.

ks2_df = ks2_df.apply(pd.to_numeric, errors='ignore')

In [None]:
# Merge the LEA data into the school data
ks2_df = pd.merge(ks2_df, leas_df, on=['LEA'])
ks2_df.head().T