# Sample import of KS2 data into a PostgreSQL database

This creates a new PostgreSQL database and imports the KS2 performance data. During the import, we'll do some data cleaning and reformatting as we go, especially as Postgres is rather particular about the data it imports.

In [1]:
# Import the required libraries

import pandas as pd
import scipy.stats

import psycopg2 as pg
import pandas as pd
import pandas.io.sql as psqlg

import csv

In [2]:
!sudo -u postgres psql -U postgres -c "drop database ema17j"

DROP DATABASE


In [3]:
!sudo -u postgres psql -U postgres -c "create database ema17j"

CREATE DATABASE


In [4]:
%load_ext sql
%sql postgresql://test:test@localhost:5432/ema17j

'Connected: test@ema17j'

## Import LEA data
This is a list of LEA codes and names, as used in the KS2 performance data.

In [5]:
leas = pd.read_csv('data/2015-2016/la_and_region_codes_meta.csv')
leas.head().T

Unnamed: 0,0,1,2,3,4
LEA,841,840,805,806,807
LA Name,Darlington,County Durham,Hartlepool,Middlesbrough,Redcar and Cleveland
REGION,1,1,1,1,1
REGION NAME,North East A,North East A,North East A,North East A,North East A


In [6]:
table_string = """
drop table if exists leas cascade;

create table leas (
    lea integer,
    la_name varchar,
    region integer,
    region_name varchar,
    primary key (lea)
);
"""
%sql $table_string

Done.
Done.


[]

Resave the data as a tab-separated variable (TSV) file, which is easier for Postgres to import.

In [7]:
leas.to_csv('data/leas.tsv', index=False, na_rep='NULL', header=False, sep="\t") 

In [8]:
# open a connection to the PostgreSQL database tm351test
conn = pg.connect(dbname='ema17j', host='localhost', user='test', password='test', port=5432)
# create a cursor
cur = conn.cursor()

In [9]:
with open('data/leas.tsv') as io:
    cur.copy_from(io, 'leas', sep='\t', null="NULL")
conn.commit()

In [10]:
# Uncomment and run this cell if the import fails.
# conn.rollback()

Check we have all the data by counting the rows in the dataframe and the table.

In [11]:
len(leas)

152

In [12]:
%%sql
SELECT count(*)
FROM leas

1 rows affected.


count
152


Clean up the temporary file.

In [13]:
!rm data/leas.tsv

## KS2 school data
This data requires more processing and cleaning, and being aware of the data types of the columns. This is complicated because Pandas took the decision that if a DataFrame column contains numbers and `na` values, that column must be stored as `float64` values, not `int64` (this is based on NumPy's number type hierarchy).

Luckily, most of the column names and meanings are provided in the `meta` data file.

In [14]:
ks2cols = pd.read_csv('data/2015-2016/ks2_meta.csv')
ks2cols['Field Name'] = ks2cols['Field Name'].apply(lambda r: r.strip(),)
ks2cols

Unnamed: 0,Column,Field Name,Label/Description
0,1,RECTYPE,Record type (1=mainstream school; 2=special sc...
1,2,ALPHAIND,Alphabetic index
2,3,LEA,Local authority number
3,4,ESTAB,Establishment number
4,5,URN,School unique reference number
5,6,SCHNAME,School/Local authority name
6,7,ADDRESS1,School address (1)
7,8,ADDRESS2,School address (2)
8,9,ADDRESS3,School address (3)
9,10,TOWN,School town


Explicitly record the field names that store integers.

In [15]:
int_cols = [c for c in ks2cols['Field Name'] 
            if c.startswith('T')
            if c not in ['TOWN', 'TELNUM', 'TKS1AVERAGE']]
int_cols += ['RECTYPE', 'ALPHAIND', 'LEA', 'ESTAB', 'URN', 'URN_AC', 'ICLOSE']
int_cols += ['READ_AVERAGE', 'GPS_AVERAGE', 'MAT_AVERAGE']

Some columns contain percentages. We'll convert these to floating point numbers on import.

Note that we also need to handle the case of `SUPP` in the data, where exact figures have been suppressed, and `NEW`, where this is a new school with no previous data.

In [16]:
def p2f(x):
    if x.strip('%').isnumeric():
        return float(x.strip('%'))/100
    elif x in ['SUPP', 'NEW', 'LOWCOV', 'NA', '']:
        return 0.0
    else:
        return x

These are the columns to try to convert from percentages. Note that we can be generous here, as columns like `PCODE` (postcode) will return the original value if the conversion fails.

In [17]:
percent_cols = [f for f in ks2cols['Field Name'] if f.startswith('P')]
percent_cols += ['WRITCOV', 'MATCOV', 'READCOV'] 
percent_cols += ['PTMAT_HIGH', 'PTREAD_HIGH', 'PSENELSAPK', 'PSENELK', 'PTGPS_HIGH']
percent_converters = {c: p2f for c in percent_cols}

In [18]:
ks2_df = pd.read_csv('data/2015-2016/england_ks2final.csv', 
                   na_values=['SUPP', 'NEW', 'LOWCOV', 'NA', ''],
                   converters=percent_converters)
ks2_df.head().T

Unnamed: 0,0,1,2,3,4
RECTYPE,4,5,1,3,1
ALPHAIND,,,53372,,11156
LEA,,,201,201,202
ESTAB,,,3614,,3323
URN,,,100000,,100028
SCHNAME,,,Sir John Cass's Foundation Primary School,City of London,"Christ Church Primary School, Hampstead"
ADDRESS1,,,St James's Passage,,Christ Church Hill
ADDRESS2,,,Duke's Place,,
ADDRESS3,,,,,
TOWN,,,London,,London


Drop the summary rows, keeping just the rows for mainstream and special schools.

In [19]:
ks2_df = ks2_df[(ks2_df['RECTYPE'] == 1) | (ks2_df['RECTYPE'] == 2)]

Convert everything to numbers, if possible.

In [20]:
ks2_df = ks2_df.apply(pd.to_numeric, errors='ignore')

Export the data to the tab-separated variable file.

In [21]:
ks2_df.to_csv('data/ks2.tsv', index=False, na_rep='NULL', header=False, sep="\t") 

Manually fix the floating-point numbers in the file, converting numbers of form `something.0` to drop the trailing `.0`. We need to do this because Pandas thinks some columns with `na` values are floats, but we know they're `int`s and they'll be stored in Postgres as `int`s.

In [22]:
!sed -i -e 's/\.0\t/\t/g' data/ks2.tsv

Build the list of field names and types from the DataFrame columns.

In [23]:
pg_fields = []
for c in ks2_df.columns:
    col = c.lower()
    if c in int_cols:
        col += ' integer'
    elif ks2_df[c].dtype == np.int64:
        col += ' integer'
    elif ks2_df[c].dtype == np.float64:
        col += ' double precision'
    else:
        col += ' varchar'
    pg_fields += [col]
pg_fields

['rectype integer',
 'alphaind integer',
 'lea integer',
 'estab integer',
 'urn integer',
 'schname varchar',
 'address1 varchar',
 'address2 varchar',
 'address3 varchar',
 'town varchar',
 'pcode varchar',
 'telnum varchar',
 'urn_ac integer',
 'schname_ac varchar',
 'open_ac double precision',
 'nftype varchar',
 'iclose integer',
 'reldenom varchar',
 'agerange varchar',
 'confexam varchar',
 'tab15 integer',
 'tab1618 integer',
 'totpups integer',
 'tpupyear integer',
 'telig integer',
 'belig double precision',
 'gelig double precision',
 'pbelig double precision',
 'pgelig double precision',
 'tks1average double precision',
 'tks1group_l integer',
 'ptks1group_l double precision',
 'tks1group_m integer',
 'ptks1group_m double precision',
 'tks1group_h integer',
 'ptks1group_h double precision',
 'tks1group_na integer',
 'ptks1group_na double precision',
 'tfsm6cla1a integer',
 'ptfsm6cla1a double precision',
 'tnotfsm6cla1a integer',
 'ptnotfsm6cla1a double precision',
 'tealgr

Create the table.

In [24]:
table_string = """
drop table if exists ks2;

create table ks2 (
    {},
    primary key (urn),
    foreign key (lea) references leas(lea)
);
""".format(', '.join(pg_fields))
%sql $table_string

Done.
Done.


[]

In [25]:
%%sql
delete FROM ks2

0 rows affected.


[]

Import the data.

In [26]:
with open('data/ks2.tsv') as io:
    cur.copy_from(io, 'ks2', sep='\t', null="NULL")
conn.commit()

In [27]:
# Uncomment and run the line below if the import fails
# conn.rollback()

Check we have all the data by counting the rows in the dataframe and the table.

In [28]:
len(ks2_df)

16162

In [29]:
%%sql
SELECT count(*)
FROM ks2

1 rows affected.


count
16162


Clean up the temporary file.

In [30]:
!rm data/ks2.tsv

# Using the database
You don't need to re-import the data into the database every time you want to use it. Instead, use these commands to connect to the database from a new notebook.

In [31]:
import pandas as pd
import scipy.stats

import psycopg2 as pg
import pandas.io.sql as psqlg

In [32]:
%load_ext sql
%sql postgresql://test:test@localhost:5432/ema17j

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


'Connected: test@ema17j'

In [33]:
# open a connection to the PostgreSQL database tm351test
conn = pg.connect(dbname='ema17j', host='localhost', user='test', password='test', port=5432)
# create a cursor
cur = conn.cursor()

# Example queries
Find some schools

In [34]:
%%sql
SELECT URN, SCHNAME
FROM ks2
limit 10

10 rows affected.


urn,schname
100000,Sir John Cass's Foundation Primary School
100028,"Christ Church Primary School, Hampstead"
100029,Christ Church School
130342,Christopher Hatton Primary School
100013,Edith Neville Primary School
100027,Eleanor Palmer Primary School
100030,Emmanuel Church of England Primary School
139837,Abacus Belsize Primary School
100026,Fitzjohn's Primary School
100014,Fleet Primary School


How many schools are there?

In [35]:
%%sql
SELECT COUNT(*)
FROM ks2

1 rows affected.


count
16162


How many schools are there with data on eligible pupils?

In [36]:
%%sql
SELECT COUNT(*)
FROM ks2
WHERE TELIG != FLOAT 'NaN'

1 rows affected.


count
15615


Show some schools and their LEA

In [37]:
%%sql
select urn, schname, ks2.lea, leas.la_name
from ks2, leas
where ks2.lea = leas.lea
limit 10

10 rows affected.


urn,schname,lea,la_name
100000,Sir John Cass's Foundation Primary School,201,City of London
100028,"Christ Church Primary School, Hampstead",202,Camden
100029,Christ Church School,202,Camden
130342,Christopher Hatton Primary School,202,Camden
100013,Edith Neville Primary School,202,Camden
100027,Eleanor Palmer Primary School,202,Camden
100030,Emmanuel Church of England Primary School,202,Camden
139837,Abacus Belsize Primary School,202,Camden
100026,Fitzjohn's Primary School,202,Camden
100014,Fleet Primary School,202,Camden
