# Open Refine

## Cleaning england_ks4final.csv

### 1. import parameters
Imported the `england_ks4final.csv` with the following parameters.

![ks4 import parameters](img/or_ks4_001.png)


### 2. Remove the non-mainstream schools.

The questions posed only require the mainstream schools.  So we can reduce the file size by selecting only the rows that match `RECTYPE` == 1

This can be done by running a text facet on the `RECTYPE` column and removing all non-matching rows.
![ks4 remove non-mainstream schools](img/or_ks4_002.png)
This reduces the file size by _1293_ rows.

### 3. Removing schools that are closed.

Any schools with the `ICLOSE` flag set to 1 can also be removed:
    
![ks4 remove closed schools](img/or_ks4_003.png)
    

### 4. Removing columns that will not be used in the investigation

Next we can remove any columns that are definitely not needed.  With well over 300 columns the simplest approach will be to exporting the columns we want to csv and then opening a new open refine project with this subset of data.The file is saved to `data/2015-2016/ks4_clean.csv`

After exporting the file I then opened a new project from the CSV file just created.  The next step is to clean up the percentages as they are currently in a string format.


In [None]:
### 5. Converting the % strings to numbers

In [12]:
import pandas as pd

In [21]:
ks4_df = pd.read_csv('data/2015-2016/ks4_clean_reduced.tsv', delimiter='\t')
ks4_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4131 entries, 0 to 4130
Data columns (total 24 columns):
ALPHAIND                 4131 non-null int64
LEA                      4131 non-null int64
ESTAB                    4131 non-null int64
URN                      4131 non-null int64
SCHNAME                  4131 non-null object
SCHNAME_AC               4131 non-null object
NFTYPE                   4131 non-null object
TABKS2                   4131 non-null int64
PTPRIORLO                3930 non-null object
PTPRIORAV                3930 non-null object
PTPRIORHI                3930 non-null object
PTL2BASICS_LL_PTQ_EE     3930 non-null object
PTEBACC_15_PTQ_EE        3749 non-null object
PTL2BASICS_3YR_PTQ_EE    3930 non-null object
PTAC5EM_PTQ_EE           3930 non-null object
ATT8SCR                  4131 non-null object
ATT8SCRENG               4131 non-null object
ATT8SCRMAT               4131 non-null object
ATT8SCREBAC              4131 non-null object
ATT8SCROPENG            

The columns I will be using are:

In [11]:
cols = ['LEA', 'ESTAB', 'URN', 'SCHNAME', 'SCHNAME_AC', 'NFTYPE',
 'TABKS2', 'PTPRIORLO', 'PTPRIORAV', 'PTPRIORHI', 'ATT8SCR',
 'ATT8SCRENG', 'ATT8SCRMAT', 'ATT8SCREBAC', 'ATT8SCROPENG',
 'PTL2BASICS_LL_PTQ_EE', 'PTL2BASICS_3YR_PTQ_EE', 'ATT8SCR_AV',
 'ATT8SCR_LO', 'ATT8SCR_HI', 'PTEBACC_15_PTQ_EE' 'PTAC5EM_PTQ_EE',
 ]

### Open refine editing steps taken to convert the percentages from strings to number.
Expand this section to see the JSON extract if needed.

In [None]:
[
  {
    "op": "core/text-transform",
    "description": "Text transform on cells in column PTPRIORLO using expression grel:toNumber(value.replace('%',''))/100.0",
    "engineConfig": {
      "mode": "row-based",
      "facets": [
        {
          "omitError": false,
          "expression": "value",
          "selectBlank": false,
          "selection": [
            {
              "v": {
                "v": "NP",
                "l": "NP"
              }
            },
            {
              "v": {
                "v": "SUPP",
                "l": "SUPP"
              }
            }
          ],
          "selectError": false,
          "invert": true,
          "name": "PTPRIORLO",
          "omitBlank": false,
          "type": "list",
          "columnName": "PTPRIORLO"
        }
      ]
    },
    "columnName": "PTPRIORLO",
    "expression": "grel:toNumber(value.replace('%',''))/100.0",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10
  },
  {
    "op": "core/text-transform",
    "description": "Text transform on cells in column PTPRIORAV using expression grel:toNumber(value.replace('%',''))/100.0",
    "engineConfig": {
      "mode": "row-based",
      "facets": [
        {
          "omitError": false,
          "expression": "value",
          "selectBlank": false,
          "selection": [
            {
              "v": {
                "v": "NP",
                "l": "NP"
              }
            },
            {
              "v": {
                "v": "SUPP",
                "l": "SUPP"
              }
            }
          ],
          "selectError": false,
          "invert": true,
          "name": "PTPRIORAV",
          "omitBlank": false,
          "type": "list",
          "columnName": "PTPRIORAV"
        }
      ]
    },
    "columnName": "PTPRIORAV",
    "expression": "grel:toNumber(value.replace('%',''))/100.0",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10
  },
  {
    "op": "core/text-transform",
    "description": "Text transform on cells in column PTPRIORHI using expression grel:toNumber(value.replace('%',''))/100.0",
    "engineConfig": {
      "mode": "row-based",
      "facets": [
        {
          "omitError": false,
          "expression": "value",
          "selectBlank": false,
          "selection": [
            {
              "v": {
                "v": "NP",
                "l": "NP"
              }
            },
            {
              "v": {
                "v": "SUPP",
                "l": "SUPP"
              }
            }
          ],
          "selectError": false,
          "invert": true,
          "name": "PTPRIORHI",
          "omitBlank": false,
          "type": "list",
          "columnName": "PTPRIORHI"
        }
      ]
    },
    "columnName": "PTPRIORHI",
    "expression": "grel:toNumber(value.replace('%',''))/100.0",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10
  },
  {
    "op": "core/text-transform",
    "description": "Text transform on cells in column PTL2BASICS_LL_PTQ_EE using expression grel:toNumber(value.replace('%',''))/100.0",
    "engineConfig": {
      "mode": "row-based",
      "facets": [
        {
          "omitError": false,
          "expression": "value",
          "selectBlank": false,
          "selection": [
            {
              "v": {
                "v": "NE",
                "l": "NE"
              }
            },
            {
              "v": {
                "v": "SUPP",
                "l": "SUPP"
              }
            }
          ],
          "selectError": false,
          "invert": true,
          "name": "PTL2BASICS_LL_PTQ_EE",
          "omitBlank": false,
          "type": "list",
          "columnName": "PTL2BASICS_LL_PTQ_EE"
        }
      ]
    },
    "columnName": "PTL2BASICS_LL_PTQ_EE",
    "expression": "grel:toNumber(value.replace('%',''))/100.0",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10
  },
  {
    "op": "core/text-transform",
    "description": "Text transform on cells in column PTL2BASICS_3YR_PTQ_EE using expression grel:toNumber(value.replace('%',''))/100.0",
    "engineConfig": {
      "mode": "row-based",
      "facets": [
        {
          "omitError": false,
          "expression": "value",
          "selectBlank": false,
          "selection": [
            {
              "v": {
                "v": "NE",
                "l": "NE"
              }
            },
            {
              "v": {
                "v": "SUPP",
                "l": "SUPP"
              }
            }
          ],
          "selectError": false,
          "invert": true,
          "name": "PTL2BASICS_3YR_PTQ_EE",
          "omitBlank": false,
          "type": "list",
          "columnName": "PTL2BASICS_3YR_PTQ_EE"
        }
      ]
    },
    "columnName": "PTL2BASICS_3YR_PTQ_EE",
    "expression": "grel:toNumber(value.replace('%',''))/100.0",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10
  },
  {
    "op": "core/column-reorder",
    "description": "Reorder columns",
    "columnNames": [
      "ALPHAIND",
      "LEA",
      "ESTAB",
      "URN",
      "SCHNAME",
      "SCHNAME_AC",
      "NFTYPE",
      "TABKS2",
      "PTPRIORLO",
      "PTPRIORAV",
      "PTPRIORHI",
      "PTL2BASICS_LL_PTQ_EE",
      "PTEBACC_15_PTQ_EE",
      "PTL2BASICS_3YR_PTQ_EE",
      "PTAC5EM_PTQ_EE",
      "ATT8SCR",
      "ATT8SCRENG",
      "ATT8SCRMAT",
      "ATT8SCREBAC",
      "ATT8SCROPENG",
      "ATT8SCR_LO",
      "ATT8SCR_AV",
      "ATT8SCR_HI",
      "ATT8SCR_15"
    ]
  },
  {
    "op": "core/text-transform",
    "description": "Text transform on cells in column PTEBACC_15_PTQ_EE using expression grel:toNumber(value.replace('%',''))/100.0",
    "engineConfig": {
      "mode": "row-based",
      "facets": [
        {
          "omitError": false,
          "expression": "value",
          "selectBlank": false,
          "selection": [
            {
              "v": {
                "v": "NA",
                "l": "NA"
              }
            },
            {
              "v": {
                "v": "SUPP",
                "l": "SUPP"
              }
            }
          ],
          "selectError": false,
          "invert": true,
          "name": "PTEBACC_15_PTQ_EE",
          "omitBlank": false,
          "type": "list",
          "columnName": "PTEBACC_15_PTQ_EE"
        }
      ]
    },
    "columnName": "PTEBACC_15_PTQ_EE",
    "expression": "grel:toNumber(value.replace('%',''))/100.0",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10
  },
  {
    "op": "core/text-transform",
    "description": "Text transform on cells in column PTAC5EM_PTQ_EE using expression grel:toNumber(value.replace('%',''))/100.0",
    "engineConfig": {
      "mode": "row-based",
      "facets": [
        {
          "omitError": false,
          "expression": "value",
          "selectBlank": false,
          "selection": [
            {
              "v": {
                "v": "NE",
                "l": "NE"
              }
            },
            {
              "v": {
                "v": "SUPP",
                "l": "SUPP"
              }
            }
          ],
          "selectError": false,
          "invert": true,
          "name": "PTAC5EM_PTQ_EE",
          "omitBlank": false,
          "type": "list",
          "columnName": "PTAC5EM_PTQ_EE"
        }
      ]
    },
    "columnName": "PTAC5EM_PTQ_EE",
    "expression": "grel:toNumber(value.replace('%',''))/100.0",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10
  }
]

In [None]:
![ks4 remove closed schools](img/or_ks4_003.png)



In [None]:
![ks4 remove closed schools](img/or_ks4_003.png)



In [4]:
https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/696998/Secondary_accountability-measures.pdf

SyntaxError: invalid syntax (<ipython-input-4-6cc8ecb354f2>, line 1)

In [None]:
https://github.com/OpenRefine/OpenRefine/wiki/Recipes