# Step 2: Explore and Assess the Data
As described in the write up, during this step the raw data should be explored, assessed and cleaned, so that it can be used in the ETL pipeline later on. 

Main focus of this step lies on unifying the student exchange data for different years in different structures into one CSV file.

Also, the JSON file containing the institutions will be assessed and cleaned.

In [1]:
import pandas as pd
pd.set_option('display.max_columns', 500)

## Student exchange data
### CSV file for 2008-09
Read data for 2008-09 and assess columns.

In [2]:
students_2008 = pd.read_csv('raw_data/student_data_2008.csv', sep=';')
students_2008.info()
# we can see that the loaded dataframe for 2008 has 32 columns

  interactivity=interactivity, compiler=compiler, result=result)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 198523 entries, 0 to 198522
Data columns (total 32 columns):
HOMEINSTITUTION              198523 non-null object
COUNTRYOFHOMEINSTITUTION     198523 non-null object
AGE                          198523 non-null int64
SEX                          198523 non-null object
NATIONALITY                  198523 non-null object
SUBJECTAREA                  198523 non-null int64
LEVELSTUDY                   198523 non-null object
YEARSPRIOR                   198523 non-null int64
MOBILITYTYPE                 198523 non-null object
HOSTINSTITUTION              168193 non-null object
COUNTRYOFHOSTINSTITUTION     168193 non-null object
WORKPLACEMENT                30336 non-null object
COUNTRYOFWORKPLACEMENT       30375 non-null object
ENTERPRISESIZE               30375 non-null object
TYPEWORKSECTOR               30375 non-null object
LENGTHSTUDYPERIOD            198523 non-null float64
LENGTHWORKPLACEMENT          198523 non-null float64
SHORTDURAT

In [3]:
# This will give us an idea of the content of the different columns
students_2008.head(10)

Unnamed: 0,HOMEINSTITUTION,COUNTRYOFHOMEINSTITUTION,AGE,SEX,NATIONALITY,SUBJECTAREA,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,COUNTRYOFHOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,ENTERPRISESIZE,TYPEWORKSECTOR,LENGTHSTUDYPERIOD,LENGTHWORKPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,CONSORTIUMAGREEMENTNUMBER,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,TAUGHTHOSTLANG,LANGUAGETAUGHT,LINGPREPARATION,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION,QUALIFICATIONATHOST
0,D KONSTAN01,DE,25,M,DE,310,1,2,S,UK BATH01,UK,,,,,4.0,0.0,,09-2008,,,27,0,27,0.0,Y,EN,NN,720.0,0.0,N,N
1,D KONSTAN01,DE,24,M,DE,461,2,4,S,F PARIS007,FR,,,,,8.0,0.0,,09-2008,,,63,0,63,0.0,Y,FR,NN,1440.0,0.0,N,N
2,D KONSTAN01,DE,23,M,DE,314,1,2,S,F MARSEIL16,FR,,,,,4.0,0.0,,09-2008,,,15,0,15,0.0,Y,FR,NN,720.0,0.0,N,N
3,D KONSTAN01,DE,24,F,DE,222,1,3,S,E CORDOBA01,ES,,,,,4.0,0.0,,09-2008,,,30,0,30,0.0,Y,ES,HS,720.0,0.0,N,N
4,EE TALLINN04,EE,22,M,EE,34,1,2,S,SF LAHTI11,FI,,,,,4.0,0.0,,09-2008,,,45,0,45,0.0,N,EN,NN,1727.2,0.0,N,N
5,D LEIPZIG01,DE,24,F,DE,220,2,3,S,UK CARDIFF01,UK,,,,,5.0,0.0,,09-2008,,,20,0,20,0.0,Y,EN,NN,1185.0,0.0,N,N
6,EE TALLINN01,EE,25,F,EE,211,2,4,P,,,KULTURWERK BBK BERLIN,DE,M,R,0.0,5.0,,,01-2009,,0,6,6,0.0,N,EN,NN,0.0,2031.84,N,N
7,D LEIPZIG01,DE,24,M,DE,222,2,4,S,F STRASBO01,FR,,,,,9.0,0.0,,09-2008,,,60,0,60,0.0,Y,FR,NN,2133.0,0.0,N,N
8,D LEIPZIG01,DE,23,M,DE,222,1,2,S,B BRUXEL87,BE,,,,,5.0,0.0,,09-2008,,,24,0,24,0.0,Y,FR,NN,1185.0,0.0,N,N
9,EE TARTU02,EE,21,F,EE,345,1,1,P,,,"KREEKA, JOB TRUST",GR,M,I,0.0,3.0,,,06-2009,,0,6,6,0.0,N,EN,NN,0.0,1473.0,N,N


#### Short explanation of the different columns we see in this dataset:
* Home institution: From which instituions do students come from
* Country of home instituion: redundant information
* Age, Sex, Nationality: Some demographic information of the student
* Subject area: Could be used to find out more about the subject field, out of scope for this project
* Level study, Years prior: Details of the students prior education and study level
* Mobility type: Indicates type of mobility 
* Host institution: To which institutions to students go
* Country of host institution: redundant information
* Work placement, Country of work placement: Name and country of company in case of placement
* Enterprise size, Type of work sector: More details about the work placement company
* Length study period, Length work placement: Duration of study/work placement in months
* Short duration: Reason for program durations < 3 months, only for study types
* Study start date, Placement start date: Start month of study/placement
* Consortium agreement number: filled if placement is administered through consortium, irrelevant for this analysis
* ECTS credits study/work, total ECTS credits: number of anticipated ECTS credits per study/work and in total
* SEV Supplement: Grant awarded in Euro for special needs
* Taught host language, language taught, ling preparation: information about language of training, out of scope for this project
* Study/Placement grant: Grant awared in Euro, not including grants for special needs
* Previous participation: indicates whether the student had received an ERASMUS grant prior to this one
* Qualification at host: information on whether the student will get a double, joint degree or any other qualification at the host institution

#### Columns that will utililze the required analytics questions
This dataset would provide a lot of opportunity to analyze Erasmus student exchanges and work placements. 
Though all these would enable interesting analyses, a subset of the above explored columns will be used further on to utilize the analytical questions stated in the project scope description:
* Home institution
* Age
* Sex
* Nationality
* Level study
* Years prior
* Mobility type
* Host institution
* Work placement
* Country of work placement
* Length study/work placement
* Short duration
* Study/Placement start date
* ECTS credits study/work/total
* Study/Placement grant
* SEV supplement
* Previous participation

These columns will be further explored and assessed.

In [4]:
students_2008 = students_2008[[
    'HOMEINSTITUTION',
    'AGE',
    'SEX',
    'NATIONALITY',
    'LEVELSTUDY',
    'YEARSPRIOR',
    'MOBILITYTYPE',
    'HOSTINSTITUTION',
    'WORKPLACEMENT',
    'COUNTRYOFWORKPLACEMENT',
    'LENGTHSTUDYPERIOD',
    'LENGTHWORKPLACEMENT',
    'SHORTDURATION',
    'STUDYSTARTDATE',
    'PLACEMENTSTARTDATE',
    'ECTSCREDITSSTUDY',
    'ECTSCREDITSWORK',
    'TOTALECTSCREDITS',
    'SEVSUPPLEMENT',
    'STUDYGRANT',
    'PLACEMENTGRANT',
    'PREVIOUSPARTICIPATION'
]]

In [5]:
# checking distributions of numerical columns
students_2008.describe()

Unnamed: 0,AGE,YEARSPRIOR,LENGTHSTUDYPERIOD,LENGTHWORKPLACEMENT,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT
count,198523.0,198523.0,198523.0,198523.0,198523.0,198523.0,198523.0,198523.0,198523.0,198523.0
mean,23.488679,2.770581,5.458781,0.665937,26.91063,2.262907,29.173537,3.696198,1380.027413,288.010219
std,2.535529,1.30522,3.272119,1.783168,21.415811,8.343544,20.169371,246.72563,1003.603021,783.931153
min,17.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,22.0,2.0,4.0,0.0,0.0,0.0,15.0,0.0,800.0,0.0
50%,23.0,3.0,5.0,0.0,30.0,0.0,30.0,0.0,1271.6,0.0
75%,24.0,4.0,9.0,0.0,36.0,0.0,38.0,0.0,1914.0,0.0
max,90.0,20.0,12.0,12.0,90.0,90.0,90.0,63609.98,10382.0,12150.0


The distributions we see here make all sense. One interesting thing to notice though, is the maximum age of students.
Let's dig a bit deeper into students older than 50 years.

In [6]:
students_2008[students_2008.AGE > 50]

Unnamed: 0,HOMEINSTITUTION,AGE,SEX,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHWORKPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION
622,A GRAZ23,58,F,AT,2,8,S,D GREIFS01,,,7.00,0.0,,11-2008,,35,0,35,0.0,1827.00,0.00,N
897,A WIEN01,53,M,AT,2,5,S,D TUBINGE01,,,4.00,0.0,,10-2008,,20,0,20,0.0,800.00,0.00,N
1999,A LINZ02,56,F,AT,1,2,S,I PALERMO03,,,4.00,0.0,,02-2009,,20,0,20,0.0,800.00,0.00,N
2063,A WIEN09,53,M,AT,2,2,S,IRLLETTERK01,,,4.00,0.0,,01-2009,,35,0,35,0.0,800.00,0.00,N
2082,A WIEN09,51,F,AT,2,2,S,IRLLETTERK01,,,4.00,0.0,,02-2009,,30,0,30,0.0,800.00,0.00,N
4800,B ANTWERP59,62,M,BE,1,4,S,E VALENCI02,,,3.00,0.0,,09-2008,,0,0,0,0.0,690.00,0.00,N
8403,B GEEL07,53,M,BE,1,2,S,TR ANTALYA01,,,3.00,0.0,,02-2009,,20,0,20,0.0,780.00,0.00,N
13405,B LOUVAIN01,59,F,BE,2,3,S,IRLDUBLIN02,,,3.75,0.0,,02-2009,,30,0,30,0.0,2625.00,0.00,N
15879,D BERLIN01,56,F,DE,2,2,S,F PARIS052,,,4.00,0.0,,09-2008,,35,0,35,0.0,1120.00,0.00,N
20584,D BERLIN06,90,M,DE,1,1,S,UK CHELMSF01,,,9.00,0.0,,09-2008,,60,0,60,0.0,1741.95,0.00,N


Although not that common, we can see students older than 50. The oldest student is even 90 years old. As we don't have the original student data (e.g. date of birth), we cannot double-check if this is really true. So we have to assume that it is.

Now we also want to check the non-numerical columns which are
* Home institution
* Sex
* Nationality
* Level study
* Mobility type
* Host institution
* Work placement
* Country of work placement
* Short duration
* Study/Placement start date
* Previous participation

for missing values and unique values to check for inconsistencies. 

In [7]:
def print_unique_values(df, colname):
    values = df[colname].dropna().unique()
    print(sorted(values))

In [8]:
print_unique_values(students_2008, 'HOMEINSTITUTION')
students_2008[students_2008.HOMEINSTITUTION.isnull()]

['A  DORNBIR01', 'A  EISENST02', 'A  EISENST05', 'A  FELDKIR03', 'A  GRAZ01', 'A  GRAZ02', 'A  GRAZ03', 'A  GRAZ04', 'A  GRAZ08', 'A  GRAZ09', 'A  GRAZ10', 'A  GRAZ23', 'A  INNSBRU01', 'A  INNSBRU08', 'A  INNSBRU09', 'A  INNSBRU21', 'A  KLAGENF01', 'A  KLAGENF02', 'A  KREMS02', 'A  KREMS03', 'A  KUFSTEI01', 'A  LEOBEN01', 'A  LINZ01', 'A  LINZ02', 'A  LINZ03', 'A  LINZ04', 'A  LINZ17', 'A  SALZBUR01', 'A  SALZBUR02', 'A  SALZBUR03', 'A  SALZBUR08', 'A  SPITTAL01', 'A  ST-POLT03', 'A  WELS01', 'A  WIEN01', 'A  WIEN02', 'A  WIEN03', 'A  WIEN04', 'A  WIEN05', 'A  WIEN06', 'A  WIEN07', 'A  WIEN08', 'A  WIEN09', 'A  WIEN10', 'A  WIEN15', 'A  WIEN20', 'A  WIEN21', 'A  WIEN36', 'A  WIEN38', 'A  WIEN52', 'A  WIEN63', 'A  WIEN64', 'A  WIENER01', 'B  ANTWERP01', 'B  ANTWERP57', 'B  ANTWERP58', 'B  ANTWERP59', 'B  ANTWERP60', 'B  ANTWERP61', 'B  ARLON08', 'B  ARLON09', 'B  BRUGGE11', 'B  BRUSSEL01', 'B  BRUSSEL05', 'B  BRUSSEL37', 'B  BRUSSEL43', 'B  BRUSSEL46', 'B  BRUXEL02', 'B  BRUXEL04', 'B  

Unnamed: 0,HOMEINSTITUTION,AGE,SEX,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHWORKPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [9]:
print_unique_values(students_2008, 'SEX')
students_2008[students_2008.SEX.isnull()]

['F', 'M']


Unnamed: 0,HOMEINSTITUTION,AGE,SEX,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHWORKPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [10]:
print_unique_values(students_2008, 'NATIONALITY')
students_2008[students_2008.NATIONALITY.isnull()]

['AT', 'BE', 'BG', 'CY', 'CZ', 'DE', 'DK', 'EE', 'ES', 'FI', 'FR', 'GR', 'HU', 'IE', 'IS', 'IT', 'LI', 'LT', 'LU', 'LV', 'MT', 'NL', 'NO', 'PL', 'PT', 'RO', 'SE', 'SI', 'SK', 'TR', 'UK', 'XX']


Unnamed: 0,HOMEINSTITUTION,AGE,SEX,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHWORKPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [11]:
print_unique_values(students_2008, 'LEVELSTUDY')
students_2008[students_2008.LEVELSTUDY.isnull()]

['1', '2', '3', 'S']


Unnamed: 0,HOMEINSTITUTION,AGE,SEX,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHWORKPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [12]:
print_unique_values(students_2008, 'MOBILITYTYPE')
students_2008[students_2008.MOBILITYTYPE.isnull()]

['C', 'P', 'S']


Unnamed: 0,HOMEINSTITUTION,AGE,SEX,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHWORKPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [13]:
print_unique_values(students_2008, 'HOSTINSTITUTION')
students_2008[students_2008.HOSTINSTITUTION.isnull()].head()

['A  BADEN01', 'A  DORNBIR01', 'A  EISENST01', 'A  EISENST02', 'A  EISENST05', 'A  FELDKIR01', 'A  FELDKIR03', 'A  GRAZ01', 'A  GRAZ02', 'A  GRAZ03', 'A  GRAZ04', 'A  GRAZ08', 'A  GRAZ09', 'A  GRAZ10', 'A  GRAZ23', 'A  INNSBRU01', 'A  INNSBRU03', 'A  INNSBRU08', 'A  INNSBRU09', 'A  INNSBRU20', 'A  INNSBRU21', 'A  KLAGENF01', 'A  KLAGENF02', 'A  KREMS02', 'A  KREMS03', 'A  KUFSTEI01', 'A  LEOBEN01', 'A  LINZ01', 'A  LINZ02', 'A  LINZ03', 'A  LINZ04', 'A  LINZ11', 'A  LINZ17', 'A  SALZBUR01', 'A  SALZBUR02', 'A  SALZBUR03', 'A  SALZBUR08', 'A  SPITTAL01', 'A  ST-POLT03', 'A  WELS01', 'A  WIEN01', 'A  WIEN02', 'A  WIEN03', 'A  WIEN04', 'A  WIEN05', 'A  WIEN06', 'A  WIEN07', 'A  WIEN08', 'A  WIEN09', 'A  WIEN10', 'A  WIEN15', 'A  WIEN20', 'A  WIEN21', 'A  WIEN38', 'A  WIEN52', 'A  WIEN63', 'A  WIEN64', 'A  WIENER01', 'B  ANTWERP01', 'B  ANTWERP57', 'B  ANTWERP58', 'B  ANTWERP59', 'B  ANTWERP60', 'B  ANTWERP61', 'B  ARLON08', 'B  ARLON09', 'B  BRUGGE11', 'B  BRUSSEL01', 'B  BRUSSEL02', 'B  

Unnamed: 0,HOMEINSTITUTION,AGE,SEX,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHWORKPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION
6,EE TALLINN01,25,F,EE,2,4,P,,KULTURWERK BBK BERLIN,DE,0.0,5.0,,,01-2009,0,6,6,0.0,0.0,2031.84,N
9,EE TARTU02,21,F,EE,1,1,P,,"KREEKA, JOB TRUST",GR,0.0,3.0,,,06-2009,0,6,6,0.0,0.0,1473.0,N
11,EE TALLINN04,20,F,EE,1,2,P,,"CLAUSTHAL UT, LABORATORY OF NON-METALLIC MATER...",DE,0.0,3.0,,,06-2009,0,5,5,0.0,0.0,1336.3,N
13,EE TARTU02,24,F,EE,1,2,P,,ESCOLA SUPERIOR DE TECNOLOGIA DA SALUDE DE LISBOA,PT,0.0,3.0,,,02-2009,0,18,18,0.0,0.0,1570.0,N
15,EE TALLINN12,23,F,EE,1,4,P,,LISBOA HOSPITAL,PT,0.0,3.0,,,02-2009,0,18,18,0.0,0.0,1531.75,N


In [14]:
# host institution should be available for mobility types S & C, check if this is the case
print_unique_values(students_2008[students_2008.HOSTINSTITUTION.isnull()], 'MOBILITYTYPE')

['P']


In [15]:
# check if every P mobility type student has a work placement set
students_2008[(students_2008.MOBILITYTYPE == 'P') & 
              (students_2008.WORKPLACEMENT.isnull())].head()

Unnamed: 0,HOMEINSTITUTION,AGE,SEX,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHWORKPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION
4634,B LEUVEN18,21,M,BE,1,2,P,,,FR,0.0,3.0,,,02-2009,0,0,0,0.0,0.0,1646.43,N
4636,B LEUVEN18,24,F,BE,1,2,P,,,IE,0.0,3.0,,,02-2009,0,0,0,0.0,0.0,1993.51,N
4638,B GENT39,22,F,BE,1,2,P,,,DE,0.0,3.0,,,03-2009,0,0,0,0.0,0.0,1313.46,N
4648,B GENT39,21,F,BE,1,2,P,,,IE,0.0,3.0,,,03-2009,0,0,0,0.0,0.0,1601.23,N
4823,B GENT39,21,F,BE,1,2,P,,,NL,0.0,3.0,,,03-2009,0,0,0,0.0,0.0,1444.92,N


In [16]:
print_unique_values(students_2008, 'COUNTRYOFWORKPLACEMENT')

['AT', 'BE', 'BG', 'CY', 'CZ', 'DE', 'DK', 'EE', 'ES', 'FI', 'FR', 'GR', 'HU', 'IE', 'IS', 'IT', 'LI', 'LT', 'LU', 'LV', 'MT', 'NL', 'NO', 'PL', 'PT', 'RO', 'SE', 'SI', 'SK', 'TR', 'UK']


In [17]:
# check if every P mobility type student has a country of work placement set
students_2008[(students_2008.MOBILITYTYPE == 'P') & 
              (students_2008.COUNTRYOFWORKPLACEMENT.isnull())].head()

Unnamed: 0,HOMEINSTITUTION,AGE,SEX,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHWORKPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [18]:
print_unique_values(students_2008, 'SHORTDURATION')

['T', 'X']


In [19]:
# check if all mobility students with a duration less than 3 months have a short duration type set
students_2008[(students_2008.MOBILITYTYPE != 'P') & 
              (students_2008.LENGTHSTUDYPERIOD < 3) & 
              (students_2008.SHORTDURATION.isnull())].head()

Unnamed: 0,HOMEINSTITUTION,AGE,SEX,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHWORKPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [20]:
print_unique_values(students_2008, 'STUDYSTARTDATE')

['01-2008', '01-2009', '01/2008', '01/2009', '02-1009', '02-2007', '02-2008', '02-2009', '02-2209', '02/2008', '02/2009', '03-1009', '03-2008', '03-2009', '03-3009', '03/2008', '03/2009', '04-2008', '04-2009', '04/2008', '04/2009', '05-2008', '05-2009', '05/2009', '06-2008', '06-2009', '06/2008', '06/2009', '07-2008', '07-2009', '07/2008', '07/2009', '08-2008', '08-2009', '08/2007', '08/2008', '09-1008', '09-2008', '09-2009', '09-2208', '09/2007', '09/2008', '09/2009', '10-2008', '10-2208', '10/2007', '10/2008', '11-2008', '11/2008', '12-2008', '12/2008']


In [21]:
# all mobility student types (S & C) should have a study start date set
students_2008[(students_2008.MOBILITYTYPE != 'P') & 
              (students_2008.STUDYSTARTDATE.isnull())].head()

Unnamed: 0,HOMEINSTITUTION,AGE,SEX,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHWORKPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [22]:
# check invalid dates
students_2008[(students_2008.STUDYSTARTDATE == '09-2208')].head()

Unnamed: 0,HOMEINSTITUTION,AGE,SEX,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHWORKPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION
154735,PL BIALYST05,22,M,PL,1,1,S,LT VILNIUS06,,,10.0,0.0,,09-2208,,63,0,63,0.0,2300.0,0.0,N


In [23]:
print_unique_values(students_2008, 'PLACEMENTSTARTDATE')

['01-2009', '01/2009', '02-2009', '02/2009', '03-2009', '03/2009', '04-2009', '04/2009', '05-2009', '05/2009', '06-2008', '06-2009', '06/2008', '06/2009', '07-2008', '07-2009', '07/2008', '07/2009', '08-2008', '08-2009', '08/2008', '08/2009', '09-2008', '09-2009', '09/2008', '09/2009', '10-2008', '10-2009', '10/2008', '11-2008', '11/2008', '12-2008', '12/2008']


In [24]:
# all placement students (P) should have a placement start date set
students_2008[(students_2008.MOBILITYTYPE == 'P') & 
              (students_2008.PLACEMENTSTARTDATE.isnull())].head()

Unnamed: 0,HOMEINSTITUTION,AGE,SEX,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHWORKPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [25]:
print_unique_values(students_2008, 'PREVIOUSPARTICIPATION')
students_2008[students_2008.PREVIOUSPARTICIPATION.isnull()]

['M', 'N', 'P', 'S']


Unnamed: 0,HOMEINSTITUTION,AGE,SEX,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHWORKPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


#### Outcome of data quality assessment for 2008 data

* Nationality: XX is not a valid country code and will be replaced with NULL
* Work placement / Country of work placement: Work placement itself might be NULL, Country of work placement is always set
* Study start date: inconsistencies in format found: mm/yyyy and mm-yyyy; also inavlid dates found (e.g. 10-2208) where from a logical perspective only dates from 2008 and 2009 make sense and the invalid dates might origin from typos
* Placement start date: inconsistencies in format found: mm/yyyy and mm-yyyy

Next step is to clean the 2008 data:

In [26]:
students_2008_cleaned = students_2008
students_2008_cleaned.NATIONALITY = students_2008_cleaned.NATIONALITY.replace({'XX': None})
students_2008_cleaned.STUDYSTARTDATE = students_2008_cleaned.STUDYSTARTDATE.str.replace('/', '-')
students_2008_cleaned.PLACEMENTSTARTDATE = students_2008_cleaned.PLACEMENTSTARTDATE.str.replace('/', '-')
# how many rows did we have in the beginning?
students_2008_cleaned.shape[0]

198523

In [27]:
# filtering rows with invalid study start dates
students_2008_cleaned = students_2008_cleaned[
    (students_2008_cleaned.STUDYSTARTDATE.str.slice(3,7).isin(['2008', '2009'])) |
    (students_2008_cleaned.STUDYSTARTDATE.isnull())]
students_2008_cleaned.shape[0]

198486

In [28]:
# filtering rows with invalid placement start dates
students_2008_cleaned = students_2008_cleaned[
    (students_2008_cleaned.PLACEMENTSTARTDATE.str.slice(3,7).isin(['2008', '2009'])) |
    (students_2008_cleaned.PLACEMENTSTARTDATE.isnull())]
students_2008_cleaned.shape[0]

198486

### CSV file for 2009-10
Read data for 2009-10 and assess columns.

In [29]:
students_2009 = pd.read_csv('raw_data/student_data_2009.csv', sep=';')
students_2009.info()
# we can see that the loaded dataframe for 2009 has 32 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 213266 entries, 0 to 213265
Data columns (total 32 columns):
HOMEINSTITUTION              213266 non-null object
COUNTRYOFHOMEINSTITUTION     213266 non-null object
AGE                          213266 non-null int64
GENDER                       213266 non-null object
NATIONALITY                  213266 non-null object
SUBJECTAREA                  213266 non-null int64
LEVELSTUDY                   213266 non-null object
YEARSPRIOR                   213266 non-null int64
MOBILITYTYPE                 213266 non-null object
HOSTINSTITUTION              177705 non-null object
COUNTRYOFHOSTINSTITUTION     177705 non-null object
WORKPLACEMENT                35562 non-null object
COUNTRYOFWORKPLACEMENT       35563 non-null object
ENTERPRISESIZE               35563 non-null object
TYPEWORKSECTOR               35563 non-null object
LENGTHSTUDYPERIOD            213266 non-null float64
LENGTHWORKPLACEMENT          213266 non-null float64
SHORTDURAT

In [30]:
students_2009.head(10)

Unnamed: 0,HOMEINSTITUTION,COUNTRYOFHOMEINSTITUTION,AGE,GENDER,NATIONALITY,SUBJECTAREA,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,COUNTRYOFHOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,ENTERPRISESIZE,TYPEWORKSECTOR,LENGTHSTUDYPERIOD,LENGTHWORKPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,CONSORTIUMAGREEMENTNUMBER,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,TAUGHTHOSTLANG,LANGUAGETAUGHT,LINGPREPARATION,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION,QUALIFICATIONATHOST
0,B LEUVEN01,BE,20,F,BE,34,1,2,S,F POITIER01,FR,,,,,3.5,0.0,,09-2009,,,30,0,30,0.0,Y,FR,NN,630,0,N,N
1,F TOULOUS23,FR,23,M,FR,22,2,3,S,DK ARHUS03,DK,,,,,5.0,0.0,,01-2010,,,30,0,30,0.0,N,EN,NN,912,0,N,N
2,F TOULOUS23,FR,22,F,FR,22,1,1,S,NL GRONING03,NL,,,,,10.0,0.0,,09-2009,,,60,0,60,0.0,N,EN,NN,1823,0,N,N
3,F TOULOUS23,FR,22,M,FR,22,2,3,S,D OESTRIC01,DE,,,,,4.5,0.0,,01-2010,,,30,0,30,0.0,Y,DE,NN,821,0,N,N
4,F TOULOUS23,FR,22,M,FR,22,2,3,S,SI LJUBLJA01,SI,,,,,4.0,0.0,,02-2010,,,30,0,30,0.0,N,EN,NN,729,0,N,N
5,F TOULOUS23,FR,21,M,FR,22,2,3,S,I CASTELL01,IT,,,,,4.0,0.0,,02-2010,,,30,0,30,0.0,N,EN,NN,729,0,N,N
6,F TOULOUS23,FR,23,F,FR,22,2,3,S,SF HELSINK02,FI,,,,,4.5,0.0,,01-2010,,,30,0,30,0.0,N,EN,NN,821,0,N,N
7,F TOULOUS23,FR,21,F,FR,22,2,3,S,D KOBLENZ03,DE,,,,,4.0,0.0,,01-2010,,,30,0,30,0.0,Y,DE,NN,729,0,N,N
8,F TOULOUS23,FR,19,F,FR,22,1,1,S,UK UXBRIDG01,UK,,,,,10.0,0.0,,09-2009,,,60,0,60,0.0,Y,EN,NN,1823,0,N,N
9,F STRASBO25,FR,22,F,FR,22,2,3,P,,,INTERSECTION MAGAZINE,UK,S,J,0.0,8.0,,,10-2009,,0,60,60,0.0,N,EN,NN,0,2800,N,N


The 2009 data has the same columns as the 2008 data. Let's have a look on the data quality of the selected columns we will use for analytics.

In [31]:
students_2009 = students_2009[[
    'HOMEINSTITUTION',
    'AGE',
    'GENDER',
    'NATIONALITY',
    'LEVELSTUDY',
    'YEARSPRIOR',
    'MOBILITYTYPE',
    'HOSTINSTITUTION',
    'WORKPLACEMENT',
    'COUNTRYOFWORKPLACEMENT',
    'LENGTHSTUDYPERIOD',
    'LENGTHWORKPLACEMENT',
    'SHORTDURATION',
    'STUDYSTARTDATE',
    'PLACEMENTSTARTDATE',
    'ECTSCREDITSSTUDY',
    'ECTSCREDITSWORK',
    'TOTALECTSCREDITS',
    'SEVSUPPLEMENT',
    'STUDYGRANT',
    'PLACEMENTGRANT',
    'PREVIOUSPARTICIPATION'
]]
students_2009.describe()

Unnamed: 0,AGE,YEARSPRIOR,LENGTHSTUDYPERIOD,LENGTHWORKPLACEMENT,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT
count,213266.0,213266.0,213266.0,213266.0,213266.0,213266.0,213266.0,213266.0,213266.0,213266.0
mean,22.646845,2.762541,5.328737,0.707532,28.498138,2.31237,30.810509,2.726244,1259.128164,273.194902
std,2.596557,1.320145,3.288474,1.807475,20.879139,8.30807,19.317489,118.213993,985.916508,716.278802
min,17.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,21.0,2.0,4.0,0.0,15.0,0.0,19.0,0.0,690.0,0.0
50%,22.0,3.0,5.0,0.0,30.0,0.0,30.0,0.0,1214.0,0.0
75%,24.0,4.0,9.0,0.0,39.0,0.0,40.0,0.0,1732.0,0.0
max,75.0,20.0,12.0,12.0,90.0,90.0,90.0,16912.5,10860.0,13032.0


In [32]:
print_unique_values(students_2009, 'HOMEINSTITUTION')
students_2009[students_2009.HOMEINSTITUTION.isnull()]

['A  BADEN01', 'A  DORNBIR01', 'A  EISENST02', 'A  FELDKIR01', 'A  FELDKIR03', 'A  GRAZ01', 'A  GRAZ02', 'A  GRAZ03', 'A  GRAZ04', 'A  GRAZ08', 'A  GRAZ09', 'A  GRAZ10', 'A  GRAZ23', 'A  INNSBRU01', 'A  INNSBRU03', 'A  INNSBRU08', 'A  INNSBRU09', 'A  INNSBRU21', 'A  INNSBRU23', 'A  KLAGENF01', 'A  KLAGENF02', 'A  KREMS03', 'A  KUFSTEI01', 'A  LEOBEN01', 'A  LINZ01', 'A  LINZ02', 'A  LINZ03', 'A  LINZ04', 'A  LINZ11', 'A  LINZ17', 'A  SALZBUR01', 'A  SALZBUR02', 'A  SALZBUR03', 'A  SALZBUR08', 'A  SPITTAL01', 'A  ST-POLT03', 'A  WELS01', 'A  WIEN01', 'A  WIEN02', 'A  WIEN03', 'A  WIEN04', 'A  WIEN05', 'A  WIEN06', 'A  WIEN07', 'A  WIEN08', 'A  WIEN09', 'A  WIEN10', 'A  WIEN15', 'A  WIEN20', 'A  WIEN21', 'A  WIEN36', 'A  WIEN38', 'A  WIEN52', 'A  WIEN63', 'A  WIEN64', 'A  WIEN66', 'A  WIENER01', 'B  ANTWERP01', 'B  ANTWERP57', 'B  ANTWERP58', 'B  ANTWERP59', 'B  ANTWERP60', 'B  ANTWERP61', 'B  ARLON08', 'B  ARLON09', 'B  BRUGGE11', 'B  BRUSSEL01', 'B  BRUSSEL02', 'B  BRUSSEL05', 'B  BRUS

Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHWORKPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [33]:
print_unique_values(students_2009, 'GENDER')
students_2009[students_2009.GENDER.isnull()]

['F', 'M']


Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHWORKPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [34]:
print_unique_values(students_2009, 'NATIONALITY')
students_2009[students_2009.NATIONALITY.isnull()]

['AT', 'BE', 'BG', 'CY', 'CZ', 'DE', 'DK', 'EE', 'ES', 'FI', 'FR', 'GR', 'HR', 'HU', 'IE', 'IS', 'IT', 'LI', 'LT', 'LU', 'LV', 'MK', 'MT', 'NL', 'NO', 'PL', 'PT', 'RO', 'SE', 'SI', 'SK', 'TR', 'UK', 'XX']


Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHWORKPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [35]:
print_unique_values(students_2009, 'LEVELSTUDY')
students_2009[students_2009.LEVELSTUDY.isnull()]

['1', '2', '3', 'S']


Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHWORKPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [36]:
print_unique_values(students_2009, 'MOBILITYTYPE')
students_2009[students_2009.MOBILITYTYPE.isnull()]

['C', 'P', 'S']


Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHWORKPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [37]:
# host institution should be available for mobility types S & C, check if this is the case
print_unique_values(students_2009[students_2009.HOSTINSTITUTION.isnull()], 'MOBILITYTYPE')

['P']


In [38]:
# check if every P mobility type student has a work placement set
students_2009[(students_2009.MOBILITYTYPE == 'P') & 
              (students_2009.WORKPLACEMENT.isnull())].head()

Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHWORKPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION
9848,B MECHELE14,21,F,BE,1,2,P,,,SI,0.0,3.0,,,02-2010,0,18,18,0.0,0,1020,N


In [39]:
print_unique_values(students_2009, 'COUNTRYOFWORKPLACEMENT')

['AT', 'BE', 'BG', 'CY', 'CZ', 'DE', 'DK', 'EE', 'ES', 'FI', 'FR', 'GR', 'HU', 'IE', 'IS', 'IT', 'LI', 'LT', 'LU', 'LV', 'MT', 'NL', 'NO', 'PL', 'PT', 'RO', 'SE', 'SI', 'SK', 'TR', 'UK']


In [40]:
# check if every P mobility type student has a country of work placement set
students_2009[(students_2009.MOBILITYTYPE == 'P') & 
              (students_2009.COUNTRYOFWORKPLACEMENT.isnull())].head()

Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHWORKPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [41]:
print_unique_values(students_2009, 'SHORTDURATION')

['0', '1', 'I', 'N', 'T', 'X', 'Y', 'Z']


In [42]:
# check if all mobility students with a duration less than 3 months have a short duration type set
students_2009[(students_2009.MOBILITYTYPE != 'P') & 
              (students_2009.LENGTHSTUDYPERIOD < 3) & 
              (students_2009.SHORTDURATION.isnull())].head()

Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHWORKPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [43]:
print_unique_values(students_2009, 'STUDYSTARTDATE')

['01-2009', '01-2010', '01/2010', '02-1010', '02-2009', '02-2010', '02/2010', '03-2009', '03-2010', '03/2010', '04-2009', '04-2010', '04/2010', '05-2009', '05-2010', '05/2010', '06-2009', '06-2010', '06/2009', '06/2010', '07-2009', '07-2010', '07/2009', '08-2009', '08-2010', '08/2009', '09-2009', '09-2010', '09/2009', '09/2010', '10-2009', '10/2009', '11-2009', '11/2009', '12-2009', '12/2009']


In [44]:
# all mobility student types (S & C) should have a study start date set
students_2009[(students_2009.MOBILITYTYPE != 'P') & 
              (students_2009.STUDYSTARTDATE.isnull())].head()

Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHWORKPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [45]:
# check invalid dates
students_2009[(students_2009.STUDYSTARTDATE == '02-1010')].head()

Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHWORKPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION
41742,EE TARTU02,21,F,EE,1,2,S,E CIUDA-R01,,,5.0,0.0,,02-1010,,18,0,18,0.0,2045,0,N
45432,EE TARTU02,20,F,EE,1,1,S,SF OULU01,,,5.0,0.0,,02-1010,,20,0,20,0.0,1740,0,N


In [46]:
print_unique_values(students_2009, 'PLACEMENTSTARTDATE')

['01-2010', '01/2010', '01/2012', '02-2010', '02/2010', '03-2010', '03/2010', '04-2010', '04/2010', '05-2010', '05/2010', '06-2009', '06-2010', '06-2011', '06/2009', '06/2010', '07-2009', '07-2010', '07-2011', '07/2009', '07/2010', '08-2009', '08-2010', '08/2009', '08/2010', '09-2009', '09-2010', '09/2009', '09/2010', '10-2009', '10-2010', '10-2011', '10/2009', '11-2009', '11/2009', '11/2010', '12-2009', '12/2009']


In [47]:
# all placement students (P) should have a placement start date set
students_2009[(students_2009.MOBILITYTYPE == 'P') & 
              (students_2009.PLACEMENTSTARTDATE.isnull())].head()

Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHWORKPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [48]:
print_unique_values(students_2009, 'PREVIOUSPARTICIPATION')
students_2009[students_2009.PREVIOUSPARTICIPATION.isnull()]

['M', 'N', 'P', 'S']


Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHWORKPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


#### Outcome of data quality assessment for 2009 data

Similar data quality issues to 2008 were also found in the 2009 dataset:
* Nationality: XX is not a valid country code and will be replaced with NULL
* Work placement / Country of work placement: Work placement itself might be NULL, Country of work placement is always set
* Study start date: inconsistencies in format found: mm/yyyy and mm-yyyy; also inavlid dates found (e.g. 10-2208) where from a logical perspective only dates from 2009 and 2010 make sense and the invalid dates might origin from typos
* Placement start date: inconsistencies in format found: mm/yyyy and mm-yyyy

Next step is to clean the 2009 data:

In [49]:
students_2009_cleaned = students_2009
students_2009_cleaned.NATIONALITY = students_2009_cleaned.NATIONALITY.replace({'XX': None})
students_2009_cleaned.STUDYSTARTDATE = students_2009_cleaned.STUDYSTARTDATE.str.replace('/', '-')
students_2009_cleaned.PLACEMENTSTARTDATE = students_2009_cleaned.PLACEMENTSTARTDATE.str.replace('/', '-')
# how many rows did we have in the beginning?
students_2009_cleaned.shape[0]

213266

In [50]:
# filtering rows with invalid study start dates
students_2009_cleaned = students_2009_cleaned[
    (students_2009_cleaned.STUDYSTARTDATE.str.slice(3,7).isin(['2009', '2010'])) |
    (students_2009_cleaned.STUDYSTARTDATE.isnull())]
students_2009_cleaned.shape[0]

213264

In [51]:
# filtering rows with invalid placement start dates
students_2009_cleaned = students_2009_cleaned[
    (students_2009_cleaned.PLACEMENTSTARTDATE.str.slice(3,7).isin(['2009', '2010'])) |
    (students_2009_cleaned.PLACEMENTSTARTDATE.isnull())]
students_2009_cleaned.shape[0]

213253

### CSV file for 2010-11
Read data for 2010-11 and assess columns.

In [52]:
students_2010 = pd.read_csv('raw_data/student_data_2010.csv', sep=';')
students_2010.info()
# we can see that the loaded dataframe for 2010 has 32 columns

  interactivity=interactivity, compiler=compiler, result=result)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 231408 entries, 0 to 231407
Data columns (total 32 columns):
HOMEINSTITUTION                 231408 non-null object
COUNTRYCODEOFHOMEINSTITUTION    231408 non-null object
AGE                             231408 non-null int64
GENDER                          231408 non-null object
NATIONALITY                     231408 non-null object
SUBJECTAREA                     231408 non-null int64
LEVELSTUDY                      231408 non-null object
YEARSPRIOR                      231408 non-null int64
MOBILITYTYPE                    231408 non-null object
HOSTINSTITUTION                 190499 non-null object
COUNTRYCODEOFHOSTINSTITUTION    190499 non-null object
WORKPLACEMENT                   40964 non-null object
COUNTRYOFWORKPLACEMENT          40964 non-null object
ENTERPRISESIZE                  40964 non-null object
TYPEWORKSECTOR                  40964 non-null object
LENGTHSTUDYPERIOD               231408 non-null float64
LENGTHPLACEMENT

In [53]:
students_2010.head(10)

Unnamed: 0,HOMEINSTITUTION,COUNTRYCODEOFHOMEINSTITUTION,AGE,GENDER,NATIONALITY,SUBJECTAREA,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,COUNTRYCODEOFHOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,ENTERPRISESIZE,TYPEWORKSECTOR,LENGTHSTUDYPERIOD,LENGTHPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,CONSORTIUMAGREEMENTNUMBER,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,TAUGHTHOSTLANG,LANGUAGETAUGHT,LINGPREPARATION,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION,QUALIFICATIONATHOST
0,HU BUDAPES02,HU,25,M,HU,521,3,1,P,,,HOCHSCHULE WISMAR,DE,M,P,0.0,3.0,,,09-2010,,0,0,0,0,N,DE,NN,0.0,900.0,N,N
1,HU BUDAPES02,HU,28,M,HU,522,3,4,P,,,ACMIT,AT,M,M,0.0,9.0,,,09-2010,,0,0,0,0,N,DE,NN,0.0,3600.0,N,N
2,HU BUDAPES02,HU,25,F,HU,581,2,5,P,,,S.C.ESZTÃ¡NY STUDIO S.R.L.,RO,S,F,0.0,5.0,,,09-2010,,0,0,0,0,N,EN,NN,0.0,1500.0,N,N
3,HU BUDAPES02,HU,23,M,SK,582,2,4,P,,,TECHNISCHE UNIVERSITAT DARMSTADT,DE,L,P,0.0,3.0,,,07-2011,,0,0,0,0,N,DE,NN,0.0,900.0,N,N
4,HU BUDAPES02,HU,24,F,HU,581,2,5,P,,,H.E.I.Z. HAUS ARCHITEKTUR STADTPLANUNG,DE,M,F,0.0,6.0,,,02-2011,,0,0,0,0,N,DE,NN,0.0,1800.0,N,N
5,HU BUDAPES02,HU,21,F,HU,581,2,4,P,,,HOCHSTRASSER ARCHITEKTEN,DE,S,F,0.0,3.0,,,07-2011,,0,0,0,0,N,DE,NN,0.0,1200.0,N,N
6,HU BUDAPES02,HU,23,F,HU,581,2,4,P,,,ANTONIO RAVALLI ARCHITETTI,IT,S,F,0.0,3.0,,,04-2011,,0,0,0,0,N,IT,NN,0.0,900.0,N,N
7,HU BUDAPES02,HU,25,M,HU,581,2,5,P,,,PRIEDEMANN,DE,M,F,0.0,6.0,,,08-2010,,0,0,0,0,N,DE,NN,0.0,2100.0,N,N
8,HU BUDAPES02,HU,26,F,HU,581,2,4,P,,,VCENK,NO,M,F,0.0,9.0,,,09-2010,,0,0,0,0,N,EN,NN,0.0,2700.0,N,N
9,HU BUDAPES02,HU,22,F,HU,581,2,3,P,,,SAUNDERS ARCHITECTURE,NO,S,F,0.0,4.5,,,09-2010,,0,0,0,0,N,EN,NN,0.0,1350.0,N,N


The 2010 data has the same columns as the 2008 data. Let's have a look on the data quality of the selected columns we will use for analytics.

In [54]:
students_2010 = students_2010[[
    'HOMEINSTITUTION',
    'AGE',
    'GENDER',
    'NATIONALITY',
    'LEVELSTUDY',
    'YEARSPRIOR',
    'MOBILITYTYPE',
    'HOSTINSTITUTION',
    'WORKPLACEMENT',
    'COUNTRYOFWORKPLACEMENT',
    'LENGTHSTUDYPERIOD',
    'LENGTHPLACEMENT',
    'SHORTDURATION',
    'STUDYSTARTDATE',
    'PLACEMENTSTARTDATE',
    'ECTSCREDITSSTUDY',
    'ECTSCREDITSWORK',
    'TOTALECTSCREDITS',
    'SEVSUPPLEMENT',
    'STUDYGRANT',
    'PLACEMENTGRANT',
    'PREVIOUSPARTICIPATION'
]]
students_2010.describe()

Unnamed: 0,AGE,YEARSPRIOR,LENGTHSTUDYPERIOD,LENGTHPLACEMENT,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT
count,231408.0,231408.0,231408.0,231408.0,231408.0,231408.0,231408.0,231408.0,231408.0,231408.0
mean,22.513971,2.73096,5.238561,0.767268,28.491802,2.587936,31.079738,2.139913,1217.70805,280.837386
std,2.703065,1.307554,3.29973,1.861627,20.720936,8.667091,18.894579,103.269691,984.106146,701.147746
min,17.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,21.0,2.0,4.0,0.0,14.0,0.0,20.0,0.0,600.0,0.0
50%,22.0,2.0,5.0,0.0,30.0,0.0,30.0,0.0,1095.84,0.0
75%,23.0,4.0,9.0,0.0,39.0,0.0,40.0,0.0,1713.0,0.0
max,99.0,20.0,12.0,12.0,90.0,90.0,90.0,20559.0,10970.0,9600.0


In [55]:
print_unique_values(students_2010, 'HOMEINSTITUTION')
students_2010[students_2010.HOMEINSTITUTION.isnull()]

['A  BADEN01', 'A  DORNBIR01', 'A  EISENST01', 'A  EISENST02', 'A  EISENST05', 'A  FELDKIR01', 'A  FELDKIR03', 'A  GRAZ01', 'A  GRAZ02', 'A  GRAZ03', 'A  GRAZ04', 'A  GRAZ08', 'A  GRAZ09', 'A  GRAZ10', 'A  GRAZ23', 'A  INNSBRU01', 'A  INNSBRU03', 'A  INNSBRU08', 'A  INNSBRU09', 'A  INNSBRU20', 'A  INNSBRU21', 'A  INNSBRU23', 'A  KLAGENF01', 'A  KLAGENF02', 'A  KREMS02', 'A  KREMS03', 'A  KUFSTEI01', 'A  LEOBEN01', 'A  LINZ01', 'A  LINZ02', 'A  LINZ03', 'A  LINZ04', 'A  LINZ17', 'A  SALZBUR01', 'A  SALZBUR02', 'A  SALZBUR03', 'A  SALZBUR08', 'A  SPITTAL01', 'A  ST-POLT03', 'A  STEYR05', 'A  WELS01', 'A  WIEN01', 'A  WIEN02', 'A  WIEN03', 'A  WIEN04', 'A  WIEN05', 'A  WIEN06', 'A  WIEN07', 'A  WIEN08', 'A  WIEN09', 'A  WIEN10', 'A  WIEN15', 'A  WIEN20', 'A  WIEN21', 'A  WIEN36', 'A  WIEN38', 'A  WIEN52', 'A  WIEN63', 'A  WIEN64', 'A  WIEN66', 'A  WIENER01', 'B  ANTWERP01', 'B  ANTWERP57', 'B  ANTWERP58', 'B  ANTWERP59', 'B  ANTWERP60', 'B  ANTWERP61', 'B  ARLON08', 'B  ARLON09', 'B  BRUG

Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [56]:
print_unique_values(students_2010, 'GENDER')
students_2010[students_2010.GENDER.isnull()]

['F', 'M']


Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [57]:
print_unique_values(students_2010, 'NATIONALITY')
students_2010[students_2010.NATIONALITY.isnull()]

['AT', 'BE', 'BG', 'CY', 'CZ', 'DE', 'DK', 'EE', 'ES', 'FI', 'FR', 'GR', 'HU', 'IE', 'IS', 'IT', 'LI', 'LT', 'LU', 'LV', 'MT', 'NL', 'NO', 'PL', 'PT', 'RO', 'SE', 'SI', 'SK', 'TR', 'UK', 'XX']


Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [58]:
students_2010.LEVELSTUDY = students_2010.LEVELSTUDY.astype(str)
print_unique_values(students_2010, 'LEVELSTUDY')
students_2010[students_2010.LEVELSTUDY.isnull()]

['1', '2', '3', 'S']


Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [59]:
print_unique_values(students_2010, 'MOBILITYTYPE')
students_2010[students_2010.MOBILITYTYPE.isnull()]

['C', 'P', 'S']


Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [60]:
# host institution should be available for mobility types S & C, check if this is the case
print_unique_values(students_2010[students_2010.HOSTINSTITUTION.isnull()], 'MOBILITYTYPE')

['P']


In [61]:
# check if every P mobility type student has a work placement set
students_2010[(students_2010.MOBILITYTYPE == 'P') & 
              (students_2010.WORKPLACEMENT.isnull())].head()

Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [62]:
print_unique_values(students_2010, 'COUNTRYOFWORKPLACEMENT')

['AT', 'BE', 'BENL', 'BG', 'CH', 'CY', 'CZ', 'DE', 'DK', 'EE', 'ES', 'FI', 'FR', 'GR', 'HR', 'HU', 'IE', 'IS', 'IT', 'LI', 'LT', 'LU', 'LV', 'MT', 'NL', 'NO', 'PL', 'PT', 'RO', 'SE', 'SI', 'SK', 'TR', 'UK']


In [63]:
students_2010[(students_2010.COUNTRYOFWORKPLACEMENT == 'BENL')]

Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION
159514,UK PORTSMO01,22,F,DE,1,2,C,E GRANADA01,FRIEDRICH -EBERT-STIFFUNG EU OFFICE BRUSSELS,BENL,10.0,0.0,,01-2011,09-2010,60,0,60,0,3780.0,0.0,N


In [64]:
# check if every P mobility type student has a country of work placement set
students_2010[(students_2010.MOBILITYTYPE == 'P') & 
              (students_2010.COUNTRYOFWORKPLACEMENT.isnull())].head()

Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [65]:
print_unique_values(students_2010, 'SHORTDURATION')

['-', '0', '2', 'T', 'X', 'Y']


In [66]:
# check if all mobility students with a duration less than 3 months have a short duration type set
students_2010[(students_2010.MOBILITYTYPE != 'P') & 
              (students_2010.LENGTHSTUDYPERIOD < 3) & 
              (students_2010.SHORTDURATION.isnull())].head()

Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [67]:
print_unique_values(students_2010, 'STUDYSTARTDATE')

['01-2010', '01-2011', '01/2010', '01/2011', '02-2010', '02-2011', '02/2010', '02/2011', '03-2010', '03-2011', '03/2011', '04-2011', '04/2011', '05-2011', '05/2011', '06-2010', '06-2011', '06/2011', '07-2010', '07-2011', '07/2010', '07/2011', '08-2010', '08-2011', '08/2010', '09-2010', '09-2011', '09/2010', '09/2011', '10-2010', '10/2010', '11-2010', '11/2010', '12-2010', '12/2010']


In [68]:
# all mobility student types (S & C) should have a study start date set
students_2010[(students_2010.MOBILITYTYPE != 'P') & 
              (students_2010.STUDYSTARTDATE.isnull())].head()

Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [69]:
print_unique_values(students_2010, 'PLACEMENTSTARTDATE')

['01-2011', '01/2011', '02-2011', '02/2011', '03-2011', '03/2011', '04-2011', '04/2011', '05-2011', '05-2012', '05/2011', '06-2010', '06-2011', '06-2012', '06/2010', '06/2011', '07-2010', '07-2011', '07-2012', '07-2013', '07-2014', '07-2015', '07-2016', '07/2010', '07/2011', '08-2010', '08-2011', '08/2010', '08/2011', '09-2010', '09-2011', '09/2010', '10-2010', '10-2011', '10/2010', '11-2010', '11-2011', '11/2010', '12-2010', '12-2011', '12/2010']


In [70]:
# all placement students (P) should have a placement start date set
students_2010[(students_2010.MOBILITYTYPE == 'P') & 
              (students_2010.PLACEMENTSTARTDATE.isnull())].head()

Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [71]:
print_unique_values(students_2010, 'PREVIOUSPARTICIPATION')
students_2010[students_2010.PREVIOUSPARTICIPATION.isnull()]

['M', 'N', 'P', 'S']


Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,WORKPLACEMENT,COUNTRYOFWORKPLACEMENT,LENGTHSTUDYPERIOD,LENGTHPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSWORK,TOTALECTSCREDITS,SEVSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


#### Outcome of data quality assessment for 2010 data

Similar data quality issues to 2008 were also found in the 2010 dataset:
* Nationality: XX is not a valid country code and will be replaced with NULL
* Work placement / Country of work placement: Work placement itself might be NULL, Country of work placement is always set
* Study start date: inconsistencies in format found: mm/yyyy and mm-yyyy
* Placement start date: inconsistencies in format found: mm/yyyy and mm-yyyy; also inavlid dates found (e.g. 07-2014) where from a logical perspective only dates from 2010 and 2011 make sense

Additional data quality issues found in the 2010 data:
* Country of work placement: should be an ISO2 code, but there was also an occurence of a concatenation of 2 ISO2 codes, everything ISO2 code that does not have exactly 2 characters will be filtered out

Next step is to clean the 2010 data:

In [72]:
students_2010_cleaned = students_2010
students_2010_cleaned.NATIONALITY = students_2010_cleaned.NATIONALITY.replace({'XX': None})
students_2010_cleaned.STUDYSTARTDATE = students_2010_cleaned.STUDYSTARTDATE.str.replace('/', '-')
students_2010_cleaned.PLACEMENTSTARTDATE = students_2010_cleaned.PLACEMENTSTARTDATE.str.replace('/', '-')
# how many rows did we have in the beginning?
students_2010_cleaned.shape[0]

231408

In [73]:
# filtering rows with invalid study start dates
students_2010_cleaned = students_2010_cleaned[
    (students_2010_cleaned.STUDYSTARTDATE.str.slice(3,7).isin(['2010', '2011'])) |
    (students_2010_cleaned.STUDYSTARTDATE.isnull())]
students_2010_cleaned.shape[0]

231408

In [74]:
# filtering rows with invalid placement start dates
students_2010_cleaned = students_2010_cleaned[
    (students_2010_cleaned.PLACEMENTSTARTDATE.str.slice(3,7).isin(['2010', '2011'])) |
    (students_2010_cleaned.PLACEMENTSTARTDATE.isnull())]
students_2010_cleaned.shape[0]

231401

In [75]:
# filtering rows with invalid country codes
students_2010_cleaned = students_2010_cleaned[
    (students_2010_cleaned.COUNTRYOFWORKPLACEMENT.str.len() == 2) |
    (students_2010_cleaned.COUNTRYOFWORKPLACEMENT.isnull())]
students_2010_cleaned.shape[0]

231400

### CSV file for 2011-12
Read data for 2011-12 and assess columns.

In [76]:
students_2011 = pd.read_csv('raw_data/student_1112.csv', sep=';')
students_2011.info()
# we can see that the loaded dataframe for 2011 has 32 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252827 entries, 0 to 252826
Data columns (total 32 columns):
HOMEINSTITUTION                 252827 non-null object
COUNTRYCODEOFHOMEINSTITUTION    252827 non-null object
AGE                             252827 non-null int64
GENDER                          252827 non-null object
NATIONALITY                     252827 non-null object
SUBJECTAREA                     252827 non-null int64
LEVELSTUDY                      252827 non-null object
YEARSPRIOR                      252827 non-null int64
MOBILITYTYPE                    252827 non-null object
HOSTINSTITUTION                 204756 non-null object
COUNTRYCODEOFHOSTINSTITUTION    204756 non-null object
PLACEMENTENTERPRISE             48142 non-null object
COUNTRYOFPLACEMENT              48144 non-null object
ENTERPRISESIZE                  48144 non-null object
TYPEPLACEMENTSECTOR             48144 non-null object
LENGTHSTUDYPERIOD               252827 non-null float64
LENGTHPLACEMENT

In [77]:
students_2011.head(10)

Unnamed: 0,HOMEINSTITUTION,COUNTRYCODEOFHOMEINSTITUTION,AGE,GENDER,NATIONALITY,SUBJECTAREA,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,COUNTRYCODEOFHOSTINSTITUTION,PLACEMENTENTERPRISE,COUNTRYOFPLACEMENT,ENTERPRISESIZE,TYPEPLACEMENTSECTOR,LENGTHSTUDYPERIOD,LENGTHPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,CONSORTIUMAGREEMENTNUMBER,ECTSCREDITSSTUDY,ECTSCREDITSPLACEMENT,TOTALECTSCREDITS,SNSUPPLEMENT,TAUGHTHOSTLANG,LANGUAGETAUGHT,LINGPREPARATION,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION,QUALIFICATIONATHOST
0,D BERLIN14,DE,27,F,DE,461,1,4,S,S JONKOPI01,SE,,,,,4.0,0.0,,Aug-11,,,15,0,15,0.0,N,EN,HM,520.0,0.0,N,N
1,D BERLIN14,DE,27,F,XX,582,2,4,S,SF HELSINK41,FI,,,,,5.0,0.0,,Oct-11,,,30,0,30,0.0,N,EN,HM,650.0,0.0,N,J
2,D BERLIN14,DE,24,F,DE,342,2,4,S,S JONKOPI01,SE,,,,,5.0,0.0,,Aug-11,,,23,0,23,0.0,N,EN,HM,650.0,0.0,N,N
3,D BERLIN14,DE,28,M,DE,314,1,3,S,UK LONDON062,UK,,,,,3.5,0.0,,Sep-11,,,27,0,27,0.0,Y,EN,HM,455.0,0.0,N,N
4,D BERLIN14,DE,22,F,DE,314,1,2,S,E TENERIF01,ES,,,,,10.0,0.0,,Sep-11,,,47,0,47,0.0,Y,ES,HM,1300.0,0.0,N,N
5,D BERLIN14,DE,24,M,DE,314,2,4,S,IRLDUBLIN27,IE,,,,,5.0,0.0,,Jan-12,,,20,0,20,0.0,Y,EN,HM,650.0,0.0,N,N
6,D BERLIN14,DE,28,F,DE,314,2,5,S,S GOTEBOR01,SE,,,,,9.5,0.0,,Sep-11,,,58,0,58,0.0,N,EN,HM,1235.0,0.0,N,N
7,D BERLIN14,DE,23,M,DE,422,1,2,S,IRLDUNDALK01,IE,,,,,4.5,0.0,,Sep-11,,,15,0,15,0.0,Y,EN,HM,585.0,0.0,N,N
8,D BERLIN14,DE,28,M,DE,481,2,5,S,TR ISTANBU16,TR,,,,,4.0,0.0,,Jan-12,,,24,0,24,0.0,N,EN,HM,520.0,0.0,N,N
9,D BERLIN18,DE,26,F,DE,214,1,4,S,TR ISTANBU05,TR,,,,,4.0,0.0,,Feb-12,,,30,0,30,0.0,Y,TR,NN,1162.08,0.0,N,N


The 2011 data has the same columns as the 2008 data, but some formats seems different, e.g. for the Date columns. Let's have a look on the data quality of the selected columns we will use for analytics.

In [78]:
students_2011 = students_2011[[
    'HOMEINSTITUTION',
    'AGE',
    'GENDER',
    'NATIONALITY',
    'LEVELSTUDY',
    'YEARSPRIOR',
    'MOBILITYTYPE',
    'HOSTINSTITUTION',
    'PLACEMENTENTERPRISE',
    'COUNTRYOFPLACEMENT',
    'LENGTHSTUDYPERIOD',
    'LENGTHPLACEMENT',
    'SHORTDURATION',
    'STUDYSTARTDATE',
    'PLACEMENTSTARTDATE',
    'ECTSCREDITSSTUDY',
    'ECTSCREDITSPLACEMENT',
    'TOTALECTSCREDITS',
    'SNSUPPLEMENT',
    'STUDYGRANT',
    'PLACEMENTGRANT',
    'PREVIOUSPARTICIPATION'
]]
students_2011.describe()

Unnamed: 0,AGE,YEARSPRIOR,LENGTHSTUDYPERIOD,LENGTHPLACEMENT,ECTSCREDITSSTUDY,ECTSCREDITSPLACEMENT,TOTALECTSCREDITS,SNSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT
count,252827.0,252827.0,252827.0,252827.0,252827.0,252827.0,252827.0,252827.0,252827.0,252827.0
mean,22.522282,2.727929,5.105015,0.825524,27.925407,2.67386,30.599267,3.127733,1186.403082,294.686539
std,2.629492,1.313078,3.312576,1.914373,20.998558,8.686028,19.158514,155.438652,994.663336,709.921101
min,17.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,21.0,2.0,4.0,0.0,11.0,0.0,19.0,0.0,533.3,0.0
50%,22.0,2.0,5.0,0.0,30.0,0.0,30.0,0.0,1072.5,0.0
75%,23.0,3.0,8.5,0.0,37.0,0.0,39.0,0.0,1674.0,0.0
max,83.0,20.0,13.25,12.0,90.0,90.0,90.0,41777.74,7730.0,9672.0


In [79]:
print_unique_values(students_2011, 'HOMEINSTITUTION')
students_2011[students_2011.HOMEINSTITUTION.isnull()]

['A  BADEN01', 'A  DORNBIR01', 'A  EISENST01', 'A  EISENST02', 'A  FELDKIR01', 'A  GRAZ01', 'A  GRAZ02', 'A  GRAZ03', 'A  GRAZ04', 'A  GRAZ08', 'A  GRAZ09', 'A  GRAZ10', 'A  GRAZ23', 'A  INNSBRU01', 'A  INNSBRU03', 'A  INNSBRU08', 'A  INNSBRU09', 'A  INNSBRU21', 'A  INNSBRU23', 'A  KLAGENF01', 'A  KLAGENF02', 'A  KLAGENF06', 'A  KREMS03', 'A  KUFSTEI01', 'A  LEOBEN01', 'A  LINZ01', 'A  LINZ02', 'A  LINZ03', 'A  LINZ04', 'A  LINZ11', 'A  LINZ17', 'A  LINZ19', 'A  SALZBUR01', 'A  SALZBUR02', 'A  SALZBUR03', 'A  SALZBUR08', 'A  SCHLAIN01', 'A  SPITTAL01', 'A  ST-POLT03', 'A  WELS01', 'A  WIEN01', 'A  WIEN02', 'A  WIEN03', 'A  WIEN04', 'A  WIEN05', 'A  WIEN06', 'A  WIEN07', 'A  WIEN08', 'A  WIEN09', 'A  WIEN10', 'A  WIEN15', 'A  WIEN20', 'A  WIEN21', 'A  WIEN36', 'A  WIEN38', 'A  WIEN52', 'A  WIEN63', 'A  WIEN64', 'A  WIEN66', 'A  WIENER01', 'B  ANTWERP01', 'B  ANTWERP57', 'B  ANTWERP58', 'B  ANTWERP59', 'B  ANTWERP60', 'B  ANTWERP61', 'B  ARLON08', 'B  ARLON09', 'B  BRUGGE11', 'B  BRUSSEL

Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,PLACEMENTENTERPRISE,COUNTRYOFPLACEMENT,LENGTHSTUDYPERIOD,LENGTHPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSPLACEMENT,TOTALECTSCREDITS,SNSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [80]:
print_unique_values(students_2011, 'GENDER')
students_2011[students_2011.GENDER.isnull()]

['F', 'M']


Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,PLACEMENTENTERPRISE,COUNTRYOFPLACEMENT,LENGTHSTUDYPERIOD,LENGTHPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSPLACEMENT,TOTALECTSCREDITS,SNSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [81]:
print_unique_values(students_2011, 'NATIONALITY')
students_2011[students_2011.NATIONALITY.isnull()]

['AT', 'BE', 'BG', 'CH', 'CY', 'CZ', 'DE', 'DK', 'EE', 'ES', 'FI', 'FR', 'GR', 'HR', 'HU', 'IE', 'IS', 'IT', 'LI', 'LT', 'LU', 'LV', 'MK', 'MT', 'NL', 'NO', 'PL', 'PT', 'RO', 'SE', 'SI', 'SK', 'TR', 'UK', 'XX']


Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,PLACEMENTENTERPRISE,COUNTRYOFPLACEMENT,LENGTHSTUDYPERIOD,LENGTHPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSPLACEMENT,TOTALECTSCREDITS,SNSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [82]:
students_2011.LEVELSTUDY = students_2011.LEVELSTUDY.astype(str)
print_unique_values(students_2011, 'LEVELSTUDY')
students_2011[students_2011.LEVELSTUDY.isnull()]

['1', '2', '3', 'S']


Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,PLACEMENTENTERPRISE,COUNTRYOFPLACEMENT,LENGTHSTUDYPERIOD,LENGTHPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSPLACEMENT,TOTALECTSCREDITS,SNSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [83]:
print_unique_values(students_2011, 'MOBILITYTYPE')
students_2011[students_2011.MOBILITYTYPE.isnull()]

['C', 'P', 'S']


Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,PLACEMENTENTERPRISE,COUNTRYOFPLACEMENT,LENGTHSTUDYPERIOD,LENGTHPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSPLACEMENT,TOTALECTSCREDITS,SNSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [84]:
# host institution should be available for mobility types S & C, check if this is the case
print_unique_values(students_2011[students_2011.HOSTINSTITUTION.isnull()], 'MOBILITYTYPE')

['P']


In [85]:
# check if every P mobility type student has a work placement set
students_2011[(students_2011.MOBILITYTYPE == 'P') & 
              (students_2011.PLACEMENTENTERPRISE.isnull())].head()

Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,PLACEMENTENTERPRISE,COUNTRYOFPLACEMENT,LENGTHSTUDYPERIOD,LENGTHPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSPLACEMENT,TOTALECTSCREDITS,SNSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION
132695,UK BOGNO-R01,21,F,UK,2,1,P,,,FR,0.0,4.75,,,Dec-11,0,30,30,0.0,0.0,1767.0,N
182764,UK SWANSEA01,22,M,UK,1,2,P,,,FR,0.0,7.75,,,Oct-11,0,60,60,0.0,0.0,2883.0,N


In [86]:
print_unique_values(students_2011, 'COUNTRYOFPLACEMENT')

['AT', 'BE', 'BEFR', 'BG', 'CH', 'CY', 'CZ', 'DE', 'DK', 'EE', 'ES', 'FI', 'FR', 'GR', 'HR', 'HU', 'IE', 'IS', 'IT', 'LI', 'LT', 'LU', 'LV', 'MT', 'NL', 'NO', 'PL', 'PT', 'RO', 'SE', 'SI', 'SK', 'TR', 'UK']


In [87]:
students_2011[(students_2011.COUNTRYOFPLACEMENT == 'BEFR')]

Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,PLACEMENTENTERPRISE,COUNTRYOFPLACEMENT,LENGTHSTUDYPERIOD,LENGTHPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSPLACEMENT,TOTALECTSCREDITS,SNSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION
7574,S MALMO01,28,F,XX,2,3,C,NL UTRECHT01,EUROPEAN INSTITUTE FOR ASIAN STUDIES,BEFR,8.0,0.0,,Sep-11,Feb-12,30,0,30,0.0,3842.5,0.0,S


In [88]:
# check if every P mobility type student has a country of work placement set
students_2011[(students_2011.MOBILITYTYPE == 'P') & 
              (students_2011.COUNTRYOFPLACEMENT.isnull())].head()

Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,PLACEMENTENTERPRISE,COUNTRYOFPLACEMENT,LENGTHSTUDYPERIOD,LENGTHPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSPLACEMENT,TOTALECTSCREDITS,SNSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [89]:
print_unique_values(students_2011, 'SHORTDURATION')

['-', '0', 'N', 'S', 'T', 'X']


In [90]:
# check if all mobility students with a duration less than 3 months have a short duration type set
students_2011[(students_2011.MOBILITYTYPE != 'P') & 
              (students_2011.LENGTHSTUDYPERIOD < 3) & 
              (students_2011.SHORTDURATION.isnull())].head()

Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,PLACEMENTENTERPRISE,COUNTRYOFPLACEMENT,LENGTHSTUDYPERIOD,LENGTHPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSPLACEMENT,TOTALECTSCREDITS,SNSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [91]:
print_unique_values(students_2011, 'STUDYSTARTDATE')

['Apr-12', 'Aug-11', 'Aug-12', 'Dec-11', 'Feb-12', 'Jan-12', 'Jul-11', 'Jul-12', 'Jun-11', 'Jun-12', 'Mar-12', 'May-12', 'Nov-11', 'Oct-11', 'Sep-11', 'Sep-12']


In [92]:
# all mobility student types (S & C) should have a study start date set
students_2011[(students_2011.MOBILITYTYPE != 'P') & 
              (students_2011.STUDYSTARTDATE.isnull())].head()

Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,PLACEMENTENTERPRISE,COUNTRYOFPLACEMENT,LENGTHSTUDYPERIOD,LENGTHPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSPLACEMENT,TOTALECTSCREDITS,SNSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [93]:
print_unique_values(students_2011, 'PLACEMENTSTARTDATE')

['Apr-12', 'Aug-11', 'Aug-12', 'Dec-11', 'Feb-12', 'Jan-12', 'Jul-11', 'Jul-12', 'Jun-11', 'Jun-12', 'Mar-12', 'May-12', 'Nov-11', 'Oct-11', 'Sep-11', 'Sep-12']


In [94]:
# all placement students (P) should have a placement start date set
students_2011[(students_2011.MOBILITYTYPE == 'P') & 
              (students_2011.PLACEMENTSTARTDATE.isnull())].head()

Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,PLACEMENTENTERPRISE,COUNTRYOFPLACEMENT,LENGTHSTUDYPERIOD,LENGTHPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSPLACEMENT,TOTALECTSCREDITS,SNSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


In [95]:
print_unique_values(students_2011, 'PREVIOUSPARTICIPATION')
students_2011[students_2011.PREVIOUSPARTICIPATION.isnull()]

['M', 'N', 'P', 'S']


Unnamed: 0,HOMEINSTITUTION,AGE,GENDER,NATIONALITY,LEVELSTUDY,YEARSPRIOR,MOBILITYTYPE,HOSTINSTITUTION,PLACEMENTENTERPRISE,COUNTRYOFPLACEMENT,LENGTHSTUDYPERIOD,LENGTHPLACEMENT,SHORTDURATION,STUDYSTARTDATE,PLACEMENTSTARTDATE,ECTSCREDITSSTUDY,ECTSCREDITSPLACEMENT,TOTALECTSCREDITS,SNSUPPLEMENT,STUDYGRANT,PLACEMENTGRANT,PREVIOUSPARTICIPATION


#### Outcome of data quality assessment for 2011 data

Similar data quality issues to 2008 were also found in the 2011 dataset:
* Nationality: XX is not a valid country code and will be replaced with NULL
* Work placement / Country of work placement: Work placement itself might be NULL, Country of work placement is always set
* Study start date: consistent format, but different to the previous one, needs to be converted
* Placement start date: consistent format, but different to the previous one, needs to be converted

Additional data quality issues found in the 2011 data:
* Country of work placement: should be an ISO2 code, but there was also an occurence of a concatenation of 2 ISO2 codes, everything ISO2 code that does not have exactly 2 characters will be filtered out

Next step is to clean the 2011 data:

In [96]:
students_2011_cleaned = students_2011
students_2011_cleaned.NATIONALITY = students_2011_cleaned.NATIONALITY.replace({'XX': None})

# how many rows did we have in the beginning?
students_2011_cleaned.shape[0]

252827

In [97]:
# date conversion
def convert_date(date):
    try: 
        return pd.to_datetime(date).strftime('%m-%Y')
    except ValueError:
        return None

students_2011_cleaned.STUDYSTARTDATE = students_2011_cleaned.STUDYSTARTDATE.str.replace('-', '-20')
students_2011_cleaned.STUDYSTARTDATE = students_2011_cleaned.STUDYSTARTDATE.apply(convert_date)
students_2011_cleaned.PLACEMENTSTARTDATE = students_2011_cleaned.PLACEMENTSTARTDATE.str.replace('-', '-20')
students_2011_cleaned.PLACEMENTSTARTDATE = students_2011_cleaned.PLACEMENTSTARTDATE.apply(convert_date)
students_2011_cleaned.shape[0]

252827

In [98]:
# filtering rows with invalid study start dates
students_2011_cleaned = students_2011_cleaned[
    (students_2011_cleaned.STUDYSTARTDATE.str.slice(3,7).isin(['2011', '2012'])) |
    (students_2011_cleaned.STUDYSTARTDATE.isnull())]
students_2011_cleaned.shape[0]

252827

In [99]:
# filtering rows with invalid placement start dates
students_2011_cleaned = students_2011_cleaned[
    (students_2011_cleaned.PLACEMENTSTARTDATE.str.slice(3,7).isin(['2011', '2012'])) |
    (students_2011_cleaned.PLACEMENTSTARTDATE.isnull())]
students_2011_cleaned.shape[0]

252827

In [100]:
# filtering rows with invalid country codes
students_2011_cleaned = students_2011_cleaned[
    (students_2011_cleaned.COUNTRYOFPLACEMENT.str.len() == 2) |
    (students_2011_cleaned.COUNTRYOFPLACEMENT.isnull())]
students_2011_cleaned.shape[0]

252826

### CSV file for 2012-13
Read data for 2012-13 and assess columns.

In [101]:
students_2012 = pd.read_csv('raw_data/SM_2012_13_20141103_01.csv', sep=';', encoding = 'ISO-8859-1')
students_2012.info()
# we can see that the loaded dataframe for 2012 has 34 columns

  interactivity=interactivity, compiler=compiler, result=result)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 267547 entries, 0 to 267546
Data columns (total 34 columns):
STUDENT_ID                        267547 non-null object
ID_MOBILITY_CDE                   267547 non-null object
HOME_INSTITUTION_CDE              267547 non-null object
HOME_INSTITUTION_CTRY_CDE         267547 non-null object
STUDENT_AGE_VALUE                 267547 non-null int64
STUDENT_GENDER_CDE                267547 non-null object
STUDENT_NATIONALITY_CDE           267547 non-null object
STUDENT_SUBJECT_AREA_VALUE        267547 non-null int64
STUDENT_STUDY_LEVEL_CDE           267547 non-null object
NUMB_YRS_HIGHER_EDUCAT_VALUE      267547 non-null int64
MOBILITY_TYPE_CDE                 267547 non-null object
HOST_INSTITUTION_CDE              267547 non-null object
HOST_INSTITUTION_COUNTRY_CDE      267547 non-null object
PLACEMENT_ENTERPRISE_VALUE        267547 non-null object
PLACEMENT_ENTERPRISE_CTRY_CDE     267544 non-null object
PLACEMENT_ENTERPRISE_SIZE_CDE     267

In [102]:
students_2012.head(10)

Unnamed: 0,STUDENT_ID,ID_MOBILITY_CDE,HOME_INSTITUTION_CDE,HOME_INSTITUTION_CTRY_CDE,STUDENT_AGE_VALUE,STUDENT_GENDER_CDE,STUDENT_NATIONALITY_CDE,STUDENT_SUBJECT_AREA_VALUE,STUDENT_STUDY_LEVEL_CDE,NUMB_YRS_HIGHER_EDUCAT_VALUE,MOBILITY_TYPE_CDE,HOST_INSTITUTION_CDE,HOST_INSTITUTION_COUNTRY_CDE,PLACEMENT_ENTERPRISE_VALUE,PLACEMENT_ENTERPRISE_CTRY_CDE,PLACEMENT_ENTERPRISE_SIZE_CDE,TYPE_PLACEMENT_SECTOR_VALUE,LENGTH_STUDY_PERIOD_VALUE,LENGTH_PLACEMENT_VALUE,SHORT_DURATION_CDE,STUDY_START_DATE,PLACEMENT_START_DATE,CONSORTIUM_AGREEMENT_NUMBER,ECTS_CREDITS_STUDY_AMT,ECTS_CREDITS_PLACEMENT_AMT,TOTAL_ECTS_CREDITS_AMT,SPECIAL_NEEDS_SUPPLEMENT_VALUE,TAUGHT_HOST_LANGUAGE_CDE,LANGUAGE_TAUGHT_CDE,LINGUISTIC_PREPARATION_CDE,STUDY_GRANT_AMT,PLACEMENT_GRANT_AMT,PREVIOUS_PARTICIPATION_CDE,QUALIFICATION_AT_HOST_CDE
0,47099714S,ES/SM/07/000456,E ALICANT01,ES,21,F,ES,222,1,3,S,UK BATH01,UK,? Unknown ?,???,?,? Unknown,2.75,0.0,T,Sep-12,00/0000,? Unknown ?,24.0,0.0,24.0,0.0,Y,EN,HM,355.84,0,N,N
1,47397349F,ES/SM/07/000459,E ALICANT01,ES,22,M,ES,58,1,4,S,P LISBOA04,PT,? Unknown ?,???,?,? Unknown,10.75,0.0,?,Sep-12,00/0000,? Unknown ?,58.0,0.0,58.0,0.0,Y,PT,HM,1386.43,0,N,N
2,47397741P,ES/SM/07/000462,E ALICANT01,ES,20,F,ES,222,1,2,S,F XX,FR,? Unknown ?,???,?,? Unknown,10.0,0.0,?,Sep-12,00/0000,? Unknown ?,54.0,0.0,54.0,0.0,Y,FR,HM,1289.7,0,N,N
3,47447533M,ES/SM/07/000465,E ALICANT01,ES,20,F,ES,223,1,2,S,F PARIS004,FR,? Unknown ?,???,?,? Unknown,9.25,0.0,?,Sep-12,00/0000,? Unknown ?,57.0,0.0,57.0,0.0,Y,FR,HM,1192.97,0,N,N
4,47627846K,ES/SM/07/000468,E ALICANT01,ES,22,F,ES,34,1,4,S,UK LONDON062,UK,? Unknown ?,???,?,? Unknown,5.0,0.0,?,Jan-13,00/0000,? Unknown ?,30.0,0.0,30.0,0.0,Y,EN,HM,644.85,0,N,N
5,48290208Y,ES/SM/07/000471,E ALICANT01,ES,32,M,ES,222,1,2,S,P LISBOA02,PT,? Unknown ?,???,?,? Unknown,6.0,0.0,?,Feb-13,00/0000,? Unknown ?,30.0,0.0,30.0,0.0,Y,PT,HM,773.82,0,N,N
6,48300804E,ES/SM/07/000474,E ALICANT01,ES,34,F,ES,314,1,10,S,I VENEZIA01,IT,? Unknown ?,???,?,? Unknown,13.0,0.0,?,Aug-12,00/0000,? Unknown ?,30.0,0.0,30.0,0.0,Y,IT,EC,1847.64,0,N,N
7,48328761B,ES/SM/07/000477,E ALICANT01,ES,23,M,ES,34,1,4,S,F PARIS105,FR,? Unknown ?,???,?,? Unknown,4.25,0.0,?,Jan-13,00/0000,? Unknown ?,27.0,0.0,27.0,0.0,Y,FR,HM,548.12,0,N,N
8,48330650Z,ES/SM/07/000480,E ALICANT01,ES,24,M,ES,38,1,5,S,PL BIALYST04,PL,? Unknown ?,???,?,? Unknown,9.5,0.0,?,Sep-12,00/0000,? Unknown ?,48.0,0.0,48.0,0.0,Y,PL,HM,1225.22,0,N,N
9,48333028T,ES/SM/07/000483,E ALICANT01,ES,25,M,ES,345,2,1,S,S GOTEBOR01,SE,? Unknown ?,???,?,? Unknown,4.0,0.0,?,Aug-12,00/0000,? Unknown ?,23.0,0.0,23.0,0.0,Y,SV,HM,515.88,0,N,N


The 2012 data has different column names as the rest of the years. Also some of the columns seem to have different formats. Moreover, '?' seem often to be used instead of NULL values.

Let's have a look on the data quality of the selected columns we will use for analytics to find out how we need to convert them to be able to merge them with all the other data.

In [103]:
students_2012 = students_2012[[
    'HOME_INSTITUTION_CDE',
    'STUDENT_AGE_VALUE',
    'STUDENT_GENDER_CDE',
    'STUDENT_NATIONALITY_CDE',
    'STUDENT_STUDY_LEVEL_CDE',
    'NUMB_YRS_HIGHER_EDUCAT_VALUE',
    'MOBILITY_TYPE_CDE',
    'HOST_INSTITUTION_CDE',
    'PLACEMENT_ENTERPRISE_VALUE',
    'PLACEMENT_ENTERPRISE_CTRY_CDE',
    'LENGTH_STUDY_PERIOD_VALUE',
    'LENGTH_PLACEMENT_VALUE',
    'SHORT_DURATION_CDE',
    'STUDY_START_DATE',
    'PLACEMENT_START_DATE',
    'ECTS_CREDITS_STUDY_AMT',
    'ECTS_CREDITS_PLACEMENT_AMT',
    'TOTAL_ECTS_CREDITS_AMT',
    'SPECIAL_NEEDS_SUPPLEMENT_VALUE',
    'SPECIAL_NEEDS_SUPPLEMENT_VALUE',
    'PLACEMENT_GRANT_AMT',
    'PREVIOUS_PARTICIPATION_CDE'
]]
students_2012.describe()

Unnamed: 0,STUDENT_AGE_VALUE,NUMB_YRS_HIGHER_EDUCAT_VALUE,LENGTH_STUDY_PERIOD_VALUE,LENGTH_PLACEMENT_VALUE,ECTS_CREDITS_STUDY_AMT,ECTS_CREDITS_PLACEMENT_AMT,TOTAL_ECTS_CREDITS_AMT,SPECIAL_NEEDS_SUPPLEMENT_VALUE,SPECIAL_NEEDS_SUPPLEMENT_VALUE.1
count,267547.0,267547.0,267511.0,267544.0,267544.0,267544.0,267544.0,267544.0,267544.0
mean,22.513058,2.696842,4.91642,0.897189,27.165128,2.916432,30.081486,2.522737,2.522737
std,2.702744,1.306263,3.312465,1.976879,20.727723,8.988954,18.761374,92.182059,92.182059
min,17.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,21.0,2.0,3.5,0.0,8.0,0.0,19.0,0.0,0.0
50%,22.0,2.0,5.0,0.0,30.0,0.0,30.0,0.0,0.0
75%,24.0,3.0,8.0,0.0,35.0,0.0,36.0,0.0,0.0
max,93.0,20.0,13.0,13.0,90.0,90.0,90.0,12952.88,12952.88


In [104]:
print_unique_values(students_2012, 'HOME_INSTITUTION_CDE')
students_2012[students_2012.HOME_INSTITUTION_CDE.isnull()]

['A  BADEN01', 'A  DORNBIR01', 'A  EISENST01', 'A  EISENST02', 'A  EISENST05', 'A  FELDKIR01', 'A  FELDKIR03', 'A  GRAZ01', 'A  GRAZ02', 'A  GRAZ03', 'A  GRAZ04', 'A  GRAZ08', 'A  GRAZ09', 'A  GRAZ10', 'A  GRAZ23', 'A  INNSBRU01', 'A  INNSBRU03', 'A  INNSBRU08', 'A  INNSBRU09', 'A  INNSBRU20', 'A  INNSBRU21', 'A  INNSBRU23', 'A  KLAGENF01', 'A  KLAGENF02', 'A  KLAGENF06', 'A  KREMS03', 'A  KUFSTEI01', 'A  LEOBEN01', 'A  LINZ01', 'A  LINZ02', 'A  LINZ03', 'A  LINZ04', 'A  LINZ11', 'A  LINZ17', 'A  LINZ23', 'A  SALZBUR01', 'A  SALZBUR02', 'A  SALZBUR03', 'A  SALZBUR08', 'A  SCHLAIN01', 'A  SPITTAL01', 'A  ST-POLT03', 'A  WELS01', 'A  WIEN01', 'A  WIEN02', 'A  WIEN03', 'A  WIEN04', 'A  WIEN05', 'A  WIEN06', 'A  WIEN07', 'A  WIEN08', 'A  WIEN09', 'A  WIEN10', 'A  WIEN15', 'A  WIEN20', 'A  WIEN21', 'A  WIEN36', 'A  WIEN38', 'A  WIEN52', 'A  WIEN63', 'A  WIEN64', 'A  WIEN66', 'A  WIENER01', 'B  ANTWERP01', 'B  ANTWERP57', 'B  ANTWERP58', 'B  ANTWERP59', 'B  ANTWERP60', 'B  ANTWERP61', 'B  AR

Unnamed: 0,HOME_INSTITUTION_CDE,STUDENT_AGE_VALUE,STUDENT_GENDER_CDE,STUDENT_NATIONALITY_CDE,STUDENT_STUDY_LEVEL_CDE,NUMB_YRS_HIGHER_EDUCAT_VALUE,MOBILITY_TYPE_CDE,HOST_INSTITUTION_CDE,PLACEMENT_ENTERPRISE_VALUE,PLACEMENT_ENTERPRISE_CTRY_CDE,LENGTH_STUDY_PERIOD_VALUE,LENGTH_PLACEMENT_VALUE,SHORT_DURATION_CDE,STUDY_START_DATE,PLACEMENT_START_DATE,ECTS_CREDITS_STUDY_AMT,ECTS_CREDITS_PLACEMENT_AMT,TOTAL_ECTS_CREDITS_AMT,SPECIAL_NEEDS_SUPPLEMENT_VALUE,SPECIAL_NEEDS_SUPPLEMENT_VALUE.1,PLACEMENT_GRANT_AMT,PREVIOUS_PARTICIPATION_CDE


In [105]:
print_unique_values(students_2012, 'STUDENT_GENDER_CDE')
students_2012[students_2012.STUDENT_GENDER_CDE.isnull()]

['F', 'M']


Unnamed: 0,HOME_INSTITUTION_CDE,STUDENT_AGE_VALUE,STUDENT_GENDER_CDE,STUDENT_NATIONALITY_CDE,STUDENT_STUDY_LEVEL_CDE,NUMB_YRS_HIGHER_EDUCAT_VALUE,MOBILITY_TYPE_CDE,HOST_INSTITUTION_CDE,PLACEMENT_ENTERPRISE_VALUE,PLACEMENT_ENTERPRISE_CTRY_CDE,LENGTH_STUDY_PERIOD_VALUE,LENGTH_PLACEMENT_VALUE,SHORT_DURATION_CDE,STUDY_START_DATE,PLACEMENT_START_DATE,ECTS_CREDITS_STUDY_AMT,ECTS_CREDITS_PLACEMENT_AMT,TOTAL_ECTS_CREDITS_AMT,SPECIAL_NEEDS_SUPPLEMENT_VALUE,SPECIAL_NEEDS_SUPPLEMENT_VALUE.1,PLACEMENT_GRANT_AMT,PREVIOUS_PARTICIPATION_CDE


In [106]:
print_unique_values(students_2012, 'STUDENT_NATIONALITY_CDE')
students_2012[students_2012.STUDENT_NATIONALITY_CDE.isnull()]

['AT', 'BE', 'BG', 'CH', 'CY', 'CZ', 'DE', 'DK', 'De', 'Dk', 'EE', 'ES', 'Es', 'FI', 'FR', 'GR', 'HR', 'HU', 'Hu', 'IE', 'IS', 'IT', 'LI', 'LT', 'LU', 'LV', 'MK', 'MT', 'NL', 'NO', 'Nl', 'No', 'PL', 'PT', 'Pl', 'RO', 'SE', 'SI', 'SK', 'TR', 'UK', 'XX', 'de', 'es', 'pt', 'xx']


Unnamed: 0,HOME_INSTITUTION_CDE,STUDENT_AGE_VALUE,STUDENT_GENDER_CDE,STUDENT_NATIONALITY_CDE,STUDENT_STUDY_LEVEL_CDE,NUMB_YRS_HIGHER_EDUCAT_VALUE,MOBILITY_TYPE_CDE,HOST_INSTITUTION_CDE,PLACEMENT_ENTERPRISE_VALUE,PLACEMENT_ENTERPRISE_CTRY_CDE,LENGTH_STUDY_PERIOD_VALUE,LENGTH_PLACEMENT_VALUE,SHORT_DURATION_CDE,STUDY_START_DATE,PLACEMENT_START_DATE,ECTS_CREDITS_STUDY_AMT,ECTS_CREDITS_PLACEMENT_AMT,TOTAL_ECTS_CREDITS_AMT,SPECIAL_NEEDS_SUPPLEMENT_VALUE,SPECIAL_NEEDS_SUPPLEMENT_VALUE.1,PLACEMENT_GRANT_AMT,PREVIOUS_PARTICIPATION_CDE


In [107]:
students_2012.STUDENT_STUDY_LEVEL_CDE = students_2012.STUDENT_STUDY_LEVEL_CDE.astype(str)
print_unique_values(students_2012, 'STUDENT_STUDY_LEVEL_CDE')
students_2012[students_2012.STUDENT_STUDY_LEVEL_CDE.isnull()]

['1', '2', '3', 'S']


Unnamed: 0,HOME_INSTITUTION_CDE,STUDENT_AGE_VALUE,STUDENT_GENDER_CDE,STUDENT_NATIONALITY_CDE,STUDENT_STUDY_LEVEL_CDE,NUMB_YRS_HIGHER_EDUCAT_VALUE,MOBILITY_TYPE_CDE,HOST_INSTITUTION_CDE,PLACEMENT_ENTERPRISE_VALUE,PLACEMENT_ENTERPRISE_CTRY_CDE,LENGTH_STUDY_PERIOD_VALUE,LENGTH_PLACEMENT_VALUE,SHORT_DURATION_CDE,STUDY_START_DATE,PLACEMENT_START_DATE,ECTS_CREDITS_STUDY_AMT,ECTS_CREDITS_PLACEMENT_AMT,TOTAL_ECTS_CREDITS_AMT,SPECIAL_NEEDS_SUPPLEMENT_VALUE,SPECIAL_NEEDS_SUPPLEMENT_VALUE.1,PLACEMENT_GRANT_AMT,PREVIOUS_PARTICIPATION_CDE


In [108]:
print_unique_values(students_2012, 'MOBILITY_TYPE_CDE')
students_2012[students_2012.MOBILITY_TYPE_CDE.isnull()]

['C', 'P', 'S']


Unnamed: 0,HOME_INSTITUTION_CDE,STUDENT_AGE_VALUE,STUDENT_GENDER_CDE,STUDENT_NATIONALITY_CDE,STUDENT_STUDY_LEVEL_CDE,NUMB_YRS_HIGHER_EDUCAT_VALUE,MOBILITY_TYPE_CDE,HOST_INSTITUTION_CDE,PLACEMENT_ENTERPRISE_VALUE,PLACEMENT_ENTERPRISE_CTRY_CDE,LENGTH_STUDY_PERIOD_VALUE,LENGTH_PLACEMENT_VALUE,SHORT_DURATION_CDE,STUDY_START_DATE,PLACEMENT_START_DATE,ECTS_CREDITS_STUDY_AMT,ECTS_CREDITS_PLACEMENT_AMT,TOTAL_ECTS_CREDITS_AMT,SPECIAL_NEEDS_SUPPLEMENT_VALUE,SPECIAL_NEEDS_SUPPLEMENT_VALUE.1,PLACEMENT_GRANT_AMT,PREVIOUS_PARTICIPATION_CDE


In [109]:
print_unique_values(students_2012, 'HOST_INSTITUTION_CDE')

['???', 'A  BADEN01', 'A  DORNBIR01', 'A  EISENST01', 'A  EISENST02', 'A  EISENST05', 'A  FELDKIR01', 'A  FELDKIR03', 'A  GRAZ01', 'A  GRAZ02', 'A  GRAZ03', 'A  GRAZ04', 'A  GRAZ08', 'A  GRAZ09', 'A  GRAZ10', 'A  GRAZ23', 'A  Graz09', 'A  INNSBRU01', 'A  INNSBRU03', 'A  INNSBRU08', 'A  INNSBRU09', 'A  INNSBRU21', 'A  KLAGENF01', 'A  KLAGENF02', 'A  KLAGENF06', 'A  KREMS02', 'A  KREMS03', 'A  KUFSTEI01', 'A  LEOBEN01', 'A  LINZ01', 'A  LINZ02', 'A  LINZ03', 'A  LINZ04', 'A  LINZ17', 'A  SALZBUR01', 'A  SALZBUR02', 'A  SALZBUR03', 'A  SALZBUR08', 'A  SPITTAL01', 'A  ST-POLT03', 'A  WELS01', 'A  WIEN01', 'A  WIEN02', 'A  WIEN03', 'A  WIEN04', 'A  WIEN05', 'A  WIEN06', 'A  WIEN07', 'A  WIEN08', 'A  WIEN09', 'A  WIEN10', 'A  WIEN15', 'A  WIEN20', 'A  WIEN21', 'A  WIEN36', 'A  WIEN38', 'A  WIEN52', 'A  WIEN63', 'A  WIEN64', 'A  WIEN66', 'A  WIEN68', 'A  WIENER01', 'A  WIENER04', 'A  XX', 'B  ANTWERP01', 'B  ANTWERP57', 'B  ANTWERP58', 'B  ANTWERP59', 'B  ANTWERP60', 'B  ANTWERP61', 'B  ARLON

In [110]:
print_unique_values(students_2012, 'PLACEMENT_ENTERPRISE_CTRY_CDE')

['???', 'AT', 'BE', 'BENL', 'BG', 'CH', 'CY', 'CZ', 'DE', 'DK', 'EE', 'ES', 'FI', 'FR', 'GR', 'HR', 'HU', 'IE', 'IS', 'IT', 'LI', 'LT', 'LU', 'LV', 'MK', 'MT', 'NL', 'NO', 'PL', 'PT', 'RO', 'SE', 'SI', 'SK', 'TR', 'UK']


In [111]:
students_2012[(students_2012.PLACEMENT_ENTERPRISE_CTRY_CDE == 'BENL')]

Unnamed: 0,HOME_INSTITUTION_CDE,STUDENT_AGE_VALUE,STUDENT_GENDER_CDE,STUDENT_NATIONALITY_CDE,STUDENT_STUDY_LEVEL_CDE,NUMB_YRS_HIGHER_EDUCAT_VALUE,MOBILITY_TYPE_CDE,HOST_INSTITUTION_CDE,PLACEMENT_ENTERPRISE_VALUE,PLACEMENT_ENTERPRISE_CTRY_CDE,LENGTH_STUDY_PERIOD_VALUE,LENGTH_PLACEMENT_VALUE,SHORT_DURATION_CDE,STUDY_START_DATE,PLACEMENT_START_DATE,ECTS_CREDITS_STUDY_AMT,ECTS_CREDITS_PLACEMENT_AMT,TOTAL_ECTS_CREDITS_AMT,SPECIAL_NEEDS_SUPPLEMENT_VALUE,SPECIAL_NEEDS_SUPPLEMENT_VALUE.1,PLACEMENT_GRANT_AMT,PREVIOUS_PARTICIPATION_CDE
19970,E MURCIA05,22,M,ES,1,3,P,???,Widar Dorpsgemeenschap,BENL,0.0,4.25,?,00/0000,Jan-13,0.0,30.0,30.0,0.0,0.0,1275,N


In [112]:
print_unique_values(students_2012, 'SHORT_DURATION_CDE')

['-', '0', '?', 'N', 'S', 'T', 'X', 'm', 'x']


In [113]:
print_unique_values(students_2012, 'STUDY_START_DATE')

['00/0000', '09/1012', 'Apr-12', 'Apr-13', 'Aug-12', 'Aug-13', 'Dec-12', 'Feb-12', 'Feb-13', 'Jan-12', 'Jan-13', 'Jul-12', 'Jul-13', 'Jun-12', 'Jun-13', 'Mar-12', 'Mar-13', 'May-12', 'May-13', 'Nov-12', 'Oct-12', 'Sep-12', 'Sep-13']


In [114]:
students_2012[(students_2012.STUDY_START_DATE == '09/1012')]

Unnamed: 0,HOME_INSTITUTION_CDE,STUDENT_AGE_VALUE,STUDENT_GENDER_CDE,STUDENT_NATIONALITY_CDE,STUDENT_STUDY_LEVEL_CDE,NUMB_YRS_HIGHER_EDUCAT_VALUE,MOBILITY_TYPE_CDE,HOST_INSTITUTION_CDE,PLACEMENT_ENTERPRISE_VALUE,PLACEMENT_ENTERPRISE_CTRY_CDE,LENGTH_STUDY_PERIOD_VALUE,LENGTH_PLACEMENT_VALUE,SHORT_DURATION_CDE,STUDY_START_DATE,PLACEMENT_START_DATE,ECTS_CREDITS_STUDY_AMT,ECTS_CREDITS_PLACEMENT_AMT,TOTAL_ECTS_CREDITS_AMT,SPECIAL_NEEDS_SUPPLEMENT_VALUE,SPECIAL_NEEDS_SUPPLEMENT_VALUE.1,PLACEMENT_GRANT_AMT,PREVIOUS_PARTICIPATION_CDE
190993,NL MAASTRI01,23,F,DE,1,2,S,IRLGALWAY01,? Unknown ?,???,3.0,0.0,?,09/1012,00/0000,20.0,0.0,20.0,0.0,0.0,0,N


In [115]:
print_unique_values(students_2012, 'PLACEMENT_START_DATE')

['00/0000', 'Apr-12', 'Apr-13', 'Aug-12', 'Aug-13', 'Dec-12', 'Feb-13', 'Jan-12', 'Jan-13', 'Jul-12', 'Jul-13', 'Jun-12', 'Jun-13', 'Mar-12', 'Mar-13', 'May-13', 'Nov-12', 'Oct-12', 'Oct-13', 'Sep-12', 'Sep-13']


In [116]:
print_unique_values(students_2012, 'PREVIOUS_PARTICIPATION_CDE')
students_2012[students_2012.PREVIOUS_PARTICIPATION_CDE.isnull()]

['M', 'N', 'P', 'S']


Unnamed: 0,HOME_INSTITUTION_CDE,STUDENT_AGE_VALUE,STUDENT_GENDER_CDE,STUDENT_NATIONALITY_CDE,STUDENT_STUDY_LEVEL_CDE,NUMB_YRS_HIGHER_EDUCAT_VALUE,MOBILITY_TYPE_CDE,HOST_INSTITUTION_CDE,PLACEMENT_ENTERPRISE_VALUE,PLACEMENT_ENTERPRISE_CTRY_CDE,LENGTH_STUDY_PERIOD_VALUE,LENGTH_PLACEMENT_VALUE,SHORT_DURATION_CDE,STUDY_START_DATE,PLACEMENT_START_DATE,ECTS_CREDITS_STUDY_AMT,ECTS_CREDITS_PLACEMENT_AMT,TOTAL_ECTS_CREDITS_AMT,SPECIAL_NEEDS_SUPPLEMENT_VALUE,SPECIAL_NEEDS_SUPPLEMENT_VALUE.1,PLACEMENT_GRANT_AMT,PREVIOUS_PARTICIPATION_CDE
141021,PL SUWALKI03,21,M,LT,1,1,P,???,OLERTROS LOGISTIKA;LT;M;H;0;3;T;00/0000;07/201...,,,,,,,,,,,,,
203632,RO PETROSA01,20,M,RO,2,5,P,???,"K'' Line Espana Servicios Maritimos S.A., Vale...",,,,,,,,,,,,,
205047,RO PETROSA01,19,F,RO,1,1,P,???,"K'' Line Espana Servicios Maritimos S.A., Vale...",,,,,,,,,,,,,


#### Outcome of data quality assessment for 2012 data

Data quality issues found in the 2012 dataset:
* Nationality: not all country codes are upper case, XX is not a valid country code and will be replaced with NULL
* Home institution code: consistent use of upper case missing
* Host institution code: consistent use of upper case missing; '???' should be replaced with NULL
* Placement enterprise value: trim whitespaces, replace '? UNKNOWN ?' with NULL
* Placement enterprice country code: replace '???' with NULL, discard rows with country codes > 2 letters
* Short duration code: replace '?' with NULL
* Study start date: consistent format, but different to the previous one, needs to be converted; default 00/0000 should be set to NULL
* Placement start date: consistent format, but different to the previous one, needs to be converted; default 00/0000 should be set to NULL
* Previous partitipation code: should not be empty, drop this rows

Next step is to clean the 2012 data:

In [117]:
students_2012_cleaned = students_2012
students_2012_cleaned.STUDENT_NATIONALITY_CDE = students_2012_cleaned.STUDENT_NATIONALITY_CDE.str.upper().replace({'XX': None})
students_2012_cleaned.HOME_INSTITUTION_CDE = students_2012_cleaned.HOME_INSTITUTION_CDE.str.upper()
students_2012_cleaned.HOST_INSTITUTION_CDE = students_2012_cleaned.HOST_INSTITUTION_CDE.str.upper().replace({'???': None})
students_2012_cleaned.PLACEMENT_ENTERPRISE_VALUE = students_2012_cleaned.PLACEMENT_ENTERPRISE_VALUE.str.strip().replace({'? Unknown ?': None})
students_2012_cleaned.PLACEMENT_ENTERPRISE_CTRY_CDE = students_2012_cleaned.PLACEMENT_ENTERPRISE_CTRY_CDE.str.upper().replace({'???': None})
students_2012_cleaned.SHORT_DURATION_CDE = students_2012_cleaned.SHORT_DURATION_CDE.str.upper().replace({'?': None})
# how many rows did we have in the beginning?
students_2012_cleaned.shape[0]

267547

In [118]:
# date conversion
def convert_date(date):
    try: 
        return pd.to_datetime(date).strftime('%m-%Y')
    except ValueError:
        return None

students_2012_cleaned.STUDY_START_DATE = students_2012_cleaned.STUDY_START_DATE.str.replace('-', '-20')
students_2012_cleaned.STUDY_START_DATE = students_2012_cleaned.STUDY_START_DATE.apply(convert_date)
students_2012_cleaned.PLACEMENT_START_DATE = students_2012_cleaned.PLACEMENT_START_DATE.str.replace('-', '-20')
students_2012_cleaned.PLACEMENT_START_DATE = students_2012_cleaned.PLACEMENT_START_DATE.apply(convert_date)
students_2012_cleaned.shape[0]

267547

In [119]:
# filtering rows with invalid study start dates
students_2012_cleaned = students_2012_cleaned[
    (students_2012_cleaned.STUDY_START_DATE.str.slice(3,7).isin(['2012', '2013'])) |
    (students_2012_cleaned.STUDY_START_DATE.isnull())]
students_2012_cleaned.shape[0]

267547

In [120]:
# filtering rows with invalid placement start dates
students_2012_cleaned = students_2012_cleaned[
    (students_2012_cleaned.PLACEMENT_START_DATE.str.slice(3,7).isin(['2012', '2013'])) |
    (students_2012_cleaned.PLACEMENT_START_DATE.isnull())]
students_2012_cleaned.shape[0]

267547

In [121]:
# filtering rows with invalid country codes
students_2012_cleaned = students_2012_cleaned[
    (students_2012_cleaned.PLACEMENT_ENTERPRISE_CTRY_CDE.str.len() == 2) |
    (students_2012_cleaned.PLACEMENT_ENTERPRISE_CTRY_CDE.isnull())]
students_2012_cleaned.shape[0]

267546

In [122]:
# filtering rows with invalid previous participation codes
students_2012_cleaned = students_2012_cleaned[students_2012_cleaned.PREVIOUS_PARTICIPATION_CDE.notnull()]
students_2012_cleaned.shape[0]

267543

### Combining all years and write back to CSV

In [123]:
# rename column headers
column_names = ['home_institution',
           'age', 
           'gender',
           'nationality',
           'level_study',
           'years_prior',
           'mobility_type',
           'host_institution',
           'work_placement',
           'country_work_placement',
           'length_study',
           'length_work',
           'short_duration_reason',
           'study_start_date',
           'work_start_date',
           'ects_study',
           'ects_work',
           'ects_total',
           'sn_supplement',
           'study_grant',
           'work_grant',
           'previous_participation']
students_2008_cleaned.columns = column_names
students_2009_cleaned.columns = column_names
students_2010_cleaned.columns = column_names
students_2011_cleaned.columns = column_names
students_2012_cleaned.columns = column_names

In [124]:
# combine all data
student_df = pd.concat([
    students_2008_cleaned,
    students_2009_cleaned,
    students_2010_cleaned,
    students_2011_cleaned,
    students_2012_cleaned
], axis=0)

student_df.head()

Unnamed: 0,home_institution,age,gender,nationality,level_study,years_prior,mobility_type,host_institution,work_placement,country_work_placement,length_study,length_work,short_duration_reason,study_start_date,work_start_date,ects_study,ects_work,ects_total,sn_supplement,study_grant,work_grant,previous_participation
0,D KONSTAN01,25,M,DE,1,2,S,UK BATH01,,,4.0,0.0,,09-2008,,27.0,0.0,27.0,0.0,720.0,0,N
1,D KONSTAN01,24,M,DE,2,4,S,F PARIS007,,,8.0,0.0,,09-2008,,63.0,0.0,63.0,0.0,1440.0,0,N
2,D KONSTAN01,23,M,DE,1,2,S,F MARSEIL16,,,4.0,0.0,,09-2008,,15.0,0.0,15.0,0.0,720.0,0,N
3,D KONSTAN01,24,F,DE,1,3,S,E CORDOBA01,,,4.0,0.0,,09-2008,,30.0,0.0,30.0,0.0,720.0,0,N
4,EE TALLINN04,22,M,EE,1,2,S,SF LAHTI11,,,4.0,0.0,,09-2008,,45.0,0.0,45.0,0.0,1727.2,0,N


In [125]:
# write to csv
student_df.to_csv('cleaned_data/students_2008_2012.csv', sep=';', encoding='utf-8', index=None, header=True)

## Assess JSON file containing institutions

In [126]:
euc = pd.read_json('raw_data/euc.json')

In [127]:
euc.describe()

Unnamed: 0,institutional_code,application_reference_number,organisation_name,country,city,code
count,4919,4919,4919,4919,4919,4919
unique,4919,4919,4892,34,2791,3
top,E ZAMORA06,29289-IC-1-2007-1-AT-ERASMUS-EUC-1,ACCADEMIA DI BELLE ARTI,ES,PARIS,EUCX
freq,1,1,4,1248,84,3183


In [128]:
euc.shape

(4919, 6)

In [129]:
print_unique_values(euc, 'institutional_code')

['A  BADEN01', 'A  DORNBIR01', 'A  EISENST01', 'A  EISENST02', 'A  EISENST05', 'A  FELDKIR01', 'A  FELDKIR03', 'A  GRAZ01', 'A  GRAZ02', 'A  GRAZ03', 'A  GRAZ04', 'A  GRAZ08', 'A  GRAZ09', 'A  GRAZ10', 'A  GRAZ23', 'A  INNSBRU01', 'A  INNSBRU03', 'A  INNSBRU08', 'A  INNSBRU09', 'A  INNSBRU20', 'A  INNSBRU21', 'A  INNSBRU23', 'A  KLAGENF01', 'A  KLAGENF02', 'A  KLAGENF06', 'A  KREMS02', 'A  KREMS03', 'A  KREMS05', 'A  KUFSTEI01', 'A  LEOBEN01', 'A  LINZ01', 'A  LINZ02', 'A  LINZ03', 'A  LINZ04', 'A  LINZ11', 'A  LINZ17', 'A  LINZ23', 'A  SALZBUR01', 'A  SALZBUR02', 'A  SALZBUR03', 'A  SALZBUR08', 'A  SALZBUR18', 'A  SCHLAIN01', 'A  SPITTAL01', 'A  ST-POLT03', 'A  ST-POLT10', 'A  WELS01', 'A  WIEN01', 'A  WIEN02', 'A  WIEN03', 'A  WIEN04', 'A  WIEN05', 'A  WIEN06', 'A  WIEN07', 'A  WIEN08', 'A  WIEN09', 'A  WIEN10', 'A  WIEN15', 'A  WIEN20', 'A  WIEN21', 'A  WIEN36', 'A  WIEN38', 'A  WIEN52', 'A  WIEN63', 'A  WIEN64', 'A  WIEN65', 'A  WIEN66', 'A  WIEN67', 'A  WIEN68', 'A  WIEN69', 'A  W

In [130]:
print_unique_values(euc, 'country')

['AT', 'BE', 'BG', 'CH', 'CY', 'CZ', 'DE', 'DK', 'EE', 'ES', 'FI', 'FR', 'GR', 'HR', 'HU', 'IE', 'IS', 'IT', 'LI', 'LT', 'LU', 'LV', 'MK', 'MT', 'NL', 'NO', 'PL', 'PT', 'RO', 'SE', 'SI', 'SK', 'TR', 'UK']


In [131]:
print_unique_values(euc, 'city')

['(TURON) MIERES', '0107 OSLO', '08207 SEDAN CEDEX', '97157-POINTE- A -PITRE CEDEX', 'A CORUNA', 'A CORUÑA', 'A GUARDA', 'AACHEN', 'AALBORG', 'AALBORG Ø.', 'AALTO (TOWN, ESPOO)', 'AARHUS C', 'AAS', 'ABANTO-ZIERBENA', 'ABARÁN (MURCIA)', 'ABERYSTWYTH', 'ADANA', 'ADEJE- LOS OLIVOS', 'ADIYAMAN', 'ADRA (ALMERIA)', 'ADRA (ALMERÍA)', 'AFRAGOLA', 'AGEN CEDEX 9', 'AGEN CÉDEX 09', 'AGRINIO', 'AGUAS NUEVAS (ALBACETE)', 'AGUILAR DE CAMPOO - PALENCIA', 'AGÜIMES', 'AIX EN PROVENCE', 'AIX-LES-BAINS  CEDEX', 'AKSARAY', 'ALACANT', 'ALBACETE', 'ALBAL', 'ALBATERA  (ALICANTE)', 'ALBERIC', 'ALBERTVILLE', 'ALBI', 'ALBI CEDEX 09', 'ALBI CEDEX 9', 'ALBOX (ALMERÍA)', 'ALCALA DE HENARES', 'ALCALÁ DE GUADAIRA', 'ALCALÁ DE HENARES', 'ALCALÁ LA REAL. JAÉN', 'ALCANTARILLA (MURCIA)', 'ALCAUDETE - JAÉN', 'ALCAÑIZ (TERUEL)', 'ALCOBENDAS', 'ALCOBENDAS (MADRID)', 'ALCOI', 'ALCORA (CASTELLÓN)', 'ALCORCON (MADRID)', 'ALCORCÓN', 'ALCOY', 'ALCUDIA', 'ALCÁZAR DE SAN JUAN (CIUDAD REAL)', 'ALDAIA (VALENCIA)', 'ALENCON', 'ALES'

In [132]:
print_unique_values(euc, 'code')

['EUC', 'EUCP', 'EUCX']


### Clean data
Make sure to consistently use Upper case for institutional code and citites.

In [133]:
euc.institutional_code = euc.institutional_code.str.upper()
euc.city = euc.city.str.upper()

### Save cleaned data as JSON file

In [134]:
euc.to_json('cleaned_data/euc.json')