First, we import the `zipfile` module to enable us to extract the Stata `dta` file.

In [43]:
import zipfile

Data variable reference from here:http://www.europeansocialsurvey.org/docs/cumulative/ESS1-6_variable_list_e01.pdf

We used the [ESS Cumulative Data Wizard](http://www.europeansocialsurvey.org/downloadwizard) to obtain a file containing all of the variables within the following categories:
* Subjective well-being, social exclusion; religion; percieved discrimination; national and ethnic identity (only the ESS standard variables)
* Gender, age and household composition
* Socio-demographic profile, including: type of area, education and occupation, union membership, income, marital status (only the ESS standard variables)

We did **not** use the country specific variables because we relied on the recoding done by the ESS staff. For example, we did not need the responses to a question asked in German, as the answers were incorporated in the ESS standard variables in a uniform manner.

In [44]:
with zipfile.ZipFile("output4909048052568705413.zip","r") as zip_ref:
    zip_ref.extractall("ESS_Cumulative")

In [45]:
import pandas as pd

In [46]:
ESS_data_frame = pd.read_stata("ESS_Cumulative/ESS1-6e01_1_F1.dta",
                               convert_categoricals=False, 
                               convert_missing=False)

In [47]:
ESS_data_frame.head()

Unnamed: 0,cntry,cname,cedition,cproddat,cseqno,name,essround,edition,idno,dweight,...,emprm14,emplnom,jbspvm,occm14,occm14a,occm14b,atncrse,fxltph,mbltph,inttph
0,BE,ESS1-6e01_1,1.1,09.03.2016,12394,ESS4e04_3,4,4.3,10202.0,1.0074,...,3.0,,,,,,1.0,1.0,1.0,2.0
1,BE,ESS1-6e01_1,1.1,09.03.2016,12395,ESS4e04_3,4,4.3,10203.0,1.0074,...,1.0,,2.0,,,5.0,1.0,2.0,1.0,2.0
2,BE,ESS1-6e01_1,1.1,09.03.2016,12396,ESS4e04_3,4,4.3,10207.0,1.0074,...,2.0,3.0,,,,4.0,1.0,1.0,1.0,1.0
3,BE,ESS1-6e01_1,1.1,09.03.2016,12397,ESS4e04_3,4,4.3,10208.0,1.0074,...,3.0,,,,,,1.0,1.0,1.0,2.0
4,BE,ESS1-6e01_1,1.1,09.03.2016,12398,ESS4e04_3,4,4.3,10302.0,1.0074,...,3.0,,,,,,1.0,1.0,1.0,1.0


We used the [IPUMS website for the CPS](https://cps.ipums.org/) to obtain data from the ASEC for 2010, 2012, and 2014. We used Stata to apply the data definitions provided by IPUMS to obtain a dta file. 


In [48]:
with zipfile.ZipFile("CPS_data_even_years.zip","r") as zip_ref:
    zip_ref.extractall("CPS_data_even_years")

In [49]:
CPS_data_frame = pd.read_stata("CPS_data_even_years/CPS_data.dta",
                               convert_categoricals=False, 
                               convert_missing=False)

In [50]:
CPS_data_frame.head()

Unnamed: 0,year,serial,numprec,hwtsupp,hhtenure,hhintype,region,statefip,statecensus,asecflag,...,educ99_mom,educ99_mom2,educ99_pop,educ99_pop2,educ99_sp,schlcoll_mom,schlcoll_mom2,schlcoll_pop,schlcoll_pop2,schlcoll_sp
0,2010,1,1,485.98999,2,1,11,23,11,1,...,,,,,,,,,,
1,2010,2,1,531.710022,1,1,11,23,11,1,...,,,,,,,,,,
2,2010,3,2,474.399994,1,1,11,23,11,1,...,,,,,7.0,,,,,0.0
3,2010,3,2,474.399994,1,1,11,23,11,1,...,,,,,9.0,,,,,0.0
4,2010,4,2,486.649994,1,1,11,23,11,1,...,,,,,10.0,,,,,0.0


In [51]:
CPS_data_frame.columns.values.tolist()

['year',
 'serial',
 'numprec',
 'hwtsupp',
 'hhtenure',
 'hhintype',
 'region',
 'statefip',
 'statecensus',
 'asecflag',
 'hseq',
 'metro',
 'metarea',
 'county',
 'metfips',
 'ownershp',
 'hhincome',
 'housret',
 'cpi99',
 'month',
 'pernum',
 'wtsupp',
 'earnwt',
 'relate',
 'age',
 'sex',
 'race',
 'marst',
 'popstat',
 'bpl',
 'yrimmig',
 'citizen',
 'mbpl',
 'fbpl',
 'nativity',
 'hispan',
 'educ',
 'educ99',
 'schlcoll',
 'empstat',
 'labforce',
 'occ',
 'occ2010',
 'occ1990',
 'ind1990',
 'occ1950',
 'ind',
 'ind1950',
 'classwkr',
 'occly',
 'occ50ly',
 'indly',
 'ind50ly',
 'classwly',
 'wkswork1',
 'wkswork2',
 'uhrsworkly',
 'uhrsworkt',
 'uhrswork1',
 'ahrsworkt',
 'wksunem1',
 'wksunem2',
 'absent',
 'durunem2',
 'durunemp',
 'fullpart',
 'nwlookwk',
 'pension',
 'whyunemp',
 'firmsize',
 'whyabsnt',
 'wantjob',
 'whyptly',
 'whyptlwk',
 'usftptlw',
 'payifabs',
 'numemps',
 'wnftlook',
 'wnlwnilf',
 'strechlk',
 'whynwly',
 'actnlfly',
 'ptweeks',
 'ftotval',
 'inctot',

In [52]:
new_df = CPS_data_frame[['educ99', 'sex']] #this needs to be updated based on our variable selection

In [53]:
new_df.head() # sample of eduction based on sex

Unnamed: 0,educ99,sex
0,11,2
1,9,1
2,9,2
3,7,1
4,10,1


In [57]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

linreg = LinearRegression()#model

scores = cross_val_score(linreg, new_df[['sex']], new_df[['educ99']])#model, set of preditors, outcome variables-that you're trying to predict
print("Cross-validation scores: {}".format(scores))

Cross-validation scores: [-0.00077251  0.0011375  -0.00018396]


“Cross-validation is a statistical method of evaluating generalization performance that is more stable and thorough than using a split into a training and a test set. In cross-validation, the data is instead split repeatedly and multiple models are trained. The most commonly used version of cross-validation is k-fold cross-validation, where k is a user-specified number, usually 5 or 10. 

When performing five-fold cross-validation, the data is first partitioned into five parts of (approximately) equal size, called folds. Next, a sequence of models is trained. The first model is trained using the first fold as the test set, and the remaining folds (2–5) are used as the training set. The model is built using the data in folds 2–5, and then the accuracy is evaluated on fold 1. Then another model is built, this time using fold 2 as the test set and the data in folds 1, 3, 4, and 5 as the training set. This process is repeated using folds 3, 4, and 5 as test sets. For each of these five splits of the data into training and test sets, we compute the accuracy. In the end[…]”

"accuracy scores (cross-validation scores)"

In [59]:
print("Average cross-validation score: {:.9f}".format(scores.mean()))

Average cross-validation score: 0.000060343
