# Lab 4: Random numbers, splitting data, evaluating model performance

- **Author:** Niall Keleher ([nkeleher@uw.edu](mailto:nkeleher@uw.edu))
- **Date:** 18 April 2016
- **Course:** INFO 371: Core Methods in Data Science

### Learning Objectives:
By the end of the lab, you will be able to:
* create dummy variables for use in regressions
* generate random numbers for use in randomization and train-test splits
* identify measures for evaluating regression performance

### Topics:
1. Qualitative/Categorical predictors
2. Generating random numbers 
3. Splitting data into training and test sets
4. Running regressions & generating predictions
5. Model performance

### References: 
* [Pandas - get_dummies()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html)
* [random library](https://docs.python.org/2/library/random.html)
* [Sci-kit Learn Cross Validation](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html)
* [Introduction to Statistical Learning, Lab #5](http://www-bcf.usc.edu/~gareth/ISL/Chapter%205%20Lab.txt)

In [1]:
import numpy as np
import pandas as pd

In [5]:
auto_df = pd.read_csv('data/Auto.csv')

In [6]:
auto_df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18,8,307,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15,8,350,165,3693,11.5,70,1,buick skylark 320
2,18,8,318,150,3436,11.0,70,1,plymouth satellite
3,16,8,304,150,3433,12.0,70,1,amc rebel sst
4,17,8,302,140,3449,10.5,70,1,ford torino


### 1. Qualitative/Categorical predictors -  Generate dummy variables in python

In [7]:
auto_df.cylinders.value_counts()

4    203
8    103
6     84
3      4
5      3
Name: cylinders, dtype: int64

In [47]:
pd.get_dummies(auto_df.cylinders).head()

Unnamed: 0,3,4,5,6,8
0,0,0,0,0,1
1,0,0,0,0,1
2,0,0,0,0,1
3,0,0,0,0,1
4,0,0,0,0,1


In [9]:
cyl_dummies = pd.get_dummies(auto_df.cylinders, prefix='cyl')

In [10]:
auto_df2 = pd.concat([auto_df, cyl_dummies], axis=1)

In [11]:
auto_df2.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name,cyl_3,cyl_4,cyl_5,cyl_6,cyl_8
0,18,8,307,130,3504,12.0,70,1,chevrolet chevelle malibu,0,0,0,0,1
1,15,8,350,165,3693,11.5,70,1,buick skylark 320,0,0,0,0,1
2,18,8,318,150,3436,11.0,70,1,plymouth satellite,0,0,0,0,1
3,16,8,304,150,3433,12.0,70,1,amc rebel sst,0,0,0,0,1
4,17,8,302,140,3449,10.5,70,1,ford torino,0,0,0,0,1


### 2. Generating random numbers - randomizing treatment assignment

In [12]:
import random

In [55]:
random.random()  # Random float x, 0.0 <= x < 1.0

0.00998124146937207

In [14]:
random.uniform(1,100)  # Random float x, 0.0 <= x < 100.0

94.87689570903558

In [15]:
random.randint(1, 10)  # Integer from 1 to 10, endpoints included

4

In [16]:
random.sample([1, 2, 3, 4, 5],  3)

[3, 4, 1]

In [54]:
random.seed(47653)

In [18]:
raw_data = {'first_name': ['Niall', 'Josh', 'Li', 'Lavi', 'Jevin', 'Emma'],  
        'sex': ['male', 'male', 'female', 'male', 'male', 'female']}
df = pd.DataFrame(raw_data, columns = ['first_name', 'sex'])

In [19]:
df

Unnamed: 0,first_name,sex
0,Niall,male
1,Josh,male
2,Li,female
3,Lavi,male
4,Jevin,male
5,Emma,female


In [20]:
df['rand'] = df.apply(lambda row: random.random(), axis=1)

In [21]:
df

Unnamed: 0,first_name,sex,rand
0,Niall,male,0.009981
1,Josh,male,0.897681
2,Li,female,0.804464
3,Lavi,male,0.147438
4,Jevin,male,0.942135
5,Emma,female,0.426891


In [22]:
df['treat'] = (df['rand']<.5)

In [23]:
df

Unnamed: 0,first_name,sex,rand,treat
0,Niall,male,0.009981,True
1,Josh,male,0.897681,False
2,Li,female,0.804464,False
3,Lavi,male,0.147438,True
4,Jevin,male,0.942135,False
5,Emma,female,0.426891,True


### 3. Splitting data into training and test sets

In [24]:
auto_df['rand'] = auto_df.apply(lambda row: random.random(), axis=1)

In [25]:
auto_df['train'] = (auto_df['rand']>.33)

In [26]:
len(auto_df)

397

In [27]:
len(auto_df[auto_df['train']])

281

In [28]:
auto_train = auto_df[auto_df['train']]

Using Scikit-Learn

In [29]:
from sklearn.cross_validation import train_test_split

In [30]:
X = auto_df['weight']

In [31]:
y = auto_df['mpg']

In [32]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [33]:
len(X_train)

265

In [34]:
len(y_train)

265

In [35]:
len(X_test)

132

In [36]:
len(y_test)

132

### 4. Running regressions & generating predictions

In [37]:
auto_df.head(1)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name,rand,train
0,18,8,307,130,3504,12,70,1,chevrolet chevelle malibu,0.880276,True


In [38]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [39]:
overfit_mod = smf.ols(formula='mpg ~ weight', data = auto_df)
overfit_result = overfit_mod.fit()
print overfit_result.summary()

                            OLS Regression Results                            
Dep. Variable:                    mpg   R-squared:                       0.692
Model:                            OLS   Adj. R-squared:                  0.691
Method:                 Least Squares   F-statistic:                     886.6
Date:                Mon, 18 Apr 2016   Prob (F-statistic):          5.37e-103
Time:                        18:40:21   Log-Likelihood:                -1146.0
No. Observations:                 397   AIC:                             2296.
Df Residuals:                     395   BIC:                             2304.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     46.3174      0.796     58.166      0.0

In [41]:
train_mod = smf.ols(formula='mpg ~ weight', data = auto_train)
train_result = train_mod.fit()
print train_result.summary()

                            OLS Regression Results                            
Dep. Variable:                    mpg   R-squared:                       0.698
Model:                            OLS   Adj. R-squared:                  0.697
Method:                 Least Squares   F-statistic:                     645.4
Date:                Mon, 18 Apr 2016   Prob (F-statistic):           1.51e-74
Time:                        18:40:32   Log-Likelihood:                -799.77
No. Observations:                 281   AIC:                             1604.
Df Residuals:                     279   BIC:                             1611.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     45.6676      0.924     49.413      0.0

### Exercise

#### Use scikitlearn to train a model to predict mpg using weight, horsepower, cylinders, displacement, acceleration, origin and year

Reference: http://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares

In [42]:
from sklearn import linear_model

In [43]:
lin_mod = linear_model.LinearRegression()

In [60]:
lin_mod.fit(X_train, y_train)



ValueError: Found arrays with inconsistent numbers of samples: [  1 265]