# Linear Regression - Exercise

The Programme for International Student Assessment (PISA) is a test given every three years to 15-year-old students from around the world to evaluate their performance in mathematics, reading, and science. This test provides a quantitative way to compare the performance of students from different parts of the world. In this exercise, we will predict the reading scores of students from the United States of America on the 2009 PISA exam.

The datasets 'pisa2009train.csv' and 'pisa2009test.csv' contain information about the demographics and schools for American students taking the exam, derived from 2009 PISA Public-Use Data Files distributed by the United States National Center for Education Statistics (NCES).

Each row in the datasets pisa2009train.csv and pisa2009test.csv represents one student taking the exam. The datasets have the following variables:

* grade: The grade in school of the student (most 15-year-olds in America are in 10th grade)

* male: Whether the student is male (1/0)

* raceeth: The race/ethnicity composite of the student

* preschool: Whether the student attended preschool (1/0)

* expectBachelors: Whether the student expects to obtain a bachelor's degree (1/0)

* motherHS: Whether the student's mother completed high school (1/0)

* motherBachelors: Whether the student's mother obtained a bachelor's degree (1/0)

* motherWork: Whether the student's mother has part-time or full-time work (1/0)

* fatherHS: Whether the student's father completed high school (1/0)

* fatherBachelors: Whether the student's father obtained a bachelor's degree (1/0)

* fatherWork: Whether the student's father has part-time or full-time work (1/0)

* selfBornUS: Whether the student was born in the United States of America (1/0)

* motherBornUS: Whether the student's mother was born in the United States of America (1/0)

* fatherBornUS: Whether the student's father was born in the United States of America (1/0)

* englishAtHome: Whether the student speaks English at home (1/0)

* computerForSchoolwork: Whether the student has access to a computer for schoolwork (1/0)

* read30MinsADay: Whether the student reads for pleasure for 30 minutes/day (1/0)

* minutesPerWeekEnglish: The number of minutes per week the student spend in English class

* studentsInEnglish: The number of students in this student's English class at school

* schoolHasLibrary: Whether this student's school has a library (1/0)

* publicSchool: Whether this student attends a public school (1/0)

* urban: Whether this student's school is in an urban area (1/0)

* schoolSize: The number of students in this student's school

* readingScore: The student's reading score, on a 1000-point scale

Import numpy and pandas packages. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Dataset size

Load the training and testing sets using the read.csv() function, and save them as variables with the names pisaTrain and pisaTest.

**How many students are there in the training set?**

In [225]:
test = pd.read_csv('pisa2009test.csv')
train = pd.read_csv('pisa2009train.csv')

## Summarizing the dataset

**Using pisaTrain, what is the average reading test score of males? or females?**

In [73]:
train.describe()

Unnamed: 0,grade,male,preschool,expectBachelors,motherHS,motherBachelors,motherWork,fatherHS,fatherBachelors,fatherWork,...,englishAtHome,computerForSchoolwork,read30MinsADay,minutesPerWeekEnglish,studentsInEnglish,schoolHasLibrary,publicSchool,urban,schoolSize,readingScore
count,3663.0,3663.0,3607.0,3601.0,3566.0,3266.0,3570.0,3418.0,3094.0,3430.0,...,3592.0,3598.0,3629.0,3477.0,3414.0,3520.0,3663.0,3663.0,3501.0,3663.0
mean,10.089817,0.511057,0.722761,0.785893,0.879978,0.348132,0.734454,0.859274,0.331933,0.853061,...,0.871659,0.899389,0.289887,266.208225,24.499414,0.967614,0.933934,0.38493,1369.316767,497.911403
std,0.554375,0.499946,0.447697,0.410259,0.325033,0.476451,0.441685,0.347789,0.470983,0.354096,...,0.334515,0.300855,0.453772,148.403525,7.184348,0.177049,0.248431,0.486645,869.983618,95.515153
min,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,100.0,168.55
25%,10.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,...,1.0,1.0,0.0,225.0,20.0,1.0,1.0,0.0,712.0,431.705
50%,10.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,...,1.0,1.0,0.0,250.0,25.0,1.0,1.0,0.0,1212.0,499.66
75%,10.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,300.0,30.0,1.0,1.0,1.0,1900.0,566.23
max,12.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,2400.0,75.0,1.0,1.0,1.0,6694.0,746.0


In [79]:
train

Unnamed: 0,grade,male,raceeth,preschool,expectBachelors,motherHS,motherBachelors,motherWork,fatherHS,fatherBachelors,...,englishAtHome,computerForSchoolwork,read30MinsADay,minutesPerWeekEnglish,studentsInEnglish,schoolHasLibrary,publicSchool,urban,schoolSize,readingScore
0,11,1,,,0.0,,,1.0,,,...,0.0,1.0,0.0,225.0,,1.0,1,1,673.0,476.00
1,11,1,White,0.0,0.0,1.0,1.0,1.0,1.0,0.0,...,1.0,1.0,1.0,450.0,25.0,1.0,1,0,1173.0,575.01
2,9,1,White,1.0,1.0,1.0,1.0,1.0,1.0,,...,1.0,1.0,0.0,250.0,28.0,1.0,1,0,1233.0,554.81
3,10,0,Black,1.0,1.0,0.0,0.0,1.0,1.0,0.0,...,1.0,1.0,1.0,200.0,23.0,1.0,1,1,2640.0,458.11
4,10,1,Hispanic,1.0,0.0,1.0,0.0,1.0,1.0,0.0,...,1.0,1.0,1.0,250.0,35.0,1.0,1,1,1095.0,613.89
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3658,9,1,White,0.0,1.0,1.0,,0.0,1.0,1.0,...,1.0,1.0,0.0,250.0,20.0,1.0,1,0,421.0,509.99
3659,9,1,White,0.0,0.0,1.0,0.0,1.0,1.0,0.0,...,1.0,0.0,1.0,450.0,16.0,1.0,1,0,1317.0,444.90
3660,10,1,Hispanic,1.0,1.0,1.0,0.0,1.0,1.0,0.0,...,1.0,1.0,0.0,225.0,16.0,1.0,1,1,539.0,476.89
3661,11,1,Black,0.0,0.0,1.0,0.0,,,0.0,...,1.0,1.0,0.0,54.0,36.0,1.0,1,1,,363.61


In [84]:
train[train['male']==1]['readingScore'].mean()

483.53247863247805

In [86]:
train[train['male']==0]['readingScore'].mean()

512.94063093244

## Locating and removing missing values

**Which variables do not have any missing values in the training set?**

In [87]:
train.isnull().any(axis=0)

grade                    False
male                     False
raceeth                   True
preschool                 True
expectBachelors           True
motherHS                  True
motherBachelors           True
motherWork                True
fatherHS                  True
fatherBachelors           True
fatherWork                True
selfBornUS                True
motherBornUS              True
fatherBornUS              True
englishAtHome             True
computerForSchoolwork     True
read30MinsADay            True
minutesPerWeekEnglish     True
studentsInEnglish         True
schoolHasLibrary          True
publicSchool             False
urban                    False
schoolSize                True
readingScore             False
dtype: bool

In [88]:
train.isnull().sum()

grade                      0
male                       0
raceeth                   35
preschool                 56
expectBachelors           62
motherHS                  97
motherBachelors          397
motherWork                93
fatherHS                 245
fatherBachelors          569
fatherWork               233
selfBornUS                69
motherBornUS              71
fatherBornUS             113
englishAtHome             71
computerForSchoolwork     65
read30MinsADay            34
minutesPerWeekEnglish    186
studentsInEnglish        249
schoolHasLibrary         143
publicSchool               0
urban                      0
schoolSize               162
readingScore               0
dtype: int64

Linear regression discards observations with missing data, so we will remove all such observations from the training and testing sets. **How many observations are now in the training and test sets?**

In [92]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3663 entries, 0 to 3662
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   grade                  3663 non-null   int64  
 1   male                   3663 non-null   int64  
 2   raceeth                3628 non-null   object 
 3   preschool              3607 non-null   float64
 4   expectBachelors        3601 non-null   float64
 5   motherHS               3566 non-null   float64
 6   motherBachelors        3266 non-null   float64
 7   motherWork             3570 non-null   float64
 8   fatherHS               3418 non-null   float64
 9   fatherBachelors        3094 non-null   float64
 10  fatherWork             3430 non-null   float64
 11  selfBornUS             3594 non-null   float64
 12  motherBornUS           3592 non-null   float64
 13  fatherBornUS           3550 non-null   float64
 14  englishAtHome          3592 non-null   float64
 15  comp

In [93]:
train.shape

(3663, 24)

In [94]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1570 entries, 0 to 1569
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   grade                  1570 non-null   int64  
 1   male                   1570 non-null   int64  
 2   raceeth                1557 non-null   object 
 3   preschool              1549 non-null   float64
 4   expectBachelors        1547 non-null   float64
 5   motherHS               1525 non-null   float64
 6   motherBachelors        1382 non-null   float64
 7   motherWork             1534 non-null   float64
 8   fatherHS               1445 non-null   float64
 9   fatherBachelors        1282 non-null   float64
 10  fatherWork             1457 non-null   float64
 11  selfBornUS             1546 non-null   float64
 12  motherBornUS           1547 non-null   float64
 13  fatherBornUS           1512 non-null   float64
 14  englishAtHome          1543 non-null   float64
 15  comp

In [133]:
test.shape

(1570, 24)

In [134]:
train_temp=train.dropna()

In [135]:
test_temp=test.dropna()

In [136]:
train_temp.shape

(2414, 24)

In [137]:
test_temp.shape

(990, 24)

## Handling categorical variables

Categorical variables are variables that take on a discrete set of values. 'Color' variable with levels of 'red', 'green', and 'blue', is an example. To include a categorical variable in a linear regression model, we define one level as the "reference level" and add a binary variable for each of the remaining levels. In this way, a categorical variable with n levels is replaced by n-1 binary variables. 

As an example, consider the 'color' variable. If "green" were the reference level, then we would add binary variables "colorred" and "colorblue" to a linear regression problem. All red examples would have colorred=1 and colorblue=0. All blue examples would have colorred=0 and colorblue=1. All green examples would have colorred=0 and colorblue=0.

Now, consider the variable "raceeth" in our problem, which has levels "American Indian/Alaska Native", "Asian", "Black", "Hispanic", "More than one race", "Native Hawaiian/Other Pacific Islander", and "White". Because it is the most common in our population, we will select White as the reference level. **How many binary variables should we include in a linear regression model?**

In [142]:
train_temp['raceeth'].nunique()

7

Type Markdown and LaTeX:  𝛼2

In [143]:
train_temp=pd.get_dummies(train_temp, prefix=['raceeth'], columns = ['raceeth'])

In [144]:
train_temp.head()

Unnamed: 0,grade,male,preschool,expectBachelors,motherHS,motherBachelors,motherWork,fatherHS,fatherBachelors,fatherWork,...,urban,schoolSize,readingScore,raceeth_American Indian/Alaska Native,raceeth_Asian,raceeth_Black,raceeth_Hispanic,raceeth_More than one race,raceeth_Native Hawaiian/Other Pacific Islander,raceeth_White
1,11,1,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,...,0,1173.0,575.01,0,0,0,0,0,0,1
3,10,0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,...,1,2640.0,458.11,0,0,1,0,0,0,0
4,10,1,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,1,1095.0,613.89,0,0,0,1,0,0,0
7,10,0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,...,0,1913.0,439.36,0,0,0,0,0,0,1
9,10,1,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,...,0,899.0,465.9,0,0,0,0,1,0,0


In [145]:
train_white= train_temp.drop('raceeth_White', axis=1)

In [146]:
train_white.head()

Unnamed: 0,grade,male,preschool,expectBachelors,motherHS,motherBachelors,motherWork,fatherHS,fatherBachelors,fatherWork,...,publicSchool,urban,schoolSize,readingScore,raceeth_American Indian/Alaska Native,raceeth_Asian,raceeth_Black,raceeth_Hispanic,raceeth_More than one race,raceeth_Native Hawaiian/Other Pacific Islander
1,11,1,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,...,1,0,1173.0,575.01,0,0,0,0,0,0
3,10,0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,...,1,1,2640.0,458.11,0,0,1,0,0,0
4,10,1,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,1,1,1095.0,613.89,0,0,0,1,0,0
7,10,0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,...,1,0,1913.0,439.36,0,0,0,0,0,0
9,10,1,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,...,1,0,899.0,465.9,0,0,0,0,1,0


## Build a model

Now, build a linear regression model (call it lmScore) using the training set to predict readingScore using all the remaining variables.

In [147]:
train_white.columns

Index(['grade', 'male', 'preschool', 'expectBachelors', 'motherHS',
       'motherBachelors', 'motherWork', 'fatherHS', 'fatherBachelors',
       'fatherWork', 'selfBornUS', 'motherBornUS', 'fatherBornUS',
       'englishAtHome', 'computerForSchoolwork', 'read30MinsADay',
       'minutesPerWeekEnglish', 'studentsInEnglish', 'schoolHasLibrary',
       'publicSchool', 'urban', 'schoolSize', 'readingScore',
       'raceeth_American Indian/Alaska Native', 'raceeth_Asian',
       'raceeth_Black', 'raceeth_Hispanic', 'raceeth_More than one race',
       'raceeth_Native Hawaiian/Other Pacific Islander'],
      dtype='object')

In [148]:
x=train_white[['grade', 'male', 'preschool', 'expectBachelors', 'motherHS',
       'motherBachelors', 'motherWork', 'fatherHS', 'fatherBachelors',
       'fatherWork', 'selfBornUS', 'motherBornUS', 'fatherBornUS',
       'englishAtHome', 'computerForSchoolwork', 'read30MinsADay',
       'minutesPerWeekEnglish', 'studentsInEnglish', 'schoolHasLibrary',
       'publicSchool', 'urban', 'schoolSize', 'raceeth_American Indian/Alaska Native', 'raceeth_Asian',
       'raceeth_Black', 'raceeth_Hispanic', 'raceeth_More than one race',
       'raceeth_Native Hawaiian/Other Pacific Islander']]

In [149]:
y=train_white['readingScore']

In [150]:
x_train = x
y_train = y

In [151]:
x_train

Unnamed: 0,grade,male,preschool,expectBachelors,motherHS,motherBachelors,motherWork,fatherHS,fatherBachelors,fatherWork,...,schoolHasLibrary,publicSchool,urban,schoolSize,raceeth_American Indian/Alaska Native,raceeth_Asian,raceeth_Black,raceeth_Hispanic,raceeth_More than one race,raceeth_Native Hawaiian/Other Pacific Islander
1,11,1,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,...,1.0,1,0,1173.0,0,0,0,0,0,0
3,10,0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,...,1.0,1,1,2640.0,0,0,1,0,0,0
4,10,1,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,1.0,1,1,1095.0,0,0,0,1,0,0
7,10,0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,...,1.0,1,0,1913.0,0,0,0,0,0,0
9,10,1,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,...,1.0,1,0,899.0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3655,10,0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,...,1.0,1,0,149.0,0,0,0,0,0,0
3657,10,1,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,...,1.0,1,0,1471.0,0,0,0,0,1,0
3659,9,1,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,...,1.0,1,0,1317.0,0,0,0,0,0,0
3660,10,1,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,...,1.0,1,1,539.0,0,0,0,1,0,0


In [152]:
y_train

1       575.01
3       458.11
4       613.89
7       439.36
9       465.90
         ...  
3655    604.76
3657    492.76
3659    444.90
3660    476.89
3662    551.85
Name: readingScore, Length: 2414, dtype: float64

In [153]:
from sklearn.linear_model import LinearRegression

In [154]:
lmScore = LinearRegression()

In [155]:
lmScore.fit(x_train,y_train)

LinearRegression()

**What is the RMSE of lmScore on the training set? How good do you think the current model is?**

In [156]:
predictions=lmScore.predict(x_train)

In [165]:
predictions

array([538.69212789, 495.62605148, 436.8781249 , ..., 447.98990547,
       468.91289474, 557.36336867])

In [157]:
from sklearn import metrics

In [158]:
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_train, predictions)))

RMSE: 73.3655514329845


## Comparing predictions for similar students

Consider two students A and B. They have all variable values the same, except that student A is in grade 11 and student B is in grade 9. **What is the predicted reading score of student A minus the predicted reading score of student B?**

In [159]:
coeff_df = pd.DataFrame(lmScore.coef_,x.columns,columns=['Coefficient'])
coeff_df

Unnamed: 0,Coefficient
grade,29.542707
male,-14.521653
preschool,-4.46367
expectBachelors,55.26708
motherHS,6.058774
motherBachelors,12.638068
motherWork,-2.809101
fatherHS,4.018214
fatherBachelors,16.929755
fatherWork,5.842798


In [160]:
coeff_df.loc['grade']*2

Coefficient    59.085414
Name: grade, dtype: float64

## Interpreting coefficients

What is the meaning of the coefficient associated with variable raceeth_Asian?

1. Predicted average reading score of an Asian student

2. Difference between the average reading score of an Asian student and the average reading score of a white student

3. Difference between the average reading score of an Asian student and the average reading score of all the students in the dataset

4. Predicted difference in the reading score between an Asian student and a white student who is otherwise identical

In [None]:
#Answer: 4.Predicted difference in the reading score between an Asian student and a white student who is otherwise identical

In [164]:
train_temp.describe()

Unnamed: 0,grade,male,preschool,expectBachelors,motherHS,motherBachelors,motherWork,fatherHS,fatherBachelors,fatherWork,...,urban,schoolSize,readingScore,raceeth_American Indian/Alaska Native,raceeth_Asian,raceeth_Black,raceeth_Hispanic,raceeth_More than one race,raceeth_Native Hawaiian/Other Pacific Islander,raceeth_White
count,2414.0,2414.0,2414.0,2414.0,2414.0,2414.0,2414.0,2414.0,2414.0,2414.0,...,2414.0,2414.0,2414.0,2414.0,2414.0,2414.0,2414.0,2414.0,2414.0,2414.0
mean,10.126346,0.501243,0.727423,0.8343,0.896023,0.363712,0.735708,0.874068,0.348384,0.857084,...,0.362883,1371.649544,517.962887,0.008285,0.039354,0.094449,0.207125,0.033554,0.008285,0.608948
std,0.523174,0.500102,0.445377,0.371888,0.305294,0.481167,0.441047,0.331842,0.476557,0.35006,...,0.480931,847.800992,89.325693,0.090663,0.194475,0.292513,0.40533,0.180116,0.090663,0.488087
min,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,100.0,244.48,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,10.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,...,0.0,712.0,455.8475,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,10.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,...,0.0,1233.0,520.205,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,10.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1900.0,581.395,0.0,0.0,0.0,0.0,0.0,0.0,1.0
max,12.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,6694.0,746.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [203]:
train.dropna().groupby(['raceeth'])['readingScore'].mean()

raceeth
American Indian/Alaska Native             443.082000
Asian                                     549.155579
Black                                     462.617325
Hispanic                                  484.542180
More than one race                        517.176790
Native Hawaiian/Other Pacific Islander    534.711000
White                                     536.733068
Name: readingScore, dtype: float64

In [204]:
coeff_df

Unnamed: 0,Coefficient
grade,29.542707
male,-14.521653
preschool,-4.46367
expectBachelors,55.26708
motherHS,6.058774
motherBachelors,12.638068
motherWork,-2.809101
fatherHS,4.018214
fatherBachelors,16.929755
fatherWork,5.842798


In [None]:
# Holding all other features fixed, a 1 unit increase in racceeth_Asian is associated with a decrease of 4.110325.

## Predicting on unseen data

Use the lmScore model to predict the reading scores of students in pisaTest. 

What is the range between the maximum and minimum predicted reading score on the test set?

In [205]:
predictions.max()

636.3154881755961

In [206]:
predictions.min()

332.67182814055184

In [None]:
#range = 284.46831179513725

## Test set MSE and RMSE

What is the RMSE of lmScore on the test set? 

In [226]:
test

Unnamed: 0,grade,male,raceeth,preschool,expectBachelors,motherHS,motherBachelors,motherWork,fatherHS,fatherBachelors,...,englishAtHome,computerForSchoolwork,read30MinsADay,minutesPerWeekEnglish,studentsInEnglish,schoolHasLibrary,publicSchool,urban,schoolSize,readingScore
0,10,0,White,1.0,0.0,1.0,1.0,1.0,1.0,0.0,...,1.0,1.0,0.0,240.0,30.0,1.0,1,0,808.0,355.24
1,10,1,White,0.0,0.0,1.0,0.0,1.0,1.0,0.0,...,1.0,1.0,0.0,255.0,,1.0,1,0,808.0,385.57
2,10,0,White,1.0,0.0,1.0,0.0,1.0,1.0,0.0,...,1.0,1.0,0.0,,30.0,1.0,1,0,808.0,522.62
3,10,0,White,1.0,0.0,1.0,1.0,1.0,1.0,0.0,...,1.0,1.0,0.0,160.0,30.0,,1,0,808.0,406.24
4,10,0,White,1.0,1.0,1.0,0.0,0.0,1.0,1.0,...,1.0,1.0,0.0,240.0,30.0,1.0,1,0,808.0,453.50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1565,9,1,White,1.0,1.0,1.0,0.0,1.0,0.0,0.0,...,1.0,1.0,0.0,300.0,20.0,1.0,1,0,987.0,465.58
1566,11,0,White,1.0,0.0,1.0,0.0,,1.0,0.0,...,1.0,1.0,0.0,450.0,25.0,1.0,1,0,987.0,380.18
1567,10,0,Hispanic,1.0,1.0,1.0,,1.0,,,...,1.0,1.0,0.0,,,1.0,1,0,987.0,324.10
1568,10,0,White,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,450.0,20.0,1.0,1,0,987.0,596.34


In [227]:
test_temp=test.dropna()

In [229]:
test_temp = pd.get_dummies(test_temp, prefix=['raceeth'], columns = ['raceeth'])

In [230]:
test_temp.columns

Index(['grade', 'male', 'preschool', 'expectBachelors', 'motherHS',
       'motherBachelors', 'motherWork', 'fatherHS', 'fatherBachelors',
       'fatherWork', 'selfBornUS', 'motherBornUS', 'fatherBornUS',
       'englishAtHome', 'computerForSchoolwork', 'read30MinsADay',
       'minutesPerWeekEnglish', 'studentsInEnglish', 'schoolHasLibrary',
       'publicSchool', 'urban', 'schoolSize', 'readingScore',
       'raceeth_American Indian/Alaska Native', 'raceeth_Asian',
       'raceeth_Black', 'raceeth_Hispanic', 'raceeth_More than one race',
       'raceeth_Native Hawaiian/Other Pacific Islander', 'raceeth_White'],
      dtype='object')

In [231]:
x = test_temp[['grade', 'male', 'preschool', 'expectBachelors', 'motherHS',
       'motherBachelors', 'motherWork', 'fatherHS', 'fatherBachelors',
       'fatherWork', 'selfBornUS', 'motherBornUS', 'fatherBornUS',
       'englishAtHome', 'computerForSchoolwork', 'read30MinsADay',
       'minutesPerWeekEnglish', 'studentsInEnglish', 'schoolHasLibrary',
       'publicSchool', 'urban', 'schoolSize','raceeth_American Indian/Alaska Native', 'raceeth_Asian',
       'raceeth_Black', 'raceeth_Hispanic', 'raceeth_More than one race',
       'raceeth_Native Hawaiian/Other Pacific Islander', 'raceeth_White']]
y = test_temp['readingScore']

In [233]:
x_test = x
y_test = y

In [234]:
from sklearn.linear_model import LinearRegression

In [235]:
lm = LinearRegression()

In [236]:
lm.fit(x_test,y_test)

LinearRegression()

In [237]:
predictions = lm.predict(x_test)

In [238]:
from sklearn import metrics

In [239]:
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

MSE: 5580.73950286333
RMSE: 74.70434728222536


## How can we improve the model performance on the test set?

In [None]:
# 1. Add more independent variables.
# 2. Use winsorize method to drop the outliers.