# Lab: Even More Linear Regression
## CMSE 381 - Spring 2023
## Lecture 7, Jan 25, 2023

In the last few lectures, we have focused on linear regression, that is, fitting models of the form 
$$
Y =  \beta_0 +  \beta_1 X_1 +  \beta_2 X_2 + \cdots +  \beta_pX_p + \varepsilon
$$
In this lab, we will continue to use two different tools for linear regression. 
- [Scikit learn](https://scikit-learn.org/stable/index.html) is arguably the most used tool for machine learning in python 
- [Statsmodels](https://www.statsmodels.org) provides many of the statisitcial tests we've been learning in class

This lab will cover two ideas: 
- Categorical variables and how to represent them as dummy variables. 
- How to build interaction terms and pass them into your favorite model.

In [1]:
# As always, we start with our favorite standard imports. 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns
import statsmodels.formula.api as smf

##ANSWER##

# Code generating examples on slides
Removed from student version

In [2]:
##ANSWER##
Credit_df = pd.read_csv('Credit.csv', index_col = 0)
Credit_df.head()

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity,Balance
1,14.891,3606,283,2,34,11,Male,No,Yes,Caucasian,333
2,106.025,6645,483,3,82,15,Female,Yes,Yes,Asian,903
3,104.593,7075,514,4,71,11,Male,No,No,Asian,580
4,148.924,9504,681,3,36,11,Female,No,No,Asian,964
5,55.882,4897,357,2,68,16,Male,No,Yes,Caucasian,331


In [3]:
##ANSWER##
est = smf.ols('Balance ~ Student', Credit_df).fit()
est.summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,480.3694,23.434,20.499,0.000,434.300,526.439
Student[T.Yes],396.4556,74.104,5.350,0.000,250.771,542.140


# Playing with multi-level variables 

## The wrong way

Ok, we're going to do this incorrectly to start. You pull in the `Auto` data set. You were so proud of yourself for remembering to fix the problems with the `horsepower` column that you conveniently forgot that the column with information about country of origin (`origin`) has a bunch of integers in it, representing:
- 1: `American`
- 2: `European`
- 3: `Japanese`.

In [4]:
Auto_df = pd.read_csv('Auto.csv')
Auto_df = Auto_df.replace('?', np.nan)
Auto_df = Auto_df.dropna()
Auto_df.horsepower = Auto_df.horsepower.astype('int')


Auto_df.columns

Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'year', 'origin', 'name'],
      dtype='object')

In [5]:
Auto_df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


You then go on your merry way building the model 
$$
\texttt{mpg} = \beta_0 + \beta_1 \cdot \texttt{origin}. 
$$

In [7]:
from sklearn.linear_model import LinearRegression

X = Auto_df.origin.values
X = X.reshape(-1, 1)
y = Auto_df.mpg.values

regr = LinearRegression()

regr.fit(X,y)

print('beta_1 = ', regr.coef_[0])
print('beta_0 = ', regr.intercept_)

beta_1 =  5.4765474801914475
beta_0 =  14.811973615412462


In [8]:
##ANSWER##
# this is the statsmodels version, but i want to go use scikit learn
# because statsmodels hides the dummy variable stuff
est = smf.ols('mpg ~ origin', Auto_df).fit()
est.summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,14.8120,0.716,20.676,0.000,13.404,16.220
origin,5.4765,0.405,13.531,0.000,4.681,6.272


&#9989; **<font color=red>Q:</font>** What does your model predict for each of the three types of cars? 

In [7]:
# Your code here

In [9]:
##ANSWER##
convertOrigin= {1: 'American', 2:'European', 3:'Japanese'}

# Prediction for "1" 
def prediction(n):
    return 14.8623 + 5.4967*n

for i in range(1,4):
    print(str(i), convertOrigin[i], ':', prediction(i))


1 American : 20.358999999999998
2 European : 25.8557
3 Japanese : 31.352399999999996


&#9989; **<font color=red>Q:</font>** Is it possible for your model to predict that both American and Japanese cars have `mpg` below European cars? 

Your answer here.

##ANSWER##
Not possible.  Because you have a linear model, they're either strictly decreasing (in order) or strictly increasing. Supposed to emphasize the fact that this is not the right way to train the model. 

## The right way

Ok, so you figure out your problem and decide to load in your data and fix the `origin` column to have names as entries.

In [10]:
convertOrigin= {1: 'American', 2:'European', 3:'Japanese'}

# This command swaps out each number n for convertOrigin[n], making it one of
# the three strings instead of an integer now.
Auto_df.origin = Auto_df.origin.apply(lambda n: convertOrigin[n])
Auto_df

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130,3504,12.0,70,American,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,American,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,American,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,American,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,American,ford torino
...,...,...,...,...,...,...,...,...,...
392,27.0,4,140.0,86,2790,15.6,82,American,ford mustang gl
393,44.0,4,97.0,52,2130,24.6,82,European,vw pickup
394,32.0,4,135.0,84,2295,11.6,82,American,dodge rampage
395,28.0,4,120.0,79,2625,18.6,82,American,ford ranger


Below is a quick code that automatically generates our dummy variables. Yay for not having to code that mess ourselves!

In [11]:
origin_dummies_df = pd.get_dummies(Auto_df.origin, prefix='origin')
origin_dummies_df

Unnamed: 0,origin_American,origin_European,origin_Japanese
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0
...,...,...,...
392,1,0,0
393,0,1,0
394,1,0,0
395,1,0,0


&#9989; **<font color=red>Q:</font>** What is the interpretation of each column in the `origin_dummies_df` data frame?

*Your answer here*

I pass these new dummy variables into my `scikit-learn` linear regression model and get the following coefficients

In [15]:
X = origin_dummies_df.iloc[:, 0:2].values
y = Auto_df.mpg

regr = LinearRegression()

regr.fit(X,y)

print('Coefs = ', regr.coef_)
print('Intercept = ', regr.intercept_)

Coefs =  [-10.41716352  -2.84769173]
Intercept =  30.4506329113924


&#9989; **<font color=red>Q:</font>** Now what does your model predict for each of the three types of cars? 

In [12]:
# Your code here

In [17]:
##ANSWER## 

print('American:')
print(regr.predict(((1, 0),))[0])


print('European:')
print(regr.predict(((0, 1),))[0])


print('Japanese:')
print(regr.intercept_)


# Ok, so I was hoping the output would show that the function doesn't have to be 
# strictly increasing. So maybe just emphasize that it doesn't matter what order 
# these things were. 

American:
20.033469387755105
European:
27.602941176470583
Japanese:
30.4506329113924


### Ooops

&#9989; **<font color=red>Q:</font>** Aw man, I didn't quite do what we said for the dummy variables in class. We talked about having only two dummy variables for a three level variable. Copy my code below here and fix it to have two variables instead of three. 
- Are your coefficients different now?
- Are your predictions for each of the three origins different now? 
- Does it matter which two levels you used for your dummy variables? 

In [14]:
# Your code here

In [14]:
##ANSWER##
# Coefficients are different (in particular, there's a different number of
# them) but the prediction ends up the same.....
X = origin_dummies_df[['origin_American','origin_European']].values
y = Auto_df.mpg

regr = LinearRegression()

regr.fit(X,y)

print('Coefs = ', regr.coef_)
print('Intercept = ', regr.intercept_)

print('\n')

print('American:')
print(regr.predict(((1, 0),))[0])


print('European:')
print(regr.predict(((0, 1),))[0])


print('Japanese:')
print(regr.predict(((0, 0),))[0])

Coefs =  [-10.41716352  -2.84769173]
Intercept =  30.4506329113924


American:
20.033469387755105
European:
27.602941176470583
Japanese:
30.4506329113924


## Another right way

Ok, fine, I'll cave, I made you do it the hard way but you got to see how the innards worked, so maybe it's not all bad ;) 

The following code does the same thing as above, but because `statsmodels` has built in tools to handle categorical variables in a data frame, it does the hard work for you. 

In [128]:
est = smf.ols('mpg ~ origin', Auto_df).fit()
est.summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,20.0718,0.407,49.339,0.000,19.272,20.872
origin[T.European],7.8197,0.867,9.018,0.000,6.115,9.524
origin[T.Japanese],10.3789,0.828,12.540,0.000,8.752,12.006


&#9989; **<font color=red>Q:</font>** What is the model learned from the above printout? Be specific in terms of your dummy variables. 

Your answer here

In [129]:
##ANSWER##
X = origin_dummies_df[['origin_European', 'origin_Japanese']].values
y = Auto_df.mpg
regr = LinearRegression()

regr.fit(X,y)

print('Coefs = ', regr.coef_)
print('Intercept = ', regr.intercept_)

print('y = ', round(regr.intercept_,2), 
      ' + ', round(regr.coef_[0],2), '*x_Euro + ', 
      round(regr.coef_[1],2), '*x_Japan',)

Coefs =  [ 7.81965438 10.37885872]
Intercept =  20.071774193548386
y =  20.07  +  7.82 *x_Euro +  10.38 *x_Japan


![Stop Icon](https://upload.wikimedia.org/wikipedia/commons/thumb/1/1e/Vienna_Convention_road_sign_B2a.svg/180px-Vienna_Convention_road_sign_B2a.svg.png)

Great, you got to here! Hang out for a bit, there's more lecture before we go on to the next portion. 

# Interaction Terms 

Below is the code I have for generating the tables shown on the slides. 

In [21]:
Advertising_df = pd.read_csv('Advertising.csv', index_col = 0)
Advertising_df.head()

Unnamed: 0,TV,Radio,Newspaper,Sales
1,230.1,37.8,69.2,22.1
2,44.5,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9


In [22]:
Advertising_df['TVxRadio'] = Advertising_df.TV * Advertising_df.Radio
Advertising_df.head()

Unnamed: 0,TV,Radio,Newspaper,Sales,TVxRadio
1,230.1,37.8,69.2,22.1,8697.78
2,44.5,39.3,45.1,10.4,1748.85
3,17.2,45.9,69.3,9.3,789.48
4,151.5,41.3,58.5,18.5,6256.95
5,180.8,10.8,58.4,12.9,1952.64


In [23]:
est = smf.ols('Sales ~ TV + Radio + TVxRadio', Advertising_df).fit()
est.summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,6.7502,0.248,27.233,0.000,6.261,7.239
TV,0.0191,0.002,12.699,0.000,0.016,0.022
Radio,0.0289,0.009,3.241,0.001,0.011,0.046
TVxRadio,0.0011,5.24e-05,20.727,0.000,0.001,0.001


&#9989; **<font color=red>Do this:</font>** Using the Auto data set, train the model 
$$
\texttt{mpg} = \beta_0 + \beta_1\cdot \texttt{weight} + \beta_2\cdot \texttt{horsepower} + \beta_3\cdot \texttt{weight x horsepower}.
$$
Is the interaction term adding value to the model? 

In [24]:
# Your code here

In [25]:
##ANSWER##
# For me to be able to reload quickly
Auto_df = pd.read_csv('Auto.csv')
convertOrigin= {1: 'American', 2:'European', 3:'Japanese'}
Auto_df.origin = Auto_df.origin.apply(lambda n: convertOrigin[n])
Auto_df = Auto_df.replace('?', np.nan)
Auto_df = Auto_df.dropna()
Auto_df.horsepower = Auto_df.horsepower.astype('int')

Auto_df.head()


Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130,3504,12.0,70,American,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,American,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,American,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,American,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,American,ford torino


In [26]:
##ANSWER##
# Add in the column

Auto_df['wt_x_hp']= Auto_df.weight * Auto_df.horsepower
Auto_df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name,wt_x_hp
0,18.0,8,307.0,130,3504,12.0,70,American,chevrolet chevelle malibu,455520
1,15.0,8,350.0,165,3693,11.5,70,American,buick skylark 320,609345
2,18.0,8,318.0,150,3436,11.0,70,American,plymouth satellite,515400
3,16.0,8,304.0,150,3433,12.0,70,American,amc rebel sst,514950
4,17.0,8,302.0,140,3449,10.5,70,American,ford torino,482860


In [27]:
##ANSWER##
# train the model
est = smf.ols('mpg ~ weight + horsepower + wt_x_hp', Auto_df).fit()
est.summary().tables[1]



0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,63.5579,2.343,27.127,0.000,58.951,68.164
weight,-0.0108,0.001,-13.921,0.000,-0.012,-0.009
horsepower,-0.2508,0.027,-9.195,0.000,-0.304,-0.197
wt_x_hp,5.355e-05,6.65e-06,8.054,0.000,4.05e-05,6.66e-05




-----
### Congratulations, we're done!

Written by Dr. Liz Munch, Michigan State University
<a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">Creative Commons Attribution-NonCommercial 4.0 International License</a>.

In [159]:
##ANSWER## 
# This cell gets the name of the current notebook. Needs a sec
# to run before it works

from jupyterinstruct import InstructorNotebook
this_notebook = InstructorNotebook.getname()


##ANSWER## 

<IPython.core.display.Javascript object>

In [160]:
##ANSWER##
#This cell runs the converter which removes ANSWER fields, renames the notebook and cleans out output fields. 

studentnotebook = InstructorNotebook.makestudent(this_notebook)
InstructorNotebook.validate(studentnotebook)

Myfilename CMSE381-Lec07-EvenMoreLinReg-INSTRUCTOR.ipynb


CMSE381-Lec07_EvenMoreLinReg.ipynb


Validating Notebook ./CMSE381-Lec07_EvenMoreLinReg.ipynb
   ERROR: Image LINK not found - http://creativecommons.org/licenses/by-nc/4.0/


1