# Subset Selection
## CMSE 381 - Spring 2024




In [45]:
# Everyone's favorite standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.linear_model import LinearRegression,LogisticRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

In [46]:
# First, we're going to do all the data loading we've had for a while for this data set
auto = pd.read_csv('../../DataSets/Auto.csv')
auto = auto.replace('?', np.nan)
auto = auto.dropna()
auto.horsepower = auto.horsepower.astype('int')

#this shuffles my data set in advance so that i don't need to worry about it later 
auto = auto.sample(frac=1).reset_index(drop=True)


auto.head()


Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,17.5,6,250.0,110,3520,16.4,77,1,chevrolet concours
1,20.8,6,200.0,85,3070,16.7,78,1,mercury zephyr
2,29.5,4,97.0,71,1825,12.2,76,2,volkswagen rabbit
3,31.0,4,71.0,65,1773,19.0,71,3,toyota corolla 1200
4,31.0,4,76.0,52,1649,16.5,74,3,toyota corona


Let's try to run subset selection on the `auto` data set! We're going to use `cylinders`, `horsepower`, `weight`, and `acceleration` to predict `mpg`. 

In [47]:
inputvars = ['cylinders','horsepower','weight', 'acceleration']

The first tool we are going to use is the `itertools` package, which gives us a way to get subsets of whatever size we want using the `combinations` command.  

In [48]:
from itertools import combinations

The weird thing is it's an iterator, so if I just try to print out what I want, it's not helpful to me. 

In [49]:
combinations(inputvars,2)

<itertools.combinations at 0x13443ee8e00>

But if I use it in a for loop it does what I want!

In [50]:
for x in combinations(inputvars,2):
    print(x)

('cylinders', 'horsepower')
('cylinders', 'weight')
('cylinders', 'acceleration')
('horsepower', 'weight')
('horsepower', 'acceleration')
('weight', 'acceleration')


Here's some code stolen from the last few days to run linear regression on a subset of the input variables. 

In [51]:
def myscore_train(df,listofvars, outputvar = 'mpg'):
    X = df[list(listofvars)]
    y = df[outputvar]
    
    #build linear regression model
    model = LinearRegression()
    model.fit(X,y)
    
    testscore = mean_squared_error(y, model.predict(X))
    
    #view mean absolute error
    return testscore
    
myvars = ('cylinders', 'acceleration')
myscore_train(auto,myvars)

23.94244665060135

In [52]:
def myscore_cv(df,listofvars, outputvar = 'mpg'):
    X = df[list(listofvars)]
    y = df[outputvar]
    
    #build linear regression model
    model = LinearRegression()
    

    #use 5-fold CV to evaluate model
    scores = cross_val_score(model, X,y, 
                             scoring='neg_mean_squared_error',
                             cv=5)

    #view mean absolute error
    return np.average(np.absolute(scores))
    

myvars = ('cylinders', 'acceleration')
myscore_cv(auto,myvars)

25.243871676023723


&#9989; **<font color=red>Do this:</font>** Modify the code below as follows: 
- Set up two nested for loops to get every size $p = \{1,\cdots,4\}$ subset of my list of variables I want to use
- For each of these subsets, use the `myscore` function to get the training RSS.
- Append it into a data frame as shown


In [53]:
myvars = []
myscores = []

#-----
# your loop goes in here
#-----
        
myResults = pd.DataFrame({'Vars':myvars, 'Score':myscores})
myResults

Unnamed: 0,Vars,Score


We got all our main subsets, we're just missing a null model. This is the model that predicts the sample mean `mpg` for any input data. 

&#9989; **<font color=red>Do this:</font>** What is the MSE on our data set if we just predict the mean for every data point? Add this entry to your `myResults` data frame

*Hint: you can get a numpy array with every entry being the same output by using the `np.full` command.*

In [22]:
## Your code here ##

myscore = np.nan #<---- fix this to get your score! Then run
                 #      the cell below to append it to your 
                 #      dataframe. 

In [24]:
newresult = pd.DataFrame({'Vars':['empty'], 'TrainScore':[myscore]})

myResults = pd.concat([newresult,myResults],ignore_index = True)

# If you print it out now, you should have 16 models scored
myResults

Unnamed: 0,Vars,TrainScore
0,empty,60.762738
1,"(cylinders,)",24.02018
2,"(horsepower,)",23.943663
3,"(weight,)",18.676617
4,"(acceleration,)",49.873627
5,"(cylinders, horsepower)",20.84819
6,"(cylinders, weight)",18.382946
7,"(cylinders, acceleration)",23.942447
8,"(horsepower, weight)",17.841442
9,"(horsepower, acceleration)",22.461644


&#9989; **<font color=red>Do this:</font>** For each size  𝑝={1,⋯,4}
   what is the minimum score ? The `idxmin` or `argmin` command will likely be useful for this. 

In [44]:
# Your code here #


0

&#9989; **<font color=red>Do this:</font>** Use ``myscore_cv`` to determine the best subset of variables

In [None]:
# Your code here #


## Homework problem


&#9989; **<font color=red>Please answer this problem in homework :</font>** write a function that does forward selection and another function that does backward selection. 



In [141]:
# Your code here #