# Subset Selection
## CMSE 381 - Spring 2024




In [1]:
# Everyone's favorite standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.linear_model import LinearRegression,LogisticRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

In [2]:
# First, we're going to do all the data loading we've had for a while for this data set
auto = pd.read_csv('../../DataSets/Auto.csv')
auto = auto.replace('?', np.nan)
auto = auto.dropna()
auto.horsepower = auto.horsepower.astype('int')

#this shuffles my data set in advance so that i don't need to worry about it later 
auto = auto.sample(frac=1).reset_index(drop=True)


auto.head()


Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,16.0,8,318.0,150,4498,14.5,75,1,plymouth grand fury
1,18.2,8,318.0,135,3830,15.2,79,1,dodge st. regis
2,23.7,3,70.0,100,2420,12.5,80,3,mazda rx-7 gs
3,24.0,4,113.0,95,2278,15.5,72,3,toyota corona hardtop
4,29.0,4,90.0,70,1937,14.2,76,2,vw rabbit


Let's try to run subset selection on the `auto` data set! We're going to use `cylinders`, `horsepower`, `weight`, and `acceleration` to predict `mpg`. 

In [19]:
inputvars = ['cylinders','horsepower','weight', 'acceleration']

The first tool we are going to use is the `itertools` package, which gives us a way to get subsets of whatever size we want using the `combinations` command.  

In [4]:
from itertools import combinations

The weird thing is it's an iterator, so if I just try to print out what I want, it's not helpful to me. 

In [5]:
combinations(inputvars,2)

<itertools.combinations at 0x28805c4ab30>

But if I use it in a for loop it does what I want!

In [6]:
for x in combinations(inputvars,2):
    print(x)

('cylinders', 'horsepower')
('cylinders', 'weight')
('cylinders', 'acceleration')
('horsepower', 'weight')
('horsepower', 'acceleration')
('weight', 'acceleration')


Here's some code stolen from the last few days to run linear regression on a subset of the input variables. 

In [7]:
def myscore_train(df,listofvars, outputvar = 'mpg'):
    X = df[list(listofvars)]
    y = df[outputvar]
    
    #build linear regression model
    model = LinearRegression()
    model.fit(X,y)
    
    testscore = mean_squared_error(y, model.predict(X))
    
    #view mean absolute error
    return testscore
    
myvars = ('cylinders', 'acceleration')
myscore_train(auto,myvars)

23.942446650601344

In [8]:
def myscore_cv(df,listofvars, outputvar = 'mpg'):
    X = df[list(listofvars)]
    y = df[outputvar]
    
    #build linear regression model
    model = LinearRegression()
    

    #use 5-fold CV to evaluate model
    scores = cross_val_score(model, X,y, 
                             scoring='neg_mean_squared_error',
                             cv=5)

    #view mean absolute error
    return np.average(np.absolute(scores))
    

myvars = ('cylinders', 'acceleration')
myscore_cv(auto,myvars)

24.383404336196282


&#9989; **<font color=red>Do this:</font>** Modify the code below as follows: 
- Set up two nested for loops to get every size $p = \{1,\cdots,4\}$ subset of my list of variables I want to use
- For each of these subsets, use the `myscore` function to get the training RSS.
- Append it into a data frame as shown


In [25]:
myscore_train(auto,inputvars, outputvar = 'mpg')

17.76139996926686

In [64]:
myvars = []
myscores = []

for i in range(1,5):
    for x in combinations(inputvars, i):
        curr_vars = x
        myvars.append(curr_vars)
        curr_score = myscore_train(auto,curr_vars, outputvar = 'mpg')
        myscores.append(curr_score)   



#-----
# your loop goes in here
#-----
        
myResults = pd.DataFrame({'Vars':myvars, 'Score':myscores})
myResults

Unnamed: 0,Vars,Score
0,"(cylinders,)",24.02018
1,"(horsepower,)",23.943663
2,"(weight,)",18.676617
3,"(acceleration,)",49.873627
4,"(cylinders, horsepower)",20.84819
5,"(cylinders, weight)",18.382946
6,"(cylinders, acceleration)",23.942447
7,"(horsepower, weight)",17.841442
8,"(horsepower, acceleration)",22.461644
9,"(weight, acceleration)",18.247176


We got all our main subsets, we're just missing a null model. This is the model that predicts the sample mean `mpg` for any input data. 

&#9989; **<font color=red>Do this:</font>** What is the MSE on our data set if we just predict the mean for every data point? Add this entry to your `myResults` data frame

*Hint: you can get a numpy array with every entry being the same output by using the `np.full` command.*

In [65]:
## Your code here ##
def myscore_avgtrain(df,listofvars, outputvar = 'mpg'):
    if listofvars == []:
        y = []
        mean_mpg = df['mpg'].mean()
        squared_errors = (df['mpg'] - mean_mpg) ** 2
        testscore = squared_errors.mean()
        
    else:
        X = df[list(listofvars)]
        y = df[outputvar]

        #build linear regression model
        model = LinearRegression()
        model.fit(X,y)
        testscore = mean_squared_error(y, model.predict(X))
    
    #view mean absolute error
    return testscore
    
myscore = myscore_avgtrain(auto,[])
 

In [66]:
newresult = pd.DataFrame({'Vars':['empty'], 'Score':[myscore]})

myResults = pd.concat([newresult,myResults],ignore_index = True)

# If you print it out now, you should have 16 models scored
myResults

Unnamed: 0,Vars,Score
0,empty,60.762738
1,"(cylinders,)",24.02018
2,"(horsepower,)",23.943663
3,"(weight,)",18.676617
4,"(acceleration,)",49.873627
5,"(cylinders, horsepower)",20.84819
6,"(cylinders, weight)",18.382946
7,"(cylinders, acceleration)",23.942447
8,"(horsepower, weight)",17.841442
9,"(horsepower, acceleration)",22.461644


&#9989; **<font color=red>Do this:</font>** For each size  𝑝={1,⋯,4}
   what is the minimum score ? The `idxmin` or `argmin` command will likely be useful for this. 

In [69]:
# Your code here #
np.argmin(myResults.iloc[:,1])

15

&#9989; **<font color=red>Do this:</font>** Use ``myscore_cv`` to determine the best subset of variables

In [1]:
# Your code here #
myvars = []
myscores = []

for i in range(1,5):
    for x in combinations(inputvars, i):
        curr_vars = x
        myvars.append(curr_vars)
        curr_score = myscore_cv(auto,curr_vars, outputvar = 'mpg')
        myscores.append(curr_score)   

myResults = pd.DataFrame({'Vars':myvars, 'Score':myscores})
myscore = myscore_avgtrain(auto,[])
newresult = pd.DataFrame({'Vars':['empty'], 'Score':[myscore]})

myResults = pd.concat([newresult,myResults],ignore_index = True)

min_index = np.argmax(myResults.iloc[:,1])
myResults.iloc[min_index,:]

NameError: name 'combinations' is not defined

## Homework problem


&#9989; **<font color=red>Please answer this problem in homework :</font>** write a function that does forward selection and another function that does backward selection. 



In [14]:
# Your code here #