In [7]:
#imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns
import statsmodels.formula.api as smf
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from mlxtend.feature_selection import SequentialFeatureSelector as sfs
from sklearn.linear_model import LinearRegression

## 5.4.1

Using basic statistical properties of the variance, as well as single-variable calculus, derive (5.6). In other words, prove that $\alpha$ given by (5.6) does indeed minimize Var $(\alpha X + (1-\alpha)Y)$.


We want to minimize Var $(\alpha X + (1- \alpha)Y)$. We have that

Var $(X+Y) =$ Var $(X) + $ Var $(Y) + 2 $ Cov $ (X,Y)$
Var $(aX) = a^2 $ Var $(X)$
Cov $(aX,bY) = ab $ Cov $(X,Y)$
Var $(aX + bY) = a^2$ Var $(X) + b^2$ Var $(y) + 2ab$ Cov $(X,Y)$

Then, we can write:

Var $(aX + (1- a)Y) = $ Var $ (aX) $ Var $((1- a)Y) + 2$ Cov $(aX,(1-a)Y)$

$=a^2$ Var $(X) + (1-a)^2)$ Var $(Y) + 2a(1-a)$ Cov $(X,Y)$

To minimize, we differentiate this expression to 0 and solve for $a$:

$0 =2a$ Var $(X) - 2(1-a)$ Var $(Y) + 2(1-2a)$ Cov $(X,Y)$

$0 = a $Var$(X) -$ Var$(Y) + a$ Var$(Y) + $Cov$(X,Y) - 2a$ Cov$(X,Y)$

$a $Var$(X) + a $Var$(Y) + 2a $Cov$(X,Y) = $Var$(Y) - $Cov$(X,Y)$

$a = ($Var$(Y) - $Cov$(X,Y))/($Var$(X) + $Var$(Y) + 2 $Cov$(X,Y))$

$a = (\sigma^2_Y - \sigma_{XY})/(\sigma^2_X + \sigma^2_Y - 2 \sigma_{XY})$


## 5.4.2 
We will now derive the probability that a given observation is part of a bootstrap sample. Suppose that we obtain a bootstrap sample from a set of n observations.

(a) What is the probability that the first bootstrap observation is not the jth observation from the original sample? Justify your answer.

Each observation has a 1/n chance of being selected. Therefore there is a 1 - (1/n) chance that the first bootstrap observation is not the jth observation.

(b) What is the probability that the second bootstrap observation is not the jth observation from the original sample?

There are n-1 observations left. However, each sample has an equal chance of being selected still, so the probability is still 1 - (1/n).

(c) Argue that the probability that the jth observation is not in the bootstrap sample is $(1 - 1/n)^n$.

We know that at each position (first bootstrap observation, second bootstrap observation etc) the probability of the jth observation not being there is 1 - (1/n). There are n total positions, therefore $(1 - 1/n)*(1 - 1/n)* \cdots * (1 - 1/n) = (1 - 1/n)^n$.

(d) When n = 5, what is the probability that the jth observation is in the bootstrap sample?

$1 - (1 - 1/5)^5 = 0.672$

(e) When n = 100, what is the probability that the jth observation is in the bootstrap sample?

$1 - (1 - 1/100)^100 = 0.634$

## 3
Code the forward selection and backward selection functions and apply them to the auto dataset to predict mpg, as instructed at the end of the in-class assignment.



In [3]:
# First, we're going to do all the data loading we've had for a while for this data set
auto = pd.read_csv('Auto.csv')
auto = auto.replace('?', np.nan)
auto = auto.dropna()
auto.horsepower = auto.horsepower.astype('int')

#this shuffles my data set in advance so that i don't need to worry about it later 
auto = auto.sample(frac=1).reset_index(drop=True)


auto.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,13.0,8,318.0,150,3940,13.2,76,1,plymouth volare premier v8
1,14.0,8,400.0,175,4464,11.5,71,1,pontiac catalina brougham
2,23.0,4,120.0,97,2506,14.5,72,3,toyouta corona mark ii (sw)
3,30.5,4,98.0,63,2051,17.0,77,1,chevrolet chevette
4,24.2,6,146.0,120,2930,13.8,81,3,datsun 810 maxima


In [11]:
#using this website for guidance https://www.analyticsvidhya.com/blog/2021/04/forward-feature-selection-and-its-implementation/

x = auto.iloc[:,1:7]
y = auto['mpg']
x.shape, y.shape

((392, 6), (392,))

In [12]:
model_lr = LinearRegression()
sfs1 = sfs(model_lr, k_features=3, forward=True, verbose=2, scoring='neg_mean_squared_error')
sfs1 = sfs1.fit(x, y)
feat_names = list(sfs1.k_feature_names_)
print(feat_names)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    0.0s finished

[2024-03-06 21:47:32] Features: 1/3 -- score: -18.756571530297617[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.0s finished


['weight', 'acceleration', 'year']



[2024-03-06 21:47:32] Features: 2/3 -- score: -11.749528802632629[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.0s finished

[2024-03-06 21:47:32] Features: 3/3 -- score: -11.759220900777297

In [13]:
#now we do backward selection

model_lr = LinearRegression()
sfs1 = sfs(model_lr, k_features=3, forward=False, verbose=2, scoring='neg_mean_squared_error')
sfs1 = sfs1.fit(x, y)
feat_names = list(sfs1.k_feature_names_)
print(feat_names)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    0.0s finished

[2024-03-06 21:53:19] Features: 5/3 -- score: -11.831827535564239[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s


['weight', 'acceleration', 'year']


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.0s finished

[2024-03-06 21:53:19] Features: 4/3 -- score: -11.786705188194437[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.0s finished

[2024-03-06 21:53:19] Features: 3/3 -- score: -11.759220900777297

The variables agree. 