In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

# PROBLEM 1

## Load the data set *Carseats.csv*. A description of the features characterizing the sales is given.

### (A) One-hot-encode all categorical features. Standardize all features. Use linear regressoin and all the features together to predic car seat sales. Create and display a dataframe consisting of a column of feature names and a corresponding column of coefficient values. Sort the rows in the dataframe from most positive to most negative coefficient.Also report training R-squared. Which features are most important for predicting car seat sales? Also comment on how good the fit to the training data is based on the value of the training R-squared.

In [3]:
df = pd.read_csv('../Carseats.csv')
df.head()

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban,US
0,9.5,138,73,11,276,120,Bad,42,17,Yes,Yes
1,11.22,111,48,16,260,83,Good,65,10,Yes,Yes
2,10.06,113,35,10,269,80,Medium,59,12,Yes,Yes
3,7.4,117,100,4,466,97,Medium,55,14,Yes,Yes
4,4.15,141,64,3,340,128,Bad,38,13,Yes,No


In [4]:
df.dtypes

Sales          float64
CompPrice        int64
Income           int64
Advertising      int64
Population       int64
Price            int64
ShelveLoc       object
Age              int64
Education        int64
Urban           object
US              object
dtype: object

In [5]:
df.ShelveLoc = df.ShelveLoc.astype('category')
df.Urban = df.Urban.astype('category')
df.US = df.US.astype('category')
df.dtypes

Sales           float64
CompPrice         int64
Income            int64
Advertising       int64
Population        int64
Price             int64
ShelveLoc      category
Age               int64
Education         int64
Urban          category
US             category
dtype: object

In [8]:
features = df.drop('Sales',axis=1)
targets  = df.Sales
features = pd.get_dummies(features)
features.head()

Unnamed: 0,CompPrice,Income,Advertising,Population,Price,Age,Education,ShelveLoc_Bad,ShelveLoc_Good,ShelveLoc_Medium,Urban_No,Urban_Yes,US_No,US_Yes
0,138,73,11,276,120,42,17,1,0,0,0,1,0,1
1,111,48,16,260,83,65,10,0,1,0,0,1,0,1
2,113,35,10,269,80,59,12,0,0,1,0,1,0,1
3,117,100,4,466,97,55,14,0,0,1,0,1,0,1
4,141,64,3,340,128,38,13,1,0,0,0,1,1,0


In [9]:
features = (features - features.mean())/features.std()
lr = LinearRegression()
lr.fit(features,targets)
R2 = lr.score(features,targets)
print('R-squared =',round(R2,3))

R-squared = 0.873


In [10]:
coef = pd.DataFrame()
coef['feature'] = features.columns
coef['coef'] = lr.coef_
coef.sort_values('coef',ascending=False)

Unnamed: 0,feature,coef
0,CompPrice,1.423278
8,ShelveLoc_Good,1.097867
2,Advertising,0.818627
1,Income,0.442259
12,US_No,0.044101
3,Population,0.030636
11,Urban_Yes,0.028056
10,Urban_No,-0.028056
13,US_Yes,-0.044101
6,Education,-0.055298


#### Competitor's price has the largest positive effect. Car seat price has the most negative effect on sales of car seat, which is to be expected. R2 = 0.873 means the training features explain 87.3% of the variation in training targets.

### (B) Repeat part (a) except do not standardize the features. Compare the coefficients obtained to those computed in part (a). Is it important to standardize features before ranking them in order of importance? Explain why or why not. Is the training R-squared value affected by scaling features?

In [11]:
features = df.drop('Sales',axis=1)
targets  = df.Sales

features = pd.get_dummies(features)

lr = LinearRegression()
lr.fit(features,targets)
R2 = lr.score(features,targets)
print('R-squared =',round(R2,3))

coef = pd.DataFrame()
coef['feature'] = features.columns
coef['coef'] = lr.coef_
coef.sort_values('coef',ascending=False)

R-squared = 0.873


Unnamed: 0,feature,coef
8,ShelveLoc_Good,2.581217
2,Advertising,0.123095
0,CompPrice,0.092815
12,US_No,0.092046
11,Urban_Yes,0.061443
1,Income,0.015803
3,Population,0.000208
6,Education,-0.021102
5,Age,-0.046045
10,Urban_No,-0.061443


#### If we don't standardize anything, shelve location is the feature having the biggest effect on car seat sales, which doesn't make as much sense, given the other features like competitor's price. If we don't standardize features, we can't give them all an honest comparison. However, the training R-squared is not affected by standardizing the features.

# PROBLEM 2

## Load the data set *Auto-cleaned.csv*

### (A) Use linear regression and the features cylinders, displacement, horsepoewr, weight, and acceleration to predict fuel economy (mpg). Standardize all features. After standardization, add a column of ones to represent the bias as an addition feautre. Create and display a dataframe consisting ofa column of feature names and a corresponding column of coefficient values. Sort  the rows in the dataframe from most positive to most negative coefficient. What is the value of the bias? Also report training R-squared.

In [12]:
from numpy.linalg import solve

In [13]:
df = pd.read_csv('../Auto-cleaned.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
2,2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite
3,3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst
4,4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino


In [14]:
features = df[['cylinders','displacement','horsepower','weight','acceleration']].copy()
targets  = df.mpg

features = (features - features.mean())/features.std()

features['bias'] = 1

lr = LinearRegression(fit_intercept=False)
lr.fit(features,targets)
R2 = lr.score(features,targets)
print('R-squared =',round(R2,3))

R-squared = 0.708


In [15]:
coef = pd.DataFrame()
coef['feature'] = features.columns
coef['coef'] = lr.coef_
coef.sort_values('coef',ascending=False).round(3)

Unnamed: 0,feature,coef
5,bias,23.446
1,displacement,-0.009
4,acceleration,-0.08
0,cylinders,-0.679
2,horsepower,-1.742
3,weight,-4.406


### (B) Repeat part (a) by solving the normal equations Aw = B where A X^T * X and B = X^T * Y. X is the data matrix corresponding to the features, Y is the vector containing the targets and w is the vector of coefficient values, including the value of the bias.

In [16]:
X = features.values
y = targets.values
A = np.matmul(X.T,X)
B = np.matmul(X.T,y)
w = solve(A,B)

coef_NormalEq = pd.DataFrame()
coef_NormalEq['feature'] = features.columns
coef_NormalEq['coef'] = w
coef_NormalEq.sort_values('coef',ascending=False).round(3)

Unnamed: 0,feature,coef
5,bias,23.446
1,displacement,-0.009
4,acceleration,-0.08
0,cylinders,-0.679
2,horsepower,-1.742
3,weight,-4.406


#### The coeffecients are the same

# PROBLEM 3

## Load the data set *Wine_red.csv*

### (A) Use multiple lienar regression and forward feature selection to train a regressor for predicting red wine quality from chemical features of the wine. Seelect features using test R-squared. use cross-validation to copmute test R-squared. List all the features used in your most accurate model and the resulting test R-squared. 

In [17]:
from sklearn.model_selection import cross_validate

In [18]:
df = pd.read_csv('../Wine_red.csv',sep=';')
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [19]:
def SelectFeature(feature_candidates,features_used,targets,df):
    N = len(feature_candidates)
    R2 = np.zeros(N)
    for k in range(N):   
        features_current = features_used.copy()
        features_current.append(feature_candidates[k])
        features = df[features_current]
        results = cross_validate(lr,features,targets,n_jobs=-1)
        R2[k] = results['test_score'].mean()
        
    R2_max = R2.max()
    feature_selected = feature_candidates[R2.argmax()] 
    return (feature_selected,R2_max)

In [20]:
feature_candidates = list(df.drop('quality',axis=1).columns)
features_used = []
targets = df.quality
lr = LinearRegression()

R2 = []
while (len(feature_candidates) > 0):
    print('number of feature candidate =',len(feature_candidates),end='')
    (feature_selected,R2_max) = SelectFeature(feature_candidates,features_used,targets,df)
    features_used.append(feature_selected)
    feature_candidates.remove(feature_selected)
    R2.append(R2_max)
    print('    feature selected:',feature_selected)
    
results = pd.DataFrame()
results['features'] = features_used
results['test R-squared'] = R2
results

number of feature candidate = 11



    feature selected: alcohol
number of feature candidate = 10    feature selected: volatile acidity
number of feature candidate = 9    feature selected: sulphates
number of feature candidate = 8    feature selected: chlorides
number of feature candidate = 7    feature selected: pH
number of feature candidate = 6    feature selected: residual sugar
number of feature candidate = 5    feature selected: citric acid
number of feature candidate = 4    feature selected: total sulfur dioxide
number of feature candidate = 3    feature selected: free sulfur dioxide
number of feature candidate = 2    feature selected: fixed acidity
number of feature candidate = 1    feature selected: density


Unnamed: 0,features,test R-squared
0,alcohol,0.215141
1,volatile acidity,0.305696
2,sulphates,0.316122
3,chlorides,0.323335
4,pH,0.326791
5,residual sugar,0.326663
6,citric acid,0.324723
7,total sulfur dioxide,0.321435
8,free sulfur dioxide,0.32212
9,fixed acidity,0.31407


In [21]:
print()
R2_max = results['test R-squared'].max()
print('maximum test R-squared =',R2_max.round(3))
print()
print('features used:')
ix = results['test R-squared'].idxmax()
results.head(ix+1).round(3)


maximum test R-squared = 0.327

features used:


Unnamed: 0,features,test R-squared
0,alcohol,0.215
1,volatile acidity,0.306
2,sulphates,0.316
3,chlorides,0.323
4,pH,0.327


### (B) Repeat part (a) for white wine. Use the data set *White_wine.csv*. Comment on similarities and diffrences between the red and white wine models.

In [22]:
df = pd.read_csv('../Wine_white.csv',sep=';')
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


In [23]:
feature_candidates = list(df.drop('quality',axis=1).columns)
features_used = []
targets = df.quality
lr = LinearRegression()

R2 = []
while (len(feature_candidates) > 0):
    print('number of feature candidate =',len(feature_candidates),end='')
    (feature_selected,R2_max) = SelectFeature(feature_candidates,features_used,targets,df)
    features_used.append(feature_selected)
    feature_candidates.remove(feature_selected)
    R2.append(R2_max)
    print('    feature selected:',feature_selected)
    
results = pd.DataFrame()
results['features'] = features_used
results['test R-squared'] = R2
results

number of feature candidate = 11    feature selected: alcohol
number of feature candidate = 10    feature selected: volatile acidity
number of feature candidate = 9    feature selected: residual sugar
number of feature candidate = 8    feature selected: sulphates
number of feature candidate = 7    feature selected: free sulfur dioxide
number of feature candidate = 6    feature selected: total sulfur dioxide
number of feature candidate = 5    feature selected: chlorides
number of feature candidate = 4    feature selected: citric acid
number of feature candidate = 3    feature selected: fixed acidity
number of feature candidate = 2    feature selected: density
number of feature candidate = 1    feature selected: pH


Unnamed: 0,features,test R-squared
0,alcohol,0.176708
1,volatile acidity,0.229091
2,residual sugar,0.242409
3,sulphates,0.243901
4,free sulfur dioxide,0.244463
5,total sulfur dioxide,0.244025
6,chlorides,0.243524
7,citric acid,0.241612
8,fixed acidity,0.239858
9,density,0.237006


In [24]:
print()
R2_max = results['test R-squared'].max()
print('maximum test R-squared =',R2_max.round(3))
print()
print('features used:')
ix = results['test R-squared'].idxmax()
results.head(ix+1).round(3)


maximum test R-squared = 0.244

features used:


Unnamed: 0,features,test R-squared
0,alcohol,0.177
1,volatile acidity,0.229
2,residual sugar,0.242
3,sulphates,0.244
4,free sulfur dioxide,0.244


#### Alcohol, volatile acidity, and suplhates are used in both models. Most of the variation in wine quality is unexplained in both models, but the red wine model is better than the white wine model at explaining the variation. The white wine model replaces chlorides and pH with residual sugar and free sulfure dioxide. R-squared for red is 0.327, while white's is 0.244 => red's model is better