# Part 1: Polynomial Regression

### A) Use the Auto dataset, find the test $R^2$ score of a linear regression model that predicts the miles per gallon (mpg) from the horsepower.

### B) Use polynomial regression to include both the horsepower feature and $(horsepower)^2$ in the regression model. Find the $R^2$ metric. 

Hint: You can use [numpy.concatenate](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.concatenate.html). For example to add to an array U a column vector $W^2$, we can use X=np.concatenate((U,W**2),axis=1)

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from pandas import read_csv
AutoData=read_csv('Auto_modify.csv') # read the data
print(type(AutoData))
print(AutoData)
X_auto_hp=AutoData.horsepower.values.reshape(-1,1) # define features: horsepower 
Y_auto_mpg=AutoData.mpg.values.reshape(-1,1) # define label: miles per gallon

# add your solution here
X_train,X_test,Y_train,Y_test=train_test_split(X_auto_hp,Y_auto_mpg,random_state=0)
linreg=LinearRegression().fit(X_train,Y_train)
r2=linreg.score(X_test,Y_test)

print("The R2 score is :",r2)


<class 'pandas.core.frame.DataFrame'>
      mpg  cylinders  displacement  horsepower  weight  acceleration  year  \
0    18.0          8         307.0         130    3504          12.0    70   
1    15.0          8         350.0         165    3693          11.5    70   
2    18.0          8         318.0         150    3436          11.0    70   
3    16.0          8         304.0         150    3433          12.0    70   
4    17.0          8         302.0         140    3449          10.5    70   
5    15.0          8         429.0         198    4341          10.0    70   
6    14.0          8         454.0         220    4354           9.0    70   
7    14.0          8         440.0         215    4312           8.5    70   
8    14.0          8         455.0         225    4425          10.0    70   
9    15.0          8         390.0         190    3850           8.5    70   
10   15.0          8         383.0         170    3563          10.0    70   
11   14.0          8      

In [4]:
AutoData['Polynomial(horsepower^2)']=AutoData.horsepower.values*AutoData.horsepower.values

X_auto_polynomial=AutoData[['horsepower','Polynomial(horsepower^2)']].values # define features: horsepower ,horsepower^2
Y_auto_mpg=AutoData.mpg.values.reshape(-1,1) # define label: miles per gallon

# add your solution here
X_train,X_test,Y_train,Y_test=train_test_split(X_auto_polynomial,Y_auto_mpg,random_state=0)
linreg=LinearRegression().fit(X_train,Y_train)
r2=linreg.score(X_test,Y_test)

print("The R2 score is :",r2)


The R2 score is : 0.7271031504642004


In [None]:
# We can see that the R2 score changes significantly after including the polynomial term horsepower^2.

### C)Use KNN regression to predict the miles per gallon(mpg) with K=7, and find $R^2$ metric in the following cases 

- One feature: Horsepower only

- Two features: horsepower and $(horsepower)^2$ 

Hint: 

    Create KNN regression object using neighbors.KNeighborsRegressor:

    knnRegression = neighbors.KNeighborsRegressor(n_neighbors=7)

    Use the .fit and .score methods as before



In [5]:
from sklearn import neighbors
# add your solution here
knnRegression = neighbors.KNeighborsRegressor(n_neighbors=7)

X_train,X_test,Y_train,Y_test=train_test_split(X_auto_hp,Y_auto_mpg,random_state=0)
knnRegression.fit(X_train,Y_train)

r2=knnRegression.score(X_test,Y_test)
print("The R2 score is :",r2)

The R2 score is : 0.6674777441714226


In [6]:

X_train,X_test,Y_train,Y_test=train_test_split(X_auto_polynomial,Y_auto_mpg,random_state=0)
knnRegression.fit(X_train,Y_train)

r2=knnRegression.score(X_test,Y_test)
print("The R2 score is :",r2)

The R2 score is : 0.6701084048823853


#The Linear model with quadratic feature performs best amongst the four options.But if we don't include quadratic feature then KNN regressor works better.This may be since KNN works better when number of features are less.
Safe to say that, whichever the case(KNN or Linear), performance increases with introducing the quadratic feature.

#### COMMENT on your results on (E) and (F): which model performs better? How does performance change when adding the quadratic feature?

# Part 2: Regularization

### A) Use the Boston dataset, and use Ridge regression model with tuning parameter set to 100 (alpha =100). Find the $R^2$ score and number of non zero coefficients.

###  B) Use Lasso regression instead of Ridge regression, also set the tuning parameter to 100. Find the $R^2$ score and number of non zero coefficients.

### C) Change the tuning parameter of the Lasso model to a very low value (alpha =0.001). What is the $R^2$ score.



In [7]:
from sklearn.datasets import load_boston
from sklearn.linear_model import Ridge 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge 
from sklearn.linear_model import Lasso
import numpy as np

dataset = load_boston()
print(dataset.DESCR)
X=dataset.data
Y=dataset.target
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,random_state=0)
RidgeModel=Ridge(alpha=100).fit(X_train,Y_train)

print("The R2 score for Ridge is",RidgeModel.score(X_test,Y_test))

print("The number of non-zero coefficients are :", np.sum(RidgeModel.coef_!=0))

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

In [8]:
LassoModel=Lasso(alpha=100).fit(X_train,Y_train)

print("The R2 score for Lasso is",LassoModel.score(X_test,Y_test))

np.sum(LassoModel.coef_!=0)

The R2 score for Lasso is 0.11866916175527807


2

In [9]:
LassoModel=Lasso(alpha=0.001).fit(X_train,Y_train)

print("The R2 score for Lasso is",LassoModel.score(X_test,Y_test))

np.sum(LassoModel.coef_!=0)

The R2 score for Lasso is 0.6350353125168686


13

### D) Comment on your result. In this problem, do all feature seem important in making predictions?


It seems through the R2 score that all features are important but we can't say this for sure as we can see from Lasso model,keeping value of alpha as 100 selects only 2 features for its model while decreasing the value of alpha to as low as 0.001 takes into consideration all the features.There might be other factors involved which would change the R2 score like scaling of the features. Hence, we need to select the an alpha value which gives us the best accuracy which will then tell the useful features. 