# Linear Regression

### Qualitative Features & Interaction Terms

#### 1-A) Use the credit data set, fit OLS linear regression model to predict credit card balance using all the following features
 - Student
 - Income
 - Limit
 - Interaction term: Income*Student
 - Interaction term: Limit*Student

#### Find the p-values of all features. Are they all helpful in predicting the response? Why? 


> They are all helpful in predicting the response, because their p-values are all less than 0.05 which indicates they are all statistically significant.

In [1]:
import statsmodels.formula.api as smf
from pandas import read_csv

credit = read_csv('Credit2.csv', index_col = 0)

model = smf.ols(formula='Balance ~ Student + Income + Limit + Income*Student + Limit*Student', data=credit)
results = model.fit()
print(results.pvalues)
#Add you code here

Intercept                5.549684e-117
Student[T.Yes]            2.330768e-08
Income                   7.366425e-105
Income:Student[T.Yes]     2.680862e-04
Limit                    2.626683e-222
Limit:Student[T.Yes]      6.320226e-08
dtype: float64


#### 1-B) Unsince sklearn library, find the test $R^2$ score for estimating the balance from features (Income, Limit, StudentEncode) using linear regression model. The StudentEncode is the binary feature that maps Student status to a numerical value ('yes' to 1 and 'No' to 0). 
- Set random state to zero in train test split


In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

credit["StudentEncode"] = credit["Student"].apply(lambda x: 0 if x == "No" else 1)

X_train, X_test, y_train, y_test = train_test_split(credit[["Income","Limit","StudentEncode"]], credit["Balance"], random_state=0)

model = LinearRegression()
model.fit(X_train,y_train)
r_square = model.score(X_test,y_test)
print("R^2 score is: {}".format(r_square))
#Add you code here

R^2 score is: 0.9492691755287224


#### 1-C) Repeat the above question after adding to the model the two interaction terms: (1) (Income x StudentEncode) and (2) (Limit x StudentEncode)

   

In [3]:
#Add you code here
credit["Income_StudentEncode"] = credit["Income"] * credit["StudentEncode"]
credit["Limit_StudentEncode"] = credit["Limit"] * credit["StudentEncode"]

X_train, X_test, y_train, y_test = train_test_split(credit[["Income","Limit","StudentEncode","Income_StudentEncode","Limit_StudentEncode"]], credit["Balance"], random_state=0)

model = LinearRegression()
model.fit(X_train,y_train)
r_square = model.score(X_test,y_test)
print("R^2 score is: {}".format(r_square))

R^2 score is: 0.9525853236314719


### Polynomial Regression

     Set random_state= 0 in train_test_split in all the questions below.

#### 2-A) Use the Auto dataset, 
  - (i) Find the test $R^2$ metric of a linear regression model that predicts the miles per gallon (mpg) from the horsepower.

  - (ii) Use polynomial regression to include both the horsepower feature and $(horsepower)^2$ in the regression model to predict the mpg. Find the test $R^2$ metric in this case

Hint: You can use [numpy.concatenate](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.concatenate.html). For example to add to an array U a column vector $W^2$, we can use X=np.concatenate((U,W**2),axis=1)

In [4]:
#Solution for (A) 
from pandas import read_csv
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np

AutoData=read_csv('Auto_modify.csv') # read the data
#You will need AutoData.horsepower, and AutoData.mpg
AutoData["horsepower2"] = AutoData["horsepower"] ** 2

X_train, X_test, y_train, y_test = train_test_split(AutoData[["horsepower"]], AutoData["mpg"], random_state=0)


model1 = LinearRegression()
model1.fit(X_train,y_train)
score1 = model1.score(X_test,y_test)
print("R^2 score for one feature is: {}".format(score1))


X_train, X_test, y_train, y_test = train_test_split(AutoData[["horsepower","horsepower2"]], AutoData["mpg"], random_state=0)

model2 = LinearRegression()
model2.fit(X_train,y_train)
score2 = model2.score(X_test,y_test)

print("R^2 score for two features is: {}".format(score2))

#Add you code here

R^2 score for one feature is: 0.6217658811398382
R^2 score for two features is: 0.7271031504642004


#### 2-B) With the same auto dataset, use KNN regression with K=7, to fit a model that predicts miles per gallon(mpg) in the following cases:

- One feature: Horsepower only

- Two features: horsepower and $(horsepower)^2$ 

#### Use MinMax feaures scaling. Find the $R^2$ metric in each of the above cases. Comparing KNN with linear regression, which model performs better? How does the performance change by adding the quadratic feature?




> For one feature of horsepower, the KNN works better than Linear Regression because the r suare is 0.661 which is larger than 0.622. For two feature of  horsepower and horsepower^2, Linear Regression works better than KNN because the r square is 0.727 which is larger than 0.670

> By adding the quadratic feature, the performance of two model both get better because their r square both become larger. The Linear Regression improved to a greater extent.

In [5]:
from sklearn import neighbors
from sklearn import preprocessing
from sklearn.metrics import r2_score


AutoData=read_csv('Auto_modify.csv')
AutoData["horsepower2"] = AutoData["horsepower"] ** 2



X_train, X_test, y_train, y_test = train_test_split(AutoData[["horsepower"]], AutoData["mpg"], random_state=0)

# MinMax feaures scaling
scalar = preprocessing.MinMaxScaler()
X_train = scalar.fit_transform(X_train)

model1 = neighbors.KNeighborsRegressor(n_neighbors=7)
model1.fit(X_train,y_train)

X_test = scalar.transform(X_test)
y_pred = model1.predict(X_test)
r2_1 = r2_score(y_test, y_pred)
print("R^2 score for one feature is: {}".format(r2_1))


X_train, X_test, y_train, y_test = train_test_split(AutoData[["horsepower","horsepower2"]], AutoData["mpg"], random_state=0)

# MinMax feaures scaling
scalar = preprocessing.MinMaxScaler()
X_train = scalar.fit_transform(X_train)

model2 = neighbors.KNeighborsRegressor(n_neighbors=7)
model2.fit(X_train,y_train)

X_test = scalar.transform(X_test)
y_pred = model2.predict(X_test)
r2_2 = r2_score(y_test, y_pred)
print("R^2 score for one feature is: {}".format(r2_2))



R^2 score for one feature is: 0.6609249833061157
R^2 score for one feature is: 0.6701084048823853


  return self.partial_fit(X, y)
  return self.partial_fit(X, y)
