# Linear Regression

### Qualitative Features & Interaction Terms

#### 1-A) Use the credit data set, fit OLS linear regression model to predict credit card balance using all the following features
 - Student
 - Income
 - Limit
 - Interaction term: Income*Student
 - Interaction term: Limit*Student

Find the p-values of all features. Are they all helpful in predicting the response? Why? 


In [2]:
import statsmodels.formula.api as smf
from pandas import read_csv

credit =read_csv('Credit2.csv')


model=smf.ols('Balance ~ Student+Income+Income*Student+Limit +Limit*Student', credit)
Fitting_results=model.fit()

print(Fitting_results.summary().tables[1])
print(Fitting_results.pvalues)


                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept              -415.3863     12.436    -33.401      0.000    -439.836    -390.936
Student[T.Yes]          235.2261     41.256      5.702      0.000     154.117     316.336
Income                   -7.6162      0.252    -30.272      0.000      -8.111      -7.122
Income:Student[T.Yes]    -2.5835      0.702     -3.678      0.000      -3.965      -1.202
Limit                     0.2613      0.004     69.090      0.000       0.254       0.269
Limit:Student[T.Yes]      0.0667      0.012      5.515      0.000       0.043       0.090
Intercept                5.549684e-117
Student[T.Yes]            2.330768e-08
Income                   7.366425e-105
Income:Student[T.Yes]     2.680862e-04
Limit                    2.626683e-222
Limit:Student[T.Yes]      6.320226e-08
dtype: float64


All features have low p-value, and hence we'd expect that they are important in predicting the response.

#### 1-B) Find the test $R^2$ score for estimating the balance from features (Income, Limit, StudentEncode) using linear regression model. The StudentEncode is the binary feature that maps Student status to a numerical value ('yes' to 1 and 'No' to 0)


In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression


credit =read_csv('Credit2.csv')
credit['StudentEncode'] = credit.Student.map({'No':0 , 'Yes':1}) 
print(credit.head(5))

X = credit[[ 'StudentEncode','Income','Limit']]
Y = credit.Balance

X_train, X_test, Y_train, Y_test= train_test_split(X, Y, random_state= 0)
linreg= LinearRegression().fit(X_train, Y_train)

print('The test R^2 score by predicting balance using student status and income', linreg.score(X_test,Y_test))


   Unnamed: 0   Income  Limit  Rating  Cards  Age  Education  Gender Student  \
0           1   14.891   3606     283      2   34         11    Male      No   
1           2  106.025   6645     483      3   82         15  Female     Yes   
2           3  104.593   7075     514      4   71         11    Male      No   
3           4  148.924   9504     681      3   36         11  Female      No   
4           5   55.882   4897     357      2   68         16    Male      No   

  Married  Ethnicity  Balance  StudentEncode  
0     Yes  Caucasian      333              0  
1     Yes      Asian      903              1  
2      No      Asian      580              0  
3      No      Asian      964              0  
4     Yes  Caucasian      331              0  
The test R^2 score by predicting balance using student status and income 0.949269175529


#### 1-C) Repeat the above question after adding to the model the two interaction terms: (1) (Income x StudentEncode) and (2) (Limit x StudentEncode)

   

In [4]:
credit['InteractionTerm1']=credit.Income*credit.StudentEncode
credit['InteractionTerm2']=credit.Limit*credit.StudentEncode

X = credit[[ 'StudentEncode','Income','Limit', 'InteractionTerm1','InteractionTerm2']]
Y = credit.Balance

X_train, X_test, Y_train, Y_test= train_test_split(X, Y, random_state= 0)
linreg= LinearRegression().fit(X_train, Y_train)

print('R^2 score by predicting balance using student status and income', linreg.score(X_test,Y_test))

R^2 score by predicting balance using student status and income 0.952585323631


#### comment: Interaction terms have strong association with response as they have low p-values, and including them improved the performance. 

### Polynomial Regression

#### 2-A) Use the Auto dataset, find the test $R^2$ score of a linear regression model that predicts the miles per gallon (mpg) from the horsepower.

#### 2-B) Use polynomial regression to include both the horsepower feature and $(horsepower)^2$ in the regression model. Find the $R^2$ metric. 

Hint: You can use [numpy.concatenate](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.concatenate.html). For example to add to an array U a column vector $W^2$, we can use X=np.concatenate((U,W**2),axis=1)

In [5]:
#Solution for (A) and (B)
from pandas import read_csv
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np

AutoData=read_csv('Auto_modify.csv') # read the data
print(type(AutoData))
print(AutoData.shape)
X_auto_hp=AutoData.horsepower.values.reshape(-1,1) # define features: horsepower 
Y_auto_mpg=AutoData.mpg.values.reshape(-1,1) # define label: miles per gallon

modelAuto2=LinearRegression()

X=X_auto_hp
for power in [1,2]:
    if power>1:
        X=np.concatenate((X,X_auto_hp**power),axis=1)
        
    X_train, X_test, Y_train, Y_test= train_test_split(X, Y_auto_mpg, random_state= 0)
    Auto_fitted_model2=modelAuto2.fit(X_train,Y_train)
    R2_auto_hp=Auto_fitted_model2.score(X_test,Y_test)
    print('With polynomial of degree', power, 'the R squared score of linear regression is:', R2_auto_hp)


<class 'pandas.core.frame.DataFrame'>
(392, 9)
With polynomial of degree 1 the R squared score of linear regression is: 0.62176588114
With polynomial of degree 2 the R squared score of linear regression is: 0.727103150464


#### C) With the same auto dataset, use KNN regression with K=7, to fit a model that predicts miles per gallon(mpg) in the following cases:

- One feature: Horsepower only

- Two features: horsepower and $(horsepower)^2$ 

#### Use MinMax feaures scaling. Find the $R^2$ metric in each of the above cases. Comparing KNN with linear regression, which model performs better? How does the performance change by adding the quadratic feature?




In [9]:
from sklearn import neighbors
from sklearn import preprocessing

# add you code here 

knnRegression = neighbors.KNeighborsRegressor(n_neighbors=7)
X=X_auto_hp
for power in [1,2]:
    if power>1:
        X=np.concatenate((X,X_auto_hp**power),axis=1)
        
    X_train, X_test, Y_train, Y_test= train_test_split(X, Y_auto_mpg, random_state= 0)
    
    scaler=preprocessing.MinMaxScaler().fit(X_train)
    X_train_transformed=scaler.transform(X_train)
    X_test_transformed=scaler.transform(X_test)
    
    Auto_fitted_model2=knnRegression.fit(X_train_transformed,Y_train)
    R2_auto_hp=knnRegression.score(X_test_transformed,Y_test)
    print("Including feature of hoursepower to the power of", power, ", R squared score of KNN regression is", R2_auto_hp)

Including feature of hoursepower to the power of 1 , R squared score of KNN regression is 0.660924983306
Including feature of hoursepower to the power of 2 , R squared score of KNN regression is 0.670108404882




#### Comments:

- KNN performs better that linear regression with a single feature (horsepower)
- Linear regression performs better than KNN when the non-linear terms are added.. Since the actual model is almost quadratic, parametric method with a quatratic term performs better (linear regression performs better by adding the quadratic term).