## Linear Regression

In this exercise we will fit a linear regression model using two libraries (statsmodels and sklearn). Using statsmodels you will be able to get the statistical tests and significance of the features. We will then use sklearn to fit a linear model and get the test accuracy. 



#### A) Using the advertising data, use the statsmodel to fit an OLS model that predicts the sales using features (TV, Radio and Newspaper). Print the p-values and confidence interval of features. 



In [6]:
from pandas import read_csv
import statsmodels.formula.api as smf

AdvertisingData=read_csv('Advertising.csv')

model=smf.ols('Sales ~ TV+Radio+Newspaper', AdvertisingData)
Fitting_results=model.fit()
print(Fitting_results.summary().tables[1])
print('p-values are: \n', Fitting_results.pvalues)

                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.9389      0.312      9.422      0.000       2.324       3.554
TV             0.0458      0.001     32.809      0.000       0.043       0.049
Radio          0.1885      0.009     21.893      0.000       0.172       0.206
Newspaper     -0.0010      0.006     -0.177      0.860      -0.013       0.011
p-values are: 
 Intercept    1.267295e-17
TV           1.509960e-81
Radio        1.505339e-54
Newspaper    8.599151e-01
dtype: float64


Comment: results imply that newspaper is not important features, while other features have stong association with label (sales)

#### B) Repeat  question (A) without the Radio feature. Comment on results (A) and (B). What do the results imply? 

In [3]:
model=smf.ols('Sales ~ TV+Newspaper', AdvertisingData)
Fitting_results=model.fit()
print(Fitting_results.summary().tables[1])
print('p-values are: \n', Fitting_results.pvalues)

                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      5.7749      0.525     10.993      0.000       4.739       6.811
TV             0.0469      0.003     18.173      0.000       0.042       0.052
Newspaper      0.0442      0.010      4.346      0.000       0.024       0.064
p-values are: 
 Intercept    3.145860e-22
TV           5.507584e-44
Newspaper    2.217084e-05
dtype: float64


Comment: From (A) we can conclude that "newspaper" is not an important feature in determining the sales. From (B), it appears as if it is important. This implies that "newspaper" is correlated with an important feature used in (A) but removed from (B)..that is the "radio" feature.

In other word, there would be correlation between Radio advertising and Newspaper advertising in markets where data was collected. But the Newspaper has little or no impact on increasing the sales. We can get the correlation matrix (check lecture slides) to check the correlations.

#### B) Using the Scikit-Learn library, fit a linear regression model using advertising trianing set, then find and print the test mean square (MSE) error and the $R^2$ score of the fitted model. Use random_state= 0 in the train_test_split function.

In [5]:

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

AdvertisingData=read_csv('Advertising.csv')
X = AdvertisingData[['Radio', 'TV','Newspaper']].values
Y = AdvertisingData.Sales
X_train, X_test, Y_train, Y_test= train_test_split(X, Y, random_state= 0)

# write you code here to answer the above question

AdvertisingData=read_csv('Advertising.csv')
X = AdvertisingData[[ 'TV']].values
Y = AdvertisingData.Sales
X_train, X_test, Y_train, Y_test= train_test_split(X, Y, random_state= 0)

# write you code here to answer the above question
##################################################


X_train, X_test, Y_train, Y_test= train_test_split(X, Y, random_state= 0)
linreg= LinearRegression().fit(X_train, Y_train)

Target_predicted= linreg.predict(X_test) #given the feature, predict y
MSE=mean_squared_error(Y_test,Target_predicted)
print('Mean square error using TV feature is', MSE)
print('The R-squared score is', linreg.score(X_test,Y_test))


Mean square error using TV feature is 8.73024887295
The R-squared score is 0.6902574858
