![](https://get.pxhere.com/photo/Ppc-marketing-click-Advertiser-per-advertising-pay-online-collage-target-web-website-payment-money-promotion-commerce-service-communication-cta-megaphone-graphic-design-font-illustration-graphics-logo-gesture-1586373.jpg)

This analysis provides examples of linear regression prediction models for sales earned versus dollars spent on various media types.  

Feature engineering is applied to demonstrate how additional categorical variables can be transformed into a format that a linear regression model can ingest and provide insight as to whether or not a variable is statistically significant.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:

from sklearn.metrics import r2_score,mean_squared_error
from math import sqrt
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
data = pd.read_csv('/kaggle/input/advertising-data/Advertising.csv', index_col = 0)

In [None]:
data.columns = ['TV', 'Radio', 'Newspaper', 'Sales']
data.head()

In [None]:
fig, axs = plt.subplots(1,3,sharey=True)
data.plot(kind='scatter',x='TV',y='Sales',ax=axs[0],figsize=[16,8])
data.plot(kind='scatter',x='Radio',y='Sales',ax=axs[1])
data.plot(kind='scatter',x='Newspaper',y='Sales',ax=axs[2])

Let's see how tightly each variable is coupled to an increase in sales

In [None]:
feature_cols = ['TV']
x = data[feature_cols]
y = data.Sales

In [None]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(x,y)

In [None]:
print(lm.intercept_)
print(lm.coef_)

This correlation means that for every unit of TV advertising dollars spent, there is an increase of 0.047 in sales.

Let's build a linear regression model that can predict the Sales increase as a function of an increase of $50,000 in TV advertising dollars 

In [None]:
X_new = pd.DataFrame({'TV':[50]})

In [None]:
lm.predict(X_new)

In [None]:
X_new = pd.DataFrame({'TV': [data.TV.min(),data.TV.max()]})
X_new.head()

In [None]:
preds = lm.predict(X_new)
preds

In [None]:
data.plot(kind='scatter', x='TV', y='Sales')
plt.plot(X_new, preds, c='red', linewidth=2)

In [None]:
import statsmodels.formula.api as smf

In [None]:
lm = smf.ols(formula='Sales ~ TV',data=data).fit()

In [None]:
lm.conf_int()

In [None]:
lm.pvalues

If p value > .10 → “not significant”

If p value ≤ .10 → “marginally significant”

If p value ≤ .05 → “significant”

If p value ≤ .01 → “highly significant.”


such a low (less than 0.05) means that we reject the null hyp since this variable has some effect on the TV sales.  We reject that the effect is NULL when looking at the relationship between TV advertising dollars and Sales metrics

In [None]:
lm.rsquared

This rmsquared value looks OK but it's hard to tell since we're only looking at one value and not in the context of other variables compared to TV dollars.  Closer to 1 would mean the value is more tightly coupled to the fitted line.

Let's look at more variables to build a multiple linear regression model

In [None]:
feature_cols = ['TV', 'Radio', 'Newspaper']
X = data[feature_cols]
y = data.Sales

In [None]:
from sklearn import model_selection
xtrain,xtest,ytrain,ytest = model_selection.train_test_split(X,y,test_size=0.3,random_state=42)

Let's create a linreg model based on the actual data of the three variables and sales

In [None]:
lm = LinearRegression()
lm.fit(X,y)

Let's find out the coefficients of the fitted line

In [None]:
print(lm.intercept_)
print(lm.coef_)

Let's see what the machine learning model says about the training data and compare the coefficients

In [None]:
lm = LinearRegression()
lm.fit(xtrain,ytrain)
print(lm.intercept_)
print(lm.coef_)

Based on the fitted line created from the training data, the coefficients seems fairly close

Since the lines look somewhat similar, our model seems promising to make predictions.  Let's use some test data and make predictions and then see how well they fit to the line using the RMSE function.

In [None]:
predictions = lm.predict(xtest)

In [None]:
print(sqrt(mean_squared_error(ytest,predictions)))

In [None]:
lm = smf.ols(formula='Sales ~ TV + Radio + Newspaper',data=data).fit()
lm.rsquared

Let's look at the confidence interval

In [None]:
lm.conf_int()

In [None]:
lm.summary()

We see that TV and Radio have the highest p-value (yet lower than 0.05).  Newspaper spending is negatively associated with Sales increases, which could essentially mean that the more money that is spent on newspaper advertising could actually detract from sales.

If p value > .10 → “not significant”

If p value ≤ .10 → “marginally significant”

If p value ≤ .05 → “significant”

If p value ≤ .01 → “highly significant.”

Our machine learning model has an R-squared value of 0.897 which is much higher than our earlier model that had only 0.611.  This means that our model that includes more variables is more strongly coupled to accurate sales metrics.

However, throwing in a bunch of variables to get a higher R-squared value doesn't always mean that is the best model.  It is best to remove variables that do not provide a statistically significant difference in the outcome.

Here we see the different between having 3 vs. 2 variables.  Less variables means less to deal with in our model and less calculations to make, so a model that can provide a similar rsquared value with less values would be conidered more efficient.

In [None]:
lm = smf.ols(formula='Sales ~ TV + Radio + Newspaper',data=data).fit()
lm.rsquared

... is nearly identical to:

In [None]:
lm = smf.ols(formula='Sales ~ TV + Radio',data=data).fit()
lm.rsquared

### Feature engineering

Let's create a new variable that is related to the geographical region of advertising, we'll call it "Area" and there are three categories, "Urban", "Rural", "Suburban".  Let's also include another variable "islarge" that denotes the fictitious size of the city.

In [None]:
np.random.seed(12345)
nums = np.random.rand(len(data))

In [None]:
mask_issmall = (nums > 0) & (nums < 0.15)
mask_islarge = (nums > 0.15) & (nums < 0.33)
mask_suburban = (nums > 0.33) & (nums < 0.60)
mask_urban = nums > 0.60


In [None]:
data['Area'] = "rural"
data['Size'] = "large"
data.head()

In [None]:
data.loc[mask_suburban,'Area'] = "suburban"
data.loc[mask_urban,'Area'] = "urban"
data.loc[mask_islarge,'Size'] = "large"
data.loc[mask_issmall,'Size'] = "small"
data.head()

To do any statistical analysis we need to convert the categorical variable "Area" into binary features.  Let's create some dummy variables out of the category content of Area into a new dataframe.

In [None]:
area_dummies = pd.get_dummies(data.Area, prefix='Area').iloc[:,1:]
area_dummies.head()

In [None]:
size_dummies = pd.get_dummies(data.Size, prefix='Size').iloc[:,0:]
size_dummies.head()


Let's connect these three dataframes using concat

In [None]:
data = pd.concat([data,area_dummies,size_dummies,], axis=1)
data.head(20)

In [None]:
feature_cols = ['TV', 'Radio', 'Newspaper', 'Area_suburban', 'Area_urban','Size_large']
X = data[feature_cols]
y = data.Sales

In [None]:
lm = LinearRegression()
lm.fit(X,y)

In [None]:
print(feature_cols)
print(lm.coef_)

We can see from the imputed Area and Size features that Area has a negative correlation on Sales, while the Size of a large fictitious city had a relatively strong correlation close to +0.5 making it a variable that would have a high statistical significance on a predictive model.