## Introduction
we will analyse a business problem with Multiple linear regression in a step by step manner and try to interpret the statistical terms at each step to understand its inner workings.

#### Dataset
we are using Advertisement dataset
Let's consider there is a company and it has to improve the sales of product. The company spends money on different advertising media such as TV, radio, and newspaper to increase the sales of its products. The company records the money spent on each advertising media (in thousands of dollars) and the number of units of product sold (in thousands of units).
Now we have to help the company to find out the most effective way to spend money on advertising media to improve sales for the next year with a less advertising budget

In [1]:
# Importing required libraries

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
%matplotlib inline

import statsmodels.api as sm
from statsmodels.formula.api import ols

In [2]:
# Loading the dataset
data  = pd.read_csv('Advertising.csv')

In [3]:
data.head()

Unnamed: 0.1,Unnamed: 0,TV,radio,newspaper,sales
0,1,230.1,37.8,69.2,22.1
1,2,44.5,39.3,45.1,10.4
2,3,17.2,45.9,69.3,9.3
3,4,151.5,41.3,58.5,18.5
4,5,180.8,10.8,58.4,12.9


In [4]:
#removing unnecessary column
data.drop('Unnamed: 0',axis=1,inplace=True)

In [5]:
data.head()

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


In [6]:
data.describe()

Unnamed: 0,TV,radio,newspaper,sales
count,200.0,200.0,200.0,200.0
mean,147.0425,23.264,30.554,14.0225
std,85.854236,14.846809,21.778621,5.217457
min,0.7,0.0,0.3,1.6
25%,74.375,9.975,12.75,10.375
50%,149.75,22.9,25.75,12.9
75%,218.825,36.525,45.1,17.4
max,296.4,49.6,114.0,27.0


In [7]:
#Checking if there is any NULL values
data.isnull().sum()

TV           0
radio        0
newspaper    0
sales        0
dtype: int64

#  Multiple Linear Regression

## Model1 : analyse relation between Sales and Money spent on TV,radio,newspaper advertising media

### Model Building

Multiple Linear Regression equation for this problem is,
𝑆𝑎𝑙𝑒𝑠 = 𝛽0 + 𝛽1 * 𝑇𝑉 + 𝛽2 * Radio+ 𝛽3 * Newspaper + epsilon

In [8]:
formula = 'sales ~ TV + radio + newspaper '
model1 = smf.ols(formula=formula, data=data).fit()
print(model1.summary())

                            OLS Regression Results                            
Dep. Variable:                  sales   R-squared:                       0.897
Model:                            OLS   Adj. R-squared:                  0.896
Method:                 Least Squares   F-statistic:                     570.3
Date:                Tue, 24 Nov 2020   Prob (F-statistic):           1.58e-96
Time:                        14:59:27   Log-Likelihood:                -386.18
No. Observations:                 200   AIC:                             780.4
Df Residuals:                     196   BIC:                             793.6
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.9389      0.312      9.422      0.0

#### Observations:
1) The above table shows the multiple regression coefficient estimates when TV, radio, and newspaper advertising budgets are used to predict product sales using the Advertising data.

2) We can analyse that the coefficient estimate for the newspaper is close to zero and the p-value is no longer significant(p-value >> 0.005) with a value around 0.86. This shows that money spent on newspaper advertising media has no relation to the sale of the product.

##### 𝑆𝑎𝑙𝑒𝑠 = 2.94 + 0.045 * 𝑇𝑉 + 0.189 * Radio + (- 0.001) * Newspaper

### Residual standard Error

In [9]:
predicted_sales_TV_radio_newspaper_Ad = model1.predict(data[['TV','radio','newspaper']])

In [10]:
RSS = np.sum((data['sales'] - predicted_sales_TV_radio_newspaper_Ad)**2)
print('RSS = {0}'.format(RSS))

RSS = 556.8252629021872


In [11]:
RSE = np.sqrt(RSS/data.shape[0])
print('RSE = {0}'.format(RSE))

RSE = 1.6685701407225697


### R Squared statistics 

In [12]:
TSS = np.sum((np.mean(data['sales']) - data['sales'])**2)
R_squared = (TSS-RSS)/TSS

print('R-Squared statistics = {0}'.format(R_squared))

R-Squared statistics = 0.8972106381789521


R squared value is 0.90 which shows that 90% variance in the sales is explained by the multiple linear regression of sales on TV and radio.

## Correlation Matrix

In [13]:
data.corr()

Unnamed: 0,TV,radio,newspaper,sales
TV,1.0,0.054809,0.056648,0.782224
radio,0.054809,1.0,0.354104,0.576223
newspaper,0.056648,0.354104,1.0,0.228299
sales,0.782224,0.576223,0.228299,1.0


### Conclusion:

1) The correlation between sales and newspaper advertising is less, this shows that newspaper advertising has no direct effect on sales.

2) From Multiple linear regression of sales on TV and radio, radio advertisement has the highest effect on sales. Every 1000 dollar money spent on radio advertising and TV advertising, increases the sales of product by 188 and 45 units respectively.