# P&G Python Linear Regression Model:  the marketing channel that has the greatest impact on sales

### Data set Description

Assume that a P&G marketing director in charge of media placement, event promotion wants to find the channel that has the greatest impact on sales among the marketing channels currently in use.Therefore, in the case of limited resources, it can be targeted delivery for the business, so that sales to achieve greater growth.

The aim of the case study is to find the relationship between revenue and maketing channel investment in a linear regression form.

The goal is to get equation which is similar to this:

**y = intercept + coef1\*x1 + coef2\*x2......**

### Terminology Understanding

[MAE](https://www.statisticshowto.com/absolute-error/)\
[RMSE](https://www.statisticshowto.com/probability-and-statistics/regression-analysis/rmse-root-mean-square-error/)\
[OLS](https://www.xlstat.com/en/solutions/features/ordinary-least-squares-regression-ols)

### Import library and data

In [None]:
import pandas as pd
store = pd.read_csv('../input/store-data/Store.csv')

### Understand the dataframe
* Dataframe info
* Data cleaning and manipulation 
* Basic statistics

In [None]:
store.info()
store.head()

**Some variables description**

**reach:** tweet times (Wechat tweet or Twitter)\
**local_tv:** local TV advertising investment\
**online:** online advertising investment\
**instore:** in stores investment, for example posters and displays\
**person:** store sales staff input\
**event:** promotional events

From the results above, I found that the column 'local_tv' included some missing values, so I considered using the 'local_tv' mean values to fill the null. Another thing I noticed was column'Unnamed: 0' wasn't needed, so I planned to delete the column. 

First, I want to check how many null values take place in each column.

In [None]:
store.isnull().sum()

The result shows there are 5 null values in 'local_tv' column. Now, I start to fill the null values.

In [None]:
store['local_tv'] = store['local_tv'].fillna(store['local_tv'].mean())

Also, it is needed to check if I successfully conducted the step above.

In [None]:
store.info()

Now it's time to delete column 'Unnamed: 0'.

In [None]:
store = store.drop('Unnamed: 0', axis = 1)

Similiarly, I am going to check if the "drop" was successful.

In [None]:
store.head()

Now I have finished all the data cleaning. It's time to do some statistics using describe(). First, I will apply the describe() to the whole dataframe.

In [None]:
store.describe()

At this stage, I can point out some important information:

Only the column 'event' is categorical column. I will need to convert it to dummy vaiable later.


Now, I will check what are the unique values in 'event' column.

In [None]:
store.event.unique()

I am also interested in the renenue under these four types of event.

In [None]:
store.groupby(['event'])['revenue'].describe()

From the result above, I noticed:

* special event earned the most revenue averagely
* cobranding is the most fequent event

Time to convert 'event' to dummy variable.

In [None]:
store = pd.get_dummies(store)

Let's take a look at how the dataframe is changed now.

In [None]:
store.head(10)

All 4 event sub-types have become 4 new vaiables, and '1' represents 'yes', '0'represent 'no'. Can also take a look at info(), will see the data type is changed from object to uint8.

In [None]:
store.info()

### Correlation and Visualisation

Now, the dataframe is completed, it's time to do some correlation!

In [None]:
store.corr()

In order to see a clear correlation amang revenue and all the other vaiables, I will use the following code:

In [None]:
store.corr()[['revenue']].sort_values('revenue',ascending=False)

Now, it is clear that local_tv,person,instore are mostly correlated to revenue.

It will be helpful to see some trend using regression plot here.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
sns.regplot(x = 'local_tv', y = 'revenue', data = store)

In [None]:
sns.regplot(x = 'person', y = 'revenue', data = store)

In [None]:
sns.regplot(x = 'instore', y = 'revenue', data = store)

From the regression plot above, we can take a overall look at how each point distributed and how closely it is to the linear regression. While there are many outliers, but overall the points focus on the center. 

### Linear Regression modeling

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
# Build model
model=LinearRegression()

In [None]:
# Create independent variable x and denpendent variable y.
x = store[['local_tv','person','instore']]
y = store['revenue']

In [None]:
# Model fitting
model.fit(x,y)

In [None]:
# Model intercept
model.intercept_

In [None]:
# Model independent variables coefficients
model.coef_

### Model Evaluation and Improve

In [None]:
score=model.score(x,y)
predictions=model.predict(x)
error=predictions-y

rmse=(error**2).mean()**.5
mae=abs(error).mean()

print(rmse)
print(mae)

I have already produced all needed parameters for the model. To be able to get the best model as we can, here I propose to do an extra thing - improve the model. How can we improve the model? The only thing we will need to do is to change some information, it could be:

* when I was filling all null values, I used mean values, and now I could change the mean values to median values and see if MAE and RMSE are going to decrease.
* another way is add a new variable to x.

That's the basic idea, but I will not include the process here as it is just repeating work. The point is understand the core concept. However I am going to use another method to produce a linear regression model, here I'm using OLS.

In [None]:
from statsmodels.formula.api import ols

In [None]:
# This step is similar to linear regression model
x=store[['local_tv','person','instore']]  
y=store['revenue'] 

In [None]:
model=ols('y~x',store).fit()

In [None]:
print(model.summary())

Very straightforward, and you can see the output intercept and coefficients are same as the linear regression model. Plus, you get more other parameters to help you understand the model. For example R-squared, AIC, BIC and other important tests.

### Model and Business Interpretation

Now, I am going to produce the final equation.

revenue = -52880 + 1.75*local_tv + 2050*person + 4.09*instore

**The following conclusions can be drawn:**

* Every 1 £ increase in TV advertising investment, you can get 1.75 £ of revenue in return.
* The revenue return of 4.09 £ can be realized for every 1 £ increase in in-store poster investment.
* Every 1 sale staff increased, you can get 2050 £ of revenue in return.
* Constant collection of data and the addition of new variables can improve the control of the overall marketing resource input.