# Media Company Case Study

## Multiple Linear Regression

### Problem Statement:
A digital media company (similar to Voot, Hotstar, Netflix, etc.) had launched a show. Initially, the show got a good response, but then witnessed a decline in viewership. The company wants to figure out what went wrong.


### Approach:
We are concerned about determining the driver variable for show viewership. This is the case of prediction rather than projection where we are more interested in predicting the key driver variables and their impact rather than forcasting the results.

First we will list down the potential reasons for the decline in viewershp.<br>

The potential reasons could be:
1. Decline in the number of people coming to the platform
2. Fewer people watching the video
3. A Decrease in marketing spend?
4. Competitive shows, e.g. cricket/ IPL
5. Special holidays
6. Twist in the story


### Data
We have been given data for the period of 1 March 2017 to 19 May 2017.<br>
With Columns as<br> 
Views_show         : Number of times the show was viewed<br>
Visitors           : Number of visitors who browsed the platform, but not necessarily watched a video.<br>
Views_platform	   : Number of times a video was viewed on the platform<br>
Ad_impression	   : Proxy for marketing budget. Represents number of impressions generated by ads<br>
Cricket_match_india: If a cricket match was being played. 1 indicates match on a given day, 0 indicates there wasn't<br>
Character_A        : Describes presence of Character A. 1 indicates character A was in the episode, 0 indicates she/he wasn't

In [None]:
# Supress Warnings

import warnings
warnings.filterwarnings('ignore')

# Import the numpy and pandas package

import numpy as np
import pandas as pd

# Data Visualisation

import matplotlib.pyplot as plt 
import seaborn as sns

In [None]:
pwd

## Reading and Understanding the Data

In [None]:
#Importing dataset
media = pd.DataFrame(pd.read_csv("mediacompany.csv"))
media.head()

In [None]:
# Checking Duplicates
sum(media.duplicated(subset = 'Date')) == 0
# No duplicate values

In [None]:
# Dropping the unwanted column
media = media.drop('Unnamed: 7',axis = 1)

In [None]:
#Let's explore the top 5 rows
media.head()

## Data Inspection

In [None]:
media.shape

In [None]:
media.info()

In [None]:
media.describe()

## Data Cleaning

In [None]:
# Checking Null values
media.isnull().sum()*100/media.shape[0]
# There are no NULL values in the dataset, hence it is clean.

In [None]:
# Outlier Analysis
fig, axs = plt.subplots(2,2, figsize = (10,5))
plt1 = sns.boxplot(media['Views_show'], ax = axs[0,0])
plt2 = sns.boxplot(media['Visitors'], ax = axs[0,1])
plt3 = sns.boxplot(media['Views_platform'], ax = axs[1,0])
plt4 = sns.boxplot(media['Ad_impression'], ax = axs[1,1])

plt.tight_layout()

In [None]:
# Data preparation

In [None]:
# Converting date to Pandas datetime format
media['Date'] = pd.to_datetime(media['Date'], dayfirst = False )
# Date is in the format YYYY-MM-DD

In [None]:
media.head()

#### Deriving Matrices

In [None]:
# Let's derive day of week column from date 

In [None]:
media['Day_of_week'] = media['Date'].dt.dayofweek

In [None]:
media.head()

## Exploratory Data Analysis

In [None]:
# Target Variable
# Views Show

In [None]:
sns.boxplot(media['Views_show'])

### Univariate analysis

#### Date

In [None]:
# days vs Views_show
media.plot.line(x='Date', y='Views_show')

In [None]:
# Inference
# we can observe a pattern in the plot.

#### Day of week

In [None]:
sns.barplot(data = media,x='Day_of_week', y='Views_show')

In [None]:
# Inference
# we can see that Views are more on 'Sunday' and 'Saturday'(weekends) and decline on subsequent days.

In [None]:
# Hence we can think of another matrix "Weekend" that is 1 for weekends and 0 for weekdays.

In [None]:
di = {5:1, 6:1, 0:0, 1:0, 2:0, 3:0, 4:0}
media['weekend'] = media['Day_of_week'].map(di)

In [None]:
media.head()

#### Weekend

In [None]:
sns.barplot(data = media,x='weekend', y='Views_show')

In [None]:
# viewership is higher on weekends.

#### Ad Impressions

In [None]:
# plot for Date vs Views_show and days vs Ad_impressions
ax = media.plot(x="Date", y="Views_show", legend=False)
ax2 = ax.twinx()
media.plot(x="Date", y="Ad_impression", ax=ax2, legend=False, color="r")
ax.figure.legend()


In [None]:
sns.scatterplot(data = media, x = 'Ad_impression', y = 'Views_show')

In [None]:
# we can see that the views as well as ad impressions show a weekly pattern.

#### Visitors

In [None]:
sns.scatterplot(data = media, x = 'Visitors', y = 'Views_show')

In [None]:
# Inference: Show views are some what proportionately related to Visitors

#### Views Platform

In [None]:
sns.scatterplot(data = media, x = 'Views_platform', y = 'Views_show')

In [None]:
# Inference: Show views are some what proportionately related to Platform views

#### Cricket Match

In [None]:
sns.barplot(data = media,x='Cricket_match_india', y='Views_show')

In [None]:
# Inference: Show views slightly declines when there is a cricket match.

#### Character A

In [None]:
sns.barplot(data = media,x='Character_A', y='Views_show')

In [None]:
# Inference: Presence of Character A improves the show viewership.

## Model building

#### Rescaling the Features

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()

In [None]:
# Apply scaler() to all the columns except the 'yes-no' and 'dummy' variables
num_vars = ['Views_show','Visitors','Views_platform','Ad_impression']

media[num_vars] = scaler.fit_transform(media[num_vars])

In [None]:
media.head()

In [None]:
# Let's check the correlation coefficients to see which variables are highly correlated

In [None]:
sns.heatmap(media.corr(),annot = True)

#### Running first model (lm1) Visitors, weekend

In [None]:
# Putting feature variable to X
X = media[['Visitors','weekend']]

# Putting response variable to y
y = media['Views_show']

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
# Representing LinearRegression as lm(Creating LinearRegression Object)
lm = LinearRegression()

In [None]:
# fit the model to the training data
lm.fit(X,y)

In [None]:
import statsmodels.api as sm
#Unlike SKLearn, statsmodels don't automatically fit a constant, 
#so you need to use the method sm.add_constant(X) in order to add a constant. 
X = sm.add_constant(X)
# create a fitted model in one line
lm_1 = sm.OLS(y,X).fit()
print(lm_1.summary())

In [None]:
# Inference:
# Visitors as well as weekend column are significant.

#### Running second model (lm2) visitors, weekend & Character_A

In [None]:
# Putting feature variable to X
X = media[['Visitors','weekend','Character_A']]

# Putting response variable to y
y = media['Views_show']

In [None]:
import statsmodels.api as sm
#Unlike SKLearn, statsmodels don't automatically fit a constant, 
#so you need to use the method sm.add_constant(X) in order to add a constant. 
X = sm.add_constant(X)
# create a fitted model in one line
lm_2 = sm.OLS(y,X).fit()
print(lm_2.summary())

In [None]:
# we have seen that views of today effects views of tomorrow. So to take that in account we will create a Lag variable.

In [None]:
# Create lag variable
media['Lag_Views'] = np.roll(media['Views_show'], 1)
media.head()

In [None]:
media.Lag_Views[0] = 0

In [None]:
media.head()

#### Running third model (lm3) visitors, Character_A, Lag_views & weekend

In [None]:
# Putting feature variable to X
X = media[['Visitors','Character_A','Lag_Views','weekend']]

# Putting response variable to y
y = media['Views_show']

In [None]:
import statsmodels.api as sm
#Unlike SKLearn, statsmodels don't automatically fit a constant, 
#so you need to use the method sm.add_constant(X) in order to add a constant. 
X = sm.add_constant(X)
# create a fitted model in one line
lm_3 = sm.OLS(y,X).fit()
print(lm_3.summary())

In [None]:
# Inference:
# It leaves visitor insignificant.

#### Running fourth model (lm4) Character_A, weekend & Views_platform

In [None]:
# Putting feature variable to X
X = media[['weekend','Character_A','Views_platform']]

# Putting response variable to y
y = media['Views_show']

In [None]:
import statsmodels.api as sm
#Unlike SKLearn, statsmodels don't automatically fit a constant, 
#so you need to use the method sm.add_constant(X) in order to add a constant. 
X = sm.add_constant(X)
# create a fitted model in one line
lm_4 = sm.OLS(y,X).fit()
print(lm_4.summary())

#### Running fifth model (lm5) Character_A, weekend & Visitors

In [None]:
# Putting feature variable to X
X = media[['weekend','Character_A','Visitors']]

# Putting response variable to y
y = media['Views_show']

In [None]:
import statsmodels.api as sm
#Unlike SKLearn, statsmodels don't automatically fit a constant, 
#so you need to use the method sm.add_constant(X) in order to add a constant. 
X = sm.add_constant(X)
# create a fitted model in one line
lm_5 = sm.OLS(y,X).fit()
print(lm_5.summary())

#### Running sixth model (lm6) Character_A, weekend, Visitors & Ad_impressions

In [None]:
# Putting feature variable to X
X = media[['weekend','Character_A','Visitors','Ad_impression']]

# Putting response variable to y
y = media['Views_show']

In [None]:
import statsmodels.api as sm
#Unlike SKLearn, statsmodels don't automatically fit a constant, 
#so you need to use the method sm.add_constant(X) in order to add a constant. 
X = sm.add_constant(X)
# create a fitted model in one line
lm_6 = sm.OLS(y,X).fit()
print(lm_6.summary())

#### Running seventh model (lm7) Character_A, weekend & Ad_impressions

In [None]:
# Inference
# we can observe a pattern in the plot.

In [None]:
# Putting feature variable to X
X = media[['weekend','Character_A','Ad_impression']]

# Putting response variable to y
y = media['Views_show']

In [None]:
import statsmodels.api as sm
#Unlike SKLearn, statsmodels don't automatically fit a constant, 
#so you need to use the method sm.add_constant(X) in order to add a constant. 
X = sm.add_constant(X)
# create a fitted model in one line
lm_7 = sm.OLS(y,X).fit()
print(lm_7.summary())

In [None]:
#Ad impression in million
media['ad_impression_million'] = media['Ad_impression']/1000000

#### Running seventh model (lm8) Character_A, weekend, Visitors, ad_impressions_million & Cricket_match_india

In [None]:
# Putting feature variable to X
X = media[['weekend','Character_A','ad_impression_million','Cricket_match_india']]

# Putting response variable to y
y = media['Views_show']

In [None]:
import statsmodels.api as sm
#Unlike SKLearn, statsmodels don't automatically fit a constant, 
#so you need to use the method sm.add_constant(X) in order to add a constant. 
X = sm.add_constant(X)
# create a fitted model in one line
lm_8 = sm.OLS(y,X).fit()
print(lm_8.summary())

#### Running seventh model (lm9) Character_A, weekend & ad_impressions_million

In [None]:
# Putting feature variable to X
X = media[['weekend','Character_A','ad_impression_million']]

# Putting response variable to y
y = media['Views_show']

In [None]:
import statsmodels.api as sm
#Unlike SKLearn, statsmodels don't automatically fit a constant, 
#so you need to use the method sm.add_constant(X) in order to add a constant. 
X = sm.add_constant(X)
# create a fitted model in one line
lm_9 = sm.OLS(y,X).fit()
print(lm_9.summary())

#### Making predictions using lm 9 

In [None]:
# Making predictions using the model
X = media[['weekend','Character_A','ad_impression_million']]
X = sm.add_constant(X)
Predicted_views = lm_9.predict(X)

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(media.Views_show, Predicted_views)
r_squared = r2_score(media.Views_show, Predicted_views)

In [None]:
print('Mean_Squared_Error :' ,mse)
print('r_square_value :',r_squared)

In [None]:
#Actual vs Predicted
c = [i for i in range(1,81,1)]
fig = plt.figure()
plt.plot(c,media.Views_show, color="blue", linewidth=2.5, linestyle="-")
plt.plot(c,Predicted_views, color="red",  linewidth=2.5, linestyle="-")
fig.suptitle('Actual and Predicted', fontsize=20)              # Plot heading 
plt.xlabel('Index', fontsize=18)                               # X-label
plt.ylabel('Views', fontsize=16)                               # Y-label

In [None]:
# Error terms
c = [i for i in range(1,81,1)]
fig = plt.figure()
plt.plot(c,media.Views_show-Predicted_views, color="blue", linewidth=2.5, linestyle="-")
fig.suptitle('Error Terms', fontsize=20)              # Plot heading 
plt.xlabel('Index', fontsize=18)                      # X-label
plt.ylabel('Views_show-Predicted_views', fontsize=16)                # Y-label

#### Making predictions using lm5

In [None]:
# Making predictions using the model
X = media[['weekend','Character_A','Visitors']]
X = sm.add_constant(X)
Predicted_views = lm_5.predict(X)

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(media.Views_show, Predicted_views)
r_squared = r2_score(media.Views_show, Predicted_views)

In [None]:
print('Mean_Squared_Error :' ,mse)
print('r_square_value :',r_squared)

In [None]:
#Actual vs Predicted
c = [i for i in range(1,81,1)]
fig = plt.figure()
plt.plot(c,media.Views_show, color="blue", linewidth=2.5, linestyle="-")
plt.plot(c,Predicted_views, color="red",  linewidth=2.5, linestyle="-")
fig.suptitle('Actual and Predicted', fontsize=20)              # Plot heading 
plt.xlabel('Index', fontsize=18)                               # X-label
plt.ylabel('Views', fontsize=16)                               # Y-label

In [None]:
# Error terms
c = [i for i in range(1,81,1)]
fig = plt.figure()
plt.plot(c,media.Views_show-Predicted_views, color="blue", linewidth=2.5, linestyle="-")
fig.suptitle('Error Terms', fontsize=20)              # Plot heading 
plt.xlabel('Index', fontsize=18)                      # X-label
plt.ylabel('Views_show-Predicted_views', fontsize=16)                # Y-label

 Ad Impressions and Character A as the driver variables that could explain the
viewership pattern. Based on industry experience, ad impressions are directly proportional to the
marketing budget. Thus, by increasing the marketing budget, a better viewership could be
achieved. Similarly, Character A’s absence and presence created a significant change in show viewership.
Character A’s presence brings viewers to the show. Thus, these two variables could be acted upon to
improve show viewership.