# Linear Regression on Advertising data

## Objectives

+ Perform Exploratory Data Analysis
+ Dealing with Outliers
+ Fitting a Linear Regression model using scikit-learn
+ Evaluating model
+ Performing Diagnostics

## Overview of Data

In [None]:
import numpy as np
import pandas as pd

In [None]:
df = pd.read_csv('../input/advertising-budget/Advertising_Budget.csv')
df.head()

In [None]:
print(df.info())

In [None]:
print('Shape: \n Rows = {0} \n Columns = {1}'.format(df.shape[0], df.shape[1]))

In [None]:
print(df.describe())

+ There is a wide gap between 75% percentile and Maximum observation for _Budget_ variable. This implies the presence of outliers.
+ The data has high variability.

## Search for Missing values

In [None]:
# Missing Data
df.isnull().sum()

## Data Visualizations

Visualizing the data provides us with visual insights about the data that are visible through the summary statistics. Also it help help us to identify our claims about outliers and high variability.

In [None]:
from matplotlib import pyplot as plt
import seaborn

seaborn.set_style('darkgrid')

In [None]:
plt.figure()
g = seaborn.PairGrid(df)
g.map_diag(seaborn.kdeplot)
g.map_offdiag(seaborn.scatterplot)
plt.show()

In [None]:
# BoxPlot
plt.figure(figsize=(10,10))
seaborn.boxplot(data=df)

## Search & Remove outliers

+ The box plot confirms the presence of outliers. These outliers are causing the inflation in variance.
+ In the next step we will remove these outliers.

In [None]:
# Locating outliers
index = []

q1 = np.percentile(df.loc[:, 'Budget'].values, 25, interpolation='midpoint')
q3 = np.percentile(df.loc[:, 'Budget'].values, 75, interpolation='midpoint')
iqr = q3 - q1

ulim = q3 + 1.5*iqr
llim = q1 - 1.5*iqr

for i in range(df.shape[0]) :
    if df.loc[i, 'Budget'] > ulim :
        index.append(i)
    if df.loc[i, 'Budget'] < llim :
        index.append(i)

In [None]:
# The outliers        
df.iloc[index, :]

In [None]:
# Remove outliers
df.drop(index, inplace=True)
df.index = np.arange(df.shape[0]) # Reset the index

## Fitting the Model

In [None]:
# Separating the variables
X = df.iloc[:, 0:1].values
y = df.iloc[:, -1].values

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X, y)

In [None]:
# Model parameters
print('Coeffcients = {}'.format(model.coef_))
print('Intercept = {0:.2f}'.format(model.intercept_))

In [None]:
y_hat = model.predict(X)

## Evaluating the model

### Method 1: Visualization

In [None]:
# Visualize
plt.figure()
plt.scatter(X.reshape(-1,), y, color='red', marker='o', alpha=0.35, label = 'Observed Data')
plt.plot(X.reshape(-1,), y_hat, color='blue', label='Fitted Line')
plt.xlabel('X-variate')
plt.ylabel('y-variate')
plt.title('Scatter Plot with Line Fit')
plt.legend(loc='upper left')
plt.show()

### Method 2: Metrics

In [None]:
from sklearn.metrics import r2_score, mean_squared_error

print('R-sq = {0:.4f}'.format(r2_score(y, y_hat)))
print('MSE = {0:.3f}'.format(mean_squared_error(y, y_hat)))