<img src="https://rhyme.com/assets/img/logo-dark.png" align="center"> 

<h2 align="center">Simple Linear Regression</h2>

Linear Regression is a useful tool for predicting a quantitative response.

We have an input vector $X^T = (X_1, X_2,...,X_p)$, and want to predict a real-valued output $Y$. The linear regression model has the form

<h4 align="center"> $f(x) = \beta_0 + \sum_{j=1}^p X_j \beta_j$. </h4>

The linear model either assumes that the regression function $E(Y|X)$ is linear, or that the linear model is a reasonable approximation.Here the $\beta_j$'s are unknown parameters or coefficients, and the variables $X_j$ can come from different sources. No matter the source of $X_j$, the model is linear in the parameters.

### Task 2: Loading the Data and Importing Libraries
---

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

The adverstiting dataset captures sales revenue generated with respect to advertisement spends across multiple channles like radio, tv and newspaper. [Source](http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv)

In [None]:
df = pd.read_csv("Advertising.csv")
df.head()

In [None]:
df.info()

### Task 3: Remove the index column

In [None]:
df.drop(["Unnamed: 0"], axis=1, inplace=True)
df.head()

### Task 4: Exploratory Data Analysis

In [None]:
import seaborn as sb
sb.distplot(df.sales)

In [None]:
sb.distplot(df.newspaper)

In [None]:
sb.distplot(df.radio)

### Task 5: Exploring Relationships between Predictors and Response

In [None]:
sb.pairplot(df, x_vars=['TV','radio','newspaper'], y_vars='sales', height=7, 
            aspect =0.7, kind='reg')

In [None]:
df.TV.corr(df.sales)

In [None]:
df.corr()

In [None]:
sb.heatmap(df.corr(),annot=True)

Tv is hihgly corr with sales

### Task 6: Creating the Simple Linear Regression Model

General linear regression model:
$y=\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}+...+\beta_{n}x_{n}$

- $y$  is the response
- $\beta_{0}$ is the intercept
- $\beta_{1}$ is the coefficient for  x1  (the first feature)
- $\beta_{n}$ is the coefficient for  xn  (the nth feature)

In our case: $y=\beta_{0}+\beta_{1}×TV+\beta_{2}×Radio+\beta_{3}×Newspaper$

The $\beta$ values are called the **model coefficients*:

- These values are "learned" during the model fitting step using the "least squares" criterion
- The fitted model is then used to make predictions

In [None]:
X = df[['TV']]
X.head()

In [None]:
y = df.sales
type(y)

Since its panad series we can use scitkit function


In [None]:
from sklearn.model_selection import train_test_split

X_train,X_test, y_train, y_test = train_test_split(X,y, random_state=1)

In [None]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

In [None]:
from sklearn.linear_model import LinearRegression

lm = LinearRegression()
lm.fit(X_train, y_train)

### Task 7: Interpreting Model Coefficients

In [None]:
print(lm.intercept_)
print(lm.coef_)

B0 = 6.91, B1 = 0.48 coef associate with spending on tv ads. For given amount of radio and news paper ads spending a unit increased in Tv ads spending is associated with a 0.048 unit increased in the sales revune. for a given amount of radio and newspaper ads spending an additional 1000 USD spend on TV ads is associated with an increase in sales of 48 items.

This is the statement associate with corr not causation.If increase in tv ads spending was associated with decreasing in sales then B1 i.e coef would be negative.

### Task 8: Making Predictions with our Model

In [None]:
#making prediction on test set
y_pred = lm.predict(X_test) #this is going to make pred on 25% of test set data
y_pred[:5] #since it is numpy array we use this method

This are first 5 values of the predicted sales revune on test set.

Now we need to compare our predicted value with actual value, hence evaluation metrics comes into play.

### Task 9: Model Evaluation Metrics

We will explore three most common model evaluation metrics for continus value

In [None]:
true = [100,50,30, 20]
pred = [90,50,50,30]

**Mean Absolute Error** (MAE) is the mean of the absolute value of the errors:;
$$ \frac{1}{n} \sum_{i=1}^{n} \left |y_i - \hat{y}_i \right |$$

In [None]:
print((10+0+20+10)/4)

from sklearn import metrics
print(metrics.mean_absolute_error(y_test,y_pred))

**Mean Squared Error** (MSE) is the mean of the squared errors:
$$\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

In [None]:
print((10**2+0**2+ 20**2+10**2)/4) #manual formula
print(metrics.mean_squared_error(y_test,y_pred))

**Root Mean Squared Error** (RMSE) is the square root of the mean of the squared errors:
$$\sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}$$

In [None]:
print(np.sqrt((10**2+0**2+ 20**2+10**2)/4)) #manual formula
print(np.sqrt(metrics.mean_squared_error(true,pred)))

In [None]:
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))