In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# What kind of problem we have to solve


Lets say you have two independent features -> Age and experience and you have to predict the salary based on those two values.

Linear Regression Best-fit line formula => Salary = B0 + B1 * (Age) + B2 * (Experience)

here, B0 = intercept, B1 and B2 are coefficients / slopes

Now, if you see, there can be a possibility that the age and experience variable themselves have a high correlation value (>90%) i.e. age and experience are internally correlated with each other. This affects the output 'salary', the features 'age' and 'experience' will be almost same thus implying that we are providing the same information to the output feature 'salary' which we want to compute.

This is the problem that we have to resolve.

In [None]:
import pandas as pd
import statsmodels.api as sm

# Example Dataset 1 => Advertising Dataset

In [None]:
df = pd.read_csv("/kaggle/input/tvradionewspaperadvertising/Advertising.csv")
df.head()

* TV => expenditure done on TV advertisements
* Radio => expenditure done on radio advertisements
* Newspaper => expenditure done on newspaper advertisements

* Sales => The final sales amount collected with the help of expenditures done


Looking at our current dataset, (TV, Radio and Newspaper) are the independent features and the Sales is the output feature which we have to predict

## Splitting the data into independent and dependent features

In [None]:
X = df[['TV', 'Radio', 'Newspaper']]   # independent variables => predict sales value based on these features
y = df[['Sales']]                        # dependent variables

X.head()

In [None]:
y.head()

## In this Case, we will use the Multiple Linear Regression technique 'Ordinary Least Squared'

* Equation of Linear Regression best fit line for this dataset is: y = B0 * 1 + B1 * (TV) + B2 * (Radio) + B3 * (Newspaper)

* Whenever computing Ordinary Least Squared (OLS) => we need to compute B0(i.e. the intercept) also.
* But we dont have the B0 value here => So we will add a column for B0 value and all values in that column will be equal to 1

* To add a constant value column for B0 with all values = 1 => We will use the statsmodel library

In [None]:
X = sm.add_constant(X)
X.head()

### Fit an Ordinary Least Squared Model with intercept on TV and Radio. We will again be using the statsmodel library for this as it has the function OLS for creating the model. Inside the OLS method we have to give endog(output feature) and exog values(input features) as parameters

In [None]:
model = sm.OLS(y, X).fit()
model.summary()

This summary helps us understand whether there is multicollinearity / high correlation between the independent features.

According to the summary,

* B0 = coeff of const = 4.6251

* B1 = coeff of TV = 0.0544 
* [coeff value means that if we change the TV expenditure(i.e. input feature) by 1 unit, the change in sales(i.e. output) will be 0.0544]

* B2 = coeff of Radio = 0.1070
* B3 = coeff of Newspaper = 0.0003 
* B3 coeff => << 0.005 => this shows that we are making an unnecessary expenditure on Newspaper. Thus we can reduce that unnecessary expenditure done on Newspaper. Thus while creating the model, we can just drop this feature.


* R-squared value = 0.903 => very close to 1 => the model has fitted very well

* P value of const = 0
* P value of TV = 0
* P value of Radio = 0
* P value of Newspaper = 0.954

=> Except the feature 'Newspaper' (P-value = 0.954) , all the P values are less than 0.05

* std error of const = 0.308
* std error of TV = 0.001
* std error of Radio = 0.008
* std error of Newspaper = 0.006

std error => high number(>0.5) if there is multicollinearity among the independent varibles.
But here, the std error are small numbers thus indicating there is no multicollinearity among the independent variables

## Plot independent features in terms of correlation

In [None]:
import matplotlib.pyplot as plt
X.iloc[:, 1:].corr()

Through this table, we can see the correlation values among the various independent features:

* Between TV and Radio => 0.054809
* Between Radio and Newspaper => 0.354104
* Between TV and Newspaper => 0.056648

This implies that none of the correlation values are >0.5. Thus indicating that there is not much correlation between the independent features and thus no multicollinearity issue among the independent features

# Example Dataset 2 => Salary Dataset with age and YOE

In [None]:
df_salary = pd.read_csv("/kaggle/input/salary-data-with-age-and-experience/Salary_Data.csv")
df_salary.head()

### In this case the independent features are 'Years of Experience' and the 'Age' and we have to predict the dependent variable 'Salary' based on these two independent features

In [None]:
X = df_salary[['YearsExperience', 'Age']]
y = df_salary[['Salary']]

X.head()

In [None]:
y.head()

### Fitting the OLS(Ordinary Least Squared) model, similar to the previous dataset

In [None]:
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
model.summary()

In this scenario, According to the summary,

* B0 = coeff of const = -6661.9872

* B1 = coeff of YearsofExperience = 6153.3533
* [this coeff value means that if we change the YearsOfExperience(i.e. input feature) by 1 unit, the change in salary(i.e. output) will be 6153.3533]

* B2 = coeff of Age = 1836.0136
* [thus if we change the Age(i.e. input feature) by 1 unit(1 year), the change in salary(i.e. output) will be 6153.3533]


* R-squared value = 0.960 => very close to 1 => the model has fitted very well

* P value of const = 0.773
* P value of YearsOfExperience = 0.014
* P value of Age = 0.165

=> for Age => the P-value is >0.05 => Age and YearsOfExperience may have some kind of correlation

* std error of const = 0.308
* std error of YearsOfExperience = 2337.092
* std error of Age = 1285.034

Here we can see the std errors of both YearsOfExperience and Age are very very high, thus indicating that there is a huge Multicollinearity among them

### Confirming the multicollinearity between Age and YearsOfExperience by plotting the correlation table

In [None]:
X.iloc[:, 1:].corr()

#### With the help of this Correlation Table / Matrix we can imply that age and yearsofexperience have 98% correlation (very highly correlated). This implies that taking one of these features will be more than enough to predict the salary.

#### Now the Question is which of the input features (YearsOfExperience and Age) to keep and which one to drop for the final prediction of salary

### Remedy for this Multicollinearity problem:

* Solution 1 : Dont do anything, keep things as it is and don't care about multicollinearity and take all the input features to create the model

* Solution 2 : Check the P values for Age and YearsOfExperience. P value of Age > P value of YearsOfExperience. Thus drop the 'Age' feature. This will not have much effect on the model as the correlation is about 98%. Thus the whole model can be trained just by considering the feature 'YearsOfExperience'