
<div class="alert alert-info" style="background-color: 	#800080; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>Regression Analysis</h2>
</div>


### Table of contents : <br/>
1. [Problem statement]( #1 )
2. [Loading Data]( #2 )
3. [Understand the Data](#3)
4. [Data Preprocessing](#4)
    * Dropping unncessary columns
    * Missing values
    * label encoding
    * Feature Selection
5. [Exploratory Data Analysis](#5)
    * Heat map
    * Histograms
    * Scatter plot
6. [Model Building](#6)
    * Splitting data
    * [Linear model](#7)
    * [Polynomial model - degree 2](#8)
    * [Polynomial model - degree 3](#9)
    * [Finding Ideal degree](#10)
7. [Final Model and test](#11)


<div class="alert alert-info" style="background-color:#800080; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>Importing Libraries</h2>
</div>

In [None]:
# Basic libraries required
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics 
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import PolynomialFeatures

<a id='1'></a>
<div class="alert alert-info" style="background-color:#800080; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>1. Convert Business Problem to Data Science Problem</h2>
</div>

We have data abour salary of various employees in different companies and at different positions. <br/>
Data includes their base pay along with all other benifits. <br/>
Let's see how will "Overtime pay ,other pay,Benefits" increases with basic pay

<a id='2'></a>
<div class="alert alert-info" style="background-color:#800080; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>2. Load Data</h2>
</div>

In [None]:
# reading the data from csv file to dataframe
data = pd.read_csv('../input/sf-salaries/Salaries.csv')
data.head()

<a id='3'></a>
<div class="alert alert-info" style="background-color:#800080; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>3. Understanding Data </h2>
</div>

In [None]:
# info() prints summary of data like dtypes, memory usage
data.info()

In [None]:
# To know how many data points,features we have 
data.shape

In [None]:
# describe() will tell us about statistical information of each Numerical column
data.describe()

<div class="alert alert-info" style="padding:0px 10px; border-radius:5px;"><h3 style='margin:10px 5px'> Inferences:</h3>
</div>

* we can see that we have **13** attributes with **148654** records
* we can observe that *note* attribute doesn't have any information and we can drop it


<a id='4'></a>
<div class="alert alert-info" style="background-color:#800080; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>4. Data Pre-processing </h2>
</div>

###  4.1 Dropping unnecessary columns
We know that id,employee name will be unique for everyone and will not carry any weightage to the model.<br/>
So, let's drop them

In [None]:
data.drop(['Id','EmployeeName'],axis=1,inplace=True)

In [None]:
data.head()

### 4.2 Dealing with missing values

In [None]:
data.isnull().sum()

#### 'Notes' 
We can see that 'Notes' attribute doesn't have any information. Let's drop it

In [None]:
data.drop(['Notes'],inplace=True,axis=1)
data.head()

#### 'Status' 

In [None]:
data[data['Status'].isnull()==False]['Status'].value_counts()

We can observe that we have very less records of Status attribute. It isn't appropriate to imputate the data. <br/>
Therefore let's drop this column too

In [None]:
data.drop(['Status'],axis=1,inplace=True)
data.head()

#### 'Benefits' 

We know that benefits is nothing but difference between **"Totalpay Benefits"** and **"Totalpay"** <br/>
Let's replace NaN values will '0'

In [None]:
data['Benefits'].fillna(0,inplace=True)
data.tail()

#### 'BasePay'
We observe that 605 records are Null.Let's observe them

In [None]:
data[data['BasePay'].isnull()==True]

we can see that even though data about BasePay isn't provided , that person still gets the TotalPay from other factors.<br/>
So let's make BasePay as 0 here

In [None]:
data['BasePay'].fillna(0,inplace=True)

#### Columns with '"Not Provided" as value

In [None]:
data[data['BasePay']=='Not Provided']

we see that some of the rows doesn't have required information at all <br/>
So let's just remove them

In [None]:
data = data[data['BasePay'] != 'Not Provided']
data.shape

<div class="alert alert-info" style="padding:0px 10px; border-radius:5px;"><h3 style='margin:10px 5px'> Inferences:</h3>
</div>

* We dropped 'Notes','Status' columns
* We changed 'Benefits','BasePay' null values to 0
* We dropped rows with unprovided information

In [None]:
data.head()

### 4.3 Encoding Categorical Columns - Label Encoding
Let's observe each column

#### 'JobTitle'

In [None]:
data['JobTitle'].value_counts()

In [None]:
data['JobTitle'] = data['JobTitle'].astype('category').cat.codes
data.head(2)

#### "Agency"

In [None]:
data['Agency'].value_counts()

Agency doesn't carry any weightage as all values are same <br/>
Let's drop the column

In [None]:
data.drop(['Agency'],axis=1,inplace=True)

### 4.4 Changing dtype

In [None]:
data.info()

we see 'Basepay','Overtimepay','Otherpay','Benifits' are Object DType <br/>
Let's change it to float

In [None]:
data = data.astype('float64')
data.info()

### 4.5 Feature Selection
According to the problem statement, we have to join 'Overtime Pay; ,'other pay' ,Benefits' <br/>

In [None]:
data['Response'] = data['OvertimePay'] + data['OtherPay'] + data['Benefits']

In [None]:
data['Regressor'] = data['BasePay']

Let's drop unnecessary columns

In [None]:
data.drop(['JobTitle','TotalPay','TotalPayBenefits','BasePay','OvertimePay','OtherPay','Benefits'],axis=1,inplace=True)

It was asked to test on 2014 data.Let's seperate it for test purpose

In [None]:
test = data[data['Year']==2014]
data = data[data['Year'] != 2014]

In [None]:
test.shape

### Final data after preprocessing

In [None]:
data.head()

<a id='5'></a>
<div class="alert alert-info" style="background-color:#800080; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>5. Exploratory Data Analysis </h2>
</div>

### 5.1 HeatMap

In [None]:
# To know correlation between attributes
plt.figure(figsize = (10,8))
p=sns.heatmap(data.corr(), annot=True,cmap='RdYlGn',center=0) 

### 5.2 Histograms

In [None]:
data.hist(figsize=(12,10))

### 5.3 Scatter Plot

In [None]:
plt.scatter(data['Regressor'],data['Response'],alpha=0.7)
plt.xlabel('BasePay')
plt.ylabel('Other Pays')

we have to create a model for this plot such that our model fits into this data

<a id='6'></a>
<div class="alert alert-info" style="background-color:#800080; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>6. Model Building </h2>
</div>

<div class="alert alert-info" style="background-color:#800080; color:white; padding:0px 10px; border-radius:5px;"><h3 style='margin:10px 5px'>6.1 Splitting the data </h3>
 </div>

In [None]:
# splitting the data into 2 parts,one to train the model and another one to test the trained model

# splitting the data such that 80% is used for training and remaining 20% for testing
X_train, X_val, y_train, y_val = train_test_split(data[['Regressor']], data[['Response']], test_size=0.2)

We will make a basic linear regression model and let's see how our model fit into the data

<a id='7'></a>
<div class="alert alert-info" style="background-color:#800080; color:white; padding:0px 10px; border-radius:5px;"><h3 style='margin:10px 5px'>6.2 Simple Linear regression Analysis</h3></div>

In [None]:
# initializing model
model = LinearRegression()

# fitting data
model.fit(X_train,y_train)

# predict 
pred_train = model.predict(X_train)
pred_val = model.predict(X_val)

# r2
train_R2_LR = r2_score(y_train,pred_train)
test_R2_LR = r2_score(y_val,pred_val)

print('train R2 SCORE:',train_R2_LR)
print('Val R2 SCORE:',test_R2_LR)

In [None]:
# let's see how well our model fit into data 
# training data plot
plt.scatter(X_train,y_train, color="red")
plt.plot(X_train, pred_train, color="blue")
plt.xlabel("Regressor")
plt.ylabel("Response")
plt.title("Regression analysis of Basepay vs Other Pays")

In [None]:
# validation data plot
plt.scatter(X_val,y_val, color="red")
plt.plot(X_val, pred_val, color="blue")
plt.xlabel("Regressor")
plt.ylabel("Response")
plt.title("Linear model")

Red coloured data is actual data and blue line is our model <br/>

Let's see other case where we will try to fit our data into a polynomial model wih degree 2 and 3

<a id='8'></a>
<div class="alert alert-info" style="background-color:#800080; color:white; padding:0px 10px; border-radius:5px;"><h3 style='margin:10px 5px'>6.3 Simple non-linear regression analysis with degree 2.
</h3></div>

In [None]:
## let's transform data into polynomial form with degree=10
poly_reg=PolynomialFeatures(degree=2)
x_poly=poly_reg.fit_transform(X_train)

poly_reg.fit(x_poly,y_train)

# fitting our data into model
model=LinearRegression()
model.fit(x_poly,y_train)

#predictions
pred_train = model.predict(x_poly)
x_poly_val = poly_reg.fit_transform(X_val)
pred_val = model.predict(x_poly_val)

# r2
train_R2_PR2 = r2_score(y_train,pred_train)
test_R2_PR2 = r2_score(y_val,pred_val)

print('train R2:',train_R2_PR2)
print('test R2:',test_R2_PR2)

In [None]:
# let's see how well our model fit into data 
# training data plot
plt.scatter(X_train,y_train, color="red")
plt.scatter(X_train, pred_train, color="blue",s=0.5)
plt.xlabel("Regressor")
plt.ylabel("Response")
plt.title("Regression analysis of Basepay vs Other Pays")

In [None]:
# validation data plot
plt.scatter(X_val,y_val, color="red")
plt.scatter(X_val, pred_val, color="blue",s=0.6)
plt.xlabel("Regressor")
plt.ylabel("Response")
plt.title("quadratic model - degree=2")

Red coloured data is actual data and blue line is our model <br/>

<a id='9'></a>
<div class="alert alert-info" style="background-color:#800080; color:white; padding:0px 10px; border-radius:5px;"><h3 style='margin:10px 5px'>6.4 Simple non-linear regression analysis with degree 3.
</h3></div>

In [None]:
## let's transform data into polynomial form with degree=10
poly_reg=PolynomialFeatures(degree=3)
x_poly=poly_reg.fit_transform(X_train)

# fitting our data into model
model=LinearRegression()
model.fit(x_poly,y_train)

#predictions
pred_train = model.predict(x_poly)
x_poly_val = poly_reg.fit_transform(X_val)
pred_val = model.predict(x_poly_val)

# r2
train_R2_PR3 = r2_score(y_train,pred_train)
test_R2_PR3 = r2_score(y_val,pred_val)

print('train R2:',train_R2_PR3)
print('test R2:',test_R2_PR3)

In [None]:
# let's see how well our model fit into data 
# training data plot
plt.scatter(X_train,y_train, color="red")
plt.scatter(X_train, pred_train, color="blue",s=0.5)
plt.xlabel("Regressor")
plt.ylabel("Response")
plt.title("Regression analysis of Basepay vs Other Pays")

In [None]:
# validation data plot
plt.scatter(X_val,y_val, color="red")
plt.scatter(X_val, pred_val, color="blue",s=0.5)
plt.xlabel("Regressor")
plt.ylabel("Response")
plt.title("Cubic - Degree = 3")

<a id='10'></a>
<div class="alert alert-info" style="background-color:#800080; color:white; padding:0px 10px; border-radius:5px;"><h3 style='margin:10px 5px'>6.4 Finding ideal degree </h3></div>

We know that if R2 score is more , the model is perfect <br/>
We have seen R2 for linear model , polynomial with degree 2 and 3 out of which polynomial with degree=3 yields the best results<br/>
Let's see some more polynomial degrees and see which will have high R2 score

In [None]:
# let's define lists to score errors and k-values
train_R2 = []
test_R2 = []
degree = []

# k ranging from 1-10
for k in range(1, 10):
    # storing degree
    degree.append(k)
    
    # initialising model
    poly_reg=PolynomialFeatures(degree=k)
    x_poly=poly_reg.fit_transform(X_train)

    # fitting our data into model
    model=LinearRegression()
    model.fit(x_poly,y_train)
    
    #predictions
    pred_train = model.predict(x_poly)
    x_poly_val = poly_reg.fit_transform(X_val)
    pred_val = model.predict(x_poly_val)

    # training data R2
    train_R2.append(r2_score(y_train,pred_train))
    
    #test data R2
    test_R2.append(r2_score(y_val,pred_val))

In [None]:
# let's plot training scores , test scores against k values 

plt.figure(figsize=(10,5))
plt.title('Model R2 score vs degree')
plt.xlabel('degree')
plt.ylabel('Model R2 score')
plt.plot(degree, train_R2, color = 'r', label = "training R2")
plt.plot(degree, test_R2, color = 'b', label = 'test R2')
plt.legend(bbox_to_anchor=(1, 1),bbox_transform=plt.gcf().transFigure)

In [None]:
test_R2.index(max(test_R2))

<div class="alert alert-info" style="padding:0px 10px; border-radius:5px;"><h3 style='margin:10px 5px'> Inferences:</h3>
</div>

We see that Polynomial model with degree= 3 gives best R2 score 

In [None]:
R = pd.DataFrame(data=[[train_R2[0],test_R2[0]],[train_R2[1],test_R2[1]],[train_R2[2],test_R2[2]]],columns={'Train_R2','Test_R2'},index={'Linear','non-linear degree=2','non-linear degree=3'})

In [None]:
R

<a id='11'></a>
<div class="alert alert-info" style="background-color:#800080; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>6.5 Final Model and Test </h2>
 </div>

let's make a model with degree of 3 

In [None]:
## let's transform data into polynomial form with degree=10
poly_reg=PolynomialFeatures(degree=3)
x_poly=poly_reg.fit_transform(X_train)

# fitting our data into model
model=LinearRegression()
model.fit(x_poly,y_train)

#predictions
pred_train = model.predict(x_poly)
x_poly_val = poly_reg.fit_transform(X_val)
pred_val = model.predict(x_poly_val)

# r2
train_R2_PR3 = r2_score(y_train,pred_train)
test_R2_PR3 = r2_score(y_val,pred_val)

print('train R2:',train_R2_PR3)
print('test R2:',test_R2_PR3)

## Test

In [None]:
x_poly_test =poly_reg.fit_transform(test[['Regressor']])
y_test = model.predict(x_poly_test)

In [None]:
# test data plot
plt.scatter(test[['Regressor']],test[['Response']], color="red",label='actual ')
plt.scatter(test[['Regressor']], y_test, color="blue",s=0.5,label='our model')
plt.xlabel("Regressor")
plt.ylabel("Response")
plt.legend()
plt.title("simple non linear regression analysis with degree-3")

## Thankyou