### This is my first attempt in analysing and publishing ML algorithm application. This analysis is based on simple linear regression method. Here we will predict salary (label) based on the years of experience (feature) of the employee. The analysis is at the beginner's level

#### Importing Packages

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#### Reading the dataset

In [None]:
df = pd.read_csv("../input/salary-data-simple-linear-regression/Salary_Data.csv")

#### Data Exploration

In [None]:
# Describe the dataset
df.describe()

##### Inferences from the above command result. 
###### We can find the total count as 30 records and hence no missing values in any of the fields.
###### We can also find the average value of each column.
###### Average salary is 76003.00 and std deviation of salary is 27414.429 which means if we pick any employee from the given dataset it would be + (or) - 27414.429 units away from the mean
###### Similarly we can find th minimum and maximum values of salary and experience
###### Upon seeing the 25%, 50% and 75% values of all the fields it gives overall distribution like a curve but may not be evenly distributed 


In [None]:
# Sample Data
df.head()

#### Select important Features

In [None]:
df_feature = df[["YearsExperience","Salary"]]

In [None]:
df_feature.head()

#### Identifying missing values

In [None]:
# Identify if any null values
df_feature.isnull().sum()

In [None]:
# Identify if any "?" values
df_feature.isin(["?"]).sum()

In [None]:
# Identify if any "? " values
df_feature.isin(["? "]).sum()

#### There are no missing values

#### We can do plotting of these features in histrogram

In [None]:
viz = df_feature[["Salary","YearsExperience"]]
viz.hist()
plt.show()

#### We can do plotting of the YearsExperience against the Salary to check the linear relation

In [None]:
plt.scatter(df_feature.Salary, df_feature.YearsExperience,  color='red')
plt.xlabel("Salary")
plt.ylabel("YearsExperience")
plt.show()

#### The relationship between Salary and years of experience seems to be linear.

#### Finding the correlation between salary and years of experience

In [None]:
df_feature[['YearsExperience','Salary']].corr()['Salary'][:]

#### The above correlation value determines that years of experience has very significant relationship with target.  The value of correlation coefficient is closer to 1 i.e 0.978

#### Train and Test Split

In [None]:
# import the libraries
from sklearn import linear_model
from sklearn.model_selection import train_test_split

In [None]:
# Creating dataframe with independent variables only
X = df_feature[['YearsExperience']]

In [None]:
# Creating dataframe with dependent (or) response variable only
Y = df_feature[['Salary']]

In [None]:
# Spliting the test data with 20% and train data with 80%. Keeping the random state with some constant value will ensure to fetch same set of
# train and test records upon multiple executions
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = .20, random_state = 42)

In [None]:
lreg = linear_model.LinearRegression()

In [None]:
lreg.fit (X_train, Y_train)

In [None]:
# Finding the coefficients
print ('Coefficients: ', lreg.coef_)
print ('Intercept: ',lreg.intercept_)

In [None]:
type(X_train)

#### The coefficient of Salary is 9423.815
#### The intercept (or) constant in slope equation is 25321.583
#### i.e. Y^ = mX + C
#### i.e Y^ = 9423.8 * X + 25321.6 --> is the simple linear equation which is the best fit line with minimum MSE (Mean Squared Error)


## We can plot the best fit line derived against the actual train dataset points

In [None]:
plt.scatter(X_train, Y_train,  color='blue')
plt.plot(X_train, lreg.coef_*X_train + lreg.intercept_, 'r-')
plt.xlabel("Years Of Expereince")
plt.ylabel("Salary")

#### Evaluation of the model

#### Predicting the target values for the test data - X_test

In [None]:
from sklearn.metrics import r2_score
import math
Y_predict = lreg.predict(X_test)

#### Printing Y_test and Y_predict values

In [None]:
plt.scatter(X_test, Y_test,  color='blue')
plt.plot(X_test,Y_predict,linestyle='dotted',color='red')
plt.scatter(X_test, Y_predict,  color='red')
plt.xlabel("Years Of Experience")
plt.ylabel("Salary")

#### Regression metrics

In [None]:
MAE = round(np.mean(np.absolute(Y_test - Y_predict)))
MSE = round(np.mean((Y_test - Y_predict)**2))
print("Mean Absolute Error = {}".format(MAE))
print("Mean Square Error = {}".format(MSE))
print("R Square = ",r2_score(Y_predict,Y_test))

#### The R2 value is 89% which means the regression model fits the given dataset. Also it ensures that Years of experience is the significant variable of the regression model