## **Linear Regression Model**

The objective of this problem is build a model to predict the salary based on the years of experience of an employee. This could be useful in case a new person enters in the firm and the Human Resources staff does not have any salary policy.

This problem resolution is going to be organized in the following sections:
* **Section 1** - Data analysis and insights
* **Section 2** - Getting the training set and the test set
* **Section 3** - Getting the Model
* **Section 4** - Conclusion

### **Section 1 - Data Analysis and insights**

In [None]:
#importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
#Getting the data
df = pd.read_csv("../input/salary-data-simple-linear-regression/Salary_Data.csv")

In [None]:
#showing the dataframe information
df.info()

As can be noticed in the dataframe information, there is no null values. So, it won't be necessary any data processing to care of null-values.

In [None]:
#Dataframe first 5 elements
df.head()

In [None]:
#show the main statistics informations of the dataframe
df.describe()

In [None]:
#Checking if there is any outlier
plt.rcParams['figure.figsize'] = [7, 7]
sns.boxplot(data = df.iloc[:, -1], orient = "v", palette = "Set1", whis = 1.5, 
            saturation = 1, width = 0.8)
plt.title("Outliers Variable Distribution", fontsize = 14, fontweight = "bold")
plt.ylabel("Salary range", fontweight = "bold")
plt.xlabel("Continuos Variable", fontweight = "bold")
df.shape

No outlier is present in the dataframe. Proceeding to get the suitable model to the data.

In [None]:
#Showing the distribution
plt.rcParams['figure.figsize'] = [7, 7]
sns.distplot(df['YearsExperience'])

In [None]:
#Showing the values and how they are scattered
sns.scatterplot(x = "YearsExperience", y = "Salary", data = df)

In [None]:
#Show the correlation matrix
plt.title("Correlation Matrix")
plt.rcParams['figure.figsize'] = [10, 10]
sns.heatmap(df.corr(), annot = True)

There is a strong linear relationship and a high correlation between the dependent and independent variable. Thus, the model to use to predict future salaries based on the years of experience will be the linear regression.

### **Section 2 - Getting the training set and the test set**

In [None]:
#Splitting the dataset into the dependent and independent variables
x = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

In [None]:
#splitting the data into the training set and the test set
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)

### **Section 3 - Getting the Model**

In [None]:
regressor = LinearRegression()
regressor.fit(x_train, y_train)

In [None]:
#getting the score of the regression
regressor.score(x_train, y_train)

0.951012 is an excelent score for a linear regression

In [None]:
#Getting the coeficients of the linear regression
print("Salary = " + str(regressor.intercept_) + " + YearsExperience*" + 
     str(regressor.coef_[0]))

In [None]:
#Visualizing the training set results
plt.scatter(x_train, y_train, color="purple")
plt.plot(x_train, regressor.predict(x_train), color = "green")
plt.title("Salary vs Experience (Training set)")
plt.xlabel("Years of Experience")
plt.ylabel("Salary")
plt.show()

In [None]:
#Visualizing the test set results
plt.scatter(x_test, y_test, color = "purple")
plt.plot(x_train, regressor.predict(x_train), color = "green")
plt.title("Salary vs Experience (Test Set)")
plt.xlabel("Years of experience")
plt.ylabel("Salary")
plt.show()

In [None]:
#Show some relevant informations about the linear regression
y_pred = regressor.predict(x_test)
print("R2 value is " + str(r2_score(y_test, y_pred)))
print("The Mean Squared Error is " + str(mean_squared_error(y_test, y_pred)))

### **Section 4 - Conclusion**

The Linear Regression was a suitable machine learning tool in this problem because of the strong linear relationship and correlation between the dependent and independent variable.
The R2 value showed that the accuracy of the results, measured by the predicted value and tested value, was 96.62% which is useful to general purposes.
In the case described in the introdution, this model would help the Human Resources Staff that could need to predict the salary based on the years of experience of a new employee.