# Regression Training

In [56]:
import sklearn
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## 1. Load and split preprocessed data

In [57]:
# Import dataset
data = pd.read_csv('salaries_processed.csv')
data.dtypes

work_year                              float64
experience_level                         int64
salary_in_usd                            int64
remote_ratio                           float64
employment_type_FL                       int64
employment_type_FT                       int64
employment_type_PT                       int64
job_title_Analytics Engineer             int64
job_title_Applied Scientist              int64
job_title_Data Analyst                   int64
job_title_Data Architect                 int64
job_title_Data Engineer                  int64
job_title_Data Science Manager           int64
job_title_Machine Learning Engineer      int64
job_title_Other                          int64
employee_residence_BR                    int64
employee_residence_DE                    int64
employee_residence_ES                    int64
employee_residence_FR                    int64
employee_residence_GB                    int64
employee_residence_IN                    int64
employee_resi

# Dataset Summary
<p>Salaries of Data science related jobs</p>
<p>This dataset contains the features shown above and it's target is to determine the Salary (standardized to usd) of a data science worker based on various factors.</p>

In [58]:
# Splitting the data set
from sklearn.model_selection import train_test_split

# Y is the dependent variable salary in usd.
# X contains the independent variables, which are all the other columns
Y = data['salary_in_usd']
X = data.drop('salary_in_usd', axis=1)

# Using a 80-20 train-test split for the training and testing set
# Test size 0.1 and 0.3 were tested to be less optimal
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=1)

In [59]:
# Check if split was correct
print('Training data:', X_train.shape, Y_train.shape)
print('Testing data:', X_test.shape, Y_test.shape)

Training data: (2953, 23) (2953,)
Testing data: (739, 23) (739,)


## 2. Choose an algorithm

<p> For the regression dataset I will be using linear regression</p>
<p> Linear regression is an supervised machine learning algorithm used to predict a continous dependent variable Y based on one or more independent variables x1,x2.... xn. A trained linear regression model takes the equation y = m1x1 + c1 + m2x2 + c2.... where the relationship expands to include coefficients and constants for each given independent variable. It will result in a linear equation that best fits the data trend provided. </p>

In [60]:
# Import linear regression and mse 
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

## 3. Train and test a model

In [67]:
# initialize and train the model
model = LinearRegression()
model.fit(X_train, Y_train)

# Predict the test set
Y_pred = model.predict(X_test)

# show the mean squared error
mse = mean_squared_error(Y_test, Y_pred)
print(f"Mean Squared Error - {mse}")

#
print("Linear Regression Coefficients - ", model.coef_)
print("Linear Regression Intercept - ", model.intercept_)

Mean Squared Error - 1797182192.5038137
Linear Regression Coefficients -  [  4180.17304544  23339.08615111    256.1555787    3954.49675624
  27219.55702538  25808.97624778   4379.45207963  21265.47471531
 -33314.2968146    -846.85279265  -7982.67938488  29161.193885
  10640.8824428   -1726.46335909 -57950.94614532 -21140.02822326
 -64262.98640991 -50759.07375427 -30488.6868149  -73065.7120044
 -43180.28780165 -72352.96165118  19231.33275277]
Linear Regression Intercept -  68281.21794547673


## 4. Evaluate the model 

In [65]:
# Print R2 Score
r2 = r2_score(Y_test, Y_pred)
print(f"R2 Score - {r2}")
# R2 measures the correlation between two input variables, values above 0 indicate a positive correlation and values below 0 indicate negative correlation.


# print rmse
rmse = mean_squared_error(Y_test, Y_pred, squared=False)
print(f"Root Mean Squared Error - {rmse}")
# RMSE is the square root of the above mean squared error, It is a measure of the error from how close the predicted values are from the actual values.

# print mae
mae = mean_absolute_error(Y_test, Y_pred)
print(f"Mean Absolute Error - {mae}")
# MAE Is another measure of error with how close predicted values, Instead of square rooting the values the absolute value is taken.

R2 Score - 0.4557117710982329
Root Mean Squared Error - 42393.185684775024
Mean Absolute Error - 33477.1769252552


## 5. Summary

<p> In step 1 we imported the dataset, summarized its purpose in predicting the salary of data science related Jobs and then created the training and testing dataset with a split of 80 to 20</p>

<p> In step 2 we summarized the linear regression model and its use in predicting a continous dependent variable, using consequent linear equatiosn for each feature and imported the model itself along with the metrics.</p>

<p> In step 3 the model was trained, and the prediction set was created along with the mean squared error being displayed</p>

<p> In step 4 the R2 Score, RMSE, Mae were evaluated and summarized their uses.</p>

# Evaluation 
<p> 
The linear regression coefficients as printed above form a part of the linear regression equation y = m1x1 + c1 + m2x2 + c2.... where each coefficient is m1,m2 ... up to mi. Each coefficient is multiplied by the the data values for each feature in order to create a predicted value.

Our R2 score was 0.4557 telling us that 45.6% of the variability present within the model could be explained by the features that have chosen to use. The rest of the variability is unexplained.
Our RMSE and MAE were 42393 and 33477 respectively, given that our dependent variable salary in usd was within the 6 figure range, this is a significant amount of error.

Less than half of the variability due to features is captured by our model as shown by our R2 Score, and our RMSE and MAE show there is a significant amount of error within the model.

Options for us to improve these metrics include using a more complex model as linear regression may not be enough to capture the relationship between the features and the dependent variable, our preprocessing simplified a large portion of the data, which had many different datatypes for job title and location maybe a different model would have been able to handle these features better. Data collection could be improved to capture more fields that may have a big impact on the dependent variable.

</p>

In [68]:
# Export model using pickle
import pickle

with open('salary_model.pkl', 'wb') as model_file:
    pickle.dump(model, model_file)