# Linear Regression Model

## Scaling and transformations
Now before proceeding to building the models I'm going to do the X/y split and scale/transform the data

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('../data/cleaned/6.jobs_in_data.csv')
df.head()

Unnamed: 0,work_year,job_title,job_category,employee_residence,experience_level,employment_type,work_setting,company_location,company_size,salary_in_euros,cost_of_living,purchasing_power,job_field
0,2023,Data DevOps Engineer,Data Engineering,Germany,2,4,1,Germany,L,87411,127.47,685.74,Data Engineering
1,2023,Data Architect,Data Architecture and Modeling,United States,3,4,3,United States,M,171120,143.34,1193.8,Data Engineering
2,2023,Data Architect,Data Architecture and Modeling,United States,3,4,3,United States,M,75256,143.34,525.02,Data Engineering
3,2023,Data Scientist,Data Science and Research,United States,3,4,3,United States,M,195040,143.34,1360.68,Data Science
4,2023,Data Scientist,Data Science and Research,United States,3,4,3,United States,M,85836,143.34,598.83,Data Science


## X/y Split
The target will be "salary_in_euros". I also want to drop the columns 'job_title' and 'job_category' because they are redundant for creating the model since I already added the column 'job_field' before with the categories that I want to work with.
(edit) After some trial and error testing I decided to drop the columns employee_residence and company_location since the high amount of unique values was influencing the performance of the model and causing overfitting. I dropped also purchasing power because that was a feature that I added from the target column for EDA purposes so it makes no sense in using it to predict the target value.

In [3]:
X = df.drop(columns=['salary_in_euros', 'job_title', 'job_category', 'purchasing_power', 'employee_residence', 'company_location'], axis=1)
y = df['salary_in_euros']

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [5]:
display(X_train.shape)
display(X_test.shape)
display(y_train.shape)
display(y_test.shape)

(4264, 6)

(1066, 6)

(4264,)

(1066,)

## Dividing X into numerical and categorical

In [6]:
X_train_num = X_train.select_dtypes(np.number)
X_train_cat = X_train.select_dtypes(object)
X_test_num = X_test.select_dtypes(np.number)
X_test_cat = X_test.select_dtypes(object)

In [7]:
X_train_num.head()

Unnamed: 0,work_year,experience_level,employment_type,work_setting,cost_of_living
2804,2023,3,4,3,143.34
3858,2023,3,4,3,143.34
511,2023,2,4,3,143.34
62,2023,3,4,3,143.34
3034,2023,3,4,2,143.34


In [8]:
X_test_num.head()

Unnamed: 0,work_year,experience_level,employment_type,work_setting,cost_of_living
1323,2023,3,4,3,143.34
1839,2023,3,4,3,143.34
798,2023,3,4,2,143.34
3856,2023,4,4,3,143.34
4553,2022,3,4,3,106.46


In [9]:
X_train_num.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
work_year,4264.0,2022.685038,0.603377,2020.0,2022.0,2023.0,2023.0,2023.0
experience_level,4264.0,2.656895,0.678589,1.0,2.0,3.0,3.0,4.0
employment_type,4264.0,3.984522,0.183138,1.0,4.0,4.0,4.0,4.0
work_setting,4264.0,2.505629,0.565209,1.0,2.0,3.0,3.0,3.0
cost_of_living,4264.0,139.699977,15.292912,27.37,143.34,143.34,143.34,197.89


## Scaling numerical features
Since the numerical features have very different ranges I'm going to use the Standard Scaler.

In [11]:
# from sklearn.preprocessing import MinMaxScaler
# import pickle

# scaler = MinMaxScaler()
# scaler.fit(X_train_num)

# path = "../ml/scalers/"
# scaler_file_name = "MinMaxScaler.pkl"

# with open(path + scaler_file_name, "wb") as file:
#     pickle.dump(scaler, file)

# X_train_num_transformed = scaler.transform(X_train_num)
# X_test_num_transformed = scaler.transform(X_test_num)

In [12]:
from sklearn.preprocessing import StandardScaler
import pickle

scaler = StandardScaler()
scaler.fit(X_train_num)

path = "../ml/scalers/"
scaler_file_name = "standard_scaler.pkl"

with open(path + scaler_file_name, "wb") as file:
    pickle.dump(scaler, file)

X_train_num_transformed = scaler.transform(X_train_num)
X_test_num_transformed = scaler.transform(X_test_num)

In [13]:
X_train_num_transformed_df = pd.DataFrame(X_train_num_transformed, columns=X_train_num.columns , index=X_train_num.index)
X_test_num_transformed_df = pd.DataFrame(X_test_num_transformed, columns=X_test_num.columns , index=X_test_num.index)

In [14]:
X_train_num_transformed_df.head()

Unnamed: 0,work_year,experience_level,employment_type,work_setting,cost_of_living
2804,0.52206,0.505675,0.084528,0.874772,0.238048
3858,0.52206,0.505675,0.084528,0.874772,0.238048
511,0.52206,-0.968145,0.084528,0.874772,0.238048
62,0.52206,0.505675,0.084528,0.874772,0.238048
3034,0.52206,0.505675,0.084528,-0.894691,0.238048


In [15]:
X_test_num_transformed_df.head()

Unnamed: 0,work_year,experience_level,employment_type,work_setting,cost_of_living
1323,0.52206,0.505675,0.084528,0.874772,0.238048
1839,0.52206,0.505675,0.084528,0.874772,0.238048
798,0.52206,0.505675,0.084528,-0.894691,0.238048
3856,0.52206,1.979495,0.084528,0.874772,0.238048
4553,-1.135472,0.505675,0.084528,0.874772,-2.173809


## Encoding categorical features

In [16]:
X_train_cat.head()

Unnamed: 0,job_field
2804,Data Engineering
3858,Data Science
511,Data Analysis
62,Data Science
3034,Data Science


In [17]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(handle_unknown='ignore')
encoder.fit(X_train_cat)

path = "../ml/encoders/"
encoder_file_name = "one_hot_encoder.pkl"

with open(path + encoder_file_name, "wb") as file:
    pickle.dump(encoder, file)

X_train_cat_encoded = encoder.transform(X_train_cat).toarray()
X_test_cat_encoded = encoder.transform(X_test_cat).toarray()

In [18]:
encoded_feature_names = encoder.get_feature_names_out(X_train_cat.columns)

X_train_cat_encoded_df = pd.DataFrame(X_train_cat_encoded, columns=encoded_feature_names, index=X_train_cat.index)
X_test_cat_encoded_df = pd.DataFrame(X_test_cat_encoded, columns=encoded_feature_names, index=X_test_cat.index)

In [19]:
X_train_cat_encoded_df.head()

Unnamed: 0,job_field_Data Analysis,job_field_Data Engineering,job_field_Data Science,job_field_Other
2804,0.0,1.0,0.0,0.0
3858,0.0,0.0,1.0,0.0
511,1.0,0.0,0.0,0.0
62,0.0,0.0,1.0,0.0
3034,0.0,0.0,1.0,0.0


In [20]:
X_test_cat_encoded_df.head()

Unnamed: 0,job_field_Data Analysis,job_field_Data Engineering,job_field_Data Science,job_field_Other
1323,0.0,1.0,0.0,0.0
1839,0.0,0.0,1.0,0.0
798,0.0,1.0,0.0,0.0
3856,0.0,0.0,1.0,0.0
4553,0.0,1.0,0.0,0.0


In [21]:
X_train_concat = pd.concat([X_train_num_transformed_df, X_train_cat_encoded_df], axis=1)
X_train_concat

Unnamed: 0,work_year,experience_level,employment_type,work_setting,cost_of_living,job_field_Data Analysis,job_field_Data Engineering,job_field_Data Science,job_field_Other
2804,0.522060,0.505675,0.084528,0.874772,0.238048,0.0,1.0,0.0,0.0
3858,0.522060,0.505675,0.084528,0.874772,0.238048,0.0,0.0,1.0,0.0
511,0.522060,-0.968145,0.084528,0.874772,0.238048,1.0,0.0,0.0,0.0
62,0.522060,0.505675,0.084528,0.874772,0.238048,0.0,0.0,1.0,0.0
3034,0.522060,0.505675,0.084528,-0.894691,0.238048,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...
3092,0.522060,0.505675,0.084528,0.874772,0.238048,0.0,0.0,1.0,0.0
3772,0.522060,0.505675,0.084528,-0.894691,0.238048,0.0,0.0,1.0,0.0
5191,-2.793004,0.505675,0.084528,-0.894691,-2.299372,0.0,1.0,0.0,0.0
5226,-2.793004,-2.441964,0.084528,-2.664155,0.238048,0.0,0.0,1.0,0.0


In [22]:
X_test_concat = pd.concat([X_test_num_transformed_df, X_test_cat_encoded_df], axis=1)
X_test_concat

Unnamed: 0,work_year,experience_level,employment_type,work_setting,cost_of_living,job_field_Data Analysis,job_field_Data Engineering,job_field_Data Science,job_field_Other
1323,0.522060,0.505675,0.084528,0.874772,0.238048,0.0,1.0,0.0,0.0
1839,0.522060,0.505675,0.084528,0.874772,0.238048,0.0,0.0,1.0,0.0
798,0.522060,0.505675,0.084528,-0.894691,0.238048,0.0,1.0,0.0,0.0
3856,0.522060,1.979495,0.084528,0.874772,0.238048,0.0,0.0,1.0,0.0
4553,-1.135472,0.505675,0.084528,0.874772,-2.173809,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...
2841,0.522060,-0.968145,0.084528,-0.894691,-5.699385,0.0,0.0,1.0,0.0
5208,-4.450536,-0.968145,0.084528,0.874772,-1.899794,0.0,0.0,1.0,0.0
1965,0.522060,0.505675,0.084528,-0.894691,0.238048,0.0,0.0,1.0,0.0
4538,-1.135472,-0.968145,0.084528,-0.894691,0.238048,0.0,0.0,1.0,0.0


In [23]:
#importing and fitting the training set on the linear regression model
from sklearn.linear_model import LinearRegression
import os

lr = LinearRegression()

lr.fit(X_train_concat, y_train)

path = "../ml/models/"
isExist = os.path.exists(path)
if not isExist:
    os.makedirs(path)
    print("The new directory is created!")

    filename = "LinearRegression.pkl"
    with open(path+filename, "wb") as file:
        pickle.dump(lr, file)

In [24]:
y_train_pred = lr.predict(X_train_concat)
y_test_pred  = lr.predict(X_test_concat)

In [25]:
import functions

functions.error_metrics_report(y_train, y_test, y_train_pred, y_test_pred)

Unnamed: 0,Metric,Train,Test
0,MAE,40409.9,38482.78
1,MSE,2749398532.44,2337308565.16
2,RMSE,52434.71,48345.72
3,R2,0.29,0.31


### Conclusion:
After some testing with different features and pre-processing I couldn't get a better score for this model than around 0.3 for both the training and test sets. There might not be a sufficient linear correlation between the features and target to get a good performance using a linear model.