# Assignment 4: Pipelines and Hyperparameter Tuning (32 total marks)
### Due: November 22 at 11:59pm

### Name: Yajur Vashisht

### In this assignment, you will be putting together everything you have learned so far. You will need to find your own dataset, do all the appropriate preprocessing, test different supervised learning models and evaluate the results. More details for each step can be found below.

### You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Import Libraries

In [1]:
import numpy as np
import pandas as pd

## Step 1: Data Input (4 marks)

Import the dataset you will be using. You can download the dataset onto your computer and read it in using pandas, or download it directly from the website. Answer the questions below about the dataset you selected. 

To find a dataset, you can use the resources listed in the notes. The dataset can be numerical, categorical, text-based or mixed. If you want help finding a particular dataset related to your interests, please email the instructor.

**You cannot use a dataset that was used for a previous assignment or in class**

In [2]:
# Import dataset (1 mark)

df = pd.read_csv('salaries.csv')

### Questions (3 marks)

1. (1 mark) What is the source of your dataset?
1. (1 mark) Why did you pick this particular dataset?
1. (1 mark) Was there anything challenging about finding a dataset that you wanted to use?

*ANSWER HERE*

1. My dataset is from: https://www.kaggle.com/datasets/infosecjobs/global-salaries-in-cybersecurity-infosec/data
2. I think this is a relevant dataset as many people want to see what their salaries would look like or what a good range is for them depending on their experience level and the size of the company they are working at. 
3. There are a few challenges with finding relevant data. I liked that the data in this dataset was very current (2020-2023) because there has been record inflation and if I wa using historical data then it could skew the results. I also like that the data is transformed to salary in usd to give a constant to compare to.

## Step 2: Data Processing (5 marks)

The next step is to process your data. Implement the following steps as needed.

In [3]:
# Clean data (if needed)

df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,EX,FT,Information Security Officer,160000,USD,160000,US,100,US,M
1,2023,EX,FT,Information Security Officer,100000,USD,100000,US,100,US,M
2,2023,SE,FT,Security Engineer,247250,USD,247250,US,0,US,M
3,2023,SE,FT,Security Engineer,160000,USD,160000,US,0,US,M
4,2023,SE,FT,Security Engineer,224250,USD,224250,US,0,US,M


In [4]:
df.dtypes

work_year              int64
experience_level      object
employment_type       object
job_title             object
salary                 int64
salary_currency       object
salary_in_usd          int64
employee_residence    object
remote_ratio           int64
company_location      object
company_size          object
dtype: object

In [5]:
null_values_per_column = df.isnull().sum()
print(null_values_per_column)

work_year             0
experience_level      0
employment_type       0
job_title             0
salary                0
salary_currency       0
salary_in_usd         0
employee_residence    0
remote_ratio          0
company_location      0
company_size          0
dtype: int64


In [6]:
df = df.drop_duplicates()

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline

# Initialize OneHotEncoder
encoder = OneHotEncoder(sparse=False)

# Fit and transform data
encoded_data = encoder.fit_transform(df[['experience_level', 'company_size']])
encoded_columns = encoder.get_feature_names_out(['experience_level', 'company_size'])
encoded_df = pd.DataFrame(encoded_data, columns=encoded_columns)

# Define features and target
y = df['salary_in_usd']
X = encoded_df

# Split into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.head()



Unnamed: 0,experience_level_EN,experience_level_EX,experience_level_MI,experience_level_SE,company_size_L,company_size_M,company_size_S
1173,0.0,0.0,1.0,0.0,1.0,0.0,0.0
168,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2589,1.0,0.0,0.0,0.0,1.0,0.0,0.0
1011,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2977,0.0,0.0,1.0,0.0,1.0,0.0,0.0


### Questions (2 marks)

1. (1 mark) Were there any missing/null values in your dataset? If yes, how did you replace them and why? If no, describe how you would've replaced them and why.
2. (1 mark) What type of data do you have? What preprocessing methods would you have to apply based on your data types?

*ANSWER HERE*

1. There were no missing and null values in my dataset. If there was I would either replace them with 0, fill with mean, or drop the row entirely. 

2. I have mostly numerical data, salary_in_usd will be my y but I needed to Encode the experience_level columns and the company_size columns

## Step 3: Implement Machine Learning Model (11 marks)

In this section, you will implement three different supervised learning models (one linear and two non-linear) of your choice. You will use a pipeline to help you decide which model and hyperparameters work best. It is up to you to select what models to use and what hyperparameters to test. You can use the class examples for guidance. You must print out the best model parameters and results after the grid search.

In [8]:
# Implement pipeline and grid search here. Can add more code blocks if necessary

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor


# Define hyperparameter grids for each model
param_grid_lr = {}  # Linear Regression may not have many hyperparameters to tune
param_grid_rf = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None]
}
param_grid_gb = {
    'n_estimators': [100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7]
}

# Set up grid search for each model
grid_search_lr = GridSearchCV(LinearRegression(), param_grid_lr, cv=5, scoring='neg_mean_squared_error')
grid_search_rf = GridSearchCV(RandomForestRegressor(random_state=42), param_grid_rf, cv=5, scoring='neg_mean_squared_error')
grid_search_gb = GridSearchCV(GradientBoostingRegressor(random_state=42), param_grid_gb, cv=5, scoring='neg_mean_squared_error')

# Fit models and print the best parameters and scores
# Fit Linear Regression
grid_search_lr.fit(X_train, y_train)
print("Best parameters for Linear Regression:", grid_search_lr.best_params_)
print("Best score for Linear Regression:", grid_search_lr.best_score_)

# Fit Random Forest
grid_search_rf.fit(X_train, y_train)
print("Best parameters for Random Forest:", grid_search_rf.best_params_)
print("Best score for Random Forest:", grid_search_rf.best_score_)

# Fit Gradient Boosting
grid_search_gb.fit(X_train, y_train)
print("Best parameters for Gradient Boosting:", grid_search_gb.best_params_)
print("Best score for Gradient Boosting:", grid_search_gb.best_score_)

Best parameters for Linear Regression: {}
Best score for Linear Regression: -3157074833.2540436
Best parameters for Random Forest: {'max_depth': 10, 'n_estimators': 200}
Best score for Random Forest: -3158408973.1292124
Best parameters for Gradient Boosting: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 100}
Best score for Gradient Boosting: -3158569672.567332


### Questions (5 marks)

1. (1 mark) Do you need regression or classification models for your dataset?
1. (2 marks) Which models did you select for testing and why?
1. (2 marks) Which model worked the best? Does this make sense based on the theory discussed in the course and the context of your dataset?

*ANSWER HERE*

1. I need a regression model. This is because my target variable, salary_in_usd, is a continuous numeric variable. Regression models are used when the output variable is a real or continuous value.

2. I used three different models. Linear Regression: It is the simplest regression model and is a good baseline. It works well when there is a linear relationship between the features and the target variable. Random Forest Regressor: A model that uses an ensemble of decision trees. It was chosen for its ability to handle complex datasets with higher dimensionality and its robustness to overfitting. Gradient Boosting Regressor: Works by building trees in a sequential manner, where each tree tries to correct the errors of the previous one. It's good for high performance in a variety of tasks and its ability to handle heterogeneous features.

3. Based on the results, the Linear Regression model worked the best. This is indicated by its lowest negative score in the grid search. Linear Regression is less prone to overfitting compared to more complex models like Random Forest and Gradient Boosting, especially if the dataset isn't very large or if the features don't have complex non-linear relationships with the target variable. Since some of the trickier variables were dropped because I wanted to compare exeperience level and company size vs. the salary an employee recieves, it makes sense that linear worked the best. In the context of salary prediction, many factors could have a linear impact on salaries like years of experience, level of education, etc. If the relationships in the data are not highly non-linear, simpler models like Linear Regression can often yield better results, as we have seen demonstrated in class.

## Step 4: Validate Model (6 marks)

Use the testing set to calculate the testing accuracy for the best model determined in Step 3.

In [13]:
# Calculate testing accuracy (1 mark)

from sklearn.metrics import mean_squared_error

# Initialize the best model with the best parameters
best_model = LinearRegression(fit_intercept=False)

# Fit the model on the training data
best_model.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = best_model.predict(X_val)

# Calculate the Mean Squared Error
mse = mean_squared_error(y_val, y_pred)

# Calculate the Root Mean Squared Error
rmse = np.sqrt(mse)

part_3_result = np.sqrt(-1*-3157074833.2540436)
print(f"Part 3 Results: {part_3_result}")
print(f"Testing set MSE: {mse}")
print(f"Testing set RMSE: {rmse}")

Part 3 Results: 56187.853075678584
Testing set MSE: 3233701947.4403224
Testing set RMSE: 56865.64821964419


In [14]:
difference_rsme = part_3_result - rmse
print("The difference between both RSME: ", difference_rsme)

The difference between both RSME:  -677.7951439656026


In [10]:
max = df['salary_in_usd'].max()
min = df['salary_in_usd'].min()
print("The max salary in USD is: ", max)
print("The min salary in USD is: ", min)
Stdev = df['salary_in_usd'].std() 
print("The standard deviation of the salaries in USD is: ", Stdev)

The max salary in USD is:  456621
The min salary in USD is:  15897
The standard deviation of the salaries in USD is:  64924.26332277093



### Questions (5 marks)

1. (1 mark) Which accuracy metric did you choose? 
1. (1 mark) How do these results compare to those in part 3? Did this model generalize well?
1. (3 marks) Based on your results and the context of your dataset, did the best model perform "well enough" to be used out in the real-world? Why or why not? Do you have any suggestions for how you could improve this analysis?

*ANSWER HERE*

1. I chose RSME as my accuracy metric. This is because the salary_in_usd column has massive variations in data, from as little as 16,000 USD to 456,621 USD.
2. Going from Part 3 to Part 4, the training vs testing set produced a very similar RSME (a differenc of 677) which in the context of 100,000's is negligible. 
3. Based on the results and context of my dataset there is absolutely no way the model performed well enough to be used in a real world application. An RSME of 56866 is massive in the context of a salary. I believe a major part of this is because I had to remove the Job Title because using HotEncoder() on it caused there to be way too many columns which was unproductive since some of the job titles were extremly similar. A suggestion would be to group like jobs together so when encoding is done it will not cause the dataset to be split into an additional 20 columns. 

## Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

1. My code was sourced from the Pipeline and ColumnTransformer examples that we went over in class along with the regression metrics examples we did. I also used a lot of the previous assignment to reference how to set up a lot of the same steps that we do every time we do machine learning.

2. The steps were completed in the order they were presented in the assignment itself. Unless the pro-processing is complete the rest of the steps cannot be completed. For parts that I got stuck on I referenced our notes until I was able to work through all of the problems sequentially. I also find for machine learning since the five steps are always the same it is just easier to use the same methodology and go from step one to five in order.

3. Generative AI was used for some of the pre-processing methods because it was a nightmare to get this dataset cleaned up enough to do work on. The prompts include: 

- What's the difference between: make_column_transformer and this: ColumnTransformer
- Invalid parameter 'normalize' for estimator LinearRegression(). Valid parameters are: ['copy_X', 'fit_intercept', 'n_jobs', 'positive'].
- There's no way this is right: 
Linear Regression Best Params: {'fit_intercept': False}
Linear Regression Best Score: -3203022745.01006
Random Forest Best Params: {'max_depth': 10, 'n_estimators': 200}
Random Forest Best Score: -3205715167.8523016
Gradient Boosting Best Params: {'learning_rate': 0.1, 'n_estimators': 100}
Gradient Boosting Best Score: -3205302866.120061

The use of GenerativeAI was for debugging and ensuring I was not going crazy due to the massive MSE's I got. The code was not used, it was more so used for checking that I had inputted everything correctly and I was not missing something causing these huge issues. It turned out all the models were just very poor which was what was causing the issue. That and I believe not using 

4. I think there were significant challenges in setting up the data and ensuring it was done correctly to minimize the errors produced by the model. Unlike the datasets used in class this is a real dataset which had a lot more challenges for pre-processing. Following the in class examples and trying to take each pre-processing step one at a time helped me for this assignment. 

## Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challenging, motivating while working on this assignment.


*ADD YOUR THOUGHTS HERE*

The part of the assignment that I really enjoyed was using different models and seeing how they interact with data differently. I disliked also using those multiple models because some of the lines of code started to look the same after a while and it was really hard to differentiate. What I found very interesting about this was being able to use data sets that are seemingly mundane an create interesting models and observations via the models to learn more about what seems like very boring data. I found it challenging working with so many pre-processing methods to clean my data instead of having a relatively cleaned dataset to begin with. 