# Business Understanding
Briefly restate the project’s purpose and goals.

Define the research/prediction question (e.g., “Can we predict a job’s salary based on location, company, and job title?”).

Describe why the problem is important or useful.

### Data Preview
- Here, we will give a preview of what the original csv data looks like before we apply feature engineering to it.

In [None]:
# Preview
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
from dotenv import load_dotenv
import os

load_dotenv()

csv_path = os.getenv("CSV_PATHNAME") # To avoid conflicts with different pathnames

# Load data
data = pd.read_csv(csv_path)

data

In [None]:
# Describe all fields in the data
data.info()
data.head()
print(data.describe())

# Data Cleaning

### Handling Missing Data

In [None]:
# Print all fields and sum of any missing values in their respective column
print("Missing Values:")
data.isnull().sum()

#### Conclusion
- In the untouched dataset, there are absolutely no missing values at all. So there is no need to create any code for accounting for missing values in the dataset

# Data Exploration
- This step will now go through the univariate, bivariate, and multivariate analysis of our cleaned dataset

### Univariate Analysis

In [None]:
# Describe data again to remind ourselves
data.info()
data.describe(include='all')

There are 3 numeric columns to perform univariate analysis on. They are:
1. work_year
2. salary
3. salary_in_usd

The rest of the columns are all categorical values. They are:
1. job_title
2. job_category
3. salary_currency
4. employee_residence
5. experience_level
6. employment_type
7. work_setting
8. company_location
9. company_size

In [None]:
# Histogram of numeric columns
numeric_cols = ['work_year', 'salary', 'salary_in_usd']

data[numeric_cols].hist(figsize=(10,6))
plt.tight_layout()
plt.show()

In [None]:
# Boxplots of numeric columns
plt.figure(figsize=(12, 8))

for i, column in enumerate(numeric_cols, 1):
    plt.subplot(2, 2, i)
    sns.boxplot(x=data[column])

plt.tight_layout()
plt.show()

In [None]:
# Categorical columns analysis
categorical_cols = ['job_title', 'job_category', 'salary_currency', 'employee_residence',
            'experience_level', 'employment_type', 'work_setting',
            'company_location', 'company_size']

# Print all unique values and their count
for column in categorical_cols:
    print(f"--------------- {column} ---------------")
    print(data[column].value_counts())
    print()

In [None]:
# Get fields with few unique values (< 15)
columns_to_plot = []
for column in categorical_cols:
    unique_count = data[column].nunique()
    if unique_count <= 15:
        columns_to_plot.append(column)

print(columns_to_plot)

In [None]:
plt.figure(figsize=(14,18))

for i, column in enumerate(columns_to_plot, 1):
    plt.subplot(3, 2, i)
    data[column].value_counts().plot(kind='bar')
    if column == "job_category":
        plt.xticks(rotation=90)
    else:
        plt.xticks(rotation=0)

plt.tight_layout()
plt.show()

### Bivariate Analysis

#### Salary vs. Categorical columns

In [None]:
# Compare salary against every categorical category, except job_title and company_location
categorical_cols = ['job_category', 'experience_level', 'company_size', 'work_setting']

plt.figure(figsize=(16, 12)) 

for i, column in enumerate(categorical_cols, 1):
    plt.subplot(2, 2, i)  
    sns.boxplot(x=column, y='salary_in_usd', data=data)
    
    if column == "job_category":
        plt.xticks(rotation=90)
    else:
        plt.xticks(rotation=0)

plt.tight_layout()
plt.show()

#### Salary vs. Numerical columns

In [None]:
plt.figure(figsize=(16, 12)) 

for i, column in enumerate(numeric_cols, 1):
    plt.subplot(2, 2, i)  
    
    # 1. Use a Scatter Plot for Numeric vs. Numeric
    sns.scatterplot(x=column, y='salary_in_usd', data=data)
    
    plt.title(f'Salary vs. {column}')
    plt.xlabel(column)
    plt.ylabel('Salary (USD)')
    
    # Rotation is less critical for numeric axes but harmless
    plt.xticks(rotation=0)

plt.tight_layout()
plt.show()

### Multivariate Analysis

# Feature Engineering

### Response variable
- For the dataset we are using, we will be building a prediction model for the salary_in_usd field

### Variable Selection and Creation

In [None]:
# Step 2: Drop variables not useful for modeling 
data.drop(['salary_currency', 'job_title', 'salary', 'employee_residence'], axis=1, inplace=True)

# Step 3: Construct New Variables

# make 'company_location' categorical with top 6 locations, rest as 'Other'
top_6 = data['company_location'].value_counts().nlargest(6).index

data['company_location'] = np.where(data['company_location'].isin(top_6),
                                    data['company_location'],
                                    'Other')

# Convert all categorical columns to numeric codes
categorical_cols = ['company_location', 'job_category', 'experience_level', 
                    'employment_type', 'work_setting', 'company_size']

for col in categorical_cols:
    data[col + '_code'] = data[col].astype('category').cat.codes

# drop original categorical columns
data_numeric = data.drop(columns=categorical_cols)

# Check result
data_numeric.head()

# Step 4: Scale Data if required


Then explain why we created/removed some

### Importance and Multicolinearity

In [None]:
# Step 5: Importance & Multicolinearity 

# Correlation matrix for all numeric features
corrVals = data_numeric.corr()
print(corrVals)

# Compute correlation with target (salary_in_usd)
target_corr = data_numeric.corr()['salary_in_usd'].drop('salary_in_usd')  # drop self-correlation

# Sort correlations
target_corr = target_corr.sort_values(ascending=False)

# Plot as bar chart
plt.figure(figsize=(12,6))
target_corr.plot(kind='bar', color='skyblue')
plt.title("Correlation of Features with Salary")
plt.ylabel("Correlation coefficient")
plt.xlabel("Features")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

#Correlations
plt.figure(figsize=(16, 14))
sns.heatmap(data_numeric.corr(), annot=True, cmap='Reds', fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()


# drop if too similar to other variables
# data_numeric.drop('TaxPaid', axis = 1, inplace = True)

Then explain output

# Predictive Modelling
State what we will use. Best to follow notes

## (Regression) Modelling (Pick which ones best for our project)

Pick model

Reasons for picking model

### Step 1: Train-Test Split

In [None]:
# Step 1 - Split Data - 80% Train, 20% Test 
# (also there is train 66.7%, test 33.3%)
# pick one

#Set the Response and the predictor variables

#Splitting the Data Set into Training Data and Test Data

### Step 2: (Regression) Modelling - Model Selection using (Linear Regression) based on training set

Using a stepwise approach - we will use a stepwise forward approach first and then a backwards one and pick the best model if different.

#### Stepwise Forward

To do this you start by trying all the variables singly and picking the best model. Then repeat the process until there is no improvement. Each model is evaluated based on its R Squared Adjusted value which is generally between 0 and 1 with a higher value being better. The R-Squared Adjusted value is based on the RMSE value (calculated as the square root of the mean squared error)

TLDR: You build a model with 1 variable at a time. Adding 1 variable to the model each time until the Rsq Adj shows no improvement

In [None]:
# State all variables to be used in the model

# Build model 1:

# Build model 2:

# Check Rsq Adj

# etc.

#### Stepwise Backwards

To do this you start by taking the model with all variables and removing variables singly and checking which variable improves the model the most. Then repeat the process until there is no improvement. Each model is evaluated based on its R Squared Adjusted value which is generally between 0 and 1 with a higher value being better. The R-Squared Adjusted value is based on the RMSE value (calculated as the square root of the mean squared error)

TLDR: Opposite of forwards. instead of building up you put all variables into the model and remove variables until the Rsq Adj is lower than the prev

In [None]:
# State all variables to be used in the model

# Build model 1:

# Build model 2:

# Check Rsq Adj

# etc.

#### Compare the Stepwise Forward and Stepwise Back Models

In [None]:
# Stepwise Forwards Rsq Adj vs Stepwise Backwards Rsq Adj

# Pick best one

### Step 2: Statement of Best Model

In [None]:
# Best model: 

# Show model params

### Step 3: (Regression) Modelling - Model Evaluation based on the TEST set

In [None]:
#Calculate the Summary Measures MAE (Mean Absolute Error), MAPE(Mean Absolute Percentage Error) and the RMSE (Root Mean Square Error) for the model based on the TEST set.
#These give a measure of the quality of the predictions

### Is the model acceptable for use in practice?

In [None]:
#Explore how the errors depend on key variables

#Plot the error values against the actual value

#Plot the error values against the AGST value

#Plot the error values against the age value 

# Findings
Summarize main results and conclusions.

Highlight key visualizations that support findings.

Connect insights back to your original business question.

Discuss limitations and possible improvements.

Probably best to split above 4 into 4 separate blocks

# Team Contributions

### Theo's Contributions
- Provided basic code to extract data from Jobs.ie and export it to a csv file
- Created initial template for both DataMining and DataAnalyses files 
- Found appropriate dataset on Kaggle
- Carried out univariate, bivariate and multivariate analysis of dataset

### Oisín's Contributions
- Created README.md file to explain project
- Researched potential websites to use for project
- Completed Feature engineering for project
- Created initial template for Predictive modelling