# Business Understanding
Briefly restate the project’s purpose and goals.

Define the research/prediction question (e.g., “Can we predict a job’s salary based on location, company, and job title?”).

Describe why the problem is important or useful.

### Data Preview

In [30]:
# Panda dataframe here to showcase original data from csv

# Data Cleaning

### Variable Characterisation
- Have all fields defined in this block

### Handling Missing Data

In [31]:
# Code here to check for any missing values to address

Then explain any missing values in this cell and how to approach

In [32]:
# Then code here to handle missing values, if we have any. Follow TitanicDataCleaning.ipynb notes

# Data Exploration
This step includes Univariate, Bivariate, and Multivariate Analysis. Like CA1 but in code this time

### Univariate Analysis

In [33]:
# Univariate analysis

### Bivariate Analysis

In [34]:
# Bivariate analysis

### Multivariate Analysis

In [35]:
# Multivariate analysis

# Feature Engineering

### Response variable
- Explain that we're doing salary

### Variable Selection and Creation

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure

# Load data
data = pd.read_csv("C:/Users/MyPC/OneDrive - Dundalk Institute of Technology/Year 3/Data Science/CA1/DataSci_CA2_PairProject_2025/jobs_data.csv")

# show data
data.info()
data.head()
print(data.describe())

# Check missing values
print("\nMissing Values:")
data.isnull().sum()

# Step #2: Drop variables not useful for modeling 
data.drop(['salary_currency', 'job_title', 'salary', 'employee_residence'], axis=1, inplace=True)

# Step 3: Construct New Variables if required
pd.get_dummies(data, drop_first=True)

top_6 = data['company_location'].value_counts().nlargest(6).index

data['company_location'] = np.where(data['company_location'].isin(top_6),
                                    data['company_location'],
                                    'Other')

data = pd.get_dummies(data, columns=['company_location'], drop_first=True)

data = pd.get_dummies(data, columns=['job_category'], drop_first=False)

data = pd.get_dummies(data, columns=['experience_level'], drop_first=True)

data = pd.get_dummies(data, columns=['employment_type'], drop_first=True)

data = pd.get_dummies(data, columns=['work_setting'], drop_first=True)

data = pd.get_dummies(data, columns=['company_size'], drop_first=True)

# Step 4: Scale Data if required


Then explain why we created/removed some

### Importance and Multicolinearity

In [None]:
# Step 5: Importance & Multicolinearity 

corrVals = data.corr()
print(corrVals)

# Compute correlation with target
target_corr = data.corr()['salary_in_usd'].drop('salary_in_usd')  # drop self-correlation

# Sort correlations
target_corr = target_corr.sort_values(ascending=False)

# Plot as bar chart
plt.figure(figsize=(12,6))
target_corr.plot(kind='bar', color='skyblue')
plt.title("Correlation of Features with Salary")
plt.ylabel("Correlation coefficient")
plt.xlabel("Features")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

#Correlations
plt.figure(figsize=(16, 14))
sns.heatmap(data.corr(), annot=True, cmap='Reds', fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()


# drop if too similar to other variables
# data.drop('TaxPaid', axis = 1, inplace = True)

Then explain output

# Predictive Modelling
State what we will use. Best to follow notes

# Findings
Summarize main results and conclusions.

Highlight key visualizations that support findings.

Connect insights back to your original business question.

Discuss limitations and possible improvements.

Probably best to split above 4 into 4 separate blocks

# Team Contributions

### Theo's Contributions
- Provided basic code to extract data from Jobs.ie and export it to a csv file
- Created initial template for both DataMining and DataAnalyses files 

### Oisín's Contributions
- Created README.md file to explain project
- Researched potential websites to use for project