# Field of Study vs. Occupation
Adam Ben-Aamr

12-10-2024

# Data Introduction
Many individuals enter college or university with an idea of what they want to focus their career on. However, there are many instances where these same individuals end up changeing their choice of career. I have always been curious about the factors that influence career changes. I want to explore how someone's field of study impacts their likelihood to stick with or switch their careers. According to the distributor and maintainer, this dataset is designed to help explore and predict whether individuals are likely to change their occupation based on their academic background, job experience, and other demographic factors.

It is unknown where this data origniated from and everything is completely annonymous. This could be a potential bias as it could provide the opporunity of eliminating some subset of bias. The csv file used for analysis in this project was retrieved from [Kaggle](https://www.kaggle.com/datasets/jahnavipaliwal/field-of-study-vs-occupation) uploaded by the user [Jahnavi Paliwal](https://www.kaggle.com/jahnavipaliwal) under the Apache 2.0 license.

The features that will be used in this analysis would be `Field of Study`, `Years of Expereince`, `Education Level`, `Current Occupation`, `Idustry Growth Rate`, `Job Satisfaction`, `Work-Life Balance`, `Job Opportunities`, `Salary`, `Job Security`, `Career Change Interest`, `Skills Gap`, `Family Influence`, `Mentorship Available` with the target variable being `Likely to Change Occupation`.

Attribute Information:
1. Field of Study: The area of academic focus during the individual’s education
2. Current Occupation: The individual's current job or industry they are employed in (Software Engineer, Mechanical Engineer, etc.)
3. Age: The age of the individual
4. Gender: The gender of the individual (Male, Female)
5. Years of Experience: The number of years the individual has been in the workforce
6. Education Level: The highest level of education completed by the individual (High School, Bachelor's, Master's, PhD)
7. Industry Growth Rate: The growth rate of the industry the individual works in (High, Medium, Low)
8. Job Satisfaction: A rating of the individual’s job satisfaction (1 - 10 scale)
9. Work-Life Balance: A rating of the individual's perceived work-life balance (1 - 10 scale)
10. Job Opportunities: The number of available job opportunities in the individual’s field
11. Salary: The annual salary of the individual (in USD or local currency equivalent)
12. Job Security: A rating of the individual’s perceived job security (1 - 10 scale)
13. Career Change Interest: Whether the individual is interested in changing their occupation (1 for yes, 0 for no)
14. Skills Gap: A measure of how well the individual’s current skills match their job requirements (1 - 10 scale)
15. Family Influence: The degree of influence the individual’s family has on their career choice (None, Low, Medium, High)
16. Mentorship Available: Whether the individual has access to a mentor in their current job
17. Certifications: Whether the individual holds any certifications relevant to their occupation
18. Freelancing Expereince: Whether the individual has freelanced in the past
19. Geographic Mobility: Whether the individual is willing to relocate for a job
20. Professional Networks: A measure of how strong the individual's professional network is (1 - 10 scale)
21. Career Change Events: The number of career changes the individual has made in the past
22. Technology Adoption: A measure of the individual’s comfort level with adopting new technologies (1 - 10 scale)
23. Likely to Change Occupation: Variable indicating whether an individual is likely to change their occupation (1 for likely to change, 0 for unlikely to change)

In this analytical endeavor, I will be performing a classification analysis

In [None]:
import pandas as pd
from datacleaner import *
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import tree
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.metrics import accuracy_score
from sklearn.tree import plot_tree

## Descriptive Statistics
The first step is to visually inspect the new data set and clean it up if necessary.

In [None]:
data = pd.read_csv('career_change_prediction_dataset.csv')

data.head()

This exploration will focus on 14 of the 23 variables presented in this dataset: Field of Study, Years of Expereince, Education Level, Current Occupation, Idustry Growth Rate, Job Satisfaction, Work-Life Balance, Job Opportunities, Salary, Job Security, Career Change Interest, Skills Gap, Family Influence, Mentorship Available

Let's check for missing variables:

In [None]:
# Check for missing values or duplicate rows
data_quality = print_data_quality(data)

data_quality

Even if it has been reported that there are no empty cells and no duplicates within the dataset, it doesn't hurt to double check manually and clean it up in the process.

In [None]:
# Remove any duplicate values
data = data.drop_duplicates()

# Removes any NaN values
data = data.dropna()

# Check for NaN values in the dataset
na_values = data.isna()

# Print the Dataset with NaN values
if na_values.any().any(): # Checks if there are any NaN values in the dataset
  print('NaN values in the dataset:')
  print(data[na_values])
else:
  print('No NaN values in the dataset')

data.head()

Following the completion of the data cleaning process, including the resolution of any missing (NaN) or duplicate values, the next step involves removing all irrelevant columns not pertinent to the scope of the current analysis. This step ensures the dataset is streamlined and focused on the variables of interest. Subsequently, the structure of the dataset can be examined, allowing for a comprehensive understanding of its remaining composition, including the number of rows, columns, and the specific attributes contained within the relevant columns.

In [None]:
# Drop columns not focused on for analysis
data = data.drop(['Age', 'Gender', 'Certifications', 'Freelancing Experience', 'Geographic Mobility', 'Professional Networks', 'Technology Adoption'], axis=1)

# Check to see if all columns are accurately represented and all null values have been eliminated
data.info()

In [None]:
data.describe()

In [None]:
# Remap the int categories to str
data['Likely to Change Occupation'] = data['Likely to Change Occupation'].map({
  0: 'No',
  1: 'Yes',
})

# Count the class distribution
change_occupation = data.groupby('Likely to Change Occupation')

change_occupation.count()

### Description
In the results displayed, you can see the data has 38,444 records, each with 23 columns.

Likely to Change Occupation is a categorical variable represented with numerical values (0 indicating no and 1 indicating yes).

Missing attribute values: none

Class distribution: 16279 not likely to change occupation, 22165 likely to change occupation

## Visualize Distribution of Data
The primary objective of visualizing the data in this context is twofold: first, to identify the features that are most effective in predicting whether an individual in a specific field is likely to experience an occupational change; second, to observe general trends within the data that may inform and guide the selection of an appropriate model.

In [None]:
# View the distribution of the target variable
fig, ax = plt.subplots(1, 2, figsize=(12, 5))
x1 = data['Years of Experience']
y1 = data['Job Satisfaction']
x2 = data['Salary']
y2 = data['Job Opportunities']

# Left scatter plot
sns.scatterplot(data=data, x=x1, y=y1, ax=ax[0], hue='Likely to Change Occupation')
ax[0].set_title('Years of Experience vs. Job Satisfaction')

# Right scatter plot
sns.scatterplot(data=data, x=x2, y=y2, ax=ax[1], hue='Likely to Change Occupation')
ax[1].set_title('Salary vs. Job opportunities')

plt.tight_layout()
plt.show()

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(12, 5))

# Density plot of Years of Experience distribution
sns.kdeplot(data[data['Likely to Change Occupation'] == 'Yes']['Years of Experience'], fill=True, label='Yes', ax=ax[0])
sns.kdeplot(data[data['Likely to Change Occupation'] == 'No']['Years of Experience'], fill=True, label='No', ax=ax[0])
ax[0].set_title('Density Plot of Years of Experience by Likely to Change Occupation')

# Density plot of Job Satisfaction distribution
sns.kdeplot(data[data['Likely to Change Occupation'] == 'Yes']['Job Satisfaction'], fill=True, label='Yes', ax=ax[1])
sns.kdeplot(data[data['Likely to Change Occupation'] == 'No']['Job Satisfaction'], fill=True, label='No', ax=ax[1])
ax[1].set_title('Density Plot of Job Satisfaction by Likely to Change Occupation')

plt.legend()
plt.show()

In [None]:
# fig, ax = plt.subplots(1, 3, figsize=(18, 6))

# Boxplot for Years of Experience vs. Likely to Change Occupation
# data['Years of Experience'] = data['Years of Experience'].astype(float)
# sns.boxplot(x='Likely to Change Occupation', y='Years of Experience', data=data, ax=ax[0])
# ax[0].set_title('Years of Experience vs. Likely to Change Occupation')

# sns.boxplot(x='Likely to Change Occupation', y='Job Satisfaction', data=data, ax=ax[1])
# ax[1].set_title('Job Satisfaction vs. Likely to Change Occupation')

# sns.boxplot(x='Likely to Change Occupation', y='Work-Life Balance', data=data, ax=ax[2])
# ax[2].set_title('Work-Life Balance vs. Likely to Change Occupation')

# plt.tight_layout()

The overall distribution of the data is notably...

## Model Building
### High Complexity Model

In [None]:
# Apply mappings to columns
mappings = {
  'Field of Study': {
    'Medicine': 1,
    'Education': 2,
    'Arts': 3,
    'Computer Science': 4,
    'Business': 5,
    'Mechanical Engineering': 6,
    'Biology': 7,
    'Law': 8,
    'Economics': 9,
    'Psychology': 10,
  },
  'Current Occupation': {
    'Business Analyst': 1,
    'Economist': 2,
    'Biologist': 3,
    'Doctor': 4,
    'Lawyer': 5,
    'Software Developer': 6,
    'Artist': 7,
    'Psychologist': 8,
    'Teacher': 9,
    'Mechanical Engineer': 10,
  },
  'Education Level': {
    'High School': 1,
    'Bachelor\'s': 2,
    'Master\'s': 3,
    'PhD': 4,
  },
  'Industry Growth Rate': {
    'Low': 1,
    'Medium': 2,
    'High': 3,
  },
  'Family Influence': {
    'None': 1,
    'Low': 2,
    'Medium': 3,
    'High': 4,
  }
}

# Map the columns
for column, mapping in mappings.items():
  data[column] = data[column].map(mapping)

In [None]:
features = data.drop('Likely to Change Occupation', axis=1)
target = data['Likely to Change Occupation']
x_train, x_test, y_train, y_test = train_test_split(features, target, test_size=0.2, shuffle=True, random_state=42)

* `x_train` - Training features used to fit (train) the model
* `x_test` - Testing features used to test the model
* `y_train` - Training target labels used to fit (train) the model
* `y_test` - Testing target labels used to compare against the model predicited labels

In [None]:
dtree = tree.DecisionTreeClassifier(criterion='entropy', max_depth=None, random_state=42)

dtree.fit(x_train, y_train)

pred_train = dtree.predict(x_train)
pred_test = dtree.predict(x_test)

print('Train Accuracy: {:3.2f}'.format(accuracy_score(y_train, pred_train)))
print('Test Accuracy: {:3.2f}'.format(accuracy_score(y_test, pred_test)))

fig, ax = plt.subplots(1, 1, figsize=(9, 9), dpi=150)

plot_tree(dtree, fontsize=4, filled=True, max_depth=3, feature_names=features.columns, class_names=['No', 'Yes'])

### Low Complexity Model

In [None]:
dtree = tree.DecisionTreeClassifier(criterion='entropy', max_depth=1, random_state=42)

dtree.fit(x_train, y_train)

pred_train = dtree.predict(x_train)
pred_test = dtree.predict(x_test)

print('Train Accuracy: {:3.2f}'.format(accuracy_score(y_train, pred_train)))
print('Test Accuracy: {:3.2f}'.format(accuracy_score(y_test, pred_test)))

fig, ax = plt.subplots(1, 1, figsize=(3, 3), dpi=150)

plot_tree(dtree, fontsize=4, filled=True, max_depth=3, feature_names=features.columns, class_names=['No', 'Yes'])

In the low complexity model, there are only two comparisons being made, which indeicates underfitting. The model is basing its decision solely on a single attribute, job satisfaction. Specifically, if an individual's job satisfaction is less than or equal to 4.5, they are predicted to change their career. While this pattern holds within the dataset, it oversimplifies the reason why people go through a career change, overlooking other potential factors and outcomes that could arise due to the inherent complexity of human satisfaction. In this case the only decision that can be made would be: if a person rates their job satisfaction as a 4.5, they are going to change their career.

The high complexity model, with approximately six comparisons, clearly exhibits a perfect fitting as both training and testing splits have yielded a 100% accuracy. In this case, a medium complexity model is not necessary.

### Grid Search with Cross Validation

In [None]:
# 5-Fold cross validation and shuffle the data
cv = KFold(n_splits=5, shuffle=True)

# Setting up grid search
model = tree.DecisionTreeClassifier()
param_grid = {
  'max_depth': list(range(1, 11)),
  'criterion': ['entropy', 'gini'],
}

grid = GridSearchCV(model, param_grid, cv=5)

# Performing the grid search
grid.fit(x_train, y_train)

# Print the results
print('Best Parameters: {}'.format(grid.best_params_))

In [None]:
# Visualize the model
fig, ax = plt.subplots(1, 1, figsize=(3, 3), dpi=150)

plot_tree(grid.best_estimator_, fontsize=4, filled=True, max_depth=3, feature_names=features.columns, class_names=['No', 'Yes'])

In [None]:
# Prediction and accuracy
pred_test = grid.best_estimator_.predict(x_test)

print('Accuracy of optimal model: {:3.2f}'.format(accuracy_score(y_test, pred_test)))

The optimal classifier, along with the best set of hyperparameters, was identified as the entropy criterion with a maximum depth of 3. Interestingly, this aligns with the high complexity model, suggesting that the high complexity model is indeed the most suitable choice.

I decided to perform a cross-validation, a critical step in the machine learning process, to ensure the selected model is both robust and generalizes well to unseen data. In this case, the mean accuracy of the best model is identical to that of the high complexity model, with an accuracy score of 100%. Thid conclusively indicates that the high complexity model is the most appropriate for deployment.

## Scatter Matrix

In [None]:
# Check for NaN values after mapping
# if data.isnull().values.any():
#   print("There are NaN values in the data after mapping...")
fields_to_check = ['Field of Study', 'Current Occupation', 'Education Level', 'Industry Growth Rate', 'Family Influence', 'Likely to Change Occupation']
unique_values = {field: data[field].unique() for field in fields_to_check}

unique_values

# Check for NaN values introduced during mapping
nan_columns = data.columns[data.isnull().any()]
nan_info = {col: data[col].isnull().sum() for col in nan_columns}

nan_info

# Ensure 'Likely to Change Occupation' is numeric
data['Likely to Change Occupation'] = pd.to_numeric(data['Likely to Change Occupation'], errors='coerce')

# Plot the pairplot
sns.pairplot(data, vars=['Years of Experience', 'Job Satisfaction', 'Work-Life Balance', 'Job Opportunities', 'Salary', 'Job Security', 'Career Change Interest', 'Skills Gap', 'Mentorship Available', 'Career Change Events'], height=4)
plt.show()