# Field of Study vs. Occupation
Adam Ben-Aamr

12-10-2024

# Data Introduction
Many individuals enter college or university with an idea of what they want to focus their career on. However, there are many instances where these same individuals end up changeing their choice of career. I have always been curious about the factors that influence career changes. I want to explore how someone's field of study impacts their likelihood to stick with or switch their careers. According to the distributor and maintainer, this dataset is designed to help explore and predict whether individuals are likely to change their occupation based on their academic background, job experience, and other demographic factors.

It is unknown where this data origniated from and everything is completely annonymous. This could be a potential bias as it could provide the opporunity of eliminating some subset of bias. The csv file used for analysis in this project was retrieved from [Kaggle](https://www.kaggle.com/datasets/jahnavipaliwal/field-of-study-vs-occupation) uploaded by the user [Jahnavi Paliwal](https://www.kaggle.com/jahnavipaliwal) under the Apache 2.0 license.

The features that will be used in this analysis would be `Field of Study`, `Years of Expereince`, `Education Level`, `Current Occupation`, `Idustry Growth Rate`, `Job Satisfaction`, `Work-Life Balance`, `Job Opportunities`, `Salary`, `Job Security`, `Career Change Interest`, `Skills Gap`, `Family Influence`, `Mentorship Available` with the target variable being `Likely to Change Occupation`.

Attribute Information:
1. Field of Study: The area of academic focus during the individual’s education
2. Current Occupation: The individual's current job or industry they are employed in (Software Engineer, Mechanical Engineer, etc.)
3. Age: The age of the individual
4. Gender: The gender of the individual (Male, Female)
5. Years of Experience: The number of years the individual has been in the workforce
6. Education Level: The highest level of education completed by the individual (High School, Bachelor's, Master's, PhD)
7. Industry Growth Rate: The growth rate of the industry the individual works in (High, Medium, Low)
8. Job Satisfaction: A rating of the individual’s job satisfaction (1 - 10 scale)
9. Work-Life Balance: A rating of the individual's perceived work-life balance (1 - 10 scale)
10. Job Opportunities: The number of available job opportunities in the individual’s field
11. Salary: The annual salary of the individual (in USD or local currency equivalent)
12. Job Security: A rating of the individual’s perceived job security (1 - 10 scale)
13. Career Change Interest: Whether the individual is interested in changing their occupation (1 for yes, 0 for no)
14. Skills Gap: A measure of how well the individual’s current skills match their job requirements (1 - 10 scale)
15. Family Influence: The degree of influence the individual’s family has on their career choice (None, Low, Medium, High)
16. Mentorship Available: Whether the individual has access to a mentor in their current job
17. Certifications: Whether the individual holds any certifications relevant to their occupation
18. Freelancing Expereince: Whether the individual has freelanced in the past
19. Geographic Mobility: Whether the individual is willing to relocate for a job
20. Professional Networks: A measure of how strong the individual's professional network is (1 - 10 scale)
21. Career Change Events: The number of career changes the individual has made in the past
22. Technology Adoption: A measure of the individual’s comfort level with adopting new technologies (1 - 10 scale)
23. Likely to Change Occupation: Variable indicating whether an individual is likely to change their occupation (1 for likely to change, 0 for unlikely to change)

In this analytical endeavor, I will be performing a classification analysis

In [None]:
import pandas as pd
from datacleaner import *
import matplotlib.pyplot as plt
import seaborn as sns

## Descriptive Statistics
The first step is to visually inspect the new data set and clean it up if necessary.

In [None]:
data = pd.read_csv('career_change_prediction_dataset.csv')

data.head()

This exploration will focus on 14 of the 23 variables presented in this dataset: Field of Study, Years of Expereince, Education Level, Current Occupation, Idustry Growth Rate, Job Satisfaction, Work-Life Balance, Job Opportunities, Salary, Job Security, Career Change Interest, Skills Gap, Family Influence, Mentorship Available

Let's check for missing variables:

In [None]:
# Check for missing values or duplicate rows
data_quality = print_data_quality(data)

data_quality

Since there are no empty cells and no duplicates within the dataset, we can continute our data exploration without cleaning.

In [None]:
# Drop columns not focused on for analysis
data = data.drop(['Age', 'Gender', 'Certifications', 'Freelancing Experience', 'Geographic Mobility', 'Professional Networks', 'Technology Adoption'], axis=1)

# Check to see if all columns are accurately represented
data.info()

In [None]:
data.describe()

In [None]:
# Remap the int categories to str
data['Likely to Change Occupation'] = data['Likely to Change Occupation'].map({
  0: 'No',
  1: 'Yes',
})

# Count the class distribution
change_occupation = data.groupby('Likely to Change Occupation')

change_occupation.count()

### Description
In the results displayed, you can see the data has 38,444 records, each with 23 columns.

Likely to Change Occupation is a categorical variable represented with numerical values (0 indicating no and 1 indicating yes).

Missing attribute values: none

Class distribution: 16279 not likely to change occupation, 22165 likely to change occupation

## Visualize Distribution of Data

In [None]:
# View the distribution of the target variable
fig, ax = plt.subplots(1, 2, figsize=(12, 5))
x1 = data['Years of Experience']
y1 = data['Job Satisfaction']
x2 = data['Salary']
y2 = data['Job Opportunities']

# Left scatter plot
sns.scatterplot(data=data, x=x1, y=y1, ax=ax[0], hue='Likely to Change Occupation')
ax[0].set_title('Years of Experience vs. Job Satisfaction')

# Right scatter plot
sns.scatterplot(data=data, x=x2, y=y2, ax=ax[1], hue='Likely to Change Occupation')
ax[1].set_title('Salary vs. Job opportunities')

plt.tight_layout()
plt.show()

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(12, 5))

# Density plot of Years of Experience distribution
sns.kdeplot(data[data['Likely to Change Occupation'] == 'Yes']['Years of Experience'], fill=True, label='Yes', ax=ax[0])
sns.kdeplot(data[data['Likely to Change Occupation'] == 'No']['Years of Experience'], fill=True, label='No', ax=ax[0])
ax[0].set_title('Density Plot of Years of Experience by Likely to Change Occupation')

# Density plot of Job Satisfaction distribution
sns.kdeplot(data[data['Likely to Change Occupation'] == 'Yes']['Job Satisfaction'], fill=True, label='Yes', ax=ax[1])
sns.kdeplot(data[data['Likely to Change Occupation'] == 'No']['Job Satisfaction'], fill=True, label='No', ax=ax[1])
ax[1].set_title('Density Plot of Job Satisfaction by Likely to Change Occupation')

plt.legend()
plt.show()

In [None]:
# Reassign all str representations to int representations
data['Field of Study'] = data['Field of Study'].map({
  'Medicine': 1,
  'Education': 2,
  'Arts': 3,
  'Computer Science': 4,
  'Business': 5,
  'Mechanical Engineering': 6,
  'Biology': 7,
  'Law': 8,
  'Economics': 9,
  'Psychology': 10
})

data['Current Occupation'] = data['Current Occupation'].map({
  'Business Analyst': 1,
  'Economist': 2,
  'Biologist': 3,
  'Doctor': 4,
  'Lawyer': 5,
  'Software Developer': 6,
  'Artist': 7,
  'Psychologist': 8,
  'Teacher': 9,
  'Mechanical Engineer': 10,
})

data['Education Level'] = data['Education Level'].map({
  'High School': 1,
  'Bachelor\'s': 2,
  'Master\'s': 3,
  'PhD': 4,
})

data['Industry Growth Rate'] = data['Industry Growth Rate'].map({
  'Low': 1,
  'Medium': 2,
  'High': 3,
})

data['Family Influence'] = data['Family Influence'].map({
  'None': 1,
  'Low': 2,
  'Medium': 3,
  'High': 4,
})

sns.pairplot(data, hue='Likely to Change Occupation', height=16, vars=['Field of Study', 'Current Occupation', 'Years of Experience', 'Education Level', 'Industry Growth Rate', 'Job Satisfaction', 'Work-Life Balance', 'Job Opportunities', 'Salary', 'Job Security', 'Career Change Interest', 'Skills Gap', 'Family Influence', 'Mentorship Available', 'Career Change Events', 'Likely to Change Occupation'])