# Career Transition Analytics

This notebook demonstrates an end‑to‑end exploratory and predictive analytics project for a synthetic career transition dataset. We will:

- Explore the structure and summary statistics of the data.
- Visualize key variables to understand distributions and relationships.
- Build predictive models to estimate the probability of a successful career transition.
- Build a regression model to predict the salary change percentage for an individual transitioning to a new role.

The dataset contains 500 synthetic records with the following features:

| Column | Description |
|-------|-------------|
| **Age** | Age of the individual in years |
| **Education** | Highest education level (High School, Bachelor, Master, PhD) |
| **YearsExperience** | Total years of work experience |
| **CurrentIndustry** | Industry of the current job |
| **NewIndustry** | Industry of the desired/transition job |
| **SalaryDiffPct** | Percentage change in salary (positive means an increase) |
| **Certification** | 1 if the individual holds a professional certification, 0 otherwise |
| **SoftSkillsScore** | Soft skills score on a scale of 1–10 |
| **HardSkillsScore** | Technical skills score on a scale of 1–10 |
| **ManagerialExperience** | 1 if the individual has managerial experience, 0 otherwise |
| **ProgramInvolvement** | 1 if the individual has led or participated in cross-functional programs, 0 otherwise |
| **TimeToTransitionMonths** | Number of months it took to transition to the new role |
| **SuccessfulTransition** | 1 if the transition was successful, 0 otherwise |



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, r2_score

# Configure plotting
sns.set(style='whitegrid')
%matplotlib inline


In [None]:
# Load the synthetic dataset
file_path = 'career_transition_dataset.csv'
df = pd.read_csv(file_path)

# Display the first few rows
print('Data preview:')
df.head()


In [None]:
# Summary statistics
print('Summary statistics:')
df.describe(include='all')


In [None]:
# Age distribution
plt.figure(figsize=(6,4))
sns.histplot(df['Age'], bins=15, kde=True)
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()

# Education count plot
plt.figure(figsize=(6,4))
sns.countplot(x='Education', data=df, order=df['Education'].value_counts().index)
plt.title('Education Level Distribution')
plt.xlabel('Education')
plt.ylabel('Count')
plt.show()

# Current industry distribution
plt.figure(figsize=(8,4))
sns.countplot(y='CurrentIndustry', data=df, order=df['CurrentIndustry'].value_counts().index)
plt.title('Current Industry Distribution')
plt.xlabel('Count')
plt.ylabel('Industry')
plt.show()

# Correlation heatmap for numeric features
numeric_cols = ['Age', 'YearsExperience', 'SalaryDiffPct', 'SoftSkillsScore', 'HardSkillsScore', 'TimeToTransitionMonths']
plt.figure(figsize=(8,6))
sns.heatmap(df[numeric_cols + ['SuccessfulTransition']].corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()


In [None]:
# Convert categorical variables using one-hot encoding
cat_cols = ['Education', 'CurrentIndustry', 'NewIndustry']
df_encoded = pd.get_dummies(df, columns=cat_cols, drop_first=True)

# Features and target
X = df_encoded.drop(['SuccessfulTransition'], axis=1)
y = df_encoded['SuccessfulTransition']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize numeric features
scaler = StandardScaler()
num_cols = ['Age', 'YearsExperience', 'SalaryDiffPct', 'SoftSkillsScore', 'HardSkillsScore', 'ManagerialExperience', 'ProgramInvolvement', 'TimeToTransitionMonths']
X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
X_test[num_cols] = scaler.transform(X_test[num_cols])

# Logistic Regression model
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)

# Predictions and evaluation
y_pred_log = log_reg.predict(X_test)
print('Logistic Regression Accuracy:', accuracy_score(y_test, y_pred_log))
print('
Classification Report (Logistic Regression):
', classification_report(y_test, y_pred_log))


In [None]:
# Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=200, random_state=42)
rf_clf.fit(X_train, y_train)

# Predictions and evaluation
y_pred_rf = rf_clf.predict(X_test)
print('Random Forest Accuracy:', accuracy_score(y_test, y_pred_rf))
print('
Classification Report (Random Forest):
', classification_report(y_test, y_pred_rf))


In [None]:
# Predicting SalaryDiffPct using regression (excluding target variable for classification)

# Features for regression (drop target and salary diff to avoid leakage?)
features_reg = df_encoded.drop(['SalaryDiffPct', 'SuccessfulTransition'], axis=1)
target_reg = df_encoded['SalaryDiffPct']

X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(features_reg, target_reg, test_size=0.3, random_state=42)

# Standardize numeric columns (reuse scaler)
X_train_reg[num_cols] = scaler.fit_transform(X_train_reg[num_cols])
X_test_reg[num_cols] = scaler.transform(X_test_reg[num_cols])

lin_reg = LinearRegression()
lin_reg.fit(X_train_reg, y_train_reg)

# Predictions and evaluation
y_pred_reg = lin_reg.predict(X_test_reg)
print('Linear Regression R^2 score:', r2_score(y_test_reg, y_pred_reg))


## Conclusions

In this project, we explored a synthetic career transition dataset and built several predictive models. Key insights include:

- Age, experience, and skill scores show meaningful variability across the sample.
- Logistic regression achieved reasonable accuracy for predicting whether a transition is successful, while the random forest classifier generally performed better on this dataset.
- A simple linear regression provides a baseline for predicting the percentage salary change, but more sophisticated models could improve performance.

Feel free to experiment with the models, feature engineering, and hyperparameter tuning to further enhance the predictive power.
