# Salary Prediction Challenge

## Table of Contents

1. [Introduction and Objectives](#introduction)
   - Project Overview
   - Goals and Success Criteria

2. [Data Description](#data-description)
   - Dataset Overview
   - Feature Descriptions
   - Initial Data Quality Assessment

3. [Exploratory Data Analysis](#eda)
   - Univariate Analysis
   - Bivariate Analysis
   - Correlation Analysis
   - Key Insights

4. [Data Preprocessing](#preprocessing)
   - Handling Missing Values
   - Feature Encoding
   - Data Cleaning Steps

5. [Feature Engineering](#features)
   - Feature Creation
   - Feature Selection
   - Feature Importance Analysis

6. [Baseline Model](#baseline)
   - Model Implementation
   - Performance Metrics
   - Confidence Intervals

7. [Main Model](#main-model)
   - Model Selection
   - Hyperparameter Tuning
   - Model Training
   - Performance Evaluation

8. [Model Comparison](#comparison)
   - Performance Metrics Comparison
   - Statistical Tests
   - Visualization of Results

9. [Optional Features Implementation](#optional)
   - Advanced Cross-validation
   - Feature Importance Analysis
   - Model Interpretability

10. [Conclusions and Recommendations](#conclusions)
   - Key Findings
   - Model Performance Summary
   - Future Improvements

## 1. Introduction and Objectives <a name="introduction"></a>

This project aims to develop a machine learning model to predict salaries based on various job-related features. The model will help both employers and job seekers understand fair market compensation.

### Project Goals:
- Develop an accurate salary prediction model
- Identify key factors influencing salaries
- Provide reliable confidence intervals for predictions
- Compare performance against a baseline model

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Set plotting style
sns.set_palette('husl')

# Add project root to path
import sys
import os
sys.path.append(os.path.abspath('..'))

OSError: 'seaborn' is not a valid package style, path of style file, URL of style file, or library style name (library styles are listed in `style.available`)

## 2. Data Description <a name="data-description"></a>

We'll start by loading and examining our dataset to understand its structure and characteristics.

In [1]:
# Load data using our custom module
from src.data.make_dataset import load_data

df = load_data('../data/raw/salary_data.csv')

print('Dataset Overview:')
print('----------------')
print(f'Shape: {df.shape}')
print('\nFeature Information:')
print(df.info())

print('\nSummary Statistics:')
print(df.describe())

ModuleNotFoundError: No module named 'src'

## 3. Exploratory Data Analysis <a name="eda"></a>

Let's explore the relationships between variables and identify patterns in our data.

In [None]:
# Correlation analysis
plt.figure(figsize=(12, 8))
numeric_columns = df.select_dtypes(include=[np.number]).columns
correlation_matrix = df[numeric_columns].corr()

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Feature Correlation Matrix')
plt.show()

In [None]:
# Salary distribution
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='avg_salary', bins=30)
plt.title('Distribution of Average Salaries')
plt.xlabel('Average Salary')
plt.ylabel('Count')
plt.show()

## 4. Data Preprocessing <a name="preprocessing"></a>

We'll now preprocess our data using the functions defined in our preprocessing module.

In [None]:
from src.data.preprocess import clean_data

# Clean the data
df_cleaned = clean_data(df.copy())

print('Preprocessing Results:')
print('--------------------')
print(f'Original shape: {df.shape}')
print(f'Cleaned shape: {df_cleaned.shape}')

## 5. Feature Engineering <a name="features"></a>

Let's create and select relevant features using our feature engineering module.

In [None]:
from src.features.build_features import preprocess_and_engineer_features

# Engineer features
df_featured = preprocess_and_engineer_features(df_cleaned.copy())

print('Feature Engineering Results:')
print('-------------------------')
print(f'Input shape: {df_cleaned.shape}')
print(f'Output shape: {df_featured.shape}')
print('\nNew features created:')
print([col for col in df_featured.columns if col not in df_cleaned.columns])

## 6. Baseline Model <a name="baseline"></a>

We'll implement and evaluate our baseline model using DummyRegressor.

In [None]:
from src.models.train_model import create_dummy_model
from src.models.evaluate_model import evaluate_model
from sklearn.model_selection import train_test_split

# Split the data
X = df_featured.drop('avg_salary', axis=1)
y = df_featured['avg_salary']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train and evaluate baseline model
baseline_model = create_dummy_model(X_train, y_train)
baseline_metrics = evaluate_model(baseline_model, X_test, y_test)

print('Baseline Model Results:')
print('---------------------')
for metric, value in baseline_metrics.items():
    print(f'{metric}: {value}')

## 7. Main Model <a name="main-model"></a>

Now we'll implement our main Random Forest model with hyperparameter tuning.

In [None]:
from src.models.train_model import train_random_forest_model

# Train and evaluate main model
rf_model = train_random_forest_model(X_train, y_train)
rf_metrics = evaluate_model(rf_model, X_test, y_test)

print('Random Forest Model Results:')
print('-------------------------')
for metric, value in rf_metrics.items():
    print(f'{metric}: {value}')

## 8. Model Comparison <a name="comparison"></a>

Let's compare the performance of our baseline and main models.

In [None]:
# Visualize model comparison
metrics = ['rmse', 'r2', 'mape']
models = ['Baseline', 'Random Forest']

for metric in metrics:
    plt.figure(figsize=(8, 6))
    values = [baseline_metrics[metric], rf_metrics[metric]]
    plt.bar(models, values)
    plt.title(f'Model Comparison - {metric.upper()}')
    plt.ylabel(metric.upper())
    plt.show()

## 9. Optional Features Implementation <a name="optional"></a>

We've implemented several optional features including:
- Advanced cross-validation with confidence intervals
- Hyperparameter tuning using GridSearchCV
- Feature importance analysis

In [None]:
# Feature importance visualization
feature_importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(12, 6))
sns.barplot(data=feature_importance.head(10), x='importance', y='feature')
plt.title('Top 10 Most Important Features')
plt.show()

## 10. Conclusions and Recommendations <a name="conclusions"></a>

### Key Findings:
- The Random Forest model significantly outperforms the baseline
- Most important features for salary prediction identified
- Model provides reliable predictions with confidence intervals

### Future Improvements:
- Collect more data for underrepresented categories
- Experiment with other advanced models
- Implement real-time model updates
- Add more domain-specific features