# Salary Prediction Model - Data Science Challenge

**Table of Contents**
1. [Introduction](#introduction)
2. [Data Loading and Exploration](#data-loading-and-exploration)
3. [Data Cleaning and Preprocessing](#data-cleaning-and-preprocessing)
4. [Feature Engineering](#feature-engineering)
5. [Model Training](#model-training)
6. [Model Evaluation](#model-evaluation)
7. [Conclusion](#conclusion)

<a id='introduction'></a>
## 1. Introduction

This notebook details the development of a predictive model for employee salaries. The goal is to forecast an individual's salary based on a dataset containing features like job title, company information, and employee demographics. This project is structured as a technical challenge for a Data Scientist role, emphasizing best practices in data science and machine learning.

<a id='data-loading-and-exploration'></a>
## 2. Data Loading and Exploration

First, we load the dataset and explore its basic structure and characteristics.

In [None]:
import pandas as pd
from src.data_loading import load_data

# Load the dataset
filepath = '../salary_prediction/data/salary_data.csv'
df = load_data(filepath)

# Display the first few rows and dataset info
if df is not None:
    print("\nDataset Head:")
    print(df.head())
    print("\nDataset Information:")
    print(df.info())

<a id='data-cleaning-and-preprocessing'></a>
## 3. Data Cleaning and Preprocessing

In this step, we handle missing data, remove duplicates, and prepare the data for feature engineering.

In [None]:
from src.data_preprocessing import clean_data

# Clean the data
if df is not None:
    cleaned_df = clean_data(df.copy())
    print("\nCleaned Data Head:")
    print(cleaned_df.head())
    print("\nCleaned Data Info:")
    print(cleaned_df.info())

<a id='feature-engineering'></a>
## 4. Feature Engineering

Here, we engineer new features to potentially improve model performance. This includes creating an 'experience_level' feature from job titles.

In [None]:
from src.data_preprocessing import engineer_features

# Engineer features
if df is not None:
    engineered_df = engineer_features(cleaned_df.copy())
    print("\nEngineered Data Head:")
    print(engineered_df.head())
    print("\nEngineered Data Info:")
    print(engineered_df.info())

<a id='model-training'></a>
## 5. Model Training

We train a RandomForestRegressor model and a DummyRegressor as a baseline for comparison.

In [None]:
from src.model_training import train_random_forest_model, create_dummy_model
from src.data_preprocessing import split_data

# Split data into training and testing sets
if df is not None:
    X_train, X_test, y_train, y_test = split_data(engineered_df.copy())

    # Train models
    rf_model = train_random_forest_model(X_train, y_train)
    dummy_model = create_dummy_model(X_train, y_train)

    print("\nRandom Forest Model:", rf_model)
    print("\nDummy Model:", dummy_model)

<a id='model-evaluation'></a>
## 6. Model Evaluation

Evaluate the trained models using appropriate metrics, including confidence intervals for robust assessment.

In [None]:
from src.model_evaluation import evaluate_model

# Evaluate models
if df is not None:
    print("\nEvaluating Random Forest Model:")
    rf_metrics = evaluate_model(rf_model, X_test, y_test)

    print("\nEvaluating Dummy Model:")
    dummy_metrics = evaluate_model(dummy_model, X_test, y_test)

<a id='conclusion'></a>
## 7. Conclusion

Summarize the findings, compare the performance of the Random Forest model against the Dummy model, and discuss potential improvements and next steps.