# 2.0 Automobile Sales Classification Model Training

This notebook documents the final steps in the data pipeline: loading the raw data, performing final feature engineering, and training the **Logistic Regression Model** used to classify "Successful Sales" for the dashboard visualization.

In [None]:
import pandas as pd
import numpy as np
import pickle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import os

# --- PATH SETUP ---
# We use the parent directory (..) to navigate from 'notebooks/' to 'data/'
PROCESSED_DATA_PATH = '../data/processed/cleaned_data.csv'
RAW_DATA_PATH = '../data/raw/auto_sales_raw.csv'
MODEL_PATH = '../data/model/classification_model.pkl'

# Load the raw data file (We assume this file exists from the setup_pipeline_data.py run)
try:
    # If the setup script was run, the raw data exists, even if we focus on cleaning it now
    df = pd.read_csv(RAW_DATA_PATH)
    print(f"Data loaded successfully from {RAW_DATA_PATH}")
except FileNotFoundError:
    print("ERROR: Raw data file not found. Please ensure the data setup script has been executed.")
    exit()

print("\nInitial Data Head:")
print(df.head())

## Feature Engineering and Target Variable Encoding

We must prepare all categorical features (Manufacturer, Region, TimePeriod) for the ML model by encoding them numerically. Critically, we encode the `Success_Category` target variable into binary form (0 or 1).

In [None]:
# Convert Price to a numerical type
df['Price_k'] = df['Price_k'].astype(float)

# --- Feature Engineering ---
# 1. Encode categorical features (Manufacturer, Region, TimePeriod)
le = LabelEncoder()
df['Manufacturer_Encoded'] = le.fit_transform(df['Manufacturer'])
df['Region_Encoded'] = le.fit_transform(df['Region'])
df['TimePeriod_Encoded'] = le.fit_transform(df['TimePeriod'])

# 2. Encode the target variable (Success_Category) into binary (1 for Success, 0 for Unsuccessful)
df['Success_Target'] = df['Success_Category'].apply(
    lambda x: 1 if x == 'Successful Sales' else 0
)

# Define the final DataFrame used by the dashboard
cleaned_df = df[[
    'Manufacturer', 'Region', 'TimePeriod', 'SalesVolume', 
    'Price_k', 'Sales_Category', 'Success_Target'
]]

print("Features Engineered and Target Variable Encoded.")
print("\nCleaned Data Sample:")
print(cleaned_df.sample(5))

## Training the Logistic Regression Model

We split the data and train the **Logistic Regression** model. This model's success rate prediction is displayed on the dashboard's KPI card and in the pie chart.

In [None]:
# Define Features (X) and Target (y)
features = ['SalesVolume', 'Price_k', 'Manufacturer_Encoded', 'Region_Encoded', 'TimePeriod_Encoded']
X = df[features]
y = df['Success_Target']

# Split data for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Logistic Regression Model
model = LogisticRegression(solver='liblinear', random_state=42)
model.fit(X_train, y_train)

# Model Evaluation
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Model Training Complete.")
print(f"Logistic Regression Model Accuracy: {accuracy:.4f}")

## Saving Artifacts

The final, crucial step is saving both the processed data and the trained model artifact (`.pkl` file). These are the exact files loaded by the Dash application (`app/app.py`).

In [None]:
# Save the cleaned DataFrame to the processed data folder
cleaned_df.to_csv(PROCESSED_DATA_PATH, index=False)
print(f"Cleaned data saved to: {PROCESSED_DATA_PATH}")

# Save the trained model using pickle
with open(MODEL_PATH, 'wb') as file:
    pickle.dump(model, file)
print(f"Trained ML Model saved to: {MODEL_PATH}")