## Dataset Description

This dataset is designed to predict the selling price of used vehicles. The original dataset contains the following features:

- **Car_Name**: Name of the vehicle.  
- **Year**: Year of manufacture.  
- **Selling_Price**: Final selling price of the vehicle (target variable).  
- **Present_Price**: Current market price of the vehicle.  
- **Driven_Kms**: Total kilometers driven.  
- **Fuel_Type**: Type of fuel used (e.g., Petrol, Diesel, CNG).  
- **Selling_Type**: Type of seller (Dealer or Individual).  
- **Transmission**: Type of transmission (Manual or Automatic).  
- **Owner**: Number of previous owners.  

The dataset has already been split into training and testing sets, which are available in the `files/input/` directory.

The following sections describe the methodology used to build the predictive model.



## Design Decisions

### 1. Handling Missing and Duplicate Values

Rows containing missing values were removed to ensure data consistency and avoid bias introduced by incomplete records.  
Duplicate entries were also eliminated to prevent distortion in model training.

---

### 2. Feature Engineering: Vehicle Age

Instead of using the raw `Year` feature, a new variable `Age` was created:

Age = 2021 - Year

This transformation improves interpretability and captures the real factor influencing depreciation: the vehicle’s age rather than its manufacturing year.

The original `Year` column was removed to avoid redundancy.

---

### 3. Removal of `Car_Name`

`Car_Name` was excluded because it is a high-cardinality categorical variable that does not directly contribute structured numerical information without advanced encoding techniques.

Removing it simplifies the model while maintaining relevant predictive power.

---

### 4. Logarithmic Transformation of Price Variables

A logarithmic transformation (`log1p`) was applied to:

- `Present_Price`
- `Selling_Price`

#### Why?

Price-related variables are typically right-skewed, meaning:

- A few high-value vehicles distort the distribution
- Variance increases with magnitude

Applying a log transformation:

- Stabilizes variance
- Reduces skewness
- Improves linear model performance
- Makes relationships more linear

The use of `log1p` (log(1 + x)) ensures numerical stability when values are close to zero.

---

### 5. Proper Train/Test Separation

The dataset was split into:

- Feature matrix (`X`)
- Target variable (`y`)

This structure ensures:

- Clear separation between predictors and outcome
- Clean pipeline integration
- Reliable evaluation on unseen data

In [1]:
import pandas as pd 
import numpy as np
import os
import pickle
import gzip
import json
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error,make_scorer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.feature_selection import SelectKBest, f_classif,f_regression, mutual_info_regression
from sklearn.linear_model import LinearRegression

def carga_y_limpieza(train_path, test_path):
    """
    Loads training and testing datasets, performs cleaning,
    feature engineering, and basic transformations.

    Returns:
        X_train, y_train, X_test, y_test,
        categorical_columns, numerical_columns
    """

    # Load datasets
    train_df = pd.read_csv(train_path)
    test_df = pd.read_csv(test_path)

    print("Train shape:", train_df.shape)
    print("Test shape:", test_df.shape)

    # ---------------------------
    # Data Cleaning
    # ---------------------------

    # Remove missing values
    train_df.dropna(inplace=True)
    test_df.dropna(inplace=True)

    # Remove duplicate records (if any)
    train_df.drop_duplicates(inplace=True)
    test_df.drop_duplicates(inplace=True)

    # ---------------------------
    # Feature Engineering
    # ---------------------------

    # Create vehicle age assuming reference year = 2021
    train_df["Age"] = 2021 - train_df["Year"]
    test_df["Age"] = 2021 - test_df["Year"]

    # Drop non-informative or redundant columns
    train_df.drop(["Car_Name", "Year"], axis=1, inplace=True)
    test_df.drop(["Car_Name", "Year"], axis=1, inplace=True)

    # ---------------------------
    # Log Transformation
    # ---------------------------

    # Apply log1p transformation to stabilize variance
    # and reduce skewness in price-related features
    price_columns = ["Present_Price", "Selling_Price"]

    for col in price_columns:
        train_df[col] = np.log1p(train_df[col])
        test_df[col] = np.log1p(test_df[col])

    # ---------------------------
    # Feature / Target Separation
    # ---------------------------

    # Target variable: Present_Price
    y_train = train_df["Present_Price"]
    X_train = train_df.drop(columns=["Present_Price"])

    y_test = test_df["Present_Price"]
    X_test = test_df.drop(columns=["Present_Price"])

    # ---------------------------
    # Column Classification
    # ---------------------------

    categorical_columns = ["Fuel_Type", "Selling_type", "Transmission"]
    numerical_columns = ["Selling_Price", "Driven_kms", "Owner", "Age"]

    print("X_train shape:", X_train.shape)
    print("X_test shape:", X_test.shape)

    return X_train, y_train, X_test, y_test, categorical_columns, numerical_columns

## Modeling Pipeline

A machine learning pipeline was constructed to ensure a structured, reproducible, and scalable modeling workflow. The pipeline integrates preprocessing, feature selection, and model training into a single unified framework.

### 1. Categorical Encoding

Categorical variables were transformed using **One-Hot Encoding**.

This approach:
- Converts categorical features into a numerical representation
- Prevents unintended ordinal relationships
- Allows linear models to interpret categorical information correctly

---

### 2. Feature Scaling

Numerical variables were scaled to the range **[0, 1]** using Min-Max Scaling.

Scaling was applied to:

- Ensure numerical stability
- Prevent features with larger magnitudes from dominating the model
- Improve convergence behavior in linear regression

---

### 3. Feature Selection

A **K-best feature selection** method was applied to retain the most informative predictors.

This step:
- Reduces dimensionality
- Minimizes noise
- Helps prevent overfitting
- Improves interpretability

Feature selection was performed using statistical relevance scoring.

---

### 4. Model Training

A **Linear Regression** model was used as the final estimator.

Linear regression was selected because:

- It provides interpretability
- It establishes a strong baseline model
- It performs well when relationships are approximately linear
- It allows direct evaluation of feature influence

---

## Why Use a Pipeline?

The pipeline structure ensures:

- Clean separation of preprocessing and modeling logic
- Prevention of data leakage
- Reproducibility
- Easier deployment and scalability
- Compatibility with cross-validation and hyperparameter tuning

By integrating all steps into a single pipeline, the workflow becomes robust, modular, and production-oriented.

In [2]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.feature_selection import SelectKBest


def build_pipeline(estimator, categorical_columns=None, numeric_columns=None):
    """
    Builds a complete machine learning pipeline including:
    - Categorical encoding
    - Numerical scaling
    - Feature selection
    - Model training

    Parameters:
        estimator: sklearn-compatible model
        categorical_columns: list of categorical feature names
        numeric_columns: list of numerical feature names

    Returns:
        sklearn Pipeline object
    """

    # ---------------------------
    # Preprocessing Stage
    # ---------------------------

    # Apply One-Hot Encoding to categorical features
    # Apply Min-Max Scaling to numerical features
    preprocessor = ColumnTransformer(
        transformers=[
            ("categorical",
             OneHotEncoder(handle_unknown="ignore"),
             categorical_columns),

            ("numerical",
             MinMaxScaler(),
             numeric_columns)
        ]
    )

    # ---------------------------
    # Full Pipeline
    # ---------------------------

    pipeline = Pipeline(steps=[

        # Step 1: Feature preprocessing
        ("preprocessor", preprocessor),

        # Step 2: Feature selection (top K best features)
        # Score function can be specified depending on regression/classification task
        ("feature_selection", SelectKBest()),

        # Step 3: Final estimator (e.g., LinearRegression)
        ("model", estimator)
    ])

    return pipeline

## Hyperparameter Optimization

Model hyperparameters were optimized using **10-fold cross-validation** to ensure robust performance estimation and reduce overfitting.

In each iteration:
- The model was trained on 9 folds
- Validated on the remaining fold
- Final performance was averaged across all splits

### Evaluation Metric

Performance was measured using **Mean Absolute Error (MAE)**.

MAE was selected because:
- It provides interpretable error values in the same unit as the target variable
- It is less sensitive to outliers than squared-error metrics
- It reflects real-world deviation in price prediction

This approach ensures reliable model selection and strong generalization to unseen data.


In [3]:
def make_grid_search(estimator, param_grid, cv=10):

    from sklearn.model_selection import GridSearchCV

    grid_search = GridSearchCV(
        estimator=estimator,
        param_grid=param_grid,
        scoring="neg_mean_absolute_error",
        n_jobs=-1,
        refit=True,
    )

    return grid_search 

## Model Persistence

Once the final model was trained and validated, it was serialized and stored for future use.

The model was saved in compressed format using `gzip`:

In [4]:
def save_estimator(estimator):

    os.makedirs("../files/models", exist_ok=True)
    with gzip.open("../files/models/model.pkl.gz", "wb") as f:
        pickle.dump(estimator, f)

## Model Evaluation and Metrics Export

The final model was evaluated on both training and testing datasets using multiple regression metrics to assess performance and generalization.

### Evaluation Metrics

The following metrics were computed:

- **R² (Coefficient of Determination)** – Measures the proportion of variance explained by the model.
- **Mean Squared Error (MSE)** – Penalizes larger errors more heavily.
- **Mean Absolute Error (MAE)** – Measures average absolute prediction error.

Evaluating both training and test sets allows detection of potential overfitting or underfitting.

---

## Metrics Persistence

All evaluation results were exported to: files/output/metrics.json

Each line in the file contains a dictionary with the following structure:

```python
{
    "type": "metrics",
    "dataset": "train" or "test",
    "r2": float,
    "mse": float,
    "mae": float
}

In [5]:
def guardar_metricas(best_model,y_train, y_train_pred, y_test, y_test_pred, output_path):
    train_metrics = {
        'type': 'metrics',
        'dataset': 'train',
        'r2': r2_score(y_train, y_train_pred),
        'mse': mean_squared_error(y_train, y_train_pred),
        'mad': mean_absolute_error(y_train, y_train_pred)
    }

    
    test_metrics = {
        'type': 'metrics',
        'dataset': 'test',
        'r2': r2_score(y_test, y_test_pred),
        'mse': mean_squared_error(y_test, y_test_pred),
        'mad': mean_absolute_error(y_test, y_test_pred)
    }


    with open(output_path, 'w') as f:
        f.write(json.dumps(train_metrics) + '\n')
        f.write(json.dumps(test_metrics) + '\n')

## Ejecución

In [6]:
def linear_regression():
    train_path="../data/raw/train_data.csv"
    test_path="../data/raw/test_data.csv"

   
    x_train, y_train, x_test, y_test, categorical_columns, numeric_columns = carga_y_limpieza(train_path, test_path)

    

       
    pipeline = build_pipeline(
        estimator=LinearRegression(),
        categorical_columns=categorical_columns,
        numeric_columns=numeric_columns,
    )

    param_grid = {
        'feature_selection__k': [10],
        'feature_selection__score_func': [f_regression], 
    }
        
    grid_search = make_grid_search(
        estimator=pipeline,
        param_grid=param_grid,
        cv=10,
    )

    grid_search.fit(x_train,y_train)


    best_model = grid_search.best_estimator_

    save_estimator(grid_search)
    y_train_pred = best_model.predict(x_train)
    y_test_pred = best_model.predict(x_test)
    
    print(best_model)


    output_path = "../outputs/metrics.json"
    guardar_metricas(best_model, y_train, y_train_pred, y_test, y_test_pred, output_path)

linear_regression()


Train shape: (211, 9)
Test shape: (90, 9)
X_train shape: (210, 7)
X_test shape: (90, 7)


Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('categorical',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['Fuel_Type', 'Selling_type',
                                                   'Transmission']),
                                                 ('numerical', MinMaxScaler(),
                                                  ['Selling_Price',
                                                   'Driven_kms', 'Owner',
                                                   'Age'])])),
                ('feature_selection',
                 SelectKBest(score_func=<function f_regression at 0x115921800>)),
                ('model', LinearRegression())])
