## Car Market Value Estimation

<span style='font-family:Helvetica'>
The purpose of this Notebook is to load and clean the dataset, perform feature engineering on it i.e  do all the data cleaning and preprocessing steps and come up with optimzed machine learning model that can accurate provide value estimate for cars, incoprpotating all the factors such as mileage, car age etc.

Here's a high-level approach to the problem:

<b>Data Cleaning and Preprocessing:</b>
- First, I will need to clean the data and handle missing values. For example, if the 'listing_price' is missing, I might need to drop that record because it's the target variable. For other missing values, techniques like mean/median imputation, or more sophisticated methods like K-NN imputation will be used (will depend on data health / percentage of missing values).

<b>Feature Engineering:</b>
- Next, i will see the co-rellation amoung different features and create new features that might be relevant to the car's price. For example, the car's age (current year - year), whether the car is luxury or not (based on the 'make'), etc.

<b>Encoding Categorical Variables:</b>
- Than we will need to convert categorical variables like 'make', 'model', 'trim', 'dealer_state', etc. into a form that can be provided to a machine learning model. I will be using One-Hot Encoding.

<b>Model Building:</b>
- I will start with a simple model like Linear Regression, and then try more complex models like Random Forest, Gradient Boosting, or Neural Networks. I will split the data into a training set and a test set to evaluate the model's performance.

<b>Model Evaluation:</b>
- I will be using metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), or R-squared to evaluate the model's performance.

<b>Hyperparameter Tuning:</b>
- If base line model accuracy is not acceptable or there is underfititng, overfitting i will need to further tune the hayperparameters. 
     </span>

<span style='font-family:Helvetica'> Now, to address specific questions for mileage negative co-relation and other features that can be incorporated for improved accuracy:
    
- To account for the negative correlation between price and mileage, we can include 'mileage' as a feature in our model. The model will learn the relationship between mileage and price from the data. If the relationship is indeed negative, the model will capture that.
    
- Other factors we can incorporate include the car's age, whether it's used or certified, the dealer's location (some locations might have higher prices), the car's style, exterior and interior color, etc.

<hr style="border:2px solid gray">
</span>

In [7]:
import joblib
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

<span style='font-family:Helvetica'> `CarPricePredictor` class will encapsulates all the steps needed to train a model to predict car prices. It uses a linear regression model, but can be switched to any other model by changing the `self.model` property.</span>

In [63]:
"""
    Car Price Estimation Model

    This module implements a car price estimation model using a machine learning approach. It preprocesses the car market 
    dataset, performs feature engineering, encodes categorical variables, and trains a regression model to predict car 
    prices.

    Attributes:
        - features (pandas.DataFrame): The preprocessed features of the dataset.
        - target (str): The target variable name.
        - model (sklearn.estimator): The trained regression model.
        - preprocessor (sklearn.compose.ColumnTransformer): The data preprocessing pipeline.

    Methods:
        - __init__(self, df, model): Initializes the CarPriceEstimationModel object.
        - load_data(self): Load the raw car market dataset from the specified data path.
        - preprocess_data(self, df): Perform data cleaning and preprocessing steps.
        - prepare_pipeline(self): Defines the preprocessing steps and creates a pipeline.
        - train_model(self, test_size=0.2): Trains the regression model using the preprocessed data.
        - predict(self, X_test): Predicts car prices for the given test data.

"""

class CarPricePredictor:
    def __init__(self, data_path):
        self.data_path = data_path
        self.model = LinearRegression()
        self.preprocessor = None
        self.features = None
        self.target = None

    def load_data(self):
        df = pd.read_csv(self.data_path, delimiter='|')
        return df

    def preprocess_data(self, df):
        df['age'] = 2023 - df['year']
        df.drop(['year'], axis=1, inplace=True)
        df.dropna(subset=['make', 'model', 'listing_price', 'age'],inplace=True)
        df.replace([np.inf, -np.inf, np.nan, 'N/a'], 0, inplace=True)
        self.features = df.drop(['vin', 'trim', 'dealer_name', 'dealer_street', 'dealer_city', 'dealer_state', 'dealer_zip',
                                 'listing_mileage', 'used', 'certified', 'style', 'driven_wheels', 'engine', 'fuel_type',
                                 'exterior_color', 'interior_color', 'seller_website', 'first_seen_date',
                                 'last_seen_date', 'dealer_vdp_last_seen_date', 'listing_status','listing_price'], axis=1)
        
        
        self.target = df['listing_price']

    def prepare_pipeline(self):
        numeric_features = self.features.select_dtypes(include=['int64', 'float64']).columns
        categorical_features = self.features.select_dtypes(include=['object']).columns

        numeric_transformer = Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='median'))])

        categorical_transformer = Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
            ('onehot', OneHotEncoder(handle_unknown='ignore'))])

        self.preprocessor = ColumnTransformer(
            transformers=[
                ('num', numeric_transformer, numeric_features),
                ('cat', categorical_transformer, categorical_features)])

    def train_model(self):
        X_train, X_test, y_train, y_test = train_test_split(self.features, self.target, test_size=0.2, random_state=0)
        X_train = self.preprocessor.fit_transform(X_train)
        self.model.fit(X_train, y_train)
        X_test = self.preprocessor.transform(X_test)
        y_pred = self.model.predict(X_test)
        print('Model trained successfully. MSE:', mean_squared_error(y_test, y_pred))

    def predict(self, data):
        data = self.preprocessor.transform(data)
        return self.model.predict(data)

.

In [None]:
# Usage
predictor = CarPricePredictor('raw_dataset.txt')
df = predictor.load_data()
predictor.preprocess_data(df)
predictor.prepare_pipeline()
predictor.train_model()

In [65]:
joblib.dump(predictor.model, 'Weights/model.pkl')
joblib.dump(predictor.preprocessor, 'Weights/preprocessor.pkl')


['Weights/preprocessor.pkl']