# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from Kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications, we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

# Required Assignment 11: What drives the car price? \n",
## Using data from the Kaggle identifying the reason for the change in the car price

***This activity focuses on producing the plots using the above mentioned dataset to support my findings***

### Importing necessary libraries

In [1]:
import pandas as pd
import plotly.express as px
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

### Loading data

In [2]:
dfcarprice = pd.read_csv("data/vehicles.csv")

## Assignment Objectives
  Identifying the most probable reason that drives the car price using CRISP-DM process.
  
## Assessment of Data
  Provided data from car delaership contains used Car price from different Regions for
  various manufacturers, car details like models with year, Cylinders,odometer and condition.
  
  At First look many of the data seems to be null, so , we may not be able to use all the features effectively.
  ### Risks and benefits
  As this limited data will be used for analysis , it pose a risk of providing wrong assesssment to the car dealers.
  As the benefit, we will be able to use modeling techinques and try to come up with other possible results to analyze further.


### verifying 

In [3]:
#print(dfcarprice.info)

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

### Using the provided data, 
    1. Verify and clean Data
    2. Cleanup all null values
    3. Identify features to use with modeling and their correlation to the price variation
   

In [4]:

#print(dfcarprice)

price_counts = pd.DataFrame(dfcarprice['price'])

price_counts['price_cnt'] = dfcarprice['price'].value_counts()
low_count_items = price_counts[price_counts['price_cnt'] < 3]
#print(low_count_items)
remove_lst = low_count_items['price'].tolist()
dfcarprice_filtrd = dfcarprice[~dfcarprice['price'].isin(remove_lst)]
#print(dfcarprice_filtrd)
#dfcarprice_filtrd_srtd = dfcarprice_filtrd.sort_values(by=['price'],ascending=[True])

#print(dfcarprice_filtrd)
#dfcarprice_filtrd_srtd = dfcarprice_filtrd_srtd.set_index('price')
#print(dfcarprice)

In [5]:

dfcarprice_filtrd_srtd =dfcarprice_filtrd.groupby(['region','manufacturer','model','year']).filter(lambda x: len(x) >= 2)
dfcarprice_filtrd_srtd =dfcarprice_filtrd.groupby(['region','manufacturer']).filter(lambda x: len(x) >= 2)

#print(dfcarprice_filtrd_srtd)

In [6]:
dfcarprice_filtrd_srtd =dfcarprice_filtrd_srtd.groupby(['region','price','manufacturer','model','year']).filter(lambda x: len(x) >= 3)

In [7]:
print(dfcarprice_filtrd_srtd)

                id      region  price    year manufacturer             model  \
186     7316852517  birmingham  21250  2002.0         ford       thunderbird   
214     7316535686  birmingham   3399  2006.0        buick          lacrosse   
247     7316163074  birmingham  16902  2020.0       toyota           corolla   
248     7316163069  birmingham  12307  2018.0          kia              soul   
316     7315498556  birmingham   4980  2010.0    chevrolet          traverse   
...            ...         ...    ...     ...          ...               ...   
425423  7309190449   sheboygan  28899  2017.0         ford              edge   
425498  7304111722   sheboygan  28899  2017.0         ford              edge   
425597  7316813658      wausau  33480  2017.0    chevrolet  silverado 2500hd   
425891  7310439607      wausau  33480  2017.0    chevrolet  silverado 2500hd   
426248  7302148581      wausau  33480  2017.0    chevrolet  silverado 2500hd   

        condition    cylinders fuel  od

In [8]:
#dfcarprice_filtrd_srtd = dfcarprice_filtrd_srtd.set_index('price')
dfcarprice_filtrd_srtd = dfcarprice_filtrd_srtd.dropna()



### Data Preparation

After our initial exploration and fine-tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

### Trasform all columns with categories into numeric using the 'factorize()'

In [9]:
#dfcarprice_filtrd_srtd['region_n'] = pd.to_numeric(pd.factorize(dfcarprice_filtrd_srtd['region'])[0])
#dfcarprice_filtrd_srtd['manufacturer_n'] = pd.to_numeric(pd.factorize(dfcarprice_filtrd_srtd['manufacturer'])[0])
#dfcarprice_filtrd_srtd['model_n'] = pd.to_numeric(pd.factorize(dfcarprice_filtrd_srtd['model'])[0])
#dfcarprice_filtrd_srtd['condition_n'] = pd.to_numeric(pd.factorize(dfcarprice_filtrd_srtd['condition'])[0])
#dfcarprice_filtrd_srtd['cylinders'] = dfcarprice_filtrd_srtd['cylinders'].str.replace(r'[^\d.]', '', regex=True)
#dfcarprice_filtrd_srtd['cylinders_n'] = pd.to_numeric(pd.factorize(dfcarprice_filtrd_srtd['cylinders'])[0])

In [10]:
#print(dfcarprice_filtrd_srtd)

In [11]:
#sns.jointplot(dfcarprice_filtrd_srtd,x='region_n',y='price',kind='hist',hue='region',height=12)

In [12]:
#sns.relplot(
#    data=dfcarprice_filtrd_srtd,
#  x='region_n',
#    y='price',
#    col='manufacturer',  # Separate plots for each manufacturer
#    hue='model',         # Different colors for each model within a manufacturer
#    style='year',        # Different markers for each year
#    kind='scatter',      # Use scatter plot for individual points
#    col_wrap=1,          # Wrap columns after 2 plots for better layout
#    height=5,            # Height of each facet
#    aspect=1.5           # Aspect ratio of each facet
#)

#plt.suptitle('Price by Region, Manufacturer, Model, and Year', y=1.02)
#plt.tight_layout(rect=[0, 0, 1, 0.98]) # Adjusted layout to prevent title overlap
#plt.show()

### Standard Transformation of Data

In [13]:
dfcarprice_filtrd_srtd.index.name='id'
#print(dfcarprice_filtrd_srtd)

In [14]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(handle_unknown='ignore')
categorical_features = ['region','manufacturer','model','year','condition','cylinders','fuel','title_status','transmission','drive','size','type', 'paint_color', 'state']
numerical_dfcarprice = dfcarprice_filtrd_srtd.drop(columns=categorical_features)

# Fit the encoder to the categorical data and transform it
encoded_features = encoder.fit_transform(dfcarprice_filtrd_srtd[categorical_features])

# Convert the output to a DataFrame with clear column names
encoded_dfcarprice = pd.DataFrame(encoded_features.toarray(), 
                          columns=encoder.get_feature_names_out(categorical_features))
#print("\nEncoded DataFrame:\n", encoded_dfcarprice)

dfcarprice_encoded_all = pd.concat([numerical_dfcarprice,encoded_dfcarprice],axis=1)


In [15]:
#print("\nEncoded DataFrame:\n", dfcarprice_encoded_all)
print("\nEncoded DataFrame:\n", dfcarprice_encoded_all.columns)


Encoded DataFrame:
 Index(['id', 'price', 'odometer', 'VIN', 'region_anchorage / mat-su',
       'region_asheville', 'region_baltimore', 'region_battle creek',
       'region_billings', 'region_boise',
       ...
       'state_nj', 'state_nv', 'state_ny', 'state_oh', 'state_sc', 'state_tn',
       'state_tx', 'state_va', 'state_vt', 'state_wi'],
      dtype='object', length=271)


In [16]:
dfcarprice_encoded_all.dropna()

Unnamed: 0,id,price,odometer,VIN,region_anchorage / mat-su,region_asheville,region_baltimore,region_battle creek,region_billings,region_boise,...,state_nj,state_nv,state_ny,state_oh,state_sc,state_tn,state_tx,state_va,state_vt,state_wi


### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

In [17]:
X = dfcarprice_encoded_all.drop(['price','VIN'], axis=1)
y = dfcarprice_encoded_all['price']

y = y.fillna(y.mode()[0])
X = X.fillna(0)

In [18]:
carfeature_sets = [
    [col for col in X.columns if 'model' in col or'year' in col or 'cylinders' in col ],
    [col for col in X.columns if 'year' in col or 'cylinders' in col or 'transmission_manual'in col ],
    [col for col in X.columns if 'region' in col or 'manufacturer' in col], # Example: all region/manufacturer features
    X.columns.tolist() # All features
]

In [19]:
print(carfeature_sets)

[['model_1500', 'model_1500 reg cab st', 'model_1500 slt', 'model_2500 4x4 megacab', 'model_350z convertible', 'model_a5', 'model_accord cpe', 'model_altima s 2.5', 'model_boxster', 'model_bus', 'model_camry', 'model_cherokee', 'model_civic sedan', 'model_corvette', 'model_cr-v', 'model_e-150', 'model_e-150 transit', 'model_e-class', 'model_e350 econoline cargo', 'model_edge 4dr se', 'model_edge sel', 'model_elantra', 'model_enclave', 'model_equinox', 'model_es350', 'model_escape', 'model_escape xls', 'model_express 3500', 'model_f-150', 'model_f-150 4x4 supercab', 'model_f-150 xlt', 'model_f150 supercab stx 4x4', 'model_f150 supercrew fx4', 'model_f150 xlt 4x4', 'model_f150 xlt supercrew 4x4', 'model_f250 super duty', 'model_f250sd', 'model_f350 super duty', 'model_f450sd', 'model_fiesta se', 'model_forester', 'model_frontier king cab sv', 'model_fusion sel', 'model_gl-class', 'model_gls550', 'model_grand caravan', 'model_grand cherokee', 'model_grand cherokee limited', 'model_gti', '

In [20]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline

### TRIED BUT NOT USED DUE TO  CONTINUOUS CATEGORICAL VALUES NOT SUPPORTED
N_SPLITS = 4
def get_score(X, y, imputer=None):
    regressor =    RandomForestRegressor(random_state=0)
    if imputer is not None:
        estimator = make_pipeline(imputer, regressor)
    else:
        estimator = regressor
    scores = cross_val_score(
        estimator, X, y, scoring="neg_mean_squared_error", cv=N_SPLITS
    )
    return scores.mean(), scores.std()


#imputer = SimpleImputer(strategy="constant", fill_value=0, add_indicator=True)
#X[1], y[1] = get_score(X, y, imputer)

In [23]:
#model = LogisticRegression(solver='liblinear')
model = RandomForestClassifier(n_estimators=100, random_state=42)

In [50]:
accuracy = {}
for i, features in enumerate(carfeature_sets):
    X_subset = X[features]
    X_train, X_test, y_train, y_test = train_test_split(X_subset, y, test_size=0.2, random_state=42)

    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
 # Convert continuous predictions to binary using a threshold (e.g., 0.5)
    threshold = 0.5
    y_pred_binary = (y_pred >= threshold).astype(int)


    accuracy[i] = [features,accuracy_score(y_test, y_pred)]
print(f"Feature Set {i+1} Accuracy: {accuracy[i][1]:.4f}  with features: {accuracy[i][0]}")
    

Feature Set 4 Accuracy: 0.7414  with features: ['id', 'odometer', 'region_anchorage / mat-su', 'region_asheville', 'region_baltimore', 'region_battle creek', 'region_billings', 'region_boise', 'region_brownsville', 'region_buffalo', 'region_central NJ', 'region_charleston', 'region_charlotte', 'region_cleveland', 'region_columbus', 'region_corpus christi', 'region_denver', 'region_fayetteville', 'region_fort collins / north CO', 'region_ft myers / SW florida', 'region_greensboro', 'region_greenville / upstate', 'region_hawaii', 'region_indianapolis', 'region_jacksonville', 'region_kalispell', 'region_knoxville', 'region_las vegas', 'region_long island', 'region_louisville', 'region_madison', 'region_missoula', 'region_modesto', 'region_myrtle beach', 'region_nashville', 'region_ocala', 'region_orlando', 'region_palm springs', 'region_raleigh / durham / CH', 'region_redding', 'region_roanoke', 'region_rochester', 'region_san diego', 'region_sarasota-bradenton', 'region_south bend / mich

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high-quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight into drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

## VERIFYING Accuracy results to evaluate which features provide accurate or proper result

### Identified the 4th feature with all of the features came closer

### Having the Odometer identifying the record uniquley and brings the Accuracy to close to the actual value

In [66]:
carfeature_limtd_sets = [
    [col for col in X.columns if 'odometer' in col or 'model' in col or'year' in col or 'cylinders' in col or 'condition' in col],
    [col for col in X.columns if 'odometer' in col or 'model' in col or 'year' in col or 'condition' in col ],
    [col for col in X.columns if 'odometer' in col or 'region' in col or 'manufacturer' in col or 'model' in col or 'year' in col or 'condition' in col ], # Example: all region/manufacturer features
    [col for col in X.columns if 'id' in col or 'odometer' in col or 'region' in col or 'manufacturer' in col or 'model' in col or 'year' in col or 'condition' in col or 'cylinders' in col or 'fuel' in col or 'title_status' in col or 'transmission' in col or 'drive' in col or 'size' in col or 'type'  in col or 'state'in col or 'paint'in col]
]

In [67]:
lmtd_featrs_accuracy = {}
for i, features in enumerate(carfeature_limtd_sets):
    X_subset = X[features]
    X_train, X_test, y_train, y_test = train_test_split(X_subset, y, test_size=0.2, random_state=42)

    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
 # Convert continuous predictions to binary using a threshold (e.g., 0.5)
    threshold = 0.5
    y_pred_binary = (y_pred >= threshold).astype(int)


    lmtd_featrs_accuracy[i] = [features,accuracy_score(y_test, y_pred)]
    print(f"Feature Set {i+1} Accuracy: {lmtd_featrs_accuracy[i][1]:.4f}  with features: {lmtd_featrs_accuracy[i][0]}")
    

Feature Set 1 Accuracy: 0.9784  with features: ['odometer', 'model_1500', 'model_1500 reg cab st', 'model_1500 slt', 'model_2500 4x4 megacab', 'model_350z convertible', 'model_a5', 'model_accord cpe', 'model_altima s 2.5', 'model_boxster', 'model_bus', 'model_camry', 'model_cherokee', 'model_civic sedan', 'model_corvette', 'model_cr-v', 'model_e-150', 'model_e-150 transit', 'model_e-class', 'model_e350 econoline cargo', 'model_edge 4dr se', 'model_edge sel', 'model_elantra', 'model_enclave', 'model_equinox', 'model_es350', 'model_escape', 'model_escape xls', 'model_express 3500', 'model_f-150', 'model_f-150 4x4 supercab', 'model_f-150 xlt', 'model_f150 supercab stx 4x4', 'model_f150 supercrew fx4', 'model_f150 xlt 4x4', 'model_f150 xlt supercrew 4x4', 'model_f250 super duty', 'model_f250sd', 'model_f350 super duty', 'model_f450sd', 'model_fiesta se', 'model_forester', 'model_frontier king cab sv', 'model_fusion sel', 'model_gl-class', 'model_gls550', 'model_grand caravan', 'model_grand

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine-tuning their inventory.

## FINDING 
### Having the odometer value may probably provide the closest value to real values reported
#### projecting price with the odometer may provide the probable value to customers