# Introduction: What Factors Sell a Car?

Hundreds of free vehicle ads are posted on websites every day. Several kinds of information are provided on the web including the physical condition of the vehicle, the year the vehicle was made, the selling price of the vehicle. An analysis was conducted to study the data set over the past few years and determine the factors that influence the price of a vehicle.

# Objective
This project aims to study the parameters that affect the price of a vehicle. Hypotheses were formed as follows:

1. What is the correlation of non-numerical (transmission type and vehicle color) to vehicle price?
2. What is the correlation of age, mileage, and vehicle condition to vehicle price?

## Data Preprocessing

In [1]:
# Import all libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import plotly.express as px
import seaborn as sns
import plotly.graph_objects as go

# Load a data file into a DataFrame
data = pd.read_csv('/kaggle/input/vehicle-us/vehicles_us.csv')

### Exploring Preliminary Data
*The Dataset* contains the following columns: 

- `price`
- `model_year`
- `model`
- `condition`
- `cylinders`
- `fuel` - gas, diesel, etc.
- `odometer` - the mileage of the vehicle at the time the ad was aired  
- `transmission`
- `paint_color`
- `is_4wd` - whether the vehicle has 4-wheel drive (Boolean type)
- `date_posted` - the date the ad was posted 
- `days_listed` - the number of days the ad was live until it was deleted 

In [2]:
#Shows general information/summaries about the DataFrame
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         51525 non-null  int64  
 1   model_year    47906 non-null  float64
 2   model         51525 non-null  object 
 3   condition     51525 non-null  object 
 4   cylinders     46265 non-null  float64
 5   fuel          51525 non-null  object 
 6   odometer      43633 non-null  float64
 7   transmission  51525 non-null  object 
 8   type          51525 non-null  object 
 9   paint_color   42258 non-null  object 
 10  is_4wd        25572 non-null  float64
 11  date_posted   51525 non-null  object 
 12  days_listed   51525 non-null  int64  
dtypes: float64(4), int64(2), object(7)
memory usage: 5.1+ MB


In [3]:
# Display sample data
data.sample()

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
11280,19500,2018.0,jeep cherokee,like new,6.0,gas,15800.0,automatic,SUV,blue,,2019-02-25,17


In [4]:
# Display description of the dataframe
data.describe()

Unnamed: 0,price,model_year,cylinders,odometer,is_4wd,days_listed
count,51525.0,47906.0,46265.0,43633.0,25572.0,51525.0
mean,12132.46492,2009.75047,6.125235,115553.461738,1.0,39.55476
std,10040.803015,6.282065,1.66036,65094.611341,0.0,28.20427
min,1.0,1908.0,3.0,0.0,1.0,0.0
25%,5000.0,2006.0,4.0,70000.0,1.0,19.0
50%,9000.0,2011.0,6.0,113000.0,1.0,33.0
75%,16839.0,2014.0,8.0,155000.0,1.0,53.0
max,375000.0,2019.0,12.0,990000.0,1.0,271.0


Based on the information above, the dataset has 13 columns and 51525 rows. From general observation, there are missing values for the columns 'model_year', 'cylinders', 'odometer', 
'paint_color', and 'is_4wd'.  

Missing values for each of these columns can be resolved by creating the value or by replacing the 'NaN' value with another value. Both methods are used to improve data quality.

The data type for column 'is_4wd' should be boolean, not float The data type for column 'date_posted' should be date/time. This problem can be solved by changing the data type according to the data type.


In [5]:
# Display #NaN value
data.isna().sum().sort_values(ascending=False)

is_4wd          25953
paint_color      9267
odometer         7892
cylinders        5260
model_year       3619
price               0
model               0
condition           0
fuel                0
transmission        0
type                0
date_posted         0
days_listed         0
dtype: int64

### <br>Conclusions and Next Steps</br>
From the results of the data exploration, the conclusions and next steps are as follows:

1. Missing value in 'is_4wd' column can be resolved by imputing 'NaN' value with 'True/False' boolean value.
2. Missing value in the 'paint_color' column can be resolved by imputing 'unknown' for the 'NaN' value. The imputation is done to provide the color category of the vehicle and considering the data type in the column is a string.
3. Missing values in the 'odometer', 'cylinders' and 'model_year' columns are resolved by imputing the median value of the data in each column.
4. The data type in 'is_4wd' can be changed to a boolean data type to facilitate analysis and the 'date_posted' type which was previously a string data type is changed to a datetime data type.

## Resolve Missing Values

In [6]:
#Review the unique value in the 'paint_color' column
data['paint_color'].unique()

array([nan, 'white', 'red', 'black', 'blue', 'grey', 'silver', 'custom',
       'orange', 'yellow', 'brown', 'green', 'purple'], dtype=object)

In [7]:
# Fill the NAN value in the 'paint_color' column using the value 'unknown'.
data['paint_color'] = data['paint_color'].fillna('unknown')

# Review the unique value of the 'paint_color' column after fixing the NAN value
data['paint_color'].unique()

array(['unknown', 'white', 'red', 'black', 'blue', 'grey', 'silver',
       'custom', 'orange', 'yellow', 'brown', 'green', 'purple'],
      dtype=object)