

### Background:

In the dynamic automotive industry, accurately estimating vehicle prices is crucial for car owners looking to buy or sell their vehicles. However, crucial factors are often overlooked by traditional methods of pricing vehicles and leading to inaccurate price estimations.  

--- 

### Objective:

The objective is to help consumers analyse trends in the automotive market to estimate vehicle prices based on various attributes. By leveraging a dataset containing individual vehicle sales transactions and associated attributes, including condition, mileage, brand, and more, we aim to provide valuable insights and tools for market analysis and price prediction. 

In [None]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

# Import from Scikit-Learn
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import plot_tree
from sklearn.metrics import confusion_matrix
np.random.seed(0)

In [None]:
# Import the Dataset
sales_data = pd.read_csv('car_prices.csv')
sales_data.head()

In [None]:
sales_data.info()

###

Description of the dataset, as available on Kaggle, is as follows.
Learn more : https://www.kaggle.com/datasets/syedanwarafridi/vehicle-sales-data/data
> **year** : The manufacturing year of the vehicle         
> **make** : The brand or manufacturer of the vehicle        
> **model** : The specific model of the vehicle        
> **trim** : Additional designation for the vehicle model       
> **body** : The body type of the vehicle (e.g., SUV, Sedan)       
> **transmission** : The type of transmission in the vehicle (e.g., automatic)       
> **vin** : Vehicle Identification Number, a unique code for each vehicle      
> **state** : The state where the vehicle is registered       
> **condition** : Condition of the vehicle, possibly rated on a scale      
> **odometer** : The mileage or distance traveled by the vehicle     
> **color** : Exterior color of the vehicle    
> **interior** : Interior color of the vehicle     
> **seller** : The entity selling the vehicle      
> **mmr** : Manheim Market Report, possibly indicating the estimated market value of the vehicle    
> **sellingprice** : The price at which the vehicle was sold      
> **saledate** : The date and time when the vehicle was sold    

---
### Cleaning of Data (Review Lecture @20mins)


Things to do:

1. Rename column names to an understandable names
2. Check for null values
3. Remove columns with no meaning (optional)

In [None]:
# Rename of column names to new understandable names
sales_data.rename(columns={
    "year": "Manufacturing Year",
    "make": "Brand",
    "trim": "Model Version",
    "body": "Vehicle Type",
    "transmission" : "Gear",
    "vin" : "Vehicle ID",
    "state" : "Registration State",
    "odometer" : "Mileage",
    "color" : "Exterior Colour",
    "interior" : "Interior Colour",
    "mmr" : "Market Value",
    "sellingprice" : "Selling Price",
    "saledate" : "Sale Date"
}, inplace=True)

# Convert all variable names to uppercase
sales_data.columns = sales_data.columns.str.upper()
cleaned_data = sales_data
cleaned_data.head()

In [None]:
# Cleaning the data (Check for NULL values)
cleaned_data.isnull().sum()

In [None]:
# Drop missing values
cleaned_data.dropna(inplace=True)
cleaned_data.isnull().sum()

In [None]:
cleaned_data.info()

In [None]:
cleaned_data.to_csv('cleaned_data.csv', index=False)

## Exploratory Data Analysis

1. Separate Numerical and Categorical data

#### Univariate Analysis
2. Plot distributions of numerical data
3. Plot distributions of categorical data

#### Bivariate Analysis
4. Plot predictors against response (Correlation)

In [None]:
# Display data types of each column
print(cleaned_data.dtypes)

# Separate Numerical and Categorical Variables
numerical_columns = cleaned_data.select_dtypes(include=['int64','float64']).columns
categorical_columns = cleaned_data.select_dtypes(include=['object']).columns

In [None]:
# Distributions of numerical variables

for col in numerical_columns:
    f, axes = plt.subplots(1, 3, figsize=(24, 6))
    sb.boxplot(data = cleaned_data[col], orient = "h", ax = axes[0])
    sb.histplot(data = cleaned_data[col], ax = axes[1])
    sb.violinplot(data = cleaned_data[col], orient = "h", ax = axes[2])
    plt.show()

In [None]:
# Distributions of categorical variables

# Find number of unique values for categorical columns
unique_values_count = {column: cleaned_data[column].nunique() for column in categorical_columns}

# Plot distributions for columns with less than 100 unique values
for key, value in unique_values_count.items():
    if value < 100:
        plt.figure(figsize=(12, 10))
    
        # Skip SALE DATE for now due to preprocessing requirement
        if key == 'SALE DATE':
            continue
            
        sb.histplot(y=cleaned_data[key])
        # Add a title and show the plot
        plt.title(f'{key} Distribution')
        plt.show()

In [None]:
# Correlation between numerical variables and sale price
for col in numerical_columns:
    #f, axes = plt.subplots(10, 2, figsize=(24, 6))
    # Create a joint dataframe by concatenating the two variables
    df = pd.concat([cleaned_data[col], cleaned_data["SELLING PRICE"]], axis = 1).reindex(cleaned_data[col].index)
    
    # Jointplot    
    sb.jointplot(data = cleaned_data, x = col, y = "SELLING PRICE", height = 12)
    plt.show()

In [None]:
# Correlation between Response and the Predictors
numeric_df = cleaned_data.select_dtypes(include=['int64','float64'])
f = plt.figure(figsize=(12, 8))
sb.heatmap(numeric_df.corr(), vmin = -1, vmax = 1, annot = True, fmt = ".2f")
plt.title('Correlation Matrix between numeric variables', fontsize=20)
plt.show()

sb.pairplot(data = numeric_df)
plt.show()

In [None]:
# Categorical variables against sale price

# Plot distributions for columns with less than 100 unique values
for col, value in unique_values_count.items():
    if value < 100:
        #f, axes = plt.subplots(10, 2, figsize=(24, 6))
        # Create a joint dataframe by concatenating the two variables
        df = pd.concat([cleaned_data[col], cleaned_data["SELLING PRICE"]], axis = 1)
        
        # Jointplot    
        f = plt.figure(figsize=(12, 8))
        sb.boxplot(x = "SELLING PRICE", y = col, data = df)
        plt.show()
