# Assignment 3 (OPTION A): Machine Learning
# 1. LOOK AT THE BIG PICTURE AND DATA EXPLORATION
## 1.1 Look at the big picture

1. End goal of the project: 
The purpose of this ML project to train AI models to predict the selling price of each car depending on their conditions. The end goal is to help businesses understand the market trend and produce new profitable products in the future. It also helps to analyze consumer behavior and predict the performance of past products.

2. Algorithm:
The algorithm to use is supervised regression.

3. Performance Measure:
* RMSE
* R2 Score

4. Collected Data:
The data is already collected from the lecturer.
The dataset is about car retails.

## 1.2 Data Exploration: Get the data

In [2]:
# In[0]: IMPORT AND FUNCTIONS
#region 
# pip install scikit-learn # to install sklearn
# pip install xgboost
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer  
from sklearn.preprocessing import OneHotEncoder      
from sklearn.model_selection import KFold   
from statistics import mean
import joblib 
import os
from xgboost import XGBRegressor
from sklearn.model_selection import cross_val_score
#endregion


### Load dataset into the Jupyter Notebook

In [4]:
# In[2]: STEP 2. GET THE DATA (DONE). LOAD DATA
datasets_folder = 'datasets'
file_name = 'CarDetailsV3.csv'

file_path = os.path.join(datasets_folder, file_name)

raw_data = pd.read_csv(file_path)

### Our label data is selling_price with the datatype int64. 


### Overview of the dataset
The table above shows that there are 8128 rows. However, there are some missing values in mileages, engine, max_power, torque, seats.

## 1.3: Scatter plot between 2 features
### Feature selections:
To approach this regression problem and focus on predicting the price, features that are suspected to have an impactful correlation raking from high to low:
1. Kilometers Driven (km_driven)
2. The year the car was produced (year)
3. Fuel consumption (fuel)


In [None]:
# 3.2 Scatter plot b/w 2 features
if 1:
    raw_data.plot(kind="scatter", y="selling_price", x="km_driven", alpha=0.2)
    #plt.axis([0, 5, 0, 10000])
    # plt.savefig('figures/scatter_1_feat.png', format='png', dpi=300)
    plt.show()      
if 1:
    raw_data.plot(kind="scatter", y="selling_price", x="year", alpha=0.2)
    #plt.axis([0, 5, 0, 10000])
    #plt.savefig('figures/scatter_2_feat.png', format='png', dpi=300)
    plt.show()

if 1:
    raw_data.plot(kind="scatter", y="selling_price", x="fuel", alpha=0.2)
    #plt.axis([0, 5, 0, 10000])
    #plt.savefig('figures/scatter_2_feat.png', format='png', dpi=300)
    plt.show()

# 3.3 Scatter plot b/w every pair of features
if 1:
    from pandas.plotting import scatter_matrix   
    features_to_plot = ["selling_price", "km_driven", "year", "fuel"]
    scatter_matrix(raw_data[features_to_plot], figsize=(12, 8)) # Note: histograms on the main diagonal
    # plt.savefig('figures/scatter_mat_all_feat.png', format='png', dpi=300)
    plt.show()

# 3.4 Plot histogram of 1 feature
# if 1:
#     from pandas.plotting import scatter_matrix   
#     features_to_plot = ["selling_price"]
#     scatter_matrix(raw_data[features_to_plot], figsize=(12, 8)) # Note: histograms on the main diagonal
#     plt.show()

if 1:
    # Plotting the distribution of the 'selling_price' feature
    plt.figure(figsize=(10, 6))
    plt.hist(raw_data['selling_price'], bins=50, edgecolor='black')
    plt.title('Distribution of Selling Price')
    plt.xlabel('Selling Price')
    plt.ylabel('Frequency')
    plt.grid(True)
    plt.show()

# 3.5 Plot histogram of numeric features
if 0:
    #raw_data.hist(bins=10, figsize=(10,5)) #bins: no. of intervals
    raw_data.hist(figsize=(10,5)) #bins: no. of intervals
    plt.rcParams['xtick.labelsize'] = 10
    plt.rcParams['ytick.labelsize'] = 10
    plt.tight_layout()
    # plt.savefig('figures/hist_raw_data.png', format='png', dpi=300) # must save before show()
    plt.show()

# 3.6 Compute correlations b/w features
corr_matrix = raw_data.corr(numeric_only=True)
#print(corr_matrix) # print correlation matrix
print('\n',corr_matrix["selling_price"].sort_values(ascending=False)) # print correlation b/w a feature and other features

# # 3.7 Try combining features
raw_data["price_per_km"] = raw_data["selling_price"] / raw_data["km_driven"]
corr_matrix = raw_data.corr(numeric_only=True)
print(corr_matrix["selling_price"].sort_values(ascending=False)) # print correlation b/w a feature and other features
#raw_data.drop(columns=["price_per_km"], inplace=True) # remove experiment columns


## Explanation and Analysis 
### 3.1: Scatter Plot between 2 features
* Relationship between selling_price and km_driven
Most data lies in the bottom left corner. Most vehicles have been driven less than 50,000 units and the selling prices are between 0 to 600,000 units.
Outliers of extremely high km_driven and extremely high prices are available.

They have an negative linear correlation which indicates that as the amount of km_driven increases, the selling prices tend to decrease exponentially. This means that cars with high mileage tend to be sold with a lower price.
<br>

* Relationship between selling_price and year 

Most data lies at the bottom right of the plot. Most vehicles were produced before year 1995.

They have a positive linear relationship which indicates as the production year increases, the price also increases. This means that newer cars are sold with higher prices.
<br>

* Relationship between selling_price and fuel 

Most vehicles use Diesel and Petrol as their fuel which a minority of them use LPG and CNG.

Diesel has the biggest spread with some cars reach the top selling prices. However, majority are still at the bottom with a low selling price.

Petrol has one outlier with the highest selling price but the spread was less and data mostly lie at the lower price.

LPG and CNG have no outliers and data lies in the lower range which is below 100,000 units.

Ranking from high to low selling price:
1. Diesel
2. Petrol
3. CNG
4. LPG



## Scatter plot between selling_price and other features
A few insights gain from this plot: <br>
About km_driven
* Very low mileage vehicles (close to 0 km driven) seem to command a premium, with some of the highest prices in this category.
* There are some very old vehicles ( before year 2000) with extremely low mileage, which could be classic or collector cars.
Conversely, there are some relatively new vehicles with unusually high mileage, possibly indicating commercial use or long-distance drivers.


About year
* The highest frequency appears to be for the most recent years, from 2018 to 2020.
* Recent cars before 2015 show a tight cluster of low mileage, while older cars have a much wider spread of mileages.

## Distribution Histogram
The highest frequency comes from vehicle with the selling price of 200,000 units and the lowest frequency is vehicle with the price of  500,000 units.
Most vehicles lie between 0 to 2,000,000 units of selling price. Above 2,000,000 units, there are only few outliers.
Tha majority of outliers are located in the approximate range of 300,000 units and 500,000 units.

## Correlation coefficients between selling_price and other variables 

1. selling_price to year (0.414092): <br>
Moderate positive correlation which means higher year has higher selling prices.
<br>
2. selling_price to seat (0.041358): <br>
Weak positive correlation which means number of seats has minimal relationship with selling price. However, more seats could slightly increase the price.
<br>

3. selling_price to km_driven (-0.22553): <br>
Weak negative correlation which means higher mileage has lower selling price. However, the correlation is quite weak so this cannot be taken as a trend.
<br>

4. selling_price to price_per_km (0.024245): <br>
Weak positive correlation which means this new combination does not significantly affect the selling price.
<br>

## Generally:
year is the strongest predictor of selling price, while km_drive is the weakest factor.
