# Fundamentals of Machine Learning - Diamond price prediction
* We will use diamond [dataset](https://www.kaggle.com/datasets/amirhosseinmirzaie/diamonds-price-dataset)
* The prediction of the price of the diamond is a complex task

![img01](https://github.com/rasvob/VSB-FEI-Fundamentals-of-Machine-Learning-Exercises/blob/master/images/diamond01.jpg?raw=true)

In [None]:
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.tree import DecisionTreeRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/rasvob/VSB-FEI-Fundamentals-of-Machine-Learning-Exercises/master/datasets/diamonds.csv").drop(columns=["x", "y", "z"])
df

# üìå Explanation of the properties in the dataset

## **Carats**
* The weigfht of the diamond (‚ÄòCarat‚Äô is a term used to describe the weight of a diamond, and the word originates from Ceratonia siliqua, commonly known as the Carob tree.)

![img02](https://github.com/rasvob/VSB-FEI-Fundamentals-of-Machine-Learning-Exercises/blob/master/images/diamond02.jpg?raw=true)

## **Clarity** 
* Clarity of the diamond is a measure of the purity and rarity of the stone, graded by the visibility of these characteristics under 10-power magnification

![img03](https://github.com/rasvob/VSB-FEI-Fundamentals-of-Machine-Learning-Exercises/blob/master/images/diamond03.jpg?raw=true)

## **Color** of the diamond

![img04](https://github.com/rasvob/VSB-FEI-Fundamentals-of-Machine-Learning-Exercises/blob/master/images/diamond04.jpg?raw=true)

## **Depth**
* The **depth** of a diamond refers to its measurement from top to bottom, from the table on the top of the diamond to the culet at its base
* The depth of any diamond is expressed as a percentage

## **Diamond‚Äôs table**
* A diamond‚Äôs **table** is the flat facet on its surface ‚Äî the large, flat surface facet that you can see when you look at the diamond from above
* As the largest facet on a diamond, the table plays a major role in determining how brilliant (sparkly) the diamond is

![img05](https://github.com/rasvob/VSB-FEI-Fundamentals-of-Machine-Learning-Exercises/blob/master/images/diamond05.jpg?raw=true)

In [None]:
df["clarity"].unique()

## üí° Conversion of the Cut, Color and Clarity will differ
*   Cut is present in the dataset in the ideal order
*   Color is alphabetically ordered letters and must be soirted in the data
*   Clarity need a conversion table

In [None]:
df["cut"], unique_cut = pd.factorize(df["cut"], sort=False)
df["color"], unique_color =  pd.factorize(df["color"], sort=True)
df["clarity"].replace(to_replace={'IF':0, 'VVS1':1,'VVS2':2, 'VS1':3, 'VS2':4,'SI1':5, 'SI2':6, 'I1':7}, inplace=True)
df

## Let's look on the distribution of the numerical values and histogram of the categorical ones

In [None]:
px.box(df, y=["carat","depth", "table"])

In [None]:
px.histogram(df, x=["cut", "color", "clarity"])

In [None]:
px.scatter_matrix(df, dimensions=["carat","depth", "table"], color="price")

## A parallel coordinate plot is sometime hard to interpret
* üí° Presence of the categorical data may hide the basic trends
* As may be seen, the highest price is not always for heaviest diamonds or most clear

In [None]:
px.parallel_coordinates(df, color="price", dimensions=["carat","cut", "color", "clarity", "depth", "table"])

In [None]:
px.parallel_coordinates(df, color="price", dimensions=["carat", "depth", "table"])

# üöÄ Regression of the data
* Split the datasets into `X` input matrix and `y` output vector

In [None]:
X = df.drop(columns=["price"]).values
y = df["price"].values

In [None]:
X.shape, y.shape

# Split the dataset into training and testing subsets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [None]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

# Let's build the models
* üí° Try KNeighborsClassifier, GaussianNB, DecisionTreeClassifier, RandomForrest and MultiLayerPerceptron aka Neural network

In [None]:
models = [
    ["KNN(3)", KNeighborsRegressor(3)],
    ["KNN(5)", KNeighborsRegressor(5)],
    ["KNN(7)", KNeighborsRegressor(7)],
    ["DecisionTree", DecisionTreeRegressor()],
    ["RandomForrest(10)", RandomForestRegressor(10)],
    ["RandomForrest(20)", RandomForestRegressor(20)],
    ["RandomForrest(50)", RandomForestRegressor(50)],
    ["RandomForrest(100)", RandomForestRegressor(100)],
    ["MLP(10)", MLPRegressor(10)],
    ["MLP(20)", MLPRegressor(20)],
    ["MLP(50)", MLPRegressor(50)],
    ["MLP(100)", MLPRegressor(100)],
    ["MLP(200)", MLPRegressor(200)],
]

In [None]:
print("                          Train                          Test")
print("                          MAE        MSE      MAPE       MAE        MSE      MAPE")
for name, model in models:
  model.fit(X_train, y_train)
  train_predicted = model.predict(X_train)
  test_predicted = model.predict(X_test)
  mae_train = mean_absolute_error(y_train, train_predicted)
  mae_test = mean_absolute_error(y_test, test_predicted)
  mse_train = mean_squared_error(y_train, train_predicted)
  mse_test = mean_squared_error(y_test, test_predicted)
  mape_train = mean_absolute_percentage_error(y_train, train_predicted)
  mape_test = mean_absolute_percentage_error(y_test, test_predicted)
  print(f"{name:20s}  {mae_train:7.1f}    {mse_train:7.0f}  {100*mape_train:7.1f}%   {mae_test:7.1f}    {mse_test:7.0f}  {100*mape_test:7.1f}%")

## Let's check the scaled version of the dataset

In [None]:
scaler = MinMaxScaler()
scaler.fit(X_train)

sX_train = scaler.transform(X_train)
sX_test = scaler.transform(X_test)

In [None]:
print("                          Train                          Test")
print("                          MAE        MSE      MAPE       MAE        MSE      MAPE")
for name, model in models:
  model.fit(sX_train, y_train)
  train_predicted = model.predict(sX_train)
  test_predicted = model.predict(sX_test)
  mae_train = mean_absolute_error(y_train, train_predicted)
  mae_test = mean_absolute_error(y_test, test_predicted)
  mse_train = mean_squared_error(y_train, train_predicted)
  mse_test = mean_squared_error(y_test, test_predicted)
  mape_train = mean_absolute_percentage_error(y_train, train_predicted)
  mape_test = mean_absolute_percentage_error(y_test, test_predicted)
  print(f"{name:20s}  {mae_train:7.1f}    {mse_train:7.0f}  {100*mape_train:7.1f}%   {mae_test:7.1f}    {mse_test:7.0f}  {100*mape_test:7.1f}%")

## ‚ö° Let's scale the `price` too
* This may allow MLP to achieve a better results

In [None]:
yscaler = MinMaxScaler()
yscaler.fit(y_train.reshape((y_train.shape[0], 1)))
sy_train = yscaler.transform(y_train.reshape((y_train.shape[0], 1))).ravel()
sy_test = yscaler.transform(y_test.reshape((y_test.shape[0], 1))).ravel()

In [None]:
print("                          Train                          Test")
print("                          MAE        MSE      MAPE       MAE        MSE      MAPE")
for name, model in models:
  model.fit(sX_train, sy_train)
  train_predicted = model.predict(sX_train)
  test_predicted = model.predict(sX_test)
  train_predicted = yscaler.inverse_transform(train_predicted.reshape(train_predicted.shape[0], 1)).ravel()
  test_predicted = yscaler.inverse_transform(test_predicted.reshape(test_predicted.shape[0], 1)).ravel()
  mae_train = mean_absolute_error(y_train, train_predicted)
  mae_test = mean_absolute_error(y_test, test_predicted)
  mse_train = mean_squared_error(y_train, train_predicted)
  mse_test = mean_squared_error(y_test, test_predicted)
  mape_train = mean_absolute_percentage_error(y_train, train_predicted)
  mape_test = mean_absolute_percentage_error(y_test, test_predicted)
  print(f"{name:20s}  {mae_train:7.1f}    {mse_train:7.0f}  {100*mape_train:7.1f}%   {mae_test:7.1f}    {mse_test:7.0f}  {100*mape_test:7.1f}%")

## üí° Let's try a different approach
* Generate dummies (binarization) of all categorical data and see the effect on the prediction of different models

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/rasvob/VSB-FEI-Fundamentals-of-Machine-Learning-Exercises/master/datasets/diamonds.csv").drop(columns=["x", "y", "z"])
df = pd.get_dummies(df)

X = df.drop(columns=["price"]).values
y = df["price"].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

scaler = MinMaxScaler()
scaler.fit(X_train)

sX_train = scaler.transform(X_train)
sX_test = scaler.transform(X_test)

yscaler = MinMaxScaler()
yscaler.fit(y_train.reshape((y_train.shape[0], 1)))
sy_train = yscaler.transform(y_train.reshape((y_train.shape[0], 1))).ravel()
sy_test = yscaler.transform(y_test.reshape((y_test.shape[0], 1))).ravel()

In [None]:
print("                          Train                          Test")
print("                          MAE        MSE      MAPE       MAE        MSE      MAPE")
test_results = {'GroundTruth': y_test}
for name, model in models:
  model.fit(sX_train, sy_train)
  train_predicted = model.predict(sX_train)
  test_predicted = model.predict(sX_test)
  train_predicted = yscaler.inverse_transform(train_predicted.reshape(train_predicted.shape[0], 1)).ravel()
  test_predicted = yscaler.inverse_transform(test_predicted.reshape(test_predicted.shape[0], 1)).ravel()
  test_results[name] = test_predicted
  mae_train = mean_absolute_error(y_train, train_predicted)
  mae_test = mean_absolute_error(y_test, test_predicted)
  mse_train = mean_squared_error(y_train, train_predicted)
  mse_test = mean_squared_error(y_test, test_predicted)
  mape_train = mean_absolute_percentage_error(y_train, train_predicted)
  mape_test = mean_absolute_percentage_error(y_test, test_predicted)
  print(f"{name:20s}  {mae_train:7.1f}    {mse_train:7.0f}  {100*mape_train:7.1f}%   {mae_test:7.1f}    {mse_test:7.0f}  {100*mape_test:7.1f}%")
  # break;
px.line(pd.DataFrame(data=test_results))