In this demo we will use a machine learning algorithm called Random Forest to predict the prices of diamonds, based on some pre-determined features.

In [58]:
import pandas as pd
import numpy as np

import sklearn
from sklearn import svm, preprocessing
from sklearn.ensemble import RandomForestRegressor

import warnings
warnings.filterwarnings('ignore')

### Load and preprocess data

The dataset can be downloaded from Kaggle, https://www.kaggle.com/shivam2503/diamonds

In [89]:
diamonds = pd.read_csv("./diamonds.csv", index_col=0)

# Convert categorical values to numerical values

cut_class_dict = {'Ideal':5, 'Premium':4, 'Good':2, 'Very Good':3, 'Fair':1}
clarity_dict = {'I3': 1, 'I2': 2, 'I1': 3, 'SI2': 4, 'SI1': 5, 'VS2': 6, 'VS1': 7, 
                'VVS2': 8, 'VVS1': 9, 'IF': 10, 'FL': 11}
color_dict = {'J': 1, 'I': 2, 'H': 3, 'G': 4, 'F': 5, 'E': 6, 'D': 7}

diamonds['cut'] = diamonds['cut'].map(cut_class_dict)
diamonds['clarity'] = diamonds['clarity'].map(clarity_dict)
diamonds['color'] = diamonds['color'].map(color_dict)

diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
1,0.23,5,6,4,61.5,55.0,326,3.95,3.98,2.43
2,0.21,4,6,5,59.8,61.0,326,3.89,3.84,2.31
3,0.23,2,6,7,56.9,65.0,327,4.05,4.07,2.31
4,0.29,4,2,6,62.4,58.0,334,4.2,4.23,2.63
5,0.31,2,1,4,63.3,58.0,335,4.34,4.35,2.75


In [90]:
# Shuffle the data
diamonds = sklearn.utils.shuffle(diamonds)

x = diamonds.drop('price', axis=1).values
x = preprocessing.scale(x)
y = diamonds['price'].values

test_size = 300

train_x = x[:-test_size]
train_y = y[:-test_size]

test_x = x[-test_size:]
test_y = y[-test_size:]

### Initialize the RandomForestRegressor

Random forest basically fits n decision trees(https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor) randomly on sub-samples of the datasets, then takes the average results. By doing so, it improves the accuracy and controls overfitting

In [91]:
# We can tweak the n_estimators, which is the amount of decision trees, to get the best performace/speed ratio
rf = RandomForestRegressor(n_estimators=16, verbose=0)
rf.fit(train_x, train_y)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=16, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [92]:
# Returns the R^2 coefficient, which represent the % of variance our model is able to explain
rf.score(test_x, test_y)

0.9824549121692727

In [93]:
predictions = rf.predict(test_x)
result = pd.DataFrame()
result['predictions'] = predictions
result['actuals'] = test_y

result.head(10)

Unnamed: 0,predictions,actuals
0,5996.375,6610
1,3424.375,3662
2,2211.0625,2127
3,8859.5625,8830
4,5262.0,4773
5,1587.25,1551
6,3439.9375,3003
7,853.46875,912
8,1043.5,1061
9,1283.625,1378


### Results

As we can see, our Random Forest algorithm is able to achieve a R^2 of 98% and accurately predicts the prices of diamonds, based on a set of pre-determined features.

### Final remarks

Despite the fact of the excellent predictive power our model has, we can not determine the individual effect of each predictor or which predictor(s) are abundant with respect to others. This is because of multicollinearity of our dataset.

In [80]:
diamonds.corr()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
carat,1.0,-0.134967,-0.291437,-0.352841,0.028224,0.181618,0.921591,0.975094,0.951722,0.953387
cut,-0.134967,1.0,0.020519,0.189175,-0.218055,-0.433405,-0.053491,-0.125565,-0.121462,-0.149323
color,-0.291437,0.020519,1.0,-0.025631,-0.047279,-0.026465,-0.172511,-0.270287,-0.263584,-0.268227
clarity,-0.352841,0.189175,-0.025631,1.0,-0.067384,-0.160327,-0.1468,-0.371999,-0.35842,-0.366952
depth,0.028224,-0.218055,-0.047279,-0.067384,1.0,-0.295779,-0.010647,-0.025289,-0.029341,0.094924
table,0.181618,-0.433405,-0.026465,-0.160327,-0.295779,1.0,0.127134,0.195344,0.18376,0.150929
price,0.921591,-0.053491,-0.172511,-0.1468,-0.010647,0.127134,1.0,0.884435,0.865421,0.861249
x,0.975094,-0.125565,-0.270287,-0.371999,-0.025289,0.195344,0.884435,1.0,0.974701,0.970772
y,0.951722,-0.121462,-0.263584,-0.35842,-0.029341,0.18376,0.865421,0.974701,1.0,0.952006
z,0.953387,-0.149323,-0.268227,-0.366952,0.094924,0.150929,0.861249,0.970772,0.952006,1.0
