Problem Description: For those who love collecting rare diamonds, you might be curious: how much is the diamond actually worth? If it is worth millions of dollars, you might want to sell it in auction to make yourself rich. Hence, given some attributes of the diamond, you wish to write a machine learning algorithm to predict how much the diamond is worth, using the Diamonds dataset (https://www.kaggle.com/shivam2503/diamonds) containing detailed information of about 54000 diamonds.

Team members: Adriel, Chee Heng, Bohan

Objective: Given the weight of the diamond, quality of the cut, diamond colour, clarity, length, valuewidth, depth of diamond and other attributes in the Diamonds dataset (other than the price), predict the price of the diamond in US dollars.

Machine learning category: It is clear that this is a regression problem since the price of the diamond is a continuous variable. We are going to do supervised learning in this case. Unsupervised regression does not make sense since we need to know some data points before any regression can be done.


These are the libraries required for the tutorial. The Python 3 interpreter does not have access to these libraries by default. We need to import them so that the Python 3 interpreter knows where to find the libraries.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import math, time # builtin libraries from the Python 3 system, but still requires importing

# Scikit learn libraries
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OrdinalEncoder

Firstly, we need to read the Diamonds dataset. The dataset can be downloaded from https://www.kaggle.com/shivam2503/diamonds. For the code to work, the diamonds.csv directory must be the same as the notebook directory.

The diamonds.csv file is now read in the form of a Pandas Dataframe. By usual convention, the Dataframe is named df, even though any other name would work just fine, just that it would be very hard for users to read and understand the code.

Now, let's take a look at the first 5 rows of the Dataframe. df.head(5) allows you to see the first 5 data points inside our dataset.


In [None]:
df = pd.read_csv('/kaggle/input/diamonds/diamonds.csv')
print(df.head(5))

Let's take a look at the dimensions of the dataset using df.shape. As you can see, there are 53940 rows (indicating 53940 diamonds) and 11 columns.

In [None]:
print(df.shape)

One of the columns contains an index, which is just called column 0. To remove a column, we use the drop method in the dataframe, as follows. axis=1 represents deleting a column.

In [None]:
df = df.drop(df.columns[0],axis=1)

You might have noticed that some of the columns have data that are categorical, such as cut, which ranges from ‘Fair’ to ‘Ideal’. As a Ridge algorithm will not be able to handle such data (a runtime error will result), we will have to convert it to something that it can handle. 

Normally, one would convert the feature into a set of boolean features, for instance, the column ‘cut’ would be split into five columns, for example: ‘cut_Fair’ ‘cut_Good’, ‘cut_Very Good’, ‘cut_Premium’ and ‘cut_Ideal’. This particular encoding is called One Hot Encoding. The disadvantage is that memory usage is very high. We could cut the number of columns to four, as having zeroes for four of these categories implies the fifth category. However, the memory savings is insignificant when the number of columns is large. 

The good news is that we can do better in memory usage while making more sense of the data at the same time since the data is ordinal (there exists an order, although we are unable to quantify the distance between each category). For instance, in the case of the ‘cut’ column, we know ‘Ideal’ is the best, followed by ‘Premium’, ‘Very Good’, ‘Good’ and finally ‘Fair’. Hence, we assign numbers to each category, in the order they come in. For instance, here ‘Fair’ is assigned 0, ‘Ideal’ is assigned 4, and everything else in between taking the remaining values.


In [None]:
cut_categories = ['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']
clarity_categories = ['I3','I2','I1','SI2','SI1','VS2','VS1','VVS2','VVS1','IF','FL']
color_categories = ['J', 'I', 'H', 'G', 'F', 'E', 'D']

We then convert the categorical values to the corresponding numbers as shown in the code below. The first 5 rows of the transformed Dataframe has been shown to help you visualize the difference.

In [None]:
# This chunk of code is supposed to check whether the Dataframe has 
# already been converted into numeric indices.
# If the Dataframe has already been converted into numeric indices, 
# the numeric indices will be converted to -1 without the check
already_converted = True
try:
    float(df['cut'][0])
except ValueError:
    already_converted = False

if already_converted == False:
    cut_enc = pd.Categorical(df.cut, categories=cut_categories, ordered=True)
    cut_labels, cut_unique = pd.factorize(cut_enc, sort=True)
    df.cut = cut_labels

    clarity_enc = pd.Categorical(df.clarity, categories=clarity_categories, ordered=True)
    clarity_labels, clarity_unique = pd.factorize(clarity_enc, sort=True)
    df.clarity = clarity_labels

    color_enc = pd.Categorical(df.color, categories=color_categories, ordered=True)
    color_labels, color_unique = pd.factorize(color_enc, sort=True)
    df.color = color_labels

print(df.head(5))

We have completed the data preprocessing. Now, the data is ready to be used for training our model! Firstly, we need to split the data into a training set and a test set, so as to prevent the possibility that the data has been overfitted to the training data by testing it against never-before-seen data. 

We do this using train_test_split() from sklearn.model_selection. In this case, we use a test_size of 0.25, which means that 25% of the data points are used for the test set. (Chee Heng believes that Adriel previously forgot to include test_size parameter)

We used the current time as the random state to introduce some run-to-run variability for you to see the effect of the performance of the model when another test set has been chosen at random. However, when reproducible results are desired (e.g. for easy debugging), it is advisable to choose a fixed random state.


In [None]:
# You would want to drop the price column so that the machine learning program 
# does not predict the prices by peeking at the prices.
X_vals = df.drop('price', axis=1) 

y_vals = df.price
X_train, X_test, y_train, y_test = train_test_split(X_vals, y_vals, test_size=0.25, random_state=int(time.time()))

Now, the data has been split into the training set and the test set and we are ready to train our model! However, the Ridge algorithm has a hyperparameter called ‘alpha’. As we do not know what value of alpha to use, it is ideal for us to try multiple values of alpha and find the best one based on our chosen performance metric, namely the r^2 value of the best-fit regression line). We do this using GridSearchCV. The best parameters can be found using the best_params_ attribute while the best score can be found using best_score_ attribute.

5-fold cross-validation was used so as to reduce the risk of overfitting by measuring the performance of the model against 5 non-overlapping unseen test sets, hence also testing the generalizability of the data set to unseen data, which are the diamonds in the training sets in this case.

The code could take seconds to run, please be patient here.



In [None]:
rdge = Ridge()

# logspace is used because we do not really know the order of magnitude of logspace
param_grid = {'alpha':np.logspace(-6,6,100)} 

gscv = GridSearchCV(rdge,param_grid,cv=5)
gscv.fit(X_train,y_train)
print("Best alpha value: {}".format(gscv.best_params_['alpha']))
print("R-Squared value for training set: {}".format(gscv.best_score_))

In [None]:
print("R-Squared value for test set: {}".format(gscv.score(X_test,y_test)))

In [None]:
y_predict = gscv.predict(X_test)
ave_diff = 0
ave_pred = 0
ave_value = 0
for i in range(len(y_test)):
    ave_diff += abs(y_test.iloc[i]-y_predict[i])
    ave_pred += y_predict[i]
    ave_value += y_test.iloc[i]
    #print("Y=%s, Predicted=%s" % (y_test.iloc[i], y_predict[i]))
print("average diffence: %s" %(ave_diff/(len(y_test))))
print("average predicted value: %s" %(ave_pred/(len(y_test))))
print("average actual value: %s" %(ave_value/(len(y_test))))
print("error: %s" %((ave_diff/(len(y_test)))/(ave_value/(len(y_test)))))

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 


# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
'''
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
'''
# Any results you write to the current directory are saved as output.
#print(int(time.time()))
df = df.drop(df.columns[0],axis=1)
print(df.columns)
#print(set(df['cut'])) #Note: Fair < Good < Very Good < Premium < Ideal
#print(set(df['color'])) #D ~ J
cuttonum = {'Fair':0, 'Good':1, 'Very Good':2, 'Premium':3, 'Ideal':4}
claritytonum = {'I3':0,'I2':1,'I1':2,'SI2':3,'SI1':4,'VS2':5,'VS1':6,'VVS2':7,'VVS1':8,'IF':9,'FL':10}
#'''
for count in range(len(df['cut'])):
    df.loc[count,'cut'] = cuttonum[df.loc[count,'cut']]
    df.loc[count,'clarity'] = claritytonum[df.loc[count,'clarity']]
    df.loc[count,'color'] = ord(df.loc[count,'color'])-ord('D')
    #df.loc[count,'price'] = math.log(df.loc[count,'price'])
    if(count % 1000 == 0):
        print(count)
#'''
#print(df.head())
print('Done!\n')
y_vals = df.price
X_train, X_test, y_train, y_test = train_test_split(df.drop('price', axis=1), y_vals, random_state=int(time.time()))
#print(X_train)
#print(y_train)
#print(y_train[12615])
#for each in y_train[:500]:
#    print(each)
rdge = Ridge()
#rdge.fit(X_train, y_train)
#y_pred = rdge.predict(X_test)
param_grid = {'alpha':np.logspace(-6,6,50)}
gscv = GridSearchCV(rdge,param_grid,cv=5)
gscv.fit(X_train,y_train)
print("Best alpha value: {}".format(gscv.best_params_['alpha']))
print("R-Squared value: {}".format(gscv.best_score_))

In [None]:
rdge = Ridge()
#rdge.fit(X_train, y_train)
#y_pred = rdge.predict(X_test)
param_grid = {'alpha':np.logspace(-6,6,200)}
gscv = GridSearchCV(rdge,param_grid,cv=5)
gscv.fit(X_train,y_train)
print("Best alpha value: {}".format(gscv.best_params_['alpha']))
print("R-Squared value: {}".format(gscv.best_score_))

In [None]:
gscv.score(X_test,y_test)

In [None]:
#print(gscv.predict(X_test))
#print(X_test.head())
print(predictVals(X_test))
#print(df.columns)