# The lapidarist* problem

This problem is for you to show your modeling abilities. Bonus points for testing your model’s error using a test set.

The limousine comes to a full stop. As the driver gets out to open the door you take a deep breath and get inside. You enter 10 Downing Street and are conducted to the usual meeting room. Inside you find the Prime Minister, accompanied by a fat, tall man and a short, deform one with long ears and an even longer nose.

Prime Minister: “Ah! You’re here! Great! Let me introduce my guests. This is Fidelious, Minister of Magic, and Krenk, the owner of the Gringotts Wizarding Bank.”

You: “Uhhh, ma’am, is this a joke?”

Fidelious: “Not at all, but don’t worry, don’t sweat the details, tomorrow you won’t remember anything. Security measures, you see.”

Krenk: “Let’s move things along. I don’t like to be exposed to Muggles.”

You: “What...” The Prime Minister interrupts you.

Prime Minister: “Our friends here seem to have run into a bit of an issue, see, some diamonds seem to have been stolen. Problem is, the only person... goblin, sorry“ he says apologetically to Krenk ” to have seen them is our distinguished guest, Krenk.”

Fidelious: “And while the Ministry completely believes Krenk as to the diamonds’ worth, we need another person to validate his claim. Safety policies, you see.”

Prime Minister: “So, since you’re the best data scientist in our country, I thought you could help. Mr. Krenk will provide you with the characteristics of the missing diamonds and we need you to create a model to value them.”

You: “But I’m not a lapidarist.”

Prime Minister: “Which is why we’re providing you with a huge dataset, containing characteristics and valuations for tens of thousands of diamonds. Now, get working.“

“Huge? Tens of thousands?“ You think. “And I thought I was the clueless one here.”

In [19]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score

In [2]:
diamonds = pd.read_csv("/Users/sebastianquintanilla/Downloads/diamonds/diamonds_data.csv")

In [3]:
diamonds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53930 entries, 0 to 53929
Data columns (total 10 columns):
carat      53930 non-null float64
cut        53930 non-null object
color      53930 non-null object
clarity    53930 non-null object
depth      53930 non-null float64
table      53930 non-null float64
price      53930 non-null int64
x          53930 non-null float64
y          53930 non-null float64
z          53930 non-null float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.1+ MB


## 1.Preparing the data

We have information on 53930 diamonds, and the info is on carat, cut, color, clarity, depth, table, dimensions (x, y and z) and price. We have no missing values in any of the columns.

Three of these columns have categorical variables, color, clarity and cut. We check what values they can take.

In [4]:
color = diamonds['color'].unique()
color

array(['E', 'I', 'J', 'H', 'F', 'G', 'D'], dtype=object)

In [5]:
clarity = diamonds['clarity'].unique()
clarity

array(['SI2', 'SI1', 'VS1', 'VS2', 'VVS2', 'VVS1', 'I1', 'IF'], dtype=object)

In [6]:
cut = diamonds['cut'].unique()
cut

array(['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'], dtype=object)

According to the dictionary, these 3 categorical variables can be ordered, so we do it with the following function.

In [None]:
def number_cat(row):
    ''' This function makes three new rows, for the three categorical variables
    '''
    #first we order the colors
    if row['color'] == 'D':
        row['n_color'] = 7
    elif row['color'] == 'E':
        row['n_color'] = 6
    elif row['color'] == 'F':
        row['n_color'] = 5
    elif row['color'] == 'G':
        row['n_color'] = 4
    elif row['color'] == 'H':
        row['n_color'] = 3
    elif row['color'] == 'I':
        row['n_color'] = 2
    else:
        row['n_color'] = 1
        
    #then we order the clarity
    if row['clarity'] == 'IF':
        row['n_clarity'] = 8
    elif row['clarity'] == 'VVS1':
        row['n_clarity'] = 7
    elif row['clarity'] == 'VVS2':
        row['n_clarity'] = 6
    elif row['clarity'] == 'VS1':
        row['n_clarity'] = 5
    elif row['clarity'] == 'VS2':
        row['n_clarity'] = 4
    elif row['clarity'] == 'SI1':
        row['n_clarity'] = 3
    elif row['clarity'] == 'SI2':
        row['n_clarity'] = 2
    else:
        row['n_clarity'] = 1
        
    #lastly we order the cut
    if row['cut'] == 'Ideal':
        row['n_cut'] = 5
    elif row['cut'] == 'Premium':
        row['n_cut'] = 4
    elif row['cut'] == 'Very Good':
        row['n_cut'] = 3
    elif row['cut'] == 'Good':
        row['n_cut'] = 2
    else:
        row['n_cut'] = 1
        
    return row

diamonds = diamonds.apply(number_cat, axis=1)

In [8]:
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z,n_color,n_clarity,n_cut
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43,6,2,5
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31,6,3,4
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31,6,5,2
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63,2,4,4
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75,1,2,2


## 2. Random forest regressor

We will now divide our data in the dependent variable (price) and all the others, also droping the categorical variables that aren't numerical.

In [30]:
X = diamonds.drop(['cut','color','clarity','price'],axis=1)
y = diamonds.price

To test our predictions we will separate our data in two, training and validation, using a function imported from sklearn. 

In [31]:
X_train, X_val, y_train, y_val = train_test_split(X, y, random_state = 8)

We will now use and fit a Random Forest Model to the training data.

In [32]:
model = RandomForestRegressor(random_state = 8, n_jobs=-1)
model.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=-1, oob_score=False, random_state=8,
           verbose=0, warm_start=False)

In [33]:
model.score(X_train,y_train)

0.99645483325815409

The R^2 of this model (with the training data) is very close to 1, so the model performed very well for the training data (as expected).

We will truly test this model by calculating the R^2 and the mean absolute error (MAE) for the validation data (which wasn't used for fitting the model).

In [34]:
y_pred = model.predict(X_val)
r2 = r2_score(y_pred, y_val)
mae = mean_absolute_error(y_pred,y_val)

print('The R2 for the validation data is: '+ str(r2))
print('The MAE for the validation data is: '+ str(mae))



The R2 for the validation data is: 0.979247263249
The MAE for the validation data is: 284.25210376


We can see we got a great R^2 (really close to 1) for the validation data. The MAE is 284.25, which means that on average our predictions are off by 248.25 muggle money.

## 3. Predicting for the test data

We will now make a data frame of the stolen diamonds and predict their price.

In [35]:
data = {'carat':[0.71,0.83,0.5,0.39,0.32,0.9,0.51,1.12,0.4,0.36], 'n_cut':[2,5,5,4,4,2,5,5,5,4], 'n_color':[2,4,6,1,4,5,7,4,4,2], 'n_clarity':[6,5,4,5,5,2,5,6,6,4], 'depth':[63.1,62.1,61.5,61.6,62.1,63.3,60.9,62.1,62.4,62.7], 'table':[58,55,55,59,56,57,57,54.8,56,59], 'x':[5.64,6.02,5.11,4.67,4.43,6.08,5.2,6.64,4.72,4.54], 'y':[5.71,6.05,5.16,4.71,4.4,6.14,5.17,6.66,4.74,4.58], 'z':[3.58,3.75,3.16,2.89,2.74,3.87,3.16,4.13,2.95,2.86]}
stolen_diamonds = pd.DataFrame(data)
stolen_diamonds


Unnamed: 0,carat,depth,n_clarity,n_color,n_cut,table,x,y,z
0,0.71,63.1,6,2,2,58.0,5.64,5.71,3.58
1,0.83,62.1,5,4,5,55.0,6.02,6.05,3.75
2,0.5,61.5,4,6,5,55.0,5.11,5.16,3.16
3,0.39,61.6,5,1,4,59.0,4.67,4.71,2.89
4,0.32,62.1,5,4,4,56.0,4.43,4.4,2.74
5,0.9,63.3,2,5,2,57.0,6.08,6.14,3.87
6,0.51,60.9,5,7,5,57.0,5.2,5.17,3.16
7,1.12,62.1,6,4,5,54.8,6.64,6.66,4.13
8,0.4,62.4,6,4,5,56.0,4.72,4.74,2.95
9,0.36,62.7,4,2,4,59.0,4.54,4.58,2.86


In [39]:
stolen_price = model.predict(stolen_diamonds)
print('Our prediction for the prices of the stolen diamonds is: ')
print(stolen_price)

Our prediction for the prices of the stolen diamonds is: 
[  2918.6   3461.9   1887.8    466.3    447.5   4228.    2154.2  11648.5
    710.8    466.3]


## 4. Conclusions.

This random forest regressor has a very good R^2 for the validation sample, that means that most of it's estimations are close to the real value, but also it has a significant mean absolute error (248 $ is a considerable error). This means that sometimes the model fails by big amounts of money, making that average higher.
