# Diamond Price Predictions
This notebook will explore the diamonds dataset and build a model for price predictions.


In [None]:
#Import files and initialise input_path
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # plot
import seaborn as sns #plot

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        input_path = os.path.join(dirname, filename)
        print(input_path)



In [None]:
# Read file
diamonds = pd.read_csv(input_path)
diamonds.head()

In [None]:
# Drop Unnamed
diamonds.drop(
    diamonds.filter(regex='Unnamed'),
    axis = 1,
    inplace = True)
diamonds.head()

## EDA

In [None]:
#Examine distribution of target variable
sns.displot(diamonds['price'])

In [None]:
# A 5-number summary of the target variable
print('Min: ', diamonds['price'].min())
print('Q1: ', np.percentile(diamonds['price'],25))
print('Median: ', np.percentile(diamonds['price'],50))
print('Q3: ', np.percentile(diamonds['price'],75))
print('Max: ', diamonds['price'].max())

In [None]:
# Pairplot is an easy way to see the correlation between numerical variables
sns.pairplot(diamonds)

From the scatter plots, we can see that there is an exponential relationship between the variables and the price. However, for depth and table it seems that there is an optimum range, and the diamonds with non-optimum depth and table get lower price.

## Encoding Categorical Variable
We need to encode the categorical variable into numerical before feeding it into the predictive algorithms, but we must take note that these are actually ordinal values. Therefore, we cannot directly use the label encoder. Label encoder will assign a label based on the alphabetical order, which might not be the same with the rank of the values.

Only for color, in which D is the best color and J is the worst, we can use label encoder. The ranking is in line with the alphabetical order

In [None]:
#Use label encoder for color
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(diamonds['color'])
diamonds['color'] = le.transform(diamonds['color'])
print('color: ', set(diamonds['color']))

In [None]:
#Define our own encoding using a dictionary
print(set(diamonds['cut']))
print(set(diamonds['clarity']))


replace = {
    'cut':{
        'Fair': 0,
        'Good': 1,
        'Very Good': 2,
        'Ideal': 3,
        'Premium': 4},
    'clarity':{
        'I1':0,
        'SI2':1,
        'SI1':2,
        'VS2':3,
        'VS1':4,
        'VVS2':5,
        'VVS1':6,
        'IF':7}
}

diamonds.replace(replace,inplace=True)
print('cut: ', set(diamonds['cut']))
print('clarity: ',set(diamonds['clarity']))

In [None]:
#We an also use heatmap to see the correlation between variables
plt.figure(figsize=(20,15))
ax=plt.subplot(111)
sns.heatmap(diamonds.corr(),cmap="coolwarm",center=0,annot=True)

# Regression Model
## Random Forest
To prevent overfitting, first we need to split the data into training and test dataset.

In [None]:
#Train Test Split
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(diamonds.drop(['price'],axis=1),
                                                    diamonds['price'],
                                                    test_size=0.3,
                                                    random_state=42)

print(x_train)

In [None]:
#Train Model
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(random_state=42)
model.fit(x_train,y_train)

In [None]:
#Predict Test Dataset
predictions = model.predict(x_test)

In [None]:
#Examine results of prediction vs the actual price. A perfect linear line 
plt.scatter(predictions,y_test)
plt.title('Prediction vs Actuals')
plt.show()

In [None]:
#Print model score (R^2)
print(f'Training score: {model.score(x_train,y_train):.5f}')
print(f'Test score: {model.score(x_test,y_test):.5f}')

In [None]:
#Print error metrics 
from sklearn.metrics import * 

print('MAE\t\tMSE\t\tRMSE')
print(f'{mean_absolute_error(y_test,predictions):.2f}\t\t',
      f'{mean_squared_error(y_test,predictions):.2f}\t',
      f'{np.sqrt(mean_squared_error(y_test,predictions)):.2f}')


In [None]:
#Get feature importance to see which variable is an important predictor to the diamond prices
feature_importance = pd.DataFrame({
    'Column':x_train.columns,
    'Importance':model.feature_importances_})
feature_importance.sort_values('Importance',inplace=True)
plt.barh(feature_importance['Column'],feature_importance['Importance'])

plt.title('Diamond Price Prediction Feature Importances')
plt.show()


This model is able to predict with a 98.13% R^2 on the test dataset. Features that are found to be important in predicting the price in unsprisingly the carat, followed by y, clarity, and color. This is an interesting observation because traditionally we would value color dan clarity more, but apparently y plays an important value.