Diamonds - the most valuable gemstone in the world. But what is the relationship between the actual stone and its price? Is it as simple as a linear regression or perhaps a non-linear regression?..

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns
plt.rcParams["figure.figsize"]=12,6

According to the data, there are 53940 diamonds which can be described in 10 attributes:
1. carat (i.e. weight)
1. quality of the cut (5 categories in ascending order of Fair, Good, Very Good, Premium, Ideal)
1. colour (7 categories with J being the worst to D being the best)
1. clarity (8 categories in ascending order of I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF)
1. depth out of 100 (i.e. how deep the diamond is relative to its width)
1. table out of 100 (i.e. the ratio between the flat surface at the top of the diamond to its average width)
1. price in USD
1. length (x) in mm
1. width (y) in mm 
1. depth (z) in mm

In [None]:
df=pd.read_csv("/kaggle/input/diamonds/diamonds.csv",index_col=0)
df

In [None]:
df.info()

Although the cut, colour and clarity are categorical variables (i.e. object), there is an order/ranking and so can therefore be transformed into numeric values.

In [None]:
#CUT
def cut_codes(cut):
    if cut=="Ideal":
        return 5
    elif cut=="Premium":
        return 4
    elif cut=="Very Good":
        return 3
    elif cut=="Good":
        return 2
    else:
        return 1
df["cut"]=df["cut"].apply(lambda x:cut_codes(x))

#COLOUR
def color_codes(color):
    if color=="D":
        return 7
    elif color=="E":
        return 6
    elif color=="F":
        return 5
    elif color=="G":
        return 4
    elif color=="H":
        return 3
    elif color=="I":
        return 2
    else:
        return 1
df["color"]=df["color"].apply(lambda x:color_codes(x))

#CLARITY
def clarity_codes(clarity):
    if clarity=="I1":
        return 8
    elif clarity=="SI2":
        return 7
    elif clarity=="SI1":
        return 6
    elif clarity=="VS2":
        return 5
    elif clarity=="VS1":
        return 4
    elif clarity=="VVS2":
        return 3
    elif clarity=="VVS1":
        return 2
    else:
        return 1
df["clarity"]=df["clarity"].apply(lambda x:clarity_codes(x))

df.head()

There are no null values but when we take a look at the data there are some 0s for the x, y and z which should not be mistaken as a measurement of 0 mm because without a length, width or depth a diamond cannot be a 3-dimensional object. Instead this 0 value should be treated as missing data. Our options to deal with missing data are to either:
1. drop the rows; or
1. replace with 0s with a new value such as the mean

In [None]:
df.isnull().sum()

In [None]:
df.describe()

In [None]:
print("The number of rows with a value of 0 for x are ",(df["x"]==0).sum(),".")
print("The number of rows with a value of 0 for y are ",(df["y"]==0).sum(),".")
print("The number of rows with a value of 0 for z are ",(df["z"]==0).sum(),".")
print("The total number of rows with a value of 0 are ",((df["x"]==0)|(df["y"]==0)|(df["z"]==0)).sum(),".")

Since the number of rows with a value of 0 is only 20 (0.04%) out of the total 53940, we shall just drop said rows.

In [None]:
df.drop(df[(df["x"]==0)|(df["y"]==0)|(df["z"]==0)].index,inplace=True)

But now when we take a look at our data, there are some visually obvious outliers (especially in y, z, depth and table).

In [None]:
fig=plt.figure(figsize=(16,16))

ax1=plt.subplot2grid((4,3),(0,0),colspan=2,rowspan=2)
ax2=plt.subplot2grid((4,3),(0,2))
ax3=plt.subplot2grid((4,3),(1,2))
ax4=plt.subplot2grid((4,3),(2,0))
ax5=plt.subplot2grid((4,3),(2,1))
ax6=plt.subplot2grid((4,3),(2,2))
ax7=plt.subplot2grid((4,3),(3,0))
ax8=plt.subplot2grid((4,3),(3,1))
ax9=plt.subplot2grid((4,3),(3,2))

sns.scatterplot(x=df["carat"],y=df["price"],color="lavender",ax=ax1)
sns.scatterplot(x=df["x"],y=df["price"],color="powderblue",ax=ax2)
sns.scatterplot(x=df["y"],y=df["price"],color="lightblue",ax=ax3)
sns.scatterplot(x=df["depth"],y=df["price"],color="palegreen",ax=ax4)
sns.scatterplot(x=df["table"],y=df["price"],color="lightgreen",ax=ax5)
sns.scatterplot(x=df["z"],y=df["price"],color="skyblue",ax=ax6)
sns.scatterplot(x=df["cut"],y=df["price"],color="lightsalmon",ax=ax7)
sns.scatterplot(x=df["color"],y=df["price"],color="palevioletred",ax=ax8)
sns.scatterplot(x=df["clarity"],y=df["price"],color="gold",ax=ax9)

plt.tight_layout(pad=1,h_pad=1,w_pad=1)

So let's filter out the outliers.

In [None]:
outlier=df[(df["y"]>30)|(df["z"]>10)|(df["depth"]<50)|(df["depth"]>75)|(df["table"]<45)|(df["table"]>75)]
print("There are {num} rows of visual outliers where y >30, z >10, depth <50 and >75, and table <45 and >75.".format(num=outlier.shape[0]))

In [None]:
df.drop(outlier.index,inplace=True)

print("After dropping the Visual Outliers, the number of rows are now",df.shape[0],".")

Time to explore the diamonds!

In [None]:
sns.set_style("ticks")
sns.distplot(a=df["price"],kde=False,bins=100,color="#AE9CCD")
plt.xlabel("Price in USD")
plt.ylabel("Count")
plt.title("Price Distribution",size=16)

In [None]:
print("The highest price is USD ",df["price"].max())
print("The lowest price is USD ",df["price"].min())
print("The average price is USD ",round(df["price"].mean()))
print("The most common price is USD ",df["price"].value_counts().idxmax())

In [None]:
mask=np.zeros_like(df.corr(),dtype=np.bool)
mask[np.triu_indices_from(mask)]=True
sns.heatmap(data=df.corr(),annot=True,square=True,mask=mask,cmap="Pastel1",linewidths=1,linecolor="white")

Of the 9 attributes, the weight (i.e. carat) of a diamond has the highest correlation with the price followed by its dimensions x, y and z.

But just by looking at these numbers, is it safe to assume a linear relationship? What if there is a non-linear relationship?

In [None]:
from sklearn.model_selection import train_test_split
x=df.drop(["price"],axis=1)
y=df["price"]
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=5)

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score
relationship=["Linear","Squared","Cubic"]
rmse=[]
r2score=[]

**Linear Relationship**

In [None]:
lr=LinearRegression()
lr.fit(x_train,y_train)
y1_predict=lr.predict(x_test)

print("MAE: %.2f"%mean_absolute_error(y_test,y1_predict))
print("MSE: %.2f"%mean_squared_error(y_test,y1_predict))
print("RMSE: %.2f"%np.sqrt(mean_absolute_error(y_test,y1_predict)))
print("R2: %.2f"%r2_score(y_test,y1_predict))

rmse.append(np.sqrt(mean_absolute_error(y_test,y1_predict)))
r2score.append(r2_score(y_test,y1_predict))

sns.distplot(y_test,hist=True,bins=50,kde=True,color="lightskyblue",label="Actual Values")
sns.distplot(y1_predict,hist=True,bins=70,kde=True,color="plum",label="Predicted Values")
plt.legend()
plt.xlabel("Price")

In [None]:
y1_predict.min()

In [None]:
print("The R Squared Score was not bad at 0.91, but the model predicted negative prices with a min of {predmin} to the max {predmax}, while the actual price only ranged from {actualmin} to {actualmax}. The price distribution did follow the general trend but the medians were not too far off by almost 400 with the predicted at {predmed} and the actual at {actualmed}.".format(predmin=int(y1_predict.min()),predmax=int(y1_predict.max()),actualmin=y_test.min(),actualmax=y_test.max(),predmed=int(np.median(y1_predict)),actualmed=int(np.median(y_test))))

**Squared Relationship**

In [None]:
poly2=PolynomialFeatures(degree=2)
x2_train=poly2.fit_transform(x_train)
x2_test=poly2.transform(x_test)

lr2=LinearRegression()
lr2.fit(x2_train,y_train)
y2_predict=lr2.predict(x2_test)

print("MAE: %.2f"%mean_absolute_error(y_test,y2_predict))
print("MSE: %.2f"%mean_squared_error(y_test,y2_predict))
print("RMSE: %.2f"%np.sqrt(mean_absolute_error(y_test,y2_predict)))
print("R2: %.2f"%r2_score(y_test,y2_predict))

rmse.append(np.sqrt(mean_absolute_error(y_test,y2_predict)))
r2score.append(r2_score(y_test,y2_predict))

sns.distplot(y_test,hist=True,bins=50,kde=True,color="lightskyblue",label="Actual Values")
sns.distplot(y2_predict,hist=True,bins=70,kde=True,color="Plum",label="Predicted Values")
plt.legend()
plt.xlabel("Price")

In [None]:
print("The R Squared Score improved to 0.96 and the model better followed the price distribution but still predicted negative prices with a min of {predmin} to the max {predmax}, while the actual price only ranged from {actualmin} to {actualmax}. The median prices were much closer and only off by 10 (predicted {predmed} and actual {actualmed}).".format(predmin=int(y2_predict.min()),predmax=int(y2_predict.max()),actualmin=y_test.min(),actualmax=y_test.max(),predmed=int(np.median(y2_predict)),actualmed=int(np.median(y_test))))

**Cubic Relationship**

In [None]:
poly3=PolynomialFeatures(degree=3)
x3_train=poly3.fit_transform(x_train)
x3_test=poly3.transform(x_test)

lr3=LinearRegression()
lr3.fit(x3_train,y_train)
y3_predict=lr3.predict(x3_test)

print("MAE: %.2f"%mean_absolute_error(y_test,y3_predict))
print("MSE: %.2f"%mean_squared_error(y_test,y3_predict))
print("RMSE: %.2f"%np.sqrt(mean_absolute_error(y_test,y3_predict)))
print("R2: %.2f"%r2_score(y_test,y3_predict))

rmse.append(np.sqrt(mean_absolute_error(y_test,y3_predict)))
r2score.append(r2_score(y_test,y3_predict))

fig=plt.figure(figsize=(10,6))
ax1=fig.add_axes([0,0,1,1])
ax2=fig.add_axes([0.15,0.3,0.5,0.5])
sns.distplot(y_test,hist=True,bins=50,kde=True,color="lightskyblue",label="Actual Values",ax=ax1)
sns.distplot(y3_predict,hist=True,bins=70,kde=True,color="plum",label="Predicted Values",ax=ax1)
sns.distplot(y_test,hist=True,bins=50,kde=True,color="lightskyblue",label="Actual Values",ax=ax2)
sns.distplot(y3_predict,hist=True,bins=1000,kde=True,color="plum",label="Predicted Values",ax=ax2)
ax1.legend(loc="upper right")
ax1.set_xlabel("Price")
ax2.set_xlabel("Price")
ax2.set_xlim([-1000,25000])

In [None]:
print("The worst R Squared Score at 0.42. The model predicted well between the prices 0 to 2000, but predicted a much larger range of prices with a min of {predmin} to the max {predmax}, while the actual price only ranged from {actualmin} to {actualmax}. Although the predicted price range was way off, the median prices were not too bad with the predicted at {predmed} and the actual at {actualmed}).".format(predmin=int(y3_predict.min()),predmax=int(y3_predict.max()),actualmin=y_test.min(),actualmax=y_test.max(),predmed=int(np.median(y3_predict)),actualmed=int(np.median(y_test))))

**So what is the relationship??**

The root mean squared error and the r-squared scores for each model will be compared to determine the model with the best performance (i.e. the smallest RMSE and largest R2).

In [None]:
scores=pd.DataFrame({"Relationship":relationship,"RMSE":rmse,"R2-Scores":r2score})
scores

In [None]:
fig,ax1=plt.subplots()

ax1.plot(scores["Relationship"],scores["RMSE"],color="sandybrown",marker="o")
ax1.set_ylabel("Root Mean Square Error",fontsize=12,color="sandybrown")
for label in ax1.get_yticklabels():
    label.set_color("sandybrown")
    
ax2=ax1.twinx()
ax2.plot(scores["Relationship"],scores["R2-Scores"],color="yellowgreen",marker="^")
ax2.set_ylabel("R Squared",fontsize=12,color="yellowgreen")
for label in ax2.get_yticklabels():
    label.set_color("yellowgreen")

The squared regression did the best job with the highest R Squared Score of 0.96 and lowest RMSE of 21. And although all three models predicted negative prices, the squared regression model was able to predict the most accurate range of positive prices and the closest median price. 

The linear regression did slightly poorer than the squared regression but still a good job, at least much much better than the cubic regression. The cubic regression did have the best RMSE but at the cost of the worse R Squared Score, which may be due to overfitting.