**Diamonds** are formed naturally from extreme heat and pressure beneath the earth's surface millions of years ago. They are then only brought to the Earth's surface through volcanic eruptions. And because they are natural, they can come in many different shapes, sizes, colours and clarities.

Of course, the bigger the better, right?? But at what point does big become fake??

When modelling the price of diamonds, these *outliers* can skew predictions and therefore lessen the model's accuracy. Let's explore the diamond dataset and see the effect of outliers on linear regression:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns
plt.rcParams["figure.figsize"]=12,6

According to the data, there are 53940 diamonds which can be described in 10 attributes:
1. carat (i.e. weight)
1. quality of the cut (5 categories in ascending order of Fair, Good, Very Good, Premium, Ideal)
1. colour (7 categories with J being the worst to D being the best)
1. clarity (8 categories in ascending order of I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF)
1. depth out of 100 (i.e. how deep the diamond is relative to its width)
1. table out of 100 (i.e. the ratio between the flat surface at the top of the diamond to its average width)
1. price in USD
1. length (x) in mm
1. width (y) in mm 
1. depth (z) in mm

In [None]:
df=pd.read_csv("/kaggle/input/diamonds/diamonds.csv",index_col=0)
df

Although the cut, colour and clarity are categorical variables, there is an order/ranking and so can therefore be transformed into numeric values.

In [None]:
#CUT
def cut_codes(cut):
    if cut=="Ideal":
        return 5
    elif cut=="Premium":
        return 4
    elif cut=="Very Good":
        return 3
    elif cut=="Good":
        return 2
    else:
        return 1
df["cut"]=df["cut"].apply(lambda x:cut_codes(x))

#COLOUR
def color_codes(color):
    if color=="D":
        return 7
    elif color=="E":
        return 6
    elif color=="F":
        return 5
    elif color=="G":
        return 4
    elif color=="H":
        return 3
    elif color=="I":
        return 2
    else:
        return 1
df["color"]=df["color"].apply(lambda x:color_codes(x))

#CLARITY
def clarity_codes(clarity):
    if clarity=="I1":
        return 8
    elif clarity=="SI2":
        return 7
    elif clarity=="SI1":
        return 6
    elif clarity=="VS2":
        return 5
    elif clarity=="VS1":
        return 4
    elif clarity=="VVS2":
        return 3
    elif clarity=="VVS1":
        return 2
    else:
        return 1
df["clarity"]=df["clarity"].apply(lambda x:clarity_codes(x))

df.head()

Let's take a look at the data:

In [None]:
df.isnull().sum()

In [None]:
df.describe()

There are no null values but there are some 0s for the x, y and z which should not be mistaken as a measurement of 0 mm because without a length, width or depth a diamond cannot be a 3-dimensional object. Instead this 0 value should be treated as missing data. Our options to deal with missing data are to either:

1. drop the rows; or
1. replace with 0s with a new value such as the mean

In [None]:
print("The number of rows with a value of 0 for x are ",(df["x"]==0).sum(),".")
print("The number of rows with a value of 0 for y are ",(df["y"]==0).sum(),".")
print("The number of rows with a value of 0 for z are ",(df["z"]==0).sum(),".")
print("The total number of rows with a value of 0 are ",((df["x"]==0)|(df["y"]==0)|(df["z"]==0)).sum(),".")

Since the number of rows with a value of 0 is only 20 (0.04%) out of the total 53940, we shall just drop said rows.

In [None]:
df.drop(df[(df["x"]==0)|(df["y"]==0)|(df["z"]==0)].index,inplace=True)

But now when we take a look at the data, there are some obvious outliers (especially in y, z, depth and table).

In [None]:
fig=plt.figure(figsize=(16,16))
ax1=plt.subplot2grid((4,3),(0,0),colspan=2,rowspan=2)
ax2=plt.subplot2grid((4,3),(0,2))
ax3=plt.subplot2grid((4,3),(1,2))
ax4=plt.subplot2grid((4,3),(2,0))
ax5=plt.subplot2grid((4,3),(2,1))
ax6=plt.subplot2grid((4,3),(2,2))
ax7=plt.subplot2grid((4,3),(3,0))
ax8=plt.subplot2grid((4,3),(3,1))
ax9=plt.subplot2grid((4,3),(3,2))

sns.scatterplot(x=df["carat"],y=df["price"],color="lavender",ax=ax1)
sns.scatterplot(x=df["x"],y=df["price"],color="powderblue",ax=ax2)
sns.scatterplot(x=df["y"],y=df["price"],color="lightblue",ax=ax3)
sns.scatterplot(x=df["depth"],y=df["price"],color="palegreen",ax=ax4)
sns.scatterplot(x=df["table"],y=df["price"],color="lightgreen",ax=ax5)
sns.scatterplot(x=df["z"],y=df["price"],color="skyblue",ax=ax6)
sns.scatterplot(x=df["cut"],y=df["price"],color="lightsalmon",ax=ax7)
sns.scatterplot(x=df["color"],y=df["price"],color="palevioletred",ax=ax8)
sns.scatterplot(x=df["clarity"],y=df["price"],color="gold",ax=ax9)

plt.tight_layout(pad=1,h_pad=1,w_pad=1)

Now what to do with these outliers? Resume the analysis with the outliers, or filter them out? And if we filter them out, should we remove the outliers visually based on their position on the graph or by calculating 1.5 times the interquartile range above the third quartile or below the first quartile?
Let's try all three methods and compare their model accuracy!

1. Including the outliers
1. Removing the visual outliers
1. Removing the calculated outliers

In [None]:
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score
outliers=["Including the Outliers","Removing the Visual Outliers","Removing the Calculated Outliers"]
rmse=[]
r2score=[]

### Including the Outliers

In [None]:
df1=df.copy()

print("The number of rows include all the outliers are",df1.shape[0],".")

In [None]:
x1=df1.drop(["price"],axis=1)
y1=df1["price"]
x1_train,x1_test,y1_train,y1_test=train_test_split(x1,y1,test_size=0.3,random_state=5)

lr1=LinearRegression()
lr1.fit(x1_train,y1_train)
y1_predict=lr1.predict(x1_test)

print("MAE: %.2f"%mean_absolute_error(y1_test,y1_predict))
print("MSE: %.2f"%mean_squared_error(y1_test,y1_predict))
print("RMSE: %.2f"%np.sqrt(mean_absolute_error(y1_test,y1_predict)))
print("R2: %.2f"%r2_score(y1_test,y1_predict))

rmse.append(np.sqrt(mean_absolute_error(y1_test,y1_predict)))
r2score.append(r2_score(y1_test,y1_predict))

In [None]:
sns.distplot(y1_test,hist=True,color="lightskyblue",label="Actual Values")
sns.distplot(y1_predict,hist=True,color="plum",label="Predicted Values")
plt.legend()
plt.xlabel("Price")

In [None]:
f,axes=plt.subplots(2,1,sharex=True)
sns.boxplot(y1_test,color="lightskyblue",whis=4,ax=axes[0])
sns.boxplot(y1_predict,color="plum",whis=7,ax=axes[1])

axes[0].set_xlabel("")
plt.xlabel("Price")

axes[0].set_ylabel("Actual Price")
axes[1].set_ylabel("Predicted Price")

Using the entire dataset with all the outliers included did not do too bad a job with a R Squared Score of 0.91. The predicted prices followed the general trend and distribution of the acutal prices, but it predicted a much much larger price range and even predicted negative prices which are impossible! So maybe not so good after all..

### Removing the Visual Outliers

In [None]:
df2=df.copy()

visual=df2[(df2["y"]>30)|(df2["z"]>10)|(df2["depth"]<50)|(df2["depth"]>75)|(df2["table"]<45)|(df2["table"]>75)]
print("There are {num} rows of visual outliers where y >30, z >10, depth <50 and >75, and table <45 and >75.".format(num=visual.shape[0]))

In [None]:
df2.drop(visual.index,inplace=True)

print("After dropping the Visual Outliers, the number of rows are now",df2.shape[0],".")

In [None]:
x2=df2.drop(["price"],axis=1)
y2=df2["price"]
x2_train,x2_test,y2_train,y2_test=train_test_split(x2,y2,test_size=0.3,random_state=5)

lr2=LinearRegression()
lr2.fit(x2_train,y2_train)
y2_predict=lr2.predict(x2_test)

print("MAE: %.2f"%mean_absolute_error(y2_test,y2_predict))
print("MSE: %.2f"%mean_squared_error(y2_test,y2_predict))
print("RMSE: %.2f"%np.sqrt(mean_absolute_error(y2_test,y2_predict)))
print("R2: %.2f"%r2_score(y2_test,y2_predict))

rmse.append(np.sqrt(mean_absolute_error(y2_test,y2_predict)))
r2score.append(r2_score(y2_test,y2_predict))

In [None]:
sns.distplot(y2_test,hist=True,color="lightskyblue",label="Actual Values")
sns.distplot(y2_predict,hist=True,color="plum",label="Predicted Values")
plt.legend()
plt.xlabel("Price")

In [None]:
f,axes=plt.subplots(2,1,sharex=True)
sns.boxplot(y2_test,color="lightskyblue",whis=4,ax=axes[0])
sns.boxplot(y2_predict,color="plum",whis=7,ax=axes[1])

axes[0].set_xlabel("")
plt.xlabel("Price")

axes[0].set_ylabel("Actual Price")
axes[1].set_ylabel("Predicted Price")

Removing the visual outliers did ever so slightly better than including all the outliers. The R Squared Score was the same at 0.91, and the Root Mean Squared Error only improved by 0.04 - almost negligible. This could be because there was only a difference 14 data points out of the original 50,000+ data points. Once again, even though the model predicted the same price distribution, the model still predicted a larger price range and even negative prices.

### Removing the Calculated Outliers

In [None]:
df3=df.copy()

Q1=df3.quantile(0.25)
Q3=df3.quantile(0.75)
IQR=Q3-Q1

col=list(df3.columns)

print("Number of Calculated Outliers")
print(df3[(df3[col]<(Q1[col]-1.5*IQR[col]))|(df3[col]>(Q3[col]+1.5*IQR[col]))].count())

In [None]:
def c_outliers(col):
    return df3[(df3[col]<(Q1[col]-1.5*IQR[col]))|(df3[col]>(Q3[col]+1.5*IQR[col]))]

for col in df3:
    df3.drop(c_outliers(col).index,inplace=True)
    
print("After dropping the Calculated Outliers, the number of rows are now",df3.shape[0],".")

In [None]:
x3=df3.drop(["price"],axis=1)
y3=df3["price"]
x3_train,x3_test,y3_train,y3_test=train_test_split(x3,y3,test_size=0.3,random_state=5)

lr3=LinearRegression()
lr3.fit(x3_train,y3_train)
y3_predict=lr3.predict(x3_test)

print("MAE: %.2f"%mean_absolute_error(y3_test,y3_predict))
print("MSE: %.2f"%mean_squared_error(y3_test,y3_predict))
print("RMSE: %.2f"%np.sqrt(mean_absolute_error(y3_test,y3_predict)))
print("R2: %.2f"%r2_score(y3_test,y3_predict))

rmse.append(np.sqrt(mean_absolute_error(y3_test,y3_predict)))
r2score.append(r2_score(y3_test,y3_predict))

In [None]:
sns.distplot(y3_test,hist=True,color="lightskyblue",label="Actual Values")
sns.distplot(y3_predict,hist=True,color="plum",label="Predicted Values")
plt.legend()
plt.xlabel("Price")

In [None]:
f,axes=plt.subplots(2,1,sharex=True)
sns.boxplot(y3_test,color="lightskyblue",whis=4,ax=axes[0])
sns.boxplot(y3_predict,color="plum",whis=7,ax=axes[1])

axes[0].set_xlabel("")
plt.xlabel("Price")

axes[0].set_ylabel("Actual Price")
axes[1].set_ylabel("Predicted Price")

The most accurate model out of the three! The R Squared Score did not improve all that much from 0.91 to 0.92, and there negative prices were still predicted but at least the predicted prices were not off by 10,000s.

### So did the outliers do anything?

In [None]:
scores=pd.DataFrame({"Data":outliers,"RMSE":rmse,"R2-Scores":r2score})
scores

In [None]:
fig,ax1=plt.subplots()

ax1.plot(scores["Data"],scores["RMSE"],color="sandybrown",marker="o")
ax1.set_ylabel("Root Mean Square Error",fontsize=12,color="sandybrown")
for label in ax1.get_yticklabels():
    label.set_color("sandybrown")
    
ax2=ax1.twinx()
ax2.plot(scores["Data"],scores["R2-Scores"],color="yellowgreen",marker="^")
ax2.set_ylabel("R Squared Score",fontsize=12,color="yellowgreen")
for label in ax2.get_yticklabels():
    label.set_color("yellowgreen")

*Removing the calculated outliers* gave the most accurate model because it decreased variability as almost 12% (6412) of the dataset was filtered out to create the ideal dataset. This would make an unreliable model as real datasets are not always so consistent. 

*Removing the visual outliers* did not improve much from *including the outliers*. This may be because of the boundaries I set for the outliers - I am no diamond expert and so these numbers were just based off what I thought were suitable (i.e. y >30; z >10; depth <50 and >75; table <45 and >75; I also did not list any outlier boundaries for carat).

Now this is where domain knowledge would be very useful in identifying appropriate boundaries for outliers. So if any diamond extraordinaire is reading this, please do let me know!