In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestRegressor,BaggingRegressor,GradientBoostingRegressor,AdaBoostRegressor

from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('classic')
%matplotlib inline

import warnings
warnings.filterwarnings(action='ignore')

# Diamonds



<img src="https://imgur.com/LqVrupH.jpg" width="800">

**price** price in US dollars (\$326--\$18,823)

**carat** weight of the diamond (0.2--5.01)

**cut** quality of the cut (Fair, Good, Very Good, Premium, Ideal)

**color** diamond colour, from J (worst) to D (best)

**clarity** a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

**x** length in mm (0--10.74)

**y** width in mm (0--58.9)

**z** depth in mm (0--31.8)

**depth** total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)

**table** width of top of diamond relative to widest point (43--95)

In [None]:
data = pd.read_csv("/kaggle/input/diamonds/diamonds.csv")

data

In [None]:
data.info()

In [None]:
data.nunique()

In [None]:
data = data.drop('Unnamed: 0', axis=1) #the column is just indexes

# Feature Analysis

In [None]:
object_columns = data.select_dtypes(include='object').columns

for col in object_columns:
    print('-'*50)
    print(col+':', '\n\n', ',  '.join(data[col].unique()), '\n')
print('-'*50)

### Diamond cut grade:

<img src="https://imgur.com/ppd6912.jpg" width="800">

1) **Excellent:**  	Excellent Cut Diamonds provide the highest level of fire and brilliance. Because almost all of the incoming light is reflected through the table, the diamond radiates with magnificent sparkle.

2) **Very Good:**  Very Good Cut Diamonds offer exceptional brilliance and fire. A large majority of the entering light reflects through the diamond’s table. To the naked eye, Very Good diamonds provide similar sparkle to those of Excellent grade.

3) **Good:**  Good Cut Diamonds showcase brilliance and sparkle, with much of the light reflecting through the table to the viewer’s eye. These diamonds provide beauty at a lower price point.

4) **Fair:**  Fair Cut Diamonds offer little brilliance, as light easily exits through the bottom and sides of the diamond. Diamonds of a Fair Cut may be a satisfactory choice for smaller carats and those acting as side stones.

5) **Poor:**  Poor Cut Diamonds yield nearly no sparkle, brilliance or fire. Entering light escapes from the sides and bottom of the diamond.

Learn more about diamond cut: https://www.diamonds.pro/education/cuts/

 
 

So, we have quality of the cut: **Fair < Good < Very Good < Premium < Ideal**

### Diamond color grade:

<img src="https://imgur.com/97Txyxi.jpg" width="800">

1) **D:**  	D is the highest color grade, meaning it has nearly no color. Under magnification and to the naked eye, a D color diamond will appear colorless.

2) **E:**  	E color diamonds look almost identical to D color diamonds. Most of the time, the differences in color between a D and E diamond are only visible to an expert gemologist when the two diamonds are viewed under magnification.

3) **F:**  F color diamonds are almost identical to D and E color diamonds, with nearly no visible color. Even under magnification and side by side, a D, E and F diamond will look almost identical to anyone other than an expert gemologist.

4) **G:**  	G color diamonds exhibit nearly no color and appear primarily colorless to the naked eye. The G color grade is the highest, best grade in the “Near Colorless” range of the GIA’s scale, which covers diamonds graded G to J.

5) **H:**  H color diamonds appear primarily colorless to the naked eye but have a faint yellow hue that’s often visible under magnification in bright lighting, especially when they’re compared to diamonds of a higher color grade.

3) **I:**  I color diamonds offer a great combination of near colorless looks and good value for money. These diamonds have a slight yellow tint that’s usually only visible when they’re viewed next to diamonds of a higher color grade.

4) **J:**  J color diamonds look mostly colorless to the naked eye, but usually have a faint yellow tint that’s easy to notice under bright lights and magnification. In diamonds with a large table, the color might also be visible with the naked eye in certain lighting conditions.

5) **K:**  K color diamonds are classed as “faint tint” on the GIA’s diamond color scale, meaning they have a slight yellow tint that’s visible even to the naked eye.

5) **L:**  L color diamonds have a yellow tint that’s visible to the naked eye in normal lighting conditions. Diamonds with this color grade are much more affordable than those in the G to J range, making them a good value for money option.

3) **M:**  M color diamonds have a definite yellow tint that’s visible to the naked eye. Like K and L diamonds, M color diamonds offer fantastic value for money when compared to near colorless or colorless diamonds.

4) **N-R:**  Diamonds in the N to R range have noticeable yellow or brown tinting. These diamonds are available at a much lower price point than faintly tinted or near colorless diamonds. We do not recommend diamonds of an N-R grade.

5) **S-Z:**  Diamonds of an S-Z range have easily noticeable yellow or brown tinting. For this reason, We do not recommend S-Z diamonds.

Learn more about diamond color: https://www.diamonds.pro/education/color/



So, we have diamond colour, from J (worst) to D (best): **J < I < H < G < F < E < D**

### Diamond clarity grade:

<img src="https://imgur.com/NErG1AE.jpg" width="800">

1) **IF:**  Internally Flawless / Flawless – No internal or external imperfections. Flawless diamonds are extremely rare.

2) **VVS1:**  Very Very Slightly Included (1st Degree) – Diamond clarity inclusions rated VVS1 are not visible at all under 10x magnification.

3) **VVS2:**  Very Very Slightly Included (2nd Degree) – Diamond clarity inclusions rated VVS2 are sometimes just barely visible under 10x magnification (standard jeweler’s loupe). When they are visible, they are quite difficult to find and can often take quite a while to locate.

4) **VS1:**  Very Slightly Included (1st Degree) – VS1 diamond clarity inclusions are just barely visible under 10x magnification (standard jeweler’s loupe). When looking for VS1 clarity inclusions with a loupe, it can sometimes take a good few seconds until the pinpoint is located.

5) **VS2:**  Very Slightly Included (2nd Degree) – VS2 clarity inclusions are almost always easily noticeable at 10x magnification (standard jeweler’s loupe). Occasionally, the inclusion will be located in a difficult-to-spot location, but otherwise, the inclusion is large enough that it can be spotted quickly under magnification.

3) **SI1:**  Slightly Included (1st Degree) – SI1 Clarity inclusions are easily found with a standard jeweler’s loupe at 10x magnification. With most shapes (to the exclusion of step cuts like Asscher and Emerald Cuts), SI1 clarity inclusions are almost always clean to the naked eye.

4) **SI2:**  Slightly Included (2nd Degree) – SI2 clarity inclusions are seen clearly and obviously with the help of a jeweler’s loupe. With step cuts like Emerald and Asscher cuts, an SI2 clarity inclusion will most likely be visible to the naked eye.

5) **I1:**  Included (1st Degree) – I1 clarity inclusions are even more obvious and clearly seen than SI2 clarity inclusions. Most I1 inclusions are visible to the naked eye—even on brilliant cuts.

5) **I2-I3:** Included (2st-3st Degree) - I2, I3 clarity represents the lowest official clarity grade for a diamond. 

Learn more about diamond clarity: https://www.diamonds.pro/education/clarity/




So, we have diamond clarity, from I1 (worst) to IF (best): **I1 < SI2 < SI1 < VS2 < VS1 < VVS2 < VVS1 < IF**

In [None]:
data.describe()

In [None]:
cols = object_columns

fig = plt.figure(figsize = (20, 6), facecolor='#fbe7dd')



for i in range(len(cols)):
    fig.add_subplot(1, 3, i+1)
    sns.countplot(data[cols[i]], palette='icefire_r')
plt.show() 

fig = plt.figure(figsize = (20, 6), facecolor='#fbe7dd')
for i in range(len(cols)):
    fig.add_subplot(1, 3, i+1)
    sns.barplot(x=cols[i], y="price", data=data, palette='icefire_r')

plt.show()

In [None]:
fig = plt.figure(figsize = (20, 6), facecolor='#fbe7dd')

for i in range(len(cols)):
    plt.style.use('Solarize_Light2')
    fig.add_subplot(1, 3, i+1)
    plt.title("%s" % cols[i],color = 'black',fontsize = 19)
    plt.pie(
        x=data[cols[i]].value_counts().values, labels=data[cols[i]].value_counts().index, autopct = '%1.1f%%'
    )

plt.show()

In [None]:
cols = object_columns

fig = plt.figure(figsize = (25, 17), facecolor='#fbe7dd')



for i in range(len(cols)):
    fig.add_subplot(1, 3, i+1)
    sns.scatterplot(
        y=data['carat'], x=data['price'], hue=data[cols[i]], palette='Paired', hue_order=data[cols[i]], markers=None
    )

plt.show()

In [None]:
float_columns = data.select_dtypes(include='float64').columns

fig = plt.figure(figsize = (22, 9), facecolor='#fbe7dd')



for i in range(len(float_columns)):
    fig.add_subplot(2, 3, i+1)
    sns.distplot(data[float_columns[i]], color='#604039')

plt.show()

In [None]:
# correlation matrix

plt.figure(figsize = (16, 7), facecolor='#fbe7dd')
sns.heatmap(data.corr(), vmin=-1, vmax=1, cmap= 'icefire')
plt.show()

We will not use the x, y, z features, since they are very strongly correlated with the carat feature. Also, these features have a similar meaning.

# get_dummies

In [None]:
y = data['price'] #Assigning the target as y

In [None]:
#Assigning the featurs as df_tran

df_tran = pd.get_dummies(data[["cut", 'color','clarity']])
df_tran['carat'] = MinMaxScaler().fit_transform(pd.DataFrame(data['carat']))
df_tran['table'] = MinMaxScaler().fit_transform(pd.DataFrame(data['table']))
df_tran['depth'] = MinMaxScaler().fit_transform(pd.DataFrame(data['depth']))
df_tran.head()

In [None]:
# correlation matrix

plt.figure(figsize = (16, 7), facecolor='#fbe7dd')
sns.heatmap(df_tran.corr(), vmin=-1, vmax=1, cmap= 'icefire')
plt.show()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_tran, y, train_size=0.7, shuffle=True, random_state=42)

In [None]:
Models = {
    "               Random Forest Regressor": RandomForestRegressor(),
    "           Gradient Boosting Regressor": GradientBoostingRegressor(),
    "                     Bagging Regressor": BaggingRegressor(),
    "                    AdaBoost Regressor": AdaBoostRegressor(),
    "                     Linear Regression": LinearRegression()
}

# Models Evaluation

for name, model in Models.items():
    model.fit(X_train, y_train)

    print(name + ": {:1.2f}%".format(model.score(X_test, y_test) * 100))

# Ordinal Encoder

In [None]:
df_tran2 = data.copy().drop('price', axis=1)

df_tran2.head()

The Cut, Color, Clarity features have an order, so we will make the encoding in accordance with the order

Remember that

**Fair < Good < Very Good < Premium < Ideal**

**J < I < H < G < F < E < D**

**I1 < SI2 < SI1 < VS2 < VS1 < VVS2 < VVS1 < IF**

In [None]:
Cut_dict = {
    'Fair': 0,
    'Good': 1,
    'Very Good': 2,
    'Premium': 3,
    'Ideal': 4
}

Color_dict = {
    'J': 0,
    'I': 1,
    'H': 2,
    'G': 3,
    'F': 4,
    'E': 5,
    'D': 6
}

Clarity_dict = {
    'I1': 0,
    'SI2': 1,
    'SI1': 2,
    'VS2': 3,
    'VS1': 4,
    'VVS2': 5,
    'VVS1': 6,
    'IF': 7
}

In [None]:
df_tran2['Cut_Ordinal'] = df_tran2.cut.map(Cut_dict)
df_tran2['Color_Ordinal'] = df_tran2.color.map(Color_dict)
df_tran2['Clarity_Ordinal'] = df_tran2.clarity.map(Clarity_dict)
df_tran2 = df_tran2.drop(['cut', 'color', 'clarity'], axis=1)
df_tran2 = df_tran2.drop(['x', 'y', 'z'], axis=1)

In [None]:
df_tran2

In [None]:
#Assigning the featurs as X_tran

X_tran = pd.DataFrame(MinMaxScaler().fit_transform(df_tran2))

In [None]:
# correlation matrix

plt.figure(figsize = (16, 7), facecolor='#fbe7dd')
sns.heatmap(X_tran.corr(), vmin=-1, vmax=1, cmap= 'icefire')
plt.show()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_tran, y, train_size=0.7, shuffle=True, random_state=42)

In [None]:
# Models Evaluation

for name, model in Models.items():
    model.fit(X_train, y_train)

    print(name + ": {:1.2f}%".format(model.score(X_test, y_test) * 100))

Using the ordinal encoding, we improved the score of all models with the exception of Linear Regression. Get_dummies turned out to be better for the Linear Regression score.