# Diamond Price Prediction : Complete Project

# UPVOTE if you like my project :)
You can visit my other works at [kaggle](https://www.kaggle.com/sagnik1511/notebooks)  or [github](https://github.com/sagnik1511?tab=repositories).

This project is based on analysis and prediction of diamonds.
In these days diamonds are very costly , so the buyer can face difficulties or abrupt changes in prices.
Using this project they can find the best diamon for their utility.
This project is made with ❤️.

# Libraries :
In this project we are using the classical process rather than NN , so we are not importing **tensorflow** or **pytorch**.


In [None]:

# supporting libraries -----------------------------------------

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import re

# for data processing ------------------------------------------

from sklearn.model_selection import train_test_split
from sklearn.metrics import*
import matplotlib.pyplot as plt
from sklearn.model_selection import*

# for prediction (machine learning models) ---------------------

from sklearn.linear_model import*
from sklearn.preprocessing import*
from sklearn.ensemble import*
from sklearn.neighbors import*
from sklearn import svm
from sklearn.naive_bayes import*
import xgboost as xgb

# Data Gathering and Primary Visualization:
We are using the [*read_csv*](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function of [*pandas*](https://pandas.pydata.org/) to make the dataframe and visualize it.

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df=pd.read_csv('/kaggle/input/diamonds/diamonds.csv')
df.head()

In [None]:
df.info()

We can see there are -
1. 2 **Integer** type features
2. 6 **Float** type features.
3. 3 **Object** type features.

So, we have to encode those categorical features as we can feed only numerical features into the machine learning model.

We are manually encoding those features.

**Feature : CUT**

In [None]:
cut=df['cut'].value_counts().index
k=0
for i in cut:
    df['cut'].replace(i,k,inplace=True)
    k+=1
df.head()

**Feature : COLOR**

In [None]:
color=df['color'].value_counts().index
k=0
for i in color:
    df['color'].replace(i,k,inplace=True)
    k+=1
df.head()

**Feature : CLARITY**

In [None]:
clarity=df['clarity'].value_counts().index
k=0
for i in clarity:
    df['clarity'].replace(i,k,inplace=True)
    k+=1
df.head()

Total dataframe has been encoded. 

### Leakagae Processing :
If there is leakage in the data , we can follow these steps -

     1. If there are no leakage then we should skip to next steps.
     2. If there are less leakage then we should fill those with a very small number e.g. -99999.
     3. If there are moderate number of leakagaes then we can fill those with th mean of the feature.
     4. If there are only or too much leakage then it is best to drop or omit the feature.

In [None]:
df.isnull().sum()

The dataset is quite good as there are no leakages. So, we can proceed further.

In [None]:
df.describe()

As we have seen in the [data description](https://www.kaggle.com/shivam2503/diamonds) that 
              
              Depth Percentage (D) = z/mean(x,y)
              
                                   = z/{(x+y)/2}
                                   
                                   = (2*z)/(x+y)

In [None]:
depth_percentage=[]
for i in range(len(df)):
    depth_percentage.append((2*df['z'][i])/(df['x'][i]+df['y'][i]))

In [None]:
len(depth_percentage)

So we can drop the features 'x','y','z' and then concatenate the depth percentage .

In [None]:
df.drop(labels=['x','y','z'],axis=1,inplace=True)
depth_percentage=pd.DataFrame({'depth_percentage':depth_percentage})
df=pd.concat([df,depth_percentage],axis=1)
df.head()

# EDA : Exploratory Data Analysis

So ,we can see there is depth and depth percentage which looks similar. so we can check if they are similar or not.
If found similar then we can drop a single feature.

In [None]:
plt.figure(figsize=(20,5))
plt.title('depth vs depth_percentage')
plt.xlabel('depth percentage')
plt.ylabel('depth')
plt.scatter(df['depth_percentage'],df['depth'],s=4,color='g')
plt.show()

So depth and depth percentage shows similar behaviour. So we are omiting depth_percentage for dimensionality reduction.

In [None]:
df.drop('depth_percentage',1,inplace=True)
df.head()

Now we are going to see the correlation of the continuous features with the *price* feature using [*matplotlib.pyplot.scatter*](https://matplotlib.org/3.3.3/api/_as_gen/matplotlib.pyplot.scatter.html)

In [None]:
fig, axs = plt.subplots(3,1,figsize=(20,20))
plt.subplot(3,1,1)
plt.title('CARAT')
plt.scatter(df['carat'],df['price'],s=4)
plt.subplot(3,1,2)
plt.title('DEPTH')
plt.scatter(df['depth'],df['price'],s=4)
plt.subplot(3,1,3)
plt.title('TABLE')
plt.scatter(df['table'],df['price'],s=4)
# plt.show()

#### Conclusion :
* The continuous features are showing a similar graphical manner as gaussian distribution.
* The cotinuous features aren't complete continuous rather they have a wide range of values.
* The Carat is found presenting a direct proportional behaviour with the price . It is also a reminder that the data isn't shuffled well.
* The Depth feature is seen forming a pyramidal shape with the price feature showing a singular point of depth is more valuable.

In [None]:

fig, axs = plt.subplots(3,1,figsize=(10,10))
fig.suptitle('Discrete Features Correlation with price',fontsize=20)
plt.subplot(3,1,1)
plt.title('CUT')
sns.violinplot(x="cut", y="price", data=df)
plt.subplot(3,1,2)
plt.title('COLOUR')
sns.violinplot(x="color", y="price", data=df)
plt.subplot(3,1,3)
plt.title('CLARITY')
sns.violinplot(x="clarity", y="price", data=df)

#### Conclusion:
* This graphs are showing that the number of features make the violin plots wider and smaller cause heavy number of points hold different values and thus the mean of the plot boils down.
* The colour and cut are directly proportionate and the clarity is inversely proportionate.

In [None]:
sns.violinplot(x='color',y='clarity',data=df)

Here we can see the clarity is also inversely proportionate with the colour.

In [None]:
df=df.sample(frac=1)

As the featureset is in a pattern we have to shuffle the dataset and that will help us find a better prediction.

# Final Data Preparation :
1. At first we have to create the X and Y .
2. Then we have to split the X and y into train and validation.
   In this project we are performing a 80%-20% train-test split.

In [None]:
X=df.drop('price',1)
y=df['price']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)
X_train.shape,X_test.shape,y_train.shape,y_test.shape

# Model Creating, Fitting and Evaluation :
At first we are going to chcek the model with the Unnamed :0 feature. 
After that we are going to check omitting that feature.

In [None]:
model=RandomForestRegressor(random_state=0)
model.fit(X_train,y_train)
y_1=model.predict(X_train)
print('RMSE in train data :',np.sqrt(mean_squared_error(y_train,y_1)))
y_pred=model.predict(X_test)
print('RMSE in test data :',np.sqrt(mean_squared_error(y_test,y_pred)))

In [None]:
print('R-squared score of the model on train data is : ',model.score(X_train,y_train))

In [None]:
print('R-squared score of the model on test data is : ',model.score(X_test,y_test))

In [None]:
plt.figure(figsize=(15,5))
plt.title('Model Evaluation')
plt.scatter(range(0,len(X_test)),y_test,label='true',s=5)
plt.scatter(range(0,len(X_test)),y_pred,label='predicted',s=5)
plt.xlabel('index')
plt.ylabel('Price')
plt.legend()
plt.show()

We can see that maximum of 5 lue points are visible , this indicates that the model really is well-tuned.

In [None]:
difference=abs(y_test-y_pred)
plt.figure(figsize=(15,5))
plt.title('difference')
plt.scatter(range(0,len(difference)),difference,s=3)
plt.xlabel('index')
plt.ylabel('Difference in Price')
plt.show()

Most of the points are bounded with ground line showing a very good prediction.

In [None]:
X_train.drop('Unnamed: 0',1,inplace=True)
X_test.drop('Unnamed: 0',1,inplace=True)

In [None]:
model=RandomForestRegressor(random_state=0)
model.fit(X_train,y_train)
y_1=model.predict(X_train)
print('RMSE in train data :',np.sqrt(mean_squared_error(y_train,y_1)))
y_pred=model.predict(X_test)
print('RMSE in test data :',np.sqrt(mean_squared_error(y_test,y_pred)))

In [None]:
print('R-squared score of the model on train data is : ',model.score(X_train,y_train))
print('R-squared score of the model on test data is : ',model.score(X_test,y_test))

In [None]:
plt.figure(figsize=(15,5))
plt.title('Model Evaluation')
plt.scatter(range(0,len(X_test)),y_test,label='true',s=5)
plt.scatter(range(0,len(X_test)),y_pred,label='predicted',s=5)
plt.xlabel('index')
plt.ylabel('Price')
plt.legend()
plt.show()

In [None]:
difference=abs(y_test-y_pred)
plt.figure(figsize=(15,5))
plt.title('difference')
plt.scatter(range(0,len(difference)),difference,s=3)
plt.xlabel('index')
plt.ylabel('Difference in Price')
plt.show()

The model gives 99.998 % accuracy when 'Unnamed :0' is present.
The model gives 98.182 % accuracy when 'Unnamed :0' is present.

# Final Conclusion :

 As we have seen here that the model accuracy decrease when the ***Unnamed :0*** feature is omitted . So , we can say that the price is also dependable on that feature and can't be taken out if any other prediction happens in future.

### Hurrah !  We've completed the project. 
If you find any queries or want to give any feedback please contact me over my email.

**Email** : *sagnik.jal00@gmail.com*

Or you can contact me over **discord** -***'s_agnik1511#6085'***



# Thank You :)