**Title: To explore diamonds dataset performing EDA & VDA and come up with relevant insights for the diamonds.csv**  
**Author: Eklavya Attar**
  
  
*__DiamondsDataset__*  
**A dataset “diamonds.csv” containing the prices and other attributes of almost 54,000 diamonds and 10 variables:**  

**1. price   - price in US dollars (\$326--\$18,823)**  
**2. carat   - weight of the diamond (0.2--5.01)**  
**3. cut     - quality of the cut (Fair, Good, Very Good, Premium, Ideal)**  
**4. color   - diamond colour, from J(worst) to D(best)**  
**5. clarity - a measurement of how clear the diamond is (I1(worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF(best))**  
**6. x       - length in mm (0--10.74)**  
**7. y       - width in mm (0--58.9)**  
**8. z       - depth in mm (0--31.8)**  
**9. depth   - total depth percentage = z/mean(x,y)**  
**10. table  - width of top of diamond relative to widest point**  


#### Note: We will exclude the columns: x, y and z.

In [None]:
# import the necessary pacakges
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
#read data file
df_diamondDataComplete = pd.read_csv("../input/diamonds.csv")
df_diamondDataComplete.head(10)

In [None]:
# Create a new dataframe so as to have a copy of the original csv imported dataframe
df_diamondData= df_diamondDataComplete
df_diamondData.head()

In [None]:
# Drop unwanted columns: Columns x, y & z
df_diamondData.drop(['x', 'y','z'], axis =1, inplace = True)
df_diamondData = df_diamondData.drop(df_diamondData.columns[0], axis=1) 

In [None]:
# View Data
print(df_diamondData.head())

In [None]:
# Check Column Names
print(df_diamondData.columns)

In [None]:
# Check datatypes of the columns
df_diamondData.dtypes

In [None]:
# Check summary
df_diamondData.describe()

## -------------------------------------------------- STEP 1 - Data cleaning and imputation ---------------------------------------------

### SECTION 1 - Only checking for null and outlier values

In [None]:
#1. Checking fot NULLS
df_diamondData.isnull().sum()

#### Therefore we see that there are no null values in the given data set.

In [None]:
#2. Checking fot Whitespaces
np.where(df_diamondData.applymap(lambda x: x == ' '))

#### Therefore we see that there are no white spaces in the given data set.

In [None]:
#3. Checking for outlier values based on the given information for each of the desired columns

In [None]:
# Create a dummy datafarme as we are only checking the values
df_diamondData_dummy = df_diamondData

In [None]:
# -A. CARAT (values to be present between 0.2 to 5.01, as given in the handout)
print(df_diamondData_dummy[(df_diamondData_dummy['carat'] < 0.2) | (df_diamondData_dummy['carat'] > 5.01)])

In [None]:
# -B. CUT (values to be present Fair, Good, Very Good, Premium, Ideal)
value_list_cut = ['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']
print(df_diamondData[~df_diamondData.cut.isin(value_list_cut)])

In [None]:
# -C. COLOR ( values to be present D ,E ,F ,G ,H ,I ,J)
value_list_color = ['D' ,'E' ,'F' ,'G' ,'H' ,'I' ,'J']
print(df_diamondData[~df_diamondData.color.isin(value_list_color)])

In [None]:
# -D. CLARITY ( values to be present I1 ,SI2 ,SI1 ,VS2 ,VS1 ,VVS2 ,VVS1 ,IF)
value_list_clarity = ['I1' ,'SI2' ,'SI1' ,'VS2' ,'VS1' ,'VVS2' ,'VVS1' ,'IF']
print(df_diamondData[~df_diamondData.clarity.isin(value_list_clarity)])

In [None]:
# -E. PRICE ( values to be present between $326 to $18,823)
print(df_diamondData_dummy[(df_diamondData_dummy['price'] < 326) | (df_diamondData_dummy['price'] > 18823)])

In [None]:
# For table and depth we dont have any accepted range of values given hence we will use an
# outlier function to detecrt outliers for table and depth columns

In [None]:
# Detect outlier fucntion
def outliers_iqr(ys):
    quartile_1, quartile_3 = np.percentile(ys, [25, 75])
    iqr = quartile_3 - quartile_1
    lower_bound = quartile_1 - (iqr * 1.5)
    upper_bound = quartile_3 + (iqr * 1.5)
    return np.where((ys > upper_bound) | (ys < lower_bound))

In [None]:
# -F. TABLE
# Call outlier function and Print outlier' s indexed values
outliers_iqr(ys=df_diamondData_dummy['table'])
a = np.array([outliers_iqr(ys=df_diamondData_dummy['table'])])
a.size

In [None]:
# -G. DEPTH
outliers_iqr(ys=df_diamondData_dummy['depth'])
b = np.array([outliers_iqr(ys=df_diamondData_dummy['depth'])])
b.size

###### Section 1 Summary


**1.) Total nulls : 0**  
__2.) Total whitespaces : 0__    
__3.) Total Outlier values : 3150__  

_-A. Based on the outlier fucntion_  
1. Table : 605
2. Depth : 2545

**We see that the percentage of outliers is 6%, still we will try to imputate them**

### SECTION 1 Ends...

### SECTION 2 - Data imputation

In [None]:
# Create a new dataframe to strore the imputed values
df_diamondData_mod = df_diamondData

In [None]:
#Imputation for Table and Depth columns using quantile method

In [None]:
#1. TABLE

In [None]:
# Creating the accepted range of values
down_quantiles_diamondtable = df_diamondData_mod.table.quantile(0.25)
up_quantiles_diamondtable   = df_diamondData_mod.table.quantile(0.75)

In [None]:
# Getting the minimum and maximum values
outliers_low_diamondtable = (df_diamondData_mod.table < down_quantiles_diamondtable)
outliers_high_diamondtable = (df_diamondData_mod.table > up_quantiles_diamondtable)

In [None]:
# Updating the column with the quantile values
df_diamondData_mod.table  = df_diamondData_mod.table.mask(outliers_low_diamondtable,down_quantiles_diamondtable)
df_diamondData_mod.table  = df_diamondData_mod.table.mask(outliers_high_diamondtable,up_quantiles_diamondtable)

In [None]:
# Call outlier function and check for the count of outlier values
a = np.array([outliers_iqr(ys=df_diamondData_mod['table'])])
a.size

#### Hence we have successfully imputed all the 605 values in the table column.

In [None]:
#2. DEPTH

In [None]:
# Creating the accepted range of values
down_quantiles_diamonddepth = df_diamondData_mod.depth.quantile(0.25)
up_quantiles_diamonddepth = df_diamondData_mod.depth.quantile(0.75)

In [None]:
# Getting the minimum and maximum values
outliers_low_diamonddepth = (df_diamondData_mod.depth < down_quantiles_diamonddepth)
outliers_high_diamonddepth = (df_diamondData_mod.depth > up_quantiles_diamonddepth)

In [None]:
# Updating the column with the quantile values
df_diamondData_mod.depth  = df_diamondData_mod.depth.mask(outliers_low_diamonddepth,down_quantiles_diamonddepth)
df_diamondData_mod.depth  = df_diamondData_mod.depth.mask(outliers_high_diamonddepth,up_quantiles_diamonddepth)

In [None]:
# Call outlier function and check for the count of outlier values
b = np.array([outliers_iqr(ys=df_diamondData_mod['depth'])])
b.size

#### Hence we have successfully imputed all the 2545 values in the depth column.

In [None]:
# Assign the imputed dataframe to a new dataframe
df_diamond = df_diamondData_mod

###### Section 2 Summary

**1.) Data imputation  was carried out successfully**

### SECTION 2 Ends...

## ---------------------------------------------------------- STEP 1 ends -----------------------------------------------------------------


## ---------------------------------------------------------- STEP 2 - EDA ----------------------------------------------------------------


#### We will try to understand the influnce of 4 Cs of diamond i.e clarity, cut, carat and color on the price of diamond and also on each other

In [None]:
#1. Realtionship between clarity and price
pd.crosstab(df_diamond["clarity"], df_diamond["price"], margins= True)

#### Therefore the contribution of clarities SI1 & VS2 to price is almost 50% as compared to other clarity values.

In [None]:
pd.crosstab(df_diamond["clarity"], columns="count")

In [None]:
#2.Realtionship between cut and price
pd.crosstab(df_diamond["cut"], df_diamond["price"], margins= True)

#### Therefore the contribution of cut "Ideal" to price is almost 40% as compared to other cut values

In [None]:
pd.crosstab(df_diamond["cut"], columns="count")

In [None]:
#3. Realtionship between color and price
pd.crosstab(df_diamond["color"], df_diamond["price"], margins= True)

#### Therefore the contribution of color G to price is more than by any other color

In [None]:
pd.crosstab(df_diamond["color"], columns="count")

In [None]:
#4. Realtionship between carat and price
pd.crosstab(df_diamond["carat"], df_diamond["price"], margins= True)

#### Therefore we  can see that the data points are skewed to lesser value of carat, the contribution of carat values in the range between 0.23 - 0.45 is more.

In [None]:
pd.crosstab(df_diamond["carat"], columns="count")

In [None]:
#5. Price table
pd.crosstab(index=df_diamond["price"], columns="count")

In [None]:
# Highest price value - 
pd.crosstab(index=df_diamond["price"], columns="count").nlargest(10, 'count')

#### Therefore we see thar the frequency of the price value 605 is more as compared to other price values hence we can say that the data that we have is not of very expensive diamonds.

In [None]:
#6. Realtionship between carat and cut
pd.crosstab(index=df_diamond["carat"], columns=df_diamond["cut"])

#### The data points are skewed to lesser values of carat and w.r.t. to cut more values are present for good, ideal, pemium and very good and lesser values exist for fair cut.

In [None]:
#7. Realtionship between Clarity and Color
pd.crosstab(df_diamond["clarity"],df_diamond["color"], margins = True)

#### Therefore we can see that the diamonds are mostly of color G and for color G the diamonds with clarity VS1 and VS2 are more as compared to other clarity.

#### In terms of clarity SI1 and VS2 are more in numbers.

##### Step 2: Summary
**1.) Most of diamonds are low in 
color i.e. G, clarity- SI1 & VS2, carat - skewed to lower values.**  
**2.) Most of diamonds have ideal cut.**  
**3.) The data that we have is not of very expensive diamonds.** 

## --------------------------------------------------------- STEP 2 ends -----------------------------------------------------------------

## --------------------------------------------------------- STEP 3 - VDA -----------------------------------------------------------------


###  We will try to understand the influnce of 4 Cs of diamond i.e clarity, cut, carat and color on the price of diamond and also on each other.

In [None]:
# 1. To check realtionshiop between clarity and price using Violin plot
sns.violinplot(x='clarity', y='price', data=df_diamond)
plt.show()

**The violinplot shows that diamonds on the highest end of the clarity spectrum i.e. IF have lower median price than low clarity diamonds.**  
**It is quiet unusual since diamonds with better clarity are expected to have higher prices.**  
**Hence there have to be other factors deciding the price of the diamonds.**  

In [None]:
# 2. To check realtionshiop between cut, clarity and price using factor plot
sns.factorplot(x='clarity', y='price', data=df_diamond, hue = 'cut', 
               kind = 'bar', size = 8) 
plt.show()

#### Hence we see that even with low clarity, most of the diamonds have ideal, very good  and premium cuts which can have an effect on the price.

In [None]:
# 3. To check realtionshiop between clarity, carat using violin plot
sns.violinplot(x='clarity', y='carat', data=df_diamond)
plt.show()

**The violinplot shows that diamonds with low clarity ratings also tend to be larger.**   
**Since size is an important factor in determining a diamond’s value, it isn’t too surprising that low clarity diamonds have higher median prices.**  
**Lighter diamonds are more expensive if they have a high clarity rating and**   
**conversely some of the heavier diamonds are not as expensive as the ones with a low clarity rating.**

In [None]:
# 4. To check realtionshiop between price and carat using Joint plot
sns.jointplot(x='carat', y='price', data=df_diamond)
plt.show()

#### Although the scatterplot above has many overlapping points, it still gives us some insights into the relationship between diamond carat (weight) and price i.e. bigger diamonds are generally more expensive.

In [None]:
# 5. Cut - Price relation
sns.factorplot(x='cut', y='price', data=df_diamond, hue = 'color', 
               kind = 'bar', size = 8) 
plt.show()

#### Hence we can say that the cut alone doesn’t seem to determine the price, for example here we can see that ideal cut disperse from low to higher price range.

In [None]:
# 6. To check realtionshiop between cut, price w.r.t. calrity using factor plot
sns.factorplot(x='clarity', y='price', data=df_diamond, hue = 'cut', 
               kind = 'bar', size = 8) 
plt.show()

#### Therefore we see that premium cut and not ideal cut is expensive across most of the diamond categories based on clarity

In [None]:
# 7. To check realtionshiop between color, price w.r.t. calrity using factor plot
sns.factorplot(x='clarity', y='price', data=df_diamond, hue = 'color', 
               kind = 'bar', size = 8)
plt.show()

#### Therefore we see that the colors - I, J are expensive across most of the diamond categories based on clarity

In [None]:
# 8. To check realtionshiop between color and cut
sns.countplot(y="cut", hue="color", data=df_diamond)
plt.show()

**The most colour quality in both ideal and premium cut diamonds is G. G is on the poor side of the scale of the colour quality.**   
**Therefore, cut of the diamonds is not effected by the colour.**

###### Step 3: Summary

**1.) VDA is a better option as compared to EDA when datasize is huge.**  
**2.) We can say that Price of diamond is collectively influenced by Carat, Cut, Clarity, Color but mostly by Carat.**  
**3.) Most of diamonds are low in color i.e. G, clarity- SI1 & VS2, carat - skewed to lower values.**  
**4.) Most of diamonds have ideal cut.**  
**5.) The data that we have is not of very expensive diamonds.**  

## ---------------------------------------------------------------- STEP 3 ends -----------------------------------------------------------