# Diamonds Dataset Analysis and Pre-Processing  

## Context
### This classic dataset contains the prices and other attributes of almost 54,000 diamonds. It's a great dataset for beginners  learning to work with data analysis and visualization.


## Content

* price price in US dollars (\$326--\$18,823)

* carat weight of the diamond (0.2--5.01)
 
* cut quality of the cut (Fair, Good, Very Good, Premium, Ideal)
 
* color diamond colour, from J (worst) to D (best)
 
* clarity a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
 
* x length in mm (0--10.74)
 
* y width in mm (0--58.9)
 
* z depth in mm (0--31.8)
 
* depth total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)
 
* table width of top of diamond relative to widest point (43--95)

# Topics:
## 1. Exploring Data
## 2. Handling Data 
## 3. Visualization of Data


# Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Extract Data

In [None]:

data = pd.read_csv('/kaggle/input/diamonds/diamonds.csv')
data.head()

# 1. Exploring Data

# 1.1 Data Size & Shape

In [None]:
data.size

In [None]:
data.shape

# 1.2 Get general information about data columns, data types

In [None]:
data.info()

# 1.3 Get Statistical Summary  

In [None]:
data.describe()

### Minimum value of (x,y,z) or (length,width,depth) is equal to zero, which is wrong and we will handle in Handling Data section

# 1.4 Correlation Table

In [None]:
data.corr()

## From Correlation table we get:
### 1. Carat and Price are highly corelated with x,y and z
### 2. But Depth doesn't have much impact, which is derived from x,y and z

# 1.5 Check for null values

In [None]:
data.isnull().sum()

## No null values

# 2 Handling Data

## 1. We need to handle the records where values of x,y,z is equal to zero.
## 2. We don't need Unnamed as a column

# 2.1 Drop Unnamed Column 

In [None]:
data.drop(['Unnamed: 0'],inplace = True, axis = 1)
data.head()

# 2.2 Handle wrong value of x,y,z 

## 2.2.1 rows of x,y,z with wrong values

In [None]:
print('Number of rows with x = 0 are {}'.format((data.x==0).sum()))
print('Number of rows with x = 0 are {}'.format((data.y==0).sum()))
print('Number of rows with x = 0 are {}'.format((data.z==0).sum()))

## 2.2.2 rows of x,y,z with wrong values set to NaN

In [None]:
data.x = data.x.replace(0,np.NaN)
data.y = data.y.replace(0,np.NaN)
data.z = data.z.replace(0,np.NaN)

print("Number of rows with x = 0 are {}".format((data.x==0).sum()))
print("Number of rows with y = 0 are {}".format((data.y==0).sum()))
print("Number of rows with z = 0 are {}".format((data.z==0).sum()))

In [None]:
data.isna().sum()

## 2.2.3 rows with NaN values removed

In [None]:
data.dropna(inplace=True)
data.isna().sum()

# 3. Visualization of Data 

# 3.1 Count of diamonds based on cut

In [None]:
sns.catplot(data=data, x='cut', kind = "count")

# 3.2 Clarity vs Price
### clarity's info from content  : I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)

In [None]:
sns.catplot(data=data,x='clarity', y= 'price', kind = 'bar')

# 3.3 Color vs Price
## *Nice coloured diamonds aren't the most costly.*
### D (best) to J(worst) color

In [None]:
sns.catplot(data=data, x='color', y = 'price', kind = "bar")

# 3.4 Clarity vs Price
### *Note median close to 1st quartile *


In [None]:
sns.catplot(data=data, x='clarity', y = 'price', kind = "box")

# 3.5 Cut vs Price
### *Different cut doesn't impact much on price*

In [None]:
print(sns.catplot(data=data, x='cut', y = 'price', kind = "box"))

# Thank you
## Do Upvote and Comment 