### INTRODUCTION OF THE DATASET 


 
**The dataset**: The goal is to predict `price` of given diamond/gemstone ( Regression Analysis).

**There are 10 independent variables (including `id`)**
* `id` : unique identifier of each diamond
* `carat` : Carat (ct.) refers to the unique unit of weight measurement used exclusively to weigh gemstones and diamonds.
* `cut` : Quality of Diamond Cut
* `color` : Color of Diamond
* `clarity` : Diamond clarity is a measure of the purity and rarity of the stone, graded by the visibility of these characteristics under 10-power magnification.
* `depth` : The depth of diamond is its height (in millimeters) measured from the culet (bottom tip) to the table (flat, top surface)
* `table` : A diamond's table is the facet which can be seen when the stone is viewed face up.
* `x` : Diamond X dimension
* `y` : Diamond Y dimension
* `x` : Diamond Z dimension
  
**Target variable:**
* `price`: Price of the given Diamond.

Dataset Source Link :
[Link of gemstone.csv dataset](https://www.kaggle.com/competitions/playground-series-s3e8/data?select=train.csv)

----------------------------------------------------------------------------------------------------------------------
   

### Step 1: IMPORT THE REQUIRED LIBRARIES AND INGEST THE DATA

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns


import warnings
warnings.filterwarnings("ignore")

In [None]:
data=pd.read_csv(r"C:\\Users\\Admin\Desktop\\ineuron\\_FSDSM_SelfLearned_RashmiKumari\\MACHINE_LEARNING\\01.End_to_End_setup_and_MLproject\\01.Setup_for_ML_DL_projects\\notebooks\\data\\gemstone.csv")

In [None]:
data.head()    # head(): 5 very first elements appeared in the dataset.

In [None]:
data.tail()    # tail(): 5 very last elements appeared in the dataset.

The loaded data provides information about gemstones, including their carat weight, cut quality, color, clarity, dimensions (x, y, z), and price. Now, let's visualize some distributions from this dataset.

In [None]:
data.columns    # columns: Gives names of the columns.This is an attribute.how we came to know--> no () after columns.

In [None]:
data.shape      # shape : This returns the shape of the dataset. (total no. of rows,total no. of columns). 
                #         shape is also attribute and not a method/function.

In [None]:
print("Number of rows/records in the dataset:",data.shape[0])
print("Nuber of columns in the dataset:",data.shape[1])

In [None]:
data.sample(20)   #sample(): method provides random sample out of the dataset...
                  #           sample(mention how many records we want in our random sample. ie. size of random sample)

In [None]:
data.info()

**Check for the null values.**

In [None]:
# data.isnull()   : This gives Boolean values in Dataframe form.

print("Number of Null/missing values in the dataset: ")

data.isnull().sum()   # Conclusion: No null value is there in any column.

**Check for the duplicate values**

In [None]:
# data.duplicated() : This gives Boolean values 
data.duplicated().sum()  # This gives the number of records which are duplicate.

In [None]:
data[data.duplicated()=="True"]   # If  we have some finite number of duplicate values and
                                  # we want to check what are they?Then we can use this code.

                                  #Conclusion : No duplicate record is there.

**Check the statistical summary  of data**

In [None]:
data.describe().T     # Best part of describe() is , it automatically considers only numerical values.
                      # T : T is Transpose ,to transpose the result

**conclusion from describe() : Gives statistical summary**

1. *Diamond Count:*
   - Dataset comprises 193,573 diamonds with unique IDs ranging from 0 to 193,572.

2. *Carat Weight (Carat):*
   - Average carat weight is 0.79.
   - Ranges from 0.2 to 3.5.
   - 25% of diamonds have carat weight below 0.4, 50% below 0.70 and 75% below 1.03.
   - Standard Deviation is 0.46

3. *Depth (percentage) :*
   - Range from 52.1 to 71.60
   - Average depth percentage is 61.82 
   - Standard deviation of 1.08 


4. *Table (percentage) :*
   - Range from 49.0 to 79.00
   - Average table percentage is 57.23  
   - Standard deviation of 1.92

5. *Dimensions (x, y, z) (millimeters ie. mm):*
   - Range of
         x : 0 to 9.65 
         y : 0 to 10.01
         z : 0 to 31.30
   - Average dimensions are approximately 5.72 x 5.72 x 3.53.
   - Standard deviations: 1.11 (x), 1.10 (y), 0.69 (z).

6. *Price:*
   - Prices Range from 326 to 18,818.
   - Average price is 3,969.16.
   - Standard Deviation is 4034.37
   - Right-skewed distribution; median price (50%) is 2,401, while the mean is higher at 3,969.16.
   
------------------------------------------------------------------------------------


**Note:**
* Carats= Milligrams/200   (Since 200mg = 1 carat)
​

* The unit for both "Depth" and "Table" in this dataset is percentage (%). 
These percentages indicate the proportional measurements of the gemstone's depth and table width relative to their average diameters.

        Depth or Depth percentage :It is calculated as (depth / average diameter) * 100
        Table or Table percentage :It is calculated as (table width / average diameter) * 100
   
   ![depth_table_of_dimanond.png](attachment:depth_table_of_dimanond.png)

* Standard deviation: The standard deviation for each of these features would have the same unit as the respective feature. 

*  ![c101b0da6ea1a0dab31f80d9963b0368_orig.png](attachment:c101b0da6ea1a0dab31f80d9963b0368_orig.png)

   mean < median => left-skewed distribution (Negatively-skewed) => tail on the left side is longer

   mean > median => right-skewed distribution (Positively-skewed) => tail on the right side is longer 

------------------------------------------------------------------------------


### Step 2: FEATURE ENGINEERNING

**'Id' feature is something that we don't need for analysis,So better to drop it.**

In [None]:
data.drop("id",axis=1,inplace=True)   #inplace= True :  modify the original DataFrame by dropping the mentioned feature permanently

*These all will perform the same task ie. dropping 'id' feature permanently from dataset.*

        data.drop(labels=['id'],axis=1,inplace=True)
        data.drop(columns=['id'],axis=1,inplace=True)
        data.drop("id",axis=1,inplace=True) 
        data= data.drop("id",axis=1) 

        Note: axis=0 means row and axis=1 means column in drop()

In [None]:
data.head(2)   #We can observe that 'id' column is gone.


**Sorting the Numercial and Categorical Values**

In [None]:
data.info()  # From this we came to know about datatypes of features .
             # dtype : float,object,int : so float and int => numerical => float and int are non-object.
             # Therefore we can seprate the data as object--> Categorical and non-object---> Numerical

In [None]:
# Method I: (Suggested as it is short)

categorical_col=data.columns[data.dtypes=="object"]
numerical_col=data.columns[data.dtypes!="object"]

print("Numerical_columns:",list(numerical_col))
print("Categorical_feature:",list(categorical_col))

In [None]:
# Method II: Using loop and if-else: (Not Suggested as it is lengthy...Good to understand the logic!!!)

features=list(data.columns)    #Typecasting to get a List structure.

numerical_feature=[]
categorical_feature=[]

for f in features:

    if data[f].dtype=='object':
        categorical_feature.append(f)
    else:
        numerical_feature.append(f)


print("Numerical_columns:",numerical_feature)
print("Categorical_feature:",categorical_feature)

In [None]:
# Analysing the Categorical features : ['cut', 'color', 'clarity']

categorical_feature

In [None]:
data[categorical_col]

In [None]:
data[categorical_col].info()

In [None]:
data[categorical_col].isnull().sum()

In [None]:
data[categorical_col].describe()   # Statistical Summary

In [None]:
# Unique values presnet inside each of the categorical features:
print("There are",data[categorical_col]['cut'].nunique(),"values present in 'cut' feature :",data[categorical_col]['cut'].unique())
print("There are",data[categorical_col]['color'].nunique(),"values present in 'color' feature :",data[categorical_col]['color'].unique())
print("There are",data[categorical_col]['clarity'].nunique(),"values present in 'clarity' feature :",data[categorical_col]['clarity'].unique())

In [None]:
# Sicne cut,color and clarity are non numerical , so we cannot find mean or median...But we can find mode.

print("Mode in cut feature:",data[categorical_col]['cut'].mode().values)
print("Mode in color feature:",data[categorical_col]['color'].mode().values)
print("Mode in clarity feature:",data[categorical_col]['clarity'].mode().values)


-------------------------------------------------------------------------

# INDEX OF EXPLORATORY DATA ANALYSIS :
A.Categorical Features: cut,clarity and color
(a) Univariate Analysis : Distribution of Diamond Cut, Color, and Clarity

`(a.1)`

- Individual countplots for cut,color and clarity
- Using Subplot with sns : Combined counplots of these 3 features using seaborn with subplots.
- Using Subplot with plotly : Combined counplots of these 3 features using plotly with subplots.

`(a.2)`

- Individual donut charts for cut,color and clarity in percentage form
- Using Subplot with sns : Combined Donut charts of these 3 features using seaborn with subplots.
- Using Subplot with plotly : Combined Donut charts of these 3 features using plotly with subplots.


`(a.3)`

- Distribution of Diamond Cut, Color, and Clarity Using sns and plotly ( combination of a.1 and a.2 Distributions)

(b) Bivariate Analysis : Count of Diamonds by (Cut and Color) , (Cut and Clarity) , (Color and Clarity)

`(b.1)`
- Countplot of Diamonds by (Cut and Color) , (Cut and Clarity) , (Color and Clarity)

`(b.2)`
- Frequncy table and percentage table , heatmap (freq & percentage) , histogram (distribution of two features percentagewise) by by (Cut and Color) , (Cut and Clarity) , (Color and Clarity)



**Observation: cut,clarity and color all three are Ordinal Catergorical Variables.**

------------------------------------------------------------------------------------------------------------------------------------

## 01. UNIVARIATE ANALYSIS

In [None]:
# Lets observe the unique values and its respective value_counts present inside each cateforical feature:
data[categorical_col]['cut'].value_counts()

In [None]:
plt.figure(figsize=(10,6))    # Sets the figure size

ax1=sns.countplot(x='cut',data=data[categorical_col],order=["Ideal","Premium","Very Good","Good","Fair"],color='green')

for p in ax1.patches:     #This loop is for denoting the values over each bar.... patches is a keyword here ...
    ax1.annotate(f'{p.get_height()}',(p.get_x() + p.get_width() / 2., p.get_height()),ha='center', va='center', xytext=(0,8), textcoords='offset points')

In [None]:
data[categorical_col]['color'].value_counts()

In [None]:
plt.figure(figsize=(10,6))    # Sets the figure size

ax2=sns.countplot(x='color',data=data[categorical_col],order=["D","E","F","G","H","I","J"],color='Orange')

for p in ax2.patches:     #This loop is for denoting the values over each bar.... patches is a keyword here ...
    ax2.annotate(f'{p.get_height()}',(p.get_x() + p.get_width() / 2., p.get_height()),ha='center', va='center', xytext=(0,8), textcoords='offset points')

In [None]:
data[categorical_col]['clarity'].value_counts()

In [None]:
plt.figure(figsize=(15,6))    # Sets the figure size

ax3=sns.countplot(x='clarity',data=data[categorical_col],order=['IF','VVS1','VVS2','VS1','VS2','SI1','SI2','I1'],color='Purple')   #This plots the bars 

for p in ax3.patches:     #This loop is for denoting the values over each bar.... patches is a keyword here ...
    ax3.annotate(f'{p.get_height()}',(p.get_x() + p.get_width() / 2., p.get_height()),ha='center', va='center', xytext=(0,8), textcoords='offset points')

# textcoords='offset points' : Places over each bar...
# xytext=(0,8): Sshows how above the value will be places over each bar...(making some distance)
# ha and va : means horizontal and vertical alignment.
# p.get_x() + p.get_width() / 2., p.get_height()) : This is basically a formula used to get the middle alinment place for annotaion of the count_value.


In [None]:
# Observation Summary Using seaborn with subplot and using plotly with subplot

plt.rcParams['figure.figsize'] = (20, 6)

# Create subplots with 1 row and 3 columns
fig, axes = plt.subplots(1, 3)

# Set custom colors
colors = ['gold', 'green', 'blue']

# Countplot for 'cut'
ax1 = sns.countplot(x='cut', data=data[categorical_col], order=["Ideal", "Premium", "Very Good", "Good", "Fair"], color=colors[0], ax=axes[0])
for p in ax1.patches:
    ax1.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 8), textcoords='offset points')

# Countplot for 'color'
ax2 = sns.countplot(x='color', data=data[categorical_col], order=["D", "E", "F", "G", "H", "I", "J"], color=colors[1], ax=axes[1])
for p in ax2.patches:
    ax2.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 8), textcoords='offset points')

# Countplot for 'clarity'
ax3 = sns.countplot(x='clarity', data=data[categorical_col], order=['IF', 'VVS1', 'VVS2', 'VS1', 'VS2', 'SI1', 'SI2', 'I1'], color=colors[2], ax=axes[2])
for p in ax3.patches:
    ax3.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 8), textcoords='offset points')

# Adjust layout
plt.tight_layout()

# Show the plot
plt.show()


In [None]:
# Doing the same using Plotly

import plotly.graph_objects as go
import plotly.express as px

# Create subplots with 1 row and 3 columns
fig = go.Figure()

# Custom colors
colors = ['gold', 'green', 'blue']

# Countplot for 'cut'
fig.add_trace(go.Bar(
    x=data[categorical_col]['cut'].value_counts().index,
    y=data[categorical_col]['cut'].value_counts().values,
    marker_color=colors[0],
    text=data[categorical_col]['cut'].value_counts().values,
    textposition='auto',
    name='Cut',
))

# Countplot for 'color'
fig.add_trace(go.Bar(
    x=data[categorical_col]['color'].value_counts().index,
    y=data[categorical_col]['color'].value_counts().values,
    marker_color=colors[1],
    text=data[categorical_col]['color'].value_counts().values,
    textposition='auto',
    name='Color',
))

# Countplot for 'clarity'
fig.add_trace(go.Bar(
    x=data[categorical_col]['clarity'].value_counts().index,
    y=data[categorical_col]['clarity'].value_counts().values,
    marker_color=colors[2],
    text=data[categorical_col]['clarity'].value_counts().values,
    textposition='auto',
    name='Clarity',
))

# Update layout
fig.update_layout(
    barmode='stack',
    bargap=0.15,
    title='Distribution of Cut, Color, and Clarity of Diamonds in Descending Order',
    showlegend=True,
)

# Show the plot
fig.show()


----------------------------------------------------------------------------------------------------------------
Lets Observe things in Percentage form:

The value_counts(Normalize=True)*100 is used to see the percentage distribution of each unique items.

In [None]:
# Using this value_count we can get the idea of percentage of each charateristics in each categorical feature:
data[categorical_col]['cut'].value_counts(normalize=True)*100

In [None]:
# Lets round it upto 2 decimal points and convert into dataframe

cut_distribution_percentage=pd.DataFrame(round(data[categorical_col]['cut'].value_counts(normalize=True)*100,2))
color_distribution_percentage=pd.DataFrame(round(data[categorical_col]['color'].value_counts(normalize=True)*100,2))
clarity_distribution_percentage=pd.DataFrame(round(data[categorical_col]['clarity'].value_counts(normalize=True)*100,2))

In [None]:
print("Distirbution of charateristics of Cut Feature in Percentage :")
cut_distribution_percentage

In [None]:
print("Distirbution of charateristics of Color Feature in Percentage :")
color_distribution_percentage

In [None]:
print("Distirbution of charateristics of Clarity Feature in Percentage :")
clarity_distribution_percentage


![colors.jpg](attachment:colors.jpg)

![diamond_cuts.jpg](attachment:diamond_cuts.jpg)


In [None]:
# Cutomize order: Top to Bottom Quality Ranking

order1 = ["Ideal", "Premium", "Very Good", "Good", "Fair"]
print("Cutting Ranking of Diamond from High to Low quality!!!")
cut_sorted_df=pd.DataFrame(cut_distribution_percentage,index=order1)
cut_sorted_df


In [None]:
plt.figure(figsize=(10,5))
sns.set_palette("Paired") 

plt.pie(cut_sorted_df['cut'], labels=cut_sorted_df.index, autopct='%0.2f%%',startangle=180)

plt.title('Pie chart : Distribution of Diamond Cuts')
plt.show()

In [None]:
plt.figure(figsize=(12,5))
sns.set_palette("Set2") 

plt.pie(cut_sorted_df['cut'], labels=cut_sorted_df.index, autopct='%0.2f%%',startangle=180)


# Draw donut ==> using creating a circle inside.
centre_circle = plt.Circle((0, 0),0.7,fc='white')   #  choose centre where we want inner circle's center , 0.6 --> 60% of main circle raduis, fc='white'-->color of inner circle
fig = plt.gcf()   #gcf():get current figure
fig.gca().add_artist(centre_circle)

# Set plot title and show the plot
plt.title('Distribution of Diamond Cuts')
plt.show()


In [None]:
order2=['D','E','F','G','H','I','J']
print("Color Ranking  of Diamond from High to Low quality!!!")
color_sorted_df=pd.DataFrame(color_distribution_percentage,index=order2)
color_sorted_df


In [None]:
plt.figure(figsize=(12,5))
sns.set_palette("Set2") 

plt.pie(color_sorted_df['color'], labels=color_sorted_df.index, autopct='%0.2f%%',startangle=180)
centre_circle = plt.Circle((0, 0),0.7,fc='white')  
fig = plt.gcf()   
fig.gca().add_artist(centre_circle)

# Set plot title and show the plot
plt.title('Distribution of Diamond Color')
plt.show()

In [None]:
order3=['IF','VVS1','VVS2','VS1','VS2','SI1','SI2','I1']
print("Clarity Ranking of Diamond from High to Low quality!!!")
clarity_sorted_df=pd.DataFrame(clarity_distribution_percentage,index=order3)
clarity_sorted_df


In [None]:
plt.figure(figsize=(12,7))
sns.set_palette("Set2") 

plt.pie(clarity_sorted_df['clarity'], labels=clarity_sorted_df.index, autopct='%0.2f%%',startangle=180)
centre_circle = plt.Circle((0, 0),0.7,fc='white')  
fig = plt.gcf()   
fig.gca().add_artist(centre_circle)

# Set plot title and show the plot
plt.title('Distribution of Diamond Clarity')
plt.show()

In [None]:
# Observation Summary:

# Set the figure size
plt.figure(figsize=(18, 6))

# Set the color palette
sns.set_palette("Set2")

# Create subplots with 1 row and 3 columns
plt.subplot(1, 3, 1)

# Plot for 'cut'
plt.pie(cut_sorted_df['cut'], labels=cut_sorted_df.index, autopct='%0.2f%%', startangle=180)
centre_circle = plt.Circle((0, 0), 0.7, fc='white')  
fig = plt.gcf()   
fig.gca().add_artist(centre_circle)
plt.title('Distribution of Diamond Cuts')

plt.subplot(1, 3, 2)

# Plot for 'color'
plt.pie(color_sorted_df['color'], labels=color_sorted_df.index, autopct='%0.2f%%', startangle=180)
centre_circle = plt.Circle((0, 0), 0.7, fc='white')  
fig = plt.gcf()   
fig.gca().add_artist(centre_circle)
plt.title('Distribution of Diamond Color')

plt.subplot(1, 3, 3)

# Plot for 'clarity'
plt.pie(clarity_sorted_df['clarity'], labels=clarity_sorted_df.index, autopct='%0.2f%%', startangle=180)
centre_circle = plt.Circle((0, 0), 0.7, fc='white')  
fig = plt.gcf()   
fig.gca().add_artist(centre_circle)
plt.title('Distribution of Diamond Clarity')

# Adjust layout and show the plot
plt.tight_layout()
plt.show()


In [None]:
# The Same as above but using Plotly......

import plotly.graph_objects as go

# Data preparation (replace with your actual data)
cut_labels = cut_sorted_df.index
cut_values = cut_sorted_df['cut']

color_labels = color_sorted_df.index
color_values = color_sorted_df['color']

clarity_labels = clarity_sorted_df.index
clarity_values = clarity_sorted_df['clarity']

# Create subplots with 1 row and 3 columns
fig = go.Figure()

# Plot for 'cut'
fig.add_trace(go.Pie(
    labels=cut_labels,
    values=cut_values,
    hole=0.4,
    marker=dict(colors=px.colors.qualitative.Set2),
    textinfo='percent+label',
    insidetextorientation='radial',
    domain=dict(x=[0, 0.33])
))

# Plot for 'color'
fig.add_trace(go.Pie(
    labels=color_labels,
    values=color_values,
    hole=0.4,
    marker=dict(colors=px.colors.qualitative.Set2),
    textinfo='percent+label',
    insidetextorientation='radial',
    domain=dict(x=[0.33, 0.66])
))

# Plot for 'clarity'
fig.add_trace(go.Pie(
    labels=clarity_labels,
    values=clarity_values,
    hole=0.4,
    marker=dict(colors=px.colors.qualitative.Set2),
    textinfo='percent+label',
    insidetextorientation='radial',
    domain=dict(x=[0.66, 1])
))

# Update layout
fig.update_layout(
    title='Distribution of Diamond Cut, Color, and Clarity',
    showlegend=False,
)

# Show the plot
fig.show()


In [None]:
# In Short In one frame:


plt.rcParams['figure.figsize'] = (23, 5)

fig, axes = plt.subplots(1, 3)

# Pie chart
axes[0].pie(data[categorical_col]['cut'].value_counts().values,
            labels=data[categorical_col]['cut'].value_counts().index,
            startangle=90,
            colors=['gold', 'lightgreen', 'red', 'lightblue', 'pink'],
            explode=[0.05, 0.05, 0.05, 0.05, 0.2],
            shadow=True, autopct='%1.2f%%')

# Countplot
sns.countplot(x=data[categorical_col]['color'], palette='ocean', order=data[categorical_col]['color'].value_counts().index, ax=axes[1])

# Bar chart
data[categorical_col]["clarity"].value_counts().plot.bar(ax=axes[2])
axes[2].set_xlabel('clarity types')
axes[2].set_ylabel('count')

plt.suptitle('Univariate : Distribution of Cut, Color, and Clarity of Gems')
plt.show()



In [None]:
# The same as above but using Plotly.

import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px



# Sample data
values1 = data[categorical_col]['cut'].value_counts().values
labels1 = data[categorical_col]['cut'].value_counts().index

values2 = data[categorical_col]['color'].value_counts().values
labels2 = data[categorical_col]['color'].value_counts().index

values3 = data[categorical_col]['clarity'].value_counts().values
labels3 = data[categorical_col]['clarity'].value_counts().index

# Create subplots
fig = make_subplots(rows=1, cols=3, specs=[[{'type': 'pie'}, {'type': 'pie'}, {'type': 'bar'}]],
                    subplot_titles=['Cut Distribution', 'Color Distribution', 'Clarity Distribution'])

# Pie chart 1
fig.add_trace(go.Pie(
    labels=labels1,
    values=values1,
    hole=0.5,
    pull=[0.05, 0.05, 0.05, 0.05, 0.2],
    marker=dict(colors=['gold', 'lightgreen', 'red', 'lightblue', 'pink']),
    textinfo='percent+label',
    insidetextorientation='radial',
), row=1, col=1)

# Pie chart 2
fig.add_trace(go.Pie(
    labels=labels2,
    values=values2,
    hole=0.5,
    pull=[0.05, 0.05, 0.05, 0.05, 0.2],
    marker=dict(colors=px.colors.qualitative.Set2),
    textinfo='percent+label',
    insidetextorientation='radial',
), row=1, col=2)

# Bar chart
fig.add_trace(go.Bar(
    x=labels3,
    y=values3,
    marker=dict(color=px.colors.qualitative.Set3),
), row=1, col=3)

# Update layout
fig.update_layout(
    barmode='stack',
    bargap=0.15,
    title='Univariate : Distribution of Cut, Color, and Clarity of Gems',
    showlegend=False,
)

# Show the plot
fig.show()


In [None]:
# Note : Since  categorical variables like 'cut', 'color', and 'clarity', it's more appropriate to use count plots or bar plots rather than KDE (Kernel Density Estimation) plots or distribution plots.

------------------------------------------------------------------------------------------------------
## 2. BIVARIATE

Bivariate: Ploting the count vs (two features together) , Note: This same could be achived using barplot or histogram with frequqncy or count on y-axis.

In [None]:
data[categorical_col].head()

### 2.1  BIVARIATE : COUNTPLOT 

**`(2.1.1) : Countplot:Cut Vs Color `**

In [None]:
# Cut vs Color


plt.figure(figsize=(20,15))    # Sets the figure size

b1=sns.countplot(x='cut',hue='color',data=data[categorical_col])

for p in b1.patches:     
    b1.annotate(f'{p.get_height()}',(p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='center', xytext=(0,20), textcoords='offset points',rotation=90)
    
plt.title("Count of Diamonds by Cut and Color")

**`(2.1.2) : Countplot: Clarity Vs Color `**

In [None]:

plt.figure(figsize=(20,15))    # Sets the figure size

b1=sns.countplot(x='clarity',hue='color',data=data[categorical_col])


for p in b1.patches:     
    b1.annotate(f'{p.get_height()}',(p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='center', xytext=(0,20), textcoords='offset points',rotation=90)
    


plt.title("Count of Diamonds by Clarity and Color")

**`(2.1.3) : Countplot: Cut Vs Clarity `**

In [None]:
# Cut Vs clarity

plt.figure(figsize=(20,15))    # Sets the figure size

b1=sns.countplot(x='cut',hue='clarity',data=data[categorical_col])

for p in b1.patches:     
    b1.annotate(f'{p.get_height()}',(p.get_x() + p.get_width()/2., p.get_height()),
               ha='center', va='center', xytext=(0,20), textcoords='offset points',rotation=90)
    

plt.title("Count of Diamonds by Cut and Clarity")

In [None]:
# I want to show countplot , barplot and historgram works same here for catogeroies vs frequency(count)

-------------------------------------------------------------------------------------

#### Frequency Table | Percentage Table | Studying Distribution of two categorical features (using heatplot and histrogram)

**`(2.2.1) : Frequncy table and percentage table , heatmap (freq & percentage) , histogram (distribution of two features percentagewise) by Cut and Color`**

In [None]:
# Frequency table : Cut Vs Color 
count_table = pd.crosstab(data[categorical_col]["cut"], data[categorical_col]["color"])
print("Count Table:")
count_table

In [None]:
# Percentage Table : Cut Vs Color
percentage_table = count_table.div(count_table.sum(axis=1), axis=0) * 100
print("\nPercentage Table:")
percentage_table.round(2)


In [None]:
# Create a figure and three subplots
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(25, 8))


# Plot 1: Heatmap with Count
sns.heatmap(count_table, annot=True, fmt='d', cmap="YlGnBu", cbar=False, ax=axes[0])   # fmt='d' => this make sure evrything comes in int form and  2000 like this and not 2.0 e(+3)
axes[0].set_title("Frequency Distribution of Cut & Color (Count)")
axes[0].set_xlabel("Color")
axes[0].set_ylabel("Cut")

# Plot 2: Stacked Histogram with Percentage Values
sns.histplot(data, x='cut', hue='color', multiple="stack", ax=axes[1], palette="pastel", edgecolor='black')
axes[1].set_title("Stacked Histogram: Cut & Color")
axes[1].set_xlabel("Cut")
axes[1].set_ylabel("Count")

# Plot 3: Heatmap with Percentage
percentage_table = count_table.div(count_table.sum(axis=1), axis=0) * 100
sns.heatmap(percentage_table, annot=True, fmt=".2f", cbar=False, ax=axes[2])
axes[2].set_title("Percentage Distribution of Cut & Color (in %)")
axes[2].set_xlabel("Color")
axes[2].set_ylabel("Cut")



# Adjust layout to prevent overlapping
plt.tight_layout()
plt.show()


In [None]:
#CONCLUSION:

# Convert the crosstab to a Pandas Series
count_series = count_table.stack()  


# Convert the crosstab to a Pandas Series so that using idxmax we can make it stack and find max or min value from stack.
# count_series.idxmax()  : Fetch the index (cut,color) with the maximum value
# count_series.loc[max_index] : Get the maximum value

# Find the index (cut, color) with the maximum value
max_index = count_series.idxmax()
min_index = count_series.idxmin()

# Get the maximum and minimum values
max_value = count_series.loc[max_index]
min_value = count_series.loc[min_index]

# Calculate percentages
total_count = count_series.sum()
max_percentage = (max_value / total_count) * 100
min_percentage = (min_value / total_count) * 100

print(f"The maximum count is {max_value} ie. {max_percentage:.2f}% for the combination {max_index}.")
print(f"The minimum count is {min_value} ie. {min_percentage:.2f}% for the combination {min_index}.")


**`(2.2.2) : Frequncy table and percentage table , heatmap (freq & percentage) , histogram (distribution of two features percentagewise) by Color and Color`**

In [None]:
# Frequency table : Color Vs Clarity
count_table = pd.crosstab(data[categorical_col]["color"], data[categorical_col]["clarity"])
print("Count Table:")
count_table

In [None]:
# Percentage Table : Color Vs Clarity
percentage_table = count_table.div(count_table.sum(axis=1), axis=0) * 100
print("\nPercentage Table:")
percentage_table.round(2)

In [None]:
# Create a figure and three subplots
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(25, 8))


# Plot 1: Heatmap with Count
sns.heatmap(count_table, annot=True, fmt='d', cmap="YlGnBu", cbar=False, ax=axes[0])   
axes[0].set_title("Frequency Distribution of Color & Clarity (Count)")
axes[0].set_xlabel("Color")
axes[0].set_ylabel("Clarity")

# Plot 2: Stacked Histogram with Percentage Values
sns.histplot(data, x='color', hue='clarity', multiple="stack", ax=axes[1], palette="pastel", edgecolor='black')
axes[1].set_title("Stacked Histogram: Color & Clarity")
axes[1].set_xlabel("Color")
axes[1].set_ylabel("Clarity")

# Plot 3: Heatmap with Percentage
percentage_table = count_table.div(count_table.sum(axis=1), axis=0) * 100
sns.heatmap(percentage_table, annot=True, fmt=".2f", cbar=False, ax=axes[2])
axes[2].set_title("Percentage Distribution of Color & Clarity ( in %)")
axes[2].set_xlabel("Color")
axes[2].set_ylabel("Clarity")



# Adjust layout to prevent overlapping
plt.tight_layout()
plt.show()


In [None]:
#CONCLUSION:

# Convert the crosstab to a Pandas Series
count_series = count_table.stack()  


# Convert the crosstab to a Pandas Series so that using idxmax we can make it stack and find max or min value from stack.
# count_series.idxmax()  : Fetch the index (color,clarity) with the maximum value
# count_series.loc[max_index] : Get the maximum value

# Find the index (color, clarity) with the maximum value
max_index = count_series.idxmax()
min_index = count_series.idxmin()

# Get the maximum and minimum values
max_value = count_series.loc[max_index]
min_value = count_series.loc[min_index]

# Calculate percentages
total_count = count_series.sum()
max_percentage = (max_value / total_count) * 100
min_percentage = (min_value / total_count) * 100

print(f"The maximum count is {max_value} ie. {max_percentage:.2f}% for the combination {max_index}.")
print(f"The minimum count is {min_value} ie. {min_percentage:.2f}% for the combination {min_index}.")



**`(2.2.3) : Frequncy table and percentage table , heatmap (freq & percentage) , histogram (distribution of two features percentagewise) by Cut and Clarity`**

In [None]:
# Frequency table : Cut Vs Clarity
count_table = pd.crosstab(data[categorical_col]["cut"], data[categorical_col]["clarity"])
print("Count Table:")
count_table

In [None]:
# Percentage Table : Cut Vs Clarity
percentage_table = count_table.div(count_table.sum(axis=1), axis=0) * 100
print("\nPercentage Table:")
percentage_table.round(2)

In [None]:
# Create a figure and three subplots
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(25, 8))


# Plot 1: Heatmap with Count
sns.heatmap(count_table, annot=True, fmt='d', cmap="YlGnBu", cbar=False, ax=axes[0])   
axes[0].set_title("Frequency Distribution of Cut & Clarity (Count)")
axes[0].set_xlabel("Cut")
axes[0].set_ylabel("Clarity")

# Plot 2: Stacked Histogram with Percentage Values
sns.histplot(data, x='cut', hue='clarity', multiple="stack", ax=axes[1], palette="pastel", edgecolor='black')
axes[1].set_title("Stacked Histogram: Cut & Clarity")
axes[1].set_xlabel("Cut")
axes[1].set_ylabel("Clarity")

# Plot 3: Heatmap with Percentage
percentage_table = count_table.div(count_table.sum(axis=1), axis=0) * 100
sns.heatmap(percentage_table, annot=True, fmt=".2f", cbar=False, ax=axes[2])
axes[2].set_title("Percentage Distribution of Cut & Clarity (in %)")
axes[2].set_xlabel("Cut")
axes[2].set_ylabel("Clarity")



# Adjust layout to prevent overlapping
plt.tight_layout()
plt.show()


In [None]:
#CONCLUSION:

# Convert the crosstab to a Pandas Series
count_series = count_table.stack()  


# Convert the crosstab to a Pandas Series so that using idxmax we can make it stack and find max or min value from stack.
# count_series.idxmax()  : Fetch the index (cut,clarity) with the maximum value
# count_series.loc[max_index] : Get the maximum value

# Find the index (cut, clarity) with the maximum value
max_index = count_series.idxmax()
min_index = count_series.idxmin()

# Get the maximum and minimum values
max_value = count_series.loc[max_index]
min_value = count_series.loc[min_index]

# Calculate percentages
total_count = count_series.sum()
max_percentage = (max_value / total_count) * 100
min_percentage = (min_value / total_count) * 100

print(f"The maximum count is {max_value} ie. {max_percentage:.2f}% for the combination {max_index}.")
print(f"The minimum count is {min_value} ie. {min_percentage:.2f}% for the combination {min_index}.")


----------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------

# 3.Multivariate Analysis

Since `cut`, `color`, and `clarity` are categorical variables and not continuous, it's inappropriate to use correlation analysis. Instead, for categorical features, we can employ the chi-squared test to identify associations. This statistical test assesses whether there is a significant relationship between categorical variables, making it a suitable method for exploring associations in non-numeric data..

In [None]:
# Use the `chi2_contingency` function from `scipy.stats` to perform the chi-squared test:

from scipy.stats import chi2_contingency

# Assuming diamond_df is your DataFrame
contingency_table = pd.crosstab(data[categorical_col]['clarity'],[data[categorical_col]['cut'],data[categorical_col]['color']])


# Note: pd.crosstab(data[categorical_col]['clarity'],[data[categorical_col]['cut'],data[categorical_col]['color']]) If we change combinations graphs looks different but the values will be exact same.
# So we can opt any of the combination we want.(Already tried!!!)


In [None]:
contingency_table   #Here no issue is there but if we do like : pd.crosstab(data[categorical_col]['clarity'],[data[categorical_col]['color'],data[categorical_col]['cut']])
                    #most of the values would be hidden then we have to use the below appraoch to make the table visible for all entires.

We noticed that some information in the table is hidden, making it challenging for thorough analysis. To address this, we've adjusted the settings to display more rows and columns, ensuring a comprehensive view of the data.

In [None]:
# Set maximum number of displayed rows and columns
pd.set_option('display.max_rows',10)  # Adjust the number as needed
pd.set_option('display.max_columns', 22000)  # Adjust the number as needed
print("These numerical values are count or frequency...")

contingency_table

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(30,13))
sns.heatmap(contingency_table,cmap="YlGnBu", annot=True, cbar=False,fmt='.1f', annot_kws={'rotation': 90})

In [None]:
print("Maximum Frequency of table :",contingency_table.values.max())    # Darkest shade
print("Minimum Frequency of table :",contingency_table.values.min())    #Lightest shade

Conclusion:

* This shows that diamond having the Clarity of 'VS2' , having 'Ideal' cut and color 'E ' , has the maximum frequency in the dataset.
* This shows  that diamond having Clarity of 'IF', 'Fair' cut and color as E,H,I,J  , has the minimum frequency in the dataset 
ie. zero here, means No data is there having these features inside the dataset.

---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Chi-square (χ²) test is a statistical test used to determine if there is a significant association between two categorical variables. 


### Step 1: Set Up the Hypotheses

- **Null Hypothesis (H0):** The categorical features (`cut`, `clarity`, `color`) are independent; there is no association between them.
  
- **Alternative Hypothesis (H1):** The categorical features are not independent; there is a significant association between them.

### Step 2: Perform the Chi-Squared Test

### Step 3: Interpret the Results

- **Chi-squared value:** This tells us the strength of the association. Higher values indicate a stronger association.
  
- **P-value:** If the p-value is below our chosen significance level (commonly 0.05), we reject the null hypothesis. A low p-value suggests a significant association.

### Step 4: Draw Conclusions

- **Reject H0:** If the p-value is low, we can reject the null hypothesis and conclude that there is a significant association between at least some of the categorical features (`cut`, `clarity`, `color`).
- **Do Not Reject H0:** If the p-value is high, there isn't enough evidence to reject the null hypothesis, suggesting independence between the categorical features.

In [None]:
#Step1:
Null_Hypothesis="The categorical features (`cut`, `clarity`, `color`) are independent and there is no association between them."
Alternative_Hypothesis="The categorical features (`cut`, `clarity`, `color`) are not independent; there is association between them."

# Step2:
# Perform chi-square test
chi2_stat, p_val, _, _ = chi2_contingency(contingency_table)

# Output results
print(f"Chi-square statistic: {chi2_stat}")
print(f"P-value: {p_val}")

#Step3:
# Interpret results
alpha = 0.05  # significance level
print(f"Significance level: {alpha}")
print(f"Is the p-value less than alpha? {'Yes' if p_val < alpha else 'No'}")

#Step4:
# Conclusion
if p_val < alpha:
    print("REJECT the null hypothesis which stated that:",Null_Hypothesis)
    print("-----------------------------------------------------------------------------------------------------------------------------------------------------------")
    
    print("CONCLUSION: The categorical features (`cut`, `clarity`, `color`) are dependent; there is association between them.")
else:
    print("FAIL TO REJECT the null hypothesis which stated that:",Alternative_Hypothesis)
    print("-----------------------------------------------------------------------------------------------------------------------------------------------------------")    
    print("CONCLUSION: The categorical features (`cut`, `clarity`, `color`) are independent; there is no association between them.")

In [None]:

# Define the pairs of categorical variables
variable_pairs = [('cut', 'clarity'), ('cut', 'color'), ('clarity', 'color')]

# Loop through each pair
for variable1, variable2 in variable_pairs:
    # Create a contingency table
    contingency_table = pd.crosstab(data[categorical_col][variable1], data[categorical_col][variable2])

    # Perform the chi-square test
    chi2, p, _, _ = chi2_contingency(contingency_table)

    # Output the results
    print(f"Chi-square test for {variable1} and {variable2}:")
    print(f"Chi2 Statistic: {chi2}")
    print(f"  P-value: {p}")
    
    # Check for significance
    alpha = 0.05
    if p < alpha:
        print("  Result: There is a significant association.")
    else:
        print("  Result: There is no significant association.")
    
    print("\n")



In [None]:

variable_pairs = [('cut', 'clarity'), ('cut', 'color'), ('clarity', 'color')]

# Loop through each pair
for variable1, variable2 in variable_pairs:
    # Create a contingency table
    contingency_table = pd.crosstab(data[categorical_col][variable1], data[categorical_col][variable2])

    # Perform the chi-square test
    chi2, p, _, _ = chi2_contingency(contingency_table)

    # Output the results with p-value rounded to 2 decimal points
    print(f"Chi-square test for {variable1} and {variable2}:")
    print(f"  Chi2 Statistic: {chi2}")
    print(f"  P-value: {p:.2f}")  # Round p-value to 2 decimal points
    
    # Check for significance
    alpha = 0.05
    if p < alpha:
        print("  Result: There is a significant association.")
    else:
        print("  Result: There is no significant association.")
    
    print("\n")


![download.png](attachment:download.png)

In [None]:
# To Get the Idea that which pair has the high asscociation for that we have to use Cramér's V 


# Define the pairs of categorical variables
variable_pairs = [('cut', 'clarity'), ('cut', 'color'), ('clarity', 'color')]

# Loop through each pair
for variable1, variable2 in variable_pairs:
    # Create a contingency table
    contingency_table = pd.crosstab(data[variable1], data[variable2])

    # Perform the chi-square test
    chi2, _, _, _ = chi2_contingency(contingency_table)

    # Calculate Cramér's V  
    # Range of Cramér's V   is from 0 (no relationship) to 1 (perfect relationship)
    num_rows = contingency_table.shape[0]
    num_cols = contingency_table.shape[1]
    cramers_v = np.sqrt(chi2 / (data.shape[0] * min(num_rows - 1, num_cols - 1)))

    # Output the results
    print(f"Cramér's V for {variable1} and {variable2}: {cramers_v:.2f}")

    # Interpret the strength of association
    if cramers_v < 0.1:
        print("  Strength of association: Weak")
    elif 0.1 <= cramers_v < 0.3:
        print("  Strength of association: Moderate")
    elif 0.3 <= cramers_v < 0.5:
        print("  Strength of association: Strong")
    else:
        print("  Strength of association: Very Strong")

    print("\n")
