### INTRODUCTION OF THE DATASET 


 
**The dataset**: The goal is to predict `price` of given diamond/gemstone ( Regression Analysis).

**There are 10 independent variables (including `id`)**
* `id` : unique identifier of each diamond
* `carat` : Carat (ct.) refers to the unique unit of weight measurement used exclusively to weigh gemstones and diamonds.
* `cut` : Quality of Diamond Cut
* `color` : Color of Diamond
* `clarity` : Diamond clarity is a measure of the purity and rarity of the stone, graded by the visibility of these characteristics under 10-power magnification.
* `depth` : The depth of diamond is its height (in millimeters) measured from the culet (bottom tip) to the table (flat, top surface)
* `table` : A diamond's table is the facet which can be seen when the stone is viewed face up.
* `x` : Diamond X dimension
* `y` : Diamond Y dimension
* `x` : Diamond Z dimension
  
**Target variable:**
* `price`: Price of the given Diamond.

Dataset Source Link :
[Link of gemstone.csv dataset](https://www.kaggle.com/competitions/playground-series-s3e8/data?select=train.csv)

----------------------------------------------------------------------------------------------------------------------
   

### Step 1: IMPORT THE REQUIRED LIBRARIES AND INGEST THE DATA

In [1]:
import numpy as np
import pandas as pd

In [2]:
data=pd.read_csv(r"C:\\Users\\Admin\Desktop\\ineuron\\_FSDSM_SelfLearned_RashmiKumari\\MACHINE_LEARNING\\01.End_to_End_setup_and_MLproject\\01.Setup_for_ML_DL_projects\\notebooks\\data\\gemstone.csv")

In [3]:
data.head()    # head(): 5 very first elements appeared in the dataset.

Unnamed: 0,id,carat,cut,color,clarity,depth,table,x,y,z,price
0,0,1.52,Premium,F,VS2,62.2,58.0,7.27,7.33,4.55,13619
1,1,2.03,Very Good,J,SI2,62.0,58.0,8.06,8.12,5.05,13387
2,2,0.7,Ideal,G,VS1,61.2,57.0,5.69,5.73,3.5,2772
3,3,0.32,Ideal,G,VS1,61.6,56.0,4.38,4.41,2.71,666
4,4,1.7,Premium,G,VS2,62.6,59.0,7.65,7.61,4.77,14453


In [4]:
data.tail()    # tail(): 5 very last elements appeared in the dataset.

Unnamed: 0,id,carat,cut,color,clarity,depth,table,x,y,z,price
193568,193568,0.31,Ideal,D,VVS2,61.1,56.0,4.35,4.39,2.67,1130
193569,193569,0.7,Premium,G,VVS2,60.3,58.0,5.75,5.77,3.47,2874
193570,193570,0.73,Very Good,F,SI1,63.1,57.0,5.72,5.75,3.62,3036
193571,193571,0.34,Very Good,D,SI1,62.9,55.0,4.45,4.49,2.81,681
193572,193572,0.71,Good,E,SI2,60.8,64.0,5.73,5.71,3.48,2258


In [5]:
data.columns    # columns: Gives names of the columns.This is an attribute.how we came to know--> no () after columns.

Index(['id', 'carat', 'cut', 'color', 'clarity', 'depth', 'table', 'x', 'y',
       'z', 'price'],
      dtype='object')

In [6]:
data.shape      # shape : This returns the shape of the dataset. (total no. of rows,total no. of columns). 
                #         shape is also attribute and not a method/function.

(193573, 11)

In [7]:
print("Number of rows/records in the dataset:",data.shape[0])
print("Nuber of columns in the dataset:",data.shape[1])

Number of rows/records in the dataset: 193573
Nuber of columns in the dataset: 11


In [8]:
data.sample(20)   #sample(): method provides random sample out of the dataset...
                  #           sample(mention how many records we want in our random sample. ie. size of random sample)

Unnamed: 0,id,carat,cut,color,clarity,depth,table,x,y,z,price
100962,100962,0.3,Premium,G,VS2,61.3,60.0,4.33,4.29,2.64,524
28165,28165,0.72,Ideal,D,SI2,61.1,56.0,5.77,5.79,3.53,2662
180386,180386,0.7,Ideal,D,SI1,61.9,54.0,5.71,5.74,3.55,2522
130045,130045,0.3,Ideal,D,SI1,62.3,56.0,4.25,4.27,2.66,552
164471,164471,0.37,Ideal,E,VS2,61.5,56.0,4.63,4.59,2.84,1041
33301,33301,0.37,Ideal,E,VS2,61.6,57.0,4.64,4.6,2.84,758
84984,84984,0.25,Very Good,F,VVS2,60.8,59.0,4.03,4.07,2.44,575
189471,189471,1.36,Ideal,G,VS1,61.6,56.0,7.13,7.17,4.41,10983
81532,81532,0.32,Very Good,E,SI2,60.8,61.0,4.39,4.42,2.68,449
59405,59405,1.63,Ideal,G,VVS1,60.6,57.0,7.59,7.63,4.62,15377


In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193573 entries, 0 to 193572
Data columns (total 11 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       193573 non-null  int64  
 1   carat    193573 non-null  float64
 2   cut      193573 non-null  object 
 3   color    193573 non-null  object 
 4   clarity  193573 non-null  object 
 5   depth    193573 non-null  float64
 6   table    193573 non-null  float64
 7   x        193573 non-null  float64
 8   y        193573 non-null  float64
 9   z        193573 non-null  float64
 10  price    193573 non-null  int64  
dtypes: float64(6), int64(2), object(3)
memory usage: 16.2+ MB


**Check for the null values.**

In [10]:
# data.isnull()   : This gives Boolean values in Dataframe form.

print("Number of Null/missing values in the dataset: ")

data.isnull().sum()   # Conclusion: No null value is there in any column.

Number of Null/missing values in the dataset: 


id         0
carat      0
cut        0
color      0
clarity    0
depth      0
table      0
x          0
y          0
z          0
price      0
dtype: int64

**Check for the duplicate values**

In [11]:
# data.duplicated() : This gives Boolean values 
data.duplicated().sum()  # This gives the number of records which are duplicate.

0

In [12]:
data[data.duplicated()=="True"]   # If  we have some finite number of duplicate values and
                                  # we want to check what are they?Then we can use this code.

                                  #Conclusion : No duplicate record is there.

Unnamed: 0,id,carat,cut,color,clarity,depth,table,x,y,z,price


**Check the statistical summary  of data**

In [13]:
data.describe().T     # Best part of describe() is , it automatically considers only numerical values.
                      # T : T is Transpose ,to transpose the result

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,193573.0,96786.0,55879.856166,0.0,48393.0,96786.0,145179.0,193572.0
carat,193573.0,0.790688,0.462688,0.2,0.4,0.7,1.03,3.5
depth,193573.0,61.820574,1.081704,52.1,61.3,61.9,62.4,71.6
table,193573.0,57.227675,1.918844,49.0,56.0,57.0,58.0,79.0
x,193573.0,5.715312,1.109422,0.0,4.7,5.7,6.51,9.65
y,193573.0,5.720094,1.102333,0.0,4.71,5.72,6.51,10.01
z,193573.0,3.534246,0.688922,0.0,2.9,3.53,4.03,31.3
price,193573.0,3969.155414,4034.374138,326.0,951.0,2401.0,5408.0,18818.0


**conclusion from describe() : Gives statistical summary**

1. *Diamond Count:*
   - Dataset comprises 193,573 diamonds with unique IDs ranging from 0 to 193,572.

2. *Carat Weight (Carat):*
   - Average carat weight is 0.79.
   - Ranges from 0.2 to 3.5.
   - 25% of diamonds have carat weight below 0.4, 50% below 0.70 and 75% below 1.03.
   - Standard Deviation is 0.46

3. *Depth (percentage) :*
   - Range from 52.1 to 71.60
   - Average depth percentage is 61.82 
   - Standard deviation of 1.08 


4. *Table (percentage) :*
   - Range from 49.0 to 79.00
   - Average table percentage is 57.23  
   - Standard deviation of 1.92

5. *Dimensions (x, y, z) (millimeters ie. mm):*
   - Range of
         x : 0 to 9.65 
         y : 0 to 10.01
         z : 0 to 31.30
   - Average dimensions are approximately 5.72 x 5.72 x 3.53.
   - Standard deviations: 1.11 (x), 1.10 (y), 0.69 (z).

6. *Price:*
   - Prices Range from 326 to 18,818.
   - Average price is 3,969.16.
   - Standard Deviation is 4034.37
   - Right-skewed distribution; median price (50%) is 2,401, while the mean is higher at 3,969.16.
   
------------------------------------------------------------------------------------


**Note:**
* Carats= Milligrams/200   (Since 200mg = 1 carat)
​

* The unit for both "Depth" and "Table" in this dataset is percentage (%). 
These percentages indicate the proportional measurements of the gemstone's depth and table width relative to their average diameters.

        Depth or Depth percentage :It is calculated as (depth / average diameter) * 100
        Table or Table percentage :It is calculated as (table width / average diameter) * 100
   
   ![depth_table_of_dimanond.png](attachment:depth_table_of_dimanond.png)

* Standard deviation: The standard deviation for each of these features would have the same unit as the respective feature. 

*  ![c101b0da6ea1a0dab31f80d9963b0368_orig.png](attachment:c101b0da6ea1a0dab31f80d9963b0368_orig.png)

   mean < median => left-skewed distribution (Negatively-skewed) => tail on the left side is longer

   mean > median => right-skewed distribution (Positively-skewed) => tail on the right side is longer 

------------------------------------------------------------------------------


### Step 2: FEATURE ENGINEERNING

**'Id' feature is something that we don't need for analysis,So better to drop it.**

In [14]:
data.drop("id",axis=1,inplace=True)   #inplace= True :  modify the original DataFrame by dropping the mentioned feature permanently

*These all will perform the same task ie. dropping 'id' feature permanently from dataset.*

        data.drop(labels=['id'],axis=1,inplace=True)
        data.drop(columns=['id'],axis=1,inplace=True)
        data.drop("id",axis=1,inplace=True) 
        data= data.drop("id",axis=1) 

        Note: axis=0 means row and axis=1 means column in drop()

In [15]:
data.head(2)   #We can observe that 'id' column is gone.

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z,price
0,1.52,Premium,F,VS2,62.2,58.0,7.27,7.33,4.55,13619
1,2.03,Very Good,J,SI2,62.0,58.0,8.06,8.12,5.05,13387



**Sorting the Numercial and Categorical Values**

In [16]:
data.info()  # From this we came to know about datatypes of features .
             # dtype : float,object,int : so float and int => numerical => float and int are non-object.
             # Therefore we can seprate the data as object--> Categorical and non-object---> Numerical

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193573 entries, 0 to 193572
Data columns (total 10 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   carat    193573 non-null  float64
 1   cut      193573 non-null  object 
 2   color    193573 non-null  object 
 3   clarity  193573 non-null  object 
 4   depth    193573 non-null  float64
 5   table    193573 non-null  float64
 6   x        193573 non-null  float64
 7   y        193573 non-null  float64
 8   z        193573 non-null  float64
 9   price    193573 non-null  int64  
dtypes: float64(6), int64(1), object(3)
memory usage: 14.8+ MB


In [17]:
# Method I: (Suggested as it is short)

categorical_col=data.columns[data.dtypes=="object"]
numerical_col=data.columns[data.dtypes!="object"]

print("Numerical_columns:",list(numerical_col))
print("Categorical_feature:",list(categorical_col))

Numerical_columns: ['carat', 'depth', 'table', 'x', 'y', 'z', 'price']
Categorical_feature: ['cut', 'color', 'clarity']


In [18]:
# Method II: Using loop and if-else: (Not Suggested as it is lengthy...Good to understand the logic!!!)

features=list(data.columns)    #Typecasting to get a List structure.

numerical_feature=[]
categorical_feature=[]

for f in features:

    if data[f].dtype=='object':
        categorical_feature.append(f)
    else:
        numerical_feature.append(f)


print("Numerical_columns:",numerical_feature)
print("Categorical_feature:",categorical_feature)

Numerical_columns: ['carat', 'depth', 'table', 'x', 'y', 'z', 'price']
Categorical_feature: ['cut', 'color', 'clarity']
