**Importing Library**

In [1]:
import pandas as pd

**Reading the Dataset**

In [2]:
df = pd.read_csv('/content/gemstone.csv')
df.head()

Unnamed: 0,id,carat,cut,color,clarity,depth,table,x,y,z,price
0,0,1.52,Premium,F,VS2,62.2,58.0,7.27,7.33,4.55,13619
1,1,2.03,Very Good,J,SI2,62.0,58.0,8.06,8.12,5.05,13387
2,2,0.7,Ideal,G,VS1,61.2,57.0,5.69,5.73,3.5,2772
3,3,0.32,Ideal,G,VS1,61.6,56.0,4.38,4.41,2.71,666
4,4,1.7,Premium,G,VS2,62.6,59.0,7.65,7.61,4.77,14453


**Dropping the id column**

In [3]:
df=df.drop(labels=['id'],axis=1)

**Separating the Independant and dependant Features**

In [4]:
## Independent and dependent features
X = df.drop(labels=['price'],axis=1)
Y = df[['price']]

In [5]:
Y

Unnamed: 0,price
0,13619
1,13387
2,2772
3,666
4,14453
...,...
193568,1130
193569,2874
193570,3036
193571,681


**Segregating numerical and categorical variables**

In [6]:
# Segregating numerical and categorical variables
categorical_cols = X.select_dtypes(include='object').columns
numerical_cols = X.select_dtypes(exclude='object').columns

**Define the custom ranking for each ordinal variable**

In [7]:
# Define the custom ranking for each ordinal variable
cut_categories = ['Fair', 'Good', 'Very Good','Premium','Ideal']
color_categories = ['D', 'E', 'F', 'G', 'H', 'I', 'J']
clarity_categories = ['I1','SI2','SI1','VS2','VS1','VVS2','VVS1','IF']

These tools are used to **create clean, organized, and efficient** machine learning preprocessing workflows. Here's why each component is important:

SimpleImputer
Handles missing values in your dataset by filling them with a strategy like mean, median, or most frequent value. Without this, most machine learning models will crash when they encounter null values.​

StandardScaler
Performs feature scaling to standardize numerical features to have mean=0 and standard deviation=1. This is crucial because features with different scales (like age: 20-60 vs salary: 20,000-100,000) can bias models that are sensitive to feature magnitudes.​

OrdinalEncoder
Converts categorical features with order (like 'low', 'medium', 'high') into numerical values while preserving their ordinal relationship. For example, 'low'=0, 'medium'=1, 'high'=2.​

Pipeline
Creates a sequential workflow that chains multiple preprocessing steps and the final model together. Benefits include:​

Applies all transformations in correct order automatically

Prevents data leakage by fitting only on training data

Makes code cleaner and more maintainable​

ColumnTransformer
The most powerful tool - it allows you to apply different transformations to different columns simultaneously.

In [8]:
from sklearn.impute import SimpleImputer ## HAndling Missing Values
from sklearn.preprocessing import StandardScaler # HAndling Feature Scaling
from sklearn.preprocessing import OrdinalEncoder # Ordinal Encoding
## pipelines
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

**Numerical Pipeline**

Purpose: Processes numerical columns (like carat, depth, table, price dimensions)​

Step 1 - Imputer: Fills missing values using the median (middle value), which is robust against outliers​

Step 2 - Scaler: Standardizes values to mean=0 and standard deviation=1, ensuring all numerical features are on the same scale


**Categorical Pipeline**

Purpose: Processes categorical columns (cut, color, clarity) that have inherent order​

Step 1 - Imputer: Fills missing values with the most frequent category (mode) - appropriate for categorical data​

Step 2 - OrdinalEncoder: Converts categorical values to numbers while preserving their quality order:​

Cut: Fair=0, Good=1, Very Good=2, Premium=3, Ideal=4

Color: D=0, E=1, F=2, G=3, etc. (D is best, worse as you go down)

Clarity: I1=0, SI2=1, SI1=2, VS2=3, etc. (I1 is worst, FL is best)

Step 3 - Scaler: Scales the encoded numerical values to prevent any single feature from dominating​


**ColumnTransformer (The Orchestrator)**

Purpose: Applies different pipelines to different column groups simultaneously​

Structure: Each tuple contains (name, pipeline, columns):​

'num_pipeline': Label for the transformation

num_pipeline: The pipeline to apply

numerical_cols: Which columns to apply it to

How It Works
When you call preprocessor.fit_transform(X_train), it:​

Takes numerical columns → runs them through num_pipeline (median imputation → scaling)

Takes categorical columns → runs them through cat_pipeline (mode imputation → ordinal encoding → scaling)

Combines both transformed outputs into a single feature matrix

Returns a fully preprocessed dataset ready for model training

In [9]:
## Numerical Pipeline
num_pipeline=Pipeline(
    steps=[
    ('imputer',SimpleImputer(strategy='median')),
    ('scaler',StandardScaler())

    ]

)

# Categorigal Pipeline
cat_pipeline=Pipeline(
    steps=[
    ('imputer',SimpleImputer(strategy='most_frequent')),
    ('ordinalencoder',OrdinalEncoder(categories=[cut_categories,color_categories,clarity_categories])),
    ('scaler',StandardScaler())
    ]

)

preprocessor=ColumnTransformer([
('num_pipeline',num_pipeline,numerical_cols),
('cat_pipeline',cat_pipeline,categorical_cols)
])


**Slpitting the Dataset into Train and Test Dataset**

Inputs
X: Your feature variables (independent variables) - all the columns used to make predictions (like carat, cut, color, clarity for diamonds)​

Y: Your target variable (dependent variable) - what you're trying to predict (like diamond price)​

Parameters
test_size=0.30: Sets aside 30% of data for testing and 70% for training​

30% goes to test set (unseen data to evaluate model performance)

70% goes to training set (data used to train the model)

Common splits are 70/30 or 80/20 (training/testing)​

random_state=30: Ensures reproducibility​

Makes the random split consistent every time you run the code

Same random_state value = same exact split each time

Without it, you'd get different splits on each run, making results non-reproducible​

Outputs (4 Arrays)
X_train: Training features (70% of X) - used to train the model​

X_test: Testing features (30% of X) - used to evaluate the model​

y_train: Training labels (70% of Y) - actual values corresponding to X_train​

y_test: Testing labels (30% of Y) - actual values to compare against predictions​

Why This Is Essential
Prevents Overfitting: Training and testing on the same data would give false high accuracy. Your model might memorize the training data instead of learning patterns.​

Simulates Real-World Performance: The test set acts as "new, unseen data" to see how well your model generalizes.​

Model Validation: You can compare predicted values (from X_test) with actual values (y_test) to calculate metrics like RMSE, MAE, and R² score.​

How It Works
The function shuffles your data randomly​

Takes 70% of rows → X_train and y_train

Takes remaining 30% of rows → X_test and y_test

Ensures corresponding indices match (row 5 in X_train matches row 5 in y_train)

In [11]:
## Train test split

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.30,random_state=30)

**Preprocessing Transformation of Train & Test Data**

In [12]:
X_train=pd.DataFrame(preprocessor.fit_transform(X_train),columns=preprocessor.get_feature_names_out())
X_test=pd.DataFrame(preprocessor.transform(X_test),columns=preprocessor.get_feature_names_out())

In [13]:
X_train.head()

Unnamed: 0,num_pipeline__carat,num_pipeline__depth,num_pipeline__table,num_pipeline__x,num_pipeline__y,num_pipeline__z,cat_pipeline__cut,cat_pipeline__color,cat_pipeline__clarity
0,-0.975439,-0.849607,-0.121531,-1.042757,-1.08097,-1.12315,0.874076,1.528722,1.352731
1,0.235195,1.833637,-0.121531,0.318447,0.279859,0.485354,-2.144558,-0.935071,-0.646786
2,0.494617,0.815855,0.3998,0.570855,0.606458,0.673737,-0.132136,0.296826,0.686225
3,-1.018676,0.260701,0.921131,-1.214034,-1.24427,-1.195605,-0.132136,0.296826,0.01972
4,-0.953821,-0.664555,-0.642862,-1.069801,-1.044681,-1.094168,0.874076,2.14467,1.352731


**Model Training**

In [14]:
!pip install catboost

Collecting catboost
  Downloading catboost-1.2.8-cp312-cp312-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.8-cp312-cp312-manylinux2014_x86_64.whl (99.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m99.2/99.2 MB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: catboost
Successfully installed catboost-1.2.8


In [15]:
## Model Training

from sklearn.linear_model import LinearRegression,Lasso,Ridge,ElasticNet
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
import numpy as np
import pandas as pd
from sklearn.linear_model import HuberRegressor
from sklearn.linear_model import PassiveAggressiveRegressor
from sklearn.linear_model import BayesianRidge


**Training Multiple Models**

In [17]:
## Train multiple models
## Model Evaluation
from sklearn.linear_model import LinearRegression,Lasso,Ridge,ElasticNet, HuberRegressor, PassiveAggressiveRegressor, BayesianRidge, RANSACRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
import lightgbm as lgb
import numpy as np
import pandas as pd
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error

def evaluate_model(true, predicted):
    mae = mean_absolute_error(true, predicted)
    mse = mean_squared_error(true, predicted)
    rmse = np.sqrt(mse)
    r2_square = r2_score(true, predicted)
    return mae, rmse, r2_square


models={
    'LinearRegression':LinearRegression(),
    'Lasso':Lasso(),
    'Ridge':Ridge(),
    'Decision Tree':DecisionTreeRegressor(),
    'Random Forest Regression':RandomForestRegressor(),
    'Huber Regression':HuberRegressor(),
    'Passive Aggressive':PassiveAggressiveRegressor(),
    'Bayesian Ridge':Ridge(), # Using Ridge as a placeholder for Bayesian Ridge for now as it's a different class
    'Extra Trees':ExtraTreesRegressor(),
    'K-Neighbors':KNeighborsRegressor(),
    'XGBoost':XGBRegressor(),
    'CatBoost':CatBoostRegressor(verbose=0), # Set verbose to 0 to reduce output
    'LightGBM': lgb.LGBMRegressor(),
    'XGBoost_tuned': XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=-1),
    'RANSAC Regressor': RANSACRegressor()
}
trained_model_list=[]
model_list=[]
r2_list=[]

for i in range(len(list(models))):
    model=list(models.values())[i]
    model_name = list(models.keys())[i]

    print(model_name)

    # Train and predict for all models
    model.fit(X_train,y_train.values.ravel())
    y_pred=model.predict(X_test)

    mae, rmse, r2_square=evaluate_model(y_test,y_pred)

    model_list.append(model_name)

    print('Model Training Performance')
    print("RMSE:",rmse)
    print("MAE:",mae)
    print("R2 score",r2_square*100)

    r2_list.append(r2_square)

    print('='*35)
    print('\n')

LinearRegression
Model Training Performance
RMSE: 1013.9047094344002
MAE: 674.025511579685
R2 score 93.68908248567512


Lasso
Model Training Performance
RMSE: 1013.8784226767013
MAE: 675.0716923362156
R2 score 93.68940971841704


Ridge
Model Training Performance
RMSE: 1013.9059272771406
MAE: 674.0555800798531
R2 score 93.68906732505968


Decision Tree
Model Training Performance
RMSE: 836.5035917149488
MAE: 423.3444144051063
R2 score 95.70430099364165


Random Forest Regression
Model Training Performance
RMSE: 610.6621720811702
MAE: 311.7576949743012
R2 score 97.71071290801133


Huber Regression
Model Training Performance
RMSE: 1079.813310118495
MAE: 607.2043993104802
R2 score 92.84193601619937


Passive Aggressive
Model Training Performance
RMSE: 1096.978611698554
MAE: 608.2915921136271
R2 score 92.61255019046062


Bayesian Ridge
Model Training Performance
RMSE: 1013.9059272771406
MAE: 674.0555800798531
R2 score 93.68906732505968


Extra Trees
Model Training Performance
RMSE: 618.68016

In [None]:
results = pd.DataFrame({'Model': model_list, 'R2 Score': r2_list})
results = results.sort_values(by='R2 Score', ascending=False).reset_index(drop=True)
print(results)

**Summary for All Models Preformed**

**Conclusion**

**This table summarizes the R2 scores for all the regression models we trained to predict gemstone prices.**

The conclusion from this table is that the **CatBoost** model achieved the **highest R2 score of 0.979544**, indicating that it explains the largest proportion of the variance in the gemstone prices among the models tested. LightGBM and XGBoost also performed very well. This suggests that these tree-based ensemble models are well-suited for this dataset

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Filter out CatBoost and get the top 4 models
top_4_models = results[results['Model'] != 'CatBoost'].head(4)

# Create the bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x='Model', y='R2 Score', data=top_4_models, hue='Model', palette='viridis', legend=False)
plt.title('Top 4 Models (Excluding CatBoost) by R2 Score')
plt.xlabel('Model')
plt.ylabel('R2 Score')
plt.ylim(0.9, 1.0) # Set y-axis limits for better visualization of high R2 scores
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [18]:
import xgboost as xgb
import pickle

# Initialize and train the XGBoost Regressor model
xgb_model = xgb.XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=-1)
xgb_model.fit(X_train, y_train.values.ravel())

# Save the trained model to a .pkl file
with open('model.pkl', 'wb') as file:
    pickle.dump(xgb_model, file)

print("XGBoost model trained and saved as model.pkl")

XGBoost model trained and saved as model.pkl
