In [90]:
import pandas as pd


In [91]:
df=pd.read_csv('data/gemstone.csv')

In [92]:
df=df.drop(labels='id',axis=1)

In [93]:
df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z,price
0,1.52,Premium,F,VS2,62.2,58.0,7.27,7.33,4.55,13619
1,2.03,Very Good,J,SI2,62.0,58.0,8.06,8.12,5.05,13387
2,0.7,Ideal,G,VS1,61.2,57.0,5.69,5.73,3.5,2772
3,0.32,Ideal,G,VS1,61.6,56.0,4.38,4.41,2.71,666
4,1.7,Premium,G,VS2,62.6,59.0,7.65,7.61,4.77,14453


In [94]:
# Dependent and Independent feature
x=df.drop(labels=['price'],axis=1)
y=df[['price']]


y = df[['price']] creates a new DataFrame with a single column named 'price'. The double brackets [[...]] are used to create a DataFrame with a single column, which preserves the DataFrame structure. This is often used when you need to extract multiple columns from a DataFrame and keep the result as a DataFrame.

y = df['price'] creates a new Series object with the values from the 'price' column of the original DataFrame. This returns a one-dimensional data structure, which is a pandas Series. This is often used when you need to extract a single column from a DataFrame and use it as a Series for further analysis or modeling.

##### Define which column should be ordinal encoded and which should be scaled

Ordinal encoding is typically used for categorical features that have an inherent order or hierarchy. For example, if your dataset has a feature called 'education level' with categories like 'high school', 'college', and 'graduate school', you could encode these categories as ordinal values like 1, 2, and 3, respectively, based on their relative levels of education. However, it is important to note that ordinal encoding assumes a linear relationship between the categories, which may not always be the case.

Scaling is typically used for numerical features that have different scales or units of measurement. For example, if your dataset has a feature called 'income' that ranges from 10,000 to 100,000, and a feature called 'age' that ranges from 20 to 80, you could apply scaling to standardize the range of values for both features to a common scale. This helps to prevent features with large values from dominating the model or introducing bias.

In [95]:
#define which columns should be ordinal endcoded and which should ber scaled
categorical_cols=x.columns[x.dtypes=='object']
numerical_cols=x.columns[x.dtypes!='object']

Another method could be

In [96]:
cat_col=x.select_dtypes(include='object').columns

In [97]:
categorical_cols
numerical_cols

Index(['carat', 'depth', 'table', 'x', 'y', 'z'], dtype='object')

In [98]:
cat_col

Index(['cut', 'color', 'clarity'], dtype='object')

#### Define the custom ranking for each ordinal variable

For example, consider the variable "education level" with categories "high school", "some college", "associate's degree", "bachelor's degree", "master's degree", and "doctoral degree". By default, these categories may be encoded as 1, 2, 3, 4, 5, and 6, respectively, which assumes that each level is equally spaced apart. However, you may know that the true ranking of these categories is "high school" < "some college" < "associate's degree" < "bachelor's degree" < "master's degree" < "doctoral degree". In this case, you could define a custom ranking for each category based on this order, such as "high school" = 1, "some college" = 2, "associate's degree" = 3, etc.

Defining custom ranking for each ordinal variable can be useful for improving the interpretability and accuracy of machine learning models that use these variables as inputs. However, it is important to ensure that the custom ranking reflects the true order or hierarchy of the categories in the variable.

In [99]:
# Define custom ranking for each ordinal variable
cut_categories=['Fair','Good','Very Good','Premium','Ideal']
color_categories=['D','E','F','G','H','I','J']
clarity_categories=["I1","SI2","SI1","VS2","VS1","VVS2","VVS1","IF"]

In [100]:
from sklearn.impute import SimpleImputer   # Handling missing values
from sklearn.preprocessing import StandardScaler  # handling feature scaling
from sklearn.preprocessing import OrdinalEncoder  # create ordinal encoding

# the above three process hould have one by one in a pipeline the .Pipeline is combining multiple steps of 1,2 and 3 steps and we will group the steps together
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

Pipeline: This is a class from scikit-learn that provides a way to simplify the construction of machine learning pipelines. A machine learning pipeline is a sequence of data processing steps that typically involves data preprocessing, feature extraction, model training, and model evaluation. The Pipeline class allows you to combine multiple steps into a single object that can be used to fit and evaluate machine learning models.

ColumnTransformer: This is a class from scikit-learn that provides a way to apply different data preprocessing steps to different columns of a dataset. It allows you to specify a list of transformers for each column or subset of columns in your dataset, and then apply them in parallel or sequentially. This is useful when you have a dataset with mixed data types, and you want to apply different preprocessing steps to different types of data, such as numerical and categorical data. The ColumnTransformer can be used within a Pipeline object to build a complete machine learning pipeline.

In [101]:
#Numerical Pipeline
num_pipeline=Pipeline(
steps=[
       ('imputer',SimpleImputer(strategy='median')),
       ('scaler',StandardScaler())
        ]

)

# Catrgoical Pipeline
cat_pipeline=Pipeline(
steps=[
('imputer',SimpleImputer(strategy='most_frequent')),
('ordinal encoder',OrdinalEncoder(categories=[cut_categories,color_categories,clarity_categories])),
('scaler',StandardScaler())]
)

preprocessor= ColumnTransformer([
    
    ('num_pipeline',num_pipeline,numerical_cols),
    ('cat_pipeline',cat_pipeline,categorical_cols)

     
     ])



This code defines two pipelines (num_pipeline and cat_pipeline) for data preprocessing, and a ColumnTransformer (preprocessor) that combines these pipelines to preprocess both numerical and categorical features in a single step. Here's a step-by-step breakdown of the code:

num_pipeline: This pipeline has two steps - SimpleImputer and StandardScaler. SimpleImputer is used to fill any missing values in the numerical columns with the median value of the column. StandardScaler is then used to scale the numerical features to have zero mean and unit variance.

cat_pipeline: This pipeline also has two steps - SimpleImputer, OrdinalEncoder, and StandardScaler. SimpleImputer is used to fill any missing values in the categorical columns with the most frequent value of the column. OrdinalEncoder is used to convert the categorical values into integers using the specified order of categories (which are defined in cut_categories, color_categories, and clarity_categories). StandardScaler is then used to scale the encoded categorical features to have zero mean and unit variance.

preprocessor: This ColumnTransformer combines the num_pipeline and cat_pipeline pipelines. It applies the num_pipeline to the numerical_cols and the cat_pipeline to the categorical_cols.

Overall, this code defines a complete data preprocessing pipeline that can be used to preprocess both numerical and categorical features in a single step. It can be used as a preprocessing step before training a machine learning model.

In [102]:
#Train Test split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(x,y,test_size=0.30,random_state=42)

In [103]:
X_train=pd.DataFrame(preprocessor.fit_transform(X_train),columns=preprocessor.get_feature_names_out())

In [104]:
X_test=pd.DataFrame(preprocessor.transform(X_test),columns=preprocessor.get_feature_names_out())

In [105]:
X_train.head()

Unnamed: 0,num_pipeline__carat,num_pipeline__depth,num_pipeline__table,num_pipeline__x,num_pipeline__y,num_pipeline__z,cat_pipeline__cut,cat_pipeline__color,cat_pipeline__clarity
0,-0.823144,-1.129988,-0.641897,-0.780451,-0.835103,-0.876024,0.8741,-0.936747,1.350746
1,0.945023,-1.777823,0.921902,1.073226,1.166389,0.946633,-1.137644,0.910853,0.684455
2,1.958484,0.165682,0.400636,1.703116,1.755063,1.742237,-0.131772,0.910853,0.018164
3,-0.995648,-0.574701,-0.641897,-1.122391,-1.161138,-1.165334,0.8741,-0.32088,2.017037
4,-0.995648,0.25823,0.400636,-1.176382,-1.152082,-1.136403,-1.137644,1.52672,-0.648127


In [107]:
X_test.head()

Unnamed: 0,num_pipeline__carat,num_pipeline__depth,num_pipeline__table,num_pipeline__x,num_pipeline__y,num_pipeline__z,cat_pipeline__cut,cat_pipeline__color,cat_pipeline__clarity
0,-0.629077,0.25823,-0.12063,-0.600482,-0.581521,-0.572248,0.8741,-1.552614,-0.648127
1,2.605374,-2.148014,-0.12063,2.126042,2.198832,1.959219,-1.137644,0.294987,-1.314417
2,-1.125026,-1.222536,0.921902,-1.374347,-1.414721,-1.46911,-0.131772,-0.936747,2.017037
3,-1.017211,-0.574701,0.921902,-1.158385,-1.161138,-1.194265,-0.131772,1.52672,2.017037
4,0.858771,0.628421,-0.641897,0.947248,0.985258,1.004495,0.8741,0.910853,-0.648127


In [109]:
#Model Training
from sklearn.linear_model import LinearRegression,Lasso,Ridge,ElasticNet
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error

In [110]:
regression=LinearRegression()
regression.fit(X_train,y_train)


In [111]:
regression.coef_

array([[ 6432.97591819,  -132.34206204,   -70.48787525, -1701.38593925,
         -494.17005097,   -76.32351645,    68.80035873,  -464.67990411,
          652.10059539]])

In [112]:
regression.intercept_

array([3976.8787389])

In [121]:
import numpy as np
def evaluate_model(true,predicted):
    mae=mean_absolute_error(true,predicted)
    mse=mean_squared_error(true,predicted)
    rmse=np.sqrt(mean_squared_error(true,predicted))
    r2_square=r2_score(true,predicted)
    return mae,mse,rmse,r2_square
                 


This code defines a function named evaluate_model which takes two arguments true and predicted. The function computes four evaluation metrics for regression models: Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R2). These metrics are commonly used to evaluate the performance of a regression model.

The function first imports the NumPy library using import numpy as np. NumPy is a popular library for scientific computing in Python and provides support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

The function then computes the four evaluation metrics using built-in functions from the scikit-learn library:

mean_absolute_error: computes the mean absolute error between the true and predicted values.
mean_squared_error: computes the mean squared error between the true and predicted values.
np.sqrt: computes the square root of a given array or scalar. Here it is used to compute the RMSE from the MSE.
r2_score: computes the R-squared value between the true and predicted 

In [125]:
#Train multiple models
models={
'LinearRegression':LinearRegression(),
'lasso':Lasso(),
'Ridge':Ridge(),
'ElasticNet':ElasticNet()

}
model_list=[]
r2_list=[]


for i in range(len(list(models))):
    model=list(models.values())[i]
    model.fit(X_train,y_train)


    # make prediction
    y_pred=model.predict(X_test)
    mae,mse,rmse,r2_square=evaluate_model(y_test,y_pred)
    print(list(models.keys())[i])
    model_list.append(list(models.keys())[i])

    print('Model traning performance')
    print('RMSE: ', rmse)
    print('MAE',mae)
    print('r2_square: ',r2_square)
    print('='*35)
    print('\n')



LinearRegression
Model traning performance
RMSE:  1014.6296630375466
MAE 675.0758270067466
r2_square:  0.9362906819996047


lasso
Model traning performance
RMSE:  1014.659130275064
MAE 676.242117366551
r2_square:  0.9362869814082755


Ridge
Model traning performance
RMSE:  1014.6343233534432
MAE 675.1077629781347
r2_square:  0.9362900967491629


ElasticNet
Model traning performance
RMSE:  1533.3541245902309
MAE 1060.9432977143
r2_square:  0.8544967219374031




The code defines a dictionary called models containing four different regression models: LinearRegression, Lasso, Ridge, and ElasticNet. Each model is instantiated with default hyperparameters.

Two empty lists are created: model_list and r2_list. These lists will store the names of the models and their R-squared values, respectively.

A for loop is used to iterate over each model in the models dictionary. The i variable is used as an index to access the ith model in the list of models.

The ith model is fitted to the training data using the fit() method. X_train and y_train represent the independent and dependent variables of the training set, respectively.

The model is used to make predictions on the test set using the predict() method. X_test is the independent variable of the test set.

The evaluate_model() function is called to calculate four metrics: mean absolute error (MAE), mean squared error (MSE), root mean squared error (RMSE), and R-squared. These metrics are calculated by comparing the true y_test values with the predicted y_pred values.

The name of the ith model is printed using print(list(models.keys())[i]). The name is added to the model_list list using model_list.append(list(models.keys())[i]).

The model's performance metrics are printed using print('RMSE: ', rmse), print('MAE',mae), and print('r2_square: ',r2_square). The metrics are also added to the r2_list list using r2_list.append(r2_square).

Finally, a separator line is printed using print('='*35) to visually separate the results of each model.

Overall, the code trains multiple regression models on a given dataset and evaluates their performance using various metrics. The results of each model are printed to the console and stored in lists for further analysis.






Regenerate response

In [126]:
model_list

['LinearRegression', 'lasso', 'Ridge', 'ElasticNet']