# Diamond Prediction

After the exploratory data analysis, a prediction will be made.
The folowing steps for this prediction:
1. **Preprocessing** - Handle outliers, Feature engineering.
2. **Predictions** - predicting with validation data and then with test data, metrics.
3. **Evaluation** - Baselines for comparison.

## Libraries & settings

In [18]:
#numpy
import numpy as np

from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer

#pipeline tools
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from pipelinehelper import PipelineHelper
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder
#time related
from timeit import default_timer as timer
from datetime import timedelta

#timer for entire code
start = timer()

#warning hadle
import warnings
warnings.filterwarnings("always")
warnings.filterwarnings("ignore")

#plotly
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook_connected"

## Baseline 1: Basic approach

### Decisions:
* **Preprocessing:**
    1. **Outliers:** **Categoric** - fill with most frequent. **continuous** - fill with median.               
    2. **Feature Engineering**: only categoric labels, OrdinalEncoder.
* **Model training** - using cross validation in validation data on a simple linear regression model.
* **Model testing** - train on whole train + validation set and use test data for results.
* **Model evaluating** - record for validation and test the following metrics:
    1. MSE - [Mean score error](https://en.wikipedia.org/wiki/Mean_squared_error)
    2. R2 - [R Square](https://en.wikipedia.org/wiki/Coefficient_of_determination)
    3. MAE - [Mean absolut error](https://en.wikipedia.org/wiki/Mean_absolute_error)
    

**Preprocessing**

In [19]:
# Preprocessing for continuous data

def Outlier_Detector(X,factor):
    X = pd.DataFrame(X).copy()
    for i in range(X.shape[1]):
        x = pd.Series(X.iloc[:,i]).copy()
        q1 = x.quantile(0.25)
        q3 = x.quantile(0.75)
        iqr = q3 - q1
        lower_bound = q1 - (factor * iqr)
        upper_bound = q3 + (factor * iqr)
        X.iloc[((X.iloc[:,i] < lower_bound) | (X.iloc[:,i] > upper_bound)),i] = np.nan 
    return X

#creating outlier_remover object using FunctionTransformer with factor=1.5
Outlier = FunctionTransformer(Outlier_Detector,kw_args={'factor':1.5})

#contiuous_transformer = SimpleImputer(strategy='median')

contiuous_transformer = Pipeline(steps=[
('outlier', Outlier),
('imputer', SimpleImputer(strategy='median'))
])

# building categorical transformers (worst to best)
cut_enc = OrdinalEncoder(categories=[["Fair", "Good", "Very Good", "Premium","Ideal"]])
color_enc = OrdinalEncoder(categories=[['J', 'I', 'H', 'G', 'F', 'E','D']])
clarity_enc = OrdinalEncoder(categories=[["I1", "SI2", "SI1", "VS2", "VS1", "VVS2", "VVS1",'IF']])


# Create preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', contiuous_transformer, Continuous),
        ('cuts', cut_enc, ["cut"]),
        ('colors', color_enc, ["color"]),
        ('clarities', clarity_enc, ["clarity"])
    ])

In [20]:
model = LinearRegression()

In [21]:
# Bundle preprocessing and modeling code in a pipeline
Baseline1 = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
Baseline1.fit(X_train2, y_train2)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('outlier',
                                                                   FunctionTransformer(func=<function Outlier_Detector at 0x000001DAFB68A790>,
                                                                                       kw_args={'factor': 1.5})),
                                                                  ('imputer',
                                                                   SimpleImputer(strategy='median'))]),
                                                  ['carat', 'depth', 'table',
                                                   'x', 'y', 'z']),
                                                 ('cuts',
                                                  OrdinalEncoder(categories=[['Fair',
                                                                              'Good',
                 

In [22]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score


# Preprocessing of validation data, get predictions
b1_val_preds = Baseline1.predict(X_val)

# Evaluate the model
b1_val_mae = mean_absolute_error(y_val, b1_val_preds)
b1_val_mse = mean_squared_error(y_val, b1_val_preds)
b1_val_r2 = r2_score(y_val, b1_val_preds)

print('MAE:', b1_val_mae)
print("MSE: ",b1_val_mse)
print("R2: ",b1_val_r2)

MAE: 1227.2738971261278
MSE:  2652792.3125459263
R2:  0.8367188169244258


In [23]:
from sklearn.model_selection import cross_val_score
CV = cross_val_score(Baseline1, X_train, y_train, cv=5, scoring = "neg_root_mean_squared_error")
print(f"validation negative root mean squared error on 5 fold cross validation: {CV}")
print(f"validation negative root mean squared error accuracy: {CV.mean()}")

validation negative root mean squared error on 5 fold cross validation: [-1674.34968697 -1594.12551928 -1562.22328898 -1589.48519538
 -1627.25022853]
validation negative root mean squared error accuracy: -1609.4867838266748


In [24]:
Baseline1.fit(X_train, y_train)
b1_test_preds = Baseline1.predict(X_test)
b1_test_mae = mean_absolute_error(y_test, b1_test_preds)
b1_test_mse = mean_squared_error(y_test, b1_test_preds)
b1_test_r2 = r2_score(y_test, b1_test_preds)
print('MAE:', b1_test_mae)
print("MSE: ",b1_test_mse)
print("R2: ",b1_test_r2)

MAE: 1249.8900248224795
MSE:  2697964.8791565355
R2:  0.8331690897943889


In [25]:
print("NRMSE: ",-np.sqrt(b1_test_mse))

NRMSE:  -1642.5482882267222


In [28]:
Baseline1 = pd.DataFrame({"val_mae": b1_val_mae,"val_mse": b1_val_mse,"val_r2": b1_val_r2,"val_nrmse": CV.mean(),"test_mae": b1_test_mae,"test_mse": b1_test_mse,"test_r2": b1_test_r2, "test_nrmse": -np.sqrt(b1_test_mse)}, index=["Baseline1"])

Baseline1

Unnamed: 0,val_mae,val_mse,val_r2,val_nrmse,test_mae,test_mse,test_r2,test_nrmse
Baseline1,1227.273897,2652792.0,0.836719,-1609.486784,1249.890025,2697965.0,0.833169,-1642.548288
