# Diamond Prediction

After the exploratory data analysis, a prediction will be made.
The folowing steps for this prediction:
1. **Preprocessing** - Handle outliers, Feature engineering.
2. **Predictions** - predicting with validation data and then with test data, metrics.
3. **Evaluation** - Baselines for comparison.

## Libraries & settings

In [50]:
#numpy
import numpy as np

from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer

#pipeline tools
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from pipelinehelper import PipelineHelper
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder
#time related
from timeit import default_timer as timer
from datetime import timedelta

#timer for entire code
start = timer()

#warning hadle
import warnings
warnings.filterwarnings("always")
warnings.filterwarnings("ignore")

#plotly
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook_connected"

## Baseline 1: Basic approach

### Decisions:
* **Preprocessing:**
    1. **Outliers:** **Categoric** - fill with most frequent. **continuous** - fill with median.               
    2. **Feature Engineering**: only categoric labels, OrdinalEncoder.
* **Model training** - using cross validation in validation data on a simple linear regression model.
* **Model testing** - train on whole train + validation set and use test data for results.
* **Model evaluating** - record for validation and test the following metrics:
    1. MSE - [Mean score error](https://en.wikipedia.org/wiki/Mean_squared_error)
    2. R2 - [R Square](https://en.wikipedia.org/wiki/Coefficient_of_determination)
    3. MAE - [Mean absolut error](https://en.wikipedia.org/wiki/Mean_absolute_error)
    

**Preprocessing**

In [134]:
# Preprocessing for continuous data
contiuous_transformer = SimpleImputer(strategy='median')

# building categorical transformers (worst to best)
cut_enc = OrdinalEncoder(categories=[["Fair", "Good", "Very Good", "Premium","Ideal"]])
color_enc = OrdinalEncoder(categories=[['J', 'I', 'H', 'G', 'F', 'E','D']])
clarity_enc = OrdinalEncoder(categories=[["I1", "SI2", "SI1", "VS2", "VS1", "VVS2", "VVS1",'IF']])



# Create preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', contiuous_transformer, Continuous),
        ('cuts', cut_enc, ["cut"]),
        ('colors', color_enc, ["color"]),
        ('clarities', clarity_enc, ["clarity"])
    ])

In [135]:
model = LinearRegression()

In [136]:
# Bundle preprocessing and modeling code in a pipeline
Baseline1 = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
Baseline1.fit(X_train2, y_train2)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  SimpleImputer(strategy='median'),
                                                  ['carat', 'depth', 'table',
                                                   'x', 'y', 'z']),
                                                 ('cuts',
                                                  OrdinalEncoder(categories=[['Fair',
                                                                              'Good',
                                                                              'Very '
                                                                              'Good',
                                                                              'Premium',
                                                                              'Ideal']]),
                                                  ['cut']),
                                       

In [139]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score


# Preprocessing of validation data, get predictions
b1_val_preds = Baseline1.predict(X_val)

# Evaluate the model
b1_val_mae = mean_absolute_error(y_val, b1_val_preds)
b1_val_mse = mean_squared_error(y_val, b1_val_preds)
b1_val_r2 = r2_score(y_val, b1_val_preds)

print('MAE:', b1_val_mae)
print("MSE: ",b1_val_mse)
print("R2: ",b1_val_r2)

MAE: 804.1298331411481
MSE:  1400528.1039965793
R2:  0.9137965363252714


In [144]:
from sklearn.model_selection import cross_val_score
CV = cross_val_score(Baseline1, X_train, y_train, cv=5, scoring = "neg_root_mean_squared_error")
print(f"validation negative root mean squared error on 5 fold cross validation: {CV}")
print(f"validation negative root mean squared error accuracy: {CV.mean()}")

validation negative root mean squared error on 5 fold cross validation: [-1249.40493603 -1199.18055605 -1202.33643443 -1213.50834468
 -1224.87038415]
validation negative root mean squared error accuracy: -1217.8601310704983


In [145]:
Baseline1.fit(X_train, y_train)
b1_test_preds = Baseline1.predict(X_test)
b1_test_mae = mean_absolute_error(y_test, b1_test_preds)
b1_test_mse = mean_squared_error(y_test, b1_test_preds)
b1_test_r2 = r2_score(y_test, b1_test_preds)
print('MAE:', b1_test_mae)
print("MSE: ",b1_test_mse)
print("R2: ",b1_test_r2)

MAE: 803.191441879669
MSE:  1461801.62701735
R2:  0.9096082762754162
