# **Build an AI Web App for Diamond Price Prediction 💎**

Estimated time needed: **90** minutes

This project is designed to help you understand the key steps in building machine learning models, evaluating their performance, and deploying the results using the open-source Mercury framework. The end result of this project will resemble the demo [the demo](https://ibm.runmercury.com/app/app-1) we've prepared.

## Objectives
By the end of this hands-on project, you will be able to:
- Master the data preprocessing pipeline, including feature selection, encoding, and handling outliers.
- Explore and understand the concepts of correlation analysis and feature visualization.
- Build and evaluate various machine learning models, interpreting metrics like MAE, MSE, and R-squared.
- Identify and choose the most effective model for making future diamond price predictions.
- Use open-source framework, Mercury, to share the exciting outcomes of your predictions with a global audience through the web.

Now, let's dive into the project step by step:

---

# Step 1: Setup

For this project, we will be using the following libraries:

*   `pandas` for managing the data.
*   `numpy`for mathematical operations.
*   `sklearn` for machine learning and machine-learning-pipeline related functions.
*   `seaborn` for visualizing the data.
*   `matplotlib` for additional plotting tools.
*   `mercury` for building a web app.

### Importing Required Libraries

In [ ]:
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(context="notebook", palette="Spectral", style = 'darkgrid' ,font_scale = 1.5, color_codes=False)
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import IsolationForest
from sklearn.linear_model import LinearRegression
# Regression
from sklearn.linear_model import LinearRegression,Ridge,Lasso,RidgeCV, ElasticNet
from sklearn.ensemble import RandomForestRegressor,BaggingRegressor,GradientBoostingRegressor,AdaBoostRegressor 
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning, module="numpy")
warnings.filterwarnings("ignore", category=DeprecationWarning, module="sklearn")
warnings.filterwarnings("ignore", category=UserWarning, module="sklearn")
warnings.filterwarnings("ignore", category=FutureWarning, module="sklearn")
warnings.filterwarnings("ignore", category=UserWarning, module="seaborn")

---

# Step 2: Data Loading and Preprocessing
In this section, we'll start by loading the diamond price dataset from a provided URL. We'll then preprocess the data to make it suitable for our machine learning tasks.

Tasks:

1. Load the dataset into a pandas DataFrame.
2. Data preprocessing including:
    - Remove unnecessary columns like index and dimensions.
    - Convert categorical features (cut, color, clarity) into numerical representations.
    - Calculate correlation coefficients and visualize them using a heatmap.
3. Feature visualization
4. Data splits and handle outliers using Isolation Forest

## Task 1: Load the Diamond Price Dataset
There are 10 attributes included in the dataset including the target ie. price.

Feature description:
- price price in US dollars (\$326--\$18,823)
- carat weight of the diamond (0.2--5.01)
- cut quality of the cut (Fair, Good, Very Good, Premium, Ideal)
- color diamond colour, from J (worst) to D (best)
- clarity a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
- x length in mm (0--10.74)
- y width in mm (0--58.9)
- z depth in mm (0--31.8)
- depth total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)
- table width of top of diamond relative to widest point (43--95)

In [ ]:
# URL for the diamond price dataset
URL = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/diamonds.csv"

# Load the dataset into a DataFrame
df = pd.read_csv(URL)

df.head()

In [ ]:
print(f"""The DataFrame has {df.shape[0]} rows and {df.shape[1]} columns.""")

## Task 2: Data Preprocessing

### Remove Unnecessary Data
The first column seems to be just index, so we can remove it.


In [ ]:
df = df.drop(["s"], axis=1)

Since the size (length x width x depth) of the diamond have a high positive correlation with carat, which can indicate multicollinearity and affect model stability and interpretability. We can drop the features of x, y, z to prevent it.

In [ ]:
df = df.drop(["x","y",'z'], axis=1)
df.head()

## Convert categorical features (e.g., 'cut', 'color', 'clarity') to numerical

In [ ]:
df["cut"] = df["cut"].map({"Ideal": 1, "Premium": 2, "Good": 3, "Very Good": 4, "Fair": 5})
df["color"] = df["color"].map({"D": 1, "E": 2, "F": 3, "G": 4, "H": 5, "I": 6, "J": 7})
df["clarity"] = df["clarity"].map({"IF": 1, "VVS1": 2, "VVS2": 3, "VS1": 4, "VS2": 5, "SI1": 6, "SI2": 7, "I1": 8})

df.head()

In [ ]:
df.describe()

## Correlation Analysis
Calculate the correlation coefficients between the features and the target variable (price). This will print the correlation coefficients of all features with the target variable in descending order. Positive values indicate a positive correlation, and negative values indicate a negative correlation.

In [ ]:
correlation_matrix = df.corr()
correlation_with_target = correlation_matrix['price'].sort_values(ascending=False)
print(correlation_with_target)

### Visualize Correlations
This helps identify features with strong correlations (both positive and negative) with the target variable.

In [ ]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()

The above plot indicates that the impact of "cut", "depth", and "table" on price are relatively small. This could mean that the features don't have substantial effects on our predictions, so we can remove these features from out model:

In [ ]:
df = df.drop(["cut", "depth", "table"], axis=1)
df.head()

## Task 3: Feature Visualization 

In [ ]:
p = sns.catplot(x='color', data=df , kind='count',aspect=2.5 )

In [ ]:
p = sns.catplot(x='color', y='price', data=df, kind='box' ,aspect=2.5 )

In [ ]:
p = sns.catplot(x='clarity', data=df , kind='count',aspect=2.5 )

In [ ]:
p = sns.catplot(x='clarity', y='price', data=df, kind='box' ,aspect=2.5)

## Task 4: Data Splits and Handle Outliers Using Isolation Forest

In [ ]:
# Split the data into features (X) and target (y)
X = df.drop(["price"], axis=1)
y = df["price"]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                test_size=0.2, 
                                                random_state=42)

# Check the shape of the training dataset
print("Shape of training data - X_train:", X_train.shape)
print("Shape of training data - y_train:", y_train.shape)

Now we can use a a tree-based anomaly detection algorithm, Isolation Forest or iForest for short, to remove outliers automatically. It is based on modeling the normal data in such a way as to isolate anomalies that are both few in number and different in the feature space. 

In [ ]:
# Identify outliers in the training dataset using Isolation Forest
iso = IsolationForest(contamination=0.1, random_state=42)
outlier_predictions_train = iso.fit_predict(X_train)

# Select non-outlier rows
non_outlier_mask = outlier_predictions_train != -1
X_train, y_train = X_train.loc[non_outlier_mask], y_train.loc[non_outlier_mask]

# Summarize the shape of the updated training dataset
print("Shape of updated training data - X_train:", X_train.shape)
print("Shape of updated training data - y_train:", y_train.shape)

---

# Step 3: Build Models To Predict Diamond Prices
In this step, we will utilize a variety of machine learning algorithms, including 'Linear Regression', 'Lasso Regression', 'AdaBoost Regression', 'Ridge Regression', 'Gradient Boosting Regression', 'Random Forest Regression', and 'KNeighbours Regression'. Our aim is to train these models on our dataset and subsequently evaluate their performance. By doing so, we can identify the most suitable model for predicting diamond prices in the subsequent stages of our project.

- **Linear Regression:** This algorithm establishes a linear relationship between the input features and the target variable, aiming to predict numeric values.

- **Lasso Regression:** Similar to linear regression, this technique adds a penalty term to the model's cost function, encouraging the selection of only the most influential features and preventing overfitting.

- **AdaBoost Regression:** This boosting algorithm constructs a strong predictive model by combining the outputs of several weak models in an iterative manner.

- **Ridge Regression:** Like linear regression, ridge regression aims to predict outcomes. However, it introduces L2 regularization to manage multicollinearity and enhance model stability.

- **Gradient Boosting Regression:** This ensemble method constructs a powerful model by sequentially refining weak learners, emphasizing previously misclassified instances.

- **Random Forest Regression:** Another ensemble technique, random forest regression, creates multiple decision trees and aggregates their predictions to provide robust and accurate results.

- **KNeighbours Regression:** This instance-based algorithm predicts outcomes by considering the outcomes of nearby data points in the feature space.

## Model Evaluation
- **Mean Absolute Error (MAE):** The MAE represents the average absolute difference between the predicted values and the actual values. It can be calculated using the formula:
```
MAE = (1/n) * Σ | y_i - ŷ_i |
```
    Where:
    - n is the number of data points.
    - y_i is the actual value for the i-th data point.
    - ŷ_i is the predicted value for the i-th data point.
    Lower MAE values indicate better performance.

- **Mean Squared Error (MSE):** The MSE calculates the average squared difference between predicted and actual values:
```
MSE = (1/n) * Σ (y_i - ŷ_i)^2
```
    Where:
    - n is the number of data points.
    - y_i is the actual value for the i-th data point.
    - ŷ_i is the predicted value for the i-th data point.
    Lower MSE values indicate better performance.

- **R-squared Score (Coefficient of Determination)**: The R-squared score measures the proportion of variance predictable by the model:
```
R^2 = 1 - Σ (y_i - ŷ_i)^2 / Σ (y_i - ȳ)^2
```
    Where:
    - n is the number of data points.
    - y_i is the actual value for the i-th data point.
    - ŷ_i is the predicted value for the i-th data point.
    - ȳ is the mean of the actual values.
    Higher R^2 values indicate a better fit, ranging from 0 to 1.

In [ ]:
# Collect all R2 scores
R2_scores = []
models = ['Linear Regression' , 'Lasso Regression' , 'AdaBoost Regression' , 'Ridge Regression' , 'Gradient Boosting Regression',
          'Random Forest Regression' ,
         'KNeighbours Regression']

## Linear Regression

In [ ]:
# Create and train the linear regression model
clf_lr = LinearRegression()
clf_lr.fit(X_train, y_train)

# Predict diamond prices for the test set using the linear regression model
y_pred = clf_lr.predict(X_test)

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error: %.3f" %mae)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error: %.3f" %mse)

# Calculate R-squared (Coefficient of Determination)
r2 = r2_score(y_test, y_pred)
print("R-squared Score: %.3f" %r2)

R2_scores.append(r2)

In this case, an R-squared score of approximately 0.901 suggests that about 90% of the variance in diamond prices can be explained by the linear regression model.

## Lasso Regression 

In [ ]:
# Create and train the Lasso regression model
clf_la = Lasso()
clf_la.fit(X_train , y_train)

# Predict diamond prices for the test set using the linear regression model
y_pred = clf_la.predict(X_test)


# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error: %.3f" %mae)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error: %.3f" %mse)

# Calculate R-squared (Coefficient of Determination)
r2 = r2_score(y_test, y_pred)
print("R-squared Score: %.3f" %r2)

R2_scores.append(r2)

In this case, an R-squared score of approximately 0.901 suggests that about 90% of the variance in diamond prices can be explained by the Lasso regression model.

## AdaBoost Regression

In [ ]:
# Create and train the AdaBoost regression model
clf_ar = AdaBoostRegressor()
clf_ar.fit(X_train , y_train)

# Predict diamond prices for the test set using the linear regression model
y_pred = clf_ar.predict(X_test)

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error: %.3f" %mae)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error: %.3f" %mse)

# Calculate R-squared (Coefficient of Determination)
r2 = r2_score(y_test, y_pred)
print("R-squared Score: %.3f" %r2)

R2_scores.append(r2)

In this case, an R-squared score of approximately 0.934 suggests that about 93% of the variance in diamond prices can be explained by the AdaBoost regression model.

## Ridge Regression 

In [ ]:
# Create and train the Ridge regression model
clf_rr = Ridge()
clf_rr.fit(X_train , y_train)

# Predict diamond prices for the test set using the linear regression model
y_pred = clf_rr.predict(X_test)

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error: %.3f" %mae)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error: %.3f" %mse)

# Calculate R-squared (Coefficient of Determination)
r2 = r2_score(y_test, y_pred)
print("R-squared Score: %.3f" %r2)

R2_scores.append(r2)

In this case, an R-squared score of approximately 0.901 suggests that about 90% of the variance in diamond prices can be explained by the Ridge regression model.

## Gradient Boosting Regression

In [ ]:
# Create and train the GradientBoosting regression model
clf_gbr = GradientBoostingRegressor()
clf_gbr.fit(X_train , y_train)

# Predict diamond prices for the test set using the linear regression model
y_pred = clf_gbr.predict(X_test)

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error: %.3f" %mae)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error: %.3f" %mse)

# Calculate R-squared (Coefficient of Determination)
r2 = r2_score(y_test, y_pred)
print("R-squared Score: %.3f" %r2)

R2_scores.append(r2)

In this case, an R-squared score of approximately 0.965 suggests that about 97% of the variance in diamond prices can be explained by the Gradient Boosting regression model.

## Random Forest Regression 

In [ ]:
# Create and train the Random Forest regression model
clf_rf = RandomForestRegressor()
clf_rf.fit(X_train , y_train)

# Predict diamond prices for the test set using the linear regression model
y_pred = clf_rf.predict(X_test)

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error: %.3f" %mae)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error: %.3f" %mse)

# Calculate R-squared (Coefficient of Determination)
r2 = r2_score(y_test, y_pred)
print("R-squared Score: %.3f" %r2)

R2_scores.append(r2)

In this case, an R-squared score of approximately 0.962 suggests that about 96% of the variance in diamond prices can be explained by the Random Forest regression model.

## KNeighbours Regression 

In [ ]:
# Create and train the KNeighbours regression model
clf_knn = KNeighborsRegressor()
clf_knn.fit(X_train , y_train)

# Predict diamond prices for the test set using the linear regression model
y_pred = clf_knn.predict(X_test)

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error: %.3f" %mae)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error: %.3f" %mse)

# Calculate R-squared (Coefficient of Determination)
r2 = r2_score(y_test, y_pred)
print("R-squared Score: %.3f" %r2)

R2_scores.append(r2)

In this case, an R-squared score of approximately 0.867 suggests that about 87% of the variance in diamond prices can be explained by the KNeighbours regression model.

## R2-Score Visualization

In [ ]:
compare = pd.DataFrame({'Algorithms' : models , 'R2 Scores' : R2_scores})
compare.sort_values(by='R2 Scores' ,ascending=False).style.background_gradient(cmap="Reds")

In [ ]:
sns.catplot(x='Algorithms', y='R2 Scores' , data=compare, kind='point', aspect=5)

Since the Gradient Boosting Algorithm has achieved the highest R2 score among all the models, we will exclusively employ this model for the subsequent diamond price prediction.

# Step 4: Diamond Price Prediction

You can use the following code to make predictions about diamond prices based on user input:

```
a = float(input("Carat Size: "))
b = int(input("Color (1-7): "))
c = int(input("Clarity (1-8): "))
features = np.array([[a, b, c]])
print("Predicted Diamond's Price = ", clf_gbr.predict(features))
```

Due to the fact that Mercury Cloud does not display user inputs, running this code within the app on Mercury Cloud could potentially lead to errors. Hence, it is advisable not to execute this code within the notebook uploaded to Mercury.

---

# Step 5: Deploy the Notebook as a Web App Using Mercury
The final step is to deploy our model as an interactive web app using open-source framework, Mercury. This will allow users to input diamond attributes and receive instant price predictions.

## Understanding Mercury
### 1. What is Mercury?
Mercury Server is the core of our web app framework. It's built using technologies like Django, Django Channels, and React. Imagine it as a bridge between our notebooks and the web browser. When a user opens a notebook in a web browser, Mercury Server steps in and establishes a connection, much like a virtual highway, between the browser and itself.

### 2. How It Works:

- **WebSocket Connection:** This connection is established between the user's web browser and Mercury Server. It's like a live channel of communication that lets data flow back and forth seamlessly.
- **Mercury Server and Worker:** Mercury Server ensures there's a "worker" ready to handle requests. Think of this worker as a skilled assistant. It connects to Mercury Server via WebSocket.
- **Action Forwarding:** When a user interacts with the app in their browser, every action they take is forwarded by Mercury Server to the worker. It's like passing notes between them.

- **Kernel Magic:** The worker maintains an open IPython kernel and understands the code from our notebook. It's our app's brain. When users interact with widgets, the worker's kernel executes the code related to those interactions.

- **Results Flow:** The worker sends the results of these executions back to the user's browser through Mercury Server. Imagine getting a quick answer after asking a question.

### 3. Why Mercury?
Mercury is the simplest way to transform our notebooks into web apps. It offers great features:
- You can show or hide your code.
- Your users can easily export executed notebook to PDF/HTML.
- There is built-in authentication, so you can control who accesses your app.
- You can produce files in the notebook and make them downloadable.
- You can share multiple notebooks with Mercury Cloud.

## Task 0: Getting Started with Web App Transformation
To kickstart the process of turning your Jupyter Notebook into an interactive web app, make sure to retain the following code. <font color="red">There's no need to execute the code as it's incompatible with the notebook version. However, it will serve as a foundational element for the upcoming steps.</font>

## Task 1: Download the Notebook
Look for the "Download Notebook" button at the top of the notebook's toolbar to download the notebook.

## Task 2: Setup a Mercury Account
1. Sign up for a free account at [Mercury](https://cloud.runmercury.com/register).
2. After signing up, log in to your Mercury account.

## Task 3: Create Your First Site
1. Look for the green **"+ Add Site"** button and click it.
2. In the "Title of your website" field, enter `Diamond Price Prediction`.
3. In the "Subdomain at which website will be available" field, enter `demo`(Please note that site subdomain names must be unique. In case the chosen subdomain name is already taken, be prepared to modify it accordingly).
4. Click the green **"OK"** button at the bottom of the page.

## Task 4: Upload Your File
1. Click the **"Upload Files"** button on the right side of the site you just created.
2. Upload your downloaded `ipynb` notebook from Task 1 into your new site. 

## Task 5: Open the App
1. Return to the sites menu by clicking the **"Sites"** button on the top navigation bar.
2. Click the link associated with the `Diamond Price Prediction` site.
3. Voilà! You've successfully accessed your site where you can explore and interact with your uploaded notebook.

In [ ]:
import mercury as mr
import math

# Configure the app:
# The `App` class controls how the notebook is displayed in the Mercury.
app = mr.App(title="💎 Diamond Price Prediction",
        description="",
        show_code=False)

# Populate the app with widgets:
# Use the `mr.Note` widget to display a Markdown text. It's used here to prompt the user to select a metric using the slider.
mr.Note(text="__Select the metrics to see the prediction__")

# Calculate the maximum and minimum values of the feature columns in the `DataFrame`. These values will define the range of the sliders.
c1_max = df['carat'].max()
c1_min = df['carat'].min()
# a slider widget capable of handling decimal numbers
carat = mr.Numeric(label="Carat", value=c1_min, min=c1_min, max=c1_max, step=0.1)

c2_max = df['color'].max()
c2_min = df['color'].min()
# a slider widget capable of handling whole numbers
color = mr.Slider(label="Color", value=c2_min, min=c2_min, max=c2_max)

c3_max = df['clarity'].max()
c3_min = df['clarity'].min()
clarity = mr.Slider(label="Clarity", value=c3_min, min=c3_min, max=c3_max)

In [ ]:
# display numbers in large boxes with title
mr.NumberBox(data="Predicted Price in USD: $" + str(math.ceil((float(clf_gbr.predict(np.array([[carat.value, color.value, clarity.value]])))))))

In [ ]:
# display numbers in a row of three boxes
mr.NumberBox([
    mr.NumberBox(data=carat.value, title="Carat"),
    mr.NumberBox(data=color.value, title="Color"),
    mr.NumberBox(data=clarity.value, title="Clarity")
])

---

# Congratulations! You have completed the project

## Authors

[Vicky Kuo](https://author.skills.network/instructors/vicky_kuo)

## Change Log

|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2023-05-01|0.1|Vicky Kuo|Initial Lab Created|

Copyright © 2023 IBM Corporation. All rights reserved.