# Group Project: Determining Diamond Cut Grades Using KNN Classification

**Section 009 Group 2**

**Ziqing Wang**<br>**Anna Tao**<br>**Ruby de Lang**

### Introduction

The 4Cs: cut, clarity, color, and carat weight, are internationally accepted standards for assessing the quality of a diamond.  Diamond cut grade is a pivotal factor in determining the beauty and value of a diamond.

The dataset being used reports on the characteristics of diamonds. 
We want to use the KNN classification method to predict the cut grades of diamonds.

The columns of the dataset:
* **carat**: a unit of measurement for a diamond's weight.
* **cut**: cut grades of diamonds, measured in five scales (high to low): Ideal, Premium, Very good, Good, Fair.
* **color**: color is graded on a scale from D (colorless) to Z (light yellow or brown).
* **clarity**: the presence of internal and external flaws within a diamond.
* **depth**: the distance from the table to the culet (the bottom of the diamond).
* **table**: the flat, topmost facet of diamonds.
* **price**: The price of diamonds.
* **x**: the x-dimension of diamonds.
* **y**: the y-dimension of diamonds.
* **z**: the z-dimension of diamonds.

### Method Overview:

Our project wishes to identify whether we can use diamond data to predict the cut of a diamond.

1. Preliminary Exploratory Data Analysis: Prepare our dataset by reading and wrangling for further analysis.
2. Splitting Data: Split the filtered dataset into a testing and training set. Summarize the distribution of each categorical predictor variables for the training data.
3. Select Predictor Variables: Check for the relationship between each variable and the cut quality. Eliminate the variables with no visible correlation.
4. Create a Classification Model: Employ the K-nearest neighbors classification algorithm to identify the optimal K value. After identifying the most suitable value, run the model on the test set to check the accuracy value. 
5. Fulfill Curiousity by checking whether the addition of predictor variables increases or decreases the accuracy of predictions. We will do this by adding predictor variables to our recipe, then running that model on the test set to check the accuracy value. 


### 1. Preliminary exploratory data analysis

In [12]:
install.packages("tidyverse")
install.packages("cowplot")
install.packages("tidymodels")
install.packages("kknn")
library(tidyverse)
library(ggplot2)
library(repr)
library(tidymodels)
library(readr)
library(cowplot)

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



**Reading Data from Online Source Into R**

After reading the data from an online source into a CSV file in Jupyter, we have 53940 recorded observations.

In [13]:
diamond_data <- read_csv("https://raw.githubusercontent.com/rubydelang/sonar_data/main/diamonds.csv") 

head(diamond_data)

“restarting interrupted promise evaluation”


ERROR: Error in date_names_lang(date_names): read failed on /opt/conda/lib/R/library/readr/R/sysdata.rdb


**Mutating Data**

For our target variable ( column cut ), we need to mutate the type into a factor. This will allow us to predict the the cut quality.

In [None]:
diamond_data <- diamond_data |>
mutate(cut = as_factor(cut))

**Checking for Missing Data** 

The 'na_rows' counts for the number of rows containing missing data, and the result of 0 means we do not have any missing data in this dataset.

In [None]:
na_rows <- sum(apply(is.na(diamond_data), 1, any))
print(na_rows)
summary(diamond_data)

**Tidy Data**

Now looking at the dataset, each row is a single observation, each column is a single variable, with meaningful column names, and each cell contains only a single value. 
Therefore, the data is already tidy so we do not need take any further actions. <br><br>

### 2. Splitting Data

**Splitting the data into training and testing data**

We have 53940 recorded observations that can be used for analysis. Our next step is the split the data into a training and testing set. We set the proportion to 0.75, this means 75% of our 53940 observations will be used towards the training set, and the remaining observations will be stored for the testing set. We set our seed to 2023 to create replicable randomized results.

In [None]:
set.seed(2023)
diamond_split <- initial_split(diamond_data, prop = 0.75, strata = cut)
diamond_training <- training(diamond_split)
diamond_testing <- testing(diamond_split)

**Examining the Distribution of Training data**

In the following visualization, we examine the distribution of our target variable 'cut' in the training data set. **THIS SHOULD BE AN EVEN DISTRIBUTION OR DELETE THIS LATER!!!!!!**

In [None]:
counting_types <- diamond_training |>
    group_by(cut)|>
    summarize(types_count = n()) 

dist <- counting_types |>
    ggplot(aes(x = cut, y = types_count, fill = cut)) +
    geom_bar(stat = "identity") +
    labs(x = "Cut QUality",y = "Patient Count", color = "Cut Quality") +
    ggtitle("Cut Quality Distribution") +
    scale_fill_discrete(name = "Cut Quality", labels = c("1", "2", "4", "5")) +
    theme(text = element_text(size = 20))
dist

**Predictor Distribution**

Below are predictor variables for cut quality. We can observe their distribution and select those that exhibit some type of relationship with the target variable (cut quality).

In [None]:
options(repr.plot.width = 13, repr.plot.height = 14)
x_cut_graph <- diamond_training |>
ggplot(aes(x = cut, y = x, color = cut)) +
geom_boxplot() +
ggtitle("Cut Grades vs x-dimension") +
labs(x ="cut grades", y = "x-dimension of diamonds", color = "cut grades")

y_cut_graph <- diamond_training |>
ggplot(aes(x = cut, y = y, color = cut)) +
geom_boxplot() +
ggtitle("Cut Grades vs y-dimension") +
labs(x ="cut grades", y = "y-dimension of diamonds", color = "cut grades")

z_cut_graph <- diamond_training |>
ggplot(aes(x = cut, y = z, color = cut)) +
geom_boxplot() +
ggtitle("Cut Grades vs z-dimension") +
labs(x ="cut grades", y = "z-dimension of diamonds", color = "cut grades")

table_cut_graph <- diamond_training |>
ggplot(aes(x = cut, y = table, color = cut)) +
geom_boxplot() +
ggtitle("Cut Grades vs Table") +
labs(x ="cut grades", y = "table of diamonds", color = "cut grades")

depth_cut_graph <- diamond_training |>
ggplot(aes(x = cut, y = depth, color = cut)) +
geom_boxplot() +
ggtitle("Cut Grades vs Depth") +
labs(x ="cut grades", y = "depth of diamonds", color = "cut grades")

carat_cut_graph <- diamond_training |>
ggplot(aes(x = cut, y = carat, color = cut)) +
geom_boxplot() +
ggtitle("Cut Grades vs Carat") +
labs(x ="cut grades", y = "carat of diamonds", color = "cut grades")

price_cut_graph <- diamond_training |>
ggplot(aes(x = cut, y = price, color = cut)) +
geom_boxplot() +
ggtitle("Cut Grades vs price") +
labs(x ="cut grades", y = "price of diamonds", color = "cut grades")

plot_grid(x_cut_graph, y_cut_graph, z_cut_graph, table_cut_graph, depth_cut_graph, carat_cut_graph, price_cut_graph, align = "h", ncol = 3)

The more separate the boxes are, the more accurate prediction it will make. Based on the graph, none of them have strong relationships with the cut grades. We can maximize the prediction accuracy by choosing the relatively associated factors like: depth and table. These two variables show a slight change in the cut quality.


In reality, the cut grades are classified based on how well the diamond can reflect light. x, y, z, depth, and table are symmetry factors and should be considered as predictors. However, based on the graphs we have plotted, the associations are weak. Therefore, we should only choose depth and table as predictors. If the accuracy is not desirable, we will add the x, y, and z in, and compare the accuracy. We will visualize the results by plotting the accuracy graphs. We hypothesize the addition of x, y, and z columns will not affect the accuracy too much because the boxes of each type looks almost identical. 

### Building a Classification Model

First, let's choose a reasonable k value to work with. We use tuning to select the best k value, and use cross validation to make sure the accuracy is not an unlucky value. **Warning: This step needs approximately 20 minutes to perform.**

**Creating a recipe**

Our first step in building the classification model is to create a recipe so our training data can be prepared to be used in the model. We included both the scale and center functions to ensure all our predictor variables have a mean of zero and standard deviation of one to ensure a bell curve distribution. 

In [None]:
set.seed(2023)
diamond_recipe <- recipe(cut ~ depth+table, data = diamond_training) |>
step_scale(all_predictors()) |>
step_center(all_predictors())

diamond_recipe_scaled <- diamond_recipe |>
                       prep() |>
                       bake(diamond_training)

**Creating a tuning model**

After scaling the training data, our next step is to identify the best K value. To begin the tuning process, we'll define a model specification for classification. The tune() function will then manage the tuning of all predictors within the model to select the optimal k-value for enhanced model performance.

In [None]:
set.seed(2023)

diamond_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("classification")

**Performing 5-fold cross validation**


Following the model tuning, we performed a 5-fold cross-validation to obtain four distinct accuracy estimates for our final outcome. Opting for five folds was based on the medium size of our data, making it best to split it into five chunks.

In [None]:
set.seed(2023)

diamond_vfold_5 <- vfold_cv(diamond_training, v = 5, strata = cut)

**Collecting metrics for many values of K**

First we create a tibbke using the seq() function starting from 1, incrementing by 10, and stopping at 151 (our dataset is on the larger side, so more K values are necessary). This allows us the tune the parameters of the KNN model. 

In [None]:
set.seed(2023)

k_vals <- tibble(neighbors = seq(from = 1, to = 151, by = 10))

**Workflow**

We then created a workflow for our model that includes the initial recipe (diamond_recipe), tuned spec (diamond_spec), and the tune_grid() function (this fits the model and all observations into a range specified from our tibble created earlier).

In [None]:
set.seed(2023)

diamond_fit_tune_1 <- workflow() |>
  add_recipe(diamond_recipe) |>
  add_model(diamond_spec) |>
  tune_grid(resamples = diamond_vfold_5, grid = k_vals)

**Collecting and Filtering the Accuracy**

We use the collect_metrics() function and filter() function to collect the mean and standard error values of all five accuracy values. This will leave us with only the filtered metrics of accuracy.

In [None]:
diamond_vfold_tune_metrics_1 <- diamond_fit_tune_1 |>
    collect_metrics() |> 
    filter(.metric == "accuracy")

All of these steps can be found below: 

In [None]:
set.seed(2023)

diamond_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("classification")

diamond_recipe <- recipe(cut ~ depth+table, data = diamond_training) |>
step_scale(all_predictors()) |>
step_center(all_predictors())

k_vals <- tibble(neighbors = seq(from = 1, to = 151, by = 10))
diamond_vfold_5 <- vfold_cv(diamond_training, v = 5, strata = cut)

diamond_fit_tune_1 <- workflow() |>
  add_recipe(diamond_recipe) |>
  add_model(diamond_spec) |>
  tune_grid(resamples = diamond_vfold_5, grid = k_vals)

diamond_vfold_tune_metrics_1 <- diamond_fit_tune_1 |>
collect_metrics() |> filter(.metric == "accuracy")

**Accuracy vs k graph**

After filtering for the accuracy values, a visualization graph with the accuracies  vs neighbors is created to see which neighbour is the most stable. We estimate the nearest neighbor by looking at which one has the highest accuracy. 

In [None]:
options(repr.plot.width = 7, repr.plot.height = 6)

accuracy_plot <- ggplot(diamond_vfold_tune_metrics_1, aes(x = neighbors, y = mean)) +
geom_point() +
geom_line() +
ggtitle("Accuracy vs. K")+
labs(x = "# of k", y = "accuracy in %")

accuracy_plot

The graph above suggests that the selection of k between 50-150 would be the best choice for K, so let's pick a relatively smaller number to speed up our remaining calculations. 70 looks to be a good k value to choose.

**Selecting the k with greatest accuracy**

To confirm the conclusion we reached above, we will filter for the highest average max accuracy value and then select the k that corresponds to the max accuracy found. Then we can pull it.

In [None]:
best_k_v <- diamond_vfold_tune_metrics_1 |>
          filter(mean == max(mean)) |>        
          select(neighbors)
best_k_v

best_k_vV <- best_k_v |>
pull()

best_k_vV

**Creating the optimized model with new K value**

Now that we have identified the best k value (70), our next step is to run the same model using the best k value, but this time on our testing dataset.

In [None]:
set.seed(2023)

diamond_spec_final <- nearest_neighbor(weight_func = "rectangular", neighbors = 70) |>
  set_engine("kknn") |>
  set_mode("classification")

diamond_fit_final <- workflow() |>
  add_recipe(diamond_recipe) |>
  add_model(diamond_spec_final) |>
  fit(diamond_training)

diamond_predict <- predict(diamond_fit_final, diamond_testing) |>bind_cols(diamond_testing)

diamond_prediction_accuracy <- diamond_predict |>
  metrics(truth = cut, estimate = .pred_class) |>
  filter(.metric == "accuracy")

diamond_prediction_accuracy

**Fitting the model**

Then we can fit the model according to the training model, initial recipe, and tuned model.

In [None]:
set.seed(2023)

diamond_fit_final <- workflow() |>
  add_recipe(diamond_recipe) |>
  add_model(diamond_spec_final) |>
  fit(diamond_training)

**Determing classification model's accuracy using the test set**

The classification model is run now using the testing dataset.

In [None]:
set.seed(2023)

diamond_predict <- predict(diamond_fit_final, diamond_testing) |>bind_cols(diamond_testing)

diamond_prediction_accuracy <- diamond_predict |>
  metrics(truth = cut, estimate = .pred_class) |>
  filter(.metric == "accuracy")

diamond_prediction_accuracy

We can see the final accuracy is 70.8%. 

### Fulfilling Curiousity

**Fufilling Curiousity**

Now we want to fulfill our curiosity. Will adding x,y,z as the predictors affect the accuracy, even the relationship graph from section 2 does not show strong relationships. So we can do that by repeating the same steps we used before.

**Creating the recipe**

We will create a recipe here, this time adding x, y, and z as predictors. Other than that we follow the same steps as before. 

In [None]:
set.seed(2023)

diamond_recipe_curiosity <- recipe(cut ~ depth+table+x+y+z, data = diamond_training) |>
step_scale(all_predictors()) |>
step_center(all_predictors())



**Fitting the recipe to a workflow**

Now we fit it to the workflow, this time adding our new recipe (diamond_fit_tune_curiosity), other than that, all our steps remain the same. We use the same model diamond_spec.

We then collect and filter the accuracy metrics to plot.

In [None]:
diamond_fit_tune_curiosity <- workflow() |>
  add_recipe(diamond_recipe_curiosity) |>
  add_model(diamond_spec) |>
  tune_grid(resamples = diamond_vfold_5, grid = k_vals)

diamond_vfold_tune_metrics_curiosity <- diamond_fit_tune_curiosity |>
    collect_metrics() |> 
    filter(.metric == "accuracy")

**Plotting the curiousity accuracy graph**

Here, we plot the new plot to find the best k-value. Similar to our last graph, we estimate the Nearest-Neighbour to be the one with the highest accuracy. 

In [None]:
accuracy_curiosity_plot <- ggplot(diamond_vfold_tune_metrics_curiosity, aes(x = neighbors, y = mean)) +
geom_point() +
geom_line() +
labs(x = "# of k", y = "accuracy in %") +
ggitle("Accuracy vs K")+
scale_x_continuous(breaks = seq(from = 0, to = 150, by = 10))

accuracy_curiosity_plot

The graph above suggests the k value to be 10. Now let's test on the testing set.

**Curiousity Recipe**

We create our tuning model using the best k value (10). Then we run the same model using neighbours = 10 on our test dataset. 

In [None]:
set.seed(2023)

diamond_spec_curiosity <- nearest_neighbor(weight_func = "rectangular", neighbors = 10) |>
  set_engine("kknn") |>
  set_mode("classification")


**Fitting the model**

Similar to when the training data was used for the classification model, we created a workflow for our model that includes our newly cretaes recipe (diamond_recipe_curiousity), recently tuned model (diamond_spec_curiosity), and the fit() function to build the classifier.

In [None]:
set.seed(2023)

diamond_fit_curiosity <- workflow() |>
  add_recipe(diamond_recipe_curiosity) |>
  add_model(diamond_spec_curiosity) |>
  fit(diamond_training)

**Determining the curiousity model's accuracy using the test set**

We then run the testing dataset. It gives us approximately 73% accuracy.

In [None]:
diamond_predict_curiosity <- predict(diamond_fit_curiosity, diamond_testing) |>bind_cols(diamond_testing)

diamond_prediction_accuracy_curiosity <- diamond_predict_curiosity |>
  metrics(truth = cut, estimate = .pred_class) |>
  filter(.metric == "accuracy")

diamond_prediction_accuracy_curiosity

**Comparing them side-by-side**

Let's take a look at them together:

In [None]:
diamond_prediction_accuracy
diamond_prediction_accuracy_curiosity

With only depth and table as our predictors, the accuracy is 70.8%, and after adding the x, y, z as the predictors, the accuracy is increased by 2%. Just like we hypothesized, the addition of the x,y,z does not affect the accuracy too much, even though they are the geometry factors.

### 4.0 Discussion

The expected outcome is that the cut grade prediction accuracy will be low, as the associations are not strong. In terms of impacts, if the accuracy is desirable, the model can speed up the process of grading the diamonds.

**4.1 Summary of Results**\
Through this data analysis, it was found that the depth and table are the main two quantitative predictors that can be used to predict the quality of a diamond. When the model ran with those two being used as predictors, it was found that the accuracy of the prediction is 70.8%. The model was then tested again but this time, using depth, table, and the geometric factors as the predictors, the accuracy when the three new variables were taken into account was 72.7% (1.9% accuracy increase). The main takeaway from this finding is that while geometric factors affect the overall visual aesthetic of the diamond, they have a very small impact on the cut. \
**4.2 Relation to Expectations**\
This is in line to what was expected with the prediction models. As seen in the box plots created in section 2, the predictive capabilities did not appear to look very strong from the get go. \
**4.3 Future Impacts**\
This can have impacts on the future of the diamond industry as it was found through this that the geometric factors have very little impact on the quality of the cut. The implications of this are that the process of grading diamonds can be streamlined, and the overselling of diamonds based on the geometric variables can be prevented as it was shown through this classification model that they don't affect the quality of the cut. 

**4.4 Future questions** 
1. What factors do affect the cut of the diamond the most?
2. How can we improve the prediction accuracy? 
3. What other factors not considered in our model influence the prediction of cut quality?

### References

1. Why Is A Diamond’s Cut Important?. (n.d.). BRILLIANT EARTH. https://www.brilliantearth.com/en-ca/diamond/buying-guide/cut/