# **Color Classification**
---

#### **Description**

This dataset contains 5053 RGB color samples that have been labelled into one of 11 different categories: <br />
> *Red, Orange, Yellow, Green, Blue, Purple, Pink, Brown, Black, Gray, White*

Using pandas and the sklearn library, train a model to categorize colors based on their red, green, and blue values. Plot and compare the results and make note of your findings.

This **[Google Color Picker](https://www.google.com/search?rls=en&q=color+picker)** may be a handy tool to use throughout the project!


#### **Contents**
- **[Introduction](#intro)**
- **[Loading the Data](#load_data)**
- **[Label Encoding](#encode_labels)**
- **[Training and Testing the Model](#train_test)**
- **[Label Decoding](#decode_labels)**
- **[Notebook Output Styling](#styling)**
- **[View and Compare Results](#view_results)**
- **[More to Explore](#more)**


### **Introduction** <a name="intro"></a>

Visible colors of light (for humans) and digital colors can be represented in the form of RGB values. The **RGB color model** operates as an additive system where <font color="red">red</font>, <font color="green">green</font>, and <font color="#2964f0">blue</font> (**<font color="red">R</font><font color="green">G</font><font color="#2964f0">B</font>**) primary light colors combine in diverse ways to replicate a wide spectrum of colors.

Each of these primary colors is represented by an integer value between 0 and 255 (just under 256 or 2<sup>8</sup> on the binary number scale). A lower value means a lower intensity or darker color. Likewise, an RGB of (0, 0, 0) is black, while (255, 255, 255) is white. The closer the red, green, and blue values are to being equal, the more likely a color is to appear gray. For example, RGB(113, 113, 113) and RGB(207, 207, 207) are different shades of gray. 

Without checking, what color would RGB(172, 145, 236) be? <br />That's a tough question to answer! 

Naturally, we as humans often use names to describe colors as we see them. You may have grown up using crayons with names like *scarlet*, *dark purple*, or *yellow-orange*. But where is the line between *orange* and *yellow* drawn? Classifying colors under an umbrella label can be difficult and is often individually or even culturally subjective! This task becomes even more challenging for those with a form of color blindness. 

Beyond the commonly known groupings of primary, secondary, and tertiary colors, [research](https://europepmc.org/backend/ptpmcrender.fcgi?accid=PMC1618485&blobtype=pdf) has found an optimitzed set of 11 distinct color categories for classification: <br />
> *Red, Orange, Yellow, Green, Blue, Purple, Pink, Brown, Black, Gray, White*


<br />

---


### **Loading the Data** <a name="load_data"></a>

In [16]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.preprocessing import OrdinalEncoder

In [17]:
df = pd.read_csv('color_data.csv')
labels = df['label']
df = df.drop('label', axis='columns')

### **Label Encoding** <a name="encode_labels"></a>

A model will not be able to understand or evaluate words like *'Blue'* or *'Orange'* so it is necessary to encode these labels. It is probably not necessary to encode the feature columns as the red, green, and blue values are already integers.

Try different label encoding methods and compare the results.
- Ordinal Encoding &rarr; found to have better accuracy and precision with this dataset
- One-Hot Encoding &rarr; found to have better recall with this dataset

##### **Ordinal Encoding**
Unlike one-hot encoding, **ordinal encoding** implies that the labels have a particular sequential order or relation to one another. For this dataset, we know it is possible for the labels to have an inherent ordering because most of the colors fall somewhere on the **color <font color="red">s</font><font color="orange">p</font><font color="yellow">e</font><font color="#31b53c">c</font><font color="#2fccb2">t</font><font color="#2964f0">r</font><font color="#874ced">u</font><font color="#ed4ce2">m</font>**. Think about the order of colors in a rainbow!

<font color="red">*Red*</font> relates to <font color="orange">*orange*</font> as <font color="#2964f0">*blue*</font> relates to <font color="#874ced">*purple*</font>.

The order in which you decide to encode the labels tells the model something about their relationships. Try a few different orderings to see if you can improve the metric scores. Two examples are already given below. 

In [18]:
label_nums = {'Brown':0, 'Red':1, 'Orange':2, 'Yellow':3, 'Green':4, 'Blue':5, 'Purple':6, 'Pink':7, 'White':8, 'Gray':9, 'Black':10}
label_nums = {'Green':0, 'Blue':1, 'Purple':2, 'Pink':3, 'Red':4, 'Orange':5, 'Yellow':6, 'Brown':7, 'Black':8, 'Gray':9, 'White':10}

encoded_labels = [label_nums[l] for l in labels]

### **Train and Test the Model** <a name="train_test"></a>

It makes a lot of sense to use a `RandomForestClassifier()`, but try out some other models as well to see which gives the best performance. 

Advanced students are encouraged to create an Artificial Neural Network (ANN) model using Keras and Tensorflow.

In [22]:
X_train, X_test, y_train, y_test = train_test_split(df, encoded_labels, test_size=0.2, random_state=42)

rfclas = RandomForestClassifier()
rfclas.fit(X_train, y_train)
pred = rfclas.predict(X_test)

print(metrics.accuracy_score(pred,y_test))
print(metrics.precision_score(pred,y_test, average='macro'))
print(metrics.recall_score(pred,y_test, average='macro'))

0.8931750741839762
0.8450534406250614
0.8574688256766958


### **Label Decoding** <a name="decode_labels"></a>

In [23]:
# Undo label encoding - get the color names for the predicted and actual labels
label_names = {v: k for k, v in label_nums.items()}     # reverses the dict used for ordinal encoding
actual_labels = [label_names[x] for x in y_test]
pred_labels = [label_names[x] for x in pred]

### **Notebook Output Styling** <a name="styling"></a>

Analyzing colors can be an intensely visual process, and evaluating model predictions may heavily rely on visual confirmation. Because of this, it is incredibly helpful to have stylized outputs that reflect the actual color of the sample and the color of the label it has been assigned.

The following functions help you do exactly that:

- `output_styling()` - Takes a feature column as a parameter and returns a list of styling instructions for each column value as strings. These styling instructions are CSS ([cascading style sheets](https://www.w3schools.com/css/css_intro.asp)) ***attribute:value*** pairs. The primary attributes to consider are:
    - **Background Color** `background-color` - Accepts hex codes (ex: #012D9C) and some standard [color names](https://developer.mozilla.org/en-US/docs/Web/CSS/named-color) (ex: 'red'). At a glance, the background should show us exactly what the sample color looks like *(hex code)* and the colors of the predicted/actual labels *(word name)*.
    - **Text Color** `color` - Accepts hex codes (ex: #012D9C) and some standard [color names](https://developer.mozilla.org/en-US/docs/Web/CSS/named-color) (ex: 'red'). We want to manipulate the color of the text so that it can be properly seen/read against the background color.
    
    When the name of this function is used as a parameter in this method call `df.style.apply()` it is called on each feature column in `df`.

- `text_color_from_ratio()` - Takes either a hex code or a color label name as a parameter and returns a string ('black' or 'white') to be the color of the output text based on standards set by [WCAG](https://www.w3.org/TR/WCAG21/#contrast-minimum). This function determines the [contrast ratio](https://www.w3.org/TR/WCAG21/#dfn-contrast-ratio) of the input color to white by computing the [relative luminance](https://www.w3.org/TR/WCAG21/#dfn-relative-luminance) of the input color. The contrast ratio will determine whether black or white will stand out better against the background color and can be easily read. To learn more about the math involved, visit any of the linked webpages.

- `make_hex_col()` - Takes a dataframe object (with the features ['red', 'green', 'blue']) as a parameter and returns a list of all the RGB color samples as Hexidecimal codes, ready to be added to the dataframe as a new column. Essentially, combining the three RGB values into one hex color code allows the `output_styling` function to show us the exact color of a sample. <br />
To learn more about the process of converting RGB to HEX, see this [infographic](https://drive.google.com/file/d/17SjN9rsOv57Y7V1nECDT1R4lGu2yLHq7/view?usp=sharing).

##### **Instructor Notes:** 
- Consider having your students create the `make_hex_col()` function themselves.
- Walkthrough how the functions work / encourage students to review them individually to help them understand the code so they can make their own customizations if desired.

In [24]:
# Combine the RGB features into one hex code for each sample using string formatting
# Return all hex codes as a list so it can be added as a new column to the dataframe
def make_hex_col(frame):
    hex_vals = []
    for index, row in frame.iterrows():
        hex_vals.append(('#%02X%02X%02X' % (row['red'], row['green'], row['blue'])))
    return hex_vals


# Revert color parameter (hex or label name) to RGB for contrast ratio analysis
# Learn more about this process by visiting the resources linked above
def text_color_from_ratio(color):
    rgb = matplotlib.colors.to_rgb(color.lower())
    mult = [0.2126, 0.7152, 0.0722]
    lum = 0
    for x in range(len(rgb)):
        sRGB = rgb[x] / 12.92 if rgb[x] <= 0.03928 else ( (rgb[x] + 0.055) / 1.055) ** 2.4
        lum += sRGB * mult[x]
    ratio = lum if lum > 1 else 5 if lum==0 else 1/lum
    return 'white' if ratio >= 4.5 else 'black'


def output_styling(column):
    if column.name in ['red','green','blue']:
        return ['color: ' + column.name for val in column]  # colors the red, green, blue column text
        # return ['' for v in s]                            # use this line for default text color instead
    
    # For styling hex and target label columns: 
    #   Center-align text
    #   Set bg color to color parameter
    #   Appropriately color text for visibility
    return ['text-align: center; color: ' + text_color_from_ratio(val) + '; background-color: ' + val.lower() for val in column]


### **View and Compare Results** <a name="view_results"></a>

In [30]:
# Build DataFrame to compare results of the tested samples
results_df = X_test.copy()
results_df.insert(3, 'hex', make_hex_col(results_df), True)
results_df.insert(4, 'actual', actual_labels, True)
results_df.insert(5, 'predicted', pred_labels, True)

# Apply styling functions and output results
results_df.style.apply(output_styling)

Unnamed: 0,red,green,blue,hex,actual,predicted
2638,163,245,72,#A3F548,Green,Green
4589,16,199,4,#10C704,Green,Green
798,67,189,237,#43BDED,Blue,Blue
3699,170,145,169,#AA91A9,Gray,Purple
1557,9,67,92,#09435C,Blue,Blue
465,82,169,128,#52A980,Green,Green
4127,177,170,46,#B1AA2E,Green,Green
915,210,130,229,#D282E5,Purple,Pink
4946,53,214,103,#35D667,Green,Green
4678,211,34,72,#D32248,Red,Red


### **More to Explore!** <a name="more"></a>

1. Test your trained model on any color you wish. Many computers have their own built-in digit color meter or dropper tool you can use to pull specific RGB colors from images. Otherwise, this online [Image Color Picker](https://imagecolorpicker.com) may be helpful, it will give both the HEX and RGBA of a selected color. <br />\*(the *A* in RGBA stands for *alpha* which is an opacity value you should delete)

```python
        # Example: testing 3 new RGB colors on a trained model
        my_colors = [[15, 188, 120], [76,76,45], [7,198,64]]
        my_X_test = pd.DataFrame(my_colors, columns=['red', 'green', 'blue'])
        pred = my_trained_model.predict(my_X_test)
```

2. If you disagree with the initial labeling of a color sample in the dataset, feel free to find and edit it in the data file! The way we perceive colors is often subjective, and the boundary that separates one color from its neighbor on the spectrum isn't rigidly defined.

3. Not enough data points? Try writing a program to randomly generate more RGB combinations and label them using your own judgement.<br /><br />
    **Instructors:** Below is an example program, but you can encourage your students to create their own if you feel that they are skilled enough to do so.

In [None]:
#################################################################
# This program generates random RGB values that it will display 
# and ask you to categorize into one of the 11 color labels.
# 
# If there is a color you are unsure about how to label, simply
# press ENTER to skip it.
#
# When you are finished, type 'quit' and your newly created 
# samples will be appended to the end of the dataset file.
# Double check the file name if they do not appear.
#################################################################

import random
import matplotlib.patches as patches

def generate_rgb():
    return [int(random.random()*255), int(random.random()*255), int(random.random()*255)]

samples = []
label = ''
while label.lower() != 'quit':
    random_rgb = generate_rgb()
    
    new_color = pd.DataFrame([random_rgb], columns=['red', 'green', 'blue'])
    hex_col = make_hex_col(new_color)

    fig = plt.figure(dpi=20)
    color_box = fig.add_subplot(111, aspect='equal')
    color_box.set_axis_off()
    color_box.add_patch(patches.Rectangle((0,0), 1, 1, color=hex_col[0]))
    plt.show(block=False)
    
    label = input('RGB (' + ', '.join([str(x) for x in random_rgb]) + ')\nLabel: ')

    if label.lower() in ['orange', 'pink', 'white', 'yellow', 'black', 'blue', 'brown', 'gray', 'green', 'red', 'purple']:
        samples.append(','.join([str(x) for x in random_rgb]) + ',' + label[0].upper() + label[1:].lower())

print('\n' + str(len(samples)) + ' Samples Added')
with open('color_data.csv', 'a') as file:
    file.write('\n' + '\n'.join(samples))