## 1. Cosmetics, chemicals... it's complicated
<p>Whenever I want to try a new cosmetic item, it's so difficult to choose. It's actually more than difficult. It's sometimes scary because new items that I've never tried end up giving me skin trouble. We know the information we need is on the back of each product, but it's really hard to interpret those ingredient lists unless you're a chemist. You may be able to relate to this situation.</p>
<p><img src="https://assets.datacamp.com/production/project_695/img/image_1.png" style="width:600px;height:400px;"></p>
<p>So instead of buying and hoping for the best, why don't we use data science to help us predict which products may be good fits for us? In this notebook, we are going to create a content-based recommendation system where the 'content' will be the chemical components of cosmetics. Specifically, we will process ingredient lists for 1472 cosmetics on Sephora via <a href="https://en.wikipedia.org/wiki/Word_embedding">word embedding</a>, then visualize ingredient similarity using a machine learning method called t-SNE and an interactive visualization library called Bokeh. Let's inspect our data first.</p>

In [None]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.manifold import TSNE



# Load the data
df = pd.read_csv('/content/cosmetics.csv')

# Check the first five rows
print("First five rows of the dataset:")
display(df.head())

# Inspect the types of products
print("\nUnique product types in the 'Name' column:")
display(df['Name'].unique())


First five rows of the dataset:


Unnamed: 0,Label,Brand,Name,Price,Rank,Ingredients,Combination,Dry,Normal,Oily,Sensitive
0,Moisturizer,LA MER,Crème de la Mer,175,4.1,"Algae (Seaweed) Extract, Mineral Oil, Petrolat...",1,1,1,1,1
1,Moisturizer,SK-II,Facial Treatment Essence,179,4.1,"Galactomyces Ferment Filtrate (Pitera), Butyle...",1,1,1,1,1
2,Moisturizer,DRUNK ELEPHANT,Protini™ Polypeptide Cream,68,4.4,"Water, Dicaprylyl Carbonate, Glycerin, Ceteary...",1,1,1,1,0
3,Moisturizer,LA MER,The Moisturizing Soft Cream,175,3.8,"Algae (Seaweed) Extract, Cyclopentasiloxane, P...",1,1,1,1,1
4,Moisturizer,IT COSMETICS,Your Skin But Better™ CC+™ Cream with SPF 50+,38,4.1,"Water, Snail Secretion Filtrate, Phenyl Trimet...",1,1,1,1,1



Unique product types in the 'Name' column:


array(['Crème de la Mer', 'Facial Treatment Essence',
       'Protini™ Polypeptide Cream', ..., 'Self Tan Dry Oil SPF 50',
       'Pro Light Self Tan Bronzing Mist',
       'DERMAPROTECT Daily Defense Broad Spectrum SPF 50+'], dtype=object)

## 2. Focus on one product category and one skin type
<p>There are six categories of product in our data (<strong><em>moisturizers, cleansers, face masks, eye creams</em></strong>, and <strong><em>sun protection</em></strong>) and there are five different skin types (<strong><em>combination, dry, normal, oily</em></strong> and <strong><em>sensitive</em></strong>). Because individuals have different product needs as well as different skin types, let's set up our workflow so its outputs (a t-SNE model and a visualization of that model) can be customized. For the example in this notebook, let's focus in on moisturizers for those with dry skin by filtering the data accordingly.</p>

In [None]:
# Filter for moisturizers
moisturizers = df[df['Name'].str.contains("moisturizer", case=False, na=False)]

# Filter for dry skin as well
moisturizers_dry = moisturizers[moisturizers['Dry'] == 1]

# Reset index
moisturizers_dry = moisturizers_dry.reset_index(drop=True)

# Display the filtered dataset
print("Filtered dataset for moisturizers suitable for dry skin:")
display(moisturizers_dry.head())

Filtered dataset for moisturizers suitable for dry skin:


Unnamed: 0,Label,Brand,Name,Price,Rank,Ingredients,Combination,Dry,Normal,Oily,Sensitive
0,Moisturizer,OLEHENRIKSEN,Sheer Transformation® Perfecting Moisturizer,38,4.2,Visit the OLEHENRIKSEN boutique,1,1,1,1,1
1,Moisturizer,ORIGINS,GinZing™ Energy-Boosting Gel Moisturizer,29,4.4,"Water, Methyl Trimethicone, Butylene Glycol, G...",1,1,1,1,1
2,Moisturizer,SUNDAY RILEY,C.E.O. C + E antiOXIDANT Protect + Repair Mois...,65,4.1,"Water, Squalane, Tetrahexyldecyl Ascorbate (Vi...",1,1,1,1,1
3,Moisturizer,GLAMGLOW,GLOWSTARTER™ Mega Illuminating Moisturizer,49,4.0,"Water, Dimethicone, Butylene Glycol, Cetyl Ric...",1,1,1,1,0
4,Moisturizer,FARMACY,Honey Drop Lightweight Moisturizer with Echina...,45,4.1,"Water, Glycereth-26, Glycerin, C13-15 Alkane, ...",1,1,1,1,1


## 3. Tokenizing the ingredients
<p>To get to our end goal of comparing ingredients in each product, we first need to do some preprocessing tasks and bookkeeping of the actual words in each product's ingredients list. The first step will be tokenizing the list of ingredients in <code>Ingredients</code> column. After splitting them into tokens, we'll make a binary bag of words. Then we will create a dictionary with the tokens, <code>ingredient_idx</code>, which will have the following format:</p>
<p>{ <strong><em>"ingredient"</em></strong>: index value, … }</p>

In [None]:
# Initialize dictionary, list, and initial index
ingredient_idx = {}
corpus = []
idx = 0

# For loop for tokenization
for i in range(len(moisturizers_dry)):
    # Get the ingredients for the current product
    ingredients = moisturizers_dry['Ingredients'][i]

    # Convert to lowercase
    ingredients_lower = ingredients.lower()

    # Tokenize the ingredients by splitting on ", "
    tokens = ingredients_lower.split(', ')

    # Append tokens to the corpus (list of lists)
    corpus.append(tokens)

    # Update the ingredient index dictionary
    for ingredient in tokens:
        if ingredient not in ingredient_idx:
            ingredient_idx[ingredient] = idx
            idx += 1

# Check the result
print("The index for decyl oleate is", ingredient_idx.get('decyl oleate', "Ingredient not found"))


The index for decyl oleate is Ingredient not found


## 4. Initializing a document-term matrix (DTM)
<p>The next step is making a document-term matrix (DTM). Here each cosmetic product will correspond to a document, and each chemical composition will correspond to a term. This means we can think of the matrix as a <em>“cosmetic-ingredient”</em> matrix. The size of the matrix should be as the picture shown below.
<img src="https://assets.datacamp.com/production/project_695/img/image_2.PNG" style="width:600px;height:250px;">
To create this matrix, we'll first make an empty matrix filled with zeros. The length of the matrix is the total number of cosmetic products in the data. The width of the matrix is the total number of ingredients. After initializing this empty matrix, we'll fill it in the following tasks. </p>

In [None]:

# Get the number of items (cosmetic products) and tokens (unique ingredients)
M = len(moisturizers_dry)  # Number of cosmetic products
N = len(ingredient_idx)    # Number of unique ingredients

# Initialize a matrix of zeros
A = np.zeros((M, N), dtype=int)

# Display the shape of the initialized matrix
print(f"Initialized document-term matrix with shape: {A.shape}")


Initialized document-term matrix with shape: (37, 701)


## 5. Creating a counter function
<p>Before we can fill the matrix, let's create a function to count the tokens (i.e., an ingredients list) for each row. Our end goal is to fill the matrix with 1 or 0: if an ingredient is in a cosmetic, the value is 1. If not, it remains 0. The name of this function, <code>oh_encoder</code>, will become clear next.</p>

In [None]:
# Define the oh_encoder function
def oh_encoder(tokens):
    # Initialize a zero vector of length equal to the number of unique ingredients
    x = np.zeros(len(ingredient_idx), dtype=int)

    # Iterate through the tokens (ingredients) in the current product
    for ingredient in tokens:
        # Get the index for each ingredient if it exists in ingredient_idx
        idx = ingredient_idx.get(ingredient)
        if idx is not None:
            # Set the corresponding index in the vector to 1
            x[idx] = 1

    return x


## 6. The Cosmetic-Ingredient matrix!
<p>Now we'll apply the <code>oh_encoder()</code> functon to the tokens in <code>corpus</code> and set the values at each row of this matrix. So the result will tell us what ingredients each item is composed of. For example, if a cosmetic item contains <em>water, niacin, decyl aleate</em> and <em>sh-polypeptide-1</em>, the outcome of this item will be as follows.
<img src="https://assets.datacamp.com/production/project_695/img/image_3.PNG" style="width:800px;height:400px;">
This is what we called one-hot encoding. By encoding each ingredient in the items, the <em>Cosmetic-Ingredient</em> matrix will be filled with binary values. </p>

In [None]:
# Populate the document-term matrix
for i, tokens in enumerate(corpus):
    # Use the oh_encoder function to get the one-hot encoded vector for the current product
    A[i, :] = oh_encoder(tokens)

# Display a portion of the matrix to check the result
print("Cosmetic-Ingredient matrix (first 5 rows):")
print(A[:5, :])


Cosmetic-Ingredient matrix (first 5 rows):
[[1 0 0 ... 0 0 0]
 [0 1 1 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]]


## 7. Dimension reduction with t-SNE
<p>The dimensions of the existing matrix is (190, 2233), which means there are 2233 features in our data. For visualization, we should downsize this into two dimensions. We'll use t-SNE for reducing the dimension of the data here.</p>
<p><strong><a href="https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding">T-distributed Stochastic Neighbor Embedding (t-SNE)</a></strong> is a nonlinear dimensionality reduction technique that is well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, this technique can reduce the dimension of data while keeping the similarities between the instances. This enables us to make a plot on the coordinate plane, which can be said as vectorizing. All of these cosmetic items in our data will be vectorized into two-dimensional coordinates, and the distances between the points will indicate the similarities between the items. </p>

In [None]:
# Dimension reduction with t-SNE
model = ...
tsne_features = ...

# Make X, Y columns
moisturizers_dry['X'] = ...
moisturizers_dry['Y'] = ...

In [None]:
# Dimension reduction with t-SNE
model = TSNE(n_components=2, random_state=42, perplexity=30)
tsne_features = model.fit_transform(A)

# Add the t-SNE features as new columns to the dataframe
moisturizers_dry['X'] = tsne_features[:, 0]
moisturizers_dry['Y'] = tsne_features[:, 1]

# Display the updated dataframe
print("Updated DataFrame with t-SNE features:")
print(moisturizers_dry[['X', 'Y']].head())


Updated DataFrame with t-SNE features:
          X         Y
0  0.916827 -0.716784
1  1.270983 -0.114038
2  0.884829 -1.372962
3  1.611735 -0.675027
4  0.196880 -0.118400


## 8. Let's map the items with Bokeh
<p>We are now ready to start creating our plot. With the t-SNE values, we can plot all our items on the coordinate plane. And the coolest part here is that it will also show us the name, the brand, the price and the rank of each item. Let's make a scatter plot using Bokeh and add a hover tool to show that information. Note that we won't display the plot yet as we will make some more additions to it.</p>

In [None]:
from bokeh.io import show, output_notebook
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, HoverTool

# Display plots inline in the notebook
output_notebook()

# Create a ColumnDataSource with the relevant data
source = ColumnDataSource(data={
    'X': moisturizers_dry['X'],
    'Y': moisturizers_dry['Y'],
    'Name': moisturizers_dry['Name'],
    'Brand': moisturizers_dry['Brand'],
    'Price': moisturizers_dry['Price'],
    'Rank': moisturizers_dry['Rank']
})

# Create a scatter plot
plot = figure(
    title="Cosmetics t-SNE Visualization",
    x_axis_label="t-SNE Dimension 1",
    y_axis_label="t-SNE Dimension 2",
    width=500,
    height=400
)

# Add scatter points
plot.circle(
    x='X',
    y='Y',
    source=source,
    size=10,
    color='#FF7373',
    alpha=0.8
)



# Show the plot
show(plot, notebook_handle=True)




## 9. Adding a hover tool
<p>Why don't we add a hover tool? Adding a hover tool allows us to check the information of each item whenever the cursor is directly over a glyph. We'll add tooltips with each product's name, brand, price, and rank (i.e., rating).</p>

In [None]:
# Create a HoverTool object
hover = HoverTool(tooltips=[
    ("Name", "@Name"),
    ("Brand", "@Brand"),
    ("Price", "@Price"),
    ("Rank", "@Rank")
])

# Add the HoverTool to the plot
plot.add_tools(hover)

# Show the plot again after adding the hover tool
show(plot, notebook_handle=True)


## 10. Mapping the cosmetic items
<p>Finally, it's show time! Let's see how the map we've made looks like. Each point on the plot corresponds to the cosmetic items. Then what do the axes mean here? The axes of a t-SNE plot aren't easily interpretable in terms of the original data. Like mentioned above, t-SNE is a visualizing technique to plot high-dimensional data in a low-dimensional space. Therefore, it's not desirable to interpret a t-SNE plot quantitatively.</p>
<p>Instead, what we can get from this map is the distance between the points (which items are close and which are far apart). The closer the distance between the two items is, the more similar the composition they have. Therefore this enables us to compare the items without having any chemistry background.</p>

In [None]:
# Plot the map
show(plot, notebook_handle=True)

## 11. Comparing two products

1.   List item
2.   List item


<p>Since there are so many cosmetics and so many ingredients, the plot doesn't have many super obvious patterns that simpler t-SNE plots can have (<a href="https://campus.datacamp.com/courses/unsupervised-learning-in-python/visualization-with-hierarchical-clustering-and-t-sne?ex=10">example</a>). Our plot requires some digging to find insights, but that's okay!</p>
<p>Say we enjoyed a specific product, there's an increased chance we'd enjoy another product that is similar in chemical composition.  Say we enjoyed AmorePacific's <a href="https://www.sephora.com/product/color-control-cushion-compact-broad-spectrum-spf-50-P378121">Color Control Cushion Compact Broad Spectrum SPF 50+</a>. We could find this product on the plot and see if a similar product(s) exist. And it turns out it does! If we look at the points furthest left on the plot, we see  LANEIGE's <a href="https://www.sephora.com/product/bb-cushion-hydra-radiance-P420676">BB Cushion Hydra Radiance SPF 50</a> essentially overlaps with the AmorePacific product. By looking at the ingredients, we can visually confirm the compositions of the products are similar (<em>though it is difficult to do, which is why we did this analysis in the first place!</em>), plus LANEIGE's version is $22 cheaper and actually has higher ratings.</p>
<p>It's not perfect, but it's useful. In real life, we can actually use our little ingredient-based recommendation engine help us make educated cosmetic purchase choices.</p>

In [None]:
# Display a few product names from the dataset to check for exact matches
print(moisturizers_dry['Name'].head(20))


0          Sheer Transformation® Perfecting Moisturizer
1              GinZing™ Energy-Boosting Gel Moisturizer
2     C.E.O. C + E antiOXIDANT Protect + Repair Mois...
3            GLOWSTARTER™ Mega Illuminating Moisturizer
4     Honey Drop Lightweight Moisturizer with Echina...
5     Hello FAB Coconut Skin Smoothie Priming Moistu...
6     Secret Sauce Clinically Advanced Miraculous An...
7                        Argan Daily Moisturizer SPF 47
8     Renewed Hope in A Jar Refreshing & Refining Mo...
9      Original Skin™ Matte Moisturizer with Willowherb
10    Drink of H2O Hydrating Boost Moisturizer Rainf...
11                 Squalane + Probiotic Gel Moisturizer
12    A Perfect World™ SPF 40 Age-Defense Moisturize...
13             Ferulic + Retinol Anti-Aging Moisturizer
14    Dr. Andrew Weil For Origins™ Mega-Bright SPF 3...
15                        Ultra Repair Face Moisturizer
16                  Ibuki Refining Moisturizer Enriched
17    GinZing™ SPF 40 Energy-Boosting Tinted Moi

In [None]:
# Search for products with "Cushion" in the name (or any other relevant keyword)
cosmetic_1 = moisturizers_dry[moisturizers_dry['Name'].str.contains("Cushion", case=False, na=False)]
cosmetic_2 = moisturizers_dry[moisturizers_dry['Name'].str.contains("SPF", case=False, na=False)]

# Display the results
display(cosmetic_1[['Name', 'Ingredients']])
display(cosmetic_2[['Name', 'Ingredients']])


Unnamed: 0,Name,Ingredients


Unnamed: 0,Name,Ingredients
7,Argan Daily Moisturizer SPF 47,**Natural.
12,A Perfect World™ SPF 40 Age-Defense Moisturize...,"Avobenzone 3.0%, Homosalate 8.0%, Octinoxate 7..."
14,Dr. Andrew Weil For Origins™ Mega-Bright SPF 3...,"Water, Butyloctyl Salicylate, Behenyl Alcohol,..."
17,GinZing™ SPF 40 Energy-Boosting Tinted Moistur...,"Octinoxate 7.5%, Octisalate 2.0%, Octocrylene ..."
18,Truth Revealed™ Brightening Broad Spectrum SPF...,"Water, Cocoglycerides, Ethylhexyl Salicylate, ..."
27,White Lucent All Day Brightener Broad Spectrum...,"Water, Sd Alcohol 40-B, Dimethicone, Dipropyle..."
29,MOISTURE BOUND Tinted Treatment Moisturizer SP...,"Phyllostachis Bambusoides Juice, Methyl Trimet..."
30,High Potency Classics: Face Finishing & Firmin...,"Water, Glycerin, Cyclopentasiloxane, Dimethico..."
31,8 HR Mattifying Moisturizer Sunscreen Broad Sp...,Visit the SEPHORA COLLECTION boutique
32,Ultra Facial Moisturizer SPF 30,"Water, Propylene Glycol, Dicaprylyl Ether, Gly..."


In [None]:
# Search for products with "Cushion" in the name
cushion_products = moisturizers_dry[moisturizers_dry['Name'].str.contains("Cushion", case=False, na=False)]

# Search for products with "SPF" in the name
spf_products = moisturizers_dry[moisturizers_dry['Name'].str.contains("SPF", case=False, na=False)]

# Display available products
display(cushion_products[['Name', 'Ingredients']])
display(spf_products[['Name', 'Ingredients']])


Unnamed: 0,Name,Ingredients


Unnamed: 0,Name,Ingredients
7,Argan Daily Moisturizer SPF 47,**Natural.
12,A Perfect World™ SPF 40 Age-Defense Moisturize...,"Avobenzone 3.0%, Homosalate 8.0%, Octinoxate 7..."
14,Dr. Andrew Weil For Origins™ Mega-Bright SPF 3...,"Water, Butyloctyl Salicylate, Behenyl Alcohol,..."
17,GinZing™ SPF 40 Energy-Boosting Tinted Moistur...,"Octinoxate 7.5%, Octisalate 2.0%, Octocrylene ..."
18,Truth Revealed™ Brightening Broad Spectrum SPF...,"Water, Cocoglycerides, Ethylhexyl Salicylate, ..."
27,White Lucent All Day Brightener Broad Spectrum...,"Water, Sd Alcohol 40-B, Dimethicone, Dipropyle..."
29,MOISTURE BOUND Tinted Treatment Moisturizer SP...,"Phyllostachis Bambusoides Juice, Methyl Trimet..."
30,High Potency Classics: Face Finishing & Firmin...,"Water, Glycerin, Cyclopentasiloxane, Dimethico..."
31,8 HR Mattifying Moisturizer Sunscreen Broad Sp...,Visit the SEPHORA COLLECTION boutique
32,Ultra Facial Moisturizer SPF 30,"Water, Propylene Glycol, Dicaprylyl Ether, Gly..."


In [None]:
# Select two products from the filtered list
cosmetic_1 = moisturizers_dry[moisturizers_dry['Name'] == "GinZing™ SPF 40 Energy-Boosting Tinted Moisturizer"]
cosmetic_2 = moisturizers_dry[moisturizers_dry['Name'] == "Truth Revealed™ Brightening Broad Spectrum SPF 50"]

# Display each item's data and ingredients
display(cosmetic_1)
print(cosmetic_1['Ingredients'].values)
display(cosmetic_2)
print(cosmetic_2['Ingredients'].values)


Unnamed: 0,Label,Brand,Name,Price,Rank,Ingredients,Combination,Dry,Normal,Oily,Sensitive,X,Y
17,Moisturizer,ORIGINS,GinZing™ SPF 40 Energy-Boosting Tinted Moistur...,39,4.0,"Octinoxate 7.5%, Octisalate 2.0%, Octocrylene ...",1,1,1,1,0,1.686682,-0.024831


["Octinoxate 7.5%, Octisalate 2.0%, Octocrylene 2.0%, Titanium Dioxide 3.0%, Zinc Oxide 3.0%Water, Butylene Glycol, Cetyl Alcohol, Neopentyl Glycol Diheptanoate, C12-15 Alkyl Benzoate, Dimethicone, Laureth-4, Polyethylene, Peg-100 Stearate, Hydrogenated Lecithin, Citrus Limon (Lemon) Peel Oil*, Citrus Grandis (Grapefruit) Peel Oil*, Mentha Viridis (Spearmint) Leaf Oil*, Citrus Aurantium Dulcis (Orange) Peel Oil*, Limonene, Linalool, Citral, Garcinia Mangostana Peel Extract, Panax Ginseng (Ginseng) Root Extract, Citrus Aurantium Amara (Bitter Orange) Flower Wax, Castanea Sativa (Chestnut) Seed Extract, Psidium Guajava (Guava) Fruit Extract, Citrus Aurantium Amara (Bitter Orange) Flower Water, Laminaria Saccharina Extract, Triticum Vulgare (Wheat) Germ Extract, Adenosine Phosphate, Pantethine, Creatine, Hordeum Vulgare (Barley) Extract/Extrait D'Orge, Folic Acid, Tourmaline, Cordyceps Sinensis Extract, Ethylhexylglycerin, Acetyl Carnitine Hcl, Caffeine, Rhodochrosite, Sodium Hyaluronate,

Unnamed: 0,Label,Brand,Name,Price,Rank,Ingredients,Combination,Dry,Normal,Oily,Sensitive,X,Y


[]
