#### Intro

<div style="font-size: 14px">
I am a coffee drinker. Usually I source my beans from a local roastery, The Coffee Company. In this notebook, I would like to find out which coffee I am likely to be interested in, with the help of data science!
</div>
<br>
<div style="font-size: 14px">
To start off, I will import some tools to scrap product description from their website. I plan to visually cluster the products that are similar with T-SNE.
</div>
<br>
<img src="https://i.imgur.com/M4spH5e.png" height=757 width=1200>

In [75]:
# First, we need to scrap our data from the retailer

import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests 
import re
from nltk.corpus import stopwords

# Create lists to store product descriptions, union of product descriptions, product names, product roast levels and product prices
corpus, complete_list, prod_names, prod_roast, prod_price = ([] for _ in range(5))

# Roast levels as classified by the retailer
roast_levels=['light','medium','dark','extra-dark']
for roast_level in roast_levels: 
    url = 'https://coffeecompany.com.au/collections/coffee/'+roast_level
    web = requests.get(url)
    doc = BeautifulSoup(web.text, 'html.parser')
    hrefs = doc.find_all('a', href = re.compile('/collections/coffee/products'))
    product_pages = list(set([href['href'] for href in hrefs]))
    for product_page in product_pages:
        url = 'https://coffeecompany.com.au'+product_page
        web = requests.get(url)
        prod_name = product_page.split('/')[-1]
        prod_names.append(prod_name)
        prod_roast.append(roast_level)
        prod_page = BeautifulSoup(web.text, 'html.parser')
        # The variable words stores the text in the description space
        words = prod_page.find('div', class_ = 'description').text.split(' ')
        price = prod_page.find('span', class_ = re.compile('price-item')).text
        # After removing non-alphanumeric characters, only feed keywords into the model if the keyword is not a pre-defined stopword
        comments = [key for word in words if (key:= ''.join([ch for ch in word if ch.isalnum()]).lower()) not in stopwords.words('english')]
        prod_price.append(price)
        # Roast level is included as part of the corpus, as it may not be brought up in the product description
        corpus.append(comments+[roast_level])
        complete_list.extend(comments+[roast_level])

In [76]:
# Preprocessing is needed as TSNE can only recognize numeric inputs

from nltk.stem import PorterStemmer

# Linguistically, words like balance and balanced mean the same thing.
# Stemming can help computer to understand two beans are balanced even if they are described differently
ps = PorterStemmer()
complete_list_stemmed = [ps.stem(word) for word in complete_list]
comment_dedup = list(set(complete_list_stemmed))
# Create a dictionary that stores the assigned index for keywords
comment_idx = {comment_dedup[i]: i for i in range(len(comment_dedup))}
# Do the same for corpus
corpus_stemmed = [[ps.stem(keyword) for keyword in des] for des in corpus]
# Create a numpy array that shows 1 if a keyword is present and 0 otherwise
# max(i) is number of products, max(j) is number of unique keywords present in the the pool of descriptions
comment_dtm = np.array([[1 if comment_dedup[j] in corpus_stemmed[i] else 0
                         for j in range(len(comment_dedup))] for i in range(len(corpus))])

In [81]:
# Model fitting. While a 3d plot is pretty cool, it is very hard to judge which beans are actually similar

from sklearn.manifold import TSNE
model = TSNE(n_components = 2, perplexity = 40, init='pca', learning_rate = 'auto')
tsne_coordinates = model.fit_transform(comment_dtm)

df_coffee = pd.DataFrame({'prod_names':prod_names, 'roast_level':prod_roast,
                        'price': prod_price, 't-snex':tsne_coordinates[:, 0],
                         't-sney':tsne_coordinates[:, 1], 'idx': range(len(prod_names))})
display(df_coffee.head())


The PCA initialization in TSNE will change to have the standard deviation of PC1 equal to 1e-4 in 1.2. This will ensure better convergence.



Unnamed: 0,prod_names,roast_level,price,t-snex,t-sney,idx
0,colombia,light,$34.00/Kg,-0.654891,-5.603013,0
1,ethiopian-yirgacheffe,light,$36.00/Kg,-1.19578,-5.572783,1
2,royal-special,light,$34.00/Kg,-0.707725,-5.47131,2
3,mara-deluxe,light,$36.00/Kg,-0.341155,-4.80718,3
4,brazil-santos_5,light,$34.00/Kg,0.191571,-4.159563,4


In [82]:
# Now, we need to visualize the result. Plotly is my go-to when I want to plot an interactive graph

import plotly.express as px
fig = px.scatter(df_coffee, x='t-snex', y='t-sney', color="roast_level", 
                hover_data=['prod_names','roast_level','price'], title='KL distance between coffees',
                width=800, height=800)
fig.show()

In [79]:
# disable pretty print first
%pprint

Pretty printing has been turned OFF


In [80]:
# How can we be sure 2 beans are similar, as suggested by the plot? We can retrieve the descriptions provided by the retailer!

def get_des(prod_1, prod_2):
    print(f'Description for {prod_1}: \n' +
    ' '.join(corpus[int(df_coffee.query(f'prod_names=="{prod_1}"').idx)]))
    print(' ')
    print(f'Description for {prod_2}: \n'+
    ' '.join(corpus[int(df_coffee.query(f'prod_names=="{prod_2}"').idx)]))

get_des('organic-espresso','mocha-java')

Description for organic-espresso: 
blend organic estate coffees peru mexico nicaragua rich quite strong smooth like espresso coffee dark
 
Description for mocha-java: 
blend combines best africa indonesia produce strong rich smooth coffee plenty depth dark


<div style="font-size: 14px">
That is a nice way to visualize how similar these beans are. Unfortunately, the output is only as good as the input. Just because two beans are graphically close to each other does not mean they are actually similar. For example, the retailer may say both beans are great for V60, but that does not mean their taste profile are necessarily similar. Nonetheless, the graphical representation does indeed provide useful insights for me to consider which bean I am going to try out next time!
</div>