# MediZen

### Cannabis strain recommendation system

Tobias Reaper

---

---

## Notebook Outline

* Introduction

---

## Introduction



---

### Imports and config

In [1]:
# === General imports === #
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn

import janitor

# === ML Imports === #
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors

In [2]:
# === Configuration === #
%matplotlib inline
# Configure pandas to display entire text of column
pd.set_option('max_colwidth', 200)
pd.set_option('max_columns', 200)  # Display up to 200 columns

---

## The Data

In [7]:
# === Load and look at the dataset === #
datapath = "assets/cannabis.csv"

strains = (pd.read_csv(datapath)
           .clean_names()
          )

print(strains.shape)
strains.head()

(2351, 6)


Unnamed: 0,strain,type,rating,effects,flavor,description
0,100-Og,hybrid,4.0,"Creative,Energetic,Tingly,Euphoric,Relaxed","Earthy,Sweet,Citrus","$100 OG is a 50/50 hybrid strain that packs a strong punch. The name supposedly refers to both its strength and high price when it first started showing up in Hollywood. As a plant, $100 OG tends ..."
1,98-White-Widow,hybrid,4.7,"Relaxed,Aroused,Creative,Happy,Energetic","Flowery,Violet,Diesel",The ‘98 Aloha White Widow is an especially potent cut of White Widow that has grown in renown alongside Hawaiian legends like Maui Wowie and Kona Gold. This White Widow phenotype reeks of diesel a...
2,1024,sativa,4.4,"Uplifted,Happy,Relaxed,Energetic,Creative","Spicy/Herbal,Sage,Woody","1024 is a sativa-dominant hybrid bred in Spain by Medical Seeds Co. The breeders claim to guard the secret genetics due to security reasons, but regardless of its genetic heritage, 1024 is a THC p..."
3,13-Dawgs,hybrid,4.2,"Tingly,Creative,Hungry,Relaxed,Uplifted","Apricot,Citrus,Grapefruit",13 Dawgs is a hybrid of G13 and Chemdawg genetics bred by Canadian LP Delta 9 BioTech. The two potent strains mix to create a balance between indica and sativa effects. 13 Dawgs has a sweet earthy...
4,24K-Gold,hybrid,4.6,"Happy,Relaxed,Euphoric,Uplifted,Talkative","Citrus,Earthy,Orange","Also known as Kosher Tangie, 24k Gold is a 60% indica-dominant hybrid that combines the legendary LA strain Kosher Kush with champion sativa Tangie to create something quite unique. Growing tall i..."


### Wrangling

In [8]:
# === Look for nulls === #
strains.isnull().sum()

strain          0
type            0
rating          0
effects         0
flavor         46
description    33
dtype: int64

In [10]:
# === Drop null values === #
strains = strains.dropna(subset=["flavor", "description"])
print(strains.shape)
strains.isnull().sum()

(2277, 6)


strain         0
type           0
rating         0
effects        0
flavor         0
description    0
dtype: int64

### Exploration

In [13]:
# === Distribution of strain types === #
strains["type"].value_counts()

hybrid    1169
indica     680
sativa     428
Name: type, dtype: int64

In [12]:
# === Get list of unique effects === #

# Create a blank set, loop thru df1, filling the set
# The result is a set of unique effects
effects_set = set()

for row in strains.itertuples(index=False):
    for w in row[3].split(","):
        if w != "None":
            effects_set.add(w)
            
effects_set

{'Aroused',
 'Creative',
 'Dry',
 'Energetic',
 'Euphoric',
 'Focused',
 'Giggly',
 'Happy',
 'Hungry',
 'Mouth',
 'Relaxed',
 'Sleepy',
 'Talkative',
 'Tingly',
 'Uplifted'}

In [None]:
# === Get list of unique flavors === #

In [None]:
# === Get count of rows containing each characteristic === #

# Create new dataframe for this
dfx = df1[["Strain", "Type", "Effects"]].copy()
dfx.head(2)

# Create new columns for each
for effect in effects_set:
    dfx[effect] = dfx["Effects"].str.contains(effect)

# Drop original column
dfx = dfx.drop(columns=["Effects"])

dfx.head()

In [None]:
dfx.value_counts()

In [28]:
# === Use pyjanitor to wrangle the data === #

df2 = (df1
        .clean_names()  # In this case, fixes Title Case
        .concatenate_columns(
            # Create a single feature for NLP analysis
            column_names=["type", "effects", "flavor"],
            new_column_name="type_effects_flavor",
            sep=",",
        )
        .remove_columns(column_names=[
            "rating",
            "description",
            "type",
            "effects",
            "flavor",
        ]))

What we're left with is a single feature containing the `type`, `effects`, and `flavor` of each strain.

In [29]:
strains.head(2)

Unnamed: 0,strain,type_effects_flavor
0,100-Og,"hybrid,Creative,Energetic,Tingly,Euphoric,Relaxed,Earthy,Sweet,Citrus"
1,98-White-Widow,"hybrid,Relaxed,Aroused,Creative,Happy,Energetic,Flowery,Violet,Diesel"


---

## Modeling

> ...to Filter or to Smart-Filter?

The input we would be receiving is a list of the user's desired characteristics.

We had two primary options for how to tackle the problem. We could build a simple filtering tool, or we could build a more powerful and flexible recommendation system.

### Naive filtering

* Break up `type`, `effects`, and `flavor` into separate columns for each individual element
  * Filter based on booleans for each column
* Use Pandas `str.contains()` or SQL `WHERE` clauses to filter
* These solutions could be implemented relatively easily...
  * They don't necessarily require ML modeling
* These filtering solutions are _naive_ - they strictly enforce the preferences sent to us by the user
    * If the user chose to filter by `sativa`, they would _only see sativa strains_
    * What if the best strain for their particular case, given the other characteristics, is a `hybrid`?
    * Sure, they would maybe be able to still find a good match
    * However, the recommendations would fail to accurately gauge their need
      * And, therefore miss out on providing the best recommendations

### Recommendation engine

* A good recommendation engine will be a little more flexible
* The recommendations will be a result of the entire filterset as a whole
  * Not based on individual elements
* To continue this example, the user thinks that the strain for them is `sativa`
  * Sativas are typically associated with effects like `energetic`, `creative`, `focused`
  * They check that on the filters list
* For the effects, however, they check off: `relaxed`, `giggly`, `sleepy`, `happy`, `hungry`
  * These effects are more likely to be associated with `hybrid` or `indica`
  * A more robust recommendations engine can recommend those, even if the user indicated `sativa`

---

### Recommended Model Recommendations

- NLP models are typically very good at breaking up and analyzing long strings (documents)
  - ...such as...a list of characteristics
  - Therefore, we thought it would simplify things to concatenate the three characteristics columns
  - The result is a single feature containing a single long string of each strain's characteristics
  - The input coming from the app can easily be concatenated and formatted to match our new feature
- Now we can think about methods, or models, that compare the input string to that of the dataset
- In order to use such data in an ML model, the words must be vectorized
  - or converted from words into numbers
  - The method we used is called TF-IDF

---

### Vectorizer

First, our pipeline uses TfidfVectorizer from scikit-learn to turn the words into numbers

* NLP models are typically very good at breaking up and analyzing long strings (documents)
  * ...such as...a list of characteristics
  * Therefore, we thought it would simplify things to concatenate the three characteristics columns

#### TF-IDF

* TF-IDF is a method of finding unique aspects of documents (strings)
  * The more common a word is across the documents the lower the score
  * The result is the unique topics rising to the top
* This way, we can compare the unique aspects of an input string.
  * In our case, we want to find the most similar, or least-unique
  * In this TF-IDF Matrix, 0 means completely similar

In [9]:
# === Vectorizer === #
# Instantiate vectorizer object
tfidf = TfidfVectorizer(stop_words="english")

# Create a vocabulary from new feature
dtm = tfidf.fit(strains2["type_effects_flavor"])

# Create vectorized version of concatenated feature
sparse = tfidf.transform(strains2["type_effects_flavor"])

# The result is a sparse matrix, which can be converted back to a dataframe
vdtm = pd.DataFrame(sparse.todense(), columns=tfidf.get_feature_names())

In [9]:
# === Serialize the vectorizer vocabulary === #
# The vocabulary is what we want to pickle and use in the app
with open("vector_vocab.pkl", "wb") as p:
    pickle.dump(dtm, p)

---

### Recommndations

#### Query for similarities with Nearest Neighbors (K-NN)

* Nearest Neighbor is a great method of calculating a list of similar "neighbors" to a given input
* Unsupervised model 
  * Means the model is not "predicting" anything
  * 

In [16]:
# === Instantiate the knn model === #
nn = NearestNeighbors(n_neighbors=10, algorithm='ball_tree')

# Fit (train) the model on the TF-IDF vector dataframe created above
nn.fit(vdtm)

NearestNeighbors(algorithm='ball_tree', leaf_size=30, metric='minkowski',
                 metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                 radius=1.0)

In [15]:
# === Serialize model === #
# This trained model is what we want to pickle and use in the app
import pickle
with open("nn_rec_model.pkl", "wb") as p:
    pickle.dump(nn, p)

In [17]:
# === Test it out on a arbitrary input === #
input1 = "sativa,happy,energetic,focused,euphoric,earthy,woody,flowery"
num_recs = 10

# Create vector using the vocab that was fit above
input_vector = tfidf.transform([input1])

# Use NN model to retrieve top n similar strains
top_id = nn.kneighbors(input_vector.todense(), n_neighbors=num_recs)[1][0]

#### Returns

- Pass the vectorized input into the trained knn model, specifying the number of neighbors to return
- This returns a list of two arrays: one is a measure of each neighbors 'near-ness'
- the other (the one we want) contains the indexes for the neighbors
  - API returns a list of only indexes
  - For the purposes of this demo I'll hydrate that list with the rest data
  - from the original (pre-wrangled) dataframe

In [18]:
# === Index-locate neighbors in og dataframe === #
top_df = df1.iloc[top_id]
top_df

Unnamed: 0,Strain,Type,Rating,Effects,Flavor,Description
2335,Y-Griega,sativa,4.8,"Happy,Energetic,Uplifted,Focused,Euphoric","Earthy,Woody,Flowery","Also known as simply “Y,” the 80% sativa Y Griega is an ..."
2129,Thai-Tanic,sativa,4.0,"Energetic,Uplifted,Happy,Focused,Euphoric","Sweet,Earthy,Woody",Thai-Tanic is a very compact sativa variety with that cl...
8,3D-Cbd,sativa,4.6,"Uplifted,Focused,Happy,Talkative,Relaxed","Earthy,Woody,Flowery",3D CBD from Snoop Dogg’s branded line of cannabis strain...
987,Harlequin,sativa,4.3,"Relaxed,Focused,Happy,Uplifted,Energetic","Earthy,Sweet,Woody",Harlequin is a 75/25 sativa-dominant strain renowned for...
2125,Thai,sativa,4.2,"Happy,Relaxed,Focused,Uplifted,Energetic","Earthy,Flowery,Sweet",Thai refers to a cannabis variety that grows natively in...
475,Charlottes-Web,sativa,4.5,"Relaxed,Uplifted,Focused,Happy,Energetic","Earthy,Flowery,Sweet",Charlotte’s Web is a cultivar with less than 0.3% THC th...
1100,Jack-Herer,sativa,4.4,"Happy,Uplifted,Energetic,Focused,Euphoric","Earthy,Pine,Woody",Jack Herer is a sativa-dominant cannabis strain that has...
948,Green-Haze,sativa,3.8,"Happy,Talkative,Creative,Focused,Hungry","Woody,Flowery,Earthy",Green Haze by A.C.E. Seeds is another version of their s...
1164,Kali-Mist,sativa,4.1,"Energetic,Focused,Uplifted,Euphoric,Creative","Woody,Earthy,Citrus","Kali Mist is known to deliver clear-headed, energetic ef..."
2047,Super-Green-Crack,sativa,4.5,"Happy,Giggly,Energetic,Focused,Euphoric","Earthy,Flowery,Pungent",Super Green Crack is a true sativa. Like a cup of strong...
