# Data Cleaning

## Problem Statement
Analyze cannabis reviews from Leafly and determine what topics are being discussed.

## Input
The data consists of cannabis strain reviews from Leafly and cannabis strain chemical composition from lab tests.

For this project we will be only using the reviews from Leafly that are located in the "Strain data" folder

__[Download the dataset here](https://data.mendeley.com/datasets/6zwcgrttkp/1)__

## Output

1. Corpus - a collection of text

2. Document-Term Matrix - word counts in matrix format

data_cleaning.ipynb

Author: UFO Software, LLC Created: Tue 06 Sep 2022 01∶09∶31 PM PDT

MIT License

Copyright (c) 2022 UFO SOftware, LLC

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


In [1]:
import os
from os.path import exists
import glob
import json
import spacy
from spacy import displacy
import pandas as pd
import itertools as it
import tqdm as notebook_tqdm
import warnings
import pickle
from gensim.models.phrases import Phrases, Phraser
from gensim.models.word2vec import LineSentence
from gensim.models import Word2Vec
from sklearn.manifold import TSNE
from sklearn.feature_extraction.text import CountVectorizer
from spacy.util import minibatch
from bokeh.plotting import figure, output_notebook, show
import numpy as np
import nltk
from collections import Counter
import string
import re

## Create Temp Direcotory
Many of the steps in this notebook take hours to complete.

To be able to make changes without running all the steps the results of long operations are saved in a temp directory.

The code checks to see if the temp file exists and skips the operation if the file exists

If you want to rerun part of the notebook just delete the files generated from the sections you want to rerun.

In [2]:
# fill in your directory structure
parent_dir = '/run/user/1000/gvfs/smb-share:server=titan.local,share=data_sets/strains'
strain_data_dir = parent_dir+'/strain_data'
temp_dir = parent_dir+'/temp'
if not exists(temp_dir):
    os.mkdir(temp_dir)

## Parse The Strain Files

Save the data in a Pandas DataFrame

In [3]:
def parse_strains():
    strains_df = pd.DataFrame(columns = ['strain', 'species', 'report', 'effect', 'aroma', 'user'])
    for filename in glob.iglob(strain_data_dir+'/*'):
        objects = []
        with (open(filename, "rb")) as openfile:
            while True:
                try:
                    objects.append(pickle.load(openfile))
                except EOFError:
                    break
        index = 1
        strain = objects[0]['strain']
        species = objects[0]['categorias'][0]

        while True:
            try:
                report = objects[0]['data_strain'][index]['reporte']
                effects = objects[0]['data_strain'][index]['efectos']
                aroma = objects[0]['data_strain'][index]['sabores']
                user = objects[0]['data_strain'][index]['usuario']

                strains_df.loc[len(strains_df.index)] = [strain, species, report, effects, aroma, user]
                index+=1
            except:
                break
                
    return strains_df

In [4]:
strain_file = temp_dir+'/strains.paquet'

#if the file with the dataframe exists read it in otherwise parse the strain files and save them as a datafile
# Parsing the strain files is a lengthy operation
if exists(strain_file):
    strains_df = pd.read_parquet(strain_file)
else:
    strains_df = parse_strains()
    strains_df.reset_index(drop=True, inplace = True)
    strains_df.to_parquet(strain_file)
    
strains_df

Unnamed: 0,strain,species,report,effect,aroma,user
0,blueberry-jack,Hybrid,My go to strain when I&#39;m &quot;down in the...,"[Giggly, Happy, Talkative, Uplifted]","[Blueberry, Pine]",aliesha03
1,blueberry-jack,Hybrid,Excellent any time of the day strain. First of...,"[Creative, Focused, Talkative, Uplifted, Dry M...","[Blueberry, Earthy, Pine, Pungent, Spicy/Herbal]",serinity0087
2,blueberry-jack,Hybrid,This definitely got me giggly and happy. My fr...,"[Energetic, Euphoric, Giggly, Happy, Uplifted]","[Blueberry, Citrus, Grape, Grapefruit]",lightweightloser
3,blueberry-jack,Hybrid,Love this helps me with my pain anxiety love t...,"[Happy, Relaxed, Sleepy]",[Blueberry],Bosslady1374
4,blueberry-jack,Hybrid,"Fun, but intense with an overwhelming head hig...",[],[Blueberry],weazal
...,...,...,...,...,...,...
100810,707-headband,Hybrid,Got me real high for 5 minutes and then quickl...,[],[],Princess16
100811,707-headband,Hybrid,ALLLL TIME FAV nothing beats this weed,"[Energetic, Euphoric, Uplifted, Dry Eyes, Dry ...",[],Endergtp23
100812,707-headband,Hybrid,Estimated THC level.,"[Creative, Energetic, Euphoric, Focused, Giggl...","[Chemical, Citrus, Diesel, Earthy, Lime, Pine,...",CalisGoinBrokeBostonBudsMMJCERTIFIED
100813,707-headband,Hybrid,it is a a high yeald hard hiting strain avrage...,"[Creative, Focused, Giggly, Happy, Relaxed, Up...",[Lemon],bobct76


## Drop Duplicate Reviews and Irrelevant Columns
There are a large number of duplicate reviews which negativly 

In [5]:
data_clean = strains_df.copy()
data_clean.drop_duplicates(subset = ['strain','report', 'user'], inplace = True, ignore_index = True)
data_clean.drop(columns = ['effect', 'aroma', 'user'], inplace = True)
data_clean.strain = data_clean.strain.astype('string')
data_clean.species = data_clean.species.astype('string')
data_clean.report = data_clean.report.astype('string')
data_clean

Unnamed: 0,strain,species,report
0,blueberry-jack,Hybrid,My go to strain when I&#39;m &quot;down in the...
1,blueberry-jack,Hybrid,Excellent any time of the day strain. First of...
2,blueberry-jack,Hybrid,This definitely got me giggly and happy. My fr...
3,blueberry-jack,Hybrid,Love this helps me with my pain anxiety love t...
4,blueberry-jack,Hybrid,"Fun, but intense with an overwhelming head hig..."
...,...,...,...
77677,707-headband,Hybrid,Got me real high for 5 minutes and then quickl...
77678,707-headband,Hybrid,ALLLL TIME FAV nothing beats this weed
77679,707-headband,Hybrid,Estimated THC level.
77680,707-headband,Hybrid,it is a a high yeald hard hiting strain avrage...


## Cleanup

Remove new lines, numbers, punctuation and extra spaces

In [6]:
# remove new line, tab and carraige return
data_clean.report = data_clean.report.str.translate(str.maketrans('','', '\n\t\r'))
# replace / with space
data_clean.report = data_clean.report.str.replace('/',' ', regex = True)
# remove or fix special characters
data_clean.report = data_clean.report.str.replace('\\;','', regex = True)
data_clean.report = data_clean.report.str.replace('&amp;','and', regex = True)
data_clean.report = data_clean.report.str.replace('&#39;',"'", regex = True)
data_clean.report = data_clean.report.str.replace('&quot;',"", regex = True)
# remove numbers
data_clean.report = data_clean.report.str.translate(str.maketrans('', '', string.digits))
# remove punctuation
data_clean.report = data_clean.report.str.translate(str.maketrans('', '', string.punctuation))
# remove extra spaces
data_clean.report = data_clean.report.replace({' +':' '},regex=True)
# make text lowercase
data_clean.report = data_clean.report.str.lower()
# remove empty reviews
data_clean.report.dropna(inplace = True)
data_clean = data_clean[~(data_clean.report == '')].copy()
data_clean.reset_index(drop = True, inplace = True)
data_clean.report = data_clean.report.astype('string')
data_clean

Unnamed: 0,strain,species,report
0,blueberry-jack,Hybrid,my go to strain when im quotdown in the rabbit...
1,blueberry-jack,Hybrid,excellent any time of the day strain first of ...
2,blueberry-jack,Hybrid,this definitely got me giggly and happy my fri...
3,blueberry-jack,Hybrid,love this helps me with my pain anxiety love t...
4,blueberry-jack,Hybrid,fun but intense with an overwhelming head high...
...,...,...,...
77481,707-headband,Hybrid,got me real high for minutes and then quickly ...
77482,707-headband,Hybrid,allll time fav nothing beats this weed
77483,707-headband,Hybrid,estimated thc level
77484,707-headband,Hybrid,it is a a high yeald hard hiting strain avrage...


## Remove non-English reviews

In [7]:
data_clean.drop(data_clean[data_clean.report == 'ダッチパッション社製のを試しました。感じた点は、身体が重くなりにくい、強いハイ、長く続く、日中散歩に合う、といったところです。酸味が強く、スカンク系とは別の香りです。サティバ強め。'].index, inplace = True)
data_clean.drop(data_clean[data_clean.report == 'hella clean burn milky smoke very thick cheefed with friends lost of reminiscing was had super uplifting カーリークーシュ美味しいだぞ o'].index, inplace = True)
data_clean.drop(data_clean[data_clean.report == 'самая сильная дудка что пробовал в небольших средних количествах нехило поднимает настроение прям хоть танцуй музыку под нее хорошо слушать грамма через бонг с перколятрами отпрвили в единственный в моей жизни бэдтрип на три часа в другой раз грамма употребленные в течении получаса через вапорайзер размазали по креслу на четыре с половиной часа даже визуалы словил'].index, inplace = True)
data_clean.drop(data_clean[data_clean.report == 'очень радостный и позитивный сорт делает тебя энергичным и радости полные штаны с приятным расслабоном после пика это в малых средних количествах если пыхнуть побольше то минуя стадию quotрадостно и веселоquot сразу накрывает как хорошая индика '].index,inplace = True)
data_clean.drop(data_clean[data_clean.report == 'в небольших количествах прекрасно тонизирует и наполняет голову позитивом если слегка перебрать то размазывает и можно разучиться разговаривать'].index,inplace = True)
data_clean.drop(data_clean[data_clean.report == 'убойнейшая индика самый смак для того чтобы вечером покурить и залипнуть в окно попивая чай если был тяжелый день или ты весь такой на нерваках зайдет просто на ура'].index,inplace = True)
data_clean.drop(data_clean[data_clean.report == 'unikatowa amnesia najlepsze palonko jakie można palić szkoda tylko że w efekcie końcowym rozsadza ci czachę i powoduje senność oraz zawroty głowy może to tylko ajerkoniak nie pamięam jedno wiem nie dla początkujących bowiem żywi ludzie są jeszcze bardziej przydatni niż ci umierający haha na plus ale mogłoby być lepiej'].index, inplace = True)
data_clean.drop(data_clean[data_clean.report == 'idealna na długie konwersacje na niewiadomo jakie tematy z kumplami poprawia kreatywność i dość długo pozwala się sobą cieszyć całkiem przyjemna w zapachu i smaku mr'].index,inplace = True)
data_clean.drop(data_clean[data_clean.report == 'ヾ∀｀ﾉ'].index, inplace = True)
data_clean.drop(data_clean[data_clean.report == 'bardzo przyjemny strain działa zawsze tak samo dobrze bardzo odpręża i daje efekt euforii przez pierwsze minut potem jest przyjemny chill i bardzo lekki i delikatny koniec mimo że zawartość thc waha się między to i tak moc jest dużo bardziej wyczuwalna niż w mocniejszych odmianach polecam'].index, inplace = True)
data_clean.drop(data_clean[data_clean.report == 'очень ароматный с запахом лимона курится легко эффект такой будто съел за рас кило лимонов после двух выкуренных чувствуются психоделические наплывы эффект плавно распределяется по всему телу люблю курить один этот сорт '].index, inplace = True)
data_clean.drop(data_clean[data_clean.report == 'очень сильный сорт покурив его впервые получил реально психоделический хай '].index, inplace = True)
data_clean.drop(data_clean[data_clean.report == 'здравсвуйтея очень хочу заказать kk возможно ли доставить в москвуи какая цена'].index, inplace = True)
data_clean.drop(data_clean[data_clean.report == 'крос и не прихотлив '].index, inplace = True)
data_clean.drop(data_clean[data_clean.report == 'автоцвет'].index, inplace = True)
data_clean.drop(data_clean[data_clean.report == 'комфортна ненапряжна супер'].index, inplace = True)
data_clean.drop(data_clean[data_clean.report == 'ог куш погружает'].index, inplace = True)
data_clean.reset_index(drop = True, inplace = True)
data_clean

Unnamed: 0,strain,species,report
0,blueberry-jack,Hybrid,my go to strain when im quotdown in the rabbit...
1,blueberry-jack,Hybrid,excellent any time of the day strain first of ...
2,blueberry-jack,Hybrid,this definitely got me giggly and happy my fri...
3,blueberry-jack,Hybrid,love this helps me with my pain anxiety love t...
4,blueberry-jack,Hybrid,fun but intense with an overwhelming head high...
...,...,...,...
77464,707-headband,Hybrid,got me real high for minutes and then quickly ...
77465,707-headband,Hybrid,allll time fav nothing beats this weed
77466,707-headband,Hybrid,estimated thc level
77467,707-headband,Hybrid,it is a a high yeald hard hiting strain avrage...


In [8]:
data_clean_file = temp_dir+'/data_clean.parquet'
data_clean.to_parquet(data_clean_file)

## Combine Reviews by Strain

Combining the reviews by strain yielded the best topic modeling results.

Performing topic modeling using the non-grouped reviews yielded poor results.

Combining the reviews by type also yielded poor results.

In [9]:
data_by_strain = data_clean.groupby(by = ['strain', 'species'], as_index = False).report.apply(' '.join)
data_by_strain.to_parquet(temp_dir+'/by_strain.parquet')
data_by_strain

Unnamed: 0,strain,species,report
0,1024,Sativa,its a good even head and body high good for st...
1,24k-gold,Hybrid,you can change the name give it no name call i...
2,3-kings,Hybrid,i was skeptical about this strain after trying...
3,3x-crazy,Indica,this strain is always a favorite the top favor...
4,501st-og,Hybrid,i have ms and this strain was suggested for me...
...,...,...,...
1051,yoda-og,Indica,superdank i finally found my medicine i cant s...
1052,yogi-diesel,Hybrid,this strain provides a nice head high where yo...
1053,yumboldt,Indica,this strain is excellent for relieving my migr...
1054,yummy,Hybrid,really like this one nice body high great for ...


## Load spaCy
I used the transformer model but the sm, md and lg model will probably yield decent results

In [10]:
!python -m spacy download en_core_web_trf
nlp = spacy.load('en_core_web_trf')

Collecting en-core-web-trf==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.4.0/en_core_web_trf-3.4.0-py3-none-any.whl (460.3 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m460.3/460.3 MB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_trf')


## Lemmatization
Lemmatization returns the root (dictionary) form a word.  It changes the verb form of the word while keeping the meaning of the word the same.

Examples:
* better -> good
* walking -> walk
* was -> be
* mice -> mouse


In [11]:
def lemmatize_review(x):
    doc = nlp(x)
    lemmed_list = []
    for token in doc:
        if not token.is_punct:
            if token.lemma_ == 'PRON':
                lemmed_list.append(token_)
            else:
                lemmed_list.append(token.lemma_)
            
    return " ".join(lemmed_list)

In [12]:
lemmed_file = temp_dir+'/lemmatized.parquet'
if exists(lemmed_file):
    data_by_strain = pd.read_parquet(lemmed_file)
else:
    data_by_strain.report = data_by_strain.report.apply(lambda x: lemmatize_review(x))
    data_by_strain.to_parquet(lemmed_file)
    
data_by_strain

Unnamed: 0,strain,species,report
0,1024,Sativa,it a good even head and body high good for str...
1,24k-gold,Hybrid,you can change the name give it no name call i...
2,3-kings,Hybrid,I be skeptical about this strain after try thr...
3,3x-crazy,Indica,this strain be always a favorite the top favor...
4,501st-og,Hybrid,I have m and this strain be suggest for I to h...
...,...,...,...
1051,yoda-og,Indica,superdank I finally find my medicine I can nt ...
1052,yogi-diesel,Hybrid,this strain provide a nice head high where you...
1053,yumboldt,Indica,this strain be excellent for relieve my migrai...
1054,yummy,Hybrid,really like this one nice body high great for ...


In [13]:
# write each review to a text file seperating each review with a \n
lemm_reviews_file = temp_dir+'/lemm_reviews.txt'
if not exists(lemm_reviews_file):
    with open(lemm_reviews_file, 'w') as lem_review_txt_file:
              data_by_strain.report.apply(lambda x: lem_review_txt_file.write(x + '\n'))
# Read in each review where one line = one sentence.  Max sentence length needed to be increased otherwise some of the reviews were cut off
sentences_unigrams = LineSentence(lemm_reviews_file, max_sentence_length = 1000000)

## Phrase Modeling
Detect frequently used phrases and combine them.

## Bigrams
A bigram is a two word phrase.  Find the most frequently occurring two word phrases and combine them.

## Trigrams
A trigram is a three word phrase.  Find the most frequently occurring three word phrases and combine them.

In [14]:
bigram_model_file = temp_dir+'/bigram_phrase_model'

if not exists(bigram_model_file):
    bigram_phrases = Phrases(sentences_unigrams)
    # Turn the finished Phrases model into a "Phraser" object,
    # which is optimized for speed and memory use
    bigram_phrases = Phraser(bigram_phrases)
    bigram_phrases.save(bigram_model_file)

In [15]:
bigram_phrases = Phraser.load(bigram_model_file)
sentences_bigrams_file = temp_dir+'/sentence_bigram_phrases_all.txt'

if not exists(sentences_bigrams_file):
    with open(sentences_bigrams_file, 'w') as f:

        for sentence_unigrams in sentences_unigrams:

            sentence_bigrams = ' '.join(bigram_phrases[sentence_unigrams])

            f.write(sentences_bigrams_file + '\n')

In [16]:
sentences_bigrams = LineSentence(sentences_bigrams_file, max_sentence_length = 1000000)
trigram_model_file = temp_dir+'/trigram_phrase_model'

if not exists(trigram_model_file):
    trigram_phrases = Phrases(sentences_bigrams) 
    # Turn the finished Phrases model into a "Phraser" object,
    # which is optimized for speed and memory use
    trigram_phrases = Phraser(trigram_phrases)
    trigram_phrases.save(trigram_model_file)

In [17]:
trigram_phrases = Phraser.load(trigram_model_file)
sentences_trigrams_file = temp_dir+'/sentence_trigram_phrases_all.txt'

if not exists(sentences_trigrams_file):
    with open(sentences_trigrams_file, 'w') as f:
        
        for sentence_bigrams in sentences_bigrams:
            
            sentence_trigrams = ' '.join(trigram_phrases[sentence_bigrams])
            
            f.write(sentence_trigrams + '\n')   

In [18]:
review_trigrams_file = temp_dir+'/review_trigrams_all.txt'

if not exists(review_trigrams_file):
    # Read in each review where one line = one sentence.  Max sentence length needed to be increased otherwise some of the reviews were cut off
    reviews_lemmatized = LineSentence(lemm_reviews_file, max_sentence_length = 1000000)

    with open(review_trigrams_file, 'w') as f:
        
        for review_unigrams in reviews_lemmatized:
                        
            # apply the first-order and second-order phrase models
            review_bigrams = bigram_phrases[review_unigrams]
            review_trigrams = trigram_phrases[review_bigrams]
            
            # write the transformed review as a line in the new file
            review_trigrams = ' '.join(review_trigrams)
            f.write(review_trigrams + '\n')

In [19]:
trigram_df_file = temp_dir+'/tri_grams.parquet'

if not exists(trigram_df_file):
    tri_df = pd.DataFrame(columns = ['tri_review'])
    with open(review_trigrams_file) as f:
        
        for review in f:
            review = re.sub('\n', '', review)
            tri_df.loc[len(tri_df)] = review
            
    tri_df.to_parquet(trigram_df_file)

else:
    tri_df = pd.read_parquet(trigram_df_file)
    
tri_df

Unnamed: 0,tri_review
0,it a good even head and body high good for str...
1,you can change the name give it no name call i...
2,I be skeptical_about this strain after try thr...
3,this strain be always a favorite the top favor...
4,I have m and this strain be suggest for I to h...
...,...
1051,superdank I finally_find my medicine I can nt ...
1052,this strain provide a nice head high where you...
1053,this strain be excellent for relieve my migrai...
1054,really like this one nice body high great for ...


In [20]:
# concatenate the reviews with trigrams to dataframe
data_by_strain = pd.concat([data_by_strain, tri_df],axis = 1)
data_by_strain

Unnamed: 0,strain,species,report,tri_review
0,1024,Sativa,it a good even head and body high good for str...,it a good even head and body high good for str...
1,24k-gold,Hybrid,you can change the name give it no name call i...,you can change the name give it no name call i...
2,3-kings,Hybrid,I be skeptical about this strain after try thr...,I be skeptical_about this strain after try thr...
3,3x-crazy,Indica,this strain be always a favorite the top favor...,this strain be always a favorite the top favor...
4,501st-og,Hybrid,I have m and this strain be suggest for I to h...,I have m and this strain be suggest for I to h...
...,...,...,...,...
1051,yoda-og,Indica,superdank I finally find my medicine I can nt ...,superdank I finally_find my medicine I can nt ...
1052,yogi-diesel,Hybrid,this strain provide a nice head high where you...,this strain provide a nice head high where you...
1053,yumboldt,Indica,this strain be excellent for relieve my migrai...,this strain be excellent for relieve my migrai...
1054,yummy,Hybrid,really like this one nice body high great for ...,really like this one nice body high great for ...


## Make the Strain Name the dataframe's Index

In [21]:
data_by_strain.set_index('strain', drop = True, inplace = True)
data_by_strain

Unnamed: 0_level_0,species,report,tri_review
strain,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1024,Sativa,it a good even head and body high good for str...,it a good even head and body high good for str...
24k-gold,Hybrid,you can change the name give it no name call i...,you can change the name give it no name call i...
3-kings,Hybrid,I be skeptical about this strain after try thr...,I be skeptical_about this strain after try thr...
3x-crazy,Indica,this strain be always a favorite the top favor...,this strain be always a favorite the top favor...
501st-og,Hybrid,I have m and this strain be suggest for I to h...,I have m and this strain be suggest for I to h...
...,...,...,...
yoda-og,Indica,superdank I finally find my medicine I can nt ...,superdank I finally_find my medicine I can nt ...
yogi-diesel,Hybrid,this strain provide a nice head high where you...,this strain provide a nice head high where you...
yumboldt,Indica,this strain be excellent for relieve my migrai...,this strain be excellent for relieve my migrai...
yummy,Hybrid,really like this one nice body high great for ...,really like this one nice body high great for ...


## Remove Stop Words

In [22]:
def remove_stop_words(x):
    doc = nlp(x)
    stopless_list = []
    for token in doc:
        if not token.is_stop:
            stopless_list.append(token.text)
    return " ".join(stopless_list)

In [23]:
data_by_strain.tri_review = data_by_strain.tri_review.apply(lambda x: remove_stop_words(x))
data_by_strain

Unnamed: 0_level_0,species,report,tri_review
strain,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1024,Sativa,it a good even head and body high good for str...,good head body high good stress nice high good...
24k-gold,Hybrid,you can change the name give it no name call i...,change schnauzerganjkosher tangie k gold gold ...
3-kings,Hybrid,I be skeptical about this strain after try thr...,skeptical_about strain try kings amazed reliev...
3x-crazy,Indica,this strain be always a favorite the top favor...,strain favorite favorite fact potency strain d...
501st-og,Hybrid,I have m and this strain be suggest for I to h...,m strain suggest help muscle_spasm occur m sle...
...,...,...,...
yoda-og,Indica,superdank I finally find my medicine I can nt ...,superdank finally_find medicine nt sleep activ...
yogi-diesel,Hybrid,this strain provide a nice head high where you...,strain provide nice head high thought forefron...
yumboldt,Indica,this strain be excellent for relieve my migrai...,strain excellent relieve migraine chance strai...
yummy,Hybrid,really like this one nice body high great for ...,like nice body high great day use light head h...


## Output
The ouput of this notebook will be

1. Corpus - a collection of text

2. Document-Term Matrix - word counts in matrix format

In [24]:
# Save the dataframe containg the corpus
data_by_strain.to_parquet(temp_dir+'/corpus.parquet')

In [25]:
cv = CountVectorizer()
data_cv = cv.fit_transform(data_by_strain.tri_review)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names_out())
data_dtm.index = data_by_strain.index
data_dtm

Unnamed: 0_level_0,_all,aa,aaa,aaaa,aaaaa,aaaaaaaaaaaaaa,aaaaaaaaaaaaaqaaaaaaaa,aaaaaaaaaand,aaaaaahhhhhhmaxing,aaaahhhhhhh,...,zzzzzzzzzzzzzzz,zzzzzzzzzzzzzzzzzzzzzzzzz,zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz,ıf,łēčtrpart,ʻohana,ʻono,δthc,⅛th,⅛thweight
strain,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1024,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
24k-gold,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3-kings,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3x-crazy,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
501st-og,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
yoda-og,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
yogi-diesel,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
yumboldt,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
yummy,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [26]:
data_dtm.to_parquet(temp_dir+'/dtm.parquet')
# Let's pickle it for later use
data_dtm.to_pickle(temp_dir+"/dtm.pkl")
pickle.dump(cv, open(temp_dir+"/cv.pkl", "wb"))