This study guide should reinforce and provide practice for all of the concepts you have seen in the past week. There are a mix of written questions and coding exercises, both are equally important to prepare you for the sprint challenge as well as to be able to speak on these topics comfortably in interviews and on the job.

If you get stuck or are unsure of something remember the 20 minute rule. If that doesn't help, then research a solution with google and stackoverflow. Only once you have exausted these methods should you turn to your Team Lead - they won't be there on your SC or during an interview. That being said, don't hesitate to ask for help if you truly are stuck.

Have fun studying!

In [1]:
# !pip install -r requirements.txt

In [2]:
# !python -m spacy download en_core_web_lg

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/bundickm/Study-Guides/master/data/cannabis.csv')
print('Shape:', df.shape, '\n')
df.head()

Shape: (2351, 6) 



Unnamed: 0,Strain,Type,Rating,Effects,Flavor,Description
0,100-Og,hybrid,4.0,"Creative,Energetic,Tingly,Euphoric,Relaxed","Earthy,Sweet,Citrus",$100 OG is a 50/50 hybrid strain that packs a ...
1,98-White-Widow,hybrid,4.7,"Relaxed,Aroused,Creative,Happy,Energetic","Flowery,Violet,Diesel",The ‘98 Aloha White Widow is an especially pot...
2,1024,sativa,4.4,"Uplifted,Happy,Relaxed,Energetic,Creative","Spicy/Herbal,Sage,Woody",1024 is a sativa-dominant hybrid bred in Spain...
3,13-Dawgs,hybrid,4.2,"Tingly,Creative,Hungry,Relaxed,Uplifted","Apricot,Citrus,Grapefruit",13 Dawgs is a hybrid of G13 and Chemdawg genet...
4,24K-Gold,hybrid,4.6,"Happy,Relaxed,Euphoric,Uplifted,Talkative","Citrus,Earthy,Orange","Also known as Kosher Tangie, 24k Gold is a 60%..."


# Tokens

## Definitions

Define the following terms in your own words, do not simply copy and paste a definition found elsewhere but reword it to be understandable and memorable to you. *Double click the markdown to add your definitions.*
<br/><br/>

**Natural Language Processing (NLP)**: `The branch of Data Science which deals with engineering human language features into a form which can be analyzed statistically.`

**Token**: `The portion of a string which lies between white space.`

**Corpus**: `A body of texts.`

**Stopwords**: `Which which do not contribute sufficient insight in statistical analyses of a text or texts.`

**Statistical Trimming**: `A method wherein words which appear with a rather low or high frequency are removed from the data prior to analysis.`

**Stemming**: `An method wherein words which share a common base such as ski, skis, and skiing are, grouped all counted under their base form, in this case ski. This method is designed to eliminate duplicates of the same concept and ensure the best sampling of words.`

**Lemmatization**: `Lemmatization operates on the same principle as stemming but in a more advanced way. Lemmatization is capable of dealing with stems which diverge such as wolf and wolves, which stemming would mistakenly divide in to wolf and wolv. Lemmatization accomplishes this by breaking words down to their base grammatical forms, e.g. converting "is" to "[to] be"`

**Vectorization**: `Vectorization converts text to matrices which a Data Scientist can then use statistical methods to glean insights from.`

## Questions of Understanding

1. What are at least 4 common cleaning tasks you need to do when creating tokens?
 1. `If webscraping, remove HTML or other extraneous elements, or convert them to text that Python can read.`
 2. `Perform stemming or lemmatization to group duplicates of the same base word together.`
 3. `Remove stop words, add custom stop words as needed.`
 4. `Account for named entities and other appropriate n-grams.`

2. Why is it important to apply custom stopwords to our dataset in addition to the ones that come in a library like spaCy?
```
Because when examining text from a certain category, i.e. movie reviews, certain words will occur with such a high frequency that they do not contribute substantially to an analysis of what makes each text different. In our movie example, the words movie and film would likely be appropriate custom stop words given that their presence would not be differentiable between entries.
```

3. Explain the tradeoffs between statistical trimming, stemming, and lemmatizing.
```
Stemming maximizes for recall; whereas, lemmatizing maximizes for precision.
```

4. Why do we need to vectorize our documents?
```
So that the computer can perform statistics on them.
```

## Practice Problems

Write a function to tokenize the `Description` column. Make sure to include the following:
- Return the tokens in an iterable structure
- Normalize the case
- Remove non-alphanumeric characters such as punctuation, whitespace, unicode, etc.
- Apply stopwords and make sure to add stopwords specific to this dataset
- Lemmatize the tokens before returning them

In [3]:
import spacy

nlp = spacy.load("en_core_web_lg")

def tokenize(text):

  tokens = []

  for doc in text:
    doc = nlp(text)
    tokens.append(doc.text)
    
  return tokens

Apply your function to `Description` and save the resulting tokens in a new column, `Tokens`

In [5]:
#

Use the function below to create a `word_count` dataframe based off the `df['Tokens']` column you created.

In [None]:
def count(docs):
        word_counts = Counter()
        appears_in = Counter()
        total_docs = len(docs)

        for doc in docs:
            word_counts.update(doc)
            appears_in.update(set(doc))

        temp = zip(word_counts.keys(), word_counts.values())
        wc = pd.DataFrame(temp, columns = ['word', 'count'])

        wc['rank'] = wc['count'].rank(method='first', ascending=False)
        total = wc['count'].sum()

        wc['pct_total'] = wc['count'].apply(lambda x: x / total)
        
        wc = wc.sort_values(by='rank')
        wc['cul_pct_total'] = wc['pct_total'].cumsum()

        t2 = zip(appears_in.keys(), appears_in.values())
        ac = pd.DataFrame(t2, columns=['word', 'appears_in'])
        wc = ac.merge(wc, on='word')

        wc['appears_in_pct'] = wc['appears_in'].apply(lambda x: x / total_docs)
        
        return wc.sort_values(by='rank')

Run the line of code below, and then explain how to interpret the graph.

```
Your Answer Here
```

In [None]:
sns.lineplot(x='rank', y='cul_pct_total', data=word_count);

# Vectorization

## Definitions

Define the following terms in your own words, do not simply copy and paste a definition found elsewhere but reword it to be understandable and memorable to you. *Double click the markdown to add your definitions.*
<br/><br/>

**Vectorization**: `The process of converting natural language text into a matrix which can be used in statistical analysis.`

**Document Term Matrix (DTM)**: `A data structure used to show the frequency of each term in a document.`

**Latent Semantic Analysis**: `A modeling method which examines a given corpora of text and then generalizes its findings to text outside that corpora.`

**Term Frequency - Inverse Document Frequency (TF-IDF)**: `A statistic which shows how frequently a given word appears in a text and the corpus to which that text belongs.`

**Word Embedding**: `A method for converting natural language text to a mathematical representation.`

**N-Gram**: `A continuous sample of n items from a text. Useful in grouping related words and ideas such as full names.`

**Skip-Gram**: `Used to predict the context word for a given target word.`

## Questions of Understanding

1. Why do we need to vectorize our documents?
```
So that the computer can run statistics on the words.
```

2. How is TF-IDF different from simple word frequency? Why do we use TF-IDF over word frequency?
```
TF-IDF is used to find which words are specific to which documents. Word frequency simply looks at the corpus of text as a whole, and, thus is not nearly as useful in unsupervised learning situations.
```

3. Why might we choose a word embedding approach over a bag-of-words approach when it comes to vectorization?
```
Word embedding preserves word context and can identify words used commonly together. The former is useful in determining if a given set of texts are discussing battery in the sense of the battery that is keeping this laptop running, and battery as in assault and batter. The latter is useful in identifying common phrases used together which may be indicative that multiple texts were written by the same author.
```

## Practice Problems

Use the dataframe `df` above to complete the following.

Vectorize the `Tokens` column.

Build a Nearest Neighbors model from your dataframe and then find the 5 nearest neighbors to the strain "100-OG"

You will be putting together a classification model below, but before you do you'll need a baseline. Run the line of code below and then find the normalized value counts for the `Rating` column in `df`.

In [None]:
df['Rating'] = df['Rating'].round().astype(int)

What is the baseline accuracy?
```
Your Answer Here
```

Visualize the rating counts from above

Use your vectorized tokens in the `df` dataframe to train a classification model

Predict the score of the fake strain description below.

```
'Afgooey, also known as Afgoo, is a potent indica strain that is believed to descend from an Afghani indica and Maui Haze. 
Its sativa parent may lend Afgoo some uplifting, creative qualities, but this strain undoubtedly takes after its indica 
parent as it primarily delivers relaxing, sleepy effects alongside its earthy pine flavor. Growers hoping to cultivate Afgoo 
may have a better chance of success indoors, but this indica can also thrive in Mediterranean climates outdoors.'
```

# Topic Modeling

## Questions of Understanding

1. What is Latent Dirichlet Allocation? What is another name for LDA in NLP?
```
Your Answer Here
```

2. How do interpret the results of a topic modeling output?
```
Your Answer Here
```

## Practice Problems

Find the top 5 topics of the `Description` column using LDA

In a short paragraph, explain how to interpret the first topic your model came up with. If your topic words are difficult to interpret, explain how you could clean up the descriptions to improve your topics

```
Your Answer Here
```

Use `pyLDAvis` to create a visualization to help you interpret your topic modeling results

Explain how to interpret the results of `pyLDAvis`

```
Your Answer Here
```

Create at least 1 more visualization to help you interpret the results of your topic modeling