# Word Counts in Language Data

So, we'll start by importing some things:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# This allows us to use regex replace later
import re
import seaborn as sns

# Natural Language Toolkit
import nltk
from nltk import word_tokenize
from nltk.util import ngrams
from nltk.util import skipgrams


Now, let's pull in some text.  In the name of making you practice, **change the code below to read in the file located at the root of your datahub drive at `css_bootcamp_2022/unix/literature/thebrotherskaramazov.txt`**

In [None]:
with open('/Users/wstyler/github/css_bootcamp_2022/unix/literature/thebrotherskaramazov.txt') as f:
    broskis = f.read()

First, we'll tokenize the data using `nltk.word_tokenize(yourdata)`. **Figure out why I used .lower() below, then look at the first ten items.**

In [None]:
token = nltk.word_tokenize(broskis.lower())
token[0:10]

This is gonna be lot of text.  **Figure out how many items are in the list.**

In [None]:
## Your code here

Oh no, that's a LOT of text.  **Save only the first 25000 words, then confirm by counting the number of items that that's what you've done.**

In [None]:
## Your code here

Now let's split it into a list of bigrams (that is, sets of 2 adjacent words).  **Make the split then look at the first ten items of the list. You'll get this by running `list(ngrams(data,N))`**. The list() element is because the `ngram` command natively creates a type of object called a `zip`)

In [1]:
## Your code here

NameError: name 'ngrams' is not defined

Now we'll count the number of times each ngram occurs.  **The `cleanitem` line exists for a specific reason, try and figure out why it occurs and what's it doing.**

In [None]:
counts = {}
for item in sorted(list(ng)):
    cleanitem = re.sub(",|\'|\(|\)","",str(item))
    counts.update({cleanitem:ng.count(item)})

Now you'll turn those counts into a Pandas Dataframe.  This is tricky for silly reasons, but **examine the code below to figure out how I'm doing it.**

In [None]:
cdf = pd.DataFrame.from_dict(counts, orient ='index')
cdf.reset_index(inplace=True)
cdf = cdf.rename(columns = {'index':'word', 0:'count'})

Now **sort the dataframe in descending order, and view the top 20 rows**.  Hint: `dataframe.sort_values('col',ascending=True/False)`.

In [None]:
## Your code here

As expected, many of the most frequent words are *function words*.  **How far down the list do you need to go to start finding content words, which tell us about actual patterns in the dataset (e.g. important characters or concepts)?  Do you see any names or words which tell you about the story?**

In [None]:
# Your answer here.

You'll need to **rerun the n-gram analysis to get unigram counts so we can look at individual words**.

In [None]:
## Your unigram code here

Now, create a barplot showing the counts of the 200 most frequent **single words**.  **Does this follow the expected (Zipfian) distributions?** (Hint: `plt.xticks(rotation=90);` might be useful in looking at the most frequent words in a plot)

In [None]:
## Your code here

Now, **create a new column in your count dataframe with the log (using `np.log`) of `count`, and graph the log counts of the most frequent words.**  Does the shape change?

In [1]:
## Your code here

Now, **write a for loop to do the get ngram counts for 1, 2, 3, 4, and 5 grams, and store the highest frequency gram (word/count pair) for each N.**  Do we see the highest frequency count fall off as we might expect?

In [2]:
## Your code here

### Getting into the Weed with Data

Will is allergic to Cannabis, a situation which is not actually particularly uncommon (and although present from birth in Will's case, occasionally develops later in life among smokers and workers in the cannabis industry who are constantly exposed to the plants and pollen).  But as a result, Cannabis plants or smoke, alongside other allergic symptoms, make his larynx (voicebox) swell up, and potentially, shut. This is pretty unfortunate for people who enjoy breathing, and serves as a plea to ask your friends to refrain from smoking or growing around other people and focus on more considerate methods of cannabis use, but it also contextualizes why this analysis is interesting.

In 2013, when he was in the depths of dissertating, Will received a Christmas gift card from his girlfriend (now wife) to a relaxation spa called 'The Wellness Center'.  Although it was a very thoughtful gift, Will immediately thought it was a joke, because, not knowing the facility, it sounded entirely like a Marijuana dispensary.  It wound up with a laugh, but everybody present agreed that it sounded like a dispensary, particularly given that Cannabis had been initially legalized under medical pretext in Colorado.

An initial corpus search (documented [here](https://wstyler.ucsd.edu/posts/dispensary_names.html)) revealed that, indeed, this was a good assumption, with 'wellness' being the most common word in Dispensary names in 2013 by far.  Now, nearly 10 years later, and with recreational use now being legal without 'medical' pretext, let's see whether California cannabis retailer culture follows the same pattern, and whether 'wellness' is still a common term used in cannabis company names.

`texts/ca_mj.csv` is a file containing a list of all Cannabis License Holder business, downloaded via <https://search.cannabis.ca.gov/> .  **Load in the dataset for our use in Pandas, and look at the first few rows to get a sense of the data.**


In [None]:
## Your code here

There's lots of interesting data here, particularly for the GIS inclined.  **How many licenses have been issued, according to this list?**

In [None]:
## Your code here

Let's focus on the `businessDbaName` and `businessLegalName` variable.  There are a huge number of 'Data not available', LLC, Inc, and such, so we should drop those too.  But we don't want to drop all rows, as many businesses have the same 'DBA' (doing business as) name as their legal name.  So, we'll just remove those strings.  **Read the code below and explain to your neighbor what it does and why.**

In [None]:
mjdb['combined'] = mjdb['businessDbaName'].astype(str) + " " + mjdb['businessLegalName'].astype(str)
mjdbclean = mjdb.replace(regex=r'Data Not Available|LLC|Inc.|Inc|,|\.',value='')
dispnames = list(mjdbclean['combined'])

Let's sanity check the data and **make sure our `dispnames` list has the same number of rows as the original dataframe.**

In [None]:
## Your code here

Now, we're going to create a single chunk of text out of all of the text. **Read the code below and explain to your neighbor what it does and why**

In [None]:
dn = ' '.join([str(i).lower() for i in dispnames])


Now, you've got a bunch of text.  **Use the code from above to tokenize, create bigrams, and get counts saved as `cdf`.**

In [None]:
## Your code here

Now look at the top 60 items.  **Do you see a theme in modern California Cannabis business naming?  Is it wellness?**

In [None]:
## Your code here

Now, find all rows which contain 'wellness' Hint: dataframe[dataframe['colname'].str.contains("string")].  **Does 'Wellness center' appear?**

In [None]:
## Your code here

Bigrams, when there are many variants, can sometimes disguise or downplay particular unigrams.  **Now run a unigram model, and see where 'wellness' ranks.**  (Note: dataframe.reset_index() allows you to re-index the data after sorting the data)

In [None]:
## Your code here