<a href="https://colab.research.google.com/github/scskalicky/BDR/blob/main/14_Conditional_Frequency_Distributions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Conditional Frequency Distributions

We have explored word frequency as a variable for different texts using `nltk.FreqDist()`. In this notebook, we will explore another function called `nltk.ConditionalFreqDist()`.

A Conditional Frequency Distribution extends a frequency distribution but in addition includes additional subcategories over which to form the distribution. For example, we *could* ask for a frequency distribution of modals over the entire Brown corpus, regardless of genre. This would be a simple frequency distribution.

When we ask for this frequency distribution to be *dependent* upon more than one category, we are placing a *condition* on the frequency counts, making it a conditional frequency distribution.

Why do we care about conditional frequency distributions? Because it is a way to directly compare different categories or subsets within a larger set of data.


Consider the following simplistic example which compares the frequency of words between two conditions.

In [None]:
# import nltk
import nltk

In [None]:
# Create two conditions: condition a and condition b.
# You can see how they differ in terms of the frequency of colours in each example
# condition a has yellow twice, and blue three times
# condition b has one of each: yellow, blue, red
condition_a = ['yellow', 'yellow', 'blue', 'blue', 'blue']
condition_b = ['yellow', 'blue', 'red']

# combine the conditions by concatenating the list
combined_conditions = condition_a + condition_b

# now everything is in one list
combined_conditions

If we ask for a `FreqDist()` of the combined conditions, we get an impression of frequency among these three colors:

In [None]:
# combined frequency distribution
fdistab = nltk.FreqDist(combined_conditions)
fdistab

However we have no way of knowing whether blue "likes" one condition over the other, or whether these colors occur equally among our conditions. So we could run a frequency distribution on each condition, separately:

In [None]:
# Frequency Distribution of condition a
fdista = nltk.FreqDist(condition_a)
fdista

In [None]:
# Frequency distribution of condition b
fdistb = nltk.FreqDist(condition_b)
fdistb

We can compare the output and see that the two different conditions have different amounts of each color. But having to do this manually and then potentially combine the results is tiring and we would like a more efficient process.

We can use the `nltk.ConditionalFreqDist()` function to do these things for us. This function will count the number of colours across both conditions. To use `ConditionalFreqDist()` (CFD), we have to provide the function with the required pieces of information:

1. the conditions and
2. the thing being counted.

We then instruct the CFD function how to loop - first by condition, and then by sample (which will essentially make a nested `for loop`). Let's first transform our lists above into a dictionary structure to help clarify what is going on here. Look at the dictionary below - there are two keys (a and b), each with a list of colors.


In [None]:
# first combine our lists so that they are nested in a dictionary
combined_colors = {'a': ['yellow', 'yellow', 'blue', 'blue', 'blue'],
                   'b': ['yellow', 'blue', 'red']}

the `ConditionalFreqDist()` function is effectively nested loops, which can be difficult to get a grasp on at first (and probably why the authors of NLTK say to not worry about it right away). See if you can understand what the double loop is doing in the example below:

In [None]:
# first loop through the condition
# and THEN loop through the keys of that condition, making a freq dist each time.

color_cfd = nltk.ConditionalFreqDist((condition, color) # our pairs: condition = condition, color = sample
  for condition in combined_colors # for key in dictionary, in this case a, and then b...
  for color in combined_colors[condition]) # create freqDist of each item in the values of a and then b

In [None]:
# verify the different condition
color_cfd.conditions()

In [None]:
# it's helpful to know that the resulting object is fundamentally a dictionary
color_cfd.keys()

In [None]:
# so we can query the conditions
color_cfd['a']['blue']

In [None]:
# look at the whole thing - you see they are just multiple FreqDist
color_cfd.items()

So, really, the conditional frequency distribution is similar to the normal frequency distribution, it just has more layers and categories.

The CFD also has a few different functions. We can visualize the different counts in a matrix, which helps clarify both the presence and the absence of values in each of the two conditions. Blue clearly prefers "a", while "b" allows for all three colours.

In [None]:
# the built in tabulate method lets you make a nice table for comparison.
color_cfd.tabulate()

In [None]:
# we can also make little plots, neato!
color_cfd.plot()

# Conditional Frequency Distributions with Brown Corpus

Now compare that color example to the CFD on the Brown corpus from the NLTK book. Genre is the condition (where above it was 'a' or 'b') and word frequency is the sample (where above it was the frequency of colours).

We are essentially asking for the frequency of each word in `modals` conditioned by the different genres in brown.

In [None]:
# download the required resources first.
nltk.download('brown')
from nltk.corpus import brown

In [None]:
# first create a CFD of *all* the words in Brown corpus, conditioned on genre.

cfd = nltk.ConditionalFreqDist(
           (genre, word) # condition = genre, sample = word
           for genre in brown.categories() # for each genre in Brown
           for word in brown.words(categories = genre)) # for each word in the genre

Now that we have created a conditional frequency distribution across the entire corpus, we can define more precise queries based on words and genres we are interested in.

In [None]:
# create a list of all the Brown genres
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']

# let's again look at modal verbs only
modals = ['can', 'could', 'may', 'might', 'must', 'will']

# Ask the CFD to give us frequency of modals across all genres
cfd.tabulate(conditions = genres, samples = modals)

What do these results say about word use across different genres?

## Comparing News and Romance

Allow me to copy the Your Turn straight from the NLTK book:

> *Your Turn: Working with the news and romance genres from the Brown Corpus, find out which days of the week are most newsworthy, and which are most romantic. - Define a variable called days containing a list of days of the week, i.e. `['Monday', ...]`*.

> Now tabulate the counts for these words using `cfd.tabulate(samples = days)`.

> Now try the same thing using plot in place of tabulate. You may control the output order of days with the help of an extra parameter: `samples = ['Monday', ...]`.

In [None]:
# first create the brown cfd
brown_cfd = nltk.ConditionalFreqDist(
  (genre, word)
  for genre in brown.categories() # for each genre in the corpus
  for word in brown.words(categories=genre)) # then for each word in the genre

In [None]:
# Create the list of words you are interested in
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

# then tabulate them
brown_cfd.tabulate(conditions = ['romance', 'news'], samples = days, cumulative=True)

In [None]:
# plot them
brown_cfd.plot(conditions = ['romance', 'news'], samples = days, cumulative = True)

What other words might make for an interesting comparison between different genres in the brown corpus? You can apply the same ideas from above to do you own investigations. Can you locate some words which clearly define different genres?

In [None]:
# try looking for comparisons of different words/genres in the Brown corpus.

## **Discussion**

Now we are getting somewhere! We have a clear list of how modal verbs pattern across the different genres in Brown. We also have some interesting results about days of the week.

Take a moment to inspect the way different genres employ modal verbs.

- What can you say about romance vs news, for example?
- Moreover, can you think of any potential mistakes we are making by comparing these direct frequency counts?
- Any other words you would be interested in comparing?
  - have a play with possible word targets and report your results.

# Inaugural Address corpus

Chapter 2 of NLTK includes several visualizations of the Conditional Frequency Distribution. One corpus they use is the inaugural corpus, which is a collection of speechs given by US presidents after they began a new term (so, one occurs every four years).

Before looking at the CFD, load in the resource and familiarise yourself with the corpus.



In [None]:
# load in required resources
nltk.download('inaugural')
from nltk.corpus import inaugural

In [None]:
# each file id contains the year and name of the US president
inaugural.fileids()

In [None]:
# remember, raw returns a string from the .txt file.
# here I pick a set of words from the string to find an interesting quote.
inaugural.raw('1945-Roosevelt.txt')[1025:1152]

## Conditional Frequency Distribution of Inaugural

Okay, the NLTK book shows how to do this, but let's repeat the process here. In the book, the authors want to examine how specific words change over time in the address. In particular, they are interested in examining how the words "america" and "citizen" change in frequency over the years. They do so using a CFD of this corpus. In this CFD, the frequency of these words are the samples, and the category or condition will be the year.  

The CFD below packs a lot of information into the code.

- In line 3, the first two arguments `(target, fileid[:4)` represent the word being searched and the category.
  - the use of `fileid` means each file is considered its own category or condition
  - slicing the fileid this way gives us the first 4 characters, which is the year
- In line 4, a loop is then initiated over the list of filedids in the corpus.
- In line 5, a nested loop then runs over the words of a single file, represented as `w`
- In line 6, an additional nested loop then loops over the two target words
- In line 7, the code askes whether `w` starts with the target. The use of `.startswith()` is a way to check if the word is the target.




In [None]:
# CFD of Inaugural
inaugural_cfd = nltk.ConditionalFreqDist(
  (target, fileid[:4]) # try this with and without fileid being sliced, so you understand why they did this
  for fileid in inaugural.fileids()
  for w in inaugural.words(fileid)
  for target in ['america', 'citizen']
  if w.lower().startswith(target))

Again, the code is quite a bit to digest. Make sure you slowly go through it and understand each line. Then, inspect the results below:

In [None]:
# inspect the matrix of results
inaugural_cfd.tabulate()

## Slicing meta data from filenames

How does this code figure out the year of each file? The answer is that information is included in the filename. Each filename in the Inaugural corpus is in this format:

```
YEAR-name.txt
```

Look again at line 3 in the CFD code cell above, the code asks for target and `fileid[:4]`. Slicing to 4 on fileid will return the first four characters of the filename, which happens to be the year. So this is a trick that would only work if all the filenames are standardized to follow the same format.

What would happen if you just used `fileid` without slicing the year?

In [None]:
# use this one simple trick to get years!
'1945-Roosevelt.txt'[:4]

## Plotting the CFD

You can plot the frequency distribution using the built-in `plot()` method for the CFD, although I've found the plots are small and you may want to use the code below to increase the size of the plot.

Examine the plot - what is it the authors of NLTK wanted you to notice about the use of `american` and `citizen` in inaugural US presidential speeches over time?


Because the filename of each file in this corpus includes the year as the first four characters, the authors could use this as a label. There is only one speech for any year in the data because these are the speeches given by US presidents when they are elected.

In [None]:
# the plot on its own is quite small, use this code to make the plot larger
import matplotlib.pyplot as plt
# define the size of the figure
plt.figure(figsize = (20, 10))

# then render the plot:
inaugural_cfd.plot()

The plot has frequency counts on the y-axis and year represented on the x-axis. This makes it a bit easier to compare.

## **Your Turn / Discussion**

- What do you think of the way these terms rise and fall over the years? Do you think it can be attributed to events during that time, specifics of the speaker, or some combination of those and other factors?
- Play around with the Inaugural Corpus - what other words might make for an interesting comparison over time?
- What might be a problem with using raw frequency counts from these different text files?

# Creating Your Own Categorized Corpus

In a prior notebook you were shown how to convert text files into an NLTK corpus object. Let's extend that now and use the `CategorizedPlaintextCorpusReader` to make a corpus with categories/genres.

In order to do so, we need some text files, and we also need a way to indicate what genre/category we would like those files to belong to. Let's follow the NLTK authors and extract this information from the filenames.

As an example, let's use some data from a [paper I published in 2015.](https://europeanjournalofhumour.org/index.php/ejhr/article/view/68)

In this paper, I analysed the linguistic properties of product reviews written for the American retail website Amazon.com. I was interested in two types of reviews: legitimate review and satirical/funny reviews.

The data lives here: [Amazon Data](https://github.com/scskalicky/LING-226-vuw/blob/main/other-data/amazon%20reviews.zip)

Thanks to Hayden (tutor, 2023) for showing us how remarkably easy it is to use `!wget` and `unzip()` to load in a zip file and save to drive without needing to manually download and then upload. Run the code cell below to download and unzip the data into the notebook:



In [None]:
# download the data
!wget 'https://github.com/scskalicky/LING-226-vuw/raw/main/other-data/amazon%20reviews.zip'

You can then use `!unzip` to unzip the folder from within colab. There's an additional -d flag you can use to unzip into a directory to make working with the data easier. For example, for the amazon data we can do:


`!unzip "amazon reviews.zip" -d "amazon reviews"`

Which will give you an a folder called amazon reviews that you can use the same as if you'd mounted it from google drive, without needing to bother with files and unzipping manually.


In [None]:
!unzip "amazon reviews.zip" -d "amazon reviews"

In the folder are 375 normal reviews and 375 satirical reviews.

The name of each file looks like this:
```
001-5-satire.txt
002-2-normal.txt
```

The first three numbers are the ID number, ranging from 1 - 375. The second number (between the two `-`) is the star rating of the review, from 1-5. The words `satire` or `normal` indicate whether the review was a normal review or a satirical funny review.

We can exploit this information to make categories in our corpus. Just as the authors of NLTK sliced the year from the filename to examine change over time, we can do the same thing with these filenames to get different categories.




In [None]:
# first we will load in the Corpus Reader and define the location of our texts
import nltk
from nltk.corpus.reader.plaintext import CategorizedPlaintextCorpusReader

# set the corpus location to point to wherever it is you saved the data
# (you may need to mount your Drive to the notebook)
corpus_location = '/content/amazon reviews'

Now to use the filenames as categories, we will explot a little bit more about regular expressions (regex) patterns. What you can do now is know that we can define a pattern to capture the `normal` or `satire` portions of the filesnames using this pattern:

```
.*(......).txt
```

This pattern captures whatever is in the brackets `()`, and says give me the last six characters before `.txt` of my pattern.

It corresponds to:

```
001-5-(satire).txt
002-2-(normal).txt
```

Try it out:

In [None]:
# create a categorised corpus
amz_corpus = CategorizedPlaintextCorpusReader(root = corpus_location, fileids = '.*', cat_pattern = '.*(......).txt')

# you can check the categories
amz_corpus.categories()

In [None]:
# and we still have our fileids
amz_corpus.fileids()

Now that we've made our corpus, we can create CFD tabulations and plots just like the NLTK book did for Brown corpus.

Let's compare different words between the satirical and regular reviews.



In [None]:
# Create a CFD of the amazon corpus
# I am using the same code as the one for Brown with two modifications:
# I have replaced "genre" with "review_type"
# I lowercase the words in the corpus
amz_cfd = nltk.ConditionalFreqDist(
    (review_type, word)
    for review_type in amz_corpus.categories()
    for word in [w.lower() for w in amz_corpus.words(categories = review_type)]
)

In [None]:
# let's ask for some specific words
pronouns = ['i', 'me', 'you', 'my']

# then tabulate them
amz_cfd.tabulate(conditions = ['normal', 'satire'], samples = pronouns, cumulative = True)

In [None]:
# we can also plot this.
amz_cfd.plot(conditions = ['normal', 'satire'], samples = pronouns, cumulative = True)

In [None]:
# what about some other words?
emotions = ['good', 'bad', 'happy', 'sad', 'love', 'sweet', 'hurt', 'ugly', 'nasty']
amz_cfd.tabulate(conditions = ['normal', 'satire'], samples = emotions, cumulative = True)

In [None]:
amz_corpus.words()

We can also wrap individual files from our corpus in `Text` so that we can look for concordances

In [None]:
# Wrap the whole set of words to look at all concordances
nltk.text.Text(amz_corpus.words()).concordance('terrible')

In [None]:
# we can also look at concordances for just one category to compare them
# the word "banana" is strongly associated with the satire corpus
nltk.text.Text(amz_corpus.words(categories = 'satire')).concordance('banana')

In [None]:
# but only occurs once in the non-satire corpus.
nltk.text.Text(amz_corpus.words(categories = 'normal')).concordance('banana')

## **Your Turn**

What else can you do with this corpus in terms of comparisons? You may want to scan the Skalicky and Crossley (2015) article, particularly Table 2 which lists some word categories that differed between the two review types. Think of some words that might reflect those categories - negation would include *not*, *no*, *never*, etc., whereas quantifier might include *many*, *few*, *some*, and so on. Can you find some differences in words between the two corpora using a combination of CFD and concordance lines?

# **Wrap Up**

Being able to create your own corpus and make a comparison across categories in your corpus is a good way to develop your assessments in this course.

At this point you might want to spend some time thinking about how to make your own corpus. Or, you might want to play more with this amazon review corpus. For instance - another piece of information you could pull from the corpus is the review rating which is located in the middle of the filename. The pattern to do so would be:

```
.*-(.)-.*.txt
```

Of course, you might still want to keep the satire/normal category, so perhaps expand your pattern to:

```
.*-(.-.*).txt
```

This would give you ten categories. I've typed the code below should you like to use that and do further comparisons.

In [None]:
# create a categorised corpus
amz_corpus2 = CategorizedPlaintextCorpusReader(root = corpus_location, fileids = '.*', cat_pattern = '.*-(.-.*).txt')

# you can check the categories
amz_corpus2.categories()