<a href="https://colab.research.google.com/github/scskalicky/LING-226-vuw/blob/main/10_Frequency_Analysis_The_Current.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Frequency of words in comments for The Current**

What are the most frequent words in the comments people leave as comment in The Current data? Are certain words more frequent in some questions than others, or are words used in a relatively equal manner among the questions? What about the messyness of the data - how much cleaning and preprocessing must be done in order to make sense of the data?

These are all valid questions to ask as a way to start thinking about research questions and how they might be answered using computational linguistic approaches. In this notebook, we will create and compare frequency distributions of words in the different comment data for The Current.

First, load in the nltk resources.

In [None]:
# import the main nltk module
import nltk

# download the nltk.book resources
nltk.download('book')

# import the resources
from nltk.book import *

Now, we want to load in some of the data from The Current. I will load in data for two questions: whether petrol cars should be banned, and whether freedom camping should be illegal.



In [None]:
# petrol car data
!wget 'https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/the-current/tp001.txt'

# freedom camping data
!wget 'https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/the-current/tp017.txt'

Save the list of each text to a variable, stripping trailing newlines and then splitting on newlines.

In [None]:
petrol = open('tp001.txt').read().rstrip().split('\n')
camping = open('tp017.txt').read().rstrip().split('\n')

Currently the data is a set of separate comments, but we might want to represent all the comments in one container, so each text is a single string of all the comments. We can do this by using `''.join()` to glue together the results of splitting out the comments from the ratings in each text. And, we will actually put a space in the call to `' '.join()`, so that a space is placed between each comment.

In [None]:
# glue the comments together, note that the call to .join() has a space between the delimiters
petrol_text = ' '.join([comment.split('\t')[1] for comment in petrol])

In [None]:
# look - a single string
petrol_text

Ok, now to create a frequency distribution of the words, we need to tokenize the text into tokens. Let's use `nltk.word_tokenize()` and `FreqDist()` to do this.

In [None]:
# first create the tokens
petrol_tokens = nltk.word_tokenize(petrol_text)

In [None]:
# now create the Frequency Distribution
petrol_fdist = nltk.FreqDist(petrol_tokens)

Okay, now that there is a frequency distribution of the petrol text, we can look at what the most frequent words in those comments are! Let's look at the top 20 most frequent words using `.most_common()`.

What do you think of the results?

In [None]:
petrol_fdist.most_common(20)

### **Your Turn**

Can you repeat this to make a frequency distribution for the camping text? You just need to repeat the above code but with the camping text instead of the petrol text! This is the output you should see if you ask for the 20 most frequent words:

```
[('.', 1591),
 ('to', 1562),
 ('the', 1419),
 ('and', 1120),
 ('it', 899),
 ('be', 863),
 ('is', 792),
 ('people', 770),
 ('camping', 674),
 ('a', 663),
 ('i', 631),
 ('should', 628),
 ('of', 573),
 ('freedom', 563),
 ('for', 482),
 ('we', 478),
 ('that', 428),
 ('nature', 401),
 ('in', 392),
 ('not', 388)]

```

In [None]:
# code cells for making fdist for camping

### **Discuss**

- What words are repeated between the two texts?
- What words are unique to the texts?
- Does this analysis tell us anything about the ability for computational measures to identify features of different texts?

## **What about them stopwords?**

The words that are repeated among the texts are so-called function words, determiners such as `the` or `an`, as well as prepositions such as `in`, `on`, etc. These words *are* important in English for making meaning, but maybe we don't want them in this analysis?

Your challenge is to create a frequency distribution for each text which only considers words of certain lengths, 4 characters or more. What will this do to the results?