#Tokenization and Bag-of-Words

Welcome to week three. This week, we will build on what we learned last week. Regular expressions were helpful to us, since it will be the main way to clean the string (text) data. Now we will look at the elementary applications, once we have a clean data.

From last week you will remember our short introduction to tokenization. Let's recap what we have learned and work on Adam Smith's Wealth of Nations to understand the text a little better.

In [None]:
# A Short (Re)Introduction to Tokenization
# The NLTK library we imported above gives us the ability to "tokenize" pieces of text into smaller pieces. These tokens can be thought of how we keep words in our minds, but for the computer.

example_text = "This is the Data Science Lecture. We are going to have so much fun! The fun will not end..."

# Now, we can put this example text into sentence tokenization. This will parse the sentences from the text.
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt')

first_result = sent_tokenize(example_text)
first_result

In [None]:
# Great! But we might also be interested in words, rather than sentences. No worries, NLTK has a solution for that as well.

from nltk.tokenize import word_tokenize

second_result = word_tokenize(example_text)
second_result

In [None]:
# It did what we asked it to do, but there are clear problems, there are two instances of 'the', one capitalized.
# To fix this problem, we can simply convert everything to lowercase, which is a common method in the field.
second_result_lower = [token.lower() for token in second_result]
second_result_lower

In [None]:
# The resources we will use this lecture.
wealth_of_nations = "https://raw.githubusercontent.com/timuroeztuerk/data-science-lecture-S24/main/Datasets/The_Wealth_of_Nations_Volume_1_Cleaned.txt"
# Importing the necessary libraries we have used in the last lectures.
import pandas as pd
import requests
import nltk
nltk.download('punkt') # Some extra knowledge for the computer, so it knows where the sentences are.
nltk.download('stopwords') # Same, but this time it will know the english stopwords like 'the' and 'and'.
nltk.download('wordnet')

In [None]:
# Import the book from Adam Smith.
adam_smith = requests.get(wealth_of_nations).text

In [None]:
# We will do three iterations of this example. First, let's simply tokenize the text using the simplest version of the nltk command.
# Import the tokenization command.
from  nltk.tokenize import word_tokenize

# Tokenize the text.
unique_tokens = word_tokenize(adam_smith)

# Lowercase all tokens so that there are no confusions. (i.e. 'The' vs. 'the')
lower_tokens = [token.lower() for token in unique_tokens]

# Let's count what we have in this text. For this, we will use a basic Python counter. Import the counter first.
from collections import Counter

# Call the counter on our unique_tokens.
wealth_counts = Counter(unique_tokens)

# What are the most common tokens?
wealth_counts.most_common(5)

# The results are interesting. Appearently the most common token in Wealth of Nations is a comma.

In [None]:
# How do we get around this problem? No worries, the python community has a fix for this.
from nltk.corpus import stopwords

# You remember our variable lower_tokens from above? Let's make a quick adjustment by filtering out some stopwords.
# Have a carfeul look at the loop below, and try to understand what it does.
filtered_words = [token for token in lower_tokens if not token in stopwords.words('english')]

# Going back to counting and listing the most common words:
wealth_counts_filtered = Counter(filtered_words)
wealth_counts_filtered.most_common(5)

# Okay, this is better, but we just got rid of the stopwords. We still have punctuation to deal with.

In [None]:
# Let's take a step back, and use another version of the tokenizer from NLTK library.
# If you are confused, look at two cells above. Instead of word_tokenize() we will use RegexpTokenizer().
from nltk.tokenize import RegexpTokenizer

# Create your own version of regular expression tokenizer. Here we specify that it should only look for words.
timur = RegexpTokenizer(r'\w+')

# Let's do the processing.
words = timur.tokenize(adam_smith)

# Same filter for stopwords as above.
filtered_words = [word for word in words if word.lower() not in stopwords.words('english')]

# Okay, one last time counting it all, let's see what we have now.
wealth_counts_filtered = Counter(filtered_words)
wealth_counts_filtered.most_common(20)

# Can we somehow make this better?

In [None]:
# Import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer

# Classic lowercase tokenization.
unique_tokens = word_tokenize(adam_smith)
lower_tokens = [token.lower() for token in unique_tokens]

# Retain alphabetic words: alpha_only
alpha_only = [t for t in lower_tokens if t.isalpha()]

# Remove all stop words: no_stops
no_stops = [t for t in alpha_only if t not in stopwords.words('english')]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]

# Create the bag-of-words: bow
bow = Counter(lemmatized)

# Print the 10 most common tokens
print(bow.most_common(20))

In [None]:
# Assignment 1: Using 'https://www.gutenberg.org/cache/epub/84/pg84.txt', Frankenstein, tokenize the data like we did before with NLTK, and (1) show the most common 20 tokens (bag-of-words), (2) create a histogram of the average length of words in this book.

# Once you print it, you will realize that some words are just verbs or grammatical bindings. Modify the code in a way the top 10 makes sense for YOU.

# You can (optionally) try to do it with WordNetLemmatizer, and then compare the results.

# Your code here...

In [None]:
# Create your own version of regular expression tokenizer. Here we specify that it should only look for words.
timur = RegexpTokenizer(r'\w+')

# Let's do the processing.
words = timur.tokenize(adam_smith)

# What words do you want to exclude?
custom_exclusions = ['upon', 'may', 'therefore', 'one', 'much', 'must', '0', 'whole']

# Just add them to the stopwords list.
all_exclusions = stopwords.words('english') + custom_exclusions

# Use the filter.
bayern = [word for word in words if word.lower() not in all_exclusions]

# Okay, one last time counting it all, let's see what we have now.
wealth_counts_filtered = Counter(bayern)
wealth_counts_filtered.most_common(20)

# Can we somehow make this better?

# Introduction to SpaCy and further libraries

In [None]:
# The following will take a while.
# Introducing Spacy.
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(adam_smith)

for token in doc:
  print(token.text)

# You can do this step in NLTK as well, but spacy has some features that are way more neat. Also, it has some extra categories! NORP, CARDINAL, MONEY, WORKOFART, LANGUAGE, EVENT

In [None]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

In [None]:
for ent in doc.ents:
  if ent.label_ == 'MONEY':
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
  elif  ent.label_ == 'CARDINAL':
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
  else:
    continue

In [None]:
gpe_list = list()
for ent in doc.ents:
  if ent.label_ == 'GPE':
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
    gpe_list.append(ent.text)

In [None]:
# Using regular expressions, let's seach if mentions Turkey.
import re
for gpe in gpe_list:
    if re.search("^Tur", gpe):
        print(gpe)

In [None]:
# What about the U.S.?
for gpe in gpe_list:
    if re.search("^U", gpe):
        print(gpe)

In [None]:
for ent in doc.ents:
  if ent.label_ == 'ORG':
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

In [None]:
for ent in doc.ents:
  if ent.label_ == 'PERSON':
    print(ent.text, ent.start_char)

In [None]:
# Interesting that Adam Smith cites Caesar in his book. Let's see in what type of context that is...
print(adam_smith[416500:417150])

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Generate word cloud
wordcloud = WordCloud(width = 800, height = 400, background_color ='white').generate(' '.join(gpe_list))

# Plotting the WordCloud
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)

plt.show()

In [None]:
# Assignment 2: Using "https://raw.githubusercontent.com/timuroeztuerk/data-science-lecture-S24/main/Datasets/ricardo.txt", follow similar steps to The Wealth of Nations and produce a story from the book.

# Your code here...

In [None]:
ricardo = requests.get('https://raw.githubusercontent.com/timuroeztuerk/data-science-lecture-S24/main/Datasets/ricardo.txt').text
doc = nlp(ricardo)

In [None]:
gpe_list = list()
for ent in doc.ents:
  if ent.label_ == '':
    gpe_list.append(ent.text)

exclusion = ['Smith']
gpe_list = [gpe for gpe in gpe_list if gpe not in exclusion]

# Generate word cloud
wordcloud = WordCloud(width = 800, height = 400, background_color ='white').generate(' '.join(gpe_list))
# Plotting the WordCloud
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)

plt.show()