In [None]:
# Initialize OK
from client.api.notebook import Notebook
ok = Notebook('lab07.ok')

# Lab 5: Regular Expressions, Text Processing

**Collaboration Policy**

Data science is a collaborative activity. While you may talk with others about
the homework, we ask that you **write your solutions individually**. If you do
discuss the assignments with others please **include their names** at the top
of your solution.


## Due Date

This assignment is due at **11:59pm Wednesday, May 22**.

# Collaborators

Write names in this cell:

In [1]:
import pandas as pd
import numpy as np
import re

## Objectives for Lab 5:

This lab has two main parts. 

In the first part, you will practice the basic usage of regular expressions and also learn to use `re` module in Python.  Some of the materials are based on the tutorial at http://opim.wharton.upenn.edu/~sok/idtresources/python/regex.pdf. As you work through the first part of the lab, you may also find the website http://regex101.com helpful. 

In the second part of the lab, we are going to practice NLP techniques.

---
# Part 1: Regular Expressions

We'll start by learning about the simplest possible regular expressions. Since regular expressions are used
to operate on strings, we'll start with the most common task: matching characters.

Most letters and characters will simply match themselves. For example, the regular expression `r'test'` will match the string `test` exactly. There are exceptions to this rule; some characters are special, and don't match themselves.

Here is a list of metacharacters that are widely used in regular experssion. 

<table border="1" class="dataframe" >
<thead>
  <tr style="text-align: right;">
    <th>Pattern </th>
    <th>Description</th> 
  </tr>
 </thead>
 <tbody>
  <tr>
    <td>^</td>
    <td>Matches beginning of line.</td> 
  </tr>
  <tr>
    <td>$</td>
    <td>Matches end of line.</td> 
  </tr>
  <tr>
    <td>.</td>
    <td>Matches any single character except newline. </td> 
  </tr>
  <tr>
    <td>*</td>
    <td>Matches 0 or more occurrences of preceding expression.</td>
  </tr>
  <tr>
    <td>+</td>
    <td>Matches 1 or more occurrence of preceding expression.</td>
  </tr>
  <tr>
    <td>?</td>
    <td>Matches 0 or 1 occurrence of preceding expression.</td>
  </tr>
  <tr>
    <td>[...]</td>
    <td>Matches any single character in brackets.</td>
  </tr>
  <tr>
    <td>[^...]</td>
    <td>Matches any single character <b>not</b> in brackets.</td>
  </tr>
  <tr>
    <td>{n}</td>
    <td>Matches exactly n number of occurrences of preceding expression.</td>
  </tr>
  <tr>
    <td>{n,}</td>
    <td>Matches n or more occurrences of preceding expression.</td>
  </tr>
  <tr>
    <td>{n,m}</td>
    <td>Matches at least n and at most m occurrences of preceding expression.</td>
  </tr>
  <tr>
    <td>a|b</td>
    <td>Matches either a or b.</td>
  </tr>
  <tr>
    <td>\1...\9</td>
    <td>Matches n-th grouped subexpression.</td>
  </tr>
  </tbody>
</table>


Perhaps the most important metacharacter is the backslash, `\`. As in Python string literals, the backslash
can be followed by various characters to signal various special sequences. It's also used to escape all the
metacharacters so you can still match them in patterns; for example, if you need to match a `[` or `\`, you
can precede them with a backslash to remove their special meaning:  `\[` or `\\`. 

The following predefined special sequences are available:

<table border="1" class="dataframe" >
<thead>
  <tr style="text-align: right; font-size=14;">
    <th>Pattern </th>
    <th>Description</th> 
  </tr>
 </thead>
 <tbody>
  <tr>
    <td>\d</td>
    <td>Matches any decimal digit; this is equivalent to the class `[0-9]`</td> 
  </tr>
  <tr>
    <td>\D</td>
    <td>Matches any non-digit character; this is equivalent to the class `[^0-9]`.</td> 
  </tr>
  <tr>
    <td>\s</td>
    <td>Matches any whitespace character; this is equivalent to the class `[ \t\n\r\f\v]` </td> 
  </tr>
  <tr>
    <td>\S</td>
    <td>Matches any non-whitespace character; this is equivalent to the class `[^ \t\n\r\f\v]`.</td>
  </tr>
  <tr>
    <td>\w</td>
    <td>Matches any alphanumeric character; this is equivalent to the class `[a-zA-Z0-9_]`</td>
  </tr>
  <tr>
    <td>\W</td>
    <td>Matches any non-alphanumeric character; this is equivalent to the class `[^a-zA-Z0-9_]`.</td>
  </tr>
  </tbody>
</table>

# Question 1
In this question, write patterns that match the given sequences. It may be as simple as the common letters on each line.

---
## Question 1a

Write a single regular expression to match the following strings without using the `|` operator. Notice that the pattern must _start_ with "abc".

1. **Match:** `abcdefg`
1. **Match:** `abcde`
1. **Match:** `abc`
1. **Skip:** `c abc`

<!--
BEGIN QUESTION
name: q1a
-->

In [2]:
regx1 = r"" # fill in your pattern
...

In [None]:
ok.grade("q1a");

---
## Question 1b

Write a single regular expression to match the following strings without using the `|` operator.

1. **Match:** `can`
1. **Match:** `man`
1. **Match:** `fan`
1. **Skip:** `dan`
1. **Skip:** `ran`
1. **Skip:** `pan`

<!--
BEGIN QUESTION
name: q1b
-->

In [8]:
regx2 = r"" # fill in your pattern
...

In [None]:
ok.grade("q1b");

# Question 2

Now that we have written a few regular expressions, we are now ready to move beyond matching. In this question, we'll take a look at some methods from the `re` package.

---
## Question 2a:

Write a Python program to extract and print the numbers of a given string. 

1. **Hint:** use `re.findall`
2. **Hint:** use `\d` for digits and one of either `*` or `+`.

<!--
BEGIN QUESTION
name: q2a
-->

In [16]:
text_q2a = "Ten 10, Twenty 20, Thirty 30"

res_q2a = ...
...

res_q2a

In [None]:
ok.grade("q2a");

---
## Question 2b:

Write a Python program to replace at most 2 occurrences of space, comma, or dot with a colon.

**Hint:** use `re.sub(regex, "newtext", string, number_of_occurences)`

<!--
BEGIN QUESTION
name: q2b
-->

In [18]:
text_q2b = 'Python Exercises, PHP exercises.'
res_q2b = ... # Hint: use re.sub()
...

res_q2b

In [None]:
ok.grade("q2b");

---
## Question 2c: 

Write a Python program to extract values between quotation marks of a string.

<!--
BEGIN QUESTION
name: q2c
-->

In [20]:
text_q2c = '"Python", "PHP", "Java"'
res_q2c = ... # Hint: use re.findall()
...

res_q2c

In [None]:
ok.grade("q2c");

## Question 2d:

Write a regular expression to extract and print the quantity and type of objects in a string. You may assume that a space separates quantity and type, i.e., `"{quantity} {type}"`. See the example string below for more detail.

1. **Hint:** use `re.findall`
2. **Hint:** use `\d` for digits and one of either `*` or `+`.

<!--
BEGIN QUESTION
name: q2d
-->

In [22]:
text_q2d = "I've got 10 eggs that I stole from 20 gooses belonging to 30 giants."

res_q2d = ...
...

res_q2d

In [None]:
ok.grade("q2d");

## Question 2e (optional):

Write a regular expression to replace all words that are not `"mushroom"` with `"badger"`.

In [24]:
text_qe = 'this is a word mushroom mushroom'
res_qe = ... # Hint: https://www.regextester.com/94017
...
res_qe

## Question 2f (extra credit):

Write a regular expression to replace all words that are `"US"` and `"U.S."` with `"USA"`.

In [25]:
text_qe = 'This will replace US and U.S. but not USGS with USA.'
res_qe = ... 
...
res_qe

# Part 2: NLP

Let's reproduce the example of extracting named entities from this article: [Discovering the essential tools for Named Entities Recognition](https://towardsdatascience.com/discovering-the-essential-tools-for-named-entities-recognition-8176c94d9747).

In [26]:
# Import NLTK module
import nltk

# Import word_tokenize 
from nltk.tokenize import word_tokenize

# Import POS tagger
from nltk.tag import pos_tag

### Upload content from a website

In [27]:
import requests
from bs4 import BeautifulSoup # a library for processing webpages

In [28]:
# send a request to the website
page = requests.get("https://en.wikipedia.org/wiki/Natural_Language_Toolkit")

# Use BeautifulSoup to parse HTML using html5 protocol. It is slower
# but more efficient 
page_content = BeautifulSoup(page.text) #, "html5lib")
# page_content   # html source of the page

In [29]:
# Now we look for the paragraphs
textContent = []
for i in range(0, 3):
    paragraphs = page_content.find_all("p")[i].text  # find the text inside the paragraph tag <p>
    textContent.append(paragraphs)

# Join the paragraphs together and replace the `\n` for empty strings
page_text = " ".join(textContent).replace("\n", "")
page_text

### Tag and tokenize the text

Create a method that takes the text as an input and use `nltk.word_tokenize` to split the text into tokens. Then, tag each token with its part of speech using `nltk.pos_tag`.

The method will return a list of tuples, each consisting of a word along with its tag; the part of the speech that it corresponds to.

<!--
BEGIN QUESTION
name: q2_1
-->

In [48]:
def preprocess_text(text):
    """
    This function takes a text. Split it in tokens using word_tokenize. 
    And then tags them using pos_tag from NLTK module.
    It outputs a list of tuples. Each tuple consists of a word and the tag with its 
    part of speech.
    """
    # Get the tokens
    tokens = ...
    # Tags the tokens
    tagged_tokens = ...
    # Returns the list of tuples
    return tagged_tokens

# Split and label the text
label_text = preprocess_text(page_text)

# Print first 20 tuples, 5 per line
#for i in range(0, 20, 5):
#    print(label_text[i], label_text[i+1], label_text[i+2], label_text[i+3], label_text[i+4])

In [None]:
ok.grade("q2_1");

### Chunk the text to get named entities

Let us now perform entity detection using a technique called **chunking**.

Tokenization extracts only “tokens” or words, whereas, chunking extracts phrases that may have an actual meaning in the text.

Chunking requires that our text is first tokenized and POS tagged. It uses these tags as inputs. It outputs “chunks” that can indicate entities.

We can first apply noun pronoun chunks or NP-chunks. We’ll look for chunks matching individual noun phrases. For this, we will customize the regular expressions used in the mechanism.

We first need to define rules. They will indicate how sentences should be chunked. You can define your own rules, if you want to extract different chunks.

For reference, here's the [list of tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).
Our rule states that our NP chunk should consist of an optional determiner (DT) followed by any number of adjectives (JJ) and then one or more pronoun noun (NNP).

<!--
BEGIN QUESTION
name: q2_2
-->

In [33]:
# Define the rule
rule = "NP: {<DT>?<JJ>*<NNP>+}" 

# We define the parser using the rule
parser = nltk.RegexpParser(rule) 

# Apply to the tagged words 
# by using the parse() function of the parser you just created
# and giving it the label_text
result = ...

# Print only the chunks
for entity in result:
    if type(entity) == nltk.tree.Tree:
        print(entity)

In [None]:
ok.grade("q2_2");

We can instead use a pre-trained classifier using the function `nltk.ne_chunk()`. 

<!--
BEGIN QUESTION
name: q2_3
-->

In [35]:
# Use ne_chunk to get entities 
named_entities = ...

# Print only those that are recognized as entities
# Entities have type nltk.tree.Tree
for entity in named_entities:
    if type(entity) == nltk.tree.Tree:
        print(entity)

In [None]:
ok.grade("q2_3");

Notice that the results are very similar.

# Latent Dirichlet Allocation (LDA) for topic modelling

Let us now apply LDA to classify text in a document to a particular topic [[1]](https://towardsdatascience.com/nlp-extracting-the-main-topics-from-your-dataset-using-lda-in-minutes-21486f5aa925). LDA builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

* Each document is modeled as a multinomial distribution of topics and each topic is modeled as a multinomial distribution of words.
* LDA assumes that the every chunk of text we feed into it will contain words that are somehow related. Therefore choosing the right corpus of data is crucial.
* It also assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution.

To learn more about LDA check out this [link](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf).


Let's add another page and see whether LDA will be able to classify their text.

In [37]:
# send a request to the website
page1 = requests.get("https://en.wikipedia.org/wiki/Shallow_parsing")

page1_content = BeautifulSoup(page1.text) #, "html5lib")

In [38]:
# Now we look for the paragraphs
text1Content = []
for i in range(0, 3):
    paragraphs = page1_content.find_all("p")[i].text  # find the text inside the paragraph tag <p>
    text1Content.append(paragraphs)

# Join the paragraphs together and replace the `\n` for empty strings
page1_text = " ".join(text1Content).replace("\n", "")
page1_text

# Preprocess the text

In [39]:
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim # another module for text processing

In [40]:
tokenizer = RegexpTokenizer(r'\w+')

stop_words = gensim.parsing.preprocessing.STOPWORDS

# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()

# compile sample documents into a list
doc_set = [page_text, page1_text]

In [41]:
# list for tokenized documents in loop
texts = []

# loop through document list
for i in doc_set:
    
    # clean and tokenize document string
    raw = i.lower()
    tokens = ...

    # remove stop words from tokens: use list comprehension
    stopped_tokens = ...
    
    # stem tokens
    stemmed_tokens = ...
    
    # add tokens to list
    #texts.append(stopped_tokens)
    texts.append(stemmed_tokens)

### Are stemmed tokens important?

Run the analysis below using `stemmed_tokens`, then come back and comment-out the creation of the stemmed tokens in the code above (making sure to properly update `texts`). How does the result change? **Write down your observations and analysis.**

<!--
BEGIN QUESTION
name: q2_4
manual: true
-->
<!-- EXPORT TO PDF -->

*Write your answer here, replacing this text.*

## Convert the text to bag of words and run LDA

For topic modelling, we need to convert the preprocessed text to a bag of words, which is a dictionary where the key is a word and the value is the number of times that word occurs in the entire corpus. 

We then run LDA on our corpus, after we specify how many topics are there in the data set and how many training passes to do over the document.

In [42]:
# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
    
# convert tokenized documents into a document-term matrix (bag-of-words)
corpus = [dictionary.doc2bow(text) for text in texts]

# generate LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=3, id2word = dictionary, passes=20)

In [43]:
'''
Checking dictionary created
'''
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

In [44]:
'''
Preview BOW for our sample preprocessed document
'''
document_num = 0 
bow_doc_x = corpus[document_num]

num_items = len(bow_doc_x)
num_items = 12
for i in range(num_items):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_x[i][0], 
                                                     dictionary[bow_doc_x[i][0]], 
                                                     bow_doc_x[i][1]))

In [45]:
print(ldamodel.print_topics(num_topics=2, num_words=3))

In [46]:
print(ldamodel.print_topics(num_topics=3, num_words=5))

In [47]:
'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''
for idx, topic in ldamodel.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic ))
    print("\n")

We are ready to interpret the model. The output from the model is a list of topics each categorized by a series of words along with the weight of that word in that topic. If you had to come up with the names for the 3 topics that LDA identified, what would you call them?

<!--
BEGIN QUESTION
name: q2_5
manual: true
-->
<!-- EXPORT TO PDF -->

*Write your answer here, replacing this text.*

**Congrats! You are finished with this assignment.**

# Submit
Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output.
**Please save before submitting!**

<!-- EXPECT 2 EXPORTED QUESTIONS -->

In [None]:
# Save your notebook first, then run this cell to submit.
import jassign.to_pdf
jassign.to_pdf.generate_pdf('lab07.ipynb', 'lab07.pdf')
ok.submit()