### DS102 | In Class Practice Week 4A - Text Mining I - Text Normalisation
<hr>

## Pre-Requisites

1. Self-Study Week 3A - Cleaning Strings

## Learning Objectives
At the end of the lesson, you will be able to:

- define **corpus**, **documents** and **terms** as part of the study of Natural Language Processing

- define **tokenisation** as breaking a document into terms

- understand the definition of **root form** of a word for verbs and nouns

- identify **stemming** as a way to find the root form of a word

- learn how to use **stop words** to filter out terms in a document that is not meaningful

At the end of the lesson, you will be able to:

- perform simple tokenisation of a document into terms using `split()`

- use `word_tokenize` from `nltk.tokenize` to break a document into a list of words

- use `PorterStemmer`'s `stem` from `nltk.stem` to perform stemming of words

- retrieve a list of stopwords defined in `stopwords.words()` from `nltk.corpus`

### Datasets Required for this Self-Study
1. `songs-100.csv`

2. `loans-descs-1k.csv`

#### Import Libraries

In [1]:
import pandas as pd
import nltk
import re

In [2]:
#If you are running this for the first time, use the next cell to download all the corpora first
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [3]:
#Use this cell to download all the required corpora first
# try:
#     from nltk.stem import PorterStemmer, WordNetLemmatizer
#     from nltk import pos_tag, wordnet
#     from nltk.tokenize import word_tokenize
#     from nltk.corpus import stopwords
# except LookupError:
#     print("Downloading corpora...")    
#     nltk.download('punkt')
#     nltk.download('stopwords')
#     nltk.download('averaged_perceptron_tagger')
#     nltk.download('wordnet')
#     nltk.download('stopwords')
#     print("Downloads complete. Rerun this line.")

### Corpus, Documents and Terms

In lingustics (the study of language), a **corpus** is a collection of texts, represented by documents. A **document** contains multiple words and when strung together, produce meaning. Each word is called a **term**. 

Consider the following corpus of 4 documents from the 100 Song Titles dataset:

In [4]:
#Dataset 1, Credits at the end of the notebook
song_titles = ['shape of you', 'paris', 'scared to be lonely', 'symphony feat zara larsson',]

Each <u>song title is a document</u>. The <u>collection of song titles forms the corpus</u>. The first document `shape of you` has <u>3 terms</u>. The second document `paris` has <u>1 term</u>. The third document `scared to be lonely` has <u>4 terms</u>.
<hr>
We now read from the dataset CSV file. As you can see, there are 100 documents in the song titles corpus.

In [5]:
#Dataset 1: Credits at the end of the notebook
song_titles_df = pd.read_csv('songs-100.csv', index_col=0)
print(song_titles_df['name'].head(7))
print("--- ---")
print(song_titles_df['name'].count())

0                          Shape of You
1                     Despacito - Remix
2    Despacito (Featuring Daddy Yankee)
3              Something Just Like This
4                           I'm the One
5                               HUMBLE.
6       It Ain't Me (with Selena Gomez)
Name: name, dtype: object
--- ---
100


In [6]:
# Your turn: Draw a sample of 50 observations from the dataset, with replacement. Use replace=True for this
song_titles_df.sample(50, replace=True)

Unnamed: 0,name
49,Havana
78,Look What You Made Me Do
46,Location
41,Call On Me - Ryan Riback Extended Remix
56,Pretty Girl - Cheat Codes X CADE Remix
9,I Don’t Wanna Live Forever (Fifty Shades Darke...
70,Sign of the Times
22,Say You Won't Let Go
35,1-800-273-8255
83,Side To Side


### Simple Tokenisation using `split()`

Before starting, store the song titles as a list of strings.

In [7]:
song_titles_list_0 = song_titles_df['name'].tolist()
print(song_titles_list_0[:10]) #Remove this index to show all titles.

['Shape of You', 'Despacito - Remix', 'Despacito (Featuring Daddy Yankee)', 'Something Just Like This', "I'm the One", 'HUMBLE.', "It Ain't Me (with Selena Gomez)", 'Unforgettable', "That's What I Like", 'I Don’t Wanna Live Forever (Fifty Shades Darker) - From "Fifty Shades Darker (Original Motion Picture Soundtrack)"']


Before performing any langugage processing, we will first perform tokenisation. **Tokenisation** is breaking up a long string into individual words or terms. For simple tokenisation, Python's `split()` method is used. 

In its raw form, the document `Despacito (Featuring Daddy Yankee)` has 4 terms. When tokenisation is performed without any prior processing, the 2nd and 4th term, `(Featuring` and `Yankee)`, **contains** special characters. 

In [8]:
# Your turn: use split() on position 2 from the song_titles_list_0
print(song_titles_list_0[2].split())

['Despacito', '(Featuring', 'Daddy', 'Yankee)']


**1** - We first remove special chracters using the `re.sub()` function and the pattern `[.®'&$’\"\-()]`. This pattern simply means capturing **any** of the special chracters in a string.

In [9]:
song_titles_list_1 = []
# Define the regex
special_chars_p = "[.®'&$’\"\-()]"
for s in song_titles_list_0:
    # Your turn: perform the substitution using re.sub()
    s1 = re.sub(special_chars_p, '', s)
    # Your turn: append this string, after substitution to the list    
    song_titles_list_1.append(s1)
print(song_titles_list_1[:10]) #Remove the indices to see all the songs.

['Shape of You', 'Despacito  Remix', 'Despacito Featuring Daddy Yankee', 'Something Just Like This', 'Im the One', 'HUMBLE', 'It Aint Me with Selena Gomez', 'Unforgettable', 'Thats What I Like', 'I Dont Wanna Live Forever Fifty Shades Darker  From Fifty Shades Darker Original Motion Picture Soundtrack']


**2** - The second step is to convert all words to lowercase using `lower()`. Then, use `split()` to perform tokenisation and collect all terms in a document.

<div class="alert alert-info">**Additional Notes**: Further steps can be done, for example, substituting letters with accents like `ú` in `rehúso` to letters without, i.e. `u`. However they are not in DS102 syllabus. Refer to [this StackOverflow answer](https://stackoverflow.com/questions/33328645/how-to-remove-accent-in-python-3-5-and-get-a-string-with-unicodedata-or-other-so) to help you.</div>

In [24]:
song_titles_list_2 = []

for s1 in song_titles_list_1:
    # Your turn: Convert all words to lower case
    s2 = s1.lower()
    # Your turn: Use split() to convert the string into a list of tokens
    s2 = s2.split()
    # Then, append this to song_titles_list_2
    song_titles_list_2.append(s2)
    
for s2 in song_titles_list_2[:10]: #Remove the indices to see all the songs.
    print(s2)

['shape', 'of', 'you']
['despacito', 'remix']
['despacito', 'featuring', 'daddy', 'yankee']
['something', 'just', 'like', 'this']
['im', 'the', 'one']
['humble']
['it', 'aint', 'me', 'with', 'selena', 'gomez']
['unforgettable']
['thats', 'what', 'i', 'like']
['i', 'dont', 'wanna', 'live', 'forever', 'fifty', 'shades', 'darker', 'from', 'fifty', 'shades', 'darker', 'original', 'motion', 'picture', 'soundtrack']


### Tokenisation using `nltk`'s `word_tokenize()`

The Natural Language Toolkit or `nltk` library is a very powerful library used for natural language processing. We will be using the `word_tokenize()` function to perform tokenisation and `pos_tag` to perform Part-of-Speech or POS tagging of words in a sentence.

In [11]:
# Dataset 2: Credits at the end of the notebook
# Your turn: read the dataset from loans-descs-1k.csv into df 
df = pd.read_csv('loans-descs-1k.csv')

In [12]:
#Convert the raw text into 2 sentences that can be used for processing
s1 = df['desc'][4]
print(s1)

  Borrower added on 02/14/14 > I am consolidating credit card debt incurred over three years ago and having a concrete end in sight is more motivating.  I am eagerly striving towards becoming completely debt free.<br>


In [13]:
print(s1)

r = 'Borrower added on \d+/\d+/\d+ >|<br>'
# Your turn: use re.sub() to substitute the string with that pattern to an empty string
s1 = re.sub('Borrower added on \d+/\d+/\d+ >|<br>', '', s1)
# Your turn: use strip() to remove leading and trailing spaces
s1 = s1.strip()

print()
print(s1)

  Borrower added on 02/14/14 > I am consolidating credit card debt incurred over three years ago and having a concrete end in sight is more motivating.  I am eagerly striving towards becoming completely debt free.<br>

I am consolidating credit card debt incurred over three years ago and having a concrete end in sight is more motivating.  I am eagerly striving towards becoming completely debt free.


To tokenise a sentence, simply use `word_tokenize()` from the `nltk` library. This will convert the sentence into individual words.

<div class="alert alert-info">**Additional Notes**: There are other tokenisers from `nltk` like `RegexpTokenizer`. They will split text specifically when they find a particular regular expression. Particularly useful for money expressions. However they are not in DS102 syllabus. Refer to [the API](https://www.nltk.org/api/nltk.tokenize.html) to find out more.</div>

In [14]:
# Your turn: Use word_tokenize(string) to convert the string into a list of tokens.
# Assign this to a new variable called ts1
ts1 = word_tokenize(s1)
print(ts1)

['I', 'am', 'consolidating', 'credit', 'card', 'debt', 'incurred', 'over', 'three', 'years', 'ago', 'and', 'having', 'a', 'concrete', 'end', 'in', 'sight', 'is', 'more', 'motivating', '.', 'I', 'am', 'eagerly', 'striving', 'towards', 'becoming', 'completely', 'debt', 'free', '.']


### Linguistics - The root form of a word (verbs & nouns)
Stemming is one way to infer the **root form of a word**. We will only limit our discussion to verbs (action words) and nouns (naming words). First, consider the following 3 sentences that use different forms of the word `watch`:

- `Larry watches television.` (singular present tense)

- `The children watch television.` (simple tense / plural, present tense)

- `My son is watching television.` (present participle tense / present continuous tense)

The word `watch` exists in 3 different <u>**forms of the verb**</u> as they exist in different tenses. However, algorithms treat them as **separate words** during analysis. Hence, we need to find the root form of the verb so they can be treated as the same word during analysis as they have the same meaning, in this case `watch`. 

Now consider the next 2 sentences:

- `This is a very expensive vase.` (singular noun)

- `The third floor in this mall sells vases.` (plural noun)

Similarly, we need to find the root <u>**form of the noun**</u>, in this case `vase`. Although only differing in one letter, the ending `s`, algorithms treat them as distinct words. Hence, we need to find the root form of the noun so they can be treated as the same word as they refer to the same object in real life.

### Stemming

**Stemming** is the first way to find root forms of a word. It uses a fixed set of rules to shorten a word to its root form. `nltk` implements the **Porter Stemmer** and you can find the reference for the rules [here](http://www.nltk.org/howto/stem.html). Use `stemmer = PorterStemmer()` and then use the `stem()` method for each word to get its root form.

In [25]:
# Your turn: instantiate a stemmer
stemmer = PorterStemmer()
stemmed_words = []
for t in ts1:
    # Your turn: use stem(string) to find the root form of the word
    u = stemmer.stem(t)
    # Your turn: append the stemmed string to the list
    stemmed_words.append(u)
print(stemmed_words)

['I', 'am', 'consolid', 'credit', 'card', 'debt', 'incur', 'over', 'three', 'year', 'ago', 'and', 'have', 'a', 'concret', 'end', 'in', 'sight', 'is', 'more', 'motiv', '.', 'I', 'am', 'eagerli', 'strive', 'toward', 'becom', 'complet', 'debt', 'free', '.']


Notice that some stemmed words are not valid english words. For example, `consolid` is not an English word. `motiv` and `eagerli` too. However, because of its relatively simple algorithm, some applications accept this form of the word and hence this algorithm is useful. Examples of implementations of stemming are in search engines as both the search term and text can be stemmed.

The following shows the original form of the sentence and the result after stemming for easy comparison.

In [16]:
print("%15s   %15s   " % ("Raw", "Stemming"))
print("%15s-- %15s" % ("------------", "------------"))
for i in range(0, len(stemmed_words)-1):
    print ("%15s   %15s" % (ts1[i], stemmed_words[i]))

            Raw          Stemming   
   --------------    ------------
              I                 I
             am                am
  consolidating          consolid
         credit            credit
           card              card
           debt              debt
       incurred             incur
           over              over
          three             three
          years              year
            ago               ago
            and               and
         having              have
              a                 a
       concrete           concret
            end               end
             in                in
          sight             sight
             is                is
           more              more
     motivating             motiv
              .                 .
              I                 I
             am                am
        eagerly           eagerli
       striving            strive
        towards            toward
       beco

### Removing Stop Words

Finally, before performing analysis, remove **stop words** from the sentence. A stop word is a word that usually appears in many texts, and hence do not hold any meaning. In signal processing language, this is referred to as <u>noise</u>. Refer to this [Github link](https://gist.github.com/sebleier/554280) for the list of stop words from `nltk`. `nltk.corpus.stopwords.words()` contains the list of stop words and if the word exists in them, ignore them.

Recall that the logical expression

```python
    word in wordlist
``` 

used in the context

```python
    if word in wordlist:
``` 
is used to check if a word exists in a `list`. It returns `True` if the word is found and `False` otherwise.

In [17]:
final_list_of_words = []
for l in stemmed_words:
    # Your turn: Use not in stopwords.words() to check if the word is a stop word.
    if l not in stopwords.words():
        final_list_of_words.append(l)
print(stemmed_words)
print()
print(final_list_of_words)
# Your turn: How many words have been eliminated after removal of stop words?

['I', 'am', 'consolid', 'credit', 'card', 'debt', 'incur', 'over', 'three', 'year', 'ago', 'and', 'have', 'a', 'concret', 'end', 'in', 'sight', 'is', 'more', 'motiv', '.', 'I', 'am', 'eagerli', 'strive', 'toward', 'becom', 'complet', 'debt', 'free', '.']

['I', 'consolid', 'credit', 'card', 'debt', 'incur', 'three', 'year', 'ago', 'concret', 'sight', 'motiv', '.', 'I', 'eagerli', 'strive', 'toward', 'becom', 'complet', 'debt', 'free', '.']


**Your turn: Text Normalisation** - Pick one of the following 2 sentences. This is also from the `loans-descs-1k.csv` dataset. 
- Convert the sentence to lowercase
- Remove all special characters according to the pattern `[.®'&$’\"\-()]`
- Perform tokenisation, followed by stemming of your selected sentence
- Remove stop words from the list of stemmed words

Note: You can use `'''` to specify a multi-line string. 

In [18]:
s2 = '''I really need to consolidate my credit card debt so that I can become debt free. 
The interest is killing me and I'm just not getting anywhere with the balances. Help!'''

s3 = '''Hello, I just closed on the house of my dreams and I would like to 
use this loan to pay off my high interest credit cards and build a deck on my home.'''

In [19]:
# Convert sentence to lower case
s2_1 = s2.lower()
print(s2_1)

i really need to consolidate my credit card debt so that i can become debt free. 
the interest is killing me and i'm just not getting anywhere with the balances. help!


In [20]:
# regex substitution
s2_2 = re.sub("[.®'&$’\"\-()]", " ", s2_1)
print(s2_2)

i really need to consolidate my credit card debt so that i can become debt free  
the interest is killing me and i m just not getting anywhere with the balances  help!


In [21]:
# Tokenise
s2_3 = nltk.word_tokenize(s2_2)
print(s2_3)

['i', 'really', 'need', 'to', 'consolidate', 'my', 'credit', 'card', 'debt', 'so', 'that', 'i', 'can', 'become', 'debt', 'free', 'the', 'interest', 'is', 'killing', 'me', 'and', 'i', 'm', 'just', 'not', 'getting', 'anywhere', 'with', 'the', 'balances', 'help', '!']


In [22]:
# Stemming
st = PorterStemmer()
s2_4 = [st.stem(s) for s in s2_3]
print(s2_4)

['i', 'realli', 'need', 'to', 'consolid', 'my', 'credit', 'card', 'debt', 'so', 'that', 'i', 'can', 'becom', 'debt', 'free', 'the', 'interest', 'is', 'kill', 'me', 'and', 'i', 'm', 'just', 'not', 'get', 'anywher', 'with', 'the', 'balanc', 'help', '!']


In [23]:
# Remove stop words
s2_5 = [s for s in s2_4 if s not in stopwords.words()]
print(s2_5)

['realli', 'need', 'consolid', 'credit', 'card', 'debt', 'becom', 'debt', 'free', 'interest', 'kill', 'get', 'anywher', 'balanc', 'help', '!']


**Credits**
- [Kaggle (Top Spotify Tracks of 2017)](https://www.kaggle.com/nadintamer/top-tracks-of-2017) for Dataset 1
- [Kaggle (Lending Club Loan Data)](https://www.kaggle.com/wendykan/lending-club-loan-data) for Dataset 2