### DS102 | In Class Practice Week 4A - Text Mining I - Text Normalisation
<hr>

## Pre-Requisites

1. Self-Study Week 3A - String Manipulation

## Learning Objectives
At the end of the lesson, you will be able to:

- define **corpus**, **documents** and **terms** as part of the study of Natural Language Processing

- define **tokenisation** as breaking a document into terms

- understand the definition of **root form** of a word for verbs and nouns

- identify **stemming** as a way to find the root form of a word

- learn how to use **stop words** to filter out terms in a document that is not meaningful

At the end of the lesson, you will be able to:

- use `word_tokenize` from `nltk.tokenize` to break a document into a list of words

- use `PorterStemmer`'s `stem` from `nltk.stem` to perform stemming of words

- retrieve a list of stopwords defined in `stopwords.words()` from `nltk.corpus`

### Datasets Required for this Self-Study
1. `songs-100.csv`

2. `loans-descs-1k.csv`

#### Import Libraries

In [5]:
!pip install nltk
!pip install sklearn

Collecting sklearn
  Downloading https://files.pythonhosted.org/packages/1e/7a/dbb3be0ce9bd5c8b7e3d87328e79063f8b263b2b1bfa4774cb1147bfcd3f/sklearn-0.0.tar.gz
Building wheels for collected packages: sklearn
  Building wheel for sklearn (setup.py): started
  Building wheel for sklearn (setup.py): finished with status 'done'
  Stored in directory: C:\Users\victo\AppData\Local\pip\Cache\wheels\76\03\bb\589d421d27431bcd2c6da284d5f2286c8e3b2ea3cf1594c074
Successfully built sklearn
Installing collected packages: sklearn
Successfully installed sklearn-0.0


In [6]:
import pandas as pd
import nltk
import re

In [9]:
# Use this cell to download all the required corpora first. Then, comment out this 
# block of code.
print("Downloading corpora...")    
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')
print("Corpora download complete.")

Downloading corpora...
Corpora download complete.


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\victo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\victo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\victo\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\victo\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\victo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [10]:
# If you are running this for the first time, use the previous cell to download all 
# the corpora before starting.
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

### Corpus, Documents and Terms

In lingustics (the study of language), a **corpus** is a collection of texts, represented by documents. A **document** contains multiple words and when strung together, produce meaning. Each word is called a **term**. 

Consider the following corpus of 4 documents from the 100 Song Titles dataset:

In [12]:
# Dataset 1, Credits at the end of the notebook
song_titles = ['shape of you', 
               'paris', 
               'scared to be lonely', 
               'symphony feat zara larsson',]

Each <u>song title is a document</u>. The <u>collection of song titles forms the corpus</u>. The first document `shape of you` has <u>3 terms</u>. The second document `paris` has <u>1 term</u>. The third document `scared to be lonely` has <u>4 terms</u>.

We now read from `songs-100.csv`, a CSV file which, in this case is our corpus. There are $100$ documents in the song titles corpus.

In [20]:
# Dataset 1: Credits at the end of the notebook
# Read from 'songs-100.csv' into song_titles_df
song_titles_df = pd.read_csv('songs-100.csv')
song_titles_df.head()
# Use count() to find the number of documents in the corpus.
song_titles_df.shape[0]

100

### Tokenisation using `nltk`

The Natural Language Toolkit or `nltk` library is a very powerful library used for natural language processing. We will be using `nltk.tokenize.word_tokenize()` to perform tokenisation.

In [24]:
# Dataset 2: Credits at the end of the notebook
# Your turn: read the dataset from loans-descs-1k.csv into loan_descs_df 
loan_descs_df = pd.read_csv('loans-descs-1k.csv',  index_col=0)
loan_descs_df.head()


Unnamed: 0,id,member_id,loan_amnt,grade,desc
296,7337222,8999285,10000.0,D,Borrower added on 09/17/13 > The wedding of ...
844,7365470,9027579,31825.0,C,Borrower added on 09/15/13 > Pay Off High Cr...
110,676471,864466,20000.0,G,Borrower added on 02/15/11 > My husband and ...
278,612322,785185,20000.0,F,"Borrower added on 11/09/10 > debt refi, alwa..."
951,7051350,8713086,13000.0,B,Borrower added on 02/14/14 > I am consolidat...


In [38]:
# Get the raw text in index = 4 and store it into s1. Hint: use df.loc[]
s1 = loan_descs_df.iloc[4]

# Uncomment this line to print s1
print(s1)

s_desc = s1['desc']
print(s_desc)

id                                                     7051350
member_id                                              8713086
loan_amnt                                                13000
grade                                                        B
desc           Borrower added on 02/14/14 > I am consolidat...
Name: 951, dtype: object
  Borrower added on 02/14/14 > I am consolidating credit card debt incurred over three years ago and having a concrete end in sight is more motivating.  I am eagerly striving towards becoming completely debt free.<br>


Using `re.sub()` to substitute the initial phrase `Borrower added on ...` with an empty string.

In [41]:
r = 'Borrower added on \d+/\d+/\d+ >|<br>'
# Your turn: use re.sub() to substitute any string with the expression to an empty string.
#
s_desc1 = re.sub(r, '', s_desc)

# Your turn: use strip() to remove leading and trailing spaces. Store the final variable as s2
s_desc1 = s_desc1.strip()

# Uncomment this line to print s2
print(s_desc1)

I am consolidating credit card debt incurred over three years ago and having a concrete end in sight is more motivating.  I am eagerly striving towards becoming completely debt free.


To tokenise a sentence, simply use `word_tokenize()` from the `nltk` library. This will convert the sentence into individual words AND special characters like full stops and commas.

<div class="alert alert-info"><b>DS102 Learning Guidelines: </b> There are other tokenisers from `nltk` like `RegexpTokenizer`. They will split text specifically when they find a particular regular expression. These Tokenizers are particularly useful for money expressions. However they are not in DS102 syllabus. Refer to [the API](https://www.nltk.org/api/nltk.tokenize.html) to find out more.</div>

In [45]:
# Use word_tokenize(string) to convert the string into a list of tokens.
# Assign this to a new variable called ts1
#
ts1 = word_tokenize(s_desc1)
print(ts1)
print(len(ts1))

['I', 'am', 'consolidating', 'credit', 'card', 'debt', 'incurred', 'over', 'three', 'years', 'ago', 'and', 'having', 'a', 'concrete', 'end', 'in', 'sight', 'is', 'more', 'motivating', '.', 'I', 'am', 'eagerly', 'striving', 'towards', 'becoming', 'completely', 'debt', 'free', '.']
32


### Linguistics - The root form of a word (verbs & nouns)
Stemming is one way to find the **root form of a word**. We will only limit our discussion to verbs (action words) and nouns (naming words). First, consider the following 3 sentences that use different forms of the word `watch`:

- `Larry watches television.` (singular present tense)

- `The children watch television.` (simple tense / plural, present tense)

- `My son is watching television.` (present participle tense / present continuous tense)

- `My mum watched television with me.` (past tense)

The word `watch` exists in 4 different <u>forms of the **verb**</u> as they exist in different tenses. However, algorithms treat them as **separate words** during analysis. Hence, we need to find the root form of the verb so they can be treated as the same word during analysis as they have the same meaning, in this case `watch`. 

`nltk` implements the **Porter Stemmer** and you can find the reference for the rules [here](http://www.nltk.org/howto/stem.html). Use `stemmer = PorterStemmer()` and then use the `stem()` method for each word to get its root form.

In [49]:
# Instantiate PorterStemmer
stemmer = PorterStemmer()

ss_verbs = ['Larry watches television.', 'The children watch television.', 
       'My son is watching television.', 'My mum watched television with me.']


ss1 = ss_verbs[0] # pull from corpus
ss2 = word_tokenize(ss1) # break document to terms
print(ss2)

ss3 = []
for s in ss2:
    s_stemmed = stemmer.stem(s)
    ss3.append(s_stemmed)
print(ss3)


# # (What meaningful comment can you write here?)
# for s in ss_verbs:
#     # (What meaningful comment can you write here?)
#     for st in word_tokenize(s):
#         # (What meaningful comment can you write here?)
#         print(stemmer.stem(st))
#     print()

['Larry', 'watches', 'television', '.']
['larri', 'watch', 'televis', '.']


Now consider the next 2 sentences:

- `This is a very expensive vase.` (singular noun)

- `The third floor in this mall sells vases.` (plural noun)

Similarly, we need to find the root <u>form of the **noun**</u>, in this case `vase`. Although only differing in one letter, the ending `s`, algorithms treat them as distinct words. Hence, we need to find the root form of the noun so they can be treated as the same word as they refer to the same object in real life.

In [54]:
ss_nouns = ['This is a very expensive vase.', 
            'The third floor in this mall sells vases.', ]

ssn1 = ss_nouns[1]
ssnlt = word_tokenize(ssn1)

ssn2 = []

# Exercise: Iterate through the list of sentences. Tokenise each sentense using word_tokenize().
# Then for every term, print out its stemmed form using stemmer.stem(term)
for s in ssnlt:
    ssn2.append(stemmer.stem(s))
    print(ssn2)
    
print(ssn2)

['the', 'third', 'floor', 'in', 'thi', 'mall', 'sell', 'vase', '.']


### Stemming

Now, we apply the stemming step on the initial loans sentence.

In [55]:
# Just to refresh our memory...
print(ts1)

['I', 'am', 'consolidating', 'credit', 'card', 'debt', 'incurred', 'over', 'three', 'years', 'ago', 'and', 'having', 'a', 'concrete', 'end', 'in', 'sight', 'is', 'more', 'motivating', '.', 'I', 'am', 'eagerly', 'striving', 'towards', 'becoming', 'completely', 'debt', 'free', '.']


In [61]:
# Instantiate a PorterStemmer
stemmer = PorterStemmer()

stemmed_words = []
for t in ts1:
    # Use stemmer.stem() to find the root form of the word
    t2 = stemmer.stem(t)
    # Then amend this line to append the stemmed word into stemmed_words
    stemmed_words.append(t2)
    
print(stemmed_words)

['I', 'am', 'consolid', 'credit', 'card', 'debt', 'incur', 'over', 'three', 'year', 'ago', 'and', 'have', 'a', 'concret', 'end', 'in', 'sight', 'is', 'more', 'motiv', '.', 'I', 'am', 'eagerli', 'strive', 'toward', 'becom', 'complet', 'debt', 'free', '.']


Notice, some stemmed words are not valid english words. For example, `consolid` is not an English word. `motiv` and `eagerli` too. However, because of its relatively simple algorithm, some applications accept this form of the word and hence this algorithm is useful. Examples of implementations of stemming are in search engines as both the search term and text can be stemmed.

The following shows the original form of the sentence and the result after stemming for easy comparison.

In [63]:
# Uncomment this line to compare the results.

print("%15s   %15s   " % ("Raw", "Stemming"))
print("%15s-- %15s" % ("------------", "------------"))
for i in range(0, len(stemmed_words)-1):
    print ("%15s   %15s" % (ts1[i], stemmed_words[i]))

            Raw          Stemming   
   --------------    ------------
              I                 I
             am                am
  consolidating          consolid
         credit            credit
           card              card
           debt              debt
       incurred             incur
           over              over
          three             three
          years              year
            ago               ago
            and               and
         having              have
              a                 a
       concrete           concret
            end               end
             in                in
          sight             sight
             is                is
           more              more
     motivating             motiv
              .                 .
              I                 I
             am                am
        eagerly           eagerli
       striving            strive
        towards            toward
       beco

### Removing Stop Words

Finally, before performing analysis, remove **stop words** from the sentence. A stop word is a word that usually appears in many texts, and hence do not hold any meaning. In signal processing language, this is referred to as <u>noise</u>. Refer to this [Github link](https://gist.github.com/sebleier/554280) for the list of stop words from `nltk`. `nltk.corpus.stopwords.words()` contains the list of stop words and if the word exists in them, ignore them.

Recall that 

```python
    word in wordlist
``` 
is used to check if a word is in a list. It returns `True` if the word is found and `False` otherwise.

In [66]:
final_list_of_words = []
for l in stemmed_words:
    l_lower = l.lower()
    if not l_lower in stopwords.words('english'):
        final_list_of_words.append(l_lower)
    # Your turn: Use not in stopwords.words('english') to check if the word 
    # is a stop word. If it isn't, then append to the final_list_of_words.
print(final_list_of_words)
    
# Exersise: How many words have been eliminated after removal of stop words?
#

['consolid', 'credit', 'card', 'debt', 'incur', 'three', 'year', 'ago', 'concret', 'end', 'sight', 'motiv', '.', 'eagerli', 'strive', 'toward', 'becom', 'complet', 'debt', 'free', '.']


**Your turn: Text Normalisation** - Pick one of the following 2 sentences. This is also from the `loans-descs-1k.csv` dataset. 
- Convert the sentence to lowercase
- Remove all special characters according to the pattern `[.®'&$’\"\-()]`
- Perform tokenisation, followed by stemming of your selected sentence
- Remove stop words from the list of stemmed words

Note: You can use `'''` to specify a multi-line string. 

In [67]:
s2 = '''I really need to consolidate my credit card debt so that I can become debt free. 
The interest is killing me and I'm just not getting anywhere with the balances. Help!'''

s3 = '''Hello, I just closed on the house of my dreams and I would like to 
use this loan to pay off my high interest credit cards and build a deck on my home.'''

In [70]:
# Step 1: Convert to lower case using lower()
s2.lower()
s3.lower()

'hello, i just closed on the house of my dreams and i would like to \nuse this loan to pay off my high interest credit cards and build a deck on my home.'

In [75]:
# Step 2: Perform regex substitution to remove special characters.
ss2 = re.sub('\n', '', s2)
ss3 = re.sub('\n', '', s3)

In [78]:
# Step 3: use word_tokenize() to tokenise the sentence and get a list of terms.
list2 = word_tokenize(ss2)
list3 = word_tokenize(ss3)

In [83]:
# Step 4: Use stemmer.stem() to get the list of stemmed terms.
list_2 = []
list_3 = []

for l in list2:
    list_2.append(stemmer.stem(l))
    
for l in list3:
    list_3.append(stemmer.stem(l))

list_2

['I',
 'realli',
 'need',
 'to',
 'consolid',
 'my',
 'credit',
 'card',
 'debt',
 'so',
 'that',
 'I',
 'can',
 'becom',
 'debt',
 'free',
 '.',
 'the',
 'interest',
 'is',
 'kill',
 'me',
 'and',
 'I',
 "'m",
 'just',
 'not',
 'get',
 'anywher',
 'with',
 'the',
 'balanc',
 '.',
 'help',
 '!']

In [84]:
# Step 5: Remove stop words. Remove the word if it appears in stopwords.words('english')
final_2 = []
final_3 = []

for l in list_2:
    if not l in stopwords.words('english'):
        final_2.append(l)
        
print(final_2)

['I', 'realli', 'need', 'consolid', 'credit', 'card', 'debt', 'I', 'becom', 'debt', 'free', '.', 'interest', 'kill', 'I', "'m", 'get', 'anywher', 'balanc', '.', 'help', '!']


In [None]:
# Finally, print() the sentence.


**Credits**
- [Kaggle](https://www.kaggle.com/nadintamer/top-tracks-of-2017) for Dataset 1
- [Kaggle](https://www.kaggle.com/wendykan/lending-club-loan-data) for Dataset 2