<a href="https://colab.research.google.com/github/kokchun/Deep-learning-AI21/blob/main/Lectures/Lec7-Text_preprocessing.ipynb" target="_parent"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> &nbsp; for interacting with the code

---
# Lecture notes - Text preprocessing
---

This is the lecture note for **text preprocessing**. 

<p class = "alert alert-info" role="alert"><b>Note</b> that this lecture note gives a brief introduction to text preprocessing. I encourage you to read further about text preprocessing. </p>

Read more:
- [Text preprocessing using NLTK - pysansar blog post](https://pythonsansar.com/how-to-do-text-preprocessing-using-python-nltk/)
- [NLTK tokenize - NLTK](https://www.nltk.org/api/nltk.tokenize.html)
- [Lemmatization - wikipedia](https://en.wikipedia.org/wiki/Lemmatisation)
- [Stemming - wikipedia](https://en.wikipedia.org/wiki/Stemming)

---

## Lower case

Many times we don't want to set different meanings to words because they might have capitalized letters vs lower case. Then one can make it lower case to start with. 

In [2]:
import pyjokes

# get some text data
jokes = pyjokes.get_jokes()
print(f"{len(jokes)=}")

# put 3 jokes in one
raw_text = f"{jokes[1]}\n{jokes[10]}\n{jokes[5]}"

text = raw_text.lower()

print(text)


len(jokes)=97
ubuntu users are apt to get this joke.
'knock, knock.' 'who's there?' ... very long pause ... 'java.'
an sql query goes into a bar, walks up to two tables and asks, 'can i join you?'


---
## Tokenize

Tokenizer divides strings list of into substrings:

- sentence tokenization
- word tokenization

This is useful when you want to work with words or sentences or other sequences for an application.

In [3]:
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt')

# cuts when it finds period '.' 
text_sentence_tokens = sent_tokenize(text)
print(text_sentence_tokens)

['ubuntu users are apt to get this joke.', "'knock, knock.'", "'who's there?'", '... very long pause ...', "'java.'", "an sql query goes into a bar, walks up to two tables and asks, 'can i join you?'"]


[nltk_data] Downloading package punkt to C:\Users\YunaLiu-
[nltk_data]     AIU21GBG\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [8]:
from nltk.tokenize import word_tokenize

text_word_tokens = word_tokenize(text)
text_word_tokens[10:20]

[',', 'knock', '.', "'", "'who", "'s", 'there', '?', "'", '...']

In [9]:
words_in_sentence_tokens = [word_tokenize(sentence) for sentence in sent_tokenize(text)]
print(words_in_sentence_tokens)

[['ubuntu', 'users', 'are', 'apt', 'to', 'get', 'this', 'joke', '.'], ["'knock", ',', 'knock', '.', "'"], ["'who", "'s", 'there', '?', "'"], ['...', 'very', 'long', 'pause', '...'], ["'java", '.', "'"], ['an', 'sql', 'query', 'goes', 'into', 'a', 'bar', ',', 'walks', 'up', 'to', 'two', 'tables', 'and', 'asks', ',', "'can", 'i', 'join', 'you', '?', "'"]]


--- 
## Remove noise

Some noise in the data can change the meaning. 
- digits
- punctuations
- stop words

In [10]:
import string

print(f"{string.punctuation=}")

# to remove three dots
punctuations = string.punctuation + "..."
print(f"{punctuations=}")

tokens_no_punctuations = [token for token in text_word_tokens if not token in punctuations]
print(tokens_no_punctuations)

string.punctuation='!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
punctuations='!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~...'
['ubuntu', 'users', 'are', 'apt', 'to', 'get', 'this', 'joke', "'knock", 'knock', "'who", "'s", 'there', 'very', 'long', 'pause', "'java", 'an', 'sql', 'query', 'goes', 'into', 'a', 'bar', 'walks', 'up', 'to', 'two', 'tables', 'and', 'asks', "'can", 'i', 'join', 'you']


In [11]:
import nltk
from nltk.corpus import stopwords

# if you haven't downloaded stopwords before
nltk.download("stopwords")

swedish_stopwords = stopwords.words("swedish")

print(f"{swedish_stopwords=}")

swedish_stopwords=['och', 'det', 'att', 'i', 'en', 'jag', 'hon', 'som', 'han', 'på', 'den', 'med', 'var', 'sig', 'för', 'så', 'till', 'är', 'men', 'ett', 'om', 'hade', 'de', 'av', 'icke', 'mig', 'du', 'henne', 'då', 'sin', 'nu', 'har', 'inte', 'hans', 'honom', 'skulle', 'hennes', 'där', 'min', 'man', 'ej', 'vid', 'kunde', 'något', 'från', 'ut', 'när', 'efter', 'upp', 'vi', 'dem', 'vara', 'vad', 'över', 'än', 'dig', 'kan', 'sina', 'här', 'ha', 'mot', 'alla', 'under', 'någon', 'eller', 'allt', 'mycket', 'sedan', 'ju', 'denna', 'själv', 'detta', 'åt', 'utan', 'varit', 'hur', 'ingen', 'mitt', 'ni', 'bli', 'blev', 'oss', 'din', 'dessa', 'några', 'deras', 'blir', 'mina', 'samma', 'vilken', 'er', 'sådan', 'vår', 'blivit', 'dess', 'inom', 'mellan', 'sådant', 'varför', 'varje', 'vilka', 'ditt', 'vem', 'vilket', 'sitta', 'sådana', 'vart', 'dina', 'vars', 'vårt', 'våra', 'ert', 'era', 'vilkas']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kokch\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [12]:
english_stopwords = stopwords.words("english")
print(f"{english_stopwords=}")

english_stopwords=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same

In [13]:
tokens_no_stop = [token for token in tokens_no_punctuations if token not in english_stopwords]
print(tokens_no_stop)

['ubuntu', 'users', 'apt', 'get', 'joke', "'knock", 'knock', "'who", "'s", 'long', 'pause', "'java", 'sql', 'query', 'goes', 'bar', 'walks', 'two', 'tables', 'asks', "'can", 'join']


---
## Stemming

Convert words into their root word. For example loves, loving --> love. There are several stemmers available and different stemmers use different rules, yielding in different results. Some are more aggressive than others, and you should read about them to choose one that is suitable for your use case. 

In [14]:
from nltk import SnowballStemmer, LancasterStemmer

snowball = SnowballStemmer("english")
lancaster = LancasterStemmer()

snowball_stemmed_tokens = [snowball.stem(token) for token in tokens_no_stop]
lancaster_stemmed_tokens = [lancaster.stem(token) for token in tokens_no_stop]

# different results with different types of stemmers, you should read and try them out
print(f"Original tokens = {tokens_no_stop}")
print(f"{snowball_stemmed_tokens= }")
print(f"{lancaster_stemmed_tokens= }")

Original tokens = ['ubuntu', 'users', 'apt', 'get', 'joke', "'knock", 'knock', "'who", "'s", 'long', 'pause', "'java", 'sql', 'query', 'goes', 'bar', 'walks', 'two', 'tables', 'asks', "'can", 'join']
snowball_stemmed_tokens= ['ubuntu', 'user', 'apt', 'get', 'joke', 'knock', 'knock', 'who', "'s", 'long', 'paus', 'java', 'sql', 'queri', 'goe', 'bar', 'walk', 'two', 'tabl', 'ask', 'can', 'join']
lancaster_stemmed_tokens= ['ubuntu', 'us', 'apt', 'get', 'jok', "'knock", 'knock', "'who", "'s", 'long', 'paus', "'java", 'sql', 'query', 'goe', 'bar', 'walk', 'two', 'tabl', 'ask', "'can", 'join']


--- 
## Lemmatization

Group inflected forms of a word into a single word. 

In [21]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet2021

nltk.download('omw-1.4')
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()

# doesn't work on adjective when using .VERB
[lemmatizer.lemmatize(word ,wordnet2021.VERB) for word in ["runs", "running", "ran", "boats", "thicker"]]

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\kokch\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\kokch\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['run', 'run', 'run', 'boat', 'thicker']