<a href="https://colab.research.google.com/github/ucaokylong/NLP_learning/blob/main/02_Pre_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center>
<img src='https://github.com/JUSTSUJAY/NLP_One_Shot/blob/main/assets/1/lang-pic.jpg?raw=1' width=600>
</center>
    
# 1. Introduction

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">1.1 NLP series</p>

This is the **second in a series of notebooks** covering the **fundamentals of Natural Language Processing (NLP)**. I find that the best way to learn is by teaching others, hence why I am sharing my journey learning this field from scratch. I hope these notebooks can be helpful to you too.

NLP series:

1. [Tokenization](./01_Tokenization.ipynb)
2. Preprocessing
<a target="_blank" href="https://colab.research.google.com/github/JUSTSUJAY/NLP_One_Shot/blob/28eb64d1c9db75ec790b0945aee7f533f0c52ecd/Notebooks/02_Pre_Processing.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/JUSTSUJAY/NLP_One_Shot/blob/main/Notebooks/02_Pre_Processing.ipynb)

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">1.2 Outline</p>

Last time, we saw how to load a language model and tokenize a string of text. This notebook focuses on **further pre-processing steps** we can perform on tokens. In particular, we will begin by looking at **case folding**, **stop word removal**, **stemming** and **lemmatization**.

We will then examine some more **advanced** pre-processing techniques, namely **part-of-speech tagging** and **named entity recognition**, which are useful tasks in of themselves.

# 2. Basic pre-processing

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">2.1 Case folding</p>

This is the act of converting every token to be uniformly **lower case** or **upper case**.

<center>
<img src='https://github.com/JUSTSUJAY/NLP_One_Shot/blob/main/assets/2/case-folding.jpg?raw=1' width=600>
</center>
    
This can be beneficial because it will **reduce the number of unique tokens** in a corpus, i,e. the size of the **vocabulary**, hence make the processing of these tokens more memory and computational effecient. The downside however is **information loss**.

For example `"Green"` (name) has a different meaning to `"green"` (colour) but both would get the **same token** if case folding is applied. Whether it makes sense to use case folding **depends on the application** (is speed or accuracy more important).

In [None]:
# Import spacy library
import spacy
print(spacy.__name__, spacy.__version__)

# Load language model
nlp = spacy.load("en_core_web_sm")

spacy 3.3.1


To case fold to lower cases we can use the `.lower` attribute.

In [None]:
# Tokenize
s = "The train to London leaves at 10am on Tuesday."
doc = nlp(s)

# Case fold
print([t.lower_ for t in doc])

['the', 'train', 'to', 'london', 'leaves', 'at', '10', 'am', 'on', 'tuesday', '.']


We might want to be **more granular** and only case fold if certain conditions are met. For example, we could **skip the first word** in a sentence.

In [None]:
# Conditional case folding
print([t.lower_ if not t.is_sent_start else t.text for t in doc])

['The', 'train', 'to', 'london', 'leaves', 'at', '10', 'am', 'on', 'tuesday', '.']


## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">2.2 Stop word removal</p>

Stop words are words that **appear commonly** but **carry little information**. Examples include, `"a"`, `"the"`, `"of"`, `"an"`, `"this"`,`"that"`. Similar to case folding, removing stop words can **improve efficiency** but comes at the cost of **losing contextual information**.

<center>
<img src='https://github.com/JUSTSUJAY/NLP_One_Shot/blob/main/assets/2/stop-word-removal.jpg?raw=1' width=600>
</center>

The choice of whether to use stop word removal will depend on the task being performed. For some tasks like **topic modelling** (identifying topics in text), contextual information is not as **important** compared to a task like **sentiment analysis** where the stop word `"not"` can change the sentiment completely.

Also note that different libraries have **different** stop word lists so you might want to **tune** your list depending on the application. Spacy's language model has **over 300 stop words**.

In [None]:
# Print spacy's stop word list
print(nlp.Defaults.stop_words)
print(len(nlp.Defaults.stop_words))

{'seems', 'the', 'whatever', 'front', 'hereby', 'nevertheless', 'else', 'somewhere', 'toward', 'keep', 'top', 'get', 'across', 'latter', 'n‘t', '’ve', 'perhaps', 'everyone', 'will', 'somehow', 'even', 'whole', 'together', "'m", 'twenty', 'thereby', 'made', 'each', 'what', 'have', 'of', 'just', 'really', 'must', 'call', 'bottom', 'hers', 'never', 'wherever', 'latterly', 'why', '’re', 'part', 'becomes', 'might', 'along', 'am', 'others', 'formerly', 'fifty', 'thus', 'namely', 'one', 'an', 'towards', 'many', 'upon', 'them', 'my', 'always', 'several', 'unless', 'fifteen', 'whoever', 'herself', 'quite', '‘re', 'nowhere', '’m', 'until', 'whether', 'three', 'via', 'something', 'may', 'ever', 'mine', 'herein', 'more', 'least', 'be', 'own', 'nine', "'d", 'therein', 'already', 'whereafter', 'amount', 'because', 'has', 'every', 'now', 'then', 'regarding', 'sometime', 're', 'side', 'per', 'alone', 'all', 'once', 'out', 'someone', 'thence', 'next', 'through', 'me', 'yours', 'four', 'itself', 'used',

To remove stop words, we use the `.is_stop` attribute.

In [None]:
# Stop word removal
print([t.text for t in doc if not t.is_stop])

['train', 'London', 'leaves', '10', 'Tuesday', '.']


Depending on the application, you might want to **customize spacys** stop word list. This can be done as follows.

In [None]:
nlp.Defaults.stop_words.add("ergo")
nlp.Defaults.stop_words.remove("whatever")

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">2.3 Stemming</p>

Stemming is the act of **reducing a word to its stem** by **removing suffixes** and sometimes prefixes depending on the language.

For example, the words `"developed"` and `"developing"` both have the stem `"develop"`.

While this technique also reduces the size of the vocabulary, it can result in **invalid words**, for example `"studies"` might be stemmed to `"studi"`. For this reason, stemming is rarely used these days. It turns out there is a **better altenative**, called **lemmatization**, which we'll look at next.

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">2.4 Lemmatization</p>

Lemmatization reduces a word down to its **lemma**, i.e. dictionary form.

While this is similar to stemming, it also takes into account things like **tenses** and **synonyms**. For example, the words `"did"`, `"done"` and `"doing"` would be converted to the base form `"do"`.

<center>
<img src='https://github.com/JUSTSUJAY/NLP_One_Shot/blob/main/assets/2/lemmatization.jpg?raw=1' width=600>
</center>
    
It also takes into account whether a word is a **noun**, **verb** or **adjective** on deciding whether to lemmatize. For example, it might not modify some adjectives so not to change their meaning. (`"energetic"` is different to `"energy"`).

Lemmatization is generally prefered to stemming because it is **more accurate and robust** while still offering the same benefit of vocabulary size reduction. It does however remove your ability to distinguish different **tenses**, which may be important for some applications.

In [None]:
# Tokenize
s = "She was the fastest swimmer."
doc = nlp(s)

We can view the lemmatization using the `.lemma_` attribute.

In [None]:
# Lemmatization
print([(t.text,t.lemma_) for t in doc])

[('She', 'she'), ('was', 'be'), ('the', 'the'), ('fastest', 'fast'), ('swimmer', 'swimmer'), ('.', '.')]


# 3. Advanced pre-processing

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">3.1 Part-of-speech tagging</p>

Part-of-speech tagging is the method of **classifying how a word is used in a sentence**, for example, **noun, verb, adjective**.

<center>
<img src='https://github.com/JUSTSUJAY/NLP_One_Shot/blob/main/assets/2/pos-tagging.jpg?raw=1' width=600>
</center>

This is very helpful because it can help us understand the **intent or action** of an ambiguous word. For example, when we say `"Hand me a hammer."`, the word `"hand"` is a **verb** (doing word) as opposed to `"The hammer is in my hand."` where it is a **noun** (thing) and has a different meaning.

We can access the part-of-speech tags using the `".pos_"` attribute.

In [None]:
# Part-of-speech
print([(t.text,t.pos_) for t in doc])

[('She', 'PRON'), ('was', 'AUX'), ('the', 'DET'), ('fastest', 'ADJ'), ('swimmer', 'NOUN'), ('.', 'PUNCT')]


A full description of the tags can be found using `"spacy.explain"`.

In [None]:
print([(t.pos_,spacy.explain(t.pos_)) for t in doc])

[('PRON', 'pronoun'), ('AUX', 'auxiliary'), ('DET', 'determiner'), ('ADJ', 'adjective'), ('NOUN', 'noun'), ('PUNCT', 'punctuation')]


## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">3.2 Named Entity Recognition</p>

Named Entity Recognition (NER) is the act of tagging **named entities** in text.

A **named entity** is anything that can be referred to by a **proper name** and usually has the **proper noun** tag. Common examples include a person, cities, countries and companies. Note that it is common to extend entities to include money, time, dates, etc.

<br>
<br>
<center>
<img src='https://github.com/JUSTSUJAY/NLP_One_Shot/blob/main/assets/2/ner.png?raw=1' width=600>
</center>
<br>
<br>


NER can help **categorize and organize** a corpus. It is especially useful, for example, in helping **chatbots** raise accurate support tickets depending on the customer problem.

Some of the **challenges** to building a state-of-the-art NER model include **type ambiguity**, where one word can have multiple meanings (e.g. Amazon - river or company?) and the fact that **entities can span multiple tokens** (e.g. John Smith). Luckily, spacy has very good NER model that we can utilize.

In [None]:
# Tokenize
s = "Apple is looking at buying U.K. startup for $1 billion"
doc = nlp(s)

There are two ways to do NER in spacy. The **first way** is via the `.ent_type_` attribute.

In [None]:
# Named Entity Recognition
print([(t.text,t.ent_type_) for t in doc])

[('Apple', 'ORG'), ('is', ''), ('looking', ''), ('at', ''), ('buying', ''), ('U.K.', 'GPE'), ('startup', ''), ('for', ''), ('$', 'MONEY'), ('1', 'MONEY'), ('billion', 'MONEY')]


In [None]:
# Only print entities
print([(t.text,t.ent_type_) for t in doc if t.ent_type != 0])

[('Apple', 'ORG'), ('U.K.', 'GPE'), ('$', 'MONEY'), ('1', 'MONEY'), ('billion', 'MONEY')]


Like before, we can `spacy.explain` to understand each tag.

In [None]:
# Entity explanation
print('ORG:', spacy.explain('ORG'))
print('GPE:', spacy.explain('GPE'))
print('MONEY:', spacy.explain('MONEY'))

ORG: Companies, agencies, institutions, etc.
GPE: Countries, cities, states
MONEY: Monetary values, including unit


The **second way** to do NER in spacy is to use the `.ents` attribute.

In [None]:
# Named Entity Recognition
print([(ent.text, ent.label_) for ent in doc.ents])

[('Apple', 'ORG'), ('U.K.', 'GPE'), ('$1 billion', 'MONEY')]


Notice this time how `$1 billion` is grouped into **one entity**, whereas before each token was a separate entity.

Finally, we can **visualize** the entities using a spacy built-in function.

In [None]:
# Import function
from spacy import displacy

# Visualize entities
displacy.render(doc, style='ent', jupyter=True)

# 4. Conclusion

Whilst we have seen the most common pre-processing techniques in this notebook, there exist many more depending on the application. Some others ideas to keep in mind include converting **emoji's to text**, **language detection** in a mixed-language corpus, **spelling correction** and **parsing** intra-word relationships.

**References:**
    
* [NLP demystified](https://www.nlpdemystified.org/)

### Coming UP
#### [3. Bag Of Words and Similarity](./03_BOW_Similarity.ipynb)

Thanks for reading!