# Text Normalization


## Overview

The objective of text normalization is to clean up the text by removing unnecessary and irrelevant components.
- reducing the number of unique tokens presen in the text
- removing the variations in a text. 
- removing redundant information


In [None]:
import spacy
import unicodedata
import re
from nltk.corpus import wordnet
import collections
from nltk.tokenize.toktok import ToktokTokenizer


## Stemming

- Stemming is the process where we standardize word forms into their base stem irrespective of their inflections.
- The `nltk` provides several popular stemmers for English:
    - `nltk.stem.PorterStemmer`
    - `nltk.stem.LancasterStemmer`
    - `nltk.stem.RegexpStemmer`
    - `nltk.stem.SnowballStemmer`

- We can compare the results of different stemmers.

In [None]:
import nltk
from nltk.stem import PorterStemmer, LancasterStemmer, RegexpStemmer, SnowballStemmer

words = ['jumping', 'jumps', 'jumped', 'jumpy']
ps = PorterStemmer()
ls = LancasterStemmer()
ss = SnowballStemmer('english')

rs = RegexpStemmer('ing$|s$|ed$|y$', min=4) # set the minimum of the string to stem


In [None]:
[ps.stem(w) for w in words]

['jump', 'jump', 'jump', 'jumpi']

In [None]:
[ls.stem(w) for w in words]

['jump', 'jump', 'jump', 'jumpy']

In [None]:
[ss.stem(w) for w in words]

['jump', 'jump', 'jump', 'jumpi']

In [None]:
[rs.stem(w) for w in words]

['jump', 'jump', 'jump', 'jump']

## Lemmatization


- Lemmatization is similar to stemmatization.
- It is a process where we remove word affixes to get the **root word** but not the **root stem**.
- These root words, i.e., lemmas, are lexicographically correct words and always present in the dictionary.

```{admonition} Question
:class: attention
In terms of Lemmatization and Stemmatization, which one requires more computational cost? That is, which processing might be slower?
```

- Two frequently-used lemmatizers
    - `nltk.stem.WordNetLemmatizer`
    - `spacy`

### WordNet Lemmatizer

- WordNetLemmatizer utilizes the dictionary of WordNet.
- It requires the **parts of speech** of the word for lemmatization.
- I think right now only nouns, verbs and adjectives are important in `WordNetLemmatizer`.

In [None]:
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

In [None]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [None]:
# nouns
print(wnl.lemmatize('cars','n')) # noun
print(wnl.lemmatize('men', 'n'))

car
men


In [None]:
# verbs
print(wnl.lemmatize('running','v')) # verb
print(wnl.lemmatize('ate', 'v'))

run
eat


In [None]:
# adj
print(wnl.lemmatize('saddest','a')) # adjectives
print(wnl.lemmatize('fancier','a'))
print(wnl.lemmatize('jumpy','a'))

sad
fancy
jumpy


## Contractions

- For the English data, contractions are problematic sometimes. 
- These may get even more complicated when different tokenizers deal with contractions differently.
- A good way is to expand all contractions into their original independent word forms.

In [None]:
!pip install contractions

Collecting contractions
  Downloading contractions-0.1.72-py2.py3-none-any.whl (8.3 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.21-py2.py3-none-any.whl (7.5 kB)
Collecting anyascii
  Downloading anyascii-0.3.1-py3-none-any.whl (287 kB)
[K     |████████████████████████████████| 287 kB 8.4 MB/s 
[?25hCollecting pyahocorasick
  Downloading pyahocorasick-1.4.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (106 kB)
[K     |████████████████████████████████| 106 kB 44.5 MB/s 
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.1 contractions-0.1.72 pyahocorasick-1.4.4 textsearch-0.0.21


In [None]:
# import library
import contractions
# contracted text
text = '''I'll be there within 5 min. Shouldn't you be there too?
          I'd love to see u there my dear. It's awesome to meet new friends.
          We've been waiting for this day for so long.'''
 
# creating an empty list
expanded_words = []   
for word in text.split():
  # using contractions.fix to expand the shortened words
  expanded_words.append(contractions.fix(word))  
   
expanded_text = ' '.join(expanded_words)
print('Original text: ' + text)
print('Expanded_text: ' + expanded_text)

Original text: I'll be there within 5 min. Shouldn't you be there too?
          I'd love to see u there my dear. It's awesome to meet new friends.
          We've been waiting for this day for so long.
Expanded_text: I will be there within 5 min. Should not you be there too? I would love to see you there my dear. It is awesome to meet new friends. We have been waiting for this day for so long.


In [None]:
text = '''She'd like to know how I'd done that!
          She's going to the park and I don't think I'll be home for dinner.
          Theyre going to the zoo and she'll be home for dinner.'''
 
contractions.fix(text)

'She would like to know how I would done that!\n          She is going to the park and I do not think I will be home for dinner.\n          They Are going to the zoo and she will be home for dinner.'

## Accented Characters (Non-ASCII)

- The `unicodedata` module handles unicode characters very efficiently. Please check [unicodedata dcoumentation](https://docs.python.org/3/library/unicodedata.html) for more details.
- When dealing with the English data, we may often encounter foreign characters in texts that are not part of the ASCII character set.

In [None]:
import unicodedata

def remove_accented_chars(text):
#     ```
#     (NFKD) will apply the compatibility decomposition, i.e. 
#     replace all compatibility characters with their equivalents. 
#     ```
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text


remove_accented_chars('Sómě Áccěntěd těxt')

# print(unicodedata.normalize('NFKD', 'Sómě Áccěntěd těxt'))
# print(unicodedata.normalize('NFKD', 'Sómě Áccěntěd těxt').encode('ascii','ignore'))
# print(unicodedata.normalize('NFKD', 'Sómě Áccěntěd těxt').encode('ascii','ignore').decode('utf-8', 'ignore'))

'Some Accented text'

:::{note}
- `str.encode()` returns an encoded version of the string as a bytes object using the specified encoding.
- `byes.decode()` returns a string decoded from the given bytes using the specified encoding.
:::

## Converting emojis to text



import demoji
 
demoji.download_codes()

In [None]:
!pip install demoji

Collecting demoji
  Downloading demoji-1.1.0-py3-none-any.whl (42 kB)
[?25l[K     |███████▋                        | 10 kB 22.7 MB/s eta 0:00:01[K     |███████████████▎                | 20 kB 25.2 MB/s eta 0:00:01[K     |███████████████████████         | 30 kB 29.3 MB/s eta 0:00:01[K     |██████████████████████████████▋ | 40 kB 23.1 MB/s eta 0:00:01[K     |████████████████████████████████| 42 kB 952 kB/s 
[?25hInstalling collected packages: demoji
Successfully installed demoji-1.1.0


In [None]:
import demoji
 
demoji.download_codes()

  This is separate from the ipykernel package so we can avoid doing imports until


In [None]:
text = "i am happy 😁"
demoji.findall(text)

{'😁': 'beaming face with smiling eyes'}

## Slang word

In [None]:

input = 'ok, tq dr'
output = input.replace('ok','okey')
output = output.replace('tq','thank you')
output = output.replace('dr','Doctor')


print(output)
# Kita tidak bisa melakukan ini

okey, thank you Doctor


Or you also can you a dictionary

In [None]:
dictionary_slang = {
    'ok':'okey', 
    'tq':'thank you', 
    'dr':'Doctor',
    }
    
dictionary_slang['tq']


'thank you'

## Translation mix language

In [None]:
!pip install googletrans

Collecting googletrans
  Downloading googletrans-3.0.0.tar.gz (17 kB)
Collecting httpx==0.13.3
  Downloading httpx-0.13.3-py3-none-any.whl (55 kB)
[K     |████████████████████████████████| 55 kB 2.6 MB/s 
Collecting rfc3986<2,>=1.3
  Downloading rfc3986-1.5.0-py2.py3-none-any.whl (31 kB)
Collecting sniffio
  Downloading sniffio-1.2.0-py3-none-any.whl (10 kB)
Collecting httpcore==0.9.*
  Downloading httpcore-0.9.1-py3-none-any.whl (42 kB)
[K     |████████████████████████████████| 42 kB 1.0 MB/s 
Collecting hstspreload
  Downloading hstspreload-2021.12.1-py3-none-any.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 18.2 MB/s 
[?25hCollecting h2==3.*
  Downloading h2-3.2.0-py2.py3-none-any.whl (65 kB)
[K     |████████████████████████████████| 65 kB 3.1 MB/s 
[?25hCollecting h11<0.10,>=0.8
  Downloading h11-0.9.0-py2.py3-none-any.whl (53 kB)
[K     |████████████████████████████████| 53 kB 1.9 MB/s 
[?25hCollecting hpack<4,>=3.0
  Downloading hpack-3.0.0-py2.py3-none-an

In [None]:
import googletrans

In [None]:
from googletrans import Translator

In [None]:
print(googletrans.LANGUAGES)

{'af': 'afrikaans', 'sq': 'albanian', 'am': 'amharic', 'ar': 'arabic', 'hy': 'armenian', 'az': 'azerbaijani', 'eu': 'basque', 'be': 'belarusian', 'bn': 'bengali', 'bs': 'bosnian', 'bg': 'bulgarian', 'ca': 'catalan', 'ceb': 'cebuano', 'ny': 'chichewa', 'zh-cn': 'chinese (simplified)', 'zh-tw': 'chinese (traditional)', 'co': 'corsican', 'hr': 'croatian', 'cs': 'czech', 'da': 'danish', 'nl': 'dutch', 'en': 'english', 'eo': 'esperanto', 'et': 'estonian', 'tl': 'filipino', 'fi': 'finnish', 'fr': 'french', 'fy': 'frisian', 'gl': 'galician', 'ka': 'georgian', 'de': 'german', 'el': 'greek', 'gu': 'gujarati', 'ht': 'haitian creole', 'ha': 'hausa', 'haw': 'hawaiian', 'iw': 'hebrew', 'he': 'hebrew', 'hi': 'hindi', 'hmn': 'hmong', 'hu': 'hungarian', 'is': 'icelandic', 'ig': 'igbo', 'id': 'indonesian', 'ga': 'irish', 'it': 'italian', 'ja': 'japanese', 'jw': 'javanese', 'kn': 'kannada', 'kk': 'kazakh', 'km': 'khmer', 'ko': 'korean', 'ku': 'kurdish (kurmanji)', 'ky': 'kyrgyz', 'lo': 'lao', 'la': 'lat

In [None]:
translator = Translator()

In [None]:
result = translator.translate("selamat", src="ms", dest="en")

AttributeError: ignored