# Text processing with nltk

- Tokenization - converting text data into list of words / list of sentences
- Morphological analysis - converting a word into its root form
    - stemming
    - lemmatization
- PoS (Part of Speech) Tagging

In [2]:
import nltk

In [3]:
nltk.download("punkt")
nltk.download("wordnet")
nltk.download("averaged_perceptron_tagger")
nltk.download("tagsets")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\idrus\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\idrus\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\idrus\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.
[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\idrus\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping help\tagsets.zip.


True

In [4]:
data = "Jakarta (/dʒəˈkɑːrtə/; Indonesian pronunciation: [dʒaˈkarta] (About this soundlisten)), officially the Special Capital Region of Jakarta (Indonesian: Daerah Khusus Ibukota Jakarta), is the capital and largest city of Indonesia. On the northwest coast of the world's most-populous island of Java, it is the centre of economy, culture and politics of Indonesia with a population of 10,770,487 in the city as of 2020.[6] Although Jakarta only covers 699.5 square kilometres (270.1 sq mi), the smallest among any Indonesian provinces, its metropolitan area covers 6,392 square kilometres (2,468 sq mi), and is the world's second-most populous urban area after Tokyo, with a population of about 35.934 million as of 2020"
print(data)

Jakarta (/dʒəˈkɑːrtə/; Indonesian pronunciation: [dʒaˈkarta] (About this soundlisten)), officially the Special Capital Region of Jakarta (Indonesian: Daerah Khusus Ibukota Jakarta), is the capital and largest city of Indonesia. On the northwest coast of the world's most-populous island of Java, it is the centre of economy, culture and politics of Indonesia with a population of 10,770,487 in the city as of 2020.[6] Although Jakarta only covers 699.5 square kilometres (270.1 sq mi), the smallest among any Indonesian provinces, its metropolitan area covers 6,392 square kilometres (2,468 sq mi), and is the world's second-most populous urban area after Tokyo, with a population of about 35.934 million as of 2020


In [5]:
# sentence tokenization
nltk.sent_tokenize(data)

['Jakarta (/dʒəˈkɑːrtə/; Indonesian pronunciation: [dʒaˈkarta] (About this soundlisten)), officially the Special Capital Region of Jakarta (Indonesian: Daerah Khusus Ibukota Jakarta), is the capital and largest city of Indonesia.',
 "On the northwest coast of the world's most-populous island of Java, it is the centre of economy, culture and politics of Indonesia with a population of 10,770,487 in the city as of 2020.",
 "[6] Although Jakarta only covers 699.5 square kilometres (270.1 sq mi), the smallest among any Indonesian provinces, its metropolitan area covers 6,392 square kilometres (2,468 sq mi), and is the world's second-most populous urban area after Tokyo, with a population of about 35.934 million as of 2020"]

In [6]:
# word tokenization
nltk.word_tokenize(data)

['Jakarta',
 '(',
 '/dʒəˈkɑːrtə/',
 ';',
 'Indonesian',
 'pronunciation',
 ':',
 '[',
 'dʒaˈkarta',
 ']',
 '(',
 'About',
 'this',
 'soundlisten',
 ')',
 ')',
 ',',
 'officially',
 'the',
 'Special',
 'Capital',
 'Region',
 'of',
 'Jakarta',
 '(',
 'Indonesian',
 ':',
 'Daerah',
 'Khusus',
 'Ibukota',
 'Jakarta',
 ')',
 ',',
 'is',
 'the',
 'capital',
 'and',
 'largest',
 'city',
 'of',
 'Indonesia',
 '.',
 'On',
 'the',
 'northwest',
 'coast',
 'of',
 'the',
 'world',
 "'s",
 'most-populous',
 'island',
 'of',
 'Java',
 ',',
 'it',
 'is',
 'the',
 'centre',
 'of',
 'economy',
 ',',
 'culture',
 'and',
 'politics',
 'of',
 'Indonesia',
 'with',
 'a',
 'population',
 'of',
 '10,770,487',
 'in',
 'the',
 'city',
 'as',
 'of',
 '2020',
 '.',
 '[',
 '6',
 ']',
 'Although',
 'Jakarta',
 'only',
 'covers',
 '699.5',
 'square',
 'kilometres',
 '(',
 '270.1',
 'sq',
 'mi',
 ')',
 ',',
 'the',
 'smallest',
 'among',
 'any',
 'Indonesian',
 'provinces',
 ',',
 'its',
 'metropolitan',
 'area',
 '

## Morphological analysis
- converting a word into its format
    - cars -> car
    - wives - wife
    - went - go
- stemming - faster, less accurate
- lemmatization - slower, more accurate

In [8]:
# stemming
from nltk.stem import PorterStemmer
ps = PorterStemmer()
ps.stem("cars")

'car'

In [9]:
ps.stem("boxes")

'box'

In [10]:
ps.stem("wives")

'wive'

In [11]:
# Lemmatization
from nltk.stem import WordNetLemmatizer
wd = WordNetLemmatizer()
wd.lemmatize("wives")

'wife'

In [12]:
wd.lemmatize("children")

'child'

In [13]:
wd.lemmatize("went","v") # v = verb

'go'

## PoS Tagging

In [14]:
data = "I love python programming How about you?"
nltk.pos_tag(nltk.word_tokenize(data))

[('I', 'PRP'),
 ('love', 'VBP'),
 ('python', 'RB'),
 ('programming', 'VBG'),
 ('How', 'WRB'),
 ('about', 'IN'),
 ('you', 'PRP'),
 ('?', '.')]

In [15]:
nltk.help.upenn_tagset('RB')

RB: adverb
    occasionally unabatingly maddeningly adventurously professedly
    stirringly prominently technologically magisterially predominately
    swiftly fiscally pitilessly ...
