Text Mining is the process of deriving meaningful information from natural language text.

The overall goal is, essentially to turn text into data for
analysis, via application of Natural Language(NLP).

Natural Language Processing(NLP) is a part of computer science and artificial intelligence which deals with human languages.

NLP is a component of text mining that performs a special kind of linguistic analysis that essentially helps a machine “read” text.

Applications of NLP
- Sentimenal Analysis
- Speech Recognition
- Chatbot
- Machine Translation
- Spell Checking
- Keyword Search
- Advertising matching

2 parts of NLP - 1) Natual Language Understanding, 2) Natual Language Generation.


### Tutorials

- Tokenization
- Stop Words
- Stemming
- POS
- Chunking
- Chinking
- Named Entity Recognition
- Lemmatization
- Corpora
- Wordnet
- Text Classification


In [1]:
import pandas as pd
import numpy as np
import nltk
import os
import nltk.corpus

In [10]:
# print(os.listdir(nltk.data.find('corpora')))

In [17]:
# print(nltk.corpus.gutenberg.fileids())

In [6]:
# nltk.download('punkt') // will download tokenization

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\maury\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [14]:
# nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\maury\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\gutenberg.zip.


True

In [19]:
hamlet = nltk.corpus.gutenberg.words('shakespeare-hamlet.txt')
hamlet

['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', ...]

In [20]:
for word in hamlet[:500]:
    print(word, sep=" ", end= " ")

[ The Tragedie of Hamlet by William Shakespeare 1599 ] Actus Primus . Scoena Prima . Enter Barnardo and Francisco two Centinels . Barnardo . Who ' s there ? Fran . Nay answer me : Stand & vnfold your selfe Bar . Long liue the King Fran . Barnardo ? Bar . He Fran . You come most carefully vpon your houre Bar . ' Tis now strook twelue , get thee to bed Francisco Fran . For this releefe much thankes : ' Tis bitter cold , And I am sicke at heart Barn . Haue you had quiet Guard ? Fran . Not a Mouse stirring Barn . Well , goodnight . If you do meet Horatio and Marcellus , the Riuals of my Watch , bid them make hast . Enter Horatio and Marcellus . Fran . I thinke I heare them . Stand : who ' s there ? Hor . Friends to this ground Mar . And Leige - men to the Dane Fran . Giue you good night Mar . O farwel honest Soldier , who hath relieu ' d you ? Fra . Barnardo ha ' s my place : giue you goodnight . Exit Fran . Mar . Holla Barnardo Bar . Say , what is Horatio there ? Hor . A peece of him Bar 

## tokenization

- Break a complex sentence into words
- Understand the importance of each of the words with respect to the sentence.
- produce a structural description on an input sentence. 

In [2]:
text = "In Brazil they drive on the right-hand side of the road. Brazil has a large coastline on the eastern side of South America"

In [3]:
token = nltk.word_tokenize(text)
print(token)

['In', 'Brazil', 'they', 'drive', 'on', 'the', 'right-hand', 'side', 'of', 'the', 'road', '.', 'Brazil', 'has', 'a', 'large', 'coastline', 'on', 'the', 'eastern', 'side', 'of', 'South', 'America']


In [4]:
len(token)

24

In [5]:
from nltk.probability import FreqDist
fdist = FreqDist()


In [6]:
for word in token:
    fdist[word.lower()] += 1
fdist

FreqDist({'the': 3, 'brazil': 2, 'on': 2, 'side': 2, 'of': 2, 'in': 1, 'they': 1, 'drive': 1, 'right-hand': 1, 'road': 1, ...})

In [7]:
fdist_top10 = fdist.most_common(10)
fdist_top10

[('the', 3),
 ('brazil', 2),
 ('on', 2),
 ('side', 2),
 ('of', 2),
 ('in', 1),
 ('they', 1),
 ('drive', 1),
 ('right-hand', 1),
 ('road', 1)]

In [8]:
## number of paragraph
from nltk.tokenize import blankline_tokenize
b_token = blankline_tokenize(text)
len(b_token)

1


- Bigrams - token of two consecutive written words known as Bigram
- Trigrams - Tokens of three consecutive written words known as Tigram
- Ngrams - Tokens of any number of consecutive written words known as Ngram

In [9]:
from nltk.util import bigrams, trigrams, ngrams

In [10]:
from nltk.tokenize import word_tokenize

In [11]:
token_bigrams = list(bigrams(token))

In [12]:
token_bigrams

[('In', 'Brazil'),
 ('Brazil', 'they'),
 ('they', 'drive'),
 ('drive', 'on'),
 ('on', 'the'),
 ('the', 'right-hand'),
 ('right-hand', 'side'),
 ('side', 'of'),
 ('of', 'the'),
 ('the', 'road'),
 ('road', '.'),
 ('.', 'Brazil'),
 ('Brazil', 'has'),
 ('has', 'a'),
 ('a', 'large'),
 ('large', 'coastline'),
 ('coastline', 'on'),
 ('on', 'the'),
 ('the', 'eastern'),
 ('eastern', 'side'),
 ('side', 'of'),
 ('of', 'South'),
 ('South', 'America')]

### Stemming
Normalize words into its base form or root form 

Affect(root word)- affection, affects, affectation, affected, affecting

In [13]:
from nltk.stem import PorterStemmer

In [14]:
pst = PorterStemmer()

In [15]:
pst.stem('having')

'have'

In [21]:
words_to_stem = ['give','gave','given','giving']

for word in words_to_stem:
    print(word, ":", pst.stem(word))

give : give
gave : gave
given : given
giving : give


In [17]:
from nltk.stem import LancasterStemmer
lst = LancasterStemmer()

In [18]:
for word in words_to_stem:
    print(word, ":", lst.stem(word))

give : giv
gave : gav
given : giv
giving : giv


### Lemmatization
Groups together different inflected forms of a word called lemma
Somehow similar to stemming as it maps several words into one common root.

- Output of lemmatisation is a proper word
- lemmatiser should map gone, going and went to go

In [22]:
from nltk.stem import wordnet
from nltk.stem import WordNetLemmatizer
word_lem = WordNetLemmatizer()

In [26]:
for word in words_to_stem:
    print(word, ":", word_lem.lemmatize(word))

give : give
gave : gave
given : given
giving : giving


In [25]:
# import nltk
# nltk.download('wordnet')

## Stop Words

- These words are useless during text analysis. You don't care about stopwords. 

In [22]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [25]:
example_sent = "this is example showing off stop word filteration"

words = word_tokenize(example_sent)

In [29]:
stop_words = stopwords.words('english')
filtered = [l for l in words if l not in stop_words]
print(filtered)

['example', 'showing', 'stop', 'word', 'filteration']


In [26]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [31]:
## Parts of speech

In [34]:
# token

In [38]:
for word in token:
    print(nltk.pos_tag([word]))

[('In', 'IN')]
[('Brazil', 'NNP')]
[('they', 'PRP')]
[('drive', 'NN')]
[('on', 'IN')]
[('the', 'DT')]
[('right-hand', 'NN')]
[('side', 'NN')]
[('of', 'IN')]
[('the', 'DT')]
[('road', 'NN')]
[('.', '.')]
[('Brazil', 'NNP')]
[('has', 'VBZ')]
[('a', 'DT')]
[('large', 'JJ')]
[('coastline', 'NN')]
[('on', 'IN')]
[('the', 'DT')]
[('eastern', 'JJ')]
[('side', 'NN')]
[('of', 'IN')]
[('South', 'NNP')]
[('America', 'NNP')]


In [37]:
# nltk.download('averaged_perceptron_tagger')

In [39]:
# Named Entity Recognition (NER)
# Google's (organization) CEO Sundar Pichai(person) introduced the new Pixel at Minnesota(location)

In [40]:
from nltk import ne_chunk

In [41]:
sent = "The US president stays in the White House"

In [3]:
from nltk.tokenize import word_tokenize

In [43]:
toks = word_tokenize(sent)
tags = nltk.pos_tag(toks)

In [50]:
print(ne_chunk(tags))

(S
  The/DT
  (GSP US/NNP)
  president/NN
  stays/NNS
  in/IN
  the/DT
  (FACILITY White/NNP House/NNP))


In [48]:
# nltk.download('maxent_ne_chunker')
# nltk.download('words')

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\maury\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\words.zip.


True

In [2]:
# syntax -> Principles, Rules, Process
# SYNTAX tree -> is a tree representation of syntatcic 
# structure of sentences or string

In [None]:
## chunking -> picking up individual pieces of information
## and grouping them into bigger pieces
## 

In [4]:
sen = 'the big cat ate the little mouse who was after cheese'

cat_tokens = nltk.pos_tag(word_tokenize(sen))
cat_tokens

[('the', 'DT'),
 ('big', 'JJ'),
 ('cat', 'NN'),
 ('ate', 'VBD'),
 ('the', 'DT'),
 ('little', 'JJ'),
 ('mouse', 'NN'),
 ('who', 'WP'),
 ('was', 'VBD'),
 ('after', 'IN'),
 ('cheese', 'NN')]

In [7]:
import pandas as pd
import numpy as np
import os
import nltk

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

In [9]:
print(os.listdir(nltk.data.find('corpora')))

['gutenberg', 'gutenberg.zip', 'stopwords', 'stopwords.zip', 'wordnet', 'wordnet.zip', 'words', 'words.zip']


In [11]:
from nltk.corpus import movie_reviews

In [14]:
# nltk.download('movie_reviews')

print(movie_reviews.categories())

['neg', 'pos']


In [16]:
len(movie_reviews.fileids('pos'))

1000

In [17]:
len(movie_reviews.fileids('neg'))

1000

In [18]:
neg = movie_reviews.fileids('neg')
print(neg)

['neg/cv000_29416.txt', 'neg/cv001_19502.txt', 'neg/cv002_17424.txt', 'neg/cv003_12683.txt', 'neg/cv004_12641.txt', 'neg/cv005_29357.txt', 'neg/cv006_17022.txt', 'neg/cv007_4992.txt', 'neg/cv008_29326.txt', 'neg/cv009_29417.txt', 'neg/cv010_29063.txt', 'neg/cv011_13044.txt', 'neg/cv012_29411.txt', 'neg/cv013_10494.txt', 'neg/cv014_15600.txt', 'neg/cv015_29356.txt', 'neg/cv016_4348.txt', 'neg/cv017_23487.txt', 'neg/cv018_21672.txt', 'neg/cv019_16117.txt', 'neg/cv020_9234.txt', 'neg/cv021_17313.txt', 'neg/cv022_14227.txt', 'neg/cv023_13847.txt', 'neg/cv024_7033.txt', 'neg/cv025_29825.txt', 'neg/cv026_29229.txt', 'neg/cv027_26270.txt', 'neg/cv028_26964.txt', 'neg/cv029_19943.txt', 'neg/cv030_22893.txt', 'neg/cv031_19540.txt', 'neg/cv032_23718.txt', 'neg/cv033_25680.txt', 'neg/cv034_29446.txt', 'neg/cv035_3343.txt', 'neg/cv036_18385.txt', 'neg/cv037_19798.txt', 'neg/cv038_9781.txt', 'neg/cv039_5963.txt', 'neg/cv040_8829.txt', 'neg/cv041_22364.txt', 'neg/cv042_11927.txt', 'neg/cv043_16808.t

In [19]:
rev = movie_reviews.words('neg/cv291_26844.txt')

In [20]:
print(rev)

['movies', 'can', 'do', 'the', 'two', 'big', 'es', ...]
