<h2 style="color:blue"> Text Mining </h2>

<br>
Text Mining is the process of deriving meaningful information <br> from natural language text.

The overall goal is, essentially to turn text into data for <br>
analysis, via application of Natural Language(NLP).

NLP is a component of text mining that performs a special kind of <br> linguistic analysis that essentially helps a machine "read" text.


What is NLTK?
NLTK stands for Natural Language Toolkit. This toolkit is one of <br> the most powerful NLP libraries which contains packages <br>
to make machines understand human language 

pip install nltk

<h4> Applications of NLP </h4>

- Sentimenal Analysis
- Speech Recognition
- Chatbot
- Machine Translation
- Spell Checking
- Keyword Search
- Advertising matching

<h4> Tutorials </h4>

- Tokenization
- Stop Words
- Stemming
- POS - Part of speech





In [1]:
import pandas as pd
import numpy as np
import nltk
import os
import nltk.corpus

In [2]:
print(os.listdir(nltk.data.find('corpora')))

['gutenberg', 'gutenberg.zip', 'movie_reviews', 'movie_reviews.zip', 'stopwords', 'stopwords.zip', 'wordnet', 'wordnet.zip', 'words', 'words.zip']


**Corpus** -> A collection of written texts, ex- medical journals, parliament debates

**downloading any corpus**

nltk.download(corpus_name)

In [3]:
# nltk.download('punkt') // will download tokenization
# nltk.download('gutenberg')

In [11]:
# nltk.corpus.gutenberg.fileids() # all the list


In [16]:
# emma = nltk.corpus.gutenberg.words('austen-emma.txt')
# emma

In [15]:
# for word in emma[:500]:
#     print(word, sep=" ", end= " ")

<h3 style="color:blue"> tokenization </h3>

- Break a complex sentence into words
- Understand the importance of each of the words with respect to the sentence.
- produce a structural description on an input sentence.

In [17]:
text = "In Brazil they drive on the right-hand side of the road. Brazil has a large coastline on the eastern side of South America"

In [26]:
from nltk.tokenize import word_tokenize, sent_tokenize

token = word_tokenize(text)
print(token)

['In Brazil they drive on the right-hand side of the road.', 'Brazil has a large coastline on the eastern side of South America']


In [19]:
from nltk.probability import FreqDist
fdist = FreqDist()

In [20]:
for word in token:
    fdist[word.lower()] += 1
fdist

FreqDist({'the': 3, 'brazil': 2, 'on': 2, 'side': 2, 'of': 2, 'in': 1, 'they': 1, 'drive': 1, 'right-hand': 1, 'road': 1, ...})

In [21]:
fdist_top10 = fdist.most_common(10)
fdist_top10

[('the', 3),
 ('brazil', 2),
 ('on', 2),
 ('side', 2),
 ('of', 2),
 ('in', 1),
 ('they', 1),
 ('drive', 1),
 ('right-hand', 1),
 ('road', 1)]

In [22]:
## number of paragraph
from nltk.tokenize import blankline_tokenize
b_token = blankline_tokenize(text)
len(b_token)

1


- **Bigrams** - token of two consecutive written words known as Bigram
- **Trigrams** - Tokens of three consecutive written words known as Tigram
- **Ngrams** - Tokens of any number of consecutive written words known as Ngram

In [23]:
from nltk.util import bigrams, trigrams, ngrams

In [24]:
token_bigrams = list(bigrams(token))
token_bigrams

[('In', 'Brazil'),
 ('Brazil', 'they'),
 ('they', 'drive'),
 ('drive', 'on'),
 ('on', 'the'),
 ('the', 'right-hand'),
 ('right-hand', 'side'),
 ('side', 'of'),
 ('of', 'the'),
 ('the', 'road'),
 ('road', '.'),
 ('.', 'Brazil'),
 ('Brazil', 'has'),
 ('has', 'a'),
 ('a', 'large'),
 ('large', 'coastline'),
 ('coastline', 'on'),
 ('on', 'the'),
 ('the', 'eastern'),
 ('eastern', 'side'),
 ('side', 'of'),
 ('of', 'South'),
 ('South', 'America')]

<h3 style="color:blue"> Stop Words </h3> <br>
Stopwords are the most common words in any natural language. For the purpose <br>
of analyzing text data and building NLP models, these stopwords might not add <br>
much value to the meaning of the document.

In [27]:
from nltk.corpus import stopwords

In [28]:
stop_words = stopwords.words('english')

In [30]:
# filtered = [l for l in token if l not in stop_words]
# print(filtered)

<h3 style="color:blue"> Stemming </h3> <br>
Normalize words into its base form or root form 

example - Affect(root word)- affection, affects, affectation, affected, affecting

In [36]:
from nltk.stem import PorterStemmer, LancasterStemmer
from nltk.stem.snowball import SnowballStemmer

In [33]:
pst = PorterStemmer()

In [34]:
pst.stem('having')

'have'

In [37]:
## plural words -
words = ['caresses', 'flies', 'dies', 'mules', 'denied', 'died', 'agreed', 'owned', 'humbled', 'sized', 'meeting', 'stating', 'siezing', 'itemization','sensational', 'traditional', 'reference', 'colonizer','plotted']

# for word in words:
#     print(word, ":", pst.stem(word))

In [42]:
porter = PorterStemmer()
lancaster=LancasterStemmer()
word_list = ["friend", "friendship", "friends", "friendships","stabil","destabilize","misunderstanding","railroad","moonlight","football"]
print("{0:20}{1:20}{2:20}".format("Word","Porter Stemmer","lancaster Stemmer"))
for word in word_list:
    print("{0:20}{1:20}{2:20}".format(word,porter.stem(word),lancaster.stem(word)))

Word                Porter Stemmer      lancaster Stemmer   
friend              friend              friend              
friendship          friendship          friend              
friends             friend              friend              
friendships         friendship          friend              
stabil              stabil              stabl               
destabilize         destabil            dest                
misunderstanding    misunderstand       misunderstand       
railroad            railroad            railroad            
moonlight           moonlight           moonlight           
football            footbal             footbal             


<h3 style="color:blue"> POS tagging - Part of Speech </h3>

The process of classifying words into their parts of speech and labeling <br> them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging.

**list of pos tag** - > https://www.sketchengine.eu/penn-treebank-tagset/


In [43]:
text = word_tokenize("And now for something completely different")
nltk.pos_tag(text)

[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]