<h2 style="color:blue"> Text Mining </h2>

<br>
Text Mining is the process of deriving meaningful information <br> from natural language text.

The overall goal is, essentially to turn text into data for <br>
analysis, via application of Natural Language processing(NLP).

NLP is a component of text mining that performs a special kind of <br> linguistic analysis that essentially helps a machine "read" text.


What is NLTK?
NLTK stands for Natural Language Toolkit. This toolkit is one of <br> the most powerful NLP libraries which contains packages <br>
to make machines understand human language 

pip install nltk

<h4> Applications of NLP </h4>

- Sentimenal Analysis
- Speech Recognition
- Chatbot
- Machine Translation
- Spell Checking
- Keyword Search
- Advertising matching

<h4> Tutorials </h4>

- Tokenization
- Stop Words
- Stemming
- POS - Part of speech





In [1]:
import pandas as pd
import numpy as np
import nltk
import os
import nltk.corpus

In [6]:
print(os.listdir(nltk.data.find('corpora')))

['gutenberg', 'gutenberg.zip', 'movie_reviews', 'movie_reviews.zip', 'stopwords', 'stopwords.zip', 'wordnet', 'wordnet.zip', 'words', 'words.zip']


**Corpus** -> A collection of written texts, ex- medical journals, parliament debates

**downloading any corpus**

nltk.download(corpus_name)

In [2]:
# nltk.download('punkt') # will download tokenization
# nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\maury\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


True

In [3]:
nltk.corpus.gutenberg.fileids() # all the list


['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [4]:
emma = nltk.corpus.gutenberg.words('austen-emma.txt')
len(emma)

192427

In [5]:
for word in emma[:500]:
    print(word, end= " ")

[ Emma by Jane Austen 1816 ] VOLUME I CHAPTER I Emma Woodhouse , handsome , clever , and rich , with a comfortable home and happy disposition , seemed to unite some of the best blessings of existence ; and had lived nearly twenty - one years in the world with very little to distress or vex her . She was the youngest of the two daughters of a most affectionate , indulgent father ; and had , in consequence of her sister ' s marriage , been mistress of his house from a very early period . Her mother had died too long ago for her to have more than an indistinct remembrance of her caresses ; and her place had been supplied by an excellent woman as governess , who had fallen little short of a mother in affection . Sixteen years had Miss Taylor been in Mr . Woodhouse ' s family , less as a governess than a friend , very fond of both daughters , but particularly of Emma . Between _them_ it was more the intimacy of sisters . Even before Miss Taylor had ceased to hold the nominal office of gover

<h3 style="color:blue"> tokenization </h3>

- Break a complex sentence into words
- Understand the importance of each of the words with respect to the sentence.
- produce a structural description on an input sentence.

In [6]:
text = "In Brazil they drive on the right-hand side of the road. Brazil has a large coastline on the eastern side of South America"

In [9]:
from nltk.tokenize import word_tokenize, sent_tokenize

token = word_tokenize(text)
print(token)

['In', 'Brazil', 'they', 'drive', 'on', 'the', 'right-hand', 'side', 'of', 'the', 'road', '.', 'Brazil', 'has', 'a', 'large', 'coastline', 'on', 'the', 'eastern', 'side', 'of', 'South', 'America']


In [10]:
from nltk.probability import FreqDist
fdist = FreqDist()

In [11]:
for word in token:
    fdist[word.lower()] += 1
fdist

FreqDist({'the': 3, 'brazil': 2, 'on': 2, 'side': 2, 'of': 2, 'in': 1, 'they': 1, 'drive': 1, 'right-hand': 1, 'road': 1, ...})

In [12]:
fdist_top10 = fdist.most_common(10)
fdist_top10

[('the', 3),
 ('brazil', 2),
 ('on', 2),
 ('side', 2),
 ('of', 2),
 ('in', 1),
 ('they', 1),
 ('drive', 1),
 ('right-hand', 1),
 ('road', 1)]

In [17]:
## number of paragraph
from nltk.tokenize import blankline_tokenize
b_token = blankline_tokenize(text)
len(b_token)

1


- **Bigrams** - token of two consecutive written words known as Bigram
- **Trigrams** - Tokens of three consecutive written words known as Tigram
- **Ngrams** - Tokens of any number of consecutive written words known as Ngram

In [14]:
from nltk.util import bigrams, trigrams, ngrams

In [15]:
token_bigrams = list(trigrams(token))
token_bigrams

[('In', 'Brazil', 'they'),
 ('Brazil', 'they', 'drive'),
 ('they', 'drive', 'on'),
 ('drive', 'on', 'the'),
 ('on', 'the', 'right-hand'),
 ('the', 'right-hand', 'side'),
 ('right-hand', 'side', 'of'),
 ('side', 'of', 'the'),
 ('of', 'the', 'road'),
 ('the', 'road', '.'),
 ('road', '.', 'Brazil'),
 ('.', 'Brazil', 'has'),
 ('Brazil', 'has', 'a'),
 ('has', 'a', 'large'),
 ('a', 'large', 'coastline'),
 ('large', 'coastline', 'on'),
 ('coastline', 'on', 'the'),
 ('on', 'the', 'eastern'),
 ('the', 'eastern', 'side'),
 ('eastern', 'side', 'of'),
 ('side', 'of', 'South'),
 ('of', 'South', 'America')]

<h3 style="color:blue"> Stop Words </h3> <br>
Stopwords are the most common words in any natural language. For the purpose <br>
of analyzing text data and building NLP models, these stopwords might not add <br>
much value to the meaning of the document.

In [16]:
from nltk.corpus import stopwords

In [17]:
stop_words = stopwords.words('english')

print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [18]:
filtered = [word for word in token if word.lower() not in stop_words]
print(filtered)

['Brazil', 'drive', 'right-hand', 'side', 'road', '.', 'Brazil', 'large', 'coastline', 'eastern', 'side', 'South', 'America']


In [25]:
# nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\maury\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

<h3 style="color:blue"> Stemming </h3> <br>
Normalize words into its base form or root form 

example - Affect(root word)- affection, affects, affectation, affected, affecting

In [19]:
from nltk.stem import PorterStemmer, LancasterStemmer
from nltk.stem.snowball import SnowballStemmer

In [20]:
pst = PorterStemmer()

In [21]:
pst.stem('having')

'have'

In [22]:
## plural words -
words = ['caresses', 'flies', 'dies', 'mules', 'denied', 'died', 'agreed', 'owned', 'humbled', 'sized', 'meeting', 'stating', 'siezing', 'itemization','sensational', 'traditional', 'reference', 'colonizer','plotted']

for word in words:
    print(word, ":", pst.stem(word))

caresses : caress
flies : fli
dies : die
mules : mule
denied : deni
died : die
agreed : agre
owned : own
humbled : humbl
sized : size
meeting : meet
stating : state
siezing : siez
itemization : item
sensational : sensat
traditional : tradit
reference : refer
colonizer : colon
plotted : plot


In [23]:
porter = PorterStemmer()
lancaster=LancasterStemmer()
word_list = ["friend", "friendship", "friends", "friendships","stabil","destabilize","misunderstanding","railroad","moonlight","football"]
print("{0:20}{1:20}{2:20}".format("Word","Porter Stemmer","lancaster Stemmer"))
for word in word_list:
    print("{0:20}{1:20}{2:20}".format(word,porter.stem(word),lancaster.stem(word)))

Word                Porter Stemmer      lancaster Stemmer   
friend              friend              friend              
friendship          friendship          friend              
friends             friend              friend              
friendships         friendship          friend              
stabil              stabil              stabl               
destabilize         destabil            dest                
misunderstanding    misunderstand       misunderstand       
railroad            railroad            railroad            
moonlight           moonlight           moonlight           
football            footbal             footbal             


<h3 style="color:blue"> POS tagging - Part of Speech </h3>

The process of classifying words into their parts of speech and labeling <br> them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging.

**list of pos tag** - > https://www.sketchengine.eu/penn-treebank-tagset/


In [24]:
text = word_tokenize("And now for something completely different")
nltk.pos_tag(text)

[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]

In [None]:
# nltk.download('averaged_perceptron_tagger')

### text mining and classification

https://stackabuse.com/the-naive-bayes-algorithm-in-python-with-scikit-learn/

In [25]:
import pandas as pd

In [52]:
df = pd.read_table('sample_data\SMSSpamCollection',sep='\t', header=None, names=['label', 'message'])

In [53]:
df.head(5)

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [54]:
df['label'] = df.label.map({'ham': 0, 'spam': 1})

In [55]:
df['message'] = df.message.map(lambda x: x.lower())

In [56]:
df.head(5)

Unnamed: 0,label,message
0,0,"go until jurong point, crazy.. available only ..."
1,0,ok lar... joking wif u oni...
2,1,free entry in 2 a wkly comp to win fa cup fina...
3,0,u dun say so early hor... u c already then say...
4,0,"nah i don't think he goes to usf, he lives aro..."


In [57]:
df['message'] = df.message.str.replace('[^\w\s]', '')

In [58]:
df.head(5)

Unnamed: 0,label,message
0,0,go until jurong point crazy available only in ...
1,0,ok lar joking wif u oni
2,1,free entry in 2 a wkly comp to win fa cup fina...
3,0,u dun say so early hor u c already then say
4,0,nah i dont think he goes to usf he lives aroun...


In [59]:
import nltk

In [60]:
df['message'] = df['message'].apply(nltk.word_tokenize)

In [61]:
df.head(5)

Unnamed: 0,label,message
0,0,"[go, until, jurong, point, crazy, available, o..."
1,0,"[ok, lar, joking, wif, u, oni]"
2,1,"[free, entry, in, 2, a, wkly, comp, to, win, f..."
3,0,"[u, dun, say, so, early, hor, u, c, already, t..."
4,0,"[nah, i, dont, think, he, goes, to, usf, he, l..."


In [62]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
 
df['message'] = df['message'].apply(lambda x: [stemmer.stem(y) for y in x])

In [63]:
df.head(3)

Unnamed: 0,label,message
0,0,"[go, until, jurong, point, crazi, avail, onli,..."
1,0,"[ok, lar, joke, wif, u, oni]"
2,1,"[free, entri, in, 2, a, wkli, comp, to, win, f..."


In [51]:
from sklearn.feature_extraction.text import CountVectorizer

In [64]:
df['message'] = df['message'].apply(lambda x: ' '.join(x))


In [65]:
df.head(3)

Unnamed: 0,label,message
0,0,go until jurong point crazi avail onli in bugi...
1,0,ok lar joke wif u oni
2,1,free entri in 2 a wkli comp to win fa cup fina...


In [66]:

count_vect = CountVectorizer()
counts = count_vect.fit_transform(df['message'])

# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [68]:
print(count_vect.get_feature_names())

8169


In [69]:
counts.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [70]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(counts, df['label'], test_size=0.2, random_state=69)

In [71]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB().fit(X_train, y_train)

In [16]:
import numpy as np

predicted = model.predict(X_test)

print(np.mean(predicted == y_test))

0.982078853046595


In [17]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_test, predicted))

[[478   4]
 [  6  70]]
