# TEXT ANALYTICS / MINING

- Unstructured text data is being generated all the time

- Text analytics / Text Mining involves techniques and algorithms for analyzing text

- Traditional data mining techniques may be used if text is converted to numerical vectors


## Key Techniques
- NLTK: stemming, stopwords, punctuation, top words
- WordCloud: visualization
- TF-IDF Vectorizer with sklearn
- Topic Modeling with gensim
- Sentiment analysis with TextBlob

### Fig. Text Mining Process

![text_mining_process](text_mining_process.png)

# 1. Text Preprocessing with NLTK (Natural Language Toolkit)

To properly use NLTK, you need to download various text corpa by running:
- import nltk
- nltk.download()

Otherwise, you may see error messages.

In [None]:
#!pip install --upgrade nltk
#!conda install nltk
!conda list

In [None]:
!pip install nltk

In [None]:
import nltk

In [None]:
nltk.download()
# please download 'All Packages'.

## (1) Removing Punctuations & Nomalization

In [None]:
s = "Hello!! 2019 was great, isn't it? So is 2020!!!"

In [None]:
import string 
p = string.punctuation
print(p)

> **maketrans(< intabstring >, < outtabstring >)** returns a translation table that maps each character in the intabstring into the character at the same position in the outtab string. 
>
> Then this table is passed to the translate() function.

In [None]:
p_out = len(p)* " "

In [None]:
p_out

In [None]:
table_p = str.maketrans(p, p_out)
table_p

In [None]:
s.translate(table_p).lower()

## (2) Stemming & Lemmatization

- Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form. 
- Lemmatisation (or lemmatization) in linguistics, is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.
- Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications.

In [None]:
from nltk.stem.lancaster import LancasterStemmer
ls = LancasterStemmer()

from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

from nltk.stem.snowball import SnowballStemmer
ss = SnowballStemmer("english") 

from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

In [None]:
ps.stem('says')

In [None]:
words = ['string', 'bringing', 'maximum', 'roughly','would',
         'multiply', 'provision', 'saying', 'saw', 'dogs', 'churches']

In [None]:
for word in words:
    print('Word: {}\tLancaster: {}\tPorter: {}\tSnowball: {}\tWordNet: {}'.format(word,ls.stem(word),ps.stem(word),ss.stem(word),wnl.lemmatize(word)))

## (3) Removing Stopwords

In [None]:
# Create a list of words
infile = open('frankenstein.txt')
words = infile.read().lower().split() #normalization
infile.close()

> ***nltk.FreqDist( ):*** A frequency distribution for the outcomes of an experiment. Formally, a frequency distribution can be defined as a function mapping from each sample to the number of times that sample occurred as an outcome. For example, it will produce a frequency distribution that encodes how often each word occurs in a text.
>
> http://www.nltk.org/api/nltk.html?highlight=freqdist#nltk.probability.FreqDist

In [None]:
# get frequent words
freq = nltk.FreqDist(words)
freq

In [None]:
freq['frankenstein']

In [None]:
# get a plot of top 10 frequent words
%matplotlib inline
freq.plot(10)

> ***Most of them are stopwords... We need to remove stopwords...***

In [None]:
stopwords = nltk.corpus.stopwords.words('english')

print(type(stopwords))
print(len(stopwords))
print(stopwords)

In [None]:
stopwords.append('would')
print(stopwords)

In [None]:
infile = open('frankenstein.txt')
words = infile.read().lower().split() #normalization
infile.close()

words2 = []
for w in words:
    if w not in stopwords:
        words2.append(w)

print(len(words))
print(len(words2))

In [None]:
# from words list, remove stopwords
words2 = []
for w in words:
    if w not in stopwords and len(w) > 1:
        words2.append(w)

print(len(words))
print(len(words2))

In [None]:
freq = nltk.FreqDist(words)
freq2 = nltk.FreqDist(words2)
freq.plot(10)
freq2.plot(10)

## (4) Example (frankenstein.txt)

## Getting the word frequency after preprocessing

In [None]:
import nltk
import string
from nltk.stem import WordNetLemmatizer
%matplotlib inline

#(1) open a dataset (i.e. textfile)
infile = open('frankenstein.txt')
content = infile.read()
infile.close()

#(2) nomalization and removing punctuatiion
p = string.punctuation
table_p = str.maketrans(p, len(p) * " ")

l_content = content.lower() #normalization
n_content = l_content.translate(table_p) #removing punctuation

#(3) removing stopwords
stopwords = nltk.corpus.stopwords.words('english')
stopwords.append('could')
stopwords.append('would')
stopwords.append('upon')

words = n_content.split()

rs_words = []
for w in words:
    if w not in stopwords:
        rs_words.append(w)

#(4) lemmatizing
wnl = WordNetLemmatizer()
        
le_words = []
for w in rs_words:
    le_words.append(wnl.lemmatize(w))

#(5) getting the word frequency
freq = nltk.FreqDist(le_words)
freq.plot(10)