# TEXT ANALYTICS / MINING

- Unstructured text data is being generated all the time

- Text analytics / Text Mining involves techniques and algorithms for analyzing text

- Traditional data mining techniques may be used if text is converted to numerical vectors

## Key Techniques
- NLTK: stemming, stopwords, punctuation, top words
- WordCloud: visualization
- TF-IDF Vectorizer with sklearn
- Topic Modeling with gensim
- Sentiment analysis with TextBlob

# TF-IDF Vectorizer with sklearn
- Vectorizers are used to transform words into numbers

- Some use a CountVectorizer – just raw counts of each word in each document

- But it is recommended to use TfidfVectorizer, which weights words by importance, not just by frequency

## 1) CountVectorizer (i.e. Term Frequency)

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

#mock data for demo purposes
doc1 = "the moving finger writes and having writ moves on"
doc2 = "the gold finger or golden finger the question is moot"
doc3 = "he is a finger spinner and can write with it too"
doc4 = "the valiant never taste of death but once or so they say"
doc5 = "knights are valiant and never afraid of death"
corpus = [doc1, doc2, doc3, doc4, doc5]

> **fit_transform( )** method "tokenize the strings and give you a vector for each string". 
> 
> The vector is the total number of tokens for the whole corpus.
> 
> Each dimension of which corresponds to the number of times a token is found in the corresponding string. 
> 
> So, it has both (1) determined which tokens it will count, and (2) how they correspond to entries in the count vector. 

In [None]:
vectorizer = CountVectorizer(stop_words='english')            # create an instance object of CountVectorizer()
matrix = vectorizer.fit_transform(corpus) # tokenize all the strings in the corpus and return a vector for each string
print(type(matrix))
# csr_matrix: compressed sparse matrix
print(matrix.shape)                       # print a stucture of the outcome (i.e. matrix variable) of fit_transform()
                                          # (number of documents, the total number of tokens)

> **vocabulary_** method "returns a dictionary that represents pairs of a token and its corresponding vector".

In [None]:
print(vectorizer.vocabulary_)
len(vectorizer.vocabulary_)

In [None]:
print(matrix)

> **get_feature_names( )** method "returns a list that represents all the tokens (i.e. word) appearing in the corpus". Each token is a feature of the instance object of CountVectorizer( ).

In [None]:
# Let's invesitgae features of the instance object.
print(vectorizer.get_feature_names_out())

> **toarray ( )** method "return a dense ndarray representation of the given matrix". 

In [None]:
# each document is represented as a term-frequency vector, where each dimension corresponds to a word
matrix.toarray()

In [None]:
print(matrix.toarray())
# But, it is not clear which feature name (i.e. token or word) is corresponding to the frequency!

In [None]:
# Let's combine the feature names and the frequency! Please note that they are list objects and share the same index!
for doc in matrix.toarray():
    for idx in range(len(doc)):
        print('{}:{}'.format(vectorizer.get_feature_names_out()[idx], doc[idx]), end=' ')
    print('\n')

> ***As we saw before, there are several words (i.e. fingers, etc.) that frequently appear in the text.***
> ***But those words do not have the "distinguishing" power.***

## 2) Solution: TF-IDF Vectorizer

- TF: Term Frequency: how many times a word appear in a document?

- IDF: Inverse Document Frequency: how many documents include the word?

- TF-IDF(t, d, D) = TF(t, d) * IDF(t, D)

- IDF(t, D) = log( N / |{d in D:  t in d}| )
- N: total number of documents in the corpus


http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

#mock data for demo purposes
doc1 = "the moving finger writes and having writ moves on"
doc2 = "the gold finger or golden finger the question is moot"
doc3 = "he is a finger spinner and can write with it too"
doc4 = "the valiant never taste of death but once or so they say"
doc5 = "knights are valiant and never afraid of death"
docs = [doc1, doc2, doc3, doc4, doc5]

vectorizer2 = TfidfVectorizer(stop_words='english') #Only the difference is the type of Vectorizer!
matrix2 = vectorizer2.fit_transform(docs)
print(vectorizer2.get_feature_names())
print(matrix2.shape)
print(matrix2.toarray()) 

### Let's compare (1) CounterVectorizer and (2) TF-IDF Vectorizer

In [None]:
print(vectorizer.get_feature_names_out()) #(1)CounterVectorizer
len(vectorizer.get_feature_names_out())

In [None]:
print(vectorizer2.get_feature_names_out()) #(2)TFIDFVectorizer
len(vectorizer.get_feature_names_out())

In [None]:
print(matrix.toarray()) #(1)CounterVectorizer

In [None]:
print(matrix2.toarray())#(2)TFIDFVectorizer

## 3) Practice:  Let's Calculate the Pairwise Document Distance with TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

#mock data for demo purposes
doc1 = "the moving finger writes and having writ moves on"
doc2 = "gold finger or golden finger the question is moot"
doc3 = "he is a finger spinner and can write with it too"
doc4 = "the valiant never taste of death but once or so they say"
doc5 = "knights are valiant and never afraid of death"
docs = [doc1, doc2, doc3, doc4, doc5]

In [None]:
vectorizer = TfidfVectorizer(stop_words='english')
matrix = vectorizer.fit_transform(docs)
print(len(docs))
print(vectorizer.get_feature_names_out())
print(matrix.shape)
print(matrix.toarray())

> **cosine_distances( )** method "takes an object, computes cosine distance between samples in the object, and returns a distance matrix".
>
> Cosine distance is defined as 1.0 minus the cosine similarity.

In [None]:
from sklearn.metrics.pairwise import cosine_distances

cos_dist = cosine_distances(matrix)
print(cos_dist.shape)

In [None]:
print(cos_dist)

## Extra: The Pairwise Document Cosine-Distance with TF (i.e. CountVectorizer)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_distances

doc1 = "the moving finger writes and having writ moves on"
doc2 = "gold finger or golden finger the question is moot"
doc3 = "he is a finger spinner and can write with it too"
doc4 = "the valiant never taste of death but once or so they say"
doc5 = "knights are valiant and never afraid of death"
docs = [doc1, doc2, doc3, doc4, doc5]

vectorizer = CountVectorizer(stop_words='english')
matrix = vectorizer.fit_transform(docs)
cos_dist = cosine_distances(matrix)
print(cos_dist)

## Practice: The Pairwise Document Cosine-Distance with TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_distances

doc1 = open('frankenstein.txt').read()
doc2 = open('raven.txt').read()
doc3 = open('abbey.txt').read()
docs = [doc1, doc2, doc3]

vectorizer = TfidfVectorizer(stop_words='english')
matrix = vectorizer.fit_transform(docs)
cos_dist = cosine_distances(matrix)
print(cos_dist)