# Text Analysis

This example shows a few text analyis methods: similarity, clustering, topic modeling and sentiment analysis.

## Data

Data is typically input as a list of documents.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import * # Jaccard
from sklearn.metrics.pairwise import * # Cosine

d1 = "He is a good guy, he is not bad"
d2 = "feet wolves cooked boys girls ,!<@!"
d3 = "He is not a good guy, he is bad"
d4 = "I drink water in parties"
d5 = "I grab a drink in parties"

c3 = [d4, d5]

#### CountVectorizer

`fit()` finds BOW (Bag of Words) and generate vocabulary (indexed alphabetically)

In [2]:
vectorizer5 = CountVectorizer()
vectorizer5.fit(c3)
print(vectorizer5.vocabulary_)

{'drink': 0, 'water': 4, 'in': 2, 'parties': 3, 'grab': 1}


`transform()` converts each word to a vector spanned by the vocabulary

In [3]:
v_c3 = vectorizer5.transform(c3).toarray()
v_c3

array([[1, 0, 1, 1, 1],
       [1, 1, 1, 1, 0]], dtype=int64)

**Similarity** - cosine_similarity needs input as lists

In [4]:
print(cosine_similarity([v_c3[0]], [v_c3[1]]))
jaccard_score(v_c3[0],v_c3[1])

[[0.75]]


0.6

## Clustering Analysis

We use K-Means and Agglomerative Clustering for clustering analysis.

### TF-IDF Vectorizer

In [5]:
from sklearn.cluster import KMeans
d6 = "Seattle weather is bad in winter"
d7 = "Seattle Seahawks is a great football team"
d8 = "I love Seahawks"
d9 = "I learned a lot of Data analytics tools"
d10 = "I am a data scientist"
c4 = [d1,d2,d3,d4,d5,d6,d7,d8,d9,d10]

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer6 = TfidfVectorizer(stop_words='english')
X = vectorizer6.fit_transform(c4)

### K-Means

`fit()` to train the model which gives cluster_centers_

In [6]:
k = 4 # number of clusters
model1 = KMeans(n_clusters = k)
model1.fit(X)
model1.cluster_centers_

array([[0.23007895, 0.        , 0.        , 0.        , 0.51943254,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.23007895, 0.23007895,
        0.        , 0.        , 0.38095248, 0.        , 0.        ,
        0.        , 0.23007895, 0.        , 0.        , 0.        ,
        0.        ],
       [0.        , 0.10272995, 0.1118034 , 0.1118034 , 0.        ,
        0.        , 0.1118034 , 0.11857386, 0.1118034 , 0.        ,
        0.        , 0.11857386, 0.        , 0.        , 0.        ,
        0.19047624, 0.        , 0.        , 0.26272082, 0.21822012,
        0.11857386, 0.        , 0.        , 0.13812811, 0.13812811,
        0.1118034 ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.54362395, 0.        , 0.        , 0.        , 0.        ,
        0.31974443, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.54362395, 0.        , 0.        , 0.        ,
      

`transform()` assigns cluster membership as given by labels_

In [7]:
model1.fit_transform(X)
model1clusters = model1.labels_.tolist()
model1clusters

[3, 0, 3, 2, 2, 0, 0, 0, 1, 1]

#### Characteristics of each cluster

We want to find 5 most significant words for each cluster. 

1. for each cluster, we get the indices of sorted array in descending order. `argsort()` sorts the array in ascending order and return the indices of sorted array. The index slicing  `[:,::-1]`reads the sorted array from the end and effectively re-arrange it in descending order.
2. `get_feature_names_out()` returns the words corresponding to indices.

In [8]:
order_centroids = model1.cluster_centers_.argsort()[:,::-1] 
terms = vectorizer6.get_feature_names_out()

for c in range(4):
    print('Cluster %d:' % c)
    for ind in order_centroids[c, :4]:
        print(terms[ind])
    

Cluster 0:
seahawks
seattle
love
weather
Cluster 1:
data
scientist
analytics
tools
Cluster 2:
drink
parties
water
grab
Cluster 3:
guy
good
bad
winter


### Agglomerative Clustering

1. **n_clusters** - number of clusters pre-selected
2. **affinity** - distance measure between documents, others are: manhattan, cosince
3. **linkage** - distance measure between clusters, others are: single, complete, average  

`fit_predict()` fits the hierarchical clustering from features or distance matrix, and returns cluster labels.

In [9]:
from sklearn.cluster import AgglomerativeClustering

model2 = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage='ward')
model2.fit_predict(X.toarray())
print(model2.labels_)


[3 0 3 2 2 0 0 0 1 1]


## LDA and Topic Modeling

### Data and document vectors

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer7 = CountVectorizer(stop_words='english')
X2 = vectorizer7.fit_transform(c4)
terms = vectorizer7.get_feature_names_out()

### Latent Dirichlet Allocation

* **n_components** - number of topics
* The model **lda**'s attribute **components_** stores topic word distribution. The array **components_[i, j]** can be viewed as pseudocount that represents the number of times **word j** was assigned to **topic i**.

To display the representative words under each topic, for each topic:
1. sort the indices of words in the descending order of pseucount
2. take the top 4 
2. retrieve the corresponding words and join them with space " " inbetween them

In [11]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=4).fit(X2)

for topic_idx, topic in enumerate(lda.components_):
    print("Topic %d:" % (topic_idx))
    print(" ".join([terms[i] for i in topic.argsort()[:-4-1:-1]]))

Topic 0:
wolves boys cooked feet
Topic 1:
bad guy good parties
Topic 2:
seahawks seattle great team
Topic 3:
scientist grab water seattle


#### Probability of a document containing a topic

In [12]:
lda.transform(X2[0])

array([[0.06257622, 0.81221881, 0.06256438, 0.06264059]])

### LDA using gensim

#### Prepare the data

list of lists of words from list of documents

In [13]:
import nltk
from nltk.corpus import stopwords

lemmatizer = nltk.stem.WordNetLemmatizer()
stemmer = nltk.stem.PorterStemmer()

processed_c4 = []
for doc in c4:
    tokens = nltk.word_tokenize(doc.lower())
    tokens = [lemmatizer.lemmatize(token) for token in tokens if token.isalpha()]
    tokens = [stemmer.stem(token) for token in tokens]
    tokens = [token for token in tokens if not token in stopwords.words('english')]
    processed_c4.append(tokens)
    
print(processed_c4)

[['good', 'guy', 'bad'], ['foot', 'wolf', 'cook', 'boy', 'girl'], ['good', 'guy', 'bad'], ['drink', 'water', 'parti'], ['grab', 'drink', 'parti'], ['seattl', 'weather', 'bad', 'winter'], ['seattl', 'seahawk', 'great', 'footbal', 'team'], ['love', 'seahawk'], ['learn', 'lot', 'data', 'analyt', 'tool'], ['data', 'scientist']]


#### gensim 

In terminal, **pip install gensim**

1. create the **dictionary** which is a list of words from all documents
2. convert document into the bag-of-words format = list of (word_id, word_count) 2-tuples
3. train the model with number of topics **num_topics** and mapping from word IDs to words **id2word** 
4. use **`print_topics(num_topics, num_words)`** to display **num_topics** randomly selected topics with a string of **num_words** words ordered by their significance. Both are optional with the default `num_topics=20, num_words=10`. `num_topics = -1` means to show all topics.

In [14]:
import gensim
dictionary = gensim.corpora.Dictionary(processed_c4)
bow_c4 = [dictionary.doc2bow(doc) for doc in processed_c4]
print(bow_c4[1])

lda_model = gensim.models.LdaModel(bow_c4, num_topics=4, id2word=dictionary)

for idx, topic in lda_model.print_topics(-1, 4):
    print('Topic: {} \n Words: {}'.format(idx, topic))


[(3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
Topic: 0 
 Words: 0.164*"seahawk" + 0.092*"seattl" + 0.092*"great" + 0.091*"footbal"
Topic: 1 
 Words: 0.092*"girl" + 0.091*"boy" + 0.091*"foot" + 0.091*"cook"
Topic: 2 
 Words: 0.117*"parti" + 0.117*"drink" + 0.109*"bad" + 0.066*"good"
Topic: 3 
 Words: 0.094*"bad" + 0.082*"lot" + 0.082*"guy" + 0.082*"analyt"


## Sentiment Analysis

### Install vader_lexicon

In [15]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import pandas as pd

nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\ytan\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

### Data

In [16]:
sentences = ["They are smart, cute, and funny.",  # positive sentence example
    "They are smart, cute, and funny!", # punctuation emphasis handled correctly (sentiment intensity adjusted)
    "They are very smart, cute, and funny.",# booster words handled correctly (sentiment intensity adjusted)
    "They are VERY SMART, cute, and FUNNY.",  # emphasis for ALLCAPS handled
    "They are VERY SMART, cute, and FUNNY!!!",  # combination of signals - VADER appropriately adjusts intensity
    "They are VERY SMART, really handsome, and INCREDIBLY FUNNY!!!",  # booster words & punctuation make this close to ceiling for score
    "The book was good.",  # positive sentence
    "The book was kind of good.",  # qualified positive sentence is handled correctly (intensity adjusted)
    "The plot was good, but the characters are uncompelling and the dialog is not great.",  # mixed negation sentence
    "A really bad, horrible book.",  # negative sentence with booster words
    "At least it isn't a horrible book.",  # negated negative sentence with contraction
    ":) and :D",  # emoticons handled
    "",  # an empty string is correctly handled
    "Today sux",  # negative slang handled
    "Today sux!",  # negative slang with punctuation emphasis handled
    "Today SUX!",  # negative slang with capitalization emphasis
    "Today kinda sux! But I'll get by, lol"  # mixed sentiment example with slang and contrastive conjunction "but"
     ]


### Sentiment analysis with existing classifier

1. use `SentimentIntensityAnalyzer()`
2. get polarity scores: compound, neg, neu, pos. The output is a dictionary data type which is joined by a dictionary of sentence using `update()`. 
3. convert list of dictionary to `DataFrame` and show.

In [17]:
sid = SentimentIntensityAnalyzer()

scores = []
for sentence in sentences:
    score = {'sentence': sentence}
    score.update(sid.polarity_scores(sentence))   
    scores.append(score)

df_scores = pd.DataFrame(scores)
df_scores

Unnamed: 0,sentence,neg,neu,pos,compound
0,"They are smart, cute, and funny.",0.0,0.259,0.741,0.8225
1,"They are smart, cute, and funny!",0.0,0.252,0.748,0.8356
2,"They are very smart, cute, and funny.",0.0,0.304,0.696,0.847
3,"They are VERY SMART, cute, and FUNNY.",0.0,0.249,0.751,0.9196
4,"They are VERY SMART, cute, and FUNNY!!!",0.0,0.236,0.764,0.9318
5,"They are VERY SMART, really handsome, and INCR...",0.0,0.294,0.706,0.9469
6,The book was good.,0.0,0.508,0.492,0.4404
7,The book was kind of good.,0.0,0.657,0.343,0.3832
8,"The plot was good, but the characters are unco...",0.327,0.579,0.094,-0.7042
9,"A really bad, horrible book.",0.791,0.209,0.0,-0.8211
