<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:200%;
           font-family:Arial;letter-spacing:0.5px">

<p width = 20%, style="padding: 10px;
              color:white;">
Natural Language Processing: Topic Modeling with NMF
              
</p>
</div>

Data Science Cohort Live NYC 2023
<p>Phase 4</p>
<br>
<br>

<div align = "right">
<img src="Images/flatiron-school-logo.png" align = "right" width="200"/>
</div>
    
    

In [None]:
# standard packages for data analysis and NLP

import numpy as np
import pandas as pd
from copy import deepcopy

#visualization packages
import seaborn as sns
import matplotlib.pyplot as plt

# NLP modules we will use for text normalization
import re #regex 
import nltk # the natural language toolkit
from nltk.tokenize import word_tokenize
from nltk import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk import pos_tag

# feature construction
from sklearn.feature_extraction.text import TfidfVectorizer #use this to create BoW matrix

**Some special packages**

In [None]:
#pip install pyLDAvis==3.3.1

#conda install -c conda-forge pyldavis


In [None]:
import pyLDAvis.sklearn # a specialized package for topic model visualization

#modeling and dimensionality reduction for visuaization
from sklearn.decomposition import NMF 

#### Taking a look at our data

Load in our Covid-19 Tweet dataset. 
- In general: twitter REST API or tweepy to download tweets.
- We will load from csv.

In [None]:
cvid_dataset_orig = pd.read_csv("Data/Corona_NLP_train.csv", encoding='latin-1')
cvid_dataset_orig = cvid_dataset_orig.rename(columns = {'UserName': 'user_name', 'ScreenName': 'screen_name', 'Sentiment': 'sentiment', 'OriginalTweet': 'text', 'TweetAt': 'date', 'Location': 'location'})

In [None]:
cvid_dataset = deepcopy(cvid_dataset_orig).drop(columns = ['sentiment'])
cvid_dataset.head()

There are no nulls in the  actual text data.

In [None]:
cvid_dataset.info()

Let's take a closer look at some of our text data:

In [None]:
cvid_dataset['text'].loc[0]

In [None]:
cvid_dataset['text'].loc[15]

In [None]:
cvid_dataset['text'].loc[239]

In [None]:
cvid_dataset['text'].loc[2300]

What are some potential cleaning tasks that you can identify?

#### Preprocess tweet text data

Will be our workhorse function for text cleaning and preprocesses a single tweet. 

- Regex: removes hashtags, mentions, urls, line break special characters, etc.

In [None]:
# additional argument sets cut off minimum length for tokenized text at which function converts to null string.
def process_tweet(tweet_text, min_length):
    
    # get common stop words that we'll remove during tokenization/text normalization
    stop_words = stopwords.words('english')

    #initialize lemmatizer
    wnl = WordNetLemmatizer()

    # helper function to change nltk's part of speech tagging to a wordnet format.
    def pos_tagger(nltk_tag):
        if nltk_tag.startswith('J'):
            return wordnet.ADJ
        elif nltk_tag.startswith('V'):
            return wordnet.VERB
        elif nltk_tag.startswith('N'):
            return wordnet.NOUN
        elif nltk_tag.startswith('R'):
            return wordnet.ADV
        else:         
            return None
   

    # lower case everything
    tweet_lower = tweet_text.lower()

    #remove mentions, hashtags, and urls, strip whitspace and breaks
    tweet_lower = re.sub(r"@[a-z0-9_]+|#[a-z0-9_]+|http\S+", "", tweet_lower).strip().replace("\r", "").replace("\n", "").replace("\t", "")
    
    
    # remove stop words and punctuations 
    tweet_norm = [x for x in word_tokenize(tweet_lower) if ((x.isalpha()) & (x not in stop_words)) ]

    #  POS detection on the result will be important in telling Wordnet's lemmatizer how to lemmatize
    
    # creates list of tuples with tokens and POS tags in wordnet format
    wordnet_tagged = list(map(lambda x: (x[0], pos_tagger(x[1])), pos_tag(tweet_norm))) 

    # now we are going to have a cutoff here. any tokenized cocument with length < min length will be removed from corpus
    if len(wordnet_tagged) <= min_length:
        return ''
    else:
         # rejoins lemmatized sentence 
         tweet_norm = " ".join([wnl.lemmatize(x[0], x[1]) for x in wordnet_tagged if x[1] is not None])
         return tweet_norm



Apply our text normalization and delete empty tweets. Might take a minute or two.

In [None]:
# anything with no of tokens <= 10 is likely junk. apply has additional args parameter to pass in function arguments.
cvid_dataset['text'] = cvid_dataset['text'].apply(process_tweet, args = [10])

#our processing created some empty documents, so we should drop these.


In [None]:
# some documents are short enough cleaning may have wiped it out.
cvid_dataset_new = cvid_dataset[cvid_dataset['text'] != '']

#### Creating our Bag of Words Term-Document Matrix

Apply tf-idf vectorizer to transform the preprocessed text into a term-document matrix. 
$$ w_{ij} = tf_{ij}\log(\frac{N}{df_i}) $$

In [None]:
corpus = cvid_dataset_new['text']
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(corpus)
X_train

X_train is in sparse matrix format (saves space/time), can't view directly. 

- Number of unique tokens in our term frequency matrix:

In [None]:
len(vectorizer.get_feature_names_out())

#### Topic Modeling

Once the data is in the right form, scikit learn makes NMF topic modeling is as easy as this:
- fit with 5 topics

In [None]:
topic_model = NMF(n_components = 5)
topic_model.fit(X_train)

Remember that: $$ X = WH $$

Our model has fitted W and H, so we can get these components independently.
- $ W $ encodes the importance of each token in the fitted topics. 
- $ H $ encodes the weight of the fitted topics for each document. 

In [None]:
# to get H
H = topic_model.transform(X_train) # transform document into topic vector representation

# to get W 
W = topic_model.components_ # word component weights for each topic

print("Shape of H is " + str(H.shape))
print("Shape of W is " + str(W.shape))
print("Shape of X_train is " + str(X_train.shape))

- Remember that there are 29,178 tweets in our dataset. 
- Vectorizer created 25,788 features with varying importance. 


**Dimensions and our interpretations of W and H make sense.** 


#### The W matrix

Let's take a look at the tokens (columns of W) with highest weight for each topic (columns) in W:

In [None]:
# weight for given token
print(W[0])
print(len(W[0]))

In [None]:
for index,topic in enumerate(W):
    print(f'THE TOP 25 WORDS FOR TOPIC #{index}')
    print([vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-25:]])
    print('\n')

It's often helpful to make a bar visualization of the most relevant token weights for each topic.

In [None]:
%%capture topic_word_plot
def plot_top_words(W, feature_names, n_top_words, title):
    fig, axes = plt.subplots(1, 5, figsize=(15, 8), sharex=True)
    axes = axes.flatten()
    for topic_idx, topic in enumerate(W):
        top_features_ind = topic.argsort()[: -n_top_words - 1 : -1]
        top_features = [feature_names[i] for i in top_features_ind]
        weights = topic[top_features_ind]

        ax = axes[topic_idx]
        ax.barh(top_features, weights, height=0.7)
        ax.set_title(f"Topic {topic_idx +1}", fontdict={"fontsize": 20})
        ax.invert_yaxis()
        ax.tick_params(axis="both", which="major", labelsize=15)
        for i in "top right left".split():
            ax.spines[i].set_visible(False)
        fig.suptitle(title, fontsize=25)

    plt.subplots_adjust(top=0.90, bottom=0.05, wspace=0.90, hspace=0.3)
    plt.show()

n_top_words = 20
tfidf_feature_names = vectorizer.get_feature_names_out()
plot_top_words(W, tfidf_feature_names, n_top_words, "Topics in NMF model")

In [None]:
topic_word_plot()

#### PyLDAvis: an excellent tool for visualizing topic models

In [None]:
vis = pyLDAvis.sklearn.prepare(topic_model, X_train,vectorizer)
pyLDAvis.display(vis)
pyLDAvis.save_html(vis, 'nmf_topics.html')

Based off of this, let's label these topics with names:

In [None]:
topic_name_dict = {0: 'essential_worker', 1: 'oil_market_crisis', 2: 'supply_shortage', 3: 'sanitizing_products', 4: 'shopping'}

#### The H matrix

So far: word distributions in each topic and evaluated topic similarity/difference. 
- This was all in the matrix $W$. 

But what about $H$? 
- $H$ contains information about breakdowns of topics in each document. Let's see this in action.

In [None]:
# takes in list of documents and plots topic weight vectors for each document
def tweet_topbreakdown(locator):

    print(cvid_dataset_orig.loc[locator].text)
    int_index = cvid_dataset_new.index.get_loc(locator)

    topic_keys = topic_name_dict.values()
    zipped_tuple = list(zip(topic_keys, list(H[int_index,:])))

    topic_breakdown = pd.DataFrame(zipped_tuple, columns = ['Topic', 'Weight']).set_index(['Topic'])
    topic_breakdown['Normalized weight'] = topic_breakdown['Weight']/topic_breakdown['Weight'].sum()

    sns.barplot(y = topic_breakdown.index, x = 'Normalized weight', data = topic_breakdown)
    plt.title("Distribution of topics for tweet no. " + str(locator))
    plt.show()

    return display(topic_breakdown)
    

In [None]:
tweet_loc_list = [5,10,115, 320]
g = list(map(tweet_topbreakdown, tweet_loc_list))


### TSNE: A way to visualize our documents by topic in 2D

TSNE (or t-distributed stochastic neighbor embedding) is a way to take high dimensional data and embed it into 2D for visualization. The technique is good at helping to identify clusters or neighborhoods in text data. 

Scale/distances dont mean too much, but clustering and closeness does.

Stochastic because each time you run you get different 2D embeddings.

Scikit-learn makes it easy, only has a fit_transform method.

In [None]:
from sklearn.manifold import TSNE # T-distributed Stochastic Neighbor Embedding

In [None]:
tsne = TSNE(random_state=42, learning_rate=100)
tsne_trans = tsne.fit_transform(H)
tsne_trans = pd.DataFrame(tsne_trans, columns = ['TSNE1', 'TSNE2'])

In [None]:
# for each document takes the topic with highest weight and assigns document to this class -- hard clustering.
tsne_trans['class'] = np.argmax(H, axis = 1)
tsne_trans['class'] = tsne_trans['class'].replace(topic_name_dict)

In [None]:
plt.figure(figsize=(10,10))
sns.scatterplot(x = 'TSNE1', y = 'TSNE2', hue = 'class', data = tsne_trans, palette = 'tab10')
plt.title('Visualization of COVID-19 tweet topic segmentation')
plt.show()

#### Finishing up and next steps

We have found a higher level representation of the data, encoded in $H$:

- The algorithm has learned concepts and now we want to do analytics on these concepts.

First, let's reform the data.

In [None]:
H_repres_norm = pd.DataFrame(H, columns = topic_name_dict.values(), index = cvid_dataset_new.index)
H_repres_norm = H_repres_norm.divide(H_repres_norm.sum(axis=1), axis=0)

In [None]:
H_repres_norm.head()

Let's join this to the rest of the dataset and clean up a little bit.

In [None]:
embedded_tweets_df = cvid_dataset.join(H_repres_norm).dropna()
embedded_tweets_df.head()

In [None]:
embedded_tweets_df.to_csv('cvid_embedded_train.csv')

What kind of tasks could we do or things could we learn from the data in this representation?