## Assignment 8: Twitter Topic Modeling with Non-negative Matrix Factorization.

### Due: Thursday, June 4th, 11:59 pm on Gradescope.

In this homework you will practice extracting topics from tweets using the relatively simple Non-negative Matrix Factorization method. This method assumes every tweet is a combination of several topics weighted by their prevailance in the text. This approach in fact finds a low-dimensional representation of the tweets (through the topic weights). 

The dataset is obtained from https://www.kaggle.com/smid80/coronavirus-covid19-tweets-late-april?select=2020-04-30+Coronavirus+Tweets.CSV and the preprocessing were borrowed from https://www.kaggle.com/satanizer/covid-19-tweets-analysis. It contains tweets which contain hashtags related to the Coronavirus. For computational speed we will analyze a dataset from one day: April 30, 2020. We encourage you to explore this dataset further and see how topics change over time.


Fill in the cells provided marked `TODO` with code to answer the questions. **Unless otherwise noted, every answer you submit should have code that clearly shows the answer in the output.** Answers submitted that do not have associated code that shows the answer may not be accepted for credit. 

**Make sure to restart the kernel and run all cells** (especially before turning it in) to make sure your code runs correctly. Answer the questions on Gradescope and make sure to download this file once you've finished the assignment and upload it to Canvas as well.

> Copyright ©2020 Valentina Staneva.  All rights reserved.  Permission is hereby granted to students registered for University of Washington CSE/STAT 416 for use solely during Spring 2020 for purposes of the course.  No other use, copying, distribution, or modification is permitted without prior written consent. Copyrights for third-party components of this work must be honored.  Instructors interested in reusing these course materials should contact the author.

---



In [0]:
import pandas as pd

# string manipulation libraries
import string
import re

# text processing libraries
import nltk
from nltk.corpus import words
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import TfidfVectorizer

---
### Data Loading

First let's read the dataset into a data frame and have a look what is there.

In [87]:
data = pd.read_csv('https://raw.githubusercontent.com/valentina-s/cse-stat-416-sp20/master/data/2020-04-30_Coronavirus_Tweets_small.csv')
data.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,text,is_quote,is_retweet,retweet_count,country_code,place_full_name,place_type,verified,lang
0,"Asegura sus beneficios, registra a tu esposa e...",False,False,0.0,,,,False,es
1,"#COVID19 | El Faro conversó con policías, un f...",False,False,11.0,,,,True,es
2,"Si ya era cuestionable la burocracia, lo es má...",False,False,1.0,,,,False,es
3,Las medidas de higiene ayudan a reducir la pro...,False,False,38.0,,,,True,es
4,Cubre tu nariz y boca al estornudar con el áng...,False,False,0.0,,,,False,es


---
### Text Preprocessing

First, we will do several text preprocessing steps. We will:
* limit to English language
* remove URL links
* make lower case
* remove pronunciation
* remove stopwords

In [0]:
# select tweets in English
text_en = data['text'][data['lang']=='en']

In [0]:
# remove URL links
text_en_lr = text_en.apply(lambda x: re.sub(r"https\S+", "", str(x)))

In [0]:
# make lower case
text_en_lr_lc = text_en_lr.apply(lambda x: x.lower())

In [0]:
# remove punctuation
text_en_lr_lc_pr = text_en_lr_lc.apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))

In [92]:
# remove stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [0]:
stop_words = set(stopwords.words('english'))
stop_words.update(['#coronavirus', '#coronavirusoutbreak', '#coronavirusPandemic', '#covid19', '#covid_19', '#epitwitter', '#ihavecorona', 'amp', 'coronavirus', 'covid19','covid-19', 'covidー19'])

text_en_lr_lc_pr_sr = text_en_lr_lc_pr.apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))

### TF-IDF Matrix

Remember that matrix factorization methods work on matrices of numbers not text so we need to convert the text into a meaningful numeric representation. Earlier we discussed the Term Frequency-Inverse Document Frequency as a good way to do that since it defines a word weight vector for each document by accounting for the most popular words such as `the` or `a`.  We can extract it using `scikit-learn`.

In [0]:
# create TF-IDF matrix
vectorizer = TfidfVectorizer(max_df=0.95)  # ignore words with very high doc frequency
tf_idf = vectorizer.fit_transform(text_en_lr_lc_pr_sr)

# exctract also the words so that we know which feature corresponds to which word
feature_names = vectorizer.get_feature_names()

In [95]:
# check out the shape
tf_idf.shape

(198579, 255511)

**Question 1** (enter answers on gradescope)

What is the number of observations? What is the number of features? Which dimension are we going to reduce?

### Non-negative Matrix Decomposition for Topic Discovery

Next we will use the [NMF](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html) method from `scikit-learn` to extract the topics.

In [0]:
from sklearn.decomposition import NMF

Set up an NMF model with 5 components. So that we all get the same results, please pass these parameters `init = 'nndsvd'` and `random_state = 1`.

In [0]:
#TODO 
# define an NMF model with `n_components = 5`
nmf = NMF(n_components=5, init='nndsvd', random_state=1)

Now use the `nmf.fit` method to obtain the factorization. Note you do not need to change any default parameters, but you have to ensure your matrix is passed in the right format, i.e. `#observations x #features`.

In [98]:
# TODO fit NMF to the TF-IDF matrix
nmf.fit(tf_idf)

NMF(alpha=0.0, beta_loss='frobenius', init='nndsvd', l1_ratio=0.0, max_iter=200,
    n_components=5, random_state=1, shuffle=False, solver='cd', tol=0.0001,
    verbose=0)

The topics are stored within the object `nmf.components_`. Now you can find the weight of each word within a topic. It will be interesting to look at the words corresponding to each topic ordered by their heighest weight. Remember the words corresponding to each topic are stored in `feature_names`, while the weights are stored in `nmf.components_`. You can use the [`argsort()`](https://numpy.org/doc/stable/reference/generated/numpy.argsort.html) function to extract the indeces of the sorted words. Note that `argsort` sorts from lowest to heighest so you need to look at the last values for the ones with heighest weights. You can reverse a list/array with `[::-1]`.

Find the maximum weight of a word in the first topic, and the word which corresponds to it.


In [99]:
#TODO
import numpy as np
topics = nmf.components_
max_weight = max(topics[0])
max_word = feature_names[np.argmax(topics[0])]
print("max weight:",max_weight,"\nmax_word:",max_word)

max weight: 2.178419059331134 
max_word: people


**Question 2.1-2.2**

What is the maximum weight of a word in the first topic? What is the word associated with it?

Create a function `words_from_topic` to extract an ordered list of words in a topic (highest weight first).

In [0]:
#TODO
def words_from_topic(topic, feature_names):
  # write a loop to create the list of ordered words (highest weight first)
  # (you can use a list comprehension if you are familiar with them)
  ordered_words = [feature_names[i] for i in np.argsort(topic)[::-1]]
  return(ordered_words)

Now you can use the function below to look at all the topic.



In [0]:
def print_top_words(components, feature_names, n_top_words):
    """ 
    print_top_words prints the first n_top_words for each topic in components
    """
    for topic_idx, topic in enumerate(components):
        ordered_words = words_from_topic(topic, feature_names)
        message = "Topic #%d: " % (topic_idx+1)
        message += ", ".join(ordered_words[:n_top_words])
        print(message)

In [102]:
print_top_words(nmf.components_, feature_names, 10)

Topic #1: people, lockdown, get, home, stay, like, one, time, know, day
Topic #2: cases, new, deaths, total, confirmed, reported, number, positive, reports, today
Topic #3: help, spread, app, selfreporting, symptoms, download, sooner, identify, slow, daily
Topic #4: us, china, join, trump, let, drug, says, million, realdonaldtrump, deaths
Topic #5: pandemic, health, support, help, crisis, workers, global, news, read, response


**Question 2.3:** (answer on Gradescope)

What is the 4th word of the 4th topic. 

Next let's look at a specific tweet and the individual contributions of the topics. For that we need to look at the coordinates of the transformed original tf-idf features. That can be obtained through `nmf.fit_transform` method. 

In [0]:
#TODO
tweets_projected = nmf.fit_transform(tf_idf)

In [104]:
tweets_projected.shape

(198579, 5)

In [105]:
text_en_lr_lc_pr_sr.iloc[0]

'attention seattle shoppers grocery stores working hard keep employees customers safe part help slow spread ☑️ limit trips ☑️ respect special shopping hours ☑️ follow socialdistance guidance stores wegotthisseattle'

In [106]:
# TODO find the weight of topic 3 in the first tweet
tweets_projected[0,2]

0.02597223727522194

In [107]:
text_en_lr_lc_pr_sr.iloc[1]

'microsoft sees digital reboot pandemic profits'

In [110]:
#TODO look at the weights for the second tweet and decide which topic it is associated with
print("weights:", tweets_projected[1])
print("-- most associated with topic", np.argmax(tweets_projected[1])+1)

weights: [0.         0.         0.         0.         0.02828405]
-- most associated with topic 5


**Question 3.1:** (answer on Gradescope)

What is the weight of topic 3 in the first tweet?

**Question 3.2:** (answer on Gradescope)

Which topic is tweet 2 associated with?