#### NLP | MVP

# Coronavirus Tweets: April 2020<a id='top'></a> 

## **Analysis Goal**  
The client, the [Centers for Disease Control and Prevention](https://www.cdc.gov/coronavirus/2019-ncov/index.html) wants to understand what type of information spread via Twitter early on in the COVID-19 pandemic in the United States to better inform the communication strategy for future pandemics. The goal of this preliminary analysis is to explore what topics were in the April 2020 tweets. 

**RQ:** What were Americans tweeting about coronavirus and COVID-19 in April 2020? 

## **Process**
**Data source:** 
Coronavirus COVID-19 Tweets [early](https://www.kaggle.com/datasets/smid80/coronavirus-covid19-tweets-early-april) and [late](https://www.kaggle.com/datasets/smid80/coronavirus-covid19-tweets-late-april) April
Corpus is filtered for English language and tweets from the United States (n=138,789). 
Text preprocessing included removing numbers, punctuation, capital letters, and emojis. 

**Models:** 
**PCA** returned the component terms:
* C0: https, coronavirus, new, stayhome, pandemic, covid, quarantine, socialdistancing, stayathome, thank
* C1: coronavirus, trump, people, covid, cases, realdonaldtrump, quarantine, just, pandemic, deaths
*  C2: amp, coronavirus, people, trump, need, health, help, pandemic, work, realdonaldtrump
* C3: people, https, help, home, just, stay, need, like, dont, cases
* C4: new, cases, deaths, today, covid, just, home, stay, state, day

**NMF** returned the topic terms:
* T0: https, thank, pandemic, coronaviruspandemic, great, thanks, stayhomestaysafe, join, check, support
* T1: coronavirus, trump, lockdown, death, realdonaldtrump, news, covid, million, china, pandemic
* T2: amp, thank, health, today, support, work, workers, help, need, community
* T3: people, just, pandemic, like, coronaviruspandemic, trump, realdonaldtrump, time, need, help
* T4: new, quarantine, stayhome, covid, day, home, today, stay, cases, york

## **Preliminary Conclusions**

NMF shows Topic 0 is likely about urging people to stay home, Topics 1 & 3 are likely about the president's response to the pandemic, Topic 2 is likely about supporting and thanking front line workers, and Topic 4 is likely about the high toll of cases in New York. 

Next steps are to explore the topics closely, determine additional stop words, stem and/or lemmatize terms, and finally name the topics. 

In [44]:
import glob 
import nltk
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pickle
import re

import seaborn as sns
import string
pd.set_option('display.max_colwidth', None)
%matplotlib inline
%config InlineBackend.figure_formats = ['retina']  # or svg
sns.set(context='notebook', style='whitegrid')

from cleantext import clean
# from itertools import cycle
# from nltk.tokenize import word_tokenize, RegexpTokenizer
# from nltk.util import ngrams
# from sklearn import svm
from sklearn.decomposition import PCA, NMF
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# from sklearn.linear_model import LogisticRegression
# from sklearn.metrics import confusion_matrix, accuracy_score
# from sklearn.model_selection import train_test_split
# from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer 


## 1 | Dataset: Coronavirus Tweets ([early](https://www.kaggle.com/datasets/smid80/coronavirus-covid19-tweets-early-april) and [late](https://www.kaggle.com/datasets/smid80/coronavirus-covid19-tweets-late-april) April) <a id='1'></a>  

In [2]:
# # import one csv file to see data
single_df = pd.read_csv('./raw_data/20200401.csv')

In [3]:
single_df.head(2)

Unnamed: 0,status_id,user_id,created_at,screen_name,text,source,reply_to_status_id,reply_to_user_id,reply_to_screen_name,is_quote,...,retweet_count,country_code,place_full_name,place_type,followers_count,friends_count,account_lang,account_created_at,verified,lang
0,1245138808619724800,2722502906,2020-04-01T00:00:00Z,GradaNorteMX,"Cuando mejor iban las cosas en el circuito de tenis universitario de Estados Unidos, el #sonorense Alán Rubio volvió a Hermosillo ante la pandemia del #COVID_19 😕🇲🇽🇺🇸\n\n🎾 https://t.co/SldPvrP81A https://t.co/K7BVA6LV94",TweetDeck,,,,False,...,0,,,,1847,252,,2014-08-10T21:20:32Z,False,es
1,1245138810071142405,817072420947247104,2020-04-01T00:00:00Z,Tu_IMSS_Coah,"El #Coronavirus se transmite de una persona infectada a otras a través de gotitas de saliva, acata las reglas de etiqueta | #EnfermedadesRespiratorias #PrevenciónCoronavirus \n#QuédateEnCasa #COVID19 #SanaDistancia\n#MéxicoUnido #IMSSolidario https://t.co/3F3rGiXjQm",TweetDeck,,,,False,...,2,,,,1576,169,,2017-01-05T18:17:00Z,False,es


In [4]:
single_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 591480 entries, 0 to 591479
Data columns (total 22 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   status_id             591480 non-null  int64  
 1   user_id               591480 non-null  int64  
 2   created_at            591480 non-null  object 
 3   screen_name           591480 non-null  object 
 4   text                  591480 non-null  object 
 5   source                591478 non-null  object 
 6   reply_to_status_id    69502 non-null   float64
 7   reply_to_user_id      84914 non-null   float64
 8   reply_to_screen_name  84914 non-null   object 
 9   is_quote              591480 non-null  bool   
 10  is_retweet            591480 non-null  bool   
 11  favourites_count      591480 non-null  int64  
 12  retweet_count         591480 non-null  int64  
 13  country_code          26040 non-null   object 
 14  place_full_name       26116 non-null   object 
 15  

In [5]:
# keep only 7 columns 

#  2   created_at            591480 non-null  object 
#  3   screen_name           591480 non-null  object 
#  4   text                  591480 non-null  object 
#  13  country_code          26040 non-null   object 
#  18  account_lang          0 non-null       float64
#  20  verified              591480 non-null  bool   
#  21  lang                  591480 non-null  object 

In [7]:
# read list of April 2020 files, omitting columns above

path = r'/Users/sandraparedes/Documents/GitHub/metis_dsml/05_nlp/g00-nlp-project/raw_data' 
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0, usecols=[2,3,4,13,18,20,21])
    li.append(df)
df = pd.concat(li, axis=0, ignore_index=True)

## 2 | Exploratory Data Analysis<a id='2'></a>  

### Corpus Selection <a id='2'></a>  

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12786782 entries, 0 to 12786781
Data columns (total 7 columns):
 #   Column        Dtype  
---  ------        -----  
 0   created_at    object 
 1   screen_name   object 
 2   text          object 
 3   country_code  object 
 4   account_lang  float64
 5   verified      bool   
 6   lang          object 
dtypes: bool(1), float64(1), object(5)
memory usage: 597.5+ MB


In [9]:
df.head(2)

Unnamed: 0,created_at,screen_name,text,country_code,account_lang,verified,lang
0,2020-04-06T00:00:00Z,EricSchneiderMD,The Pearl Harbor metaphor for #Covid_19 is an odd choice. Pearl Harbor was the start of 3 years of US mobilizing after trying to stay out of WWII.,,,False,en
1,2020-04-06T00:00:00Z,kubofinanciero,🟩 Para nosotros cada celular o computadora es una sucursal de kubo.financiero. Inicia un plan de inversión o pide un préstamo desde tu casa #ConectemosAúnMás 👉 https://t.co/p3eaYnigMZ #kubofinanciero #kubo #fintech #COVID19 https://t.co/xdIvsvXL5x,,,False,es


In [10]:
df['country_code'].unique()

array([nan, 'ZA', 'US', 'PT', 'PK', 'IN', 'BR', 'IT', 'TR', 'MX', 'BE',
       'GB', 'LK', 'DE', 'GT', 'TH', 'CL', 'DO', 'UY', 'AR', 'CA', 'SV',
       'CU', 'JP', 'CO', 'NG', 'ES', 'PY', 'PH', 'SA', 'KE', 'NZ', 'PA',
       'MY', 'JO', 'RS', 'AU', 'EC', 'HN', 'PL', 'SG', 'VE', 'PE', 'TT',
       'NI', 'CN', 'OM', 'MV', 'ID', 'JM', 'IE', 'GH', 'LV', 'KH', 'BO',
       'IL', 'NL', 'SN', 'GU', 'TW', 'NP', 'FJ', 'PF', 'UG', 'VN', 'BN',
       'LY', 'IR', 'CD', 'HK', 'BD', 'GN', 'FR', 'DK', 'CR', 'AE', 'HT',
       'HU', 'RU', 'KR', 'CH', 'IQ', 'TL', 'BH', 'PG', 'KZ', 'MA', 'KW',
       'PR', 'SE', 'GR', 'AT', 'SZ', 'FI', 'RW', 'QA', 'GE', 'NO', 'CM',
       'ET', 'MQ', 'BS', 'SC', 'LC', 'CZ', 'TZ', 'LB', 'MZ', 'EG', 'UA',
       'SO', 'ZW', 'CY', 'SL', 'MU', 'SY', 'BA', 'NC', 'BJ', 'RE', 'GP',
       'CG', 'GG', 'AF', 'LS', 'ZM', 'BW', 'AZ', 'AO', 'MW', 'ME', 'RO',
       'MT', 'HR', 'IS', 'SR', 'MN', 'AD', 'CI', 'IM', 'BY', 'DZ', 'BI',
       'MC', 'GQ', 'TG', 'AL', 'GI', 'ML', 'TN', 'MM

In [11]:
df['lang'].unique()

array(['en', 'es', 'in', 'th', 'und', 'pt', 'tl', 'ar', 'it', 'uk', 'ca',
       'hu', 'fr', 'ja', 'ko', 'el', 'tr', 'fi', 'de', 'ru', 'lt', 'zh',
       'eu', 'fa', 'nl', 'ur', 'bn', 'sv', 'hi', 'ro', 'ht', 'ckb', 'vi',
       'pl', 'sl', 'ta', 'ne', 'et', 'cy', 'da', 'si', 'ps', 'lv', 'cs',
       'iw', 'mr', 'ml', 'kn', 'te', 'no', 'or', 'gu', 'pa', 'am', 'lo',
       'dv', 'sr', 'is', 'ka', 'sd', 'bg', 'my', 'hy', 'km', 'bo', 'ug'],
      dtype=object)

In [12]:
print('English entries:', (df[df["lang"] == 'en'].count())['lang'])

English entries: 7128121


In [13]:
print('US entries:', (df[df["country_code"] == 'US'].count())['country_code'])

US entries: 153660


In [14]:
eng_us_df = df[(df['country_code'] == 'US') &
               (df['lang'] == 'en')]
eng_us_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 138789 entries, 141 to 12786770
Data columns (total 7 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   created_at    138789 non-null  object 
 1   screen_name   138788 non-null  object 
 2   text          138789 non-null  object 
 3   country_code  138789 non-null  object 
 4   account_lang  0 non-null       float64
 5   verified      138789 non-null  bool   
 6   lang          138789 non-null  object 
dtypes: bool(1), float64(1), object(5)
memory usage: 7.5+ MB


In [15]:
eng_us_df.head(2)

Unnamed: 0,created_at,screen_name,text,country_code,account_lang,verified,lang
141,2020-04-06T00:00:05Z,WFMGINC,....#SUNDAYFUNDAY #coronavirus style #vino cheers 🍷 https://t.co/SrymChBkq2,US,,False,en
234,2020-04-06T00:00:14Z,jpomietlasz,"This pandemic has confirmed my worst fears, most people don’t know how to make entertaining videos. #Covid_19 #SinceIveBeenQuarantined #AmericasUnfunniestVideos #WrestleMania #tonyaharding",US,,False,en


#### corpus = English &  US tweets (n=138,789)

In [16]:
# save corpus selection as tweet_df
tweet_df = eng_us_df 
tweet_df.to_pickle('./raw_data/tweet_df.pkl')
tweet_df.to_csv(r'//Users/sandraparedes/Documents/GitHub/metis_dsml/05_nlp/g00-nlp-project/raw_data/tweet_df.csv', index=False)


### Text Preprocessing<a id='tp'></a>  

In [17]:
# read in corpus
df = pd.read_csv('./raw_data/tweet_df.csv', low_memory=False)
df.head(2)

Unnamed: 0,created_at,screen_name,text,country_code,account_lang,verified,lang
0,2020-04-06T00:00:05Z,WFMGINC,....#SUNDAYFUNDAY #coronavirus style #vino cheers 🍷 https://t.co/SrymChBkq2,US,,False,en
1,2020-04-06T00:00:14Z,jpomietlasz,"This pandemic has confirmed my worst fears, most people don’t know how to make entertaining videos. #Covid_19 #SinceIveBeenQuarantined #AmericasUnfunniestVideos #WrestleMania #tonyaharding",US,,False,en


In [18]:
# drop all columns except text
df = df.drop(columns=['created_at', 'screen_name', 'country_code', 
                 'account_lang', 'verified', 'lang'])
df.head(2)

Unnamed: 0,text
0,....#SUNDAYFUNDAY #coronavirus style #vino cheers 🍷 https://t.co/SrymChBkq2
1,"This pandemic has confirmed my worst fears, most people don’t know how to make entertaining videos. #Covid_19 #SinceIveBeenQuarantined #AmericasUnfunniestVideos #WrestleMania #tonyaharding"


In [19]:
# remove numbers, punctuation, and capital letters
alphanumeric = lambda x: re.sub('\w*\d\w*',' ', str(x))
punc_lower = lambda x: re.sub('[%s]' % re.escape(string.punctuation), ' ', x.lower())
                          
df['text'] = df.text.map(alphanumeric).map(punc_lower)
df.head(2)

Unnamed: 0,text
0,sundayfunday coronavirus style vino cheers 🍷 https t co
1,this pandemic has confirmed my worst fears most people don’t know how to make entertaining videos sinceivebeenquarantined americasunfunniestvideos wrestlemania tonyaharding


In [20]:
# remove emojis
df = df.astype(str).apply(lambda x: x.str.encode('ascii', 'ignore').str.decode('ascii'))
df.head(2)

Unnamed: 0,text
0,sundayfunday coronavirus style vino cheers https t co
1,this pandemic has confirmed my worst fears most people dont know how to make entertaining videos sinceivebeenquarantined americasunfunniestvideos wrestlemania tonyaharding


In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 138796 entries, 0 to 138795
Data columns (total 1 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    138796 non-null  object
dtypes: object(1)
memory usage: 1.1+ MB


In [22]:
# save preprocessed corpus as corpus_tweets_df
corpus_tweets_df = df 
corpus_tweets_df.to_pickle('./raw_data/corpus_tweets_df.pkl')
corpus_tweets_df.to_csv(r'//Users/sandraparedes/Documents/GitHub/metis_dsml/05_nlp/g00-nlp-project/raw_data/corpus_tweets_df.csv', index=False)


## 3 | Sentiment Analysis<a id='3'></a>  

In [23]:
#read in corpus
df = pd.read_pickle("./raw_data/corpus_tweets_df.pkl")  
df.head(2)

Unnamed: 0,text
0,sundayfunday coronavirus style vino cheers https t co
1,this pandemic has confirmed my worst fears most people dont know how to make entertaining videos sinceivebeenquarantined americasunfunniestvideos wrestlemania tonyaharding


In [24]:
# Vader Sentiment
analyzer = SentimentIntensityAnalyzer() 
sentiment = analyzer.polarity_scores(df).get('compound')
print('compound', sentiment)

compound 0.0


In [25]:
df['score'] = df.text.map(analyzer.polarity_scores).map(lambda x: x.get('compound'))
df.head(5)

Unnamed: 0,text,score
0,sundayfunday coronavirus style vino cheers https t co,0.4767
1,this pandemic has confirmed my worst fears most people dont know how to make entertaining videos sinceivebeenquarantined americasunfunniestvideos wrestlemania tonyaharding,-0.6124
2,is this true \nhttps t co \n ecuadorenemergencia coronaviruspandemic,0.4215
3,many us thought it was wuhan province but it could never be us then it was italy but it could never be us now it is here one newyorker died every minutes from over this weekend absolutely devastating \n\nhttps t co,-0.923
4,ah coronavirus humor https t co,0.2732


## 4 | Vectorizer<a id='4'></a>  

In [26]:
corpus = df.text
print('corpus type:', type(corpus))
print(corpus.head(2))

corpus type: <class 'pandas.core.series.Series'>
0                                                                                                                            sundayfunday  coronavirus style  vino cheers  https   t co  
1    this pandemic has confirmed my worst fears  most people dont know how to make entertaining videos      sinceivebeenquarantined  americasunfunniestvideos  wrestlemania  tonyaharding
Name: text, dtype: object


### CountVectorizer

In [27]:
cv_vectorizer = CountVectorizer(stop_words='english', min_df=0.02, max_df=.95)
cv_vectorizer

CountVectorizer(max_df=0.95, min_df=0.02, stop_words='english')

In [28]:
# document-term matrix with count vectorizer
cv_doc_term_mtx = cv_vectorizer.fit_transform(corpus)
type(cv_doc_term_mtx)

scipy.sparse._csr.csr_matrix

In [29]:
cv_doc_term_df = pd.DataFrame(cv_doc_term_mtx.toarray(), 
                              columns=cv_vectorizer.get_feature_names_out())
cv_doc_term_df.head(2)

Unnamed: 0,amp,care,cases,coronavirus,coronaviruspandemic,covid,day,deaths,doing,dont,...,stayathome,stayhome,thank,think,time,today,trump,virus,work,world
0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


### Term Frequency Inverse Document Frequency (TF-IDF)

In [30]:
# option tfidf2 = TfidfVectorizer(ngram_range=(1,2), binary=True, stop_words='english')
tf_vectorizer = TfidfVectorizer(stop_words='english', 
                                min_df=0.01, 
                                max_df=.95)
tf_vectorizer

TfidfVectorizer(max_df=0.95, min_df=0.01, stop_words='english')

In [31]:
# document-term matrix with TF-IDF
tf_doc_term_mtx = tf_vectorizer.fit_transform(corpus)
type(tf_doc_term_mtx)

scipy.sparse._csr.csr_matrix

In [32]:
tf_doc_term_df = pd.DataFrame(tf_doc_term_mtx.toarray(), 
                              columns=tf_vectorizer.get_feature_names_out())
tf_doc_term_df.head(2)

Unnamed: 0,america,americans,amp,april,away,best,better,business,california,care,...,watch,way,week,weeks,work,workers,working,world,year,york
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 5 | Dimensionality Reduction <a id='5'></a>  

### Principal Component Analysis (PCA)

In [33]:
pca = PCA(n_components=5)
pca_matrix = pca.fit_transform(cv_doc_term_df)
pca_matrix

array([[ 0.44088238,  0.50226742, -0.07243432, -0.0830229 , -0.08805793],
       [-0.92361988, -0.0752408 , -0.06110429,  0.94002008, -0.17652642],
       [ 0.10340195, -0.45334209, -0.20403876, -0.09259598, -0.08051224],
       ...,
       [-0.83093773, -0.12936078, -0.14441171, -0.20125915, -0.00408033],
       [-0.8471146 , -0.10162256, -0.14650546, -0.17682858, -0.07491721],
       [ 0.06727795, -0.41857728, -0.19291245,  0.00994001, -0.05351073]])

In [34]:
pca_variance = pca.explained_variance_ratio_
print('pca_variance: ', pca_variance)

total_variance = sum(pca.explained_variance_ratio_)
print('total_variance: ', total_variance) 


pca_variance:  [0.1239549  0.11011231 0.07528215 0.03777027 0.03386963]
total_variance:  0.3809892569368003


In [35]:
# correlation matrix 
pca_components = pd.DataFrame(pca.components_.round(2), 
                              index = ['pc1','pc2', 'pc3','pc4', 'pc5'],
                              columns=cv_vectorizer.get_feature_names_out())
                           
print('pca_components.shape:', pca_components.shape)
pca_components.T.style.background_gradient(cmap='Reds')


pca_components.shape: (5, 46)


Unnamed: 0,pc1,pc2,pc3,pc4,pc5
amp,0.0,-0.13,0.99,-0.06,0.01
care,-0.01,-0.01,0.01,0.02,-0.0
cases,-0.0,0.02,0.01,0.06,0.38
coronavirus,0.32,0.94,0.13,-0.0,-0.02
coronaviruspandemic,-0.02,-0.02,-0.0,-0.01,-0.02
covid,0.03,0.02,-0.0,-0.0,0.08
day,-0.0,-0.0,0.01,-0.01,0.05
deaths,-0.01,0.01,0.01,0.03,0.23
doing,-0.01,-0.0,0.01,0.01,-0.01
dont,-0.02,0.0,0.01,0.05,-0.02


In [36]:
# function to display components
def display_pcs(model, feature_names, no_top_words, topic_names=None):
    for ix, topic in enumerate(model.components_):
        if not topic_names or not topic_names[ix]:
            print("\nComponent:", ix)
        else:
            print("\nComponent: '",topic_names[ix],"'")
        print(", ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))


#### Top terms by prinicpal component:

In [37]:
display_pcs(pca, cv_vectorizer.get_feature_names_out(), 10)


Component: 0
https, coronavirus, new, stayhome, pandemic, covid, quarantine, socialdistancing, stayathome, thank

Component: 1
coronavirus, trump, people, covid, cases, realdonaldtrump, quarantine, just, pandemic, deaths

Component: 2
amp, coronavirus, people, trump, need, health, pandemic, help, work, realdonaldtrump

Component: 3
people, https, home, help, need, just, stay, know, cases, dont

Component: 4
new, cases, deaths, covid, today, day, state, home, just, health


## 5 | Topic Modeling <a id='5'></a>  

### Non-Negative Matrix Factorization (NMF)

In [38]:
# V     visible variables     doc_term             input (corpus matrix)
# W     weights               doc_topic            feature set
# H     hidden variables      topic_term           coefficients

In [39]:
V = tf_doc_term_mtx #
V.shape

(138796, 146)

In [40]:
# W matrix = feature set & weights

nmf = NMF(n_components=5, init=None)
W = nmf.fit_transform(V).round(3)
print(type(W))
W.shape

<class 'numpy.ndarray'>


(138796, 5)

In [41]:
# H matrix = hidden variables & coefficients 

H = pd.DataFrame(nmf.components_.round(2),
                 index = ['c1', 'c2','c3', 'c4', 'c5'],
                 columns = tf_vectorizer.get_feature_names_out())
print('H.shape:',  H.shape)
H.T.style.background_gradient(cmap='Blues')


H.shape: (5, 146)


Unnamed: 0,c1,c2,c3,c4,c5
america,0.02,0.05,0.07,0.19,0.02
americans,0.0,0.07,0.12,0.26,0.0
amp,0.0,0.0,8.04,0.0,0.0
april,0.06,0.03,0.09,0.04,0.2
away,0.0,0.02,0.05,0.19,0.07
best,0.04,0.02,0.07,0.16,0.1
better,0.03,0.02,0.06,0.23,0.08
business,0.06,0.02,0.11,0.15,0.04
california,0.1,0.07,0.0,0.02,0.26
care,0.05,0.0,0.18,0.37,0.06


In [42]:
# function to display topics
def display_topics(model, feature_names, no_top_words, topic_names=None):
    for ix, topic in enumerate(model.components_):
        if not topic_names or not topic_names[ix]:
            print("\nTopic ", ix)
        else:
            print("\nTopic: '",topic_names[ix],"'")
        print(", ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))


#### Top terms by topic:

In [43]:
display_topics(nmf, tf_vectorizer.get_feature_names_out(), 10)


Topic  0
https, thank, pandemic, coronaviruspandemic, great, thanks, stayhomestaysafe, join, check, support

Topic  1
coronavirus, trump, lockdown, death, realdonaldtrump, news, covid, million, china, pandemic

Topic  2
amp, thank, health, today, support, work, workers, help, need, community

Topic  3
people, just, pandemic, like, coronaviruspandemic, realdonaldtrump, trump, time, need, help

Topic  4
new, quarantine, stayhome, covid, day, home, today, stay, cases, york
