#### NLP | Model

# Coronavirus Tweets: April 2020<a id='top'></a> 

### Natural Language Processing Stepwise Analysis<a id='top'></a> 

1. [Research Question](#1)<br/>
2. [DataFrames](#2) <br/>
3. [Exploratory Data Analysis](#3)<br/>
   [Data Summary](#31)<br/>
   [Text Preprocessing](#32)<br/>
4. [Vectorizer](#4)<br/>
5. [Topic Modeling/Dimensionality Reduction](#5)<br/>
6. [Sentiment Analysis](#6)<br/>
7. [Classification](#7) <br/>
    1 [Naive Bayes: Bernoulli](#71)<br/>
    2 [Naive Bayes: Gaussian](#72)<br/>
    3 [Naive Bayes: Multinomial](#73)<br/>

In [1]:
import glob 
import nltk
import nltk
# nltk.download('stopwords')
import matplotlib.pyplot as plt
import numpy as np


# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')
import pandas as pd
import pickle
import re
import seaborn as sns
import string
pd.set_option('display.max_colwidth', None)
%matplotlib inline
%config InlineBackend.figure_formats = ['retina']  # or svg
sns.set(context='notebook', style='whitegrid')


from cleantext import clean
# from itertools import cycle
from nltk.tokenize import MWETokenizer, word_tokenize
from nltk.tag import pos_tag
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer
from nltk.tokenize import word_tokenize #, RegexpTokenizer
from nltk.util import ngrams
from sklearn import svm
from sklearn.decomposition import PCA, NMF
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# from sklearn.linear_model import LogisticRegression
# from sklearn.metrics import confusion_matrix, accuracy_score
# from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer 







Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.


# 1 | Research Design<a id='1'></a> 

* **Research Question:** What were Americans tweeting about coronavirus and COVID-19 in April 2020? 
* **Impact Hypothesis:** Inform CDC's communication strategy for future pandemics. 
* **Data source:** Coronavirus COVID-19 Tweets [early](https://www.kaggle.com/datasets/smid80/coronavirus-covid19-tweets-early-april) and [late](https://www.kaggle.com/datasets/smid80/coronavirus-covid19-tweets-late-april) April, n=138,796


[back to top](#top)

# 2 | [DataFrames](https://github.com/slp22/nlp-project/blob/main/nlp-coronavirus-tweets-mvp.ipynb)<a id='2'></a> 

In [2]:
# load clean tweet corpus from mvp 
df = pd.read_csv('./raw_data/tweet_df.csv', low_memory=False)


In [3]:
df.head(2)

Unnamed: 0,created_at,screen_name,text,country_code,account_lang,verified,lang
0,2020-04-06T00:00:05Z,WFMGINC,....#SUNDAYFUNDAY #coronavirus style #vino cheers 🍷 https://t.co/SrymChBkq2,US,,False,en
1,2020-04-06T00:00:14Z,jpomietlasz,"This pandemic has confirmed my worst fears, most people don’t know how to make entertaining videos. #Covid_19 #SinceIveBeenQuarantined #AmericasUnfunniestVideos #WrestleMania #tonyaharding",US,,False,en


[back to top](#top)

# 3 | Exploratory Data Analysis<a id='3'></a> 

##### Note: Full EDA part of [MVP](https://github.com/slp22/nlp-project/blob/main/nlp-coronavirus-tweets-mvp.ipynb).

### 3.1 Data Summary<a id='31'></a> 

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 138796 entries, 0 to 138795
Data columns (total 7 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   created_at    138796 non-null  object
 1   screen_name   138790 non-null  object
 2   text          138789 non-null  object
 3   country_code  138789 non-null  object
 4   account_lang  2 non-null       object
 5   verified      138787 non-null  object
 6   lang          138787 non-null  object
dtypes: object(7)
memory usage: 7.4+ MB


#### All data types are objects. The column `account_lang` is mostly null values, will drop in next step. 

[back to top](#top)

### 3.2 Text Preprocessing<a id='32'></a>  

In [5]:
# drop all columns except text
df = df.drop(columns=['created_at', 'screen_name', 'country_code', 
                 'account_lang', 'verified', 'lang'])
df.head(2)

Unnamed: 0,text
0,....#SUNDAYFUNDAY #coronavirus style #vino cheers 🍷 https://t.co/SrymChBkq2
1,"This pandemic has confirmed my worst fears, most people don’t know how to make entertaining videos. #Covid_19 #SinceIveBeenQuarantined #AmericasUnfunniestVideos #WrestleMania #tonyaharding"


In [6]:
# remove numbers, punctuation, and capital letters
alphanumeric = lambda x: re.sub('\w*\d\w*',' ', str(x))
punc_lower = lambda x: re.sub('[%s]' % re.escape(string.punctuation), ' ', x.lower())
                          
df['text'] = df.text.map(alphanumeric).map(punc_lower)
df.head(2)

Unnamed: 0,text
0,sundayfunday coronavirus style vino cheers 🍷 https t co
1,this pandemic has confirmed my worst fears most people don’t know how to make entertaining videos sinceivebeenquarantined americasunfunniestvideos wrestlemania tonyaharding


In [7]:
# remove emojis
df = df.astype(str).apply(lambda x: x.str.encode('ascii', 'ignore').str.decode('ascii'))
df.head(5)

Unnamed: 0,text
0,sundayfunday coronavirus style vino cheers https t co
1,this pandemic has confirmed my worst fears most people dont know how to make entertaining videos sinceivebeenquarantined americasunfunniestvideos wrestlemania tonyaharding
2,is this true \nhttps t co \n ecuadorenemergencia coronaviruspandemic
3,many us thought it was wuhan province but it could never be us then it was italy but it could never be us now it is here one newyorker died every minutes from over this weekend absolutely devastating \n\nhttps t co
4,ah coronavirus humor https t co


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 138796 entries, 0 to 138795
Data columns (total 1 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    138796 non-null  object
dtypes: object(1)
memory usage: 1.1+ MB


In [9]:
# # save preprocessed corpus as corpus_tweets_df
# corpus_tweets_df = df 
# corpus_tweets_df.to_pickle('./raw_data/corpus_tweets_df.pkl')
# corpus_tweets_df.to_csv(r'//Users/sandraparedes/Documents/GitHub/metis_dsml/05_nlp/g00-nlp-project/raw_data/corpus_tweets_df.csv', index=False)


[back to top](#top)

[back to top](#top)

## 4 | Vectorizer<a id='4'></a>  

In [10]:
# load preprocessed corpus from step 3 
df = pd.read_pickle("./raw_data/corpus_tweets_df.pkl")  
df.head(2)

Unnamed: 0,text
0,sundayfunday coronavirus style vino cheers https t co
1,this pandemic has confirmed my worst fears most people dont know how to make entertaining videos sinceivebeenquarantined americasunfunniestvideos wrestlemania tonyaharding


In [11]:
# isolate tweet text in dataframe
corpus = df.text
print('corpus type:', type(corpus))
print(corpus.head(2))

corpus type: <class 'pandas.core.series.Series'>
0                                                                                                                            sundayfunday  coronavirus style  vino cheers  https   t co  
1    this pandemic has confirmed my worst fears  most people dont know how to make entertaining videos      sinceivebeenquarantined  americasunfunniestvideos  wrestlemania  tonyaharding
Name: text, dtype: object


In [12]:
# custom stop words 
stopwords = nltk.corpus.stopwords.words('english')
new_words = ['also',
             'amp', 
             'corona', 
             'coronavirus', 
             'https',
             'http',
             'pandemic', 
             'covid',
            'hers',
             'his',
            'weeks',
            'americans',
            'another',
            'anyone',
            'working',
            'workers',
            'would',
            'co']
stopwords.extend(new_words)
# print(stopwords)


In [13]:
# stemmer
stemmer = SnowballStemmer("english")

def prep(word, stemmer=None):
    if word.lower() in stopwords:
        return None
    elif stemmer is None:
        return word.lower()
    else:
        return stemmer.stem(word)



### Term Frequency Inverse Document Frequency (TF-IDF)

In [14]:
tf_vectorizer = TfidfVectorizer(stop_words=stopwords, 
                                min_df=0.01, 
                                max_df=.95, 
                                preprocessor=prep)
tf_vectorizer

TfidfVectorizer(max_df=0.95, min_df=0.01,
                preprocessor=<function prep at 0x7f9e64e14ca0>,
                stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
                            'ourselves', 'you', "you're", "you've", "you'll",
                            "you'd", 'your', 'yours', 'yourself', 'yourselves',
                            'he', 'him', 'his', 'himself', 'she', "she's",
                            'her', 'hers', 'herself', 'it', "it's", 'its',
                            'itself', ...])

In [15]:
# document-term matrix with TF-IDF
tf_doc_term_mtx = tf_vectorizer.fit_transform(corpus)
type(tf_doc_term_mtx)

scipy.sparse._csr.csr_matrix

In [16]:
tf_doc_term_df = pd.DataFrame(tf_doc_term_mtx.toarray(), 
                              columns=tf_vectorizer.get_feature_names_out())
tf_doc_term_df.head()

Unnamed: 0,america,april,around,away,back,best,better,business,california,call,...,virus,want,watch,way,week,well,work,world,year,york
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [17]:
# double check that domain specfic words were omitted 
print('https' in tf_vectorizer.get_feature_names_out())
print('corona' in tf_vectorizer.get_feature_names_out())
print('covid' in tf_vectorizer.get_feature_names_out())
print('t' in tf_vectorizer.get_feature_names_out())
print('amp' in tf_vectorizer.get_feature_names_out())


False
False
False
False
False


[back to top](#top)

In [18]:
# # # http://localhost:8888/notebooks/Documents/GitHub/metis_dsml/05_nlp/g1-nlp-overview/nlp_unsupervised_nlp_exercises6_soln.ipynb

# word_list = MWETokenizer([('I','am')]).tokenize(word_tokenize(text))
# word_list
# r_list = [word for word in word_list if word[0] == 'r']
# r_list
# # Find the part of speech of each item in the `r_list` using `pos_tag`. Save the results in a variable called `pos_list`.
# pos_list = pos_tag(r_list)
# pos_list

[back to top](#top)

## 5 | Topic Modeling/Dimensionality Reduction <a id='5'></a>  

### Non-Negative Matrix Factorization (NMF)

In [19]:
# V     visible variables     doc_term             input (corpus matrix)
# W     weights               doc_topic            feature set
# H     hidden variables      topic_term           coefficients

In [20]:
V = tf_doc_term_mtx
V.shape

(138796, 170)

In [21]:
# W matrix = feature set & weights

nmf = NMF(n_components=5, init=None)
W = nmf.fit_transform(V).round(3)
print(type(W))
W.shape

<class 'numpy.ndarray'>


(138796, 5)

In [22]:
# H matrix = hidden variables & coefficients 

H = pd.DataFrame(nmf.components_.round(2),
                 index = ['c1', 'c2','c3', 'c4', 'c5'],
                 columns = tf_vectorizer.get_feature_names_out())
print('H.shape:',  H.shape)
H.T.style.background_gradient(cmap='Blues')


H.shape: (5, 170)


Unnamed: 0,c1,c2,c3,c4,c5
america,0.2,0.0,0.02,0.05,0.2
april,0.26,0.04,0.17,0.0,0.01
around,0.31,0.02,0.02,0.1,0.01
away,0.29,0.02,0.0,0.04,0.02
back,0.95,0.03,0.0,0.03,0.03
best,0.34,0.06,0.01,0.01,0.04
better,0.37,0.02,0.01,0.05,0.04
business,0.33,0.0,0.02,0.03,0.01
california,0.2,0.35,0.03,0.0,0.0
call,0.36,0.01,0.02,0.01,0.02


In [23]:
# function to display topics
def display_topics(model, feature_names, no_top_words, topic_names=None):
    for ix, topic in enumerate(model.components_):
        if not topic_names or not topic_names[ix]:
            print("\nTopic ", ix)
        else:
            print("\nTopic: '",topic_names[ix],"'")
        print(", ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))


#### Top terms by topic:

In [24]:
display_topics(nmf, tf_vectorizer.get_feature_names_out(), 10)


Topic  0
us, get, like, time, today, home, one, need, thank, help

Topic  1
quarantine, stayhome, quarantinelife, socialdistancing, day, stayathome, staysafe, lockdown, california, stayhomestaysafe

Topic  2
new, york, cases, deaths, city, nyc, state, county, today, positive

Topic  3
people, many, million, economy, help, next, virus, know, cases, dont

Topic  4
trump, coronaviruspandemic, realdonaldtrump, president, says, virus, coronavirusoutbreak, news, china, said


[back to top](#top)

## 6 | Sentiment Analysis<a id='6'></a>  

save preprocessed topic df and pass to Vader sentiment

In [26]:
# Vader Sentiment
analyzer = SentimentIntensityAnalyzer() 
sentiment = analyzer.polarity_scores(df).get('compound')
print('compound', sentiment)

compound 0.0


In [27]:
df['score'] = df.text.map(analyzer.polarity_scores).map(lambda x: x.get('compound'))
df.head(5)

Unnamed: 0,text,score
0,sundayfunday coronavirus style vino cheers https t co,0.4767
1,this pandemic has confirmed my worst fears most people dont know how to make entertaining videos sinceivebeenquarantined americasunfunniestvideos wrestlemania tonyaharding,-0.6124
2,is this true \nhttps t co \n ecuadorenemergencia coronaviruspandemic,0.4215
3,many us thought it was wuhan province but it could never be us then it was italy but it could never be us now it is here one newyorker died every minutes from over this weekend absolutely devastating \n\nhttps t co,-0.923
4,ah coronavirus humor https t co,0.2732


[back to top](#top)

## 7 | Classification<a id='7'></a>  

### 7.1 Naive Bayes: Bernoulli<a id='71'></a> 

In [30]:
bern = BernoulliNB().fit(X_train, y_train)
y_predict_bern = bern.predict(X_val) 


In [None]:
seven = ["Bernoulli NB", 'bern',
       recall_score(y_predict_bern, y_val),
       roc_auc_score(y_val, bern.predict_proba(X_val)[:,1])]

five

[back to top](#top)

### 7.2 Naive Bayes: Gaussian<a id='72'></a> 

In [None]:
gaus = GaussianNB().fit(X_train, y_train)
y_pred_gaus = gaus.predict(X_val)


In [None]:
eight = ["Gaussian NB",'gaus',
         recall_score(y_pred_gaus, y_val),
         roc_auc_score(y_val, gaus.predict_proba(X_val)[:,1])]
eight

[back to top](#top)

### 7.3 Naive Bayes: Multinomial<a id='73'></a> 

In [None]:
multi = MultinomialNB().fit(X_train, y_train)
y_pred_multi = multi.predict(X_val)


In [None]:
nine = ["Multinomial NB", 'multi',
        recall_score(y_pred_multi, y_val),
        roc_auc_score(y_val, multi.predict_proba(X_val)[:,1])]
nine

#### [comment ]

[back to top](#top)

In [None]:
# http://localhost:8888/notebooks/Documents/GitHub/metis_dsml/05_nlp/g1-nlp-overview/nlp_unsupervised_nlp_exercises5_soln.ipynb
# Our goal is to create a Naive Bayes model that will look at the review text and determine if the review is positive or negative. Let's start by prepping the data.

# # define the input and output of the model
# X = data.review
# y = data.sentiment

# # split the data into training and test sets
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
# X_train_cv = cv.fit_transform(X_train)
# X_test_cv  = cv.transform(X_test)

In [None]:
# dtm_cv = pd.DataFrame(X_train_cv.toarray(), columns=cv.get_feature_names())
# dtm_cv

In [None]:
# Next, we're going to put this document-term matrix through a Naive Bayes model and see how well the model performs.
# mnb = MultinomialNB()
# mnb.fit(X_train_cv, y_train)
# mnb.score(X_test_cv, y_test)

In [None]:
# Using `CountVectorizer`, we are able to predict the sentiment of a review with 87.7% accuracy. Next, you are tasked with repeating the whole process again, but using `TfidfVectorizer` instead to see if you can get a better prediction score.

# 1. Create a `TfidfVectorizer` object with the same hyperparameters as the `CountVectorizer` object we created earlier and name it `tv`.

# 2. Take the `X_train` data and turn it into a TF-IDF matrix called `X_train_tv`.

# 3. Take the `X_test` data and turn it into a TF-IDF matrix with the same columns as `X_train_tv` and call it `X_test_tv`.

# 4. Turn `X_train_tv` into a pandas dataframe and call it `tfidf`.

# (See final step below)

In [None]:
# tv = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
# X_train_tv = tv.fit_transform(X_train)
# X_test_tv  = tv.transform(X_test)
# tfidf = pd.DataFrame(X_train_tv.toarray(), columns=tv.get_feature_names())
# tfidf

In [None]:
# 5. Fit a GaussianNB model and save the final score as `tfidf_score`.

# gnb = GaussianNB()
# gnb.fit(X_train_tv.toarray(), y_train)
# tfidf_score = gnb.score(X_test_tv.toarray(), y_test)
# tfidf_score

In [None]:
# The final prediction accuracy using the TF-IDF Vectorizer was 84.5% versus the final prediction accuracy using the Count Vectorizer, which was 87.7%.

# This tells us that while TF-IDF can be the better option over simple word counts, it is not always the case. The best approach is to try both vectorizers and choose the one that works best for your dataset and analysis goal.

[back to top](#top)