# Project Phase 3 
Tiffany Wong 

### Description 
In the final phase of our class project, you will use the datasets you created in phase 2 to build a text categorization model.

### Note to self 
- PRIMARY = twitter-stance 
- SECONDARY = changeorg-stance 

### Goal 
The goal of this final phase of the project is to build a text categorization model on your primary dataset, and to evaluate it on both your primary and your secondary dataset. 

## 1 Data Partitioning 
Create a Training set for your model by randomly selecting 70% of the texts in your PRIMARY dataset.  Use the remaining 30% of texts from the PRIMARY dataset as your Test (PRIMARY) set.  Designate 100% of your SECONDARY dataset as the Test (SECONDARY) dataset.  So you should have one Training set (drawn from the PRIMARY data), and two different Test sets (one from PRIMARY and one from SECONDARY). 

In [455]:
# import libraries 
import pandas as pd 
import numpy as np 
import sklearn 
from sklearn.model_selection import train_test_split 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer 
from sklearn.naive_bayes import MultinomialNB 
from sklearn.pipeline import Pipeline

In [456]:
# read in primary and secondary datasets 
twitter_stance = pd.read_csv('/Users/tiffwong/Desktop/cs585/project/twitter_stance_labels2.csv') 
changeorg_stance = pd.read_csv('/Users/tiffwong/Desktop/cs585/project/changeorg_stance_labels.csv') 
twitter_stance = twitter_stance.drop(['Unnamed: 0'], axis=1)
changeorg_stance = changeorg_stance.drop(['Unnamed: 0'], axis=1) 

# sanity check 
# changeorg_stance.head() 

In [457]:
twittertext_train, twittertext_test, twitterlabel_train, twitterlabel_test = train_test_split(twitter_stance["text"], twitter_stance["label"],
                                   random_state=1, 
                                   train_size=0.70, 
                                   shuffle=True) 

# sanity check
len(twittertext_train)/len(twitter_stance)

# join twittertext_train and twitterlabel_train
twitter_train = pd.concat([twittertext_train, twitterlabel_train], axis=1, join='inner') 
# join twittertext_test and twitterlabel_test 
twitter_test = pd.concat([twittertext_test, twitterlabel_test], axis=1, join='inner') 

# rename changeorg_stance dataset to be 100% test 
changetext_test = changeorg_stance["text"] 
changelabel_test = changeorg_stance["label"]

# twitter_test.head() 
# changetext_test.head() 

dataframes right now: 

* PRIMARY training 

twitter_train = 70% of twitter_stance dataset for training 

* PRIMARY testing 

twitter_test = 30% of twitter_stance dataset for testing 

* SECONDARY testing 

changeorg_test = 100% of changeorg_stance dataset for testing

## 2 Baseline model training: 
Train a simple bag-of-words classifier on your Training dataset.  If your data comes from the stance task, you will build a multiclass model (one which can assign one of three labels - pro-mitigation, anti-mitigation, or unclear). 


An example of how to use scikit-learn to build a simple text categorization model is here: https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html.  

In [458]:
count_vect = CountVectorizer() 
twitter_train_counts = count_vect.fit_transform(twittertext_train) 
twitter_train_counts.shape

(1260, 5841)

In [459]:
tf_transformer = TfidfTransformer(use_idf=False).fit(twitter_train_counts) 
twitter_train_tf = tf_transformer.transform(twitter_train_counts) 
twitter_train_tf.shape 

(1260, 5841)

In [460]:
tfidf_transformer = TfidfTransformer() 
twitter_train_tfidf = tfidf_transformer.fit_transform(twitter_train_counts) 
twitter_train_tfidf.shape

(1260, 5841)

### train a classifier

In [461]:
clf = MultinomialNB().fit(twitter_train_tfidf, twitterlabel_train)

## 3. Model evaluation 1: 
Calculate your baseline model's accuracy for your model's predictions on the Test (PRIMARY) set, and on the Test (SECONDARY) set.  Enter these values in the answer boxes provided.

### Build a pipeline 
In order to make the vectorizer => transformer => classifier easier to work with, scikit-learn provides a Pipeline class that behaves like a compound classifier:

In [462]:
twittertext_train

1145    A local #pharmaceutical company had expressed ...
927     200K dead isn’t enough for you? You’re gonna r...
1189    After vaccine announcement, \ngovt. announces ...
1065    Please don't put tracking nano-bots in our Cov...
671     Leave Fauci alone. Talk about Trump and Pence,...
                              ...                        
905     #CovidVaccine will never work. ANd has never w...
1791    Go America! Time to unite against the real ene...
1096    The CDC announced Friday the relaxing of COVID...
235     When something doesn't make logical sense. Lik...
1061    According to Wong late on Friday, details of t...
Name: text, Length: 1260, dtype: object

In [463]:
text_clf = Pipeline([
    ('vect', CountVectorizer()), 
    ('tfidf', TfidfTransformer()), 
    ('clf', MultinomialNB()),
]) 
text_clf.fit(twittertext_train, twitterlabel_train)

### PRIMARY dataset evaluation (twitter_stance)

In [464]:
# use twittertext_test and twitterlabel_test 
docs_test = twittertext_test
predicted = text_clf.predict(docs_test) 
twitter_accuracy = np.mean(predicted == twitterlabel_test) 

### SECONDARY dataset evaluation (changeorg_stance)

In [465]:
# use twittertext_test and twitterlabel_test 
docs_test = changetext_test 
predicted = text_clf.predict(docs_test) 
change_accuracy = np.mean(predicted == changelabel_test) 

## 4. Feature engineering: 
In order to try to improve your model, think about what features of the text might be associated with the category you are trying to predict.  What attributes of a text besides the presence of individual words might be good predictors (for example, regular expression patterns or specific word sequences)?  Create at least three new features that represent attributes of the text.  Add them to your model and retrain. 

An example of how to add a set of features (defined as a vector of 1/0 values indicating whether the attribute is present or absent for a given text) is shown here: https://gist.github.com/DerrickHiggins/20c77745b080e3d493231424d7da9a2f 

### Feature 1: Word Count 
One of the new features can be word count. Seeing if the length of a text is any indication of the stance/label of it. 

In [466]:
def word_count(string): 
    wrdcount = len(string.split(" ")) 
    return(wrdcount) 

word_count(twittertext_train[1]) 

19

### Feature 2: Sentiment Analysis - subjectivity
The sentiment property returns a namedtuple of the form Sentiment(polarity, subjectivity). The polarity score is a float within the range [-1.0, 1.0]. The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective. 

I will be adding each text's subjectivity and polarity rating as two new features. 

In [467]:
# pip install -U textblob

In [468]:
import textblob
from textblob import TextBlob 

In [469]:
def sentiment_subjectivity(string): 
    text = TextBlob(string) 
    subjectivity = text.sentiment.subjectivity 
    return(subjectivity)

In [470]:
twittertext_train[1]

'Excellent thread on CDC and CDPH school guidelines.  Prioritizing students in school is a great place to start'

In [471]:
sentiment_subjectivity(twittertext_train[1])  

0.875

### Feature 3: Sentiment Analysis - Polarity

In [472]:
def sentiment_polarity(string): 
    text = TextBlob(string) 
    polarity = text.sentiment.polarity 
    return(polarity)

In [473]:
sentiment_polarity(twittertext_train[1])  

0.9

### Feature 3: Number of stopwords 
Stopwords are simply insignificant words that need to be filtered out before text processing. 

Third feature to add is the number of stopwords in each text. 

In [474]:
import nltk
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
nltk.download('stopwords') 
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/tiffwong/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/tiffwong/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [475]:
def count_stopwords(string):
    stop_words = set(stopwords.words('english'))  
    word_tokens = word_tokenize(string)
    stopwords_x = [w for w in word_tokens if w in stop_words]
    return len(stopwords_x) 

In [476]:
count_stopwords(twitter_stance["text"][1])

6

In [477]:
'''
twittertext_train, twittertext_test, twitterlabel_train, twitterlabel_test = train_test_split(twitter_stance["text"], twitter_stance["label"],
                                   random_state=1, 
                                   train_size=0.70, 
                                   shuffle=True) 

# sanity check
len(twittertext_train)/len(twitter_stance)

# join twittertext_train and twitterlabel_train
twitter_train = pd.concat([twittertext_train, twitterlabel_train], axis=1, join='inner') 
# join twittertext_test and twitterlabel_test 
twitter_test = pd.concat([twittertext_test, twitterlabel_test], axis=1, join='inner') 

# rename changeorg_stance dataset to be 100% test 
changetext_test = changeorg_stance["text"] 
changelabel_test = changeorg_stance["label"]
'''

'\ntwittertext_train, twittertext_test, twitterlabel_train, twitterlabel_test = train_test_split(twitter_stance["text"], twitter_stance["label"],\n                                   random_state=1, \n                                   train_size=0.70, \n                                   shuffle=True) \n\n# sanity check\nlen(twittertext_train)/len(twitter_stance)\n\n# join twittertext_train and twitterlabel_train\ntwitter_train = pd.concat([twittertext_train, twitterlabel_train], axis=1, join=\'inner\') \n# join twittertext_test and twitterlabel_test \ntwitter_test = pd.concat([twittertext_test, twitterlabel_test], axis=1, join=\'inner\') \n\n# rename changeorg_stance dataset to be 100% test \nchangetext_test = changeorg_stance["text"] \nchangelabel_test = changeorg_stance["label"]\n'

### generate new features 

In [478]:
def new_features(dataframe1): 
    return_df = pd.DataFrame({'index':dataframe1.index, 'text':dataframe1.values}) 
    # word count 
    return_df['word_count'] = return_df['text'].apply(lambda x: word_count(x)) 

    # subjectivity 
    return_df['subjectivity'] = return_df['text'].apply(lambda x: sentiment_subjectivity(x)) 

    # polarity and add 1 so all values get shifted up by 1, bc polarity has range=[-1.1] 
    return_df['polarity'] = return_df['text'].apply(lambda x: sentiment_polarity(x)) 
    return_df['polarity'] = return_df['polarity'] + 1

    # stopword count
    return_df['stopwords'] = return_df['text'].apply(lambda x: count_stopwords(x))  

    return return_df 

#### twittertext_train 

In [479]:
twittertext_train_wfeatures = new_features(twittertext_train)
twittertext_train_wfeatures.head()

Unnamed: 0,index,text,word_count,subjectivity,polarity,stopwords
0,1145,A local #pharmaceutical company had expressed ...,17,0.1875,0.9375,5
1,927,200K dead isn’t enough for you? You’re gonna r...,16,0.45,0.9,7
2,1189,"After vaccine announcement, \ngovt. announces ...",10,0.0,1.0,1
3,1065,Please don't put tracking nano-bots in our Cov...,16,0.0,1.0,6
4,671,"Leave Fauci alone. Talk about Trump and Pence,...",34,0.0,1.0,11


#### twittertext_test

In [480]:
twittertext_test_wfeatures = new_features(twittertext_test)
twittertext_test_wfeatures.head()

Unnamed: 0,index,text,word_count,subjectivity,polarity,stopwords
0,1462,#coronavirus #Covid #COVID19 #COVIDー19 #CovidV...,12,0.0,1.0,2
1,510,I wouldn’t take a vaccine from any of them unt...,15,0.2,0.85,8
2,612,Safety &amp; efficacy of #CovidVaccine must fo...,14,0.1,1.0,1
3,1322,to produce up to an additional 100 million #Co...,17,0.3,1.0,7
4,993,CDC won’t take blame for companies who force m...,50,0.227679,1.040179,11


#### changetext_test

In [481]:
changetext_test_wfeatures = new_features(changetext_test)
changetext_test_wfeatures.head() 

Unnamed: 0,index,text,word_count,subjectivity,polarity,stopwords
0,0,CARTA PELOS POBRES!,3,0.0,1.0,0
1,1,Target: Protect your team and community as Cov...,10,0.0,1.0,4
2,2,Restrict Maximum Monthy BCHydro Charges During...,8,0.0,1.0,0
3,3,Ensure Sir Richard Branson sells private asset...,14,0.375,1.0,5
4,4,Cancel Lucknow University Exams! Mass promotio...,11,0.0,1.0,3


### Retrain model with new features

In [482]:
from scipy import sparse

In [483]:
count_vect = CountVectorizer() 
twitter_train_counts = count_vect.fit_transform(twittertext_train) 
tfidf_transformer = TfidfTransformer() 
twitter_train_tfidf = tfidf_transformer.fit_transform(twitter_train_counts) 
twitter_train_tfidf.shape

(1260, 5841)

In [484]:
# Add new feature to the document representation 
# combine feature arrays 
with_features = sparse.hstack([twitter_train_tfidf, twittertext_train_wfeatures[['word_count', "subjectivity", "polarity", "stopwords"]]]) 

In [485]:
# retrain the model 
clf = MultinomialNB().fit(with_features, twitterlabel_train)  

# 5. Model evaluation 2: 
Calculate overall model accuracy for your new model's predictions on the Test (PRIMARY) set, and on the Test (SECONDARY) set.  Enter these values in the answer boxes provided.

### PRIMARY dataset evaluation (twitter_stance)
using twittertext_test to transform and twitterlabel_test to test 

In [486]:
primary_test_counts = count_vect.transform(twittertext_test) 
twitter_test_tfidf = tfidf_transformer.transform(primary_test_counts) 

In [487]:
# use twittertext_test_wfeatures and twitterlabel_test 
# combine test and new features 
docs_test = sparse.hstack([twitter_test_tfidf, twittertext_test_wfeatures[['word_count','subjectivity', 'polarity', 'stopwords']]]) 

# predict labels 
predicted = clf.predict(docs_test) 

# calculate accuracy 
twitter_accuracy2 = np.mean(predicted == twitterlabel_test) 

### SECONDARY dataset evaluation (changeorg_stance)
using changetext_test to transform and changelabel_test to test 

In [488]:
secondary_test_counts = count_vect.transform(changetext_test) 
change_test_tfidf = tfidf_transformer.transform(secondary_test_counts) 

In [489]:
# use changetext_test_wfeatures and changelabel_test 
# combine test and new features 
docs_test = sparse.hstack([change_test_tfidf, changetext_test_wfeatures[['word_count','subjectivity', 'polarity', 'stopwords']]]) 

# predict labels 
predicted = clf.predict(docs_test) 

# calculate accuracy 
change_accuracy2 = np.mean(predicted == changelabel_test) 

## Model comparison 

In [493]:
"With the first model, the primary dataset had {twitter_accuracy}% accuracy, while with the second model, it had {twitter_accuracy2}% accuracy.".format(twitter_accuracy=round(twitter_accuracy*100, 2), twitter_accuracy2=round(twitter_accuracy2*100, 2)) 

'With the first model, the primary dataset had 48.33% accuracy, while with the second model, it had 44.07% accuracy.'

In [494]:
"With the first model, the secondary dataset had {change_accuracy}% accuracy, while with the second model, it had {change_accuracy2}% accuracy.".format(change_accuracy=round(change_accuracy*100, 2), change_accuracy2=round(change_accuracy2*100, 2)) 

'With the first model, the secondary dataset had 43.87% accuracy, while with the second model, it had 46.07% accuracy.'

In [498]:
(changeorg_stance['label']=='unclear').sum()

535

In [499]:
(twitter_stance['label']=='unclear').sum()

761