This notebook contains the writeup, the three learners, and the ensemble learner, along with some extra analysis. 

forest.ipynb includes the code used for a grid search to get the optimal parameters for the random forest. logistic_regression.ipynb includes the code for its respective grid search as well as the code for collecting data. The data collected was saved as the csv fulldatasetwithends.csv. Not all code submitted was run exactly as is. For the grid searches, I submitted code that ran grid searches on only 1% of the dataset, since they took a long time. And I just commented out the calling of the data collection function for the same reason.

**Classifying Congressional Tweets Using Various Machine Learning Algorithms**

For my project I decided to look at applying machine learning to classify by party tweets made by Congresspeople and Senators. In particular, I wanted to know if ensemble methods combining different learners could lead to better predictions. Data were collected from twitter using the tweepy library, all tweets by people on this list of Congressmen and Senators were used: https://triagecancer.org/congressional-social-media. This source for twitter handles was used as it includes party info. Those who were not Republicans or Democrats were filtered out. 

Obviously, lots of work has been done in the field of sentiment analysis, even in the specific political classification area. Some studies have found that political orientation classification can be difficult (The Perils of Classifying Political Orientation From Text;
Hao Yan, Allen Lavoie, and Sanmay Das; https://www.cse.wustl.edu/~sanmay/papers/political-orientation.pdf) because political language isn’t necessarily constant in different media. I think that the dataset I’m using of just tweets should mitigate that problem, as much of the political messaging across politicians on twitter specifically, with a character limit, probably by necessity is limited to buzzwords and hashtags that should be easy to identify. Other analyses have covered looking at the orientation of users as a whole, not specific tweets (Predicting the Political Alignment of Twitter Users; Michael D. Conover, Bruno Gonc¸alves, Jacob Ratkiewicz, Alessandro Flammini and Filippo Menczer; https://cnets.indiana.edu/wp-content/uploads/conover_prediction_socialcom_pdfexpress_ok_version.pdf). Of course, classifying specific tweets is much harder than classifying users as a whole. The analysis done by Conover et al found that using hashtags had the greatest accuracy, but not all tweets have hashtags so I couldn’t take that approach.

The approach I did take was to use the bag of words model for two learners and to leverage the sequential nature of a tweet for the third. I trained three models: a logistic regression model, a random forest model, and an lstm recurrent neural network. The first two were done in scikit-learn, the last in tensorflow. For the first two I vectorized using tfidf and some simple preprocessing (to remove links and such). I ran grid searches on both the vectorizer and the model as a whole for both the random forest and logistic regression. I was surprised to find for both that the optimal n_gram range was only 1 (so no word pairs or triplets). The grid search found stop words helpful for both. Other optimal parameters can be found in forest.ipynb and logistic_regression.ipynb. For the logistic regression, before doing the full grid search I determined a K value for SelectKBest by running a few grid searches on subsamples and picking a number that led to a reasonable training time at little cost to accuracy. The lstm was vectorized by a simple Token Text encoding that padded each vector to be the same length as other vectors in the batch.

The lstm included an embedding layer, a bidirectional layer, and two dense layers: one with relu and one with a sigmoid activation function. Much of my code for this learner was adapted from Sebastian Raschka’s Python Machine Learning 3rd Edition. Not much hyperparameter optimization was done for this model as I was really most interested in seeing how it overlaps with the other models, and optimization is very time-intensive, as this learner has very high training time (close to an hour!). 

The ensemble learner simply took a majority vote of the three other learners. I didn’t use the built in majority vote module in scikit-learn because I had a tensorflow model, so I just summed the predictions on the test set (making sure they weren’t accidentally shuffled to not match). This had an accuracy in the range of the neural network and the logistic regression. 

Results:
**Logistic Regression**

Accuracy on test set:

0.8361050599507642
Training time:

38.06299066543579

Prediction time (test):

5.7797932624816895

**Random Forest**

Accuracy on test set:

0.6471051839257567

Training time:

157.40676164627075

Prediction time (test):

69.72026896476746

**Recurrent Neural Network**

Accuracy on test set:

80.83%

Training time:

3024.5883202552795

Prediction time (test):

69.72026896476746

**Majority Vote**

Accuracy on test set:

82.62%

Majority vote training and prediction times are the sum of component parts. I didn't include numbers here because I thought it would be slightly misleading, as you can parallelize training and prediction and I didn't implement that as I didn't think it was too important for this project and it's arguably outside the scope of this course. As you can see, the logistic regression was the standout model, with the lowest training time and highest test set accuracy. The RNN was close in accuracy, but had an abysmally high training time. The random forest was pretty terrible, with very low test set accuracy. Its accuracy was about equal to the proportion of Democrat tweets in the test dataset (about 66%). It classified almost everything as Democrat, and very few (about 1000 in the run I’m submitting) as Republican. Unfortunately, the poor performance of the random forest means that my ensemble learner suffered as well. 

I suspect the poor performance of the random forest may be because there are too many features for a random forest to do well. If there’s a few words that are indicative within a tweet, then it’s likely only one or two trees in the forest will have been significantly trained on those words. Those trees would be outvoted by the others, and when there’s no information to go on they’ll vote Democrat since the majority of the dataset is Democrat.

Other results can be seen below in the code blocks. Because the random forest was poor, I suspected that I couldn’t really test the question of whether a majority vote improves learners because if the two strong learners tie, the tiebreaker random forest would almost always vote Democrat. 75% of those misclassified by the majority were Republican, supporting this theory. So I looked at those wrongly predicted by the majority vote and looked at the vote scores themselves. A 0 or 3 indicates that all three models were wrong, 1 or 2 that 1 model was correct. I found in 74% of cases, at least one model actually had the right answer. I did the same analysis without the random forest and found that in 48% of cases, one of the two stronger models predicted correctly. This lends strong support to the thesis that ensemble learning can be helpful, as these very different models are incorrect about different test examples. This suggests that if a third strong model is found, a majority vote between the three would yield significant improvement.

It’s also the case that 100% accuracy is impossible with this dataset. I doubt that a machine learning algorithm could achieve much greater accuracy than a human. I don’t think there are patterns unrelated to politics that are correlated with party. There’s enough diversity across the aisle that I imagine anything a machine could use to classify would be noticed by an educated human. This includes words and phrases like “the wall” or “our troops” or “minority representation”. Tweets that don’t have political content (like simple happy birthday wishes) are bound to be misclassified. So I grabbed a sample of some of the misclassified tweets to see how hard they are for a person to identify. Many of them I couldn’t really figure out correctly, although some of them I could. An example of the former: “You can also join us by calling 855-859-6912! Press * 3 to ask a question!”. I don’t think any algorithm or person is gonna be more than 50% likely to classify that correctly. An example of the latter: “For those insisting Mail-in Voting doesn’t cause voter fraud; election theft, New Jersey proves otherwise (Note: Absentee Voting verification procedures reduce fraud).” This one is by a Republican, as should be obvious to anyone who’s paid attention this election. So clearly there’s room for improvement if this got misclassified.

As I discussed above, there’s quite a bit of disagreement between the two strong models. This means that a third model could be added (assuming it’s better than the random forest) that would lead to a majority voting model that’s significantly better than any of its component parts. I think a support vector machine might be a good candidate, as that’s what Conover et al used. Other improvements can be made on the models themselves. More hyperparameter tuning, particularly on the recurrent neural network, would likely improve performance. Future work could explore these possibilities and improve the models I made here.

But overall from this project it is clear that classification of individual tweets by party is viable, and ensemble learning is a viable possibility to improve it. 80% success isn’t bad, and I’ve laid out some clear ways to get even better results. This project may not seem all that relevant on its face, but there’s clear uses. For one thing, a model can be trained on Congresspeople and then generalized and used to predict the leanings of tweets by general Twitter users, or used to predict on any text at all. The work of Yan et al suggests this may not work though, since the training population needs to have similar speech patterns as the testing population. The general twitter population likely writes informally compared to Senators and Congresspeople. Also, with both the logistic regression and the recurrent neural networks, the probability estimates can be used to suggest how partisan different politicians are. It could assess whether a candidate is trying to appeal to the middle ground or trying to appeal to his/her parties base. If a Democrat candidate’s tweets are strongly predicted Democrat, then it would be the latter. If the prediction is weaker, then the former seems true. The results of this project suggest the viability of those options, and they should be further explored.

**Prototype writeup:**

Title: Classifying Congressional Tweets Using Various Machine Learning Algorithms

Description:

I am trying to read tweets written by Congresspeople (Senate and House) and use machine learning to predict what party they're from. The tweets are collected using Twitter's API and the python libraryTweepy. I'm converting the text of the words to a bag of words, filtering out punctuation and stemming as well. I'm leaving in hashtags and @s. 

In terms of specific algorithms, I expect to use multiple different classifiers. I also will use a neural net,though I'm not sure the architecture I'll aim for. I still need to do more research on what's commonly used for NLP. It will likely incorporate existing sentiment analysis models to improve accuracy. In addition to that, I'll try a number of different classifiers, using grid search to optimize them, and test them with both test set accuracy and k-fold cross validation.

I plan to present the best models based on multiple metrics, including accuracy, time to learn, and time to classify. I'll include the optimized parameters as well, and the optimal preparatory pipelines.

I've done most of the data preprocessing and collected a decent-sized dataset. I've run a logistic
regression classifier with default hyperparameters and a tfidf vectorizer. Stemming and filtering functions are completed. You can see the accuracy of this basic model below; with 75% accuracy on the test set, I'd say that's a good proof of concept.

I still need to complete my other models, and I also would like to get a larger dataset. I need to see how to get more tweets per person using tweepy. I'll also look into other tools for preprocessing. And I'll look into other dictionaries that could be helpful.


In [1]:
import pandas as pd
import numpy as np 
import tensorflow as tf
import tensorflow_datasets as tfds
from joblib import load, dump
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\quincy\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
#read in the dataset

origdf = pd.read_csv('fulldatasetwithends.csv').dropna()
origdf

Unnamed: 0.1,Unnamed: 0,account,text,party
0,0,SenShelby,Today is #SmallBusinessSaturday! I encourage a...,R
1,1,SenShelby,Wishing everyone a very happy #Thanksgiving. T...,R
2,2,SenShelby,The @USAirForce has selected Maxwell Air Force...,R
3,3,SenShelby,Great news! @NSF recently awarded @SamfordU $1...,R
4,4,SenShelby,Thank you to each and every one of the brave m...,R
...,...,...,...,...
376415,376415,RepGwenMoore,In conjunction with the reckless decision to p...,D
376416,376416,RepGwenMoore,Without consultation with or authorization fro...,D
376417,376417,RepGwenMoore,I will continue doing everything in my power t...,D
376418,376418,RepGwenMoore,Poverty is rising in parts of the US. We need ...,D


In [3]:
# process text
import string

removable_punctuation = "!\"$%&'()*+,-./:;<=>?[\]^_`{|}~"
translator = str.maketrans('', '', removable_punctuation)

def remove_links(text):
    index = text.find('https://t.co/')
    if index is -1:
        return text
    else:
        text = text[0:index] + text[index+23:]
        return remove_links(text)
 

def preprocessor(text):
    try:
        text = remove_links(text)
#         print(text[index+23:])
    except:
        print(text)
    text = text.replace("&amp;", '').lower()
    return text.translate(translator)
    


In [4]:
# shuffle dataframe, convert party to binary (0 = Democrat, 1 = Republican)
df = origdf.sample(frac=1)

df['party'] =df['party']=='R'
df['party'] = df['party'].astype(int)

In [5]:
print("Republicans:")
print(sum(df['party']))
print("Democrats:")
print(len(df) - sum(df['party']))


Republicans:
135933
Democrats:
240487


In [6]:
#tokenizer
from sklearn.model_selection import train_test_split
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

porter = PorterStemmer()
stop = []

#need to put stopwords through preprocessor, since stopwords have punctuation and features don't
for word in stopwords.words('english'):
    stop.append(preprocessor(word))

def tokenizer(text):
    return text.split()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split() if word not in stop]

X = list(df['text']) 
y = list(df['party'])
X_train, X_test, y_train, y_test = train_test_split(
    X,y, test_size = 0.3, random_state = 0)

In [7]:
#logistic regression with hyperparameters taken from the grid search
from joblib import dump
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
import time

starttime = time.time()

tfidf = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=preprocessor)



lr = Pipeline([('vect', tfidf),('clf', LogisticRegression(random_state=0, solver='liblinear', C=100, penalty= 'l2'))])

lr.fit(X_train,y_train)
trainingtime = time.time()-starttime
results_train = lr.predict(X_train) == y_train
startpredicttime = time.time()
results_test = lr.predict(X_test) == y_test
predicttime = time.time()-startpredicttime

print("Accuracy on training set:")
print(np.count_nonzero(results_train)/len(results_train))
print("Accuracy on test set:")
print(np.count_nonzero(results_test)/len(results_test))
print("Training time:")
print(trainingtime)
print("Prediction time (test):")
print(predicttime)

Accuracy on training set:
0.9276567967392046
Accuracy on test set:
0.8361050599507642
Training time:
38.06299066543579
Prediction time (test):
5.7797932624816895


In [8]:
# random forest classifier with (some) hyperparameters taken from the grid search

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
import time

starttime = time.time()
tfidf = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=preprocessor, tokenizer=tokenizer_porter,
                        stop_words=stop)




rf = Pipeline([('vect', tfidf),('clf', RandomForestClassifier(n_estimators=10, random_state=0,
                                                             max_depth=20, max_features='auto',
                                                             n_jobs=-1))])



rf.fit(X_train, y_train) 
trainingtime = time.time()-starttime
results_train = rf.predict(X_train) == y_train
startpredicttime = time.time()
results_test = rf.predict(X_test) == y_test
predicttime = time.time()-startpredicttime

print("Accuracy on training set:")
print(np.count_nonzero(results_train)/len(results_train))
print("Accuracy on test set:")
print(np.count_nonzero(results_test)/len(results_test))
print("Training time:")
print(trainingtime)
print("Prediction time (test):")
print(predicttime)

Accuracy on training set:
0.6475441566031864
Accuracy on test set:
0.6471051839257567
Training time:
157.40676164627075
Prediction time (test):
69.72026896476746


In [9]:
#converting train and test lists back to dataframes for the RNN

rnn_df_train = pd.DataFrame(list(zip(X_train, y_train)), 
               columns =['text', 'party'])

rnn_df_test = pd.DataFrame(list(zip(X_test, y_test)), 
               columns =['text', 'party'])
rnn_df_test_copy = rnn_df_test.copy() #will be used later
rnn_df_test['text'] = rnn_df_test['text'].apply(preprocessor)
rnn_df_train['text'] = rnn_df_train['text'].apply(preprocessor)


target_train = rnn_df_train.pop('party') 
target_test = rnn_df_test.pop('party') 
rnn_df_train

Unnamed: 0,text
0,hello tomatoes peppers #covid19garden
1,still quite excited that just this past weeken...
2,when the trump administration relaxes enforcem...
3,minimizing the risk and severity of this virus...
4,honored to have fought for montana every step ...
...,...
263489,in the #caresact senate republicans and the wh...
263490,thanks for always having my back sis @ayannapr...
263491,too many servicemembers come home with invisib...
263492,a president that values american lives doesnt ...


In [11]:
#creating tensorflow datasets

ds_raw_test = tf.data.Dataset.from_tensor_slices((rnn_df_test.values,target_test.values))
ds_raw_train_valid =  tf.data.Dataset.from_tensor_slices((rnn_df_train.values,target_train.values))

trainsize =(int) (len(ds_raw_train_valid)*0.8)

ds_raw_train = ds_raw_train_valid.take(trainsize)

ds_raw_valid = ds_raw_train_valid.skip(trainsize)

In [12]:
#processing datasets

from collections import Counter

tokenizer = tfds.deprecated.text.Tokenizer()
token_counts = Counter()

for example in ds_raw_train:
    tokens = tokenizer.tokenize(example[0].numpy()[0])
    token_counts.update(tokens)

encoder = tfds.deprecated.text.TokenTextEncoder(token_counts)
    
def encode(text_tensor, label):
    text = text_tensor.numpy()[0]
    encoded_text = encoder.encode(text)
    return encoded_text, label

def encode_map_fn(text, label):
    return tf.py_function(encode, inp=[text,label], Tout=(tf.int64, tf.int64)) # 64 or 32?

ds_train = ds_raw_train.map(encode_map_fn)
ds_valid = ds_raw_valid.map(encode_map_fn)
ds_test = ds_raw_test.map(encode_map_fn)

    
train_data = ds_train.padded_batch(32, padded_shapes=([-1],[]))
valid_data = ds_valid.padded_batch(32, padded_shapes=([-1],[]))
test_data = ds_test.padded_batch(32, padded_shapes=([-1],[]))

In [13]:
#create and train RNN

from tensorflow.keras.layers import Embedding

model = tf.keras.Sequential()
model.add(Embedding(input_dim=100, output_dim = 6, input_length = 20, name= 'embed-layer'))

embedding_dim = 20
vocab_size = len(token_counts) + 2

tf.random.set_seed(1)

rnn = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim = vocab_size,
                             output_dim = embedding_dim,
                             name = 'embed-layer'),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, name='lstm-kayer')),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid'),
])

print(rnn.summary())

rnn.compile(optimizer = tf.keras.optimizers.Adam(1e-3),
                     loss = tf.keras.losses.BinaryCrossentropy(from_logits=False),
                     metrics=['accuracy'])
starttime = time.time()
history = rnn.fit(train_data, validation_data = valid_data, epochs=10)
trainingtime = time.time()-starttime
test_results = rnn.evaluate(test_data)
testtime = time.time()-trainingtime
print('Test Accuracy: {:.2f}%'.format(test_results[1]*100))

print("Training time:")
print(trainingtime)
print("Prediction time (test):")
print(predicttime)

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embed-layer (Embedding)      (None, None, 20)          2134520   
_________________________________________________________________
bidirectional (Bidirectional (None, 128)               43520     
_________________________________________________________________
dense (Dense)                (None, 64)                8256      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
Total params: 2,186,361
Trainable params: 2,186,361
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test Accuracy: 80.83%
Training time:
3024.5883202552795
Prediction time (test):
69.72026896476746


In [14]:
rnn_predictions = np.concatenate(np.round(rnn.predict(test_data))).astype(int)
rnn_predictions

array([1, 0, 1, ..., 1, 0, 0])

In [15]:
#this is to verify that the test set isn't shuffled for the RNN compared to the others

targets = np.concatenate([y for x, y in test_data], axis=0)
np.all(targets==y_test)

True

In [16]:
lr_predictions = lr.predict(X_test)
lr_predictions

array([1, 1, 1, ..., 1, 0, 0])

In [17]:
rf_predictions = rf.predict(X_test)
rf_predictions

array([0, 0, 0, ..., 0, 0, 0])

In [23]:
#how many did rf predict as republican

sum(rf_predictions)

1045

In [18]:
print(len(rnn_predictions))
sum(rnn_predictions == lr_predictions)

112926


96543

In [19]:
sum_predictions = np.add(lr_predictions, rf_predictions)
sum_predictions = np.add(sum_predictions, rnn_predictions)
sum_predictions

array([2, 1, 2, ..., 2, 0, 0])

In [20]:
majority_predictions = sum_predictions>1
majority_predictions = majority_predictions.astype(int)
majority_predictions

array([1, 0, 1, ..., 1, 0, 0])

In [44]:
print("Majority vote test set accuracy:")
print(sum(majority_predictions == y_test)/len (y_test)*100)

Majority vote test set accuracy:
82.6249048049165


In [22]:
#the text of some incorrectly classified tweets. This is mostly to see if those incorrectly classified are "hard" to classify (that is, could a person identify them)

majority_wrong = rnn_df_test_copy[majority_predictions!=y_test]
print(majority_wrong)
wrong_sample = majority_wrong.sample(n=100)
for index, row in wrong_sample.iterrows():
    print('Party: %s Text:' % row['party'], row['text'])

                                                     text  party
11      I have full confidence in @POTUS and @VP's abi...      1
18      Thanks to @POTUS and the #CARESAct, Virginia’s...      1
20      As we continue to work to safely reopen our co...      1
25      I’m live with @scrowder. Tune in here: https:/...      1
29      Grateful for the overwhelming support from Ohi...      1
...                                                   ...    ...
112881  "Everything" including court packing on the ag...      1
112886  Happy Birthday @RodneyChilders4 ! 🏁🏁🏁🎊🎂🎈🎁🎉👍😎🏁🏁...      1
112895  My Clean Cities Bill and EJ for All Act are in...      1
112913  Democrats would rather score political points ...      1
112915  Trump declared surrender to COVID the minute h...      0

[19621 rows x 2 columns]
Party: 0 Text: Looking foward to a news conference with @SenMikeLee today at 2pm, to celebrate the 5th Annual Flavors of Utah. Some of Utah's food producers are donating signature Utah  items to

In [42]:
#wanted to determine the scores of the wrongly predicted. If all the models predict incorrectly those the majority predicts incorrectly,
# then it's unlikely ensemble learning is helpful for this classification task

wrong_sum_predictions =sum_predictions[majority_predictions!=y_test]

print("sum scores of test examples predicted wrongly by majority")
print("# with score 0:")
print(sum(wrong_sum_predictions==0))
print("# with score 1:")
print(sum(wrong_sum_predictions==1))
print("# with score 2:")
print(sum(wrong_sum_predictions==2))
print("# with score 3:")
print(sum(wrong_sum_predictions==3))
print("percent with score 1 or 2 (so percent with disagreement):")
print((sum(wrong_sum_predictions==1) + sum(wrong_sum_predictions==2))/sum(wrong_sum_predictions)*100)

sum scores of test examples predicted wrongly by majority
# with score 0:
7707
# with score 1:
7736
# with score 2:
4143
# with score 3:
35
percent with score 1 or 2 (so percent with disagreement):
73.65908104421158


In [41]:
#The same analysis above, but ignoring the random forest since it just predicts Democrat for almost every example

lr_rnn_sum_predictions =  np.add(lr_predictions, rnn_predictions)

lr_rnn_wrong_sum_predictions =sum_predictions[majority_predictions!=y_test]

print("sum scores of test examples predicted wrongly by majority")
print("# with score 0:")
print(sum(lr_rnn_wrong_sum_predictions==0))
print("# with score 1:")
print(sum(lr_rnn_wrong_sum_predictions==1))
print("# with score 2:")
print(sum(lr_rnn_wrong_sum_predictions==2))
print("percent with score 1 (so percent with disagreement):")
print((sum(wrong_sum_predictions==1))/sum(wrong_sum_predictions)*100)

sum scores of test examples predicted wrongly by majority
# with score 0:
7707
# with score 1:
7736
# with score 2:
4143
percent with score 1 (so percent with disagreement):
47.96924412475972


In [40]:
# breaking down the wrong by party

print('Percent Republican of those misclassified by the majority vote')
print(sum(majority_wrong['party']) / len(majority_wrong['party'])*100)

Percent Republican of those misclassified by the majority vote
78.70648794658784
