## [Computational Social Science] Project 5: Natural Language Processing

In this project, you will use natural language processing techniques to explore a dataset containing tweets from members of the 116th United States Congress that met from January 3, 2019 to January 2, 2021. The dataset has also been cleaned to contain information about each legislator. Concretely, you will do the following:

* Preprocess the text of legislators' tweets
* Conduct Exploratory Data Analysis of the text
* Use sentiment analysis to explore differences between legislators' tweets
* Featurize text with manual feature engineering, frequency-based, and vector-based techniques
* Predict legislators' political parties and whether they are a Senator or Representative

You will explore two questions that relate to two central findings in political science and examine how they relate to the text of legislators' tweets. First, political scientists have argued that U.S. politics is currently highly polarized relative to other periods in American history, but also that the polarization is asymmetric. Historically, there were several conservative Democrats (i.e. "blue dog Democrats") and liberal Republicans (i.e. "Rockefeller Republicans"), as measured by popular measurement tools like [DW-NOMINATE](https://en.wikipedia.org/wiki/NOMINATE_(scaling_method)#:~:text=DW\%2DNOMINATE\%20scores\%20have\%20been,in\%20the\%20liberal\%2Dconservative\%20scale.). However, in the last few years, there are few if any examples of any Democrat in Congress being further to the right than any Republican and vice versa. At the same time, scholars have argued that this polarization is mostly a function of the Republican party moving further right than the Democratic party has moved left. **Does this sort of asymmetric polarization show up in how politicians communicate to their constituents through tweets?**

Second, the U.S. Congress is a bicameral legislature, and there has long been debate about partisanship in the Senate versus the House. The House of Representatives is apportioned by population and all members serve two year terms. In the Senate, each state receives two Senators and each Senator serves a term of six years. For a variety of reasons (smaller chamber size, more insulation from the voters, rules and norms like the filibuster, etc.), the Senate has been argued to be the "cooling saucer" of Congress in that it is more bipartisan and moderate than the House. **Does the theory that the Senate is more moderate have support in Senators' tweets?**

**Note**: See the project handout for more details on caveats and the data dictionary.

In [None]:
# pandas and numpy
import pandas as pd
import numpy as numpy

# punctuation, spacy, stop words and English language model
from string import punctuation
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
import en_core_web_sm
nlp = en_core_web_sm.load()

# textblob
from textblob import TextBlob

# countvectorizer, tfidfvectorizer, LatentDirichletAllocation, CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

# gensim
import gensim
from gensim import models

# plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Image, en_core_web_sm, scattertext, WordCloud, STOPWORDS, ImageColorGenerator, 
from PIL import Image
import en_core_web_sm
import scattertext as st
nlp = en_core_web_sm.load()
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

#Classification
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

import multiprocessing


In [None]:
congress_tweets = pd.read_csv("116th Congressional Tweets and Demographics.csv")
# fill in this line of code with a sufficient number of tweets, depending on your computational resources
#congress_tweets = congress_tweets.sample(...)
congress_tweets.head()

In [None]:
congress_tweets=congress_tweets.drop(['tweet_id', 'screen_name'],axis=1)

In [None]:
congress_tweets

## Preprocessing

The first step in working with text data is to preprocess it. Make sure you do the following:

* Remove punctuation and stop words. The `rem_punc_stop()` function we used in lab is provided to you but you should feel free to edit it as necessary for other steps
* Remove tokens that occur frequently in tweets, but may not be helpful for downstream classification. For instance, many tweets contain a flag for retweeting, or share a URL 

As you search online, you might run into solutions that rely on regular expressions. You are free to use these, but you should also be able to preprocess using the techniques we covered in lab. Specifically, we encourage you to use spaCy's token attributes and string methods to do some of this text preprocessing.

In [None]:
def rem_punc_stop(text):
    stop_words = STOP_WORDS
    punc = set(punctuation)
    
    punc_free = "".join([ch for ch in text if ch not in punc])
    
    doc = nlp(punc_free)
    
    spacy_words = [token.text for token in doc]
    
    spacy_words = [word for word in spacy_words if not word.startswith(('http', 'RT'))]
    
    #spacy_words2 = [token.text.lower() for token in doc]
    
    no_punc = [word for word in spacy_words if word not in stop_words]
    
    return no_punc

In [None]:
text = congress_tweets['text'][2]

In [None]:
tokens_reduced = rem_punc_stop(text)
tokens_reduced

In [None]:
numpy.random.seed(10)
ct_sub = congress_tweets.sample(n=3000)

In [None]:
ct_sub

In [None]:
ct_sub['tokens'] = ct_sub['text'].map(lambda x: rem_punc_stop(x))
ct_sub['tokens']

In [None]:
ct_sub['tokens_str'] = ct_sub['tokens'].map(lambda text: ' '.join(text))

## Exploratory Data Analysis

Use two of the techniques we covered in lab (or other techniques outside of lab!) to explore the text of the tweets. You should construct these visualizations with an eye toward the eventual classification tasks: (1) predicting the legislator's political party based on the text of their tweet, and (2) predicting whether the legislator is a Senator or Representative. As a reminder, in lab we covered <u>word frequencies,</u> <u>word clouds,</u> <u>word/character counts,</u> <u>scattertext,</u> and <u>topic modeling</u> as possible exploration tools. 

### EDA 1 - Word Cloud

In [None]:
wordtext = ' '.join(ct_sub['tokens'].map(lambda wordtext: ' '.join(wordtext)))
wordcloud = WordCloud().generate(text)
ct_sub['wordcloud'] = WordCloud().generate(text)
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.show() 

We noticed that RT appeared in the word cloud but RT refers to retweet and won't help us with analysis so we're going to adjust the rem_punc_stop function to remove the word from all tweets.

In [None]:
def rem_punc_stop(text):
    stop_words = STOP_WORDS
    # Individually
    nlp.Defaults.stop_words.add("RT")
    
    punc = set(punctuation)
    
    punc_free = "".join([ch for ch in text if ch not in punc])
    
    doc = nlp(punc_free)
    
    spacy_words = [token.text for token in doc]
    
    no_punc = [word for word in spacy_words if word not in stop_words]
    
    return no_punc

In [None]:
ct_sub['tokens'] = ct_sub['text'].map(lambda x: rem_punc_stop(x))
text = ' '.join(ct_sub['tokens'].map(lambda text: ' '.join(text)))

wordcloud = WordCloud(background_color = "white").generate(text)
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.show()

### EDA 2 - Topic Modeling 

#### Part 1: 5 components

In [None]:
#create  tf-idf matrix
X = ct_sub['text']
tf = TfidfVectorizer(tokenizer = rem_punc_stop)

tfidf_matrix =  tf.fit_transform(X)
dense_matrix = tfidf_matrix.todense() 

In [None]:
#apply LDA model with hyperparameter n_components = 5
lda = LatentDirichletAllocation(n_components=5, max_iter=20, random_state=0)
lda = lda.fit(dense_matrix)

In [None]:
#print topics
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #{}:".format(topic_idx))
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

In [None]:
#print top words
tf_feature_names = tf.get_feature_names()
print_top_words(lda, tf_feature_names, 20)

In [None]:
#compare prevalence of each topic across documents

#get topic distribution array
topic_dist = lda.transform(tfidf_matrix)
topic_dist

In [None]:
#merge back with original df
topic_dist_df = pd.DataFrame(topic_dist)
df_w_topics = topic_dist_df.join(ct_sub.reset_index())
df_w_topics.head()

In [None]:
ct_sub

In [None]:
#check average weight of each topic across party using group by
grouped = df_w_topics.groupby('party')
for i in range(0, 5):
    print(grouped[i].mean().sort_values(ascending=False))

In [None]:
#check average weight of each topic across position using group by
grouped = df_w_topics.groupby('position')
for i in range(0, 5):
    print(grouped[i].mean().sort_values(ascending=False))

<b> What do we see so far? </b>
With n_components = 5, we don't really see much separation for either party or position.  Going to retrain LDA with more topics, n_topics = 10. 

### EDA 2 - Topic Modeling 

#### Part 2: 10 components

In [None]:
#apply LDA model with hyperparameter n_components = 10
lda_10 = LatentDirichletAllocation(n_components=10, max_iter=20, random_state=0)
lda_10 = lda_10.fit(dense_matrix)

In [None]:
#print topics
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #{}:".format(topic_idx))
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

In [None]:
#print top words
tf_feature_names = tf.get_feature_names()
print_top_words(lda_10, tf_feature_names, 20)

In [None]:
#compare prevalence of each topic across documents

#get topic distribution array
topic_dist = lda_10.transform(tfidf_matrix)
topic_dist

In [None]:
#merge back with original df
topic_dist_df = pd.DataFrame(topic_dist)
df_w_topics = topic_dist_df.join(ct_sub.reset_index())
df_w_topics.head()

In [None]:
#check average weight of each topic across party using group by
grouped = df_w_topics.groupby('party')
for i in range(0, 10):
    print(grouped[i].mean().sort_values(ascending=False))

In [None]:
#check average weight of each topic across position using group by
grouped = df_w_topics.groupby('position')
for i in range(0, 10):
    print(grouped[i].mean().sort_values(ascending=False))

<b>What do we see?</b>
Still don't see much separation for either party or position. While the number of topics produced are sensitive to our choice of n, it doesn't seem like choosing more topics did any better in separating out topics by Congressional party or position. Feels like all Congressional members, irrespective of party and position seem to be having discussions around similar topics.  The ways they discuss these topics may differ, and it might be that sentiment analysis will do a better job picking up these difference of opinions. 

### EDA 3 - Word Count

In [None]:
ct_sub['word_count'] = ct_sub['text'].apply(lambda x: len(str(x).split()))

In [None]:
#for party
sns.displot(ct_sub, x="word_count", hue = "party", col = "party")
plt.show()

In [None]:
#for position
sns.displot(ct_sub, x="word_count", hue = "position", col = "position")
plt.show()

<b>What do we see?</b>
Democrats and Representatives have higher word count relative to their comparative counterparts (Republicans, and Senators, respectively). This might influence the results we see in the sentiment analysis and beyond.

## Sentiment Analysis

Next, let's analyze the sentiments contained within the tweets. You may use TextBlob or another library for these tasks. Do the following:

* Choose two legislators, one who you think will be more liberal and one who you think will be more conservative, and analyze their sentiment and/or subjectivity scores per tweet. For instance, you might do two scatterplots that plot each legislator's sentiment against their subjectivity, or two density plots for their sentiments. Do the scores match what you thought?
* Plot two more visualizations like the ones you chose in the first part, but do them to compare (1) Democrats v. Republicans and (2) Senators v. Representatives 

`TextBlob` has already been imported in the top cell.

In [None]:
# Sen. Nydia Velázquez (D) vs. Sen. Liz Cheney (R)

velázquez_text = congress_tweets[congress_tweets['name_wikipedia']=='Nydia Velázquez']['text']
velázquez_text

cheney_text = congress_tweets[congress_tweets['name_wikipedia']=='Liz Cheney']['text']
cheney_text

#### Sentiment

In [None]:
congress_tweets['velázquez_polarity']= velázquez_text.map(lambda text: TextBlob(text).sentiment.polarity)
sns.displot(congress_tweets, x='velázquez_polarity')
plt.show()

In [None]:
congress_tweets['cheney_polarity']= cheney_text.map(lambda text: TextBlob(text).sentiment.polarity)
sns.displot(congress_tweets, x='cheney_polarity')
plt.show()

<b>What do we see?</b>
According to the plots, both Senators' tweets are mostly nuetral. This matches what I'd assume, given that politicians tend to stay away from polarizing statements.

#### Subjectivity

In [None]:
congress_tweets['velázquez_subjectivity'] = velázquez_text.map(lambda text: TextBlob(text).sentiment.subjectivity)
sns.displot(congress_tweets, x="velázquez_subjectivity")
plt.show()

In [None]:
congress_tweets['cheney_subjectivity'] = cheney_text.map(lambda text: TextBlob(text).sentiment.subjectivity)
sns.displot(congress_tweets, x="cheney_subjectivity")
plt.show()

<b>What do we see?</b>
According to the plots, Senator Velázquez's tweets tend to be more objective. A significant portion of Senator Cheney's tweet are also objective, however Senator Cheney seems to have a mean closer to ~0.5. This may suggest that Democratic senators are more objective, however it may also be due to the much smaller sample from Senator Cheney.

#### Sentiment vs. Subjectivity

In [None]:
sns.scatterplot(data = congress_tweets, x = 'velázquez_subjectivity', y = 'velázquez_polarity')
plt.show()

In [None]:
sns.scatterplot(data = congress_tweets, x = 'cheney_subjectivity', y = 'cheney_polarity')
plt.show()

<b>What do we see?</b>
According to the plots, both Senators' tweets become more polarizing as they become more subjective. In the case of Senator Cheney, however, this occurs in a much more positive direction than Senator Velázquez. This, again, may be due to the smaller sample size from Senator Cheney.

#### Democrats vs. Republicans

In [None]:
ct_sub['polarity']= ct_sub['tokens_str'].map(lambda text: TextBlob(text).sentiment.polarity)
ct_sub['subjectivity']= ct_sub['tokens_str'].map(lambda text: TextBlob(text).sentiment.subjectivity)

In [None]:
sns.relplot(
    data=ct_sub, x="subjectivity", y="polarity",
    col="party", hue = "party", kind="scatter"
)
plt.show()

<b>What do we see?</b>
These plots reveal a similar patter to the one we just saw between Senators Velázquez and Cheney. It seems that both parties' tweets become more positive as they get more subjective - although the relationship seems to be stronger for Republicans. This suggest what saw earlier was not due soley Senator Cheney's smaller sample size.

#### Senators vs. Representatives

In [None]:
sns.relplot(
    data=ct_sub, x="subjectivity", y="polarity",
    col="position", hue = "position", kind="scatter"
)
plt.show()

<b>What do we see?</b>
Again, we see this same pattern. As tweets become mroe subjective, they also become more positive. The slight differences we see here are likely due to the composition of either chamber. For the 116th Congress, the Senate was majority Republican and the House was majority Democrat.

## Featurization

Before going to classification, explore different featurization techniques. Create three dataframes or arrays to represent your text features, specifically:

* Features engineered from your previous analysis. For example, word counts, sentiment scores, topic model etc.
* A term frequency-inverse document frequency matrix. 
* An embedding-based featurization (like a document averaged word2vec)

In the next section, you will experiment with each of these featurization techniques to see which one produces the best classifications.

In [None]:
ct_sub.columns

### Engineered Text Features

In [None]:
# Engineered Features, including wordcloud, word count, polarity and subjectivity
engineered_features = ct_sub[['word_count','polarity', 'subjectivity']].reset_index(drop = True)

In [None]:
engineered_features

### Bag-of-words or Tf-idf

In [None]:
# Frequency Based featurization

# tfidf
tfidf_df = pd.DataFrame(tfidf_matrix.todense(), columns = tf.get_feature_names())

### Word Embedding

In [None]:
# Load Word2Vec model from Google; OPTIONAL depending on your computational resources (the file is ~1 GB)
# Also note that this file path assumes that the word vectors are underneath 'data'; you may wish to point to the CSS course repo and change the path
# or move the vector file to the project repo 

model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary = True) 

In [None]:
import numpy as np
# Function to average word embeddings for a document; use examples from lab to apply this function. You can use also other techniques such as PCA and doc2vec instead.
def document_vector(word2vec_model, doc):
    doc = [word for word in doc if word in model.wv.vocab]
    return np.mean(model.wv._getitem_(doc), axis=0)

In [None]:
# embedding based featurization
model = gensim.models.Word2Vec(ct_sub['tokens'], size = 100, window = 5, \
                              min_count = 5, sg = 0, alpha = 0.025, iter = 5, batch_words = 10000)

In [None]:
model.wv.vocab

In [None]:
doc = [word for word in ct_sub.reset_index()['tokens'][0] if word in model.wv.vocab]
len(doc)

In [None]:
doc[0:5]

In [None]:
def document_vector(word2vec_model, doc):
    doc = [word for word in doc if word in model.wv.vocab]
    return numpy.mean(word2vec_model.wv.__getitem__(doc), axis=0)

In [None]:
# Initialize an array for the size of the corpus
empty_list_embeddings_means = []
for puppy in ct_sub['tokens_str']: # append the vector for each document
    empty_list_embeddings_means.append(document_vector(model, puppy))
    
doc_average_embeddings = numpy.array(empty_list_embeddings_means) # list to array 

In [None]:
doc_average_embeddings

In [None]:
dae_df = pd.DataFrame(doc_average_embeddings)

In [None]:
dae_df

## Classification

Either use cross-validation or partition your data with training/validation/test sets for this section. Do the following:

* Choose a supervised learning algorithm such as logistic regression, random forest etc. 
* Train six models. For each of the three dataframes you created in the featurization part, train one model to predict whether the author of the tweet is a Democrat or Republican, and a second model to predict whether the author is a Senator or Representative.
* Report the accuracy and other relevant metrics for each of these six models.
* Choose the featurization technique associated with your best model. Combine those text features with non-text features. Train two more models: (1) A supervised learning algorithm that uses just the non-text features and (2) a supervised learning algorithm that combines text and non-text features. Report accuracy and other relevant metrics. 

If time permits, you are encouraged to use hyperparameter tuning or AutoML techniques like TPOT, but are not explicitly required to do so.

### Train Six Models with Just Text

In [None]:
ct_sub

Creat a column with only Democrats and Republicans

In [None]:
ct_sub['DR'] = ct_sub['party']. apply(lambda x : 0 if x =='Democrat' and x!='Independent' else 1)

In [None]:
ct_sub

Join dataframes together

In [None]:
dataframes = [engineered_features,
                    topic_dist_df, 
                    tfidf_df]

featurization_technique = ['Engineered Text Features', 
                            'Topic Model',
                             'Tf-idf Features']


In [None]:
# binarize label
lb_style = LabelBinarizer()
y = ct_sub['party_binary'] = lb_style.fit_transform(ct_sub['DR'])


In [None]:
for dataframe, featurization in zip(dataframes, featurization_technique):
    X_train, X_test, y_train, y_test = train_test_split(dataframe, 
                                                        y, 
                                                        train_size = .80, 
                                                        test_size=0.20, 
                                                        random_state = 10)
    # create a model
    logit_reg = LogisticRegression()

    # fit the model
    logit_model = logit_reg.fit(X_train, y_train.ravel())

    y_pred = logit_model.predict(X_test)
    
    cf_matrix = confusion_matrix(y_test, y_pred, normalize = "true")

    df_cm = pd.DataFrame(cf_matrix, range(2),
                      range(2))

    df_cm = df_cm.rename(index=str, columns={0: "Democrat", 1: "Republican"})
    df_cm.index = ["Democrat", "Republican"]
    plt.figure(figsize = (10,7))
    sns.set(font_scale=1.4)#for label size
    sns.heatmap(df_cm, 
               annot=True,
               annot_kws={"size": 16},
               fmt='g')

    plt.title(featurization)
    plt.xlabel("Predicted Label")
    plt.ylabel("True Label")
    plt.show()

In [None]:
y2 = ct_sub['position_binary'] = lb_style.fit_transform(ct_sub['position'])

In [None]:
ct_sub

In [None]:
for dataframe, featurization in zip(dataframes, featurization_technique):
    X2_train, X2_test, y2_train, y2_test = train_test_split(dataframe, 
                                                        y2, 
                                                        train_size = .80, 
                                                        test_size=0.20, 
                                                        random_state = 10)
    # create a model
    logit_reg = LogisticRegression()

    # fit the model
    logit_model = logit_reg.fit(X2_train, y2_train.ravel())

    y2_pred = logit_model.predict(X2_test)
    
    cf_matrix = confusion_matrix(y2_test, y2_pred, normalize = "true")

    df_cm = pd.DataFrame(cf_matrix, range(2),
                      range(2))

    df_cm = df_cm.rename(index=str, columns={0: "Representative", 1: "Senator"})
    df_cm.index = ["Representative", "Senator"]
    plt.figure(figsize = (10,7))
    sns.set(font_scale=1.4)#for label size
    sns.heatmap(df_cm, 
               annot=True,
               annot_kws={"size": 16},
               fmt='g')

    plt.title(featurization)
    plt.xlabel("Predicted Label")
    plt.ylabel("True Label")
    plt.show()

### Two Combined Models

In [None]:
categories = ['position_binary', 'party_binary']
labels = [['Representative', 'Senator'], ['Democrat', 'Republican']]

In [None]:
# two models ([best text features + non-text features] * [democrat/republican, senator/representative])
ct_sub.columns
non_text_features = ct_sub[['trump_2016_state_share', 'clinton_2016_state_share', 
                        'obama_2012_state_share', 'romney_2012_state_share']]
non_text_features = non_text_features.replace({',':''}, regex=True)
non_text_features = non_text_features.apply(pd.to_numeric)
non_text_dummies = pd.get_dummies(ct_sub[['gender', 'state']]).reset_index(drop = True)
non_text_features = non_text_features.reset_index(drop = True).join(non_text_dummies)
non_text_features

In [None]:
# Non-text model
dataframe = non_text_features
featurization = "Non-Text Features"
for category, label in zip(categories, labels):
    print(category)
    
    y = ct_sub[category]
    
    X_train, X_test, y_train, y_test = train_test_split(dataframe, 
                                                        y, 
                                                        train_size = .80, 
                                                        test_size=0.20, 
                                                        random_state = 10)
    # create a model
    logit_reg = LogisticRegression()

    # fit the model
    logit_model = logit_reg.fit(X_train, y_train.ravel())

    y_pred = logit_model.predict(X_test)
    
    cf_matrix = confusion_matrix(y_test, y_pred, normalize = "true")

    df_cm = pd.DataFrame(cf_matrix, range(2),
                      range(2))

    df_cm = df_cm.rename(index=str, columns={0: "Democrat", 1: "Republican"})
    df_cm.index = ["Democrat", "Republican"]
    plt.figure(figsize = (10,7))
    sns.set(font_scale=1.4)#for label size
    sns.heatmap(df_cm, 
               annot=True,
               annot_kws={"size": 16},
               fmt='g')

    plt.title(featurization)
    plt.xlabel("Predicted Label")
    plt.ylabel("True Label")
    plt.show()

In [None]:
# Non-text plus text features from TF-IDF model
dataframe = tfidf_df.reset_index(drop = True).join(non_text_features)
featurization = "Non-Text Features + TF-IDF"
for category, label in zip(categories, labels):
    print(category)
    
    y = ct_sub[category]
    
    X_train, X_test, y_train, y_test = train_test_split(dataframe, 
                                                        y, 
                                                        train_size = .80, 
                                                        test_size=0.20, 
                                                        random_state = 10)

    # create a model
    logit_reg = LogisticRegression()

    # fit the model
    logit_model = logit_reg.fit(X2_train, y2_train.ravel())

    y2_pred = logit_model.predict(X2_test)
    
    cf_matrix = confusion_matrix(y2_test, y2_pred, normalize = "true")

    df_cm = pd.DataFrame(cf_matrix, range(2),
                      range(2))

    df_cm = df_cm.rename(index=str, columns={0: "Representative", 1: "Senator"})
    df_cm.index = ["Representative", "Senator"]
    plt.figure(figsize = (10,7))
    sns.set(font_scale=1.4)#for label size
    sns.heatmap(df_cm, 
               annot=True,
               annot_kws={"size": 16},
               fmt='g')

    plt.title(featurization)
    plt.xlabel("Predicted Label")
    plt.ylabel("True Label")
    plt.show()

## Discussion Questions

1. Why do standard preprocessing techniques need to be further customized to a particular corpus?

**Your Answer Here**

2. Did you find evidence for the idea that Democrats and Republicans have different sentiments in their tweets? What about Senators and Representatives?

**Your Answer Here**

3. Why is validating your exploratory and unsupervised learning approaches with a supervised learning algorithm valuable?

**Your Answer Here**

4. Did text only, non-text only, or text and non-text features together perform the best? What is the intuition behind combining text and non-text features in a supervised learning algorithm?

**Your Answer Here**