# Part 1: Twitter Network Analysis

## Exercise 1: Build the network of retweets
We will now build a network that has as nodes the Twitter handles of the members of the house, and direct edges between nodes A and B if A has retweeted content posted by B. We will build a weighted network, where the weight of an edge is equal to the number of retweets. You can build the network following the steps below (and you should be able to reuse many of the functions you have written as part of the exercises during the previous weeks):

* Consider the 200 most recent tweets written by each member of the house (use the files [here](https://github.com/suneman/socialgraphs2019/tree/master/files/data_twitter/tweets_2019.zip/). For each file, use a regular expression to find retweets and to extract the Twitter handle of the user whose content was retweeted. All retweets begin with "*RT @originalAuthor:*", where "*originalAuthor*" is the handle of the user whose content was retweeted (and the part of the text you want to extract).

* For each retweet, check if the handle retweeted is, in fact, the handle of a member of the house. If yes, keep it. If no, discard it.

In [116]:
import re
import io
import glob
import pandas as pd

In [117]:
# load Twitter handles of the members of the house
members_df = pd.read_csv("H115_tw_2019.csv")
members_df.head()

Unnamed: 0,WikiPageName,Party,State,Name,tw_id,tw_name
0,Don_Young,Republican,Alaska,Don Young,37007270.0,repdonyoung
1,Jim_Sensenbrenner,Republican,Wisconsin,Jim Sensenbrenner,851621400.0,JimPressOffice
2,Hal_Rogers,Republican,Kentucky,Hal Rogers,550401800.0,RepHalRogers
3,Chris_Smith_(New_Jersey_politician),Republican,New Jersey,Chris Smith,1289319000.0,RepChrisSmith
4,Steny_Hoyer,Democratic,Maryland,Steny Hoyer,22012090.0,LeaderHoyer


In [118]:
# get the list of twitter names
members_tw_name = members_df.tw_name.unique().tolist()
members_tw_name

['repdonyoung',
 'JimPressOffice',
 'RepHalRogers',
 'RepChrisSmith',
 'LeaderHoyer',
 'RepMarcyKaptur',
 'RepVisclosky',
 'RepPeterDeFazio',
 'repjohnlewis',
 'RepFredUpton',
 'SpeakerPelosi',
 'FrankPallone',
 'RepEliotEngel',
 'NitaLowey',
 'RepRichardNeal',
 'RepJoseSerrano',
 'RepDavidEPrice',
 'rosadelauro',
 'RepMaxineWaters',
 'RepJerryNadler',
 'repjimcooper',
 'KenCalvert',
 'WhipClyburn',
 'RepAnnaEshoo',
 'RepMarkGreen',
 'RepHastingsFL',
 'RepPeteKing',
 'RepMaloney',
 'RepRoybalAllard',
 'RepBobbyRush',
 'BobbyScott',
 'BennieGThompson',
 'RepFrankLucas',
 'RepLloydDoggett',
 'USRepMikeDoyle',
 'JacksonLeeTX18',
 'RepZoeLofgren',
 'MacTXPress',
 'RepCummings',
 'repblumenauer',
 'Robert_Aderholt',
 'RepKevinBrady',
 'RepDannyDavis',
 'RepDianaDeGette',
 'RepKayGranger',
 'RepRonKind',
 'RepMcGovern',
 'BillPascrell',
 'BradSherman',
 'RepShimkus',
 'RepAdamSmith',
 'RepGregoryMeeks',
 'RepBarbaraLee',
 'RepSteveChabot',
 'JoeCrowleyNY',
 'RepJohnLarson',
 'gracenapolitano

In [123]:
# define a function to check if the original authors of a person's retweets are members of the house

def check_if_member(tup):
    
    # initialize a list that only stores retweets whose original author is a member of the house 
    retweets_list = []
    
    # iterate through the reweets tuple and examine the original author of each retweet
    for grp in tup:
        
        # add to the list if original author is a member
        if grp[0] in members_tw_name:
            retweets_list.append(grp)
            
    return retweets_list

In [124]:
# define a function to find the original author and the content of ONE member's retweets 

def find_and_filter_retweets(path):
    
    # first, read the tweets of a particular member
    # io module is chosen for its encoding parameter as utf-8 encoding is needed
    f = io.open(path, mode="r", encoding="utf-8").read()
    
    # then, use regex to find all retweets of this member
    # result is stored in a tuple of tuples of the following format: 
    # ((originalAuthor1, content1),(originalAuthor2, content2),...)
    tup = re.findall("RT @([\w_]+): (.*)", f)
    
    # check if the original authors of a person's retweets are members of the house
    retweets_list = check_if_member(tup)
    return retweets_list

In [125]:
# define a function to find and filter ALL members' retweets

def process_all():
    
    # initialize a dictionary that stores the Wikipedia name of the member, 
    # the author of his/her retweets (must be a house member too), and the content of these retweets.
    # format: {houseMember1 : [(originalAuthor1, content1),(...,...),...],  houseMember2 : [(...,...),(...,...),...],  ...}
    retweets_dict = {}
    
    # iterate through every file in the directory
    for file in glob.glob("tweets_2019/*"):
        
        # find the retweet list of each house member
        retweets_list = find_and_filter_retweets(file)
        
        # extract the name of the house member
        name = file[12:]
        
        # append the name and retweets info to the dictionary
        retweets_dict[name] = retweets_list
    
    return retweets_dict

In [127]:
retweets_full_dict = process_all()
retweets_full_dict

{'Adam_Kinzinger': [],
 'Adam_Schiff': [('SpeakerPelosi',
   'Over the past week, we’ve learned a great deal about @realDonaldTrump’s abuses of power &amp; betrayal of his oath of office…'),
  ('RepAdamSchiff',
   "@GOPLeader Actually Kevin, Mueller's report confirmed that in 2016, the Trump campaign had multiple contacts with Russia…"),
  ('SpeakerPelosi',
   'In this House, we speak truth to power. https://t.co/obQc9WqpY1'),
  ('SpeakerPelosi',
   'A special message from @RepAdamSchiff &amp; @HouseDemocrats as Americans celebrate the birth of our democracy. #ProtectOurDe…')],
 'Adam_Smith_(politician)': [('RepLindaSanchez',
   'As part of its anti-immigrant agenda, the Trump admin. wants to strip rights from immigration judges. It’s never been…'),
  ('RepRaulGrijalva',
   'Yesterday we passed @RepCunningham and @USRepKCastor &amp; @RepRooney’s bills to #ProtectOurCoast from offshore drilling.…'),
  ('USRepKCastor',
   '#ProtectOurCoast Alert❗️We cannot risk another oil drilling disas

* Use a NetworkX [`DiGraph`](https://networkx.github.io/documentation/development/reference/classes.digraph.html) to store the network. Use weighted edges to account for multiple retweets. Also store the party of each member as a node attribute (use the data in [this file](https://github.com/suneman/socialgraphs2019/blob/master/files/data_twitter/H115_tw_2019.csv). Remove self-loops (edges that connect a node with itself).

## Exercise 2: Visualize the network of retweets and investigate differences between the parties

**code below is copied directly from the previous exercise and thus needs modification, do not touch first**

In [67]:
# clean members names
processed_names = []
for name in name_list115:
    name = re.sub(r'_\([^)]*\)', '', name) #remove "_(...)"
    name = re.sub(r'_', ' ', name.lower()) #replace underscore with space
    names = name.split() #split on white space
    processed_names.extend(names)

In [68]:
# find punctuations
punctuations = [i for i in string.punctuation]
punctuations

['!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 '{',
 '|',
 '}',
 '~']

In [69]:
# add members names and punctuations to the stopwords
stop_words = set(nltk.corpus.stopwords.words('english')).union(processed_names)
stop_words = stop_words.union(set(punctuations))
stop_words

{'!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 'a',
 'about',
 'above',
 'abraham',
 'adam',
 'adams',
 'aderholt',
 'adrian',
 'adriano',
 'after',
 'again',
 'against',
 'aguilar',
 'ain',
 'al',
 'alan',
 'albio',
 'alcee',
 'alex',
 'all',
 'allen',
 'alma',
 'am',
 'amash',
 'ami',
 'amodei',
 'an',
 'and',
 'andré',
 'andy',
 'ann',
 'anna',
 'anthony',
 'any',
 'are',
 'aren',
 "aren't",
 'arrington',
 'as',
 'at',
 'austin',
 'b.',
 'babin',
 'bacon',
 'balderson',
 'banks',
 'barbara',
 'barletta',
 'barr',
 'barragán',
 'barry',
 'barton',
 'bass',
 'be',
 'beatty',
 'because',
 'becerra',
 'been',
 'before',
 'being',
 'below',
 'ben',
 'bennie',
 'bera',
 'bergman',
 'bernice',
 'beto',
 'betty',
 'between',
 'beutler',
 'beyer',
 'biggs',
 'bilirakis',
 'bill',
 'billy',
 'bishop',
 'black',
 'blackburn',
 'blaine',
 'blake',
 'blum',
 'blumenauer',
 '

In [142]:
# function to check if a string contains numbers
def hasNumbers(inputString):
    return bool(re.search(r'\d', inputString))

In [147]:
def tokenize(name):
    text = pickle.load(
        open( "D:/NUS/Y3S1/02805 Social Graphs and Interactions/Exercise/politicians_wiki/plain_text/115/" + name + ".pickle", 
             "rb"))
    
    word_tokens = nltk.tokenize.word_tokenize(text) # tokenize text
    word_tokens = [w.lower() for w in word_tokens if not w.lower() in stop_words] # remove stopwords
    word_tokens = list(filter(lambda x: hasNumbers(x) == False, word_tokens)) # remove numbers
    
    return word_tokens

In [148]:
republican_names = df[df.Party == 'Republican'].WikiPageName.drop_duplicates().tolist()
democratic_names = df[df.Party == 'Democratic'].WikiPageName.drop_duplicates().tolist()

In [149]:
republican_tokens = []
democratic_tokens = []

In [150]:
for i in republican_names:
    republican_tokens.extend(tokenize(i))

In [151]:
for i in democratic_names:
    democratic_tokens.extend(tokenize(i))

Now, we're ready to calculate the TF for each word. Use the method of your choice to find the top 5 terms within each party.
* Describe similarities and differences between the parties.
* Why aren't the TFs not necessarily a good description of the parties?

In [152]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [153]:
republican_vectorizer = TfidfVectorizer(use_idf = False, max_features = 5)
democratic_vectorizer = TfidfVectorizer(use_idf = False, max_features = 5)

In [154]:
republican_vectorizer.fit_transform(republican_tokens)
democratic_vectorizer.fit_transform(democratic_tokens)

<216176x5 sparse matrix of type '<class 'numpy.float64'>'
	with 8338 stored elements in Compressed Sparse Row format>

In [155]:
republican_vectorizer.get_feature_names()

['committee', 'district', 'election', 'house', 'republican']

In [156]:
democratic_vectorizer.get_feature_names()

['congress', 'congressional', 'district', 'election', 'house']

Next, we calculate IDF for every word.
* What base logarithm did you use? Is that important?

In [157]:
republican_vectorizer = TfidfVectorizer(use_idf = True)
democratic_vectorizer = TfidfVectorizer(use_idf = True)
rep_X = republican_vectorizer.fit_transform(republican_tokens)
dem_X = democratic_vectorizer.fit_transform(democratic_tokens)

In [158]:
republican_vectorizer.get_feature_names()

['aa',
 'aaps',
 'aaron',
 'ab',
 'abandoned',
 'abandoning',
 'abatement',
 'abaya',
 'abayas',
 'abbasse',
 'abbey',
 'abbott',
 'abbreviated',
 'abby',
 'abc',
 'abdication',
 'abdirizak',
 'abducted',
 'abduction',
 'abdulhakim',
 'abdulmutallab',
 'abdulrahman',
 'abedin',
 'abel',
 'abercrombie',
 'abetting',
 'abhorred',
 'abhorrent',
 'abide',
 'abiding',
 'abigail',
 'abilene',
 'abilities',
 'ability',
 'able',
 'abney',
 'abnormally',
 'aboard',
 'abolish',
 'abolished',
 'abolishing',
 'abolishment',
 'abolition',
 'abomination',
 'abort',
 'aborted',
 'abortifacients',
 'aborting',
 'abortion',
 'abortionfortenberry',
 'abortionin',
 'abortions',
 'abortionschweikert',
 'abounded',
 'about',
 'aboutalebi',
 'above',
 'abraham',
 'abramoff',
 'abrams',
 'abroad',
 'abrupt',
 'abruptly',
 'abruzzo',
 'abscam',
 'absence',
 'absent',
 'absentee',
 'absolute',
 'absolutely',
 'absorb',
 'absorbed',
 'absorbers',
 'absorbing',
 'abstained',
 'abstaining',
 'abstinence',
 'abstr

In [159]:
democratic_vectorizer.get_feature_names()

['aapj',
 'aaron',
 'aarp',
 'aayesha',
 'ab',
 'aba',
 'abandoned',
 'abatement',
 'abbas',
 'abbey',
 'abboud',
 'abc',
 'abdication',
 'abdul',
 'abdullah',
 'abdulmutallab',
 'abdulmutullab',
 'abel',
 'abercrombie',
 'aberdeen',
 'abhi',
 'abiding',
 'abigail',
 'abilities',
 'ability',
 'able',
 'ably',
 'abner',
 'abnormalities',
 'abolish',
 'abolished',
 'abolishing',
 'abolition',
 'abort',
 'abortion',
 'abortionchu',
 'abortiondoggett',
 'abortionquigley',
 'abortionrepresentative',
 'abortions',
 'abortionthompson',
 'abrahams',
 'abrasive',
 'abroad',
 'abrupt',
 'abruptly',
 'abruzzo',
 'absalom',
 'abscam',
 'absence',
 'absences',
 'absent',
 'absentee',
 'absolutely',
 'absorbed',
 'absorbing',
 'absurd',
 'abu',
 'abuse',
 'abused',
 'abuser',
 'abusers',
 'abuses',
 'abusing',
 'abusive',
 'abysmal',
 'abzug',
 'aca',
 'academia',
 'academic',
 'academics',
 'academy',
 'acceded',
 'acceding',
 'accelerated',
 'accelerator',
 'accelerators',
 'accent',
 'accenture',

In [164]:
rep_X[0].toarray()

array([[0., 0., 0., ..., 0., 0., 0.]])

In [138]:
republican_vectorizer.get_feature_names()

['committee', 'district', 'election', 'house', 'republican']

In [135]:
dem_X.toarray()

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       ...,
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.]])

* Visualize the network using the [Networkx draw function](https://networkx.github.io/documentation/stable/reference/generated/networkx.drawing.nx_pylab.draw.html#networkx.drawing.nx_pylab.draw), and nodes coordinates from the force atlas algorithm. *Hint: use an undirected version of the graph to find the nodes positions for better results, but stick to the directed version for all measurements.* Plot nodes in colors according to their party (e.g. 'red' for republicans and 'blue' for democrats) and set the node-size to be proportional to total degree. 

* Compare the network of Retweets with the network of Wikipedia pages (generated during Week 5). Do you observe any differences? How do you explain them?

* Now set the node-size tob proportional to betweenness centrality. Do you observe any changes?

* Repeat the point above using eigenvector centrality. Again, do you see a difference? Can you explain why based on what eigenvector centrality measures?

* Who are the three nodes with highest degree within each party? And wbat is their eigenvector centrality? And their betweenness centrality?

* Plot (on the same figure) the distribution of outgoing strength for the republican and democratic nodes repectively (i.e. the sum of the weight on outgoing links). Which party is more active in retweeting other members of the house?

* Find the 3 members of the republican party that have retweeted tweets from democratic members most often. Repeat the measure for the democratic members. Can you explain your results by looking at the Wikipedia pages of these members of the house?

## Exercise 3: Community detection

* Use your favorite method of community detection to find communities in the full house of representatives network. Report the value of modularity found by the algorithm. Is it higher or lower than what you found for the Wikipedia network (Week 7)? Comment on your result.

* Visualize the network, using the Force Atlas algorithm. This time assign each node a different color based on their *community*. Describe the structure you observe.

* Compare the communities found by your algorithm with the parties by creating a matrix $\mathbf{D}$ with dimension $(B \times C$, where $B$ is the number of parties and $C$ is the number of communities. We set entry $D(i,j)$ to be the number of nodes that party $i$ has in common with community $j$. The matrix $\mathbf{D}$ is what we call a [**confusion matrix**](https://en.wikipedia.org/wiki/Confusion_matrix). 

* Use the confusion matrix to explain how well the communities you've detected correspond to the parties. Consider the following questions
  * Are there any republicans grouped with democrats (and vice versa)?
  * Does the community detection algorithm sub-divide the parties? Do you know anything about American politics that could explain such sub-divisions? Answer in your own words.

# Part 2: What do republican and democratic members tweet about?

## Exercise 4: TF-IDF of the republican and democratic tweets
We will create two documents, one containing the words extracted from tweets of republican members, and the other for Democratic members. We will then use TF-IDF to compare the content of these two documents and create a word-cloud. The procedure you should use is exactly the same you used in exercise 2 of week 7. The main steps are summarized below:

* Create two large documents, one for the democratic and one for the republican party. Tokenize the pages, and combine the tokens into one long list including all the pages of the members of the same party. 
  * Exclude the twitter handles of other members.
  * Exclude punctuation.
  * Exclude stop words (if you don't know what stop words are, go back and read NLPP1e again).
  * Exclude numbers (since they're difficult to interpret in the word cloud).
  * Set everything to lower case.
  * Compute the TF-IDF for each document.

* Now, create word-cloud for each party. Are these topics less "boring" than the Wikipedia topics from Week 7? Why?  Comment on the results.

# Part 3: Sentiment analysis

## Exercise 5: Sentiment over the Twitter data

* Download the LabMT wordlist. It's available as supplementary material from [Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0026752) (Data Set S1). Describe briefly how the list was generated.

* Based on the LabMT word list, write a function that calculates sentiment given a list of tokens (the tokens should be lower case, etc).

* Create two lists: one containing tweets by democratic members, and the other with the tweets of republican members. Calculate the sentiment of each tweet and plot the distribution of sentiment for each of the two lists. Are there significant differences between the two? Which party post more positive tweets?

* Compute the average $m$ and standard deviation $\sigma$  of the Tweets sentiment (considering tweets by both republican and democrats).

* Now only tweets with sentiment lower than $m-2\sigma$. We will refer to them as *negative* tweets.  Build a list containing *negative* tweets written by democrats, and one for republicans. Compute the TF-IDF on these two lists. Create a word-cloud for each of them. Are there differences between the positive content posted by republican and democrats?

* Repeat the point above, but considering _positive_ tweets (e.g. with sentiment larger than $m+2\sigma$). Comment on your results.