<a href="https://colab.research.google.com/github/sharona1ex/117th-US-Congress-Twitter-Interaction-Graph-Analysis/blob/main/Social_Network_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1><center>117<sup>th</sup> US Congress Twitter Interaction Graph Analysis</center></h1>

The dataset used is a network that represents the Twitter interaction network for the 117th United States Congress, both House of Representatives and Senate. [1]

The 117th United States Congress was a meeting of the legislative branch of the United States federal government, composed of the United States Senate and the United States House of Representatives. It convened in Washington, D.C., on January 3, 2021, during the final weeks of Donald Trump's first presidency and the first two years of Joe Biden's presidency and ended on January 3, 2023. [2]

*(To directly read the analysis of this network, scroll to bottom.)*

<small> References:<br> [1] Stanford Network Analysis Project (SNAP). "Twitter Interaction Network for the 117th United States Congress." https://snap.stanford.edu<br> [2] Wikipedia contributors. (2023). "117th United States Congress." Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/117th_United_States_Congress </small>

<h2><center>Setup - Apache Spark GraphFrames</center></h2>

In Google Collab you need to setup graphframes before you run the queries. Please follow through each cell step by step.


### Installing Spark

Install Dependencies:


1.   Java 8
2.   Apache Spark with hadoop and
3.   Findspark (used to locate the spark in the system)


In [1]:
!rm -rf spark-3.1.1-bin-hadoop3.2

In [2]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
#!wget -q --show-progress http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
#!tar xf spark-3.1.1-bin-hadoop3.2.tgz
!pip install -q findspark pyspark
#!pip -q install findspark pyspark graphframes

Set Environment Variables:

In [3]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
#os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop3.2"

In [4]:
!ls

sample_data


In [5]:
!pip show pyspark

Name: pyspark
Version: 3.5.3
Summary: Apache Spark Python API
Home-page: https://github.com/apache/spark/tree/master/python
Author: Spark Developers
Author-email: dev@spark.apache.org
License: http://www.apache.org/licenses/LICENSE-2.0
Location: /usr/local/lib/python3.10/dist-packages
Requires: py4j
Required-by: 


### Installing GraphFrames

In [6]:
!pip install graphframes

Collecting graphframes
  Downloading graphframes-0.6-py2.py3-none-any.whl.metadata (934 bytes)
Collecting nose (from graphframes)
  Downloading nose-1.3.7-py3-none-any.whl.metadata (1.7 kB)
Downloading graphframes-0.6-py2.py3-none-any.whl (18 kB)
Downloading nose-1.3.7-py3-none-any.whl (154 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.7/154.7 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: nose, graphframes
Successfully installed graphframes-0.6 nose-1.3.7


In [7]:
!python -V

Python 3.10.12


In [8]:
!curl -L -o "/usr/local/lib/python3.10/dist-packages/pyspark/jars/graphframes-0.8.2-spark3.3.2-s_2.11.jar" https://repos.spark-packages.org/graphframes/graphframes/0.8.2-spark3.1-s_2.12/graphframes-0.8.2-spark3.1-s_2.12.jar

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  242k  100  242k    0     0  1035k      0 --:--:-- --:--:-- --:--:-- 1038k


In [9]:
!ls /usr/local/lib/python3.10/dist-packages/pyspark/jars/

activation-1.1.1.jar
aircompressor-0.27.jar
algebra_2.12-2.0.1.jar
annotations-17.0.0.jar
antlr4-runtime-4.9.3.jar
antlr-runtime-3.5.2.jar
aopalliance-repackaged-2.6.1.jar
arpack-3.0.3.jar
arpack_combined_all-0.1.jar
arrow-format-12.0.1.jar
arrow-memory-core-12.0.1.jar
arrow-memory-netty-12.0.1.jar
arrow-vector-12.0.1.jar
audience-annotations-0.5.0.jar
avro-1.11.2.jar
avro-ipc-1.11.2.jar
avro-mapred-1.11.2.jar
blas-3.0.3.jar
bonecp-0.8.0.RELEASE.jar
breeze_2.12-2.1.0.jar
breeze-macros_2.12-2.1.0.jar
cats-kernel_2.12-2.1.1.jar
chill_2.12-0.10.0.jar
chill-java-0.10.0.jar
commons-cli-1.5.0.jar
commons-codec-1.16.1.jar
commons-collections-3.2.2.jar
commons-collections4-4.4.jar
commons-compiler-3.1.9.jar
commons-compress-1.23.0.jar
commons-crypto-1.1.0.jar
commons-dbcp-1.4.jar
commons-io-2.16.1.jar
commons-lang-2.6.jar
commons-lang3-3.12.0.jar
commons-logging-1.1.3.jar
commons-math3-3.6.1.jar
commons-pool-1.5.4.jar
commons-text-1.10.0.jar
compress-lzf-1.1.2.jar
curator-client-2.13.0.jar
cur

### Starting Spark with Libraries Loaded

In [10]:
import findspark
findspark.init()
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .config("spark.jars", "/usr/local/lib/python3.10/dist-packages/pyspark/jars/graphframes-0.8.2-spark3.3.2-s_2.11.jar") \
    .config("spark.driver.memory", "12g") \
    .getOrCreate()

spark.conf.set("spark.sql.repl.eagerEval.enabled", True)  # Property used to format output tables better\


### Dataset Loading
*link - https://snap.stanford.edu/data/congress-twitter.html*

In [28]:
# load dataset functions
import pandas as pd
import json

def read_congress_data(file_path):
  # Read the edgelist file into a DataFrame
  edgelist = pd.read_csv(file_path, delim_whitespace=True, header=None)
  # each value in edgelist[3] has numbers like  "0.009615384615384616}", we need remove the "}"
  edgelist[4] = edgelist[3].apply(lambda x: x.replace('}', ''))
  # drop unwanted columns
  edgelist = edgelist.drop(columns=[2, 3])
  # rename columns
  edgelist = edgelist.rename(columns={0: 'src', 1: 'dst', 4: 'weight'})
  # assing float type to weight column
  edgelist['weight'] = edgelist['weight'].astype(float)

  return edgelist


def read_vertices_data(file_path):
  # Read the edgelist file into a DataFrame
  with open(file_path, 'r') as f:
    data = json.load(f)

  # extract and convert to dataframe
  vert_df = pd.DataFrame(data[0])

  # drop unwanted columns
  vert_df.drop(columns=['inList', 'inWeight', 'outList', 'outWeight'], inplace=True)

  # add id column
  vert_df["id"] = vert_df.index

  # place id as the first column of datafram
  cols = vert_df.columns.tolist()
  cols = cols[-1:] + cols[:-1]
  vert_df = vert_df[cols]

  return vert_df

In [29]:
cong_edge = read_congress_data(file_path="/content/congress.edgelist")

  edgelist = pd.read_csv(file_path, delim_whitespace=True, header=None)


In [30]:
cong_vert = read_vertices_data(file_path="/content/congress_network_data.json")

In [55]:
cong_edge.head()

Unnamed: 0,src,dst,weight
0,0,4,0.002105
1,0,12,0.002105
2,0,18,0.002105
3,0,25,0.004211
4,0,30,0.002105


In [32]:
cong_vert.head()

Unnamed: 0,id,usernameList
0,0,SenatorBaldwin
1,1,SenJohnBarrasso
2,2,SenatorBennet
3,3,MarshaBlackburn
4,4,SenBlumenthal


In [33]:
v = spark.createDataFrame(cong_vert)
# Edge DataFrame
e = spark.createDataFrame(cong_edge)
# Create a GraphFram

### Creating Graph

In [34]:
from graphframes import *
from graphframes import GraphFrame

In [35]:
print('PySpark Version :'+spark.version)
print('PySpark Version :'+spark.sparkContext.version)


PySpark Version :3.5.3
PySpark Version :3.5.3


In [36]:
g = GraphFrame(v, e)



### Testing the graph setup

In [39]:
# Do a test run of the graph: SUCCESS
# Search from vertex 0 to vertex 12
paths = g.bfs("id = 0", "id = 12")
paths.show()



+-------------------+--------------------+-------------------+
|               from|                  e0|                 to|
+-------------------+--------------------+-------------------+
|{0, SenatorBaldwin}|{0, 12, 0.0021052...|{12, SenatorCardin}|
+-------------------+--------------------+-------------------+



### Queries

#### a. Find the top 5 nodes with the highest outdegree and find the count of the number of outgoing edges in each

In [56]:
# a. Find the top 5 nodes with the highest outdegree and find the count of the number of outgoing edges in each

# Calculate out-degrees and get top 5
top_outdegrees = g.outDegrees.orderBy("outDegree", ascending=False).limit(5)

# Join with vertex DataFrame to get usernames
result = top_outdegrees.join(g.vertices, top_outdegrees.id == g.vertices.id) \
                       .select(g.vertices.id, "usernameList", "outDegree") \
                       .orderBy("outDegree", ascending=False)

# Show the result
result.show(truncate=False)



+---+-------------+---------+
|id |usernameList |outDegree|
+---+-------------+---------+
|367|SpeakerPelosi|210      |
|322|GOPLeader    |157      |
|393|RepBobbyRush |111      |
|71 |SenSchumer   |97       |
|399|SteveScalise |89       |
+---+-------------+---------+



#### b. Find the top 5 nodes with the highest indegree and find the count of the number of incoming edges in each

In [45]:
# b. Find the top 5 nodes with the highest indegree and find the count of the number of incoming edges in each

# Calculate in-degrees and get top 5
top_indegrees = g.inDegrees.orderBy("inDegree", ascending=False).limit(5)

# Join with vertex DataFrame to get usernames
result = top_indegrees.join(g.vertices, top_indegrees.id == g.vertices.id) \
                      .select(g.vertices.id, "usernameList", "inDegree") \
                      .orderBy("inDegree", ascending=False)

# Show the result
result.show(truncate=False)

+---+-------------+--------+
|id |usernameList |inDegree|
+---+-------------+--------+
|322|GOPLeader    |127     |
|208|RepFranklin  |121     |
|190|RepJeffDuncan|120     |
|111|RepDonBeyer  |109     |
|385|RepJohnRose  |108     |
+---+-------------+--------+



#### c. Calculate PageRank for each of the nodes and output the top 5 nodes with the highest PageRank values. You are free to define any suitable parameters.

In [49]:
# c. Calculate PageRank for each of the nodes and output the top 5 nodes with the highest PageRank values. You are free to define any suitable parameters.

# Calculate PageRank
max_iterations = 20
reset_probability = 0.15

pagerank_result = g.pageRank(resetProbability=reset_probability, maxIter=max_iterations)




+---+---------------+-------------------+
| id|   usernameList|           pagerank|
+---+---------------+-------------------+
|474|   RepLeeZeldin| 0.2427682690820443|
| 26|   SenJoniErnst| 0.5725160790561225|
|418|  RepJasonSmith| 0.7998705298459917|
|222|  RepJimmyGomez|  1.223589672052033|
|270|    RepMondaire| 0.7434097060228391|
|278|  RepRobinKelly|  1.039626289967799|
|442|   repdinatitus| 1.5985902646733952|
|296|    RepLawrence| 1.1406138478822057|
| 54|SenatorMenendez|  1.173045484400616|
|  0| SenatorBaldwin| 0.6755726663996988|
|348| RepRichardNeal|0.28650213253334417|
|112|        RepBice| 1.5596407466678006|
|330|RepGregoryMeeks| 0.7612260699989268|
| 22|     SenTedCruz| 0.7596636472185839|
|198|   RepPatFallon| 0.9596579938753536|
|414|       RepSires| 0.4909244018194002|
|130|     RepKenBuck| 1.6905696606558955|
|196|    RepRonEstes|0.16882740238109212|
|184|   RepTedDeutch|0.32783786092371625|
| 34|       HawleyMO| 0.1560716558179307|
+---+---------------+-------------

In [50]:
# top 5 nodes
result = pagerank_result.vertices.orderBy("pagerank", ascending=False).limit(5)
result.show(truncate=False)

+---+-------------+------------------+
|id |usernameList |pagerank          |
+---+-------------+------------------+
|322|GOPLeader    |4.429485581824597 |
|208|RepFranklin  |4.3894448062995455|
|190|RepJeffDuncan|4.20400762778138  |
|385|RepJohnRose  |4.015253326230502 |
|192|RepTomEmmer  |3.993500324865214 |
+---+-------------+------------------+



#### d. Run the connected components algorithm on it and find the top 5 components with the largest number of nodes.

In [52]:
from pyspark.sql.functions import col, count
import os

# Create a 'checkpoints' directory in the current working directory
checkpoint_dir = os.path.join(os.getcwd(), 'checkpoints')
os.makedirs(checkpoint_dir, exist_ok=True)
spark.sparkContext.setCheckpointDir(checkpoint_dir)

# Run connected components algorithm
result = g.connectedComponents()

# Count the number of nodes in each component
component_sizes = result.groupBy("component") \
                        .agg(count("id").alias("node_count")) \
                        .orderBy("node_count", ascending=False) \
                        .limit(5)


# Show the result
print("Top 5 largest connected components:")
component_sizes.show(truncate=False)



Top 5 largest connected components:
+---------+----------+
|component|node_count|
+---------+----------+
|0        |475       |
+---------+----------+



#### e. Run the triangle counts algorithm on each of the vertices and output the top 5 vertices with the largest triangle count. In case of ties, you can randomly select the top 5 vertices.

In [54]:
from pyspark.sql.functions import col, desc
import random

# Run triangle count algorithm
triangle_counts = g.triangleCount()

# Select the top 5 vertices with the largest triangle count
top_5_triangles = triangle_counts.select("id", "count", "usernameList") \
                                 .orderBy(desc("count")) \
                                 .limit(5)

# Show the result
print("Top 5 vertices with the largest triangle count:")
top_5_triangles.show(truncate=False)

Top 5 vertices with the largest triangle count:
+---+-----+-------------+
|id |count|usernameList |
+---+-----+-------------+
|367|3281 |SpeakerPelosi|
|322|2777 |GOPLeader    |
|190|1900 |RepJeffDuncan|
|208|1894 |RepFranklin  |
|254|1893 |LeaderHoyer  |
+---+-----+-------------+



### Brief Analysis
1. SpeakerPelosi has the highest outdegree (210), indicating that SpeakerPelosi is the most active in initiating interactions on Twitter. SpeakerPelosi (aka Nancy Pelosi) aligns with her role as Speaker of the House, a position that requires frequent communication.

2. GOPLeader (Kevin McCarthy) ranks high in both outdegree (2nd, 157) and indegree (1st, 127). He's both active and frequently mentioned or interacted with by others. This reflects his position as the Republican leader in the House. (https://apnews.com/article/donald-trump-kevin-mccarthy-coronavirus-pandemic-9d801249ae7c642576a66a1ae2e3219a)

3. The PageRank results closely mirror the indegree rankings, with GOPLeader, RepFranklin, and RepJeffDuncan appearing in the top 3 for both metrics. This suggests that these representatives are not only frequently mentioned but also occupy central positions in the network's information flow.

4. The number of component in this analysis is just one and it shows that each and every member is connected to one another directly or indirectly in the 117th US Congress.

5. Representatives like RepFranklin, RepJeffDuncan, and RepJohnRose appear prominently in indegree and PageRank metrics despite not being in top leadership positions. This could indicate that they are emerging influencers or particularly active in Twitter discussions.

6. The triangle count results provide insight into clustering within the network. SpeakerPelosi has the highest triangle count (3281), followed by GOPLeader (2777). This suggests that these leaders are part of many tightly connected groups, which could represent frequent collaborations or discussions among smaller groups of representatives.

