# graphs lab

The purpose of this lab is to get some practice with graphs, and the NetworkX library.

For full credit on this assignment complete the steps below labeled TODO(1)-TODO(5).

This lab is less guided than PAs, you will be doing exploratory analysis on a graph of Twitter interactions between members of Congress.

Data Source: https://snap.stanford.edu/data/congress-twitter.html

In [None]:
import json
import networkx as nx

In [None]:
with open("congress_network/congress_network_data.json") as f:
    data = json.load(f)
print(data[0].keys())

In [None]:
# this data is a list with a single element, so we grab that 0th element 
# and take a look at the keys

# We see the keys inList, inWeight, outList, outWeight, usernameList

# All of these lists are 475 records long.
# The record at any given index refers to the same individual.
# - usernameList is a list of Twitter usernames
# - inList is a list of indexes of other members that have shared this member's content
# - inWeight is a probability of having their content shared by that member
# - outList & outWeight are the mirror images of inList & inWeight, we do not need to use them for our purposes

# Refer to congress_network/README for further documentation on this file.

In [None]:
idx = 0  # let's examine the data at a given index
print("Username: ", data[0]["usernameList"][idx])
print("Followers: ", data[0]["inList"][idx])
print("Follower Weights: ", data[0]["inWeight"][idx])

This shows that Senator Tammy Baldwin has had content reshared by the members at indexes 4, 9, 11, etc.

The corresponding weights are "transmission probabilities", the odds of a given post being reshared. 

A network like this can be used to compute "viral centrality", a measure of who's information spreads the farthest among a given network.  Take a look at the other files in the data directory if you'd like to learn more.

## Lab Assignment

Our goal today is to explore this data and write a few functions that will help in doing so.

The first function we'll need is a function that loads this data into a NetworkX graph.

NetworkX gives four graph choices:
- Graph — Undirected graphs with self loops
- DiGraph — Directed graphs with self loops
- MultiGraph — Undirected graphs with self loops and parallel edges
- MultiDiGraph — Directed graphs with self loops and parallel edges

Each node will be a member of congress and each edge will be a follow relationship between the two.
Keep in mind that on Twitter, as with most social media, a user can follow another user but not be followed by them and vice-versa.
Consider which graph you'd want to use here.



In [None]:
def build_graph(usernameList, inList, inWeight):
    """
    Build a graph given usernames, connections, and weights from the dataset.
    
    usernameList: list[str]     - list of Twitter usernames
    inList: list[list[int]]     - list of list of connected nodes (followers)
    inWeight: list[list[float]] - list of list of floats (weights) 
                                  These should be set as the "weight" property on each edge.
    """
    # TODO(1): complete this function

In [None]:
graph = build_graph(data[0]["usernameList"], data[0]["inList"], data[0]["inWeight"])

In [None]:
def test_build_graph(tg):
    # these checks will help ensure you loaded the correct amount of data
    assert len(tg) == 475
    assert len(graph.edges) == 13289
    # Pelosi reshared ~.6% of Bobby Rush's content
    assert graph["SpeakerPelosi"]["RepBobbyRush"]["weight"] == 0.00644122383252818
    # Bobby Rush reshared ~1% of Speaker Pelosi's content
    assert graph["RepBobbyRush"]["SpeakerPelosi"]["weight"] == 0.010309278350515464
    return "OK"

In [None]:
test_build_graph(graph)

## Part 2

Now that the data is loaded, let's do some initial queries. 
These help us get familiar with the data and can serve as reasonableness checks on our data.

Write two short functions (and any helper functions you deem necessary):

- most_followed - Return the top N people with the highest number of followers (in edges).
- most_central - Return the top N people with the highest degree centrality as determined by NetworkX.  [Use nx.degree_centrality](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.centrality.degree_centrality.html#networkx.algorithms.centrality.degree_centrality)

In [None]:
def most_followed(g, top_n):
    """
    Return a list of tuples of the N most followed (most in edges)
    along with follower counts.
        
    Parameters:
        g: Graph
        top_n: How many members to include.
    
    Return format:
    
    [('GOPLeader', 127),
     ('RepFranklin', 121),
     ('RepJeffDuncan', 120)]
    """
    # TODO(2)
    return []
    
def most_central(g, top_n):
    """
    Return a list of tuples of the N most central and their degree centrality score.
    
    (Use nx.degree_centrality: 
    
    Parameters:
        g: Graph
        top_n: How many members to include.
    
    Return format:
    [('GOPLeader', 0.5991561181434598),
     ('SpeakerPelosi', 0.550632911392405),
     ('RepBobbyRush', 0.4008438818565401),
     ('LeaderHoyer', 0.3945147679324894),
     ('RepFranklin', 0.38396624472573837)]
    """
    # TODO(3)
    return []

In [None]:
most_followed(graph, 10)

In [None]:
most_central(graph, 6)

## Part 3

After exploring our data, we will implement a graph algorithm using this graph.

Let's find out the shortest path between two members.

For this portion, you must implement Dijkstra's algorithm.  You may not use the implementation in NetworkX, but you may use it to check yours.

In [None]:
from collections import deque

def shortest_path(graph, start, end):
    """
    Return a list of usernames that form a path from one user to another in graph.
    Include both usernames in the list. So for a path from A to C via B, (A -> B -> C),
    return ["A", "B", "C"].

    Return None if no path is found.

    
    Input:
        graph: Graph
        from_name: str - username to start at
        to_name: str - username to search for 
    
    This path would represent (for our data) the shortest path information
    could take between two users.
    
    Note that due to the nature of the data, shortest_path(g, A, B) != shortest_path(g, B, A)
    """
    return []

In [None]:
shortest_path(graph, "RepMTG", "SenSanders")

In [None]:
shortest_path(graph, "SenSanders", "RepMTG")

## Part 4

Now, use `shortest_path` to find the people with the lowest distance from GOPLeader & SpeakerPelosi.

Let's define a metric to approximate someone's influence over both parties.

This metric is given in the function below:

In [None]:
def metric(graph, member):
    """
    Compute a measure of connectedness to party leadership.
    
    Our metric is the sum of the length of the four paths between:
        (member, GOPLeader)
        (GOPLeader, member)
        (member, SpeakerPelosi)
        (SpeakerPelosi, member)
    
    If any path is missing, return a large negative number.
    Return a negative number for GOPLeader & SpeakerPelosi as well.
    """
    people = ("GOPLeader", "SpeakerPelosi")
    if member in people:
        return -1
    index = 0
    for p in people:
        # length of forward path
        # subtract two from the path length to account for each person
        # and exclude completely disconnected nodes (likely missing data)
        path = shortest_path(graph, p, member)
        index += len(path) - 2 if path else -100 

        # same as above, but backwards path
        path = shortest_path(graph, member, p)
        index += len(path) - 2 if path else -100
    return index

Now you can apply this metric over all nodes to find the person "farthest" from both of these accounts.

**TODO(5)**
**To complete the assignment, print this member's name & the name of their state/territory.** 

(You will need to do a web search since their name is not part of the data set.)