# Online Social Networks, Hilary Term 2020
## Week 3 Formative Assignment

You will be marked for correct output and for following the instructions carefully.  

Remember to name this file correctly. So if your surname is "Turing", the file you submit should be:
TURING_OSN20_Week03_Formative.ipynb 

Be careful that the extension is not .json. **Edit it above now where it says SURNAME before you continue.** 

For this series of exercises, you will be asked to use the "comm16.graphml" data set uploaded to Canvas. Please do not distribute this file as it is licensed for academic use only. 

In [2]:
# To get started
import networkx as nx
import pandas as pd
%matplotlib inline 

g = nx.read_graphml("comm16.graphml")

# A helpful code snippet reminder for getting the attributes of a node: 
# next(iter(g.nodes)) will get the NodeID of the first node
# g.nodes[NODEID] will get the attributes of a node by NodeID, thus: 

g.nodes[next(iter(g.nodes))]

{'name': '1',
 'sex': '1',
 'race': '1',
 'grade': '10',
 'totalnoms': '8',
 'scode': '1'}

In [3]:
# Exercise 1. Basic netowrk description 

# Describe the network using the following features: 
# 1.1 Number of nodes (1pt)
# 1.2 Number of edges (1pt)
# 1.3 Average in-degree/degree (explain why it is a directed or undirected graph), 2pts

# Answer below here: 

print(nx.info(g))
print("\n")

# 1.1
print("There are", g.number_of_nodes(), "nodes.")

# 1.2
print("There are", g.number_of_edges(), "edges.")

# 1.3
df = pd.DataFrame.from_dict(dict(g.nodes), orient = "index")
df["degree"] = pd.Series(dict(nx.degree(g)))
df["in_degree"] = pd.Series(dict(g.in_degree(g)))
df["out_degree"] = pd.Series(dict(g.out_degree(g)))
# display(df)

print("The average degree is {:.2f}".format(df["degree"].mean()), 
      "but the average in-degree is {:.2f}".format(df["in_degree"].mean()))

print("The average out-degree is also {:.2f}".format(df["out_degree"].mean()))
print("\n")

deg_in = g.in_degree('n5')
deg_out = g.out_degree('n5')

print("Node n5 has an in-degree of {}.".format(deg_in))
print("However, node n5 has an out-degree of {}, which shows it is a directed graph as in-degrees are not equal to out-degrees.".format(deg_out))

# Comments below here: 


ex1_of4 = 4

Name: 
Type: DiGraph
Number of nodes: 795
Number of edges: 4125
Average in degree:   5.1887
Average out degree:   5.1887


There are 795 nodes.
There are 4125 edges.
The average degree is 10.38 but the average in-degree is 5.19
The average out-degree is also 5.19


Node n5 has an in-degree of 6.
However, node n5 has an out-degree of 7, which shows it is a directed graph as in-degrees are not equal to out-degrees.


In [8]:
# Exercise 2 - Path based metrics. (2pts each)

# 2.1. What is the diameter of the network? 
# 2.2. How many nodes are in the giant component (assuming there is a giant component)?
# 2.3. What is the largest strongly_connected_component


# Answer below here: 

count = 0
for c in nx.strongly_connected_components(g): count += 1
print("There are", count, "strongly connected components with", g.number_of_nodes(), "nodes.")

g_strong = max(nx.strongly_connected_components(g), key=len) # to select the largest strongly connected component
print(len(g_strong))

# print(nx.diameter(g_strong))
# print(g_strong.diameter())

Gc = max(nx.weakly_connected_components(g), key=len)
print(len(Gc))

# print("The giant component has", Gc.number_of_nodes(), "number of nodes.")
# print("The giant component has", len(Gc.nodes), "number of nodes.")

# Sorry, I really tried and even went through the documentation and help files. I don't know what's wrong. 
# The above lines of code for diameter and number_of_nodes do not work.

# Comments below here: 
"""1. The diameter of the network is the diameter of its largest strong connected (giant) component."""


ex2_of6 = 4

There are 160 strongly connected components with 795 nodes.
391
778


In [4]:
# Exercise 3 - Using betweenness (2pts each)
#
# Find the three nodes with the highest betweenness centrality 
# For each of these nodes, you will see there are attributes attached to the data. 
# 3.1. Report the betweeenness of these nodes. 
# 3.2. Describe the grade, race, and gender of these nodes. 
# Later we will discuss how networks can reveal intersectional inequalities using these data 
#
# You may consult the codebook (AddHealth_agreement_and_codebook.pdf) for this task

# Answer below here: 

from collections import Counter

bc = nx.betweenness_centrality(g) #code has been changed from .degree_centrality to what it is now
# print(bc)

k = Counter(bc)
high = k.most_common(3)

print("Nodes with highest betweenness centrality and its respective value are:")

for i in high: 
    print(i[0]," :",i[1]," ")

print("\n")

print("And their attributes are coded as follows:")
print(g.nodes["n474"])
print(g.nodes["n699"])
print(g.nodes["n189"])

print("\n")
print("Node name 475 is a white female with a grade of 9.")
print("Node name 700 is a white female with a grade of 8.")
print("Node name 190 is also a white female with a grade of 9.")

# Comments below here: 
""" You looked at the degree centrality, not the betweenness centrality."""


ex3_of4 = 0


Nodes with highest betweenness centrality and its respective value are:
n153  : 0.029592448043782093  
n375  : 0.027386116755251527  
n417  : 0.021653419757343584  


And their attributes are coded as follows:
{'name': '475', 'sex': '2', 'race': '1', 'grade': '9', 'totalnoms': '10', 'scode': '1'}
{'name': '700', 'sex': '2', 'race': '1', 'grade': '8', 'totalnoms': '10', 'scode': '0'}
{'name': '190', 'sex': '2', 'race': '1', 'grade': '9', 'totalnoms': '10', 'scode': '1'}


Node name 475 is a white female with a grade of 9.
Node name 700 is a white female with a grade of 8.
Node name 190 is also a white female with a grade of 9.


In [10]:
# Exercise 4 - Using closeness (2pts each)
#
# Find the three nodes with the highest closeness centrality. 
# For each of the nodes report: 
# 4.1. How many nodes can these three reach?
# 4.2. Report on their average distance to their reachable nodes 
# 4.3. Are they the same nodes as featured in Exercise 3? 


# Answer below here: 

from collections import Counter

cc = nx.closeness_centrality(g)
k = Counter(cc)
high = k.most_common(3)
print("Nodes with highest closeness centrality and its respective value are:")
for i in high: 
    print(i[0]," :",i[1]," ")
    
deg1 = nx.degree(g,'n474')
deg2 = nx.degree(g,'n189')
deg3 = nx.degree(g,'n370')

paths1 = nx.single_source_shortest_path(g,'n474')
paths2 = nx.single_source_shortest_path(g,'n189')
paths3 = nx.single_source_shortest_path(g,'n370')

print("\n")

print("Node n474 has a direct connection to {} others.".format(deg1))
print("Node n474 can indirectly reach {} others.".format(len(paths1)))

print("\n")

print("Node n189 has a direct connection to {} others.".format(deg2))
print("Node n189 can indirectly reach {} others.".format(len(paths2)))

print("\n")

print("Node n370 has a direct connection to {} others.".format(deg3))
print("Node n370 can indirectly reach {} others.".format(len(paths3)))

print("\n")

import numpy as np

avg_dist1 = np.mean([len(path)-1 for index,path in paths1.items()])
print("The average distance from node n474 to all other reachable nodes "\
      "is {:.3f}.".format(avg_dist1))

avg_dist2 = np.mean([len(path)-1 for index,path in paths2.items()])
print("The average distance from node n189 to all other reachable nodes "\
      "is {:.3f}.".format(avg_dist2))

avg_dist3 = np.mean([len(path)-1 for index,path in paths3.items()])
print("The average distance from node n370 to all other reachable nodes "\
      "is {:.3f}.".format(avg_dist3))

print("\n")

print("Only n474 and n189 are the same as the question before.")

# Comments below here: 



ex4_of6 = 5

Nodes with highest closeness centrality and its respective value are:
n474  : 0.24718495678539812  
n189  : 0.24214875600266028  
n370  : 0.23946042688585442  


Node n474 has a direct connection to 30 others.
Node n474 can indirectly reach 451 others.


Node n189 has a direct connection to 28 others.
Node n189 can indirectly reach 451 others.


Node n370 has a direct connection to 27 others.
Node n370 can indirectly reach 451 others.


The average distance from node n474 to all other reachable nodes is 4.253.
The average distance from node n189 to all other reachable nodes is 4.408.
The average distance from node n370 to all other reachable nodes is 4.355.


Only n474 and n189 are the same as the question before.


In [11]:
try:
    total = ex1_of4 + ex2_of6 + ex3_of4 + ex4_of6
    print("Your total out of 20 is: {}.".format(total))
except (NameError, TypeError):
    print("The totals have not been tallied yet")

Your total out of 20 is: 13.
