# Warm-up
In this warm-up task we'll analyze a network with network analysis techniques. 

## Dataset
The [MOOC user action dataset](https://snap.stanford.edu/data/act-mooc.html) represents the actions taken by users on a popular MOOC platform. The actions are represented as a directed, temporal network. The nodes represent users and course activities (targets), and edges represent the actions by users on the targets. The actions have attributes and timestamps. To protect user privacy, we anonimize the users and timestamps are standardized to start from timestamp 0. The dataset is directed, temporal, and attributed.

Additionally, each action has a binary label, representing whether the user dropped-out of the course after this action, i.e., whether this is last action of the user.

In [63]:
import pandas as pd
import igraph
import seaborn as sns

%config Completer.use_jedi = False

In [2]:
finished_paths = pd.read_csv("../datasets/wikispeedia_paths-and-graph/paths_finished.tsv", 
                                delimiter = "\t", 
                                skiprows = 15, 
                                names = ['hashedIpAddress', 'timestamp', 'durationInSec', 'path', 'rating'])
finished_paths

Unnamed: 0,hashedIpAddress,timestamp,durationInSec,path,rating
0,6a3701d319fc3754,1297740409,166,14th_century;15th_century;16th_century;Pacific...,
1,3824310e536af032,1344753412,88,14th_century;Europe;Africa;Atlantic_slave_trad...,3.0
2,415612e93584d30e,1349298640,138,14th_century;Niger;Nigeria;British_Empire;Slav...,
3,64dd5cd342e3780c,1265613925,37,14th_century;Renaissance;Ancient_Greece;Greece,
4,015245d773376aab,1366730828,175,14th_century;Italy;Roman_Catholic_Church;HIV;R...,3.0
...,...,...,...,...,...
51313,15a13a1d66ef5456,1349231015,66,Yagan;Ancient_Egypt;Civilization,
51314,2ef7ac844cefda58,1300254138,165,Yagan;Folklore;Brothers_Grimm;<;19th_century;C...,3.0
51315,12863abb7887f890,1385095372,228,Yagan;Australia;England;France;United_States;T...,
51316,19f8284371753362,1298792567,56,Yarralumla%2C_Australian_Capital_Territory;Aus...,1.0


In [3]:
articles = pd.read_csv("../datasets/wikispeedia_paths-and-graph/articles.tsv", 
                                delimiter = "\t", 
                                skiprows = 11, 
                                names = ['article']).reset_index().rename(columns = {"index": "article_id"})
articles

Unnamed: 0,article_id,article
0,0,%C3%81ed%C3%A1n_mac_Gabr%C3%A1in
1,1,%C3%85land
2,2,%C3%89douard_Manet
3,3,%C3%89ire
4,4,%C3%93engus_I_of_the_Picts
...,...,...
4599,4599,Zionism
4600,4600,Zirconium
4601,4601,Zoroaster
4602,4602,Zuid-Gelders


In [4]:
categories = pd.read_csv("../datasets/wikispeedia_paths-and-graph/categories.tsv", 
                                delimiter = "\t", 
                                skiprows = 12, 
                                names = ['article', 'category'])
categories

Unnamed: 0,article,category
0,%C3%81ed%C3%A1n_mac_Gabr%C3%A1in,subject.History.British_History.British_Histor...
1,%C3%81ed%C3%A1n_mac_Gabr%C3%A1in,subject.People.Historical_figures
2,%C3%85land,subject.Countries
3,%C3%85land,subject.Geography.European_Geography.European_...
4,%C3%89douard_Manet,subject.People.Artists
...,...,...
5199,Zirconium,subject.Science.Chemistry.Chemical_elements
5200,Zoroaster,subject.People.Religious_figures_and_leaders
5201,Zuid-Gelders,subject.Geography.European_Geography
5202,Zuid-Gelders,subject.Language_and_literature.Languages


In [5]:
def replace_back_page(path):
    while "<" in path:
        idx = path.index("<")
        path[idx] = path[idx - 2]
    return path
    
def parse_path(unparsed_path, delimiter = ";"):
    nodes = unparsed_path.split(delimiter)
    replaced = replace_back_page(nodes)
    return replaced

In [6]:
parsed_paths = list()
for path in finished_paths['path'].tolist():
    parsed_paths.append(parse_path(path))

In [7]:
def path_array_to_tuple(path_array):
    tuple_array = list()
    for i in range(0, len(path_array)-1, 1):
        tuple_array.append((path_array[i], path_array[i+1]))
    return tuple_array

edges = list()
for p in parsed_paths:
    edges.extend(path_array_to_tuple(p))

In [8]:
len(edges)

295460

## Graph building

Since analyzing a big network is very time consuming we'll consider only the 1000 first edges. 

In [9]:
g = igraph.Graph()
g.add_vertices(articles['article'].tolist())#, attributes = categories.set_index(keys = 'article').to_dict()['category'])

In [10]:
%%time

def add_edges_to_graph(graph, edges, autoconnect = False):
    for e in edges:
        # Auto-connections are not allowed
        if e[0] != e[1] and ~autoconnect:
            try:
                graph.add_edges([e])
            except:
                pass
        elif autoconnect:
            try:
                graph.add_edges([e])
            except:
                pass
add_edges_to_graph(g, edges[:1000])

CPU times: user 55.7 ms, sys: 1.92 ms, total: 57.7 ms
Wall time: 76.2 ms


In [11]:
igraph.summary(g)

IGRAPH UN-- 4604 1000 -- 
+ attr: name (v)


In [12]:
g.vs["category"] = categories["category"].tolist()

In [13]:
graph_analysis = articles[["article"]]
graph_analysis["degree"] = g.degree()
graph_analysis["vertex_betweenness"] = g.betweenness()
graph_analysis["eigen_centrality"] = g.eigenvector_centrality()


## Ranking

To rank the nodes, we'll consider the betweenness measure.

In [73]:
graph_analysis.sort_values(by = ["vertex_betweenness"], ascending = False).head(10)

Unnamed: 0,article,degree,vertex_betweenness,eigen_centrality
4297,United_States,42,15218.31399,0.035618
10,14th_century,48,14818.061081,0.848915
1433,Europe,38,12787.399283,0.308413
4147,Time,40,10091.471753,1.0
1385,England,36,9534.479396,0.234948
128,Africa,32,7531.992252,0.112957
873,Chemistry,14,6798.214326,0.129564
33,2005_Atlantic_hurricane_season,20,6341.954255,0.019499
377,Atlantic_Ocean,20,5189.531923,0.033121
3650,Science,18,5091.979364,0.26139


`United_States` is the node with greatest betweenness which means this node is the one that serves as bridges for other nodes most of the time. This might indicate that the node works as a bridge for finding other contents related to United States. 

## Properties

To analyze a property of the network, we'll consider the degree measure. 

In [68]:
graph_analysis.sort_values(by = ["degree"], ascending = False)

Unnamed: 0,article,degree,vertex_betweenness,eigen_centrality
108,Achilles,61,4355.611994,0.343946
10,14th_century,48,14818.061081,0.848915
4297,United_States,42,15218.313990,0.035618
4147,Time,40,10091.471753,1.000000
1433,Europe,38,12787.399283,0.308413
...,...,...,...,...
1616,GNU_Linux_naming_controversy,0,0.000000,-0.000000
1617,GNU_Project,0,0.000000,-0.000000
1618,Gabon,0,0.000000,-0.000000
1619,Gaborone,0,0.000000,-0.000000


The node `Achilles` is the one with largest `degree`. Thus, it's the node that has more connections to other pages. Now, let's take a look into the degree distribution. Since we created only 1000 edges, the majority of the nodes will have no connection. Thus we'll discard nodes with 0 degree. 

In [72]:
graph_analysis[graph_analysis["degree"] > 0][["degree"]].describe(percentiles = [0.25, 0.5, 0.75, 0.85, 0.99])

Unnamed: 0,degree
count,370.0
mean,5.405405
std,7.564213
min,1.0
25%,2.0
50%,2.0
75%,6.0
85%,8.0
99%,38.62
max,61.0


As we can see, 85% percent of the nodes are connected to other 8 nodes. Half of the nodes are connected to other 2 nodes. This suggests us that half of the web pages leads to other 2 pages. 

In [16]:
graph_analysis.sort_values(by = ["eigen_centrality"], ascending = False).head(10)

Unnamed: 0,article,degree,vertex_betweenness,eigen_centrality
4147,Time,40,10091.471753,1.0
10,14th_century,48,14818.061081,0.848915
2466,Light,38,3947.398944,0.782667
103,Acceleration,33,3820.782333,0.724419
3244,Physics,38,4779.615849,0.684546
3425,Rainbow,27,534.179851,0.587099
969,Color,24,522.955897,0.525907
2165,Isaac_Newton,14,1199.475691,0.390826
108,Achilles,61,4355.611994,0.343946
1433,Europe,38,12787.399283,0.308413


The node with greatest value for eigen centrality is the `Time` page. This indicates that this page is very influent within the network. It's interesting to note that this node is the 4th node with highest degree. So this might indicate that Batman is a very important node. 

## Finding communities

Let's try to find some communities within the network. Defining community as 
>*"A community, with respect to graphs, can be defined as a subset of nodes that are densely connected to each other and loosely connected to the nodes in the other communities in the same graph."*

To find communities we'll run the Girvan-Newman algorithm. 

In [56]:
import networkx as nx

nx_graph = nx.Graph()

for e in edges[:1000]:
    # Auto-connections are not allowed
    if e[0] != e[1]:
        nx_graph.add_edge(e[0], e[1], weight = 1)

In [57]:
from networkx.algorithms import community

communities_generator = community.girvan_newman(nx_graph)
top_level_communities = next(communities_generator)
next_level_communities = next(communities_generator)

In [62]:
c = sorted(map(sorted, next_level_communities))

Three communities were found within the network. The first seems to be a community of pages related to History and Geography articles

In [80]:
c[0]

['10th_century',
 '11th_century',
 '12th_century',
 '13th_century',
 '14th_century',
 '15th_century',
 '16th_century',
 '2004_Atlantic_hurricane_season',
 '2005_Atlantic_hurricane_season',
 '8th_century',
 'Aberdeen',
 'Abraham_Lincoln',
 'Accra',
 'Achilles',
 'Achilles_tendon',
 'Africa',
 'African_slave_trade',
 'Alexander_the_Great',
 'Alfred_Nobel',
 'Amazon_Rainforest',
 'American_Civil_War',
 'American_Revolutionary_War',
 'Americium',
 'Ancient_Greece',
 'Ancient_Rome',
 'Animal',
 'Antelope',
 'Apollo',
 'Atlantic_Ocean',
 'Atlantic_slave_trade',
 'Bald_Eagle',
 'Bantu',
 'Barley',
 'Bat',
 'Beetle',
 'Bermuda',
 'Black_Sea',
 'Bongo_%28antelope%29',
 'Brazil',
 'Bread',
 'British_Empire',
 'Bronze_Age',
 'Burundi',
 'C%C3%B4te_d%27Ivoire',
 'C._S._Lewis',
 'Calcium',
 'Cape_Verde',
 'Chicago',
 'China',
 'Christopher_Columbus',
 'City_status_in_the_United_Kingdom',
 'Climate',
 'Climate_change',
 'Cod',
 'Country',
 'Crocodile',
 'Cuba',
 'Culture',
 'Currency',
 'Cyclops',
 

The second community is related to Science and Technology articles.

In [78]:
c[1]

['19th_century',
 'Acceleration',
 'Acid_rain',
 'Age_of_Enlightenment',
 'Albert_Einstein',
 'Aluminium',
 'Anarchism',
 'Anatomy',
 'Ancient_Egypt',
 'Anglicanism',
 'Archbishop_of_Canterbury',
 'Argon',
 'Art',
 'Astrology',
 'Astronomy',
 'Atom',
 'Bible',
 'Biology',
 'Buckminster_Fuller',
 'California',
 'Carbon',
 'Carbon_dioxide',
 'Chemical_element',
 'Chemistry',
 'Chlorine',
 'Christianity',
 'Chromatic_aberration',
 'Chromium',
 'Cloud',
 'Color',
 'Comet',
 'Comet_Hale-Bopp',
 'Comet_Halley',
 'Communication',
 'Computer',
 'Computer_science',
 'Constitutional_monarchy',
 'David_Hume',
 'Day',
 'Drawing',
 'Dresden',
 'Drosophila_melanogaster',
 'Earth%27s_atmosphere',
 'Education_in_the_United_States',
 'Effects_of_global_warming',
 'Egypt',
 'Electrical_engineering',
 'Empiricism',
 'Energy',
 'English_peasants%27_revolt_of_1381',
 'Fertilizer',
 'Fine_art',
 'French_Revolution',
 'Fyodor_Dostoevsky',
 'Germany',
 'Glass',
 'God',
 'Government',
 'Gravitation',
 'Great_c

The third community are articles related to Birds.

In [79]:
c[2]

['Bird', 'British_Isles', 'Hawk', 'River_Severn', 'Sparrowhawk', 'Wader']