# Creating Network Graphs for the WALK Project

The purpose of this notebook is to create visually attractive network graphs from the WALK partner Web Archives for the public to view and manipulate online. We would like to promote the use of web archives, but unfortunately, the datasets for these files are very large. Without some visual cues, it would be difficult to know what is going on inside them.

Networks are one way among many to show what websites are mentioned in the web archives and how they link to each other. Hopefully, most of the technical details are covered for you.  I wanted you to be able to see the technical side of things, while at the same time being able to modify it if you wish.

Because we are using [git](https://www.github.com), you should be able to make as many changes as you want to this notebook without hurting the main repository.  So, feel free to play around as much as you want.  

The code being used for this repository is R. R is optimised for conducting statistical analysis. Inside R are a number of libraries that can be installed.  A library is more code that has been built in such a way to be used by other programmers.  To access a library, we "import" the library.  In our case, the main libraries are iGraph, which is a tool for analyzing network graphs, and visNetwork which is a library for making attractive interactive network graphs.

We install the libraries below.  In R code, putting a `#` in front of code means it is a comment.  Coders use comments for two main reasons.  1. To describe what is happening in code for other coders to read it.  2. To "comment out" code that is not being used currently, but might be used later.  When I want to use the code that is commented out, I will just simply remove the `#` and it will run as stated.

In [1]:
# Install the libraries
# suppressMessages is just a tool that keeps R from outputting information that we don't need to see right now.
suppressMessages(library(igraph))
suppressMessages(library(visNetwork))

# Some other libraries we may use later

#suppressMessages(library(intergraph))
#suppressMessages(library(network))
#suppressMessages(library(ergm))


processFile is something we call a function.  It is a process that we might call later on to simplify out code.  There's not too much to worry about here, except to know that this takes the path of a file, grabs some data and returns a "dataframe" which is kind of like a spreadsheet for us to manipulate.

In [2]:

# Helper function for getting warcbase data into igraph
processFile = function(filepath) {
  con = file(filepath, "r")
  line = readLines(con)
  line <- as.data.frame(do.call(rbind, (strsplit(gsub('\\(|\\)', '', line), ','))), stringsAsFactors=FALSE)
  names(line) = c("date", "from", "to", "weight")
  return (line)
  close(con)
}


This next part takes the data file from the folder already on your vpn and gives it an easy name (variable) for futre use.  If you put a different title next to word `file` below, the data will be given the name `graph_file` and whatever you put in file2 will be put into `graph_file2`.  That means if you were to create a note book and write `print (graph_file2)` it would show you a summary of the data as an igraph `object`.  

In [7]:

# Grab the data files
filepath = '/data/links/'


# Change the following file to the 
file = 'ALBERTA_edmonton_public_library'
#       ^^^^^^^^^^^^^^^^^^^^^^^^^^  <- Change to the file name you want. Leave out "-links.txt"

suffix = '-links.txt'
file2 = 'DALHOUSIE_Halifax_Regional_Municipality_Documents'
#       ^^^^^^^^^^^^^^^^^^^^^^^^^^  <- Change to the file name you want. Leave out "-links.txt"


savesuffix = '.graphml'
savepath = '~/data/graphml/'
htmlsavepath = '~/data/html/'
htmlsavesuffix = '.html'

filename = paste0(filepath, file, suffix)
filename2 = paste0(filepath, file2, suffix)
savename = paste0(savepath, file, savesuffix)
savename2 = paste0(savepath, file2, savesuffix)
savehtml = paste0(htmlsavepath, file, htmlsavesuffix)
savehtml2 = paste0(htmlsavepath, file2, htmlsavesuffix)

graph_file <- graph.data.frame(unique(processFile(filename)[c("from", "to")]), directed=TRUE)
graph_file <- simplify(graph_file)

graph_file2 <- graph.data.frame(unique(processFile(filename2)[c("from", "to")]), directed=TRUE)
graph_file2 <- simplify(graph_file2)

The section below does some mathematical work that can be useful in sizing and shaping the network.  For instance, it is common to make the bubbles in a network bigger when they have more links.  It is also common to use different algoritms to detect groupings of bubbles based on common relationships and show them as the same color.

A `vector` in r, is really just a list of data.  It is created using `c("here", "is", "a", "vector")`.

You can switch between graph_file & graph_file2 simply by changing what happens after `graphfile <-` to the right igraph object name.

I also asked R to output some descriptive data for the network here.  For example, "average in-degree" refers to the average number of links to each website in the archive.

Later on there will be a little workshop explaining some of this to you.  But for now, we just want to create a reasonably attractive network.

In [5]:
# Change this to whatever graph_file you want.
# graph_file <- graph_file
#             ^^^^^^^^^^^  <- Change to graph_file or graph_file2 as required.

#Create vectors for strongly connected components, degree etc.
# A "weakly" connected component is one where websites are connected in a single unit.
# A "strongly" connected component happens when you are able to follow a path along the links and hit every bubble.
# For our purposes, we only need to know that 
# whole network > weakly connected component > strongly connected component.
# We will want to use "weak" "strong" or "full" to modify the size of the network for visualization.
V(graph_file)$s_component <- clusters(graph_file, "strong")$membership
V(graph_file)$w_component <- clusters(graph_file, "weak")$membership

V(graph_file2)$s_component <- clusters(graph_file2, "strong")$membership
V(graph_file2)$w_component <- clusters(graph_file2, "weak")$membership
 
dist_component_id <- which(clusters(graph_file, "strong")$csize == max(clusters(graph_file, "strong")$csize))
weak_component_id <- which(clusters(graph_file, "weak")$csize == max(clusters(graph_file, "weak")$csize))

dist_component_id2 <- which(clusters(graph_file2, "strong")$csize == max(clusters(graph_file2, "strong")$csize))
weak_component_id2 <- which(clusters(graph_file2, "weak")$csize == max(clusters(graph_file2, "weak")$csize))

strongly_connected_graph <- induced_subgraph(graph_file, which(V(graph_file)$s_component == dist_component_id))
weakly_connected_graph <- induced_subgraph(graph_file, which(V(graph_file)$w_component == weak_component_id))

strongly_connected_graph2 <- induced_subgraph(graph_file2, which(V(graph_file2)$s_component == dist_component_id2))
weakly_connected_graph2 <- induced_subgraph(graph_file2, which(V(graph_file2)$w_component == weak_component_id2))

#  It may not make sense to plot the full graph if it is too large.
#  In this case, we should plot a large component.  In extreme cases,
#  We can plot just the strongly connected component.

full <- graph_file
weak <- weakly_connected_graph
strong <- strongly_connected_graph

full2 <- graph_file2
weak2 <- weakly_connected_graph2
strong2 <- strongly_connected_graph2


#  Change graph_plot_name to whichever of the above you'd like to plot.

graph_plot_name <- full
#                  ^^^^^^  can be weak, strong or full.

graph_plot_name2 <- full2

V(graph_plot_name)$rlabel <- NA
V(graph_plot_name)$relevant <- FALSE
V(graph_plot_name)$wtc <- NA

V(graph_plot_name2)$rlabel <- NA
V(graph_plot_name2)$relevant <- FALSE
V(graph_plot_name2)$wtc <- NA

V(graph_plot_name)$indegree <- degree(graph_plot_name, mode="in", loops=FALSE)
V(graph_plot_name)$outdegree <- degree(graph_plot_name, mode="out", loops=FALSE)
V(graph_plot_name)$degree <- degree(graph_plot_name, loops=FALSE)
avg_indegree <- mean(V(graph_plot_name)$indegree)
avg_outdegree <- mean(V(graph_plot_name)$outdegree)
avg_degree <- mean(V(graph_plot_name)$degree)

V(graph_plot_name2)$indegree <- degree(graph_plot_name2, mode="in", loops=FALSE)
V(graph_plot_name2)$outdegree <- degree(graph_plot_name2, mode="out", loops=FALSE)
V(graph_plot_name2)$degree <- degree(graph_plot_name2, loops=FALSE)
avg_indegree2 <- mean(V(graph_plot_name2)$indegree)
avg_outdegree2 <- mean(V(graph_plot_name2)$outdegree)
avg_degree2 <- mean(V(graph_plot_name2)$degree)

print (paste("Average In-degree:", avg_indegree))
print (paste("Average Out-degree:", avg_outdegree))
print (paste("Average Degree:", avg_degree))

print (paste("Average In-degree:", avg_indegree2))
print (paste("Average Out-degree:", avg_outdegree2))
print (paste("Average Degree:", avg_degree2))

# This piece of code can be used to determine the number of nodes that have their names on them.
V(graph_plot_name)$relevant <- V(graph_plot_name)$degree > avg_degree + 5
#                                                          ^^^^^^^^^^^^^  You could change this to +1, -1 etc.

V(graph_plot_name2)$relevant <- V(graph_plot_name2)$degree > avg_degree2 + 5

V(graph_plot_name)$rlabel[which(V(graph_plot_name)$relevant == TRUE)] <- V(graph_plot_name)$name[which(V(graph_plot_name)$relevant == TRUE)]
V(graph_plot_name)$wtc <- membership(cluster_edge_betweenness(graph_plot_name))

V(graph_plot_name2)$rlabel[which(V(graph_plot_name2)$relevant == TRUE)] <- V(graph_plot_name2)$name[which(V(graph_plot_name2)$relevant == TRUE)]
V(graph_plot_name2)$wtc <- membership(cluster_edge_betweenness(graph_plot_name2))

# With this, you will see the number and names of websites that will be labelled in the network graph.
#print(V(graph_plot_name)$rlabel[which(V(graph_plot_name)$relevant == TRUE)])


[1] "Average In-degree: 1.15150736003085"
[1] "Average Out-degree: 1.15150736003085"
[1] "Average Degree: 2.30301472006171"
[1] "Average In-degree: 1.4903156384505"
[1] "Average Out-degree: 1.4903156384505"
[1] "Average Degree: 2.980631276901"


In [None]:
# Running this code will give you the details for plotting
# in Igraph
#?visIgraph


In [7]:
mems <- V(graph_plot_name)$wtc
colors <- rainbow(max(mems))

mems2 <- V(graph_plot_name2)$wtc
colors2 <- rainbow(max(mems2))


V(graph_plot_name)$frame.color <- "black"
V(graph_plot_name)$size <- log(V(graph_plot_name)$degree)*5
V(graph_plot_name)$label <- V(graph_plot_name)$rlabel
V(graph_plot_name)$color=c("black", "red", "orange", "blue", "green", "orange", "yellow", "cadetblue1", "deeppink")[as.numeric(match.arg(as.character(0:6),as.character(mems), several.ok=TRUE))]
V(graph_plot_name)$label.cex=log(V(graph_plot_name)$degree/10)
V(graph_plot_name)$shadow = FALSE
E(graph_plot_name)$edge.curved=0.5
E(graph_plot_name)$arrow.size=1
E(graph_plot_name)$width=0.1

V(graph_plot_name2)$frame.color <- "black"
V(graph_plot_name2)$size <- log(V(graph_plot_name2)$degree)*5
V(graph_plot_name2)$label <- V(graph_plot_name2)$rlabel
V(graph_plot_name2)$color=c("black", "red", "orange", "blue", "green", "orange", "yellow", "cadetblue1", "deeppink")[as.numeric(match.arg(as.character(0:6),as.character(mems), several.ok=TRUE))]
V(graph_plot_name2)$label.cex=log(V(graph_plot_name2)$degree/10)
V(graph_plot_name2)$shadow = FALSE
E(graph_plot_name2)$edge.curved=0.5
E(graph_plot_name2)$arrow.size=1
E(graph_plot_name2)$width=0.1

# R plots have a variety of layouts that you can choose from.
# cut and paste any of these into the "layout =" section to change the layout.
# "layout_nicely" is a good generic layout in most cases.

#  layout_nicely
#  layout_with_sugiyama
#  layout_as_tree
#  layout_in_circle
#  layout_on_grid
#  layout_on_sphere
#  layout_with_lgl
#  layout_with_fr
#  layout_with_kk

layout = "layout_nicely"
#graph_plot_vis <- toVisNetworkData(graph_plot_name, idToLabel=TRUE)

#visIgraph(graph_plot_name, layout=layout) %>%
#      visEdges(shadow=FALSE)




“number of items to replace is not a multiple of replacement length”

In [8]:
outputIgraph <- visIgraph(graph_plot_name, layout=layout, idToLabel=FALSE, type="full") %>%
      visEdges(shadow=FALSE) %>%
      visConfigure(enabled = TRUE)

outputIgraph2 <- visIgraph(graph_plot_name2, layout=layout, idToLabel=FALSE, type="full") %>%
      visEdges(shadow=FALSE) %>%
      visConfigure(enabled = TRUE)

In [8]:
# Write the graphml
write.graph (graph_file, savename, format='graphml')
write.graph (graph_file2, savename2, format='graphml')



In [9]:
visSave(outputIgraph, file = savehtml)
visSave(outputIgraph2, file = savehtml2)