TTS Link Analysis Report
Ruaridh Thomson s0786036
Both PageRank and Hubs and Authorities imagine the graph as a node network where all unique nodes are stored in a dictionary. Nodes correspond to unique emails, where in the case of email@example.com and firstname.lastname@example.org these are treated as separate nodes on the graph. Each node is a Node object that stores the name (email) of the node, the destination links (dest_nodes) of the node (name of other nodes this node points to) and the source links (source_links) of the node (name of nodes pointing to this one). It is possible to quickly (~10 seconds) iterate over graph.txt and populate each node with its destination and source links before performing link analysis. This saves unnecessary iterating to find source or destination nodes during any calculations. All nodes are stored in a dictionary with the node name as the key - this is the graph. The code for the main chunk of each algorithm (pagerank and hubs_auth) follows conventions and implementation outlined in the slides (web-2x2.pdf) and should be readable. Against the sanity checks, PageRank successfully achieved the correct scores in 10 iterations. Hubs and Authorities (H&S), however, only took 9 iterations to get the values. The number of iterations are defined at the top of the source. Although H&S converged to the sanity checks in 9 iterations, it was observed that calculating hub score before authority score very slightly changed the overall scores compared to authority before hub. With enough iterations this is negligible.
Results from PageRank
0.007376 email@example.com N/A 0.00735341 firstname.lastname@example.org President (Enron Online) 0.00713286 email@example.com xxx 0.00671554 firstname.lastname@example.org Employee 0.00554463 email@example.com N/A 0.00530575 firstname.lastname@example.org Manager (Logistics Manager) 0.00459808 email@example.com Employee (Chief Operating Officer) 0.00412046 firstname.lastname@example.org CEO 0.00405156 email@example.com CEO (Enron America) 0.00390492 firstname.lastname@example.org Employee
Results from hubs
0.99928093 email@example.com Used for auto-generated emails (fake user) 0.03296957 firstname.lastname@example.org xxx 0.01040851 email@example.com Not in roles.txt 0.00677409 l..firstname.lastname@example.org Not in roles.txt 0.00582504 email@example.com Not in roles.txt 0.00475687 firstname.lastname@example.org Not in roles.txt 0.00450159 email@example.com Not in roles.txt 0.00401388 firstname.lastname@example.org Employee 0.00329035 email@example.com Not in roles.txt 0.00280233 firstname.lastname@example.org Not in roles.txt
Results from authorities
0.38418728 email@example.com Trader 0.38417654 firstname.lastname@example.org Employee (Specialist) 0.38384904 email@example.com Not in roles.txt 0.38376442 firstname.lastname@example.org Employee (Analyst) 0.3555813 email@example.com Trader 0.277949 firstname.lastname@example.org xxx 0.21582593 email@example.com Not in roles.txt 0.21574703 firstname.lastname@example.org Not in roles.txt 0.1721554 email@example.com Not in roles.txt 0.14322686 firstname.lastname@example.org Not in roles.txt
PageRank manages to identify significantly more people who are present in 'roles.txt'. Hubs and authorities identifies what we would expect from both; people with a lot of outgoing and people with a lot of incoming. Though many emails cannot be identified in roles.txt. The companies automatic emailer would be a likely guess as a top hub (effectively spamming, but not spam, everyone with company info). Is the description 'Employee' just as useful as 'N/A' or 'xxx' or the person not existing in roles.txt. We assume that even if a person is not present in roles.txt that they are an employee. Or are we to assume that because they are not in roles.txt that they may be fake emails or not employed by Enron(?).
Visualising Key Connections
It is possible to add labels to the connections we are visualising, though as far as I can tell these labels would be the subject of the email - which one would get by reading enron.xml.
The following were chosen purely because they exist in roles.txt and we can observe how information is exchanged by people we know. We are able to observe a variety of employees, all of which rank highly in either PageRank, Hubs or Authorities. We are also able to observe information about the two unknown employees (bill.williams and gerald.nemec).
10-20 useful employees email@example.com Employee firstname.lastname@example.org N/A email@example.com Manager (Logistics Manager) firstname.lastname@example.org Employee (Chief Operating Officer) email@example.com CEO firstname.lastname@example.org CEO (alias of above?) email@example.com CEO (Enron America) firstname.lastname@example.org Employee email@example.com Employee firstname.lastname@example.org Used for auto-generated emails (fake user) email@example.com xxx firstname.lastname@example.org Employee (Analyst) email@example.com Trader firstname.lastname@example.org Trader email@example.com Employee (Specialist)
Here we have 15 employees, some of which are obviously important to the company (CEO, Chief Operating Officer, etc.) and some that are of unknown importance. In the case of firstname.lastname@example.org, there have been a lot of emails sent from bill and he has been recognised as an authoritative source of information; though it is uncertain who he is. We will visualise the exchange of information between these individuals. We will use the label to indicate the number of emails. Some curious observations include: pete.davis (the email robot) sends 3615 emails to itself, though this may be to keep record of all unique emails sent. There is a clear divide with who is sending who emails, with one exception; bill.williams. Though we do not know who he is.. We can see that Kenneth Lay uses email@example.com for something (PageRank), but only seems to communicate between these people using firstname.lastname@example.org.
Implications of the algorithms
This really comes down to what each algorithm achieves. By visualising the results we can easily identify how communication is structured within the company. In the case of bill.williams, he is clearly an important figure - almost central - when observed as not only a hub, but an informed figure. However, we can observe that the only contact he has had from either of the CEOs is once by john.lavorato. While being a well connected person he is not so connected with the owners of the company. We can see that the left hand side of the graph contains those found by PageRank and the right hand side are found via H&S, suggesting more influential figures in the company are found by PageRank rather than H&S. H&S appears to find a network of people who exchange a lot of information rather than maybe the importance of it. pete.davis may be important here, as the more someone contacts what is the hub email of the company, the more likely they are to be found an authority.