# INFO 4271 - Exercise 6 - Link Analysis

Issued: May 28, 2024

Due: June 3rd, 2024

Please submit this filled sheet via Ilias by the due date.

---

# 1. Co-Linking Similarity 
The directed graph of resource pointers (e.g., hyperlinks on the Internet, or citations in academic publishing) implicitly encodes topic information but can be much cheaper to process than the content words of the individual documents.

a) Implement a document similarity measure based only on graph topology, assuming that documents are similar if they link to similar documents.

In [5]:
#An example graph topology. Each entry represents a document 
# alongside the outgoing links found in its content. 
graph = {'D1' : ['D14', 'D16'],
		 'D2' : ['D5', 'D6', 'D7'],
		 'D3' : ['D4', 'D14', 'D15', 'D18', 'D19'],
		 'D4' : ['D2', 'D9', 'D14'],
		 'D5' : ['D2', 'D8', 'D17'],
		 'D6' : ['D3', 'D8', 'D12', 'D15'],
		 'D7' : ['D3', 'D19'],
		 'D8' : ['D1', 'D2', 'D3', 'D5', 'D9', 'D10', 'D11', 'D13', 'D14', 'D15', 'D17', 'D19'],
		 'D9' : [],
		 'D10' : ['D1', 'D14', 'D19'],
		 'D11' : ['D6'],
		 'D12' : ['D9', 'D11', 'D13', 'D16', 'D18'],
		 'D13' : ['D2', 'D4', 'D18'],
		 'D14' : ['D2', 'D14'],
		 'D15' : ['D7'],
		 'D16' : ['D2', 'D10', 'D16'],
		 'D17' : ['D1', 'D4', 'D6', 'D7', 'D11', 'D12'],
		 'D18' : ['D2', 'D9', 'D14'],
		 'D19' : [],
		 'D20' : ['D12']
		}

#Measure the similarity between two documents x and y in a graph based on their outgoing links. 
def sim_out(x, y, graph):
    
	out_x = set(graph[x])
	out_y = set(graph[y])
	
	union = len(out_x.union(out_y))
	if union == 0:
		return 0.0

	intersection = len(out_x.intersection(out_y))
	
	return round(intersection/union, 3)
    

#Print a document simialrity matrix 
l = '\t'
for doc in graph:
	l += doc+'\t'
print(l)
for doc in graph:
	l = doc+'\t'
	for d in graph:
		l += str(sim_out(doc, d, graph))+'\t'
	print(l)

	D1	D2	D3	D4	D5	D6	D7	D8	D9	D10	D11	D12	D13	D14	D15	D16	D17	D18	D19	D20	
D1	1.0	0.0	0.167	0.25	0.0	0.0	0.0	0.077	0.0	0.25	0.0	0.167	0.0	0.333	0.0	0.25	0.0	0.25	0.0	0.0	
D2	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.071	0.0	0.0	0.333	0.0	0.0	0.0	0.333	0.0	0.286	0.0	0.0	0.0	
D3	0.167	0.0	1.0	0.143	0.0	0.125	0.167	0.214	0.0	0.333	0.0	0.111	0.333	0.167	0.0	0.0	0.1	0.143	0.0	0.0	
D4	0.25	0.0	0.143	1.0	0.2	0.0	0.0	0.25	0.0	0.2	0.0	0.143	0.2	0.667	0.0	0.2	0.0	1.0	0.0	0.0	
D5	0.0	0.0	0.0	0.2	1.0	0.167	0.0	0.154	0.0	0.0	0.0	0.0	0.2	0.25	0.0	0.2	0.0	0.2	0.0	0.0	
D6	0.0	0.0	0.125	0.0	0.167	1.0	0.2	0.143	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.111	0.0	0.0	0.25	
D7	0.0	0.0	0.167	0.0	0.0	0.2	1.0	0.167	0.0	0.25	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	
D8	0.077	0.071	0.214	0.25	0.154	0.143	0.167	1.0	0.0	0.25	0.0	0.214	0.071	0.167	0.0	0.154	0.125	0.25	0.0	0.0	
D9	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	
D10	0.25	0.0	0.333	0.2	0.0	0.0	0.25	0.25	0.0	1.0	0.0	0.0	0.0	0.25	0.0	0.0	0.125

b) Now let us modify the above scheme to also use the documents' incoming links in the calculation of the similarity score.

In [8]:
from collections import defaultdict

def incoming(graph) -> dict:
    
    in_graph = defaultdict(list)
    
    for in_doc, out_docs in graph.items():
        for out_doc in out_docs:
            in_graph[out_doc].append(in_doc)
            
    return in_graph
            

#Measure the similarity between two documents x and y in a graph based on their incoming and outgoing links. 
def sim_inout(x, y, graph):

	in_graph = incoming(graph=graph)

	out_x, out_y = set(graph[x]), set(graph[y])
	in_x, in_y = set(in_graph[x]), set(in_graph[y])

	out_union = len(out_x.union(out_y))
	in_union = len(in_x.union(in_y))
	union = out_union + in_union
	
	if union == 0:
		return 0.0

	out_intersection = len(out_x.intersection(out_y))
	in_intersection = len(in_x.intersection(in_y))
	intersection = out_intersection + in_intersection

	return round(intersection / union, 3)

#Print a document simialrity matrix 
l = '\t'
for doc in graph:
	l += doc+'\t'
print(l)
for doc in graph:
	l = doc+'\t'
	for d in graph:
		l += str(sim_inout(doc, d, graph))+'\t'
	print(l)

	D1	D2	D3	D4	D5	D6	D7	D8	D9	D10	D11	D12	D13	D14	D15	D16	D17	D18	D19	D20	
D1	1.0	0.071	0.182	0.222	0.111	0.091	0.111	0.056	0.125	0.25	0.286	0.182	0.111	0.273	0.125	0.1	0.083	0.1	0.286	0.0	
D2	0.071	1.0	0.059	0.067	0.071	0.0	0.0	0.091	0.273	0.154	0.167	0.0	0.071	0.267	0.167	0.067	0.286	0.067	0.077	0.0	
D3	0.182	0.059	1.0	0.077	0.083	0.071	0.083	0.222	0.091	0.3	0.091	0.143	0.3	0.133	0.2	0.0	0.143	0.077	0.2	0.0	
D4	0.222	0.067	0.077	1.0	0.1	0.083	0.1	0.176	0.0	0.1	0.111	0.167	0.1	0.25	0.111	0.091	0.0	0.714	0.111	0.0	
D5	0.111	0.071	0.083	0.1	1.0	0.2	0.111	0.118	0.125	0.111	0.125	0.0	0.25	0.167	0.125	0.1	0.083	0.1	0.125	0.0	
D6	0.091	0.0	0.071	0.083	0.2	1.0	0.333	0.105	0.0	0.0	0.1	0.071	0.0	0.0	0.0	0.0	0.071	0.0	0.0	0.143	
D7	0.111	0.0	0.083	0.1	0.111	0.333	1.0	0.118	0.0	0.111	0.125	0.083	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	
D8	0.056	0.091	0.222	0.176	0.118	0.105	0.118	1.0	0.0	0.188	0.0	0.222	0.056	0.095	0.059	0.111	0.158	0.176	0.0	0.0	
D9	0.125	0.273	0.091	0.0	0.125	0.0	0.0	0.0	1.0	0.125	0.333

c) Discuss the differences between these two simialrity score variants. What are the salient advantages and disadvantages they offer?

Advantages of out only:
- Its computationally less expensive as it only considers the outgoing links.
- It can provide a quick measure of similarity based on the graph structure (topology).

Disadvantages:
- It may not capture the full context of document similarity as it ignores the incoming links:  An example would be a "hub" that has very few outgoing links but lots of incoming links from different topics.

Advantages of in-out:
- It captures the relationships between documents from both directions, providing a more accurate measure of similarity. (Above example)

Disadvantages:
- It requires more computational resources as it considers both incoming and outgoing links.

# 2. PageRank

The PageRank algorithm models page authoritativeness. Is it robust to tempering? Can you think of ways to game the PageRank scheme and give your website an artificially high score? What are ways to defend against such attacks?

Possible exploits: 


- You could buy backlinks from other high-ranking webpages to boost your own PageRank. You cannot defend against theses schemes directly but the quality of the webpage linking your website for money will do this indirectly. 
  
- Groups of websites (link farms) that all hyperlink to every other site in the group. This will artificially inflate the pagerank score. Yet simple algorithms will always find these link farms through clustering methods during crawling (or the pagerank algorithm) after some time. 

- Hidden links: Include invisible hyperlinks that only crawlers will see but not the user. Search engines could use automated detection algorithms that e.g. find links hidden in small characters or other anomalies. 