# DBLP Co-authorship Data Analysis
## Submitted By : Sharanya Dasgupta
## Roll No.: CS2320



This is the Colab version of the notebook.

In this assignment, you are given an undirected graph (<a href="https://snap.stanford.edu/data/com-DBLP.html">DBLP co-authorship</a>) in the edge list format. Each author is represented by an anonymous integer ID. Each line of the file contains the ID of two authors who have co-authored at least one paper together, separated by a TAB.

The original file has some extra lines in the beginning but those have been removed in the file given to you.

If you are using your own system, download the data here: <a href="https://drive.google.com/file/d/1B-cimWJmdEJio07kBBBH1RpyAhhG-euQ/view?usp=sharing">Download the data</a>.

If you are running it on colab, then the data can be imported by the cells given below, no need to download a local copy.

You may assume that each edge is represented in the file only once, but the order of the nodes are not known. In other words, if <tt>(a,b)</tt> is an edge in the graph, then the file contains either a line '<tt>a TAB b</tt>' or '<tt>b TAB a</tt>', but not both and there is no duplicate line.

For each of the tasks below, you are instructed to implement a function. Use appropriate map and reduce (or other spark / python functions) as you require to implement those. If you need to define any other function (e.g., a separate function for map or reduce), feel free to define those in the same cell above the desired function.

### Getting PySpark to work with Colab

Execute the cells below to make pyspark to work with colab.

In [1]:
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

Collecting pyspark
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425345 sha256=ec3239f0588cf01bd7aaa3600f08e47baa13a561ff8138ee49954efd5dab7e90
  Stored in directory: /root/.cache/pip/wheels/41/4e/10/c2cf2467f71c678cfc8a6b9ac9241e5e44a01940da8fbb17fc
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.0
The following additional packages will be installed:
  libxtst6 openjdk-8-jre-headless
Suggested packages:
  openjdk-8-demo openjdk-8-source libnss-mdns fonts-dejavu-extra fonts-nanum fonts-ipafont-gothic
  fonts-ipafont-mincho fonts-wqy-microhei fonts-wqy-zenhei fonts-indic

### Getting Google Drive to work with Colab

In [2]:
from pydrive2.auth import GoogleAuth
from pydrive2.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

id='1B-cimWJmdEJio07kBBBH1RpyAhhG-euQ'
filename = 'com-dblp.ungraph.txt'

downloaded = drive.CreateFile({'id': id})
downloaded.GetContentFile(filename)

In [3]:
# We start by getting the spark context
from pyspark import SparkContext, SparkConf
sc = SparkContext.getOrCreate()

## Create an RDD from the input file

In [4]:
# Use the appropriate file path for file downloaded in your system
# edgeList = sc.textFile("/Users/debapriyo/spark-playground/com-dblp.ungraph.txt")

# Use this line for running on Colab
edgeList = sc.textFile(filename)

# See some sample lines
edgeList.takeSample(False,10)

['46395\t86575',
 '41654\t107818',
 '231445\t322853',
 '76909\t83620',
 '246430\t380199',
 '21642\t24286',
 '111003\t131731',
 '32951\t234250',
 '407896\t407897',
 '133771\t345198']

If you count the number of edges, you should get 1049866.

In [5]:
# Use the built in Spark function to count the number of lines (elements in the RDD) (ungraded)
edgeList.count()

1049866

## Task 1: Compute the number of co-authors (degree) for each author (3 marks)

Find the degree of each node. Be mindful of the fact that the graph is undirected.

In [6]:
"""
    Input: an RDD in the edgelist format as above
    Returns: an RDD with each element being a pair (node_ID, degree)
"""
def degree(edgeList):

  # Flattened list of author_ids
  Author_ids = edgeList.flatMap(lambda line : line.split("\t"))

  # Each Presence of an author_id contributes 1 towards it's total degree (as the graph is undirected)
  degree = Author_ids.map(lambda a : (a,1))

  # Summing up the total degree of each author_id by reduceByKey operation
  Author_degrees = degree.reduceByKey(lambda m, n: m + n)

  return Author_degrees

In [7]:
degrees = degree(edgeList)
degrees.takeSample(False,10)

[('299341', 1),
 ('188121', 4),
 ('201535', 3),
 ('385766', 2),
 ('286551', 2),
 ('327932', 2),
 ('283069', 4),
 ('238237', 2),
 ('364624', 3),
 ('262966', 1)]

A sample of the RDD returned by the <tt>degree</tt> function should have the following format:

````{verbatim}
[('86281', 15),
 ('364476', 8),
 ('277755', 2),
 ('372715', 2),
 ('60523', 3),
 ('370437', 3),
 ('191955', 11),
 ('286360', 113),
 ('264995', 10),
 ('308008', 2)]
````

## Task 2: Convert to adjacency-list format (3 marks)

Convert the data from edge-list format to the adjacency-list format.

In [8]:
"""
    Input: an RDD in the edgelist format as above
    Returns: an RDD with each element being a pair (node_ID, list of adjacent node IDs)
             i.e., (string,list)
"""
# Defining functions for creating adjacency-list while combineByKey

def adj_list(a,b): # Converting two lists into single sorted list
    a.extend(b)
    a.sort(key=int)
    return a

def toAdjList(edgeList):

  # Creating list of lists of Coauthors
  edges = edgeList.map(lambda line : line.split("\t"))

  # Each Edge contributes towards the adjacency-lists of both end nodes
  edge1 = edges.map(lambda w : (w[0],[w[1]]))
  edge2 = edges.map(lambda w : (w[1],[w[0]]))
  effective_edges=edge1.union(edge2)

  # Creating adjacency-list for each author_id by combineByKey operation and defined functions
  adjacency_list=effective_edges.reduceByKey(adj_list)

  return adjacency_list

In [9]:
adjList = toAdjList(edgeList)
adjList.takeSample(False,2)

[('206883', ['45992', '109764']),
 ('255676', ['6535', '9935', '109620', '233521'])]

A sample of the <tt>adjList</tt> RDD should be of the following format:
````{verbatim}
[('229379',
  ['16758',
   '20208',
   '84673',
   '86302',
   '201707',
   '208085',
   '229377',
   '229378',
   '270355',
   '284357',
   '299813',
   '344507']),
 ('92555', ['43489', '58094', '92554'])]
````


## Task 3: Compute the number of mutual co-authors (4 marks)

From either the edge-list representation or the adjacency-list representation, compute the number of mutual co-authors for each pair of authors (only for pairs of authors for whom there is any co-author at all).

In [10]:
"""
    Input: an RDD in the edgelist or adjacency list format as above
    Returns: an RDD with each element being a pair
             (pair of author IDs, number of mutual co-authors for those authors)
             i.e., (string,integer)
"""
# For each member in adjacency-list (Y1, Y2, … , Yn) of an author X
# For each pair (Y1, Y2), emitting the key value pair (Y1-Y2,1)
# Since, each such pair contributes one towards their no. of mutual co-authors

def MutualCoAuthor(w):
  ls=[]
  res = w[1]
  # Iterating through adjacency-list to emit key value pair (Y1-Y2,1)
  for i in range(0,len(res),1):
    for j in range(i+1,len(res),1):
      ls.append((str(res[i])+'-'+str(res[j]),1))
  return ls

def numMutualCoAuthors(data):

  # Flattened list of key value pairs (Y1-Y2,1)
  CoAuthor=data.flatMap(MutualCoAuthor)

  # Summing up the total no. of mutual co-authors of each pair of author_id by reduceByKey operation
  MutualCoAuthors=CoAuthor.reduceByKey(lambda m, n: m + n)

  return MutualCoAuthors

Now, execute the process. Even with good programming, it may take a few minutes to run.

In [11]:
# use one of the following lines depending on your function
# Using adjacency list
data = adjList
# data = edgeList
numMF = numMutualCoAuthors(data)
numMF.takeSample(False,10)

[('101895-322836', 1),
 ('1111-12605', 1),
 ('58316-101799', 1),
 ('10275-40145', 1),
 ('70047-186671', 1),
 ('198239-380100', 1),
 ('195464-267276', 1),
 ('121221-324055', 1),
 ('56723-73918', 1),
 ('116176-155494', 4)]

The format of the output RDD should be of the following format:
````{verbatim}
[('33109-40583', 1),
 ('75663-261743', 1),
 ('83029-266423', 1),
 ('348648-379269', 3),
 ('36973-38262', 1),
 ('158280-190810', 2),
 ('97465-109654', 1),
 ('40020-86638', 1),
 ('49728-267552', 1),
 ('147445-208076', 1)]
````