This project was done as a part of the course requirement of CS 8803 : Data Science for Epidemiology (Fall 2020). There are two folders in this package. The DOC folder contains a report and a presentation for the project. The SRC folder is composed of a code and data folder.

The codes of this project are written in Python3. Henceforth, all the references to folders will be within the SRC folder. The packages needed to run this project can be installed by running pip3 install -r requirements.txt on the requirements.txt file in SRCshou. To use this project and run a demo, please perform the following steps in the given order. Each step is articulated with its context, goal and data requirment. Each step contains either a notebook or a script which needs to be only run to implement that step.

  1. Preparing Prestige Data:

The raw CSV files for US News Rankings, CS Rankings and the faculty-hiring drive prestige are in data\prestige\. The notebook prestige_calculation.ipynb contians code to compute prestige scores for all universties. The resulting dataframe is pickled and stored for convenience in data\prestige\ as prestige_data.pkl.

  1. Preparing Idea Quality Data:

ICLR data is stored in \data\mag_papers\iclr.json. The script to process these and get the papers that match the ones in our MAG data is code\ The matched papers are pickled and stored for convenience in data\paper_quality\iclr_matched_papers.pkl. The metadata for each paper of ICLR 2018 is stored in data\paper_quality\iclr2018_metadata.jsonl as a jsonlines file.

  1. Collecting and process MAG data:

We got our data from the Microsoft Academic Graph. One can read more about how to use their API from here : The script to collect data is in the script The current parameters passed in the MAG query are to fetch conference papers post 2017 for the conferences ACL, ICCV, CVPR, ICLR, ICML, NAACL, NEURIPS. These can be changed as needed. When this script is run, the necessary data files will get collected in data\paper_data_mag. Each .json file will be of the form MAG_conference_name.json. These files are too big to be stored in github and hence will have to scraped. These files are then futher used to create the citation and infection networks. The notebook code\citation_network_dk.ipynb contains code to aggregate the data for all the conferecnes and collect it in a dataframe. Running this notebook will store the pickled version of this dataframe is in data\all_paper_data.pkl. This file is too big to be stored in github and will have to be generated by runnin the code.

  1. Creating the citation and infection network:

The code to create the citation network and its infection network are in prestige_calculation.ipynb. These have been pickled and stored in \data\networks\citation_network.pkl and \data\networks\citation_infection_network.

  1. Creating the patient zero paper data:

The notebook code\patient_zero.ipynb contains code to process, aggregate and visualise data for the infection sources and can be run to do so. The resulting dataframe is pickled and stored in \data\patient_zero_data.pkl.

  1. Creating researcher collaboration network and visualizing network topology

The notebook collaboration_network.ipynb generates our collaboration network of researchers from paper collaborations, then calculates degree and betweenness centrality measures of this network. It also generates plots to help understand the centralities. Pickles of several centrality measures that were assessed in this notebook are stored in \data\nodes\author_centrality.pkl and \data\nodes\author_list.pkl. This notebook also has plot to visualize several centrality plots.

  1. Simulating Epidemics

The simulation of an SIR model for our network is done in thhe notebook network_simulation.ipynb. It also contains plots to visualize the simulation results correlations.


