# COMP440 Collective Intelligence Team Nostradami Final Project: Citation Networks

## Team Nostradami is: Gugo Babayan, Eddie Chen, Nick Duncan, and Rory Donaghy
___
### What dataset are we working with?
AMiner is a project from Chinese university, Beijing Jiaotong University, by researchers Huaiyu Wan, Yutao Zhang, Jing Zhang, and Jie Tang. The project was first published in 2019 via MIT Press Direct, which can be found [here](https://direct.mit.edu/dint/article/1/1/58/9974/AMiner-Search-and-Mining-of-Academic-Social). We're using v12 which we're downloading from Kaggle ([Dataset Link](https://www.kaggle.com/datasets/mathurinache/citation-network-dataset/data?select=dblp.v12.json)), as the latest version (v14) is difficult to access due to the data servers being located in China.

### How was the dataset collected?
AMiner's scientific publication citation network was created by scraping sources from dblp, acm, and mag. How the network was constructed is outlined in the paper linked above.

### For what purpose was the dataset collected?
This dataset was created for strictly academic research purposes, and is offered for free through the projects website for data scientists to analyze
___
## To Run This Project, First Reference The README.md and Ensure You Have A Copy of Our Cleaned Data (named indexed_data.csv) Which Can Be Created With clean_data.py Or Downloaded

In [None]:
import pandas as pd
import numpy as np
import json
import csv
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx

from kaggle.api.kaggle_api_extended import KaggleApi
from zipfile import ZipFile 

In [None]:
citations_df = pd.read_csv('unindexed_data.csv')

citations_df.head()

In [None]:
# Recover Dataframe From CSV
citations_df = citations_df.dropna(subset=["Document Type"])
citations_df = citations_df.dropna(subset=["Field of Study"])
citations_df["Field of Study"] = citations_df["Field of Study"].apply(lambda fields: fields.split(", "))
citations_df["Authors"] = citations_df["Authors"].apply(lambda fields: fields.split(", "))
citations_df["References"] = citations_df["References"].apply(lambda x: x.split(", ") if type(x)==str else list())
print(citations_df.shape)

In [None]:
# These columns are Series data types
print(citations_df.Year.min())
print(citations_df.Year.max())

### Basic Analytics

In [1]:
# Papers written between 1800 (inclusive) and 1899 (inclusive)
print(citations_df[(citations_df["Year"] >= 1800) & (citations_df["Year"] < 1900)])

NameError: name 'citations_df' is not defined

In [None]:
# Papers written between 1900 (inclusive) and 1999 (inclusive)
print(citations_df[(citations_df["Year"] >= 1900) & (citations_df["Year"] < 1999)])

In [None]:
# Papers written past 2000 (inclusive)
print(citations_df[(citations_df["Year"] >= 2000)])

In [None]:
print(len(citations_df.Title.unique()))
print("There are " + str(citations_df.shape[0] - len(citations_df.Title.unique())) + " papers that share the same title" )

Of the 3,277,181 papers in the dataset, there are only 3,232,994 unique titles. This means that 44,187 papers share their title with another paper.

In [None]:
citations_df.loc[citations_df[citations_df["Citations"] >= citations_df.Citations.max()].index[0]]

In [None]:
print(citations_df.Citations.min())
print(citations_df.Citations.max())

print(citations_df[citations_df["Citations"] >= citations_df.Citations.max()])

The most cited paper is called "Distinctive Image Features from Scale-Invariant Keypoints with 35,541 citations.

### Show Most Common Field of Studies

In [None]:
# ---------------------------------------------------------------------------
# Create frequency dictionary for the "Field of Study" column
# of the citations dataframe, citations_df.
# ---------------------------------------------------------------------------
fosDict={}

for fields in citations_df["Field of Study"]:
  for field in fields:
    fosDict[field] = fosDict.get(field, 0) + 1


fieldsKeys=list(fosDict.keys())
fieldsValues=[fosDict.get(field) for field in fieldsKeys]

# ------------------------------------------------------------
# Create a Field of Study Dataframe, fos_df.
# ------------------------------------------------------------
fos_df = pd.DataFrame({
    "Field": fieldsKeys,
    "Frequency": fieldsValues
})

fos_df.set_index("Field", inplace=True)

fos_df = fos_df.sort_values("Frequency", ascending=False)
fos_df = fos_df.reset_index()

In [None]:
# Print out Top 5 most common fields
print(fos_df.head(5))

In [None]:
# ---------------------------------------------------------------------------
# Finding the number of fields of study that are only mentioned once.
# ---------------------------------------------------------------------------
num_lowest_freq_field = 0
for key in fosDict:
    if fosDict.get(key) == 1:
        num_lowest_freq_field += 1

print(num_lowest_freq_field)

In [None]:
# ---------------------------------------------------------------------------
# Print out the least common fields.
# ---------------------------------------------------------------------------
print(fos_df.tail(num_lowest_freq_field + 1))

24390 fields of study are seen only once in the entire dataset.

### Show Top Authors In Each Field of Study

In [None]:
# ---------------------------------------------------------------------------
# Create A Dictionary Containing The Top 20 Authors of Each Field In Dataset
# ---------------------------------------------------------------------------

citations_df = citations_df.explode("Field of Study")
grouped = citations_df.groupby("Field of Study")

top_influential_figures = {}

for field, group in grouped:
    author_citations = {}
    
    for index, row in group.iterrows():
        authors = row["Authors"].split(", ")
        
        for author in authors:
            author_citations[author] = author_citations.get(author, 0) + row["Citations"]
            
    sorted_authors = sorted(author_citations.items(), key=lambda x: x[1], reverse=True)
    top_influential_figures[field] = sorted_authors[:20]

In [None]:
print(top_influential_figures)

### Illustrate Growth In Fields Over Time

In [None]:
# ---------------------------------------------------------------------------
# Create a line graph of the top 5 fields of study in the citations 
# dataframe (citations_df) where the x-axis represents "Year" and the
# y-axis represents the "Frequency" that we've seen the field of study
# throughout the entire citations dataframe.
# ---------------------------------------------------------------------------

# Create a dataframe of papers that were published each year starting from 1800
grouped_df = citations_df.groupby("Year")
# Create dictionary where key=Year and value=Dictionary of Field of Study Frequency for that year
year_fos_dict = {}
for year in grouped_df.indices.keys():
    year_fos_list = grouped_df.get_group(year)["Field of Study"].tolist()
    temp_list = []
    for fields in year_fos_list:
        temp_list.extend(fields)

    fos_dict = {}
    for fos in temp_list:
        fos_dict[fos] = fos_dict.get(fos, 0) + 1

    year_fos_dict[year] = fos_dict

# ----------------------------------------------------------------------
topFields = fos_df.head(5)["Field"].tolist()

# Find the number of citations at a year for each of the top 5 fields
# by making a list whose length is the range between the smallest year in
# citations_df and the largest year in citations_df. 
# For this list, index 0=the lowest year in citations_df.
fos_freq_year_dict = {}
for fos in topFields:
    fos_freq_year = [] # Field of study frequency for that year
    for year in range(citations_df.Year.min(), citations_df.Year.max() + 1):
        if year not in list(year_fos_dict.keys()):
            fos_freq_year.append(0)
        else:
            if fos in year_fos_dict[year]:
                fos_freq_year.append(year_fos_dict[year][fos])
            else:
                fos_freq_year.append(0)
    fos_freq_year_dict[fos] = fos_freq_year

# ----------------------------------------------------------------
# Create a line chart starting from the year 1950 and 
# going to the maximum year in citations_df minus 2 year 
# (i.e 2018 because at 2020, it looks like graph just dips off)
# ----------------------------------------------------------------
years = [i for i in range(1950, citations_df.Year.max()-1)]

for key in fos_freq_year_dict:
  plt.plot(years, fos_freq_year_dict[key][150:-2], label=key)

plt.xlabel("Year")
plt.ylabel("Frequency")
plt.title('Num Mentions of a Field Per Year')
plt.legend(loc="upper left")
plt.show()

FIND ORIGINAL PAPERS

In [None]:
final_df = pd.read_csv('final_dataset.csv')

In [None]:
final_df.head()

In [None]:
small_df = final_df[:5]
print(small_df)

In [None]:
mask = final_df.references_column.isna()
masked_df = final_df[mask]

In [None]:
print(len(masked_df[masked_df["Document Type"] == "Journal"]))

In [None]:
masked_df["Document Type"]
plt.figure(figsize=(8, 5))
sns.countplot(data=masked_df, x="Document Type")
plt.title("Distribution of Original Paper's Document Types")
plt.xlabel("Document Type")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.show()

In [None]:
plt.figure(figsize=(8, 5))
sns.countplot(data=final_df, x="Document Type")
plt.title("Distribution of All Paper's Document Types")
plt.xlabel("Document Type")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.show()

In [None]:
print(fos_df.head(20))
topFields = fos_df["Field"].to_numpy()[1:20]
print(len(topFields))

### Illustrate How CS Has Interacted With Other Fields Over Time By Showing Growth of Fields Associated With CS In Dataset

In [None]:
# ---------------------------------------------------------------------------
# Create a line graph tracking the times we've seen a field of study 
# for the top 19 fields (excluding "Computer science") over time for papers
# that have both the fields of study "Computer science" and one of or more
# fields from the top 19 fields of study in the citations dataframe (citations_df) 
# where the x-axis represents "Year" and the y-axis represents the "Frequency"
# that we've seen the field of study throughout the entire citations dataframe.
# ---------------------------------------------------------------------------

# Create dictionary where key=Year and value=Dictionary of Field of Study Frequency for that year
year_fos_dict = {}
for year in grouped_df.indices.keys():
    year_fos_list = grouped_df.get_group(year)["Field of Study"].to_numpy()
    temp_list = []
    for fields in year_fos_list:
        if (not set(topFields).isdisjoint(set(fields)) and "Computer science" in set(fields)):
            temp_list.extend(fields)

    fos_dict = {}
    for fos in temp_list:
        fos_dict[fos] = fos_dict.get(fos, 0) + 1

    year_fos_dict[year] = fos_dict

# ----------------------------------------------------------------
# topFields = ["Artificial intelligence", "Mathematics", "Machine learning", "Mathematical optimization", ... , "Multimedia"]

# Find the number of citations at a year for each of the top 19 fields for 
# papers that have the "Computer science" field and one or more of the top 19 fields
# by making a list whose length is the range between the smallest year in
# citations_df and the largest year in citations_df. 
# For this list, index 0=the lowest year in citations_df.
fos_freq_year_dict = {}
for fos in topFields:
    fos_freq_year = [] # Field of study frequency for that year
    for year in range(citations_df.Year.min(), citations_df.Year.max() + 1):
        if year not in list(year_fos_dict.keys()):
            fos_freq_year.append(0)
        else:
            if fos in year_fos_dict[year]:
                fos_freq_year.append(year_fos_dict[year][fos])
            else:
                fos_freq_year.append(0)
    fos_freq_year_dict[fos] = fos_freq_year

# print(fos_freq_year_dict)
# print(fos_freq_year_dict.keys())
# print(fos_freq_year_dict.values())
# ----------------------------------------------------------------
# Create a line chart starting from the year 1950 and 
# going to the maximum year in citations_df minus 2 year 
# (i.e 2018 because at 2020, it looks like graph just dips off)
years = [i for i in range(1950, citations_df.Year.max()-1)]
plt.figure(figsize=(100,50))
plt.rcParams["font.size"] = 50
plt.rcParams["lines.linewidth"] = 10

for key in fos_freq_year_dict:
  plt.plot(years, fos_freq_year_dict[key][150:-2], label=key)

plt.xlabel("Year")
plt.ylabel("Frequency")
plt.title('Num Mentions of a Field Pair Per Year')
plt.legend(loc="upper left")
plt.show()

### INCOMPLETE AS OF PROJECT DEADLINE: Show How Authors Inetract With Each Other

In [None]:
# ---------------------------------------------------------------------------
# Removes papers that are not part of the CS Field and do not contain a 
# secondary field that is a part of the top 20 fields
# ---------------------------------------------------------------------------

def trim_non_top_fields(fos_list):
  if not set(fos_list).isdisjoint(set(top_19)) and "Computer science" in set(fos_list):
    trimmed = [x for x in fos_list if x in top_fields]
    if trimmed.index("Computer science") != 0:
      trimmed.insert(0, trimmed.pop(trimmed.index("Computer science")))
    return trimmed
  return np.NaN

citations_df['Field of Study'] = citations_df['Field of Study'].apply(lambda x: trim_non_top_fields(x))

citations_df = citations_df.dropna(subset=["Field of Study"])

In [None]:
# ---------------------------------------------------------------------------
# Creates a dataframe that contains edge relationships between authors, WIP
# ---------------------------------------------------------------------------

citations_df = citations_df.reset_index() # Since I'm relying on the index here, I don't want gaps
authors = list(set([x for xs in citations_df["Authors"].to_list() for x in xs]))

# Create Lookup tables for easy access of data location
authors_lookup_table = {k: v for v, k in enumerate(authors)}
id_lookup_table = {k: v for v, k in enumerate(citations_df["ID"].to_list())}

author_network_df = pd.DataFrame({"A": np.array(authors)})
author_network_df["B"] = np.empty((len(author_network_df), 0)).tolist()

# Iterates through papers and assigns references authors (A) each authors citation list (B)
for index, row in citations_df.itterrows():
  authors = row["Authors"]
  references = row["References"]
  for author in authors:
    author_index = authors_lookup_table.get(author, -1)
    if author_index >= 0:
      for ref in references:
        paper_index = id_lookup_table.get(ref, -1)
        if paper_index >= 0:
          paper = citations_df.iloc[paper_index]
          author_network_df.iloc[author_index]["B"] = author_network_df.iloc[author_index]["B"] + set(paper["Authors"])

# Explode to make them single edge relationships
exploded = author_network_df.explode("B")

In [None]:
G = nx.Graph()
G.add_nodes_from(exploded['A'])
G.add_nodes_from(exploded['B'])
edges = [(row['A'], row['B']) for index, row in exploded.iterrows()]
G.add_edges_from(edges)

pos = nx.spring_layout(G) 
nx.draw(G, pos, with_labels=True, node_size=300, node_color='skyblue', font_size=10, font_color='black')
plt.show()