# Homework 1
## Introduction
### Important dates
* Homework release: Thursday, 3rd October 2019
* **Homework due**: Wednesday, 16th October 2019 (23:59 hours, 11:59 PM)
* Peer review due: Wednesday, 23rd October 2019 (23:59 hours, 11:59 PM)
* Grading release: Wednesday, 30th October 2019

### Description

The data you'll be working with comes from multiple sources. The main data source will be [DBLP](https://dblp.uni-trier.de/), a database of publications from major computer science journals and conferences. A subset of DBLP, which you will use in this assignment, is provided to you via a [google drive folder](https://drive.google.com/file/d/1Kci8joML74tCSzuBbhxtd1ylR4f0dlm6/view). Later on, you will enrich the DBLP data with a dataset on conference rankings and with the proceedings of the [NIPS conference](https://nips.cc/) [1] ('proceedings' is another word for the set of papers published at an academic conference). After loading and cleaning the data, you will answer various questions about its contents.

**Some rules:**
- You are allowed to use any built-in Python library that comes with Anaconda. If you want to use an external library, you have to justify your choice.
- Make sure you use the data folder provided in the repository in *read-only* mode.
- Be sure to provide explanations for your answers. A notebook that only has code cells will not suffice.
- Also, be sure to *hand in a fully-run and evaluated notebook*. We will not run your notebook for you, we will grade it as is, which means that only the results contained in your evaluated code cells will be considered, and we will not see the results in unevaluated code cells. In order to check whether everything looks as intended, you can check the rendered notebook on the GitHub website once you have pushed your solution there.

[1] Note that NIPS was renamed to NeurIPS in 2018, but for simplicity, whenever we say 'NIPS', we really mean 'NIPS and NeurIPS'.

## Task A. Getting a sense of the dataset

### A1. Loading the data
Download the DBLP dataset (available on [google drive](https://drive.google.com/file/d/1Kci8joML74tCSzuBbhxtd1ylR4f0dlm6/view)) and load it into a Pandas dataframe. A row of your dataframe should look as follows:

| paper id | author names | publication year | paper title | 
| :----:|:-------------:| :-----:|:-----:|
| conf/nips/doe1036 | [John Doe, Jane Doe] | 2003 | Some Catchy Title: An Expanded and Boring Title | 


1. Filter the papers: keep only conference papers. For each of the remaining ones, find the acronym of the conference where it was published. Retain only those papers that have been published in the conferences listed in `data/list_of_ai_conferences.txt`. Additionally, add a column named 'conference' to your dataframe.   
_Hint: The `paper id` tells you whether a paper was published at a conference, and if so, at which one._

2. Report the overall number of papers in the filtered dataset, as well as the number of papers per conference.

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import re
from matplotlib.ticker import MaxNLocator
import matplotlib.pyplot as plt
from requests import get
from bs4 import BeautifulSoup

In [None]:
#Load the data with \t separator
df = pd.read_csv('data/dblp.tsv', sep='\t')

In [None]:
#Rename to fit the instructions
df.rename(columns={'id': 'paper_id',\
                   'authors': 'authors_name',\
                   'year': 'publication_year',\
                   'title': 'paper_title'}, inplace=True)

In [None]:
#Build the Dataframe column to filter on
sr = df['paper_id'].str.split('/')

source_type = []
conference = []
for list_ in sr.values:
    source_type.append(list_[0])
    conference.append(list_[1])


df['type'] = source_type
df['conference'] = conference

In [None]:
print("The dataset has these sources: ", df['type'].unique())
print("The dataset has these conferences: ", df['conference'].unique())

In [None]:
#Filter conferences
conf_papers = df[df.type == 'conf']

In [None]:
#Load ai conferences
ai_conferences = pd.read_csv('data/list_of_ai_conferences.txt', header=None)
ai_conf = [i[0] for i in ai_conferences.values]

In [None]:
#Filter ai conferences
ai_conf_papers = conf_papers[conf_papers['conference'].isin(ai_conf)]

In [None]:
ai_conf_papers.drop(columns='type', inplace=True)

In [None]:
ai_conf_papers.head()

In [None]:
print("Overall number of papers: " + str(ai_conf_papers.shape[0]))

In [None]:
#Number of papers per conference
ai_conf_papers['conference'].value_counts()

In [None]:
ai_conf_papers.to_csv('data/ai_conf_papers')

### A2. An author-centric look
The dataframe you created above was a paper-centric one. Now, we want you to create a new dataframe centered around authors. Do this by expanding the author names in the lists in the 2nd column into separate rows. That is, if a paper has 3 authors, turn that row into 3 rows, each of which only contains one of the author names (along with the rest of the paper information, i.e., title, conference and year). Keep both dataframes, we are going to need both of them.    
**Report the number of unique authors.**

In [None]:
#With the format of the name of that first paper, we see Thomas Prescher and Michael Schwarz with a number at the end.
#It seems to be generated automatically to deal with different authors with the same name.
#Thus, we assume that there is no two different authors with same author_name value.
df.iloc[0]['authors_name']

In [None]:
spl = pd.read_csv('data/ai_conf_papers')

In [None]:
spl.drop('paper_id', axis=1, inplace=True)

In [None]:
author_list_of_list = spl['authors_name'].apply(ast.literal_eval).values.tolist()

spl['author_list'] = author_list_of_list

In [None]:
spl.drop('authors_name', axis=1, inplace=True)

In [None]:
author_centred_df = spl.explode('author_list')

In [None]:
author_centred_df.rename(columns={'author_list': 'author'}, inplace=True)

In [None]:
author_centred_df.drop('Unnamed: 0', axis=1, inplace=True)

In [None]:
#Change the order of the columns
cols = author_centred_df.columns.tolist()
cols

In [None]:
cols = cols[:1] + cols[-1:] + cols[1:-1]
cols

In [None]:
author_centred_df = author_centred_df[cols]

In [None]:
author_centred_df.reset_index(inplace=True, drop=True)

In [None]:
author_centred_df.head()

In [None]:
author_centred_df.to_csv('data/author_centred_df.csv')

### A3. Is your data analysis-friendly?

Using an appropriate view of your data (paper-centric or author-centric), solve the following tasks:

1. Plot the number of papers per author and analyze it. Do you observe any outliers? Can you identify the source of the problem? Please elaborate!   
_Hint: To find out where the problem comes from, try performing an analysis at the conference or year level._   
Make sure you remove this inconsistency from your dataframe before moving to the next step, and also create a new plot of the number of papers per author after fixing the problem.   

2. Plot the number of papers per year. Do you observe any inconsistency in the output? Real-world data is usually messy, with typos, erroneous entries, and sometimes issues that make even loading the data problematic. Fix any errors that you encounter along the way, find ways to clean the attribute `year`, and redo the plot of the number of papers per year.   

3. Machine learning (ML) has been one of the hottest topics within the broader area of AI recently, so let’s see if this is reflected in the number of ML publications. In particular, let’s focus on the two major ML conferences, NIPS and ICML: make a new dataframe with only NIPS and ICML papers (let’s call these the “ML papers”), plot the number of ML papers over time, and analyze and discuss the plot. Do you observe anything odd in this plot? What causes these problems?   
_Hint: Try to perform an analysis at the conference or year level._   

4. By now, you may have noticed that some conferences are not fully represented in the DBLP dataset. Complete the paper-centric dataframe by scraping the full NIPS data from the online proceedings at https://papers.nips.cc/ (maintain the same schema used in your previous dataframes, but fill in missing values). After this step, remove any remaining papers that have missing values. Redo the plots of steps A3.2 and A3.3 after fixing the issue.   

_Note: In order to avoid re-running the cleaning part of the notebook every time, you could save the results at this point as a pickle file! Also, propagating your cleaning to both dataframes might prove useful later on._

***A3.1***

In [None]:
author_centric_df = pd.read_csv('data/author_centred_df.csv', index_col=0)
conference_centric_df = pd.read_csv('data/ai_conf_papers.csv', index_col=0)

In [None]:
author_centric_df.head()

In [None]:
author_pub = author_centric_df.groupby('author')['author'].count()

In [None]:
plt.plot(author_pub.values)
plt.title('publication count for each author')
plt.ylabel('# publication')
plt.xlabel('authors')
plt.show()

We can see two outliers who published much more than other authors.

In [None]:
author_pub[author_pub > 400]

In [None]:
sheila_df = author_centric_df[author_centric_df.author == 'Sheila A. McIlraith']

In [None]:
sheila_df.head()

In [None]:
sheila_df['publication_year'].value_counts()

Impossible to do so much paper in one year. May be a contribution sitation ?

In [None]:
sheila_df['conference'].unique().tolist()

In [None]:
satinder_df = author_centric_df[author_centric_df.author == 'Satinder P. Singh (ed.)']

In [None]:
satinder_df['publication_year'].value_counts()

In [None]:
conference_centric_df.head()

In [None]:
conference_centric_df[conference_centric_df.publication_year == '2018.0'].count()

***A3.2***

In [None]:
author_centric_df = pd.read_csv('data/author_centred_df.csv', index_col=0)
conference_centric_df = pd.read_csv('data/ai_conf_papers.csv', index_col=0)

In [None]:
#Number of papers per year

conf_per_year = conference_centric_df.groupby('publication_year')['publication_year'].count()

plt.plot(conf_per_year.values)
plt.title('Number of conference per year')
plt.xlabel('conference')
plt.ylabel('# papers')
plt.show()

we see that the patter of increasing number of papers is cyclic. Which is not the case in reality.

In [None]:
#Looking at the year format, we can see inconsistency.
conference_centric_df['publication_year'].sample(30)

We can see mainly, 4 types of dates: 1993.0, <em>2015.0</em>, <i>2017.0</i>, '17

In [None]:
conference_centric_df['publication_year'] = conference_centric_df['publication_year'].replace({'<em>': '', '</em>': '', '<i>': '', '</i>': ''}, regex=True)

In [None]:
comma_date_df = conference_centric_df[conference_centric_df['publication_year'].str.contains("'")]['publication_year'].replace("'", '', regex=True)

In [None]:
conference_centric_df['publication_year'] = conference_centric_df['publication_year'].replace({'?': np.nan}, regex=False)

In [None]:
conference_centric_df['publication_year'].sample(30)

In [None]:
comma_date_df = comma_date_df.astype('int32')

In [None]:
comma_date_df[comma_date_df < 19] = comma_date_df[comma_date_df < 19].values + 2000

In [None]:
comma_date_df[comma_date_df < 100 ] = comma_date_df[comma_date_df < 100].values + 1900

In [None]:
mess_dates = conference_centric_df['publication_year'] 

In [None]:
mess_dates.sample(30)

In [None]:
mess_dates.update(comma_date_df)

In [None]:
conference_centric_df['publication_year'] = mess_dates.astype('float64').astype('Int64')

In [None]:
conference_centric_df.sample(30)

In [None]:
conf_per_year = conference_centric_df.groupby('publication_year')['publication_year'].count()

plt.plot(conf_per_year.values)
plt.title('Number of conference per year')
plt.xlabel('conference')
plt.ylabel('# papers')
plt.show()

***A3.3***

In [None]:
mlconf = ['icml', 'nips']
''' Add your code here '''

***A3.4***

In [None]:
''' Add your code here '''

### A4. Author activity

For each author, calculate their normalized paper count, defined as the total number of papers divided by the author’s period of activity. An author’s period of activity is defined as the number of years between the earliest and latest papers of this author. Plot the distribution of the normalized paper count. What is the appropriate scale for the axes? Does the distribution (roughly) follow a particular law, and if yes, which one?

In [None]:
''' Add your code here '''

## Task B. Ranking authors

As you may know, there exist rankings for universities, which represent their relative quality and are used to compare the performance of different universities. In the same vein, there are rankings for conferences and journals, which represent the importance and impact of each conference or journal, and therefore allow for approximate comparisons. In this part, you will rank authors based on different aspects of their research output.

### B1. A Naïve Score

In the absence of citation counts, it is hard to objectively rank the authors based on the impact of their contributions to the field of AI research. A naïve way would be to rank them based on their number of published papers. Obtain such a ranking and analyze your result. Identify and explain some obvious limitations of this scheme.

In [None]:
''' Add your code here '''

### B2. H5-index

Another way to score and rank authors could be based on the quality of the conferences and journals where they publish their papers. For this task, you have to use the H5-index score from AMiner (https://aminer.org/ranks/conf) (another database of scholarly publications), which captures the quality of academic conferences: the higher the H5-index, the better the conference.
1. Load the AMiner dataset ( *'aminer_ai.tsv'* available in the folder ``data/``), which contains H5-index values for AI conferences. Load it into a new Pandas dataframe, and join it with the author-centric DBLP dataframe.
2. Calculate a *'new'* author ranking (give each author a score, by which the authors are then sorted in order to obtain the ranking), where each author's score is the sum of the H5-indices of all their papers (the H5-index of a paper being the H5-index of the conference it is published in).
3. Analyze your new, H5-index-based author ranking and explain how and why your results are different from the previous ranking. Do you see any differences in the top-20 authors based on the H5-index-based ranking and the one produced using publication counts? If yes, list the authors that are ranked in the top 20 based on publication counts but absent in the top 20 based on the H5-index-based ranking. Identify the ranks of these authors in the ranking produced by the H5-index based ranking scheme.
4. Now, take the authors in the file `data/list_of_selected_authors.txt`, and compute their rankings using the two (naïve and H5-index-based) ranking schemes. What do you observe? Explain the potential dangers of the naïve, paper-count-based score.
5. On the flip side, do you see any potential dangers of using the H5-index-based score?   
_Hint: Analyze the conferences in which the top ranked authors publish. Investigate the effect of the conferences in which these authors publish more frequently on the obtained ranking._

***B2.1***

In [None]:
''' Add your code here '''

***B2.2***

In [None]:
''' Add your code here '''

***B2.3***

In [None]:
''' Add your code here '''

***B2.4***

In [None]:
''' Add your code here '''

***B2.5***

In [None]:
''' Add your code here '''

### B3. And Justice For All

An ideal ranking scheme should not give undue advantage to authors who have been conducting research for a longer period of time and therefore have naturally published more papers when compared to a junior researcher. Does the ranking scheme designed by you in ``Step 2`` take this factor into account? If not, introduce variations in your ranking scheme to mitigate this effect. Do you observe anything odd with this new ranking? Clearly explain your observations.

_Hint: What you did in part A4 may be useful here._

In [None]:
''' Add your code here '''

## Task C. Trending topics

Historically, the field of AI has witnessed research in two broad flavors: “symbolic” (logic, planning, control, etc.) vs. “connectionist” (neural networks, deep learning, Bayesian methods, etc.). Let’s see if we can see how the popularity of these two approaches to AI is reflected in the DBLP data.

To this end, construct two dataframes: ``symbolic`` and ``connectionist``. ``symbolic`` is your paper-centric dataframe from part A1 filtered down to those papers whose titles contain at least one of the following words (not differentiating between upper and lower case letters): “logic”, “planning”, “control”; ``connectionist`` is a dataframe constructed in a similar manner, but with the words “deep”, “learning”, “feature”, “bayesian”. Plot the number of papers per year for ``symbolic`` and ``connectionist`` separately (i.e., 2 plots).
1. Describe the trends you observe. Based on these plots alone, what might one conclude about the popularity of the two approaches to AI?
2. Moving beyond these plots, what do you, as a careful data scientist, conclude about the popularity of symbolic vs. connectionist AI? Corroborate your reasoning with further plots.

_Note: You could use the text handling utilities below to clean the text in the paper titles._

In [None]:
# Text handling utilities
from string import punctuation
stopwords_list = open('data/stopwords.txt', 'r').readlines()
stopwords_list = [x.strip() for x in stopwords_list]
def stopword_remover(text):
    text_list = text.split()
    text_list = [x for x in text_list if x not in stopwords_list]
    return ' '.join(text_list)
def lowercase_all(text):
    return text.lower()
def remove_punct(text):
    return ''.join([ch for ch in text if ch not in punctuation])

In [None]:
''' Add your code here '''

In [None]:
words_symbolic = ['logic', 'planning', 'control']
''' Add your code here '''

In [None]:
words_connectionist = ['deep', 'learning', 'feature', 'bayesian']
''' Add your code here '''