**Neural Information Processing Systems (NIPS)** is a multi-track machine learning and computational neuroscience conference that includes invited talks, demonstrations, symposia and oral and poster presentations of refereed papers. 

I've been longing to start my research in AI - specifically in the cusp of machine learning and computational/ systems neuroscience, which is exactly the field that NIPS plays in. Availability of this dataset is a blessing, because I can now better understand 
- who are the players contributing to cutting-edge research in this area within the Australian research community,
- which unis are doing more than the others, 
- who are the up-and-comers versus the long-timers, 
- which pairs of researchers are the biggest collaborators,
- which areas are they playing in, and how has it been changing over the years
- which benchmark datasets have they been using mostly


Am fully cognisant of the fact that this view is pretty myopic. For starters, there are other conferences such as ICML etc, other journal publications which may not get featured or accepted into NIPS, or perhaps may involve private research that may not be amenable to publishing. However, this analysis can certainly be used directionally, I think haha. 

I've downloaded the data from the Kaggle NIPS dataset uploaded by Ben Hamner here (https://github.com/benhamner/nips-papers).  

## Admin stuff

In [1]:
import sqlite3
import os
import sys
import seaborn as sns
sns.set()
sns.set(style="darkgrid")
import pandas as pd
import numpy as np

from datetime import datetime

# Load the "autoreload" extension
%load_ext autoreload

# always reload modules marked with "%aimport"
%reload_ext autoreload



In [2]:
# python-dotenv to load up all the entries in this file as environment variables so they are accessible with os.environ.get.
from dotenv import load_dotenv, find_dotenv

# find .env automagically by walking up directories until it's found
dotenv_path = find_dotenv()

# find root directory path
project_root_path = os.path.dirname(find_dotenv())

In [3]:
# creating a path for DATA directory
data_dir = os.path.join(project_root_path, 'DATA')

## Connect to database

Note: Ben has conveniently placed the data as csv files as well as built a sqlite database. I'm just gonna use the db because it's taking ages to open on my sublime text as well!

In [4]:
# create a pathway to the database
database_path = os.path.join(data_dir, 'database.sqlite')

In [5]:
cnx = sqlite3.connect(database_path)

## Query the database

### Querying the Papers table

In [6]:
df_papers = pd.read_sql("Select * from papers;", cnx)

In [7]:
df_papers.head()

Unnamed: 0,id,year,title,event_type,pdf_name,abstract,paper_text
0,1,1987,Self-Organization of Associative Database and ...,,1-self-organization-of-associative-database-an...,Abstract Missing,767\n\nSELF-ORGANIZATION OF ASSOCIATIVE DATABA...
1,2,1987,The Capacity of the Kanerva Associative Memory...,,2-the-capacity-of-the-kanerva-associative-memo...,Abstract Missing,184\n\nTHE CAPACITY OF THE KANERVA ASSOCIATIVE...
2,3,1987,Supervised Learning of Probability Distributio...,,3-supervised-learning-of-probability-distribut...,Abstract Missing,52\n\nSupervised Learning of Probability Distr...
3,4,1987,Constrained Differential Optimization,,4-constrained-differential-optimization.pdf,Abstract Missing,612\n\nConstrained Differential Optimization\n...
4,5,1987,Towards an Organizing Principle for a Layered ...,,5-towards-an-organizing-principle-for-a-layere...,Abstract Missing,485\n\nTOWARDS AN ORGANIZING PRINCIPLE FOR\nA ...


In [8]:
df_papers.shape

(6560, 7)

In [9]:
df_papers.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6560 entries, 0 to 6559
Data columns (total 7 columns):
id            6560 non-null int64
year          6560 non-null int64
title         6560 non-null object
event_type    6560 non-null object
pdf_name      6560 non-null object
abstract      6560 non-null object
paper_text    6560 non-null object
dtypes: int64(2), object(5)
memory usage: 177.2 MB


csvs were struggling to open due to the size. That's why I used sqlite.

### Querying the authors table

In [10]:
df_authors = pd.read_sql("Select * from authors;", cnx)

In [11]:
df_authors.head()

Unnamed: 0,id,name
0,1,Hisashi Suzuki
1,2,Suguru Arimoto
2,3,Philip A. Chou
3,4,John C. Platt
4,5,Alan H. Barr


In [12]:
df_authors.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8653 entries, 0 to 8652
Data columns (total 2 columns):
id      8653 non-null int64
name    8653 non-null object
dtypes: int64(1), object(1)
memory usage: 665.9 KB


### Querying the paper author key table

In [13]:
df_key = pd.read_sql("Select * from paper_authors;", cnx)

In [14]:
df_key.head()

Unnamed: 0,id,paper_id,author_id
0,1,63,94
1,2,80,124
2,3,80,125
3,4,80,126
4,5,80,127


In [15]:
df_key.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18321 entries, 0 to 18320
Data columns (total 3 columns):
id           18321 non-null int64
paper_id     18321 non-null int64
author_id    18321 non-null int64
dtypes: int64(3)
memory usage: 429.5 KB


## Start analysing the hell out of it

### How big are the papers?

In [16]:
df_papers['paper_length'] = df_papers['paper_text'].str.split().apply(len)

### Simple test - How many papers have been submitted by UNSW?

In [17]:
df_papers['paper_text'].str.contains('UNSW').sum()

5

In [18]:
df_papers['paper_text'].str.contains('University of New South Wales').sum()

10

In [19]:
df_papers[df_papers['paper_text'].str.contains('University of New South Wales')]

Unnamed: 0,id,year,title,event_type,pdf_name,abstract,paper_text,paper_length
3580,3619,2008,A computational model of hippocampal function ...,,3619-a-computational-model-of-hippocampal-func...,We present a new reinforcement-learning model ...,A computational model of hippocampal function ...,4392
4402,4441,2011,Generalized Lasso based Approximation of Spars...,,4441-generalized-lasso-based-approximation-of-...,"Sparse coding, a method of explaining sensory ...",Generalized Lasso based Approximation of Spars...,5549
4924,4964,2013,Projecting Ising Model Parameters for Fast Mixing,Poster,4964-projecting-ising-model-parameters-for-fas...,Inference in general Ising models is difficult...,Projecting Ising Model Parameters for Fast Mix...,5093
5131,5171,2013,Factorized Asymptotic Bayesian Inference for L...,Poster,5171-factorized-asymptotic-bayesian-inference-...,This paper extends factorized asymptotic Bayes...,Factorized Asymptotic Bayesian Inference\nfor ...,5776
5272,5315,2014,Projecting Markov Random Field Parameters for ...,Poster,5315-projecting-markov-random-field-parameters...,Markov chain Monte Carlo (MCMC) algorithms are...,Projecting Markov Random Field Parameters for\...,5122
5331,5374,2014,Automated Variational Inference for Gaussian P...,Poster,5374-automated-variational-inference-for-gauss...,We develop an automated variational method for...,Automated Variational Inference\nfor Gaussian ...,5133
5410,5453,2014,(Almost) No Label No Cry,Spotlight,5453-almost-no-label-no-cry.pdf,"In Learning with Label Proportions (LLP), the ...","(Almost) No Label No Cry\n\nGiorgio Patrini1,2...",6630
5412,5455,2014,Extended and Unscented Gaussian Processes,Spotlight,5455-extended-and-unscented-gaussian-processes...,We present two new methods for inference in Ga...,Extended and Unscented Gaussian Processes\nDan...,5355
5622,5665,2015,Scalable Inference for Gaussian Process Models...,Poster,5665-scalable-inference-for-gaussian-process-m...,We propose a sparse method for scalable automa...,Scalable Inference for Gaussian Process Models...,5226
6200,6243,2016,Infinite Hidden Semi-Markov Modulated Interact...,Poster,6243-infinite-hidden-semi-markov-modulated-int...,The correlation between events is ubiquitous a...,Infinite Hidden Semi-Markov Modulated Interact...,5639


# TO BE CONTINUED...

## Close the connection

In [20]:
cnx.close()

FIN.