**Neural Information Processing Systems (NIPS)** is a multi-track machine learning and computational neuroscience conference that includes invited talks, demonstrations, symposia and oral and poster presentations of refereed papers. 

I've been longing to start my research in AI - specifically in the cusp of machine learning and computational/ systems neuroscience, which is exactly the field that NIPS plays in. Availability of this dataset is a blessing, because I can now better understand 
- who are the players contributing to cutting-edge research in this area within the Australian research community,
- which unis are doing more than the others, 
- who are the up-and-comers versus the long-timers, 
- which pairs of researchers are the biggest collaborators,
- which areas are they playing in, and how has it been changing over the years
- which benchmark datasets have they been using mostly


Am fully cognisant of the fact that this view is pretty myopic. For starters, there are other conferences such as ICML etc, other journal publications which may not get featured or accepted into NIPS, or perhaps may involve private research that may not be amenable to publishing. However, this analysis can certainly be used directionally, I think haha. 

I've downloaded the data from the Kaggle NIPS dataset uploaded by Ben Hamner here (https://github.com/benhamner/nips-papers).  

## Admin stuff

In [1]:
import sqlite3
import os
import sys
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
sns.set(style="darkgrid")
import pandas as pd
import numpy as np

from datetime import datetime

# Load the "autoreload" extension
%load_ext autoreload

# always reload modules marked with "%aimport"
%reload_ext autoreload



In [2]:
# python-dotenv to load up all the entries in this file as environment variables so they are accessible with os.environ.get.
from dotenv import load_dotenv, find_dotenv

# find .env automagically by walking up directories until it's found
dotenv_path = find_dotenv()

# find root directory path
project_root_path = os.path.dirname(find_dotenv())

In [3]:
# creating a path for DATA directory
data_dir = os.path.join(project_root_path, 'DATA')

## Function to munge and filter country specific papers and authors

In [1]:
# function for filtering and transforming country specific papers and authors
def nips_filter_country_specific(country_name):
    '''
    PURPOSE
    *******
    BLAH
    
    ARGUMENTS
    *********
    country_name: please enter specific string denoting the name of the country you want to filter your dataset
                  ex: 'Australia'
    
    REQUIRES
    ********
    DATA TO BE DOWNLOADED AND STORED INTO AN SQLITE DATABASE USING BEN HAMNER'S SCRIPT. CHECK SCRIPTS FOLDER
    pip install python-dotenv
    import os
    import pandas as pd
    from pandas import ExcelWriter
    
    RETURNS
    *******
    1) blah
    
    TO DO
    *****
    need to do error checking
    '''
    
    # create a database connection
    database_path = os.path.join(data_dir, 'database.sqlite')
    cnx = sqlite3.connect(database_path)
    
    # download the raw datasets for author, paper, relationship key
    df_papers = pd.read_sql("Select * from papers;", cnx)
    df_authors = pd.read_sql("Select * from authors;", cnx)
    df_key = pd.read_sql("Select * from paper_authors;", cnx)
    
    # create a feature for paper length
    df_papers['paper_length'] = df_papers['paper_text'].str.split().apply(len)
    
    # get the first 500 words of the paper text and store it in a new variable "paper_text_500_words"
    df_papers['paper_text_500_words'] = df_papers['paper_text'].apply(lambda x: ' '.join(x.split()[:500]))
    
    # mark the number of papers likely to be published by authors from the country specified by the user
    df_papers['country_paper'] = np.where(df_papers['paper_text_500_words'].str.contains(country_name),1,0)
    
    # bring in authors by merging papers with authors using the key table
    df_country = pd.merge(left=df_papers[df_papers.country_paper == 1], right=df_key, left_on='id', right_on='paper_id',suffixes=('_papers', '_key'))
    df_country = pd.merge(left = df_country, right = df_authors, how='left',left_on='author_id', right_on='id',suffixes=('_country', '_author'))

    df_country.drop(['id_papers','event_type','abstract','country_paper','id_key','paper_id','author_id','id'], axis = 1, inplace = True)

    # close the connection
    cnx.close()
    
    return df_papers, df_authors, df_country

## Connect to database

The github link above has a script for extracting nips papers data from the www. Thats like teaching us how to fish. This time however, I just want to eat the fish. Ben has also extracted the data in csv and sqlite db format here (https://www.kaggle.com/benhamner/nips-2015-papers). 

I tried playing around with the csvs and they were a bit sluggish due to size. So I'm going to be using the sqlite db that I've downloaded into my data directory.

In [5]:
# create a pathway to the database
database_path = os.path.join(data_dir, 'database.sqlite')

In [6]:
cnx = sqlite3.connect(database_path)

## Query the database

### Querying the Papers table

In [7]:
df_papers = pd.read_sql("Select * from papers;", cnx)

In [8]:
df_papers.head()

Unnamed: 0,id,year,title,event_type,pdf_name,abstract,paper_text
0,1,1987,Self-Organization of Associative Database and ...,,1-self-organization-of-associative-database-an...,Abstract Missing,767\n\nSELF-ORGANIZATION OF ASSOCIATIVE DATABA...
1,2,1987,The Capacity of the Kanerva Associative Memory...,,2-the-capacity-of-the-kanerva-associative-memo...,Abstract Missing,184\n\nTHE CAPACITY OF THE KANERVA ASSOCIATIVE...
2,3,1987,Supervised Learning of Probability Distributio...,,3-supervised-learning-of-probability-distribut...,Abstract Missing,52\n\nSupervised Learning of Probability Distr...
3,4,1987,Constrained Differential Optimization,,4-constrained-differential-optimization.pdf,Abstract Missing,612\n\nConstrained Differential Optimization\n...
4,5,1987,Towards an Organizing Principle for a Layered ...,,5-towards-an-organizing-principle-for-a-layere...,Abstract Missing,485\n\nTOWARDS AN ORGANIZING PRINCIPLE FOR\nA ...


In [9]:
df_papers.shape

(6560, 7)

In [10]:
# note the memory usage argument in the function gives a more accurate estimate of memory usage. Otherwise it just says 
# something to the effect of 3.5+Kb
df_papers.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6560 entries, 0 to 6559
Data columns (total 7 columns):
id            6560 non-null int64
year          6560 non-null int64
title         6560 non-null object
event_type    6560 non-null object
pdf_name      6560 non-null object
abstract      6560 non-null object
paper_text    6560 non-null object
dtypes: int64(2), object(5)
memory usage: 177.2 MB


csvs were struggling to open due to the size. That's why I used sqlite.

### Querying the authors table

In [11]:
df_authors = pd.read_sql("Select * from authors;", cnx)

In [12]:
df_authors.head()

Unnamed: 0,id,name
0,1,Hisashi Suzuki
1,2,Suguru Arimoto
2,3,Philip A. Chou
3,4,John C. Platt
4,5,Alan H. Barr


In [13]:
df_authors.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8653 entries, 0 to 8652
Data columns (total 2 columns):
id      8653 non-null int64
name    8653 non-null object
dtypes: int64(1), object(1)
memory usage: 665.9 KB


### Querying the paper author key table

In [14]:
df_key = pd.read_sql("Select * from paper_authors;", cnx)

In [15]:
df_key.head()

Unnamed: 0,id,paper_id,author_id
0,1,63,94
1,2,80,124
2,3,80,125
3,4,80,126
4,5,80,127


In [16]:
df_key.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18321 entries, 0 to 18320
Data columns (total 3 columns):
id           18321 non-null int64
paper_id     18321 non-null int64
author_id    18321 non-null int64
dtypes: int64(3)
memory usage: 429.5 KB


## Start analysing the hell out of it

### How big are the papers?

In [17]:
df_papers['paper_length'] = df_papers['paper_text'].str.split().apply(len)

### Simple test - How many papers have been submitted by UNSW?

In [18]:
df_papers['paper_text'].str.contains('UNSW').sum()

5

In [19]:
df_papers['paper_text'].str.contains('University of New South Wales').sum()

10

In [20]:
df_papers[df_papers['paper_text'].str.contains('University of New South Wales')]

Unnamed: 0,id,year,title,event_type,pdf_name,abstract,paper_text,paper_length
3580,3619,2008,A computational model of hippocampal function ...,,3619-a-computational-model-of-hippocampal-func...,We present a new reinforcement-learning model ...,A computational model of hippocampal function ...,4392
4402,4441,2011,Generalized Lasso based Approximation of Spars...,,4441-generalized-lasso-based-approximation-of-...,"Sparse coding, a method of explaining sensory ...",Generalized Lasso based Approximation of Spars...,5549
4924,4964,2013,Projecting Ising Model Parameters for Fast Mixing,Poster,4964-projecting-ising-model-parameters-for-fas...,Inference in general Ising models is difficult...,Projecting Ising Model Parameters for Fast Mix...,5093
5131,5171,2013,Factorized Asymptotic Bayesian Inference for L...,Poster,5171-factorized-asymptotic-bayesian-inference-...,This paper extends factorized asymptotic Bayes...,Factorized Asymptotic Bayesian Inference\nfor ...,5776
5272,5315,2014,Projecting Markov Random Field Parameters for ...,Poster,5315-projecting-markov-random-field-parameters...,Markov chain Monte Carlo (MCMC) algorithms are...,Projecting Markov Random Field Parameters for\...,5122
5331,5374,2014,Automated Variational Inference for Gaussian P...,Poster,5374-automated-variational-inference-for-gauss...,We develop an automated variational method for...,Automated Variational Inference\nfor Gaussian ...,5133
5410,5453,2014,(Almost) No Label No Cry,Spotlight,5453-almost-no-label-no-cry.pdf,"In Learning with Label Proportions (LLP), the ...","(Almost) No Label No Cry\n\nGiorgio Patrini1,2...",6630
5412,5455,2014,Extended and Unscented Gaussian Processes,Spotlight,5455-extended-and-unscented-gaussian-processes...,We present two new methods for inference in Ga...,Extended and Unscented Gaussian Processes\nDan...,5355
5622,5665,2015,Scalable Inference for Gaussian Process Models...,Poster,5665-scalable-inference-for-gaussian-process-m...,We propose a sparse method for scalable automa...,Scalable Inference for Gaussian Process Models...,5226
6200,6243,2016,Infinite Hidden Semi-Markov Modulated Interact...,Poster,6243-infinite-hidden-semi-markov-modulated-int...,The correlation between events is ubiquitous a...,Infinite Hidden Semi-Markov Modulated Interact...,5639


### how about Australia as a whole?

In [21]:
df_papers['paper_text'].str.contains('Australia').sum()

263

In [22]:
df_papers[df_papers['paper_text'].str.contains('Australia')].head()

Unnamed: 0,id,year,title,event_type,pdf_name,abstract,paper_text,paper_length
262,269,1989,Predicting Weather Using a Genetic Memory: A C...,,269-predicting-weather-using-a-genetic-memory-...,Abstract Missing,Predicting Weather Using a Genetic Memory\n\nP...,3861
280,289,1989,Using Local Models to Control Movement,,289-using-local-models-to-control-movement.pdf,Abstract Missing,316\n\nAtkeson\n\nUsing Local Models to Contro...,2926
375,386,1990,e-Entropy and the Complexity of Feedforward Ne...,,386-e-entropy-and-the-complexity-of-feedforwar...,Abstract Missing,c-Entropy and the Complexity of\nFeedforward N...,2341
398,410,1990,Comparison of three classification techniques:...,,410-comparison-of-three-classification-techniq...,Abstract Missing,"Comparison of three classification techniques,...",3114
422,434,1990,Direct memory access using two cues: Finding t...,,434-direct-memory-access-using-two-cues-findin...,Abstract Missing,Direct memory access using two cues: Finding\n...,3158


This gives me all kinds of junk. dataframe id 262 "predicting weather using genetic memory" has the word Australia listed inside the body of the text while describing something near the Australian coast. One way of handling this is to search for the word Australia within the first 100 words of the paper_text.

In [23]:
# get the first 500 words of the paper text and store it in a new variable "paper_text_500_words"
df_papers['paper_text_500_words'] = df_papers['paper_text'].apply(lambda x: ' '.join(x.split()[:500]))

In [24]:
df_papers['paper_text_500_words'].str.contains('Australia').sum()

154

Its not bullet-proof. Some Australian authors may have published without university affiliations. In such cases, there's no reason why the word 'Australia' might figure in their papers. However, I cant think of a better way to approach the problem. So it will have to do!

WOW. 154 papers from Australia out of the 6500 papers. Still, its not too shabby, and i have my playground at last.

In [25]:
df_papers[df_papers['paper_text_500_words'].str.contains('Australia')].head()

Unnamed: 0,id,year,title,event_type,pdf_name,abstract,paper_text,paper_length,paper_text_500_words
375,386,1990,e-Entropy and the Complexity of Feedforward Ne...,,386-e-entropy-and-the-complexity-of-feedforwar...,Abstract Missing,c-Entropy and the Complexity of\nFeedforward N...,2341,c-Entropy and the Complexity of Feedforward Ne...
398,410,1990,Comparison of three classification techniques:...,,410-comparison-of-three-classification-techniq...,Abstract Missing,"Comparison of three classification techniques,...",3114,"Comparison of three classification techniques,..."
422,434,1990,Direct memory access using two cues: Finding t...,,434-direct-memory-access-using-two-cues-findin...,Abstract Missing,Direct memory access using two cues: Finding\n...,3158,Direct memory access using two cues: Finding t...
430,442,1991,"Splines, Rational Functions and Neural Networks",,442-splines-rational-functions-and-neural-netw...,Abstract Missing,"Splines, Rational Functions and Neural Network...",2841,"Splines, Rational Functions and Neural Network..."
456,468,1991,Operators and curried functions: Training and ...,,468-operators-and-curried-functions-training-a...,Abstract Missing,Operators and curried functions:\nTraining and...,2674,Operators and curried functions: Training and ...


In [26]:
df_papers['ozzie'] = np.where(df_papers['paper_text_500_words'].str.contains('Australia'),1,0)

In [27]:
df_papers.ozzie.value_counts()

0    6406
1     154
Name: ozzie, dtype: int64

## Lets get in authors into the mix

In [28]:
df_authors.head()

Unnamed: 0,id,name
0,1,Hisashi Suzuki
1,2,Suguru Arimoto
2,3,Philip A. Chou
3,4,John C. Platt
4,5,Alan H. Barr


#### Merge with the key and the author tables

In [29]:
df_ozzie = pd.merge(left=df_papers[df_papers.ozzie == 1], right=df_key, left_on='id', right_on='paper_id',
                    suffixes=('_papers', '_key'))
df_ozzie = pd.merge(left = df_ozzie, right = df_authors, how='left',left_on='author_id', right_on='id',
                    suffixes=('_ozzie', '_author'))

#### Remove the extraneous columns just to keep your sanity

In [30]:
df_ozzie.drop(['id_papers','event_type','abstract','ozzie','id_key',
               'paper_id','author_id','id'], axis = 1, inplace = True)
df_ozzie.head()

Unnamed: 0,year,title,pdf_name,paper_text,paper_length,paper_text_500_words,name
0,1990,e-Entropy and the Complexity of Feedforward Ne...,386-e-entropy-and-the-complexity-of-feedforwar...,c-Entropy and the Complexity of\nFeedforward N...,2341,c-Entropy and the Complexity of Feedforward Ne...,Robert C. Williamson
1,1990,Comparison of three classification techniques:...,410-comparison-of-three-classification-techniq...,"Comparison of three classification techniques,...",3114,"Comparison of three classification techniques,...",A. C. Tsoi
2,1990,Comparison of three classification techniques:...,410-comparison-of-three-classification-techniq...,"Comparison of three classification techniques,...",3114,"Comparison of three classification techniques,...",R. A. Pearson
3,1990,Direct memory access using two cues: Finding t...,434-direct-memory-access-using-two-cues-findin...,Direct memory access using two cues: Finding\n...,3158,Direct memory access using two cues: Finding t...,Janet Wiles
4,1990,Direct memory access using two cues: Finding t...,434-direct-memory-access-using-two-cues-findin...,Direct memory access using two cues: Finding\n...,3158,Direct memory access using two cues: Finding t...,Michael S. Humphreys


## Close the connection

In [31]:
cnx.close()

## Test the function

In [32]:
df_ozzie.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 448 entries, 0 to 447
Data columns (total 7 columns):
year                    448 non-null int64
title                   448 non-null object
pdf_name                448 non-null object
paper_text              448 non-null object
paper_length            448 non-null int64
paper_text_500_words    448 non-null object
name                    448 non-null object
dtypes: int64(2), object(5)
memory usage: 28.0+ KB


In [33]:
df_papers, df_authors, df_ozzie_fxn = nips_filter_country_specific('Australia')

In [34]:
df_ozzie_fxn.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 448 entries, 0 to 447
Data columns (total 7 columns):
year                    448 non-null int64
title                   448 non-null object
pdf_name                448 non-null object
paper_text              448 non-null object
paper_length            448 non-null int64
paper_text_500_words    448 non-null object
name                    448 non-null object
dtypes: int64(2), object(5)
memory usage: 28.0+ KB


### Check whether the 2 dataframes are equal

In [35]:
from pandas.util.testing import assert_frame_equal
assert_frame_equal(df_ozzie_fxn.reset_index(drop=True), df_ozzie.reset_index(drop=True))

FIN.