<center>
<h1>
Optimal Academic Funding
</h1>

<h3>
Tu Anh Nguyen
</h3>
<h4>
Tarleton State University, Stephenville, TX
</h4>
<h4>
12/04/2017
</h4>

</center>

# Introduction

This project is a preliminarily research on exploring effect of varying funding levels. Our goal is to find an optimal distribution of funding so that the number of paper produced each year are optimized. Please note that, this research only focuses on academic papers that are in Science, technology, engineering, and mathematics (STEM). We hope to expand our reseach into other disciplin to have a broader picture.

## Data Collection

The data we use for this project is collected fron arXiv (https://arxiv.org/). The arXiv is a repository that contains papers from Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, Statistics, Electrical Engineering and Systems Science, and Economics. For each paper in arXiv, we collect the following data,

1. doi
1. Publishing date
1. Title
1. Author names
1. Main Field

According to https://arxiv.org/help/bulk_data, there are 3 services that we can use for colleting the data from arXiv. They are `OAI-PMH`, `arXiv API`, and `RSS`. Our fist approach was using the arXiv. However, there are inconsistency in the category values because of the recent changes in the naming system for category. Furthermore, `arXiv API` was not desgined for accessing the whole arXiv repository. Thus, we use `OAI-PMH` to acces the metadata for every paper in the arXiv repository. 

### Data Collection Process

We use the `urllib` to request a `url` link that serves as the query interface for arXiv's `OAI-PMH`. After the request, arXiv's server will return a text strings in `XML` format that contains the results for the report. We use the `BeautifulSoup` package to parse the data. Each query returns at most 1,000 results. If there are more than, 1,000 results, arXiv will provide a `resumptionToken` that can be used for continue to the next 1,000 results. The python code for the data collection process is provided in the appendix. The data collected are exported into a csv file, which can be acceses at https://www.dropbox.com/s/mxlqmphe9dxtw8y/data_v3.csv?dl=1. Please note that the size of this files is (371M)

## Data Exploration

In order to process the data, we need to import important packages. 

1. We use `pandas` for parsing csv files and organizing data
1. We use `numpy` as our advance calculation tool
1. we use `matplotlib.pyplot` for the visualization of our result

In [1]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Since the data is too big for github. We have to host our data using dropbox. Once, we have downloaded the data to our local machine, we can change `download` to `False` in order to avoid having do download the data gain.

In [3]:
# Download and import data
download = True # If download is true download and save data, else just read data

if(download):
    ## Data set
    data_url = "https://www.dropbox.com/s/mxlqmphe9dxtw8y/data_v3.csv?dl=1"
    df = pd.read_csv(data_url)
    df.to_csv("data.csv", index = False)
else:
    df = pd.read_csv("data.csv")

Since there might be duplications and `NaN` in the dataset, we uses pandas `dropna` and `drop_duplicates` to remove them from our data frame. Furthermore, we also use `drop` to remove the index value from the csv file.

In [4]:
# Cleaning the dataframe
df.dropna(inplace = True)
df.drop_duplicates(subset = "doi", inplace = True)
df.drop("Unnamed: 0", axis = 1, inplace = True) # Drop the "Unnamed: 0"

# Get the category list
all_cat = list(set(df["category"].values))
all_cat.sort()

df.head()

Unnamed: 0,doi,date,title,authors,category
0,oai:arXiv.org:0704.0002,2007-03-30,Sparsity-certifying Graph Decompositions,Streinu Ileana;Theran Louis,cs
1,oai:arXiv.org:0704.0046,2007-04-01,A limit relation for entropy and channel capac...,Csiszar I.;Hiai F.;Petz D.,cs
2,oai:arXiv.org:0704.0047,2007-04-01,Intelligent location of simultaneously active ...,Kosel T.;Grabec I.,cs
3,oai:arXiv.org:0704.0050,2007-04-01,Intelligent location of simultaneously active ...,Kosel T.;Grabec I.,cs
4,oai:arXiv.org:0704.0062,2007-03-31,On-line Viterbi Algorithm and Its Relationship...,Šrámek Rastislav;Brejová Broňa;Vinař Tomáš,cs


In [None]:
# Collecting all the aurhors
au_lst = []
for paper_authors in df["authors"].values:
    for author in paper_authors.split(";"):
        au_lst.append(author)
        
# Get all the unique authors       
au_lst = list(set(au_lst))
au_lst.sort()

au_dict = {author:index for (index, author) in enumerate(au_lst)}
cat_dict = {cat:index for (index, cat) in enumerate(all_cat)}

# Creating the matrix
n = len(au_dict)
p = len(all_cat)
credit_matrix = np.zeros((n, p))

In [None]:
for index, row in df[["authors", "category"]].iterrows():
    
    author_list = row["authors"].split(";")
    contribute = 1.0/len(author_list)
    
    for author in author_list:
        try:
            credit_matrix[ au_dict[author], cat_dict[row["category"]] ] += contribute
        except KeyError as e:
            print(e)

# Calculating stuff
author_activity = credit_matrix / credit_matrix.sum(axis=1, keepdims=True)
author_weight_in_field = credit_matrix / credit_matrix.sum(axis=0, keepdims=True)
field_field_influence = np.transpose(author_activity).dot(author_weight_in_field)

proj1_df = pd.DataFrame(field_field_influence, columns = all_cat, index=all_cat)
proj1_df

In [None]:
credit_df = pd.DataFrame(credit_matrix, columns = all_cat, index = au_lst)
credit_df.to_csv("credit_data.csv", index = False)

In [None]:
credit_df.head()

In [None]:
def update_author_funding(credit, field_funding):
    author_weight_in_field = credit / credit.sum(axis=0,keepdims=True)
    author_funding_from_field = author_weight_in_field * field_funding
    author_funding = author_funding_from_field.sum(axis=1,keepdims=True)
    return author_funding

def compute_credit(author_funding):
    new_credit = author_prod * author_funding
    field_credit = new_credit.sum(axis=0)
    author_credit = new_credit.sum(axis=1)
    total_credit = new_credit.sum()
    return new_credit, total_credit

In [None]:
num_field = len(all_cat)
num_auth  = len(au_lst)
num_steps = 200

# Learning hyperparameter
p = 0.05
alpha = 0.1

# Current credit
current_credit = credit_matrix

# Current field funding - Generating a random funding
d = np.random.rand(num_field)
current_field_funding = d / d.sum()

# Saving the original field funding 
original_field_funding = current_field_funding.copy()

In [None]:
# Initial calculation
current_author_funding = update_author_funding(current_credit, current_field_funding)
author_prod = current_credit / current_author_funding # This is invariance
current_credit, current_total_credit = compute_credit(current_author_funding)

# Initialize the best state
best_field_funding = current_field_funding.copy()
best_credit        = current_credit.copy()
best_total_credit  = current_total_credit.copy()

tot_credit_lst = []

In [None]:
for i in range(num_steps):
    current_author_weight_in_field = current_credit / current_credit.sum(axis=0,keepdims=True)

    if(np.random.rand() < 0.05):
        gradient = np.random.rand(num_field)
    else:
        gradient = (author_prod * current_author_weight_in_field).sum(axis = 0)

    gradient_norm = gradient/(sum(gradient))  # normalize
    # Update field funding
    new_field_funding = current_field_funding + alpha*gradient_norm
    new_field_funding = new_field_funding / (sum(new_field_funding)) # normalize 

    new_author_funding = update_author_funding(current_credit, new_field_funding)
    new_credit, new_total_credit = compute_credit(new_author_funding)
    
    tot_credit_lst.append(new_total_credit)
    
    # update the new best result
    if(best_total_credit < new_total_credit):
        best_field_funding = new_field_funding.copy()
        best_credit        = new_credit.copy()
        best_total_credit  = new_total_credit.copy()

#         print("new Best")

    # Update for new step 
    current_field_funding = new_field_funding.copy()
    current_credit        = new_credit.copy()

In [None]:
plt.plot(tot_credit_lst)
plt.ylabel('Total Credit')
plt.xlabel('step')
plt.show()

In [None]:
funding_df = pd.DataFrame(best_field_funding, columns = ["Field Funding"], index = all_cat)
funding_df.sort_values(by="Field Funding", ascending = False)

# Appendix

This is the python code used for scraping the data from the arXiv repository using `OAI-PMH`.
```python
# Import for processing XML
from bs4 import BeautifulSoup
import time

# Import for requesting HTML
import urllib
import urllib.request
from urllib.error import HTTPError

# Import for text processing
import io
import re

# Import for data processing and organization
import pandas as pd
import numpy as np

# Find all the categories
url = "http://export.arxiv.org/oai2?verb=ListSets"
u = urllib.request.urlopen(url, data = None)
f = io.TextIOWrapper(u,encoding='utf-8')
text = f.read()
soup = BeautifulSoup(text, 'xml')
all_cat = [sp.text for sp in soup.findAll("setSpec")]

# Export the categories to a txt files
f = open("all_cat_v01.txt", "w")
f.write(",".join(all_cat))
f.close()

def scrape(cat):

    '''
    Function to scrape all the paper from a category

    INPUT : category (String)
    OUTPUT: dataframe that contains doi, date, title, authors, category for each paper (pandas.DataFrame)
    '''

    # Initialization
    df = pd.DataFrame(columns=("doi", "date", "title", "authors", "category"))
    base_url = "http://export.arxiv.org/oai2?verb=ListRecords&"
    url = base_url + "set={}&metadataPrefix=arXiv".format(cat)

    # while loop in order to loop through all the resutls
    while True:
        # print url to keep track of the progress
        print(url)
        # accessing the url
        try:
            u = urllib.request.urlopen(url, data = None)
        except HTTPError as e:
            # Incase of some error that require us to wait
            if e.code == 503:
                to = int(e.hdrs.get("retry-after", 30))
                print("Got 503. Retrying after {0:d} seconds.".format(to))
                time.sleep(to)
                continue # Skip this loop, continue to the next one
            else:
                raise

        # read the request
        f = io.TextIOWrapper(u,encoding='utf-8')
        text = f.read()
        soup = BeautifulSoup(text, 'xml')

        # collect the data
        for record in soup.findAll("record"):
            try:
                doi = record.find("identifier").text
            except:
                doi = np.nan

            try:
                date = record.find("created").text
            except:
                date = np.nan

            try:
                title = record.find("title").text
            except:
                title = np.nan

            try:
                authors = ";".join([author.get_text(" ") for author in record.findAll("author")])
            except:
                authros = np.nan

            try:
                category = record.find("setSpec").text
            except:
                category = np.nan

            df = df.append({"doi":doi, "date":date, "title":title, "authors":authors, "category":category},\
                           ignore_index=True)


        # resumptionToken help to find if there are more results
        token = soup.find("resumptionToken")
        if token is None or token.text is None:
            break
        else:
            url = base_url + "resumptionToken=%s"%(token.text)

    return(df)

# Initialize master_df that contains all the data points
master_df = pd.DataFrame(columns=("doi", "date", "title", "authors", "category"))
for i in all_cat:
    # Print out the current category for progress tracking
    print("----------------",i,"-------------------")
    df = scrape(i)
    # append the new data to master_df
    master_df = master_df.append(df, ignore_index = True)

# Export the data to a csv file
master_df.to_csv("data.csv")
```