My understanding of the scraping process:
1. An author is chosen
2. All papers for that author are pulled, including;
 - Title
 - Subject classification
 - arxiv ID
 - other authors
 - number of pages
 - Do we get date?  Any other features?
3. Repeat for another author
 
We must avoid bias from papers with many co-authors.  Thus, we define a *paper credit* to be 1/number of authors, since each a paper with $k$ authors will be seen $k$ times (once for each author).

# Project 1
Scrape and compute field_field_influence matrix described below. Display as directed graph using networkx package.

## Notation
- $n$ = number of unique authors
- $p$ = number of unique fields
field[q]
- Credit Matrix
  - $a \times f$ matrix $C$ initialized with zeros
  - For each paper by author $i$, increment $C_{ij}$ by that paper's credit, where $j$ is the field of the paper.
- author_activity
  - $a \times f$ matrix $A$ initialized as a copy of $C$
  - Divide by columns sums: A = C/C.sum(axis=1, keepdims=True)
  - $A_{ij}$ = proportion of all credit earned by author $i$ that comes from papers in field $j$
  - In words, suppose author $3$ is Cook, field $8$ is "Math", and $A_{3,8}=0.75$, then we say Cook is 75% mathemtician, 25% other fields.
- author_weight_in_field
  - $a \times f$ matrix $W$ initialized as a copy of $C$
  - Divide by columns sums: W = C/C.sum(axis=0, keepdims=True)
  - $W_{ij}$ = proportion of all credit availble in field $j$ that is given to author $i$
- field_field_influence
  - $f \times f$ matrix $I = A^T W$
  - $I_{ij}$ = influence by field $i$ on field $j$ = proportion of work in field $j$ attributable to field $i$.
  - $I_{ij} = \sum_k A^T_{ik} W_{kj} = \sum_k A_{ki} W_{kj} = \sum_k (\mbox{activity of author k in field i})(\mbox{weight given to author k by field j}) = \sum_k (\mbox{proportion of author k that "belongs" to field i})(\mbox{propotion of work in field j done by author k})$




# Project 2
Explore effect of varying funding levels.

## Notation
- $F_f$ = $p$-vector of field funding amounts
- author_funding_from_field
  - $a \times f$ matrix where entry $(i,j)$ = $W_{ij} * F_j$
  - In words, field $j$ distributes its funds $F_j$ to authors proportionate to their weight in that field.
- author_funding
  - Compute row sums of author_funding_from_field (keepdims = True so it is $a \times 1$ and broadcasting works correctly below)
- author_productivity
  - $a \times f$ matrix $P = $ credit / author_funding
  - credit in each field per unit of funding
  - ASSUME this matrix is invariant - it is intrinsic to the author
- Observe that credit = author_productivity \* author_funding
  - We use \* for element-wise multiplication with broadcasting, not the usual matrix.dot(vector) multiplication.  In other words, every entry in row $i$ of author_productivity is multiplied by the same number - entry $i$ of author_funding.

Process:
1. Choose intial funding levels for each field.  Compute author_productivity.  Assume this is invariant.
3. Iterate the following system
  - Set new field_funding levels
  - Assume fields distribute funding based upon prior credit and author_weight_in_field
  - Assume author_productivity is invariant
  - Compute new:
    - author_funding
    - credit = author_productivity \* author_funding
    - author_weight_in_field
4. Explore effects of different funding levels on scientific output (encoded in the credit matrix).
  - Consider the same total funding, but different distributions across fields.
  - Consider increasing/decreading total funding.  To be fair, we should consider total_credit/total_funding.  I am not sure this will be intersting - total_credit may scale linearly with total_funding (can't quite tell).  If so, only changes to distribution will be meaningful.

In [1]:
# This cell searches randomly and dumbly for better funding levels.
# You should implement gradient descent to smartly (not randomly) choose our funding changes
from setup import *

num_field = 4
num_author = 6
num_steps = 2

current_credit = np.random.rand(num_author,num_field)*5
d = np.random.rand(num_field)
current_field_funding = d / d.sum()

def update_author_funding(credit, field_funding):
    author_weight_in_field = credit / credit.sum(axis=0,keepdims=True)
    author_funding_from_field = author_weight_in_field * field_funding
    author_funding = author_funding_from_field.sum(axis=1,keepdims=True)
    return author_funding

def compute_credit(author_funding):
    new_credit = author_prod * author_funding
    field_credit = new_credit.sum(axis=0)
    author_credit = new_credit.sum(axis=1)
    total_credit = new_credit.sum()
    return new_credit, total_credit
    
current_author_funding = update_author_funding(current_credit, current_field_funding)
author_prod = current_credit / current_author_funding
current_credit, current_total_credit = compute_credit(current_author_funding)

for i in range(num_steps):
    # Pick a random "direction" to move funding levels and computes effect on credit.
    v = np.random.rand(num_field)
    v -= v.mean()  # to make sure funding levels sum to 1
    x = current_field_funding
    # Makes sure we don't set any funding less than 0 or more than 1
    k0 = ((np.zeros_like(x) - x) / v).max()
    k1 = (( np.ones_like(x) - x) / v).max()
    k = min(k0,k1)*0.05

    new_field_funding = current_field_funding + k*v
    new_author_funding = update_author_funding(current_credit, new_field_funding)
    new_credit, new_total_credit = compute_credit(new_author_funding)
    best_field_fund, best_credit, best_total_credit = current_field_funding, current_credit, current_total_credit
    
    current_field_funding, current_credit, current_total_credit = new_field_funding, new_credit, new_total_credit
    if(best_total_credit < current_total_credit):
        best_field_funding, best_credit, best_total_credit = current_field_funding, current_credit, current_total_credit
    print("Current Field Funding")
    print(current_field_funding)
    print("Current Credit")
    display(margins(current_credit))
print("Best Field Funding")
print(best_field_funding)
print("Best Credit")
display(margins(best_credit))

Current Field Funding
[ 0.33103129  0.43317793  0.23918742 -0.00339664]
Current Credit


Unnamed: 0,0,1,2,3,TOTAL
0,1.088,0.944,0.383,3.996,6.411
1,3.212,0.821,0.475,1.286,5.795
2,0.947,1.217,1.025,0.347,3.536
3,0.903,0.807,1.029,3.595,6.333
4,0.16,4.87,4.21,4.828,14.068
5,0.612,5.245,1.638,0.431,7.926
TOTAL,6.922,13.904,8.761,14.482,44.069


Current Field Funding
[ 0.31447973  0.35121014  0.30245128  0.03185885]
Current Credit


Unnamed: 0,0,1,2,3,TOTAL
0,1.062,0.921,0.374,3.9,6.257
1,3.056,0.782,0.452,1.223,5.513
2,0.952,1.223,1.03,0.349,3.554
3,0.938,0.839,1.069,3.736,6.582
4,0.17,5.175,4.473,5.129,14.947
5,0.586,5.025,1.57,0.413,7.594
TOTAL,6.764,13.965,8.969,14.75,44.447


Best Field Funding
[ 0.31447973  0.35121014  0.30245128  0.03185885]
Best Credit


Unnamed: 0,0,1,2,3,TOTAL
0,1.062,0.921,0.374,3.9,6.257
1,3.056,0.782,0.452,1.223,5.513
2,0.952,1.223,1.03,0.349,3.554
3,0.938,0.839,1.069,3.736,6.582
4,0.17,5.175,4.473,5.129,14.947
5,0.586,5.025,1.57,0.413,7.594
TOTAL,6.764,13.965,8.969,14.75,44.447


In [2]:
# This cell does a basic MCMC random search funding levels.
# You should implement gradient descent to smartly (not randomly) choose our funding changes
from setup import *

num_field = 4
num_auth = 6
num_steps = 2

current_credit = np.random.rand(num_auth,num_field)*5
d = np.random.rand(num_field)
current_field_fund = d / d.sum()

def update_auth_fund(credit, field_fund):
    auth_weight_in_field = credit / credit.sum(axis=0,keepdims=True)
    auth_fund_from_field = auth_weight_in_field * field_fund
    auth_fund = auth_fund_from_field.sum(axis=1,keepdims=True)
    return auth_fund

def compute_credit(auth_fund):
    new_credit = auth_prod * auth_fund
    field_credit = new_credit.sum(axis=0)
    auth_credit = new_credit.sum(axis=1)
    total_credit = new_credit.sum()
    return new_credit, total_credit
    
current_auth_fund = update_auth_fund(current_credit, current_field_fund)
auth_prod = current_credit / current_auth_fund
current_credit, current_total_credit = compute_credit(current_auth_fund)
best_field_fund, best_credit, best_total_credit = current_field_fund, current_credit, current_total_credit

for i in range(num_steps):
    # Pick a random "direction" to move funding levels and computes effect on credit.
    v = np.random.rand(num_field)
    v -= v.mean()  # to make sure funding levels sum to 1
    x = current_field_fund
    # Makes sure we don't set any funding less than 0 or more than 1
    k0 = ((np.zeros_like(x) - x) / v).max()
    k1 = (( np.ones_like(x) - x) / v).max()
    k = min(k0,k1)*0.05

    new_field_fund = current_field_fund + k*v
    new_auth_fund = update_auth_fund(current_credit, new_field_fund)
    new_credit, new_total_credit = compute_credit(new_auth_fund)
    
    
    if(current_total_credit < new_total_credit):
        # accept new funding level because it is better
        current_field_fund, current_credit, current_total_credit = new_field_fund, new_credit, new_total_credit
        if(best_total_credit < current_total_credit):
            best_field_fund, best_credit, best_total_credit = current_field_fund, current_credit, current_total_credit
    else:
        credit_change = new_total_credit - current_total_credit
        r = np.random.rand()
        anneal = 3
        a = np.exp(credit_change * anneal)
        if(r < a):
            # accept new funding even though it is worse
            current_field_fund, current_credit, current_total_credit = new_field_fund, new_credit, new_total_credit
        else:
            #reject nee funding level
            pass        

    print("Current Field Funding")
    print(current_field_fund)
    print("Current Credit")
    display(margins(current_credit))
print("Best Field Funding")
print(best_field_fund)
print("Best Credit")
display(margins(best_credit))

Current Field Funding
[ 0.0869643   0.22169554  0.39187388  0.29946628]
Current Credit


Unnamed: 0,0,1,2,3,TOTAL
0,3.747,3.472,3.334,1.139,11.692
1,2.584,3.693,1.168,4.285,11.73
2,2.507,3.941,0.175,1.53,8.154
3,1.924,2.546,0.489,1.379,6.338
4,3.522,4.021,3.95,1.078,12.572
5,3.47,2.266,3.845,3.56,13.141
TOTAL,17.755,19.939,12.962,12.972,63.627


Current Field Funding
[ 0.13261609  0.2120236   0.38341007  0.27195024]
Current Credit


Unnamed: 0,0,1,2,3,TOTAL
0,3.74,3.466,3.328,1.137,11.671
1,2.581,3.689,1.167,4.281,11.717
2,2.636,4.143,0.185,1.609,8.573
3,1.976,2.614,0.502,1.416,6.508
4,3.506,4.003,3.933,1.073,12.515
5,3.392,2.215,3.758,3.48,12.844
TOTAL,17.831,20.129,12.872,12.996,63.828


Best Field Funding
[ 0.13261609  0.2120236   0.38341007  0.27195024]
Best Credit


Unnamed: 0,0,1,2,3,TOTAL
0,3.74,3.466,3.328,1.137,11.671
1,2.581,3.689,1.167,4.281,11.717
2,2.636,4.143,0.185,1.609,8.573
3,1.976,2.614,0.502,1.416,6.508
4,3.506,4.003,3.933,1.073,12.515
5,3.392,2.215,3.758,3.48,12.844
TOTAL,17.831,20.129,12.872,12.996,63.828
