# Early release of HA3

Dear all this is an early release of HA3 so that you already can get going. We will release a more fleshed out version with examples and more text at the beginning of next week.

In [None]:
import numpy as np
import pandas as pd

##  How Eventnet works


The main idea of this task is program the "core" of Eventnet. Eventnet naturally offers much more flexibility then what we do below but we restrict us to a very stripped down featureset to keep things simple. We will thus only consider three features already covered in the exercises: `user_activity` (user out degree), `article popularity` (article in degree) and `previous_activity` (edge multiplicity).

We have provided you with the "main loop" of eventnet below. The core functions to implement are the `update`, `log` and `negative_sample` steps.

In [None]:
def main_loop(df, seed=1, start_row=0, end_row=-1, columns):
    net = EventNet_0()
    rng = np.random.default_rng(seed)
    out_rows = []
    for row in df[start_row:end_row].itertuples(): # assumtion the rows are in time increasing order
        out_rows.append(net.log(row))
        net.update(row)
        
        fake_row = net.negative_sample(row, rng)
        if not fake_row is None:
            out_rows.append(fake_row)

    df_out = pd.DataFrame.from_records(out_rows, columns = columns)
    return df_out, net

## a) `update`



Choose useful datatypes that support to efficiently keep track of the respective features (user out-degree, article in-degree and edge multiplicity). Further keep track of a list of users and articles where each user/article appears only once. Finally implement the function `update` that updates the features given an observed row (see also the use in the main loop).

Throughout this implementation avoid quadratic runtimes in the number of users, articles or events.


## b) `log`

Now create a `log` function that save the features of the provided row. It returns a named tuple of type `out_class` for the correpsponding row by recoding the respective features.

In [None]:
from collections import namedtuple
columns = ("IS_OBSERVED", "SOURCE", "TARGET", "TIME", "TYPE", "user_activity", "article_popularity", "previous_activity" )
out_class = namedtuple("out_class", columns)

## c) Negative sample

Write a function `negative_sample` that draws one negative sample from all possible user/article combinations we have seen so far. Thereby avoid drawing the same user-article combination as the current row through rejection sampling (i.e. keep sampling until you have a valid pair).
When drawing user, article pairs first draw the an integer representing the user then an integer representing the article.
Return a namedtuple of the same type as b).

In [None]:
"""Need to still implement the bulk of this class"""
class EventNet_0:
    def __init__(self):
        self.users = []
        self.articles = []
        """Choose appropriate datatypes below"""
        self.user_out_degree = None   # choose datatype
        self.article_in_degree = None # choose datatype
        self.previous_activity = None # choose datatype

    def update(self, row):
        """Your code for a) here"""
        pass

    def construct(self, is_observed, user, article, time):
        return out_class(is_observed, user, article, time, "edit",
                    self.user_out_degree[user],
                    self.article_in_degree[article],
                    self.previous_activity[(user, article)])
    
    def log(self, row):
        """Your code for b) here"""
        return out_class(1,1,1,1,"edit",1,1,1)

    def negative_sample(self, row, rng):
        if (len(self.users) * len(self.articles) ) <= 1:
            return None
        """Your code for c) here"""
        return out_class(1,1,1,1,"edit",1,1,1)

### Example on fake data

In [None]:
fake_row = namedtuple("fake_row", ("time", "article", "user", ))
def row(u,a):
    return fake_row(1, a, u)
rng1=np.random.default_rng(1)
net = EventNet_0()
net.update(row(1,-1))
net.update(row(1,-2))
print(net.log(row(1,-1)))
# out_class(IS_OBSERVED=1, SOURCE=1, TARGET=-1, TIME=1, TYPE='edit', user_activity=2, article_popularity=1, previous_activity=1)
print(net.log(row(0,-2)))
# out_class(IS_OBSERVED=1, SOURCE=0, TARGET=-2, TIME=1, TYPE='edit', user_activity=0, article_popularity=1, previous_activity=0)
print(net.negative_sample(row(0,-3), rng1))
# out_class(IS_OBSERVED=0, SOURCE=1, TARGET=-1, TIME=1, TYPE='edit', user_activity=2, article_popularity=1, previous_activity=1)
print(net.negative_sample(row(0,-3), rng1))
# out_class(IS_OBSERVED=0, SOURCE=1, TARGET=-2, TIME=1, TYPE='edit', user_activity=2, article_popularity=1, previous_activity=1)

### Example on real data

In [None]:
from pathlib import Path

In [None]:
p = Path(r"YOUR_PATH_TO\human_migration_events.csv")

In [None]:
df = pd.read_csv(p, delimiter=";")
df = df[df.type =="edit"] # filter for only edit events

df_out, net = main_loop(df, columns=columns)

In [None]:
df.columns


In [None]:
df_out.head(15)


#### Output should be:


<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>IS_OBSERVED</th>
      <th>SOURCE</th>
      <th>TARGET</th>
      <th>TIME</th>
      <th>TYPE</th>
      <th>user_activity</th>
      <th>article_popularity</th>
      <th>previous_activity</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>1</td>
      <td>209.91.204.xxx</td>
      <td>Puerto Rico</td>
      <td>984868500000</td>
      <td>edit</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <th>1</th>
      <td>1</td>
      <td>TimShell</td>
      <td>Puerto Rico</td>
      <td>985483241000</td>
      <td>edit</td>
      <td>0</td>
      <td>1</td>
      <td>0</td>
    </tr>
    <tr>
      <th>2</th>
      <td>1</td>
      <td>Koyaanisqatsi</td>
      <td>Bermuda</td>
      <td>987892807000</td>
      <td>edit</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <th>3</th>
      <td>0</td>
      <td>209.91.204.xxx</td>
      <td>Puerto Rico</td>
      <td>987892807000</td>
      <td>edit</td>
      <td>1</td>
      <td>2</td>
      <td>1</td>
    </tr>
    <tr>
      <th>4</th>
      <td>1</td>
      <td>Koyaanisqatsi</td>
      <td>Bermuda</td>
      <td>987895460000</td>
      <td>edit</td>
      <td>1</td>
      <td>1</td>
      <td>1</td>
    </tr>
    <tr>
      <th>5</th>
      <td>0</td>
      <td>TimShell</td>
      <td>Puerto Rico</td>
      <td>987895460000</td>
      <td>edit</td>
      <td>1</td>
      <td>2</td>
      <td>1</td>
    </tr>
    <tr>
      <th>6</th>
      <td>1</td>
      <td>Koyaanisqatsi</td>
      <td>History of Barbados</td>
      <td>988030404000</td>
      <td>edit</td>
      <td>2</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <th>7</th>
      <td>0</td>
      <td>TimShell</td>
      <td>Bermuda</td>
      <td>988030404000</td>
      <td>edit</td>
      <td>1</td>
      <td>2</td>
      <td>0</td>
    </tr>
    <tr>
      <th>8</th>
      <td>1</td>
      <td>Koyaanisqatsi~enwiki</td>
      <td>History of the British Virgin Islands</td>
      <td>988036307000</td>
      <td>edit</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <th>9</th>
      <td>0</td>
      <td>TimShell</td>
      <td>Puerto Rico</td>
      <td>988036307000</td>
      <td>edit</td>
      <td>1</td>
      <td>2</td>
      <td>1</td>
    </tr>
    <tr>
      <th>10</th>
      <td>1</td>
      <td>KoyaanisQatsi</td>
      <td>History of the Falkland Islands</td>
      <td>988551301000</td>
      <td>edit</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <th>11</th>
      <td>0</td>
      <td>209.91.204.xxx</td>
      <td>History of the British Virgin Islands</td>
      <td>988551301000</td>
      <td>edit</td>
      <td>1</td>
      <td>1</td>
      <td>0</td>
    </tr>
    <tr>
      <th>12</th>
      <td>1</td>
      <td>KoyaanisQatsi</td>
      <td>History of French Guiana</td>
      <td>988552596000</td>
      <td>edit</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <th>13</th>
      <td>0</td>
      <td>Koyaanisqatsi~enwiki</td>
      <td>History of the Falkland Islands</td>
      <td>988552596000</td>
      <td>edit</td>
      <td>1</td>
      <td>1</td>
      <td>0</td>
    </tr>
    <tr>
      <th>14</th>
      <td>1</td>
      <td>KoyaanisQatsi</td>
      <td>French Guiana</td>
      <td>988939007000</td>
      <td>edit</td>
      <td>2</td>
      <td>0</td>
      <td>0</td>
    </tr>
  </tbody>
</table>


In [None]:
df_out.tail(15)

### Output should be:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>IS_OBSERVED</th>
      <th>SOURCE</th>
      <th>TARGET</th>
      <th>TIME</th>
      <th>TYPE</th>
      <th>user_activity</th>
      <th>article_popularity</th>
      <th>previous_activity</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>1633289</th>
      <td>0</td>
      <td>Kosandr</td>
      <td>Operation Black Tulip</td>
      <td>1514821157000</td>
      <td>edit</td>
      <td>1</td>
      <td>87</td>
      <td>0</td>
    </tr>
    <tr>
      <th>1633290</th>
      <td>1</td>
      <td>Joe Roe</td>
      <td>Recent African origin of modern humans</td>
      <td>1514822399000</td>
      <td>edit</td>
      <td>37</td>
      <td>1638</td>
      <td>6</td>
    </tr>
    <tr>
      <th>1633291</th>
      <td>0</td>
      <td>Cj005257</td>
      <td>Zahoor ul Akhlaq</td>
      <td>1514822399000</td>
      <td>edit</td>
      <td>1</td>
      <td>87</td>
      <td>0</td>
    </tr>
    <tr>
      <th>1633292</th>
      <td>1</td>
      <td>Narky Blert</td>
      <td>Ankai Fort</td>
      <td>1514825889000</td>
      <td>edit</td>
      <td>203</td>
      <td>61</td>
      <td>0</td>
    </tr>
    <tr>
      <th>1633293</th>
      <td>0</td>
      <td>Parkenator90</td>
      <td>History of Paraguay (to 1811)</td>
      <td>1514825889000</td>
      <td>edit</td>
      <td>1</td>
      <td>42</td>
      <td>0</td>
    </tr>
    <tr>
      <th>1633294</th>
      <td>1</td>
      <td>Jd22292</td>
      <td>Day to Mark the Departure and Expulsion of Jews from the Arab Countries and Iran</td>
      <td>1514828866000</td>
      <td>edit</td>
      <td>13</td>
      <td>34</td>
      <td>0</td>
    </tr>
    <tr>
      <th>1633295</th>
      <td>0</td>
      <td>Minsbot</td>
      <td>Syed Waseem Hussain</td>
      <td>1514828866000</td>
      <td>edit</td>
      <td>8</td>
      <td>22</td>
      <td>0</td>
    </tr>
    <tr>
      <th>1633296</th>
      <td>1</td>
      <td>Balkanique</td>
      <td>Iran–Turkey barrier</td>
      <td>1514830718000</td>
      <td>edit</td>
      <td>24</td>
      <td>14</td>
      <td>10</td>
    </tr>
    <tr>
      <th>1633297</th>
      <td>0</td>
      <td>Sonia Murillo Perales</td>
      <td>Inner emigration</td>
      <td>1514830718000</td>
      <td>edit</td>
      <td>17</td>
      <td>26</td>
      <td>0</td>
    </tr>
    <tr>
      <th>1633298</th>
      <td>1</td>
      <td>Jd22292</td>
      <td>1948 Palestinian exodus</td>
      <td>1514831144000</td>
      <td>edit</td>
      <td>14</td>
      <td>3229</td>
      <td>0</td>
    </tr>
    <tr>
      <th>1633299</th>
      <td>0</td>
      <td>Ledmonkey</td>
      <td>City Beautiful movement</td>
      <td>1514831144000</td>
      <td>edit</td>
      <td>1</td>
      <td>203</td>
      <td>0</td>
    </tr>
    <tr>
      <th>1633300</th>
      <td>1</td>
      <td>Jd22292</td>
      <td>Internally displaced Palestinians</td>
      <td>1514831226000</td>
      <td>edit</td>
      <td>15</td>
      <td>145</td>
      <td>0</td>
    </tr>
    <tr>
      <th>1633301</th>
      <td>0</td>
      <td>Zeng8r</td>
      <td>Drift to the north</td>
      <td>1514831226000</td>
      <td>edit</td>
      <td>27</td>
      <td>8</td>
      <td>0</td>
    </tr>
    <tr>
      <th>1633302</th>
      <td>1</td>
      <td>Hmains</td>
      <td>Chittagong Hill Tracts conflict</td>
      <td>1514832795000</td>
      <td>edit</td>
      <td>1899</td>
      <td>171</td>
      <td>2</td>
    </tr>
    <tr>
      <th>1633303</th>
      <td>0</td>
      <td>Ammorgan2</td>
      <td>Nationality Law of the Democratic People's Republic of Korea</td>
      <td>1514832795000</td>
      <td>edit</td>
      <td>1</td>
      <td>36</td>
      <td>0</td>
    </tr>
  </tbody>
</table>

## Task 2: Evaluation of link prediction

In this task you are fist going to implement three link prediction metrics preferential attachment, common neighbors and Jaccard similarity. We then will implement functions needed to compute the ROC/PR curves for a given pairwise score matric and observed adjacency. The final task is then to evaluate these functions on a real world dataset.

In [None]:
import numpy as np
from scipy import sparse
import matplotlib.pyplot as plt

In [None]:
A=sparse.random(2000,2000,.005,'coo')
A=(A+A.T).tocoo()
A.setdiag(np.zeros(2000))

In [None]:
values = np.ones(len(A.col))
values[A.col==A.row]=0
A = sparse.coo_matrix((values,(A.col, A.row)), dtype = np.int16)

In [None]:
A_small_1 = sparse.coo_matrix(([1,1,1], ([0,1,2],[1,2,3])), shape=(4,4) )
A_small_1 = (A_small_1+A_small_1.T).tocoo()

In [None]:
A_small_2 = sparse.coo_matrix(([1,1], ([0,1],[1,2])), shape=(3,3) )
A_small_2 = (A_small_2+A_small_2.T).tocoo()

In [None]:
A_small_3 = sparse.coo_matrix(([1,1,1,1,1], ([0,1,2,3,3],[1,2,3,0,4])), shape=(5,5) )
A_small_3 = (A_small_3+A_small_3.T).tocoo()

In [None]:
A_small_4 = sparse.coo_matrix(([1,1,1,], ([0,1,2],[1,2,0])), shape=(3,3) )
A_small_4 = (A_small_4+A_small_4.T).tocoo()

In [None]:
A_small_5 = sparse.coo_matrix(([1,1,1,1], ([0,1,2,2],[1,2,0,3])), shape=(4,4) )
A_small_5 = (A_small_5+A_small_5.T).tocoo()

### a) preferential attachment

Compute the pairwise preferential attachment score of an adjacency matrix for all pairs of nodes. Assume the matrix describes a simple undirected graph with no self loops. Assume A is an adjacency matrix in sparse coo format. 
Return a dense (numpy) array of type `np.int16` with zeros on the diagonals.

In [None]:
def preferential_attachment(A):
    return np.zeros(A.shape, dtype=np.int16)

In [None]:
%%time
M_pref = preferential_attachment(A)

In [None]:
del M_pref

### b) common neighbors

Compute the pairwise common neighbors score of an adjacency matrix for all pairs of nodes. Assume the matrix describes a simple undirected graph with no self loops. Assume A is an adjacency matrix in sparse coo format. 
Return a dense (numpy) array of type `np.int16` with zeros on the diagonals. 

In [None]:
def common_neighbors(A):
    return np.zeros(A.shape, dtype=np.int16)

In [None]:
%%time
M_common=common_neighbors(A)


In [None]:
del M_common

### c) Jaccard

Compute the pairwise Jaccard score of an adjacency matrix for all pairs of nodes. Assume the matrix describes a simple undirected graph with no self loops. Assume A is an adjacency matrix in sparse coo format. 
Return a dense (numpy) array of type `np.float16` with zeros on the diagonals. 

In [None]:
def Jaccard(A):
    return np.zeros(A.shape, dtype=np.float16)

In [None]:
M_Jaccard_small_1 = Jaccard(A_small_1)
print(M_Jaccard_small_1)

In [None]:
M_Jaccard_small_2 = Jaccard(A_small_2)
print(M_Jaccard_small_2)

In [None]:
%%time
M_Jaccard = Jaccard(A)

In [None]:
del M_Jaccard

#### Helper function

Usually the number of different values (i.e. different thresholds) is much smaller than all values in the matrix. The following function counts the number values 

In [None]:
def value_counts(M, additional_keys=None):
    if M.dtype.char in np.typecodes['AllInteger']:
        keys, vals = value_counts_int(M)
    elif M.dtype.char in np.typecodes['Float']:
        keys, vals = value_counts_float(M)
    else:
        raise ValueError(f"invalid dtype {M.dtype}")
    
    if additional_keys is None:
        return keys, vals
    append_indices = additional_keys>keys[-1]
    append_keys = additional_keys[append_indices]
    additional_keys=additional_keys[~append_indices]

    indices = np.searchsorted(keys, additional_keys)
    inds = keys[indices]!=additional_keys
    new_indices = indices[inds]
    new_values = additional_keys[inds]
    out_keys = np.insert(keys, new_indices, new_values)
    out_values = np.insert(vals, new_indices, np.zeros(len(new_indices)))
    
    out_keys = np.append(out_keys, append_keys)
    out_values = np.append(out_values, np.zeros(len(append_keys)))
    return out_keys, out_values


In [None]:
def value_counts_int(M_in):
    counts = np.bincount(M_in.flat)
    a = np.nonzero(counts)[0]
    return a, counts[a]

In [None]:
def value_counts_float(M_in):
    return np.unique(M_in.flat, return_counts=True)

### d) fps tps and thresholds

Write a function that computes the true positive counts tps, false positive counts fps and thresholds given the observed adjacency matrix A and pairwise score matrix M. You can compare your results with those from sklearn.

In [None]:
from sklearn.metrics import roc_curve, precision_recall_curve

In [None]:
def get_fps_tps_thresholds(A, M):
    return fps, tps, thresholds

### e) roc_curve

Implement the function `my_roc_curve` that computes the roc curve for the observed adjacecy matrix `A` and pairwise score matrix `M`. Thereby use the `get_fps_tps_thresholds` function. 

In [None]:
def my_roc_curve(A, M):
    """Needs implementation, currently returns nonsense values"""
    FPR = np.array([0,1])
    TPR = np.array([0,1])
    keys = np.array([0,1])
    return FPR, TPR, keys

In [None]:
my_roc_curve(A, Jaccard(A))

### f) precision recall curve

Write a function that computes the precision recall curve for the pairwise value matrix `M` and the observed adjacency matrix `A`.

The function returns two arrays, the first represent the recall values, and the second the corresponding precision values.

In [None]:
def my_pr_curve(A, M):
    """Needs implementation, currently returns nonsense values"""
    precision=np.array([0,1])
    recall=np.array([0,1])
    thresholds=np.array([0,1])
    return precision, recall, thresholds

### Information: There are three different ways to implement the AUC

See below for the different options. The one used in the lecture is "lower" or "upper" depending on the order of values.


In [2]:
def my_auc(x,y):
    s=0
    for i in range(1,len(x)+1):
        if kind=="trapezoid":
            s+=(y[i]+y[i-1])* (x[i]-x[i-1])/2
        elif kind=="lower":
            s+=(y[i])* (x[i]-x[i-1])
        elif kind == "upper":
            s+=(y[i-1])* (x[i]-x[i-1])
        else:
            raise ValueError
    return s

### g) Applying the functions to real data

Apply the functions to the facebook network. Compute average AUC-ROC and AUC-PR for the three different metrics over the first 20 weeks of the data. Learn the pairwise matrix on one week and then evaluate it on the following week.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
1->2
2->3
...
19->20