#### CSCE 670 :: Information Storage and Retrieval :: Texas A&M University :: Spring 2020


# Homework 2:  PageRank + Learning to Rank

### 100 points [10% of your final grade]

### Due: March 5, 2020 by 11:59pm

*Goals of this homework:* In this homework you will explore real-world challenges of building a graph (in this case, from tweets), implement and test the classic PageRank algortihm over this graph. In addition, you will apply learning to rank to a real-world dataset and report the performance in terms of NDCG.

*Submission instructions (eCampus):* To submit your homework, rename this notebook as `UIN_hw2.ipynb`. For example, my homework submission would be something like `555001234_hw2.ipynb`. Submit this notebook via eCampus (look for the homework 2 assignment there). Your notebook should be completely self-contained, with the results visible in the notebook. We should not have to run any code from the command line, nor should we have to run your code within the notebook (though we reserve the right to do so). So please run all the cells for us, and then submit.

*Late submission policy:* For this homework, you may use as many late days as you like (up to the 5 total allotted to you).

*Collaboration policy:* You are expected to complete each homework independently. Your solution should be written by you without the direct aid or help of anyone else. However, we believe that collaboration and team work are important for facilitating learning, so we encourage you to discuss problems and general problem approaches (but not actual solutions) with your classmates. You may post on Piazza, search StackOverflow, etc. But if you do get help in this way, you must inform us by **filling out the Collaboration Declarations at the bottom of this notebook**. 

*Example: I found helpful code on stackoverflow at https://stackoverflow.com/questions/11764539/writing-fizzbuzz that helped me solve Problem 2.*

The basic rule is that no student should explicitly share a solution with another student (and thereby circumvent the basic learning process), but it is okay to share general approaches, directions, and so on. If you feel like you have an issue that needs clarification, feel free to contact either me or the TA.

# Part 1: PageRank (60 points)
In this assignment, we're going to adapt the classic PageRank approach to allow us to find not the most authoritative web pages, but rather to find significant Twitter users. 


## Part 1.1: A re-Tweet Graph (20 points)

So, instead of viewing the world as web pages with hyperlinks (where pages = nodes, hyperlinks = edges), we're going to construct a graph of Twitter users and their retweets of other Twitter users (so user = node, retweet of another user = edge). Over this Twitter-user graph, we can apply the PageRank approach to order the users. The main idea is that a user who is retweeted by other users is more "impactful". 

Here is a toy example. Suppose you are given the following four retweets:

* **userID**: diane, **text**: "RT ", **sourceID**: bob
* **userID**: charlie, **text**: "RT Welcome", **sourceID**: alice
* **userID**: bob, **text**: "RT Hi ", **sourceID**: diane
* **userID**: alice, **text**: "RT Howdy!", **sourceID**: parisa

There are four short tweets retweeted by four users. The retweet between users form a directed graph with five nodes and four edges. E.g., the "diane" node has a directed edge to the "bob" node.

You should build a graph by parsing the tweets in the file we provide called *PageRank.json*.

**Notes:**

* You may see some weird characters in the content of tweets, just ignore them. 
* The edges are binary and directed. If Bob retweets Alice once, in 10 tweets, or 10 times in one tweet, there is an edge from Bob to Alice, but there is not an edge from Alice to Bob.
* If a user retweets herself, ignore it.
* Correctly parsing screen_name in a tweet is error-prone. Use the id of the user (this is the user who is re-tweeting) and the id of the user in the retweeted_status field (this is the user who is being re-tweeted; that is, this user created the original tweet).
* Later you will need to implement the PageRank algorithm on the graph you build here.


In [1]:
# Here define your function for building the graph by parsing 
# the input file of tweets
# Insert as many cells as you want
import json
import numpy as np
import pandas as pd

from collections import defaultdict


def construct_graph(filename):
    """
    Construct the graph by the given file path.
    
    
    Since the graph is binary, we can use an adjacency list to represent it. 
    
    After computing the PageRank scores, we also need to print the screen name of each user. So we
    have another dictionary to store the mapping of user_id to screen_name.
    """
    # use two adjacency list to represent the graph
    graph = defaultdict(set)
    name_dict = dict()
    with open(filename) as f:
        for line in f:
            # parse json file line by line
            tweet_json = json.loads(line)
            
            # get properties
            retweeting_id = tweet_json["user"]["id"]
            retweeting_name = tweet_json["user"]["name"]
            retweeting_screen_name = tweet_json["user"]["screen_name"]
            retweeted_id = tweet_json["retweeted_status"]["user"]["id"]
            retweeted_name = tweet_json["retweeted_status"]["user"]["name"]
            retweeted_screen_name = tweet_json["retweeted_status"]["user"]["screen_name"]
            
            # update name dictionary
            name_dict[retweeting_id] = (retweeting_name, retweeting_screen_name)
            name_dict[retweeted_id] = (retweeted_name, retweeted_screen_name)
            
            # insert edges into the graph adjacency list; if the user retweets himself, we just skip
            if retweeting_id != retweeted_id:
                graph[retweeting_id].add(retweeted_id)

    # generate vertex set (except for the single nodes which have no edges coming out or coming in)
    vertex_set = set(graph.keys())
    for item in graph.values():
        vertex_set.update(item)
    
    return graph, vertex_set, name_dict
            

# construct the graph and user id-to-name mapping dictionary
graph, vertex_set, name_dict = construct_graph("PageRank.json")
print("size of the adjacency list of the graph: %d" % len(graph))
print("size of the name dictionary of the graph: %d" % len(name_dict))

size of the adjacency list of the graph: 902
size of the name dictionary of the graph: 1003


In [2]:
# Call your function to print out the size of the graph, 
# i.e., the number of nodes and edges
# How you maintain the graph is totaly up to you
# However, if you encounter any memory issues, we recommend you 
#write the graph into a file, and load it later.

# number of nodes
print("number of nodes: %d" % len(vertex_set))

# number of edges
edge_count = 0
for v in graph.values():
    edge_count += len(v);
print("number of edges: %d" % edge_count)

number of nodes: 1003
number of edges: 6177


We will not check the correctness of your graph. However, this will affect the PageRank results later.

## Part 1.2: PageRank Implementation (30 points)

Your program will return the top 10 users with highest PageRank scores. The **output** should be like:

* user1 - score1
* user2 - score2
* ...
* user10 - score10

You should follow these **rules**:

* Assume all nodes start out with equal probability.
* The probability of the random surfer teleporting is 0.1 (that is, the damping factor is 0.9).
* If a user is never retweeted and does not retweet anyone, their PageRank scores should be zero. Do not include the user in the calculation.
* It is up to you to decide when to terminate the PageRank calculation.
* There are PageRank implementations out there on the web. Remember, your code should be **your own**.


**Hints**:
* If you're using the matrix style approach, you should use [numpy.matrix](https://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.html).
* Scipy is built on top of Numpy and has support for sparse matrices. You most likely will not need to use Scipy unless you'd like to try out their sparse matrices.
* If you choose to use Numpy (and Scipy), please make sure your Anaconda environment include their latest versions.
* Test your parsing and PageRank calculations using a handful of tweets, before moving on to the entire file we provide.
* We will evaluate the user ranks you provide as well as the quality of your code. So make sure that your code is clear and readable.

What is the termination condition in your PageRank implementation? Describe it below:

*ADD YOUR ANSWER HERE*

In [3]:
# Here add your code to implement a function called PageRanker
# Insert as many cells as you want

def cal_page_rank_score(graph, vertex_set, stop_threshold):
    d = 0.9
    n = len(vertex_set)
    
    # map the node id into the range from 0 to n
    idx_to_id = np.array(list(vertex_set))
    id_to_idx = dict()
    for idx_id_pair in enumerate(vertex_set):
        id_to_idx[idx_id_pair[1]] = idx_id_pair[0]
        
    # since we only have a thousand vertices, there's no need for using a sparse matrix
    # get the two matrices for iterative calculation
    page_rank_mt_left = np.ones(shape=(n, 1)) * (1-d) / n
    page_rank_mt_right = np.zeros(shape=(n,n))
    for cur, neighbors in graph.items():
        m = len(neighbors)
        for neighbor in neighbors:
            page_rank_mt_right[id_to_idx[neighbor]][id_to_idx[cur]] = d * 1.0 / m
    
    # calculate PageRank scores
    page_rank_scores = np.ones(shape=(n, 1))
    while True:
        # record the scores of the last round
        page_rank_scores_last = page_rank_scores
        page_rank_scores = page_rank_mt_left + page_rank_mt_right.dot(page_rank_scores)
        
        # every time we compare the total absolute difference to the stop_threshold
        # if satisfied, break the while loop and return the result (We consider the result has converged)
        if cal_total_diff(page_rank_scores_last, page_rank_scores) < stop_threshold:
            break

    # rank from high to low
    # combine the original id with the PR score
    page_rank_scores_with_id = np.zeros(shape=(n, 2))
    for idx_score_pair in enumerate(page_rank_scores):
        page_rank_scores_with_id[idx_score_pair[0]][0] = idx_to_id[idx_score_pair[0]]
        page_rank_scores_with_id[idx_score_pair[0]][1] = idx_score_pair[1]

    # sort by the second column, then reverse the vector so that it's in the descending order
    return page_rank_scores_with_id[page_rank_scores_with_id[:, 1].argsort()][::-1]    
            

def cal_total_diff(page_rank_scores_last, page_rank_scores):
    score_diff = page_rank_scores_last - page_rank_scores
    return sum(abs(diff) for diff in score_diff)
    

In [4]:
# Now let's call your function on the graph you've built. Output the results.
# call the function and get the page rank scores
page_rank_scores = cal_page_rank_score(graph, vertex_set, 1e-6)
page_rank_scores

array([[1.18390615e+09, 1.36480243e-02],
       [3.01965959e+09, 1.03577208e-02],
       [3.07769557e+09, 1.00544722e-02],
       ...,
       [2.95003020e+09, 9.97008973e-05],
       [3.16824535e+09, 9.97008973e-05],
       [2.92902092e+09, 9.97008973e-05]])

In [5]:
# output the rank results
# take the top 10 entries, then print
# print("user_id \t name \t screen_name \t score \t")
top_10_data = []
for entry in page_rank_scores[:10]:
    top_10_data.append([str(int(entry[0])), name_dict[entry[0]][0], name_dict[entry[0]][1], entry[1]])

pd.DataFrame(top_10_data, columns=["user_id", "name", "screen_name", "score"])    

Unnamed: 0,user_id,name,screen_name,score
0,1183906148,صهيب #البركة,Green_suhaib,0.013648
1,3019659587,جُــهـيـمااان,Seeer_2,0.010358
2,3077695572,الولاء والبراء,Time__11,0.010054
3,3068694151,السُلطان سِنجر,senjar0,0.008884
4,2598548166,مفتاح الخير,kikkobaba1,0.008469
5,3154266823,✴_الموحد الكردي_☝️✴️,VIP_____NEW_A1,0.008347
6,571198546,أبُـوالوليد الشيشاني,ossnkaksbahn,0.008304
7,3042570996,عُـيُـونُ الأمّـة,news_is_is,0.008269
8,3039321886,شؤون عسكرية,yhbyhbyhbyhb111,0.007502
9,3082766914,امي الجزراويه16,y_mam98,0.006962


## Part 1.3: Improving PageRank (10 points)
In the many years since PageRank was introduced, there have been many improvements and extensions. For this part, you should experiment with one such improvement and then compare the results you get with the original results in Part 1.2. 

In [6]:
# Plus be sure to describe your extension (what is it? 
# why did you choose it?) and your comparison to Part 1.2

### Solution of Improvement

My solution of improvement is to count the retweeted times as the weight of each edge in the graph and use the weight to calculate the outbound links `L()`. In other words, the probabilities of each outbound edges are not equal and they are determined by the weight of each edge.

### Why Choose it?

The original version of PageRank algorithm has an equal probability for each link and doesn't fully take the number of links to a same page into account. Therefore, I think for our scenario of retweeting, it might be a good way of improvement. 

In [7]:
# Here add your code
def construct_weighted_graph(filename):
    """
    Construct the weighted graph by the given file path.
    
    
    Since the graph is binary, we can use an adjacency list to represent it. 
    
    After computing the PageRank scores, we also need to print the screen name of each user. So we
    have another dictionary to store the mapping of user_id to screen_name.
    """
    # use two adjacency list to represent the graph
    graph = defaultdict(dict)
    name_dict = dict()
    with open(filename) as f:
        for line in f:
            # parse json file line by line
            tweet_json = json.loads(line)
            
            # get properties
            retweeting_id = tweet_json["user"]["id"]
            retweeting_name = tweet_json["user"]["name"]
            retweeting_screen_name = tweet_json["user"]["screen_name"]
            retweeted_id = tweet_json["retweeted_status"]["user"]["id"]
            retweeted_name = tweet_json["retweeted_status"]["user"]["name"]
            retweeted_screen_name = tweet_json["retweeted_status"]["user"]["screen_name"]
            
            # update name dictionary
            name_dict[retweeting_id] = (retweeting_name, retweeting_screen_name)
            name_dict[retweeted_id] = (retweeted_name, retweeted_screen_name)
            
            # insert edges into the graph adjacency list; if the user retweets himself, we just skip
            if retweeting_id != retweeted_id:
                if retweeted_id not in graph[retweeting_id]:
                    graph[retweeting_id][retweeted_id] = 0
                graph[retweeting_id][retweeted_id] += 1

    # normalize the weight to probability
    for neighbors in graph.values():
        weight_sum = sum(neighbors.values())
        for neighbor in neighbors.keys():
            neighbors[neighbor] /= weight_sum

    # generate vertex set (except for the single nodes which have no edges coming out or coming in)
    vertex_set = set(graph.keys())
    for item in graph.values():
        vertex_set.update(item.keys())
    
    return graph, vertex_set, name_dict
            

# construct the graph and user id-to-name mapping dictionary
graph, vertex_set, name_dict = construct_weighted_graph("PageRank.json")
print("size of the adjacency list of the graph: %d" % len(graph))
print("size of the name dictionary of the graph: %d" % len(name_dict))

size of the adjacency list of the graph: 902
size of the name dictionary of the graph: 1003


In [8]:
# Here add your code to implement a function called PageRanker
# Insert as many cells as you want

def cal_weighted_page_rank_score(graph, vertex_set, stop_threshold):
    d = 0.9
    n = len(vertex_set)
    
    # map the node id into the range from 0 to n
    idx_to_id = np.array(list(vertex_set))
    id_to_idx = dict()
    for idx_id_pair in enumerate(vertex_set):
        id_to_idx[idx_id_pair[1]] = idx_id_pair[0]
        
    # since we only have a thousand vertices, there's no need for using a sparse matrix
    # get the two matrices for iterative calculation
    page_rank_mt_left = np.ones(shape=(n, 1)) * (1-d) / n
    page_rank_mt_right = np.zeros(shape=(n,n))
    for cur, neighbors in graph.items():
        for neighbor in neighbors.keys():
            # now d times probability
            page_rank_mt_right[id_to_idx[neighbor]][id_to_idx[cur]] = d * neighbors[neighbor]
    
    # calculate PageRank scores
    page_rank_scores = np.ones(shape=(n, 1))
    while True:
        # record the scores of the last round
        page_rank_scores_last = page_rank_scores
        page_rank_scores = page_rank_mt_left + page_rank_mt_right.dot(page_rank_scores)
        
        # every time we compare the total absolute difference to the stop_threshold
        # if satisfied, break the while loop and return the result (We consider the result has converged)
        if cal_total_diff(page_rank_scores_last, page_rank_scores) < stop_threshold:
            break

    # rank from high to low
    # combine the original id with the PR score
    page_rank_scores_with_id = np.zeros(shape=(n, 2))
    for idx_score_pair in enumerate(page_rank_scores):
        page_rank_scores_with_id[idx_score_pair[0]][0] = idx_to_id[idx_score_pair[0]]
        page_rank_scores_with_id[idx_score_pair[0]][1] = idx_score_pair[1]

    # sort by the second column, then reverse the vector so that it's in the descending order
    return page_rank_scores_with_id[page_rank_scores_with_id[:, 1].argsort()][::-1]    
            

def cal_total_diff(page_rank_scores_last, page_rank_scores):
    score_diff = page_rank_scores_last - page_rank_scores
    return sum(abs(diff) for diff in score_diff)
    

In [9]:
# call the function and get the weighted page rank scores
weighted_page_rank_scores = cal_weighted_page_rank_score(graph, vertex_set, 1e-6)

In [10]:
# output the rank results
# take the top 10 entries, then print
# print("user_id \t name \t screen_name \t score \t")
top_10_data = []
for entry in weighted_page_rank_scores[:10]:
    top_10_data.append([str(int(entry[0])), name_dict[entry[0]][0], name_dict[entry[0]][1], entry[1]])

pd.DataFrame(top_10_data, columns=["user_id", "name", "screen_name", "score"])    

Unnamed: 0,user_id,name,screen_name,score
0,3042570996,عُـيُـونُ الأمّـة,news_is_is,0.019975
1,1183906148,صهيب #البركة,Green_suhaib,0.015222
2,2860872854,إبن الطيب,Dz_c4,0.015214
3,3154266823,✴_الموحد الكردي_☝️✴️,VIP_____NEW_A1,0.011731
4,610166901,أبوثامر المهاجر,R4_Qwy,0.011333
5,3142161801,أنباريـة بس زرقاويةْ,faresa_a3,0.010784
6,3068694151,السُلطان سِنجر,senjar0,0.009806
7,3019659587,جُــهـيـمااان,Seeer_2,0.009579
8,2598548166,مفتاح الخير,kikkobaba1,0.009428
9,3039321886,شؤون عسكرية,yhbyhbyhbyhb111,0.009036


We can see some users like `news_is_is` becomes the first top 1 user in this kind of PageRank. And the overall page rank scores of the top 10 users are increasing.

# Part 2: Learning to Rank (40 points)

For this part, we're going to play with some Microsoft LETOR data that has query-document relevance judgments. Let's see how learning to rank works in practice. 

First, you will need to download the MQ2008.zip file from the Resources tab on Piazza. This is data from the [Microsoft Research IR Group](https://www.microsoft.com/en-us/research/project/letor-learning-rank-information-retrieval/).

The data includes 15,211 rows. Each row is a query-document pair. The first column is a relevance label of this pair (0,1 or 2--> the higher value the more related), the second column is query id, the following columns are features, and the end of the row is comment about the pair, including id of the document. A query-document pair is represented by a 46-dimensional feature vector. Features are a numeric value describing a document and query such as TFIDF, BM25, Page Rank, .... You can find compelete description of features from [here](https://arxiv.org/ftp/arxiv/papers/1306/1306.2597.pdf).

The good news for you is the dataset is ready for analysis: It has already been split into 5 folds (see the five folders called Fold1, ..., Fold5).


## Part 2.1: Build Point-wise Learning to Rank  (20 points)
First, you should build a point-wise Learning to Rank framework. 
1. You could train a binary classification model like SVM or logistic regression on the train file. In this case, 0 is treated as negative (irrelevant) sample and 1, 2 are treated as positive (relevant) sample.
2. You apply the already trained model to predict the scores for documents on test file.
3. Order the documents based on the scores.

add your results and discussion here

### Discussion

As Yun He mentioned in Piazza,

> Each fold in the hw2's learning to rank dataset has the same dataset but with different splits.
> Hence, the five-fold cross-validation in part 2 in hw2 should be done in this way:
> 
> 1. In each fold, you train on train.txt, tune hyperparameters on vali.txt and test on test.txt
> 2. Calculate the average NDCG@10 for the queries in the test.txt in each fold
> 3. Report the average NDCG@10 based on the five test.txt.

So I think the five folds of data should be used in the following way:

1. Create a parameter grid which contains several groups of hyperparameters


2. Do grid search on these 5 folds of data. That is: for each group of hyperparameters, we use them to train an SVM model on the train.txt of every fold of data and then validate it on the vali.txt of every fold of data respectively. Then calculate the average ndcg.


3. In the end, we choose the group of hyperparameters who has the maximum average ndcg. That is the grid search result.


4. Calculate the NDCG@10 based on the five test.txt using the best model we get in step 3.

In [11]:
import pandas as pd
import numpy as np
import math

from collections import defaultdict
from sklearn import svm

In [12]:
def parse_file(filename):
    """
    Parse dataset file by the given filename and return feature, qid, doc_id, labels arrays.
    """
    qid = []
    doc_id = []
    feature_vectors = []
    labels = []
    with open(filename) as f:
        for line in f:
            # parse features, qid, doc_id, labels
            items = line[:line.find('#')].split(' ')
            features = [float(f[f.find(':')+1:]) for f in items[2:-1]]
            qid.append(items[1][items[1].find(':')+1:])
            doc_id.append(line[line.find("#docid = ") + len("#docid = ") : line.find(" inc = ")])
            feature_vectors.append(features)
            labels.append(float(items[0]))

    # convert list to numpy array
    feature_vectors = np.array(feature_vectors)        
    labels = np.array(labels)
    
    return feature_vectors, qid, doc_id, labels


In [13]:
def calculate_ndcg(test_qid, pred_labels, test_labels):
    """
    Calculate NDCG@10
    """
    # gernate dataframe which contains column "qid", "pred_label", "true_label"
    qid_df = pd.DataFrame(test_qid, columns=["qid"])
    pred_label_df = pd.DataFrame(pred_labels, columns=["pred_label"])
    true_label_df = pd.DataFrame(test_labels, columns=["true_label"])
    query_result_df = pd.concat([qid_df,pred_label_df,true_label_df],axis=1)
    
    # calculate the prediction rank and ideal rank within each group (group by qid)
    query_result_df["pred_rank"] = query_result_df.groupby("qid")["pred_label"].rank("first", ascending=False)
    query_result_df["ideal_rank"] = query_result_df.groupby("qid")["true_label"].rank("first", ascending=False)
    
    # calculate DCG@10
    ndcg_dict = defaultdict(float)
    idcg_dict = defaultdict(float)
    for index, row in query_result_df.iterrows():
        if row["pred_rank"] <= 10.0:
            ndcg_dict[row["qid"]] += (2 ** row["true_label"] - 1) / math.log(row["pred_rank"] + 1, 2)
        if row["ideal_rank"] <= 10.0:
            idcg_dict[row["qid"]] += (2 ** row["true_label"] - 1) / math.log(row["ideal_rank"] + 1, 2)

    # normalize
    for key in ndcg_dict.keys():
        # if we meet the case that ideal dcg score is 0, just treat it as 0 (skip it)
        if idcg_dict[key] > 0:
#             print("key: %s, dcg: %f, idcg: %f" % (key, ndcg_dict[key], idcg_dict[key]))
            ndcg_dict[key] /= idcg_dict[key]

    # calculate average ndcg
    average_ndcg = np.array(list(ndcg_dict.values())).mean()
            
    return average_ndcg, ndcg_dict


In [14]:
# parameter grid
# since I have already pretest some groups of parameters and shrinked the range of the
# best parameters to gamma [0.65, 0.85] & C [10, 60], I just do grid search in the 
# following groups of hypeparameters
gamma = [0.65, 0.7, 0.75, 0.8, 0.85]
C = [1, 10, 20, 30, 40, 50, 60]
folds = ["MQ2008/Fold1/", "MQ2008/Fold2/", "MQ2008/Fold3/", "MQ2008/Fold4/", "MQ2008/Fold5/"]

# grid search
best_score = 0.
for i in range(len(gamma)):
    for j in range(len(C)):
        clf = svm.SVC(kernel='rbf', gamma=gamma[i], C=C[j])
        
        average_score = 0.
        for k in range(5):
            # train and predict on the validation dataset
            train_features, train_qid, train_doc_id, train_labels = parse_file(folds[k] + "train.txt")
            vali_features, vali_qid, vali_doc_id, vali_labels = parse_file(folds[k] + "vali.txt")
            clf.fit(train_features, train_labels)
            pred_labels = clf.predict(vali_features)
            
            # calculate ndcg as score for tuning the hypeparameters
            average_ndcg, ndcg_dict = calculate_ndcg(vali_qid, pred_labels, vali_labels)
            average_score += average_ndcg
        
        # calculate average score of the five folds
        average_score /= 5
        print("For gamma: %f & C: %f, the average ndcg score is: %f" % (gamma[i], C[j], average_score))
        
        # if the average score is better, record the current hyperparameter group 
        if average_score > best_score:
            best_parameter = [gamma[i], C[j]]
            best_score = average_score

print("The parameters of the best model are: ", best_parameter)
print("The score of the best model is: ", best_score)

For gamma: 0.650000 & C: 1.000000, the average ndcg score is: 0.348171
For gamma: 0.650000 & C: 10.000000, the average ndcg score is: 0.385171
For gamma: 0.650000 & C: 20.000000, the average ndcg score is: 0.392228
For gamma: 0.650000 & C: 30.000000, the average ndcg score is: 0.393491
For gamma: 0.650000 & C: 40.000000, the average ndcg score is: 0.392897
For gamma: 0.650000 & C: 50.000000, the average ndcg score is: 0.392256
For gamma: 0.650000 & C: 60.000000, the average ndcg score is: 0.390979
For gamma: 0.700000 & C: 1.000000, the average ndcg score is: 0.347433
For gamma: 0.700000 & C: 10.000000, the average ndcg score is: 0.385874
For gamma: 0.700000 & C: 20.000000, the average ndcg score is: 0.391876
For gamma: 0.700000 & C: 30.000000, the average ndcg score is: 0.393351
For gamma: 0.700000 & C: 40.000000, the average ndcg score is: 0.393520
For gamma: 0.700000 & C: 50.000000, the average ndcg score is: 0.392727
For gamma: 0.700000 & C: 60.000000, the average ndcg score is: 0.3

In [15]:
# predict the test set of fold 1 as an example using the best parameters we select in the last step
svm_clf = svm.SVC(kernel='rbf', gamma=best_parameter[0], C=best_parameter[1])
train_features, train_qid, train_doc_id, train_labels = parse_file(folds[0] + "train.txt")
test_features, test_qid, test_doc_id, test_labels = parse_file(folds[0] + "test.txt")
svm_clf.fit(train_features, train_labels)
pred_labels = clf.predict(test_features)

# sort by pred_labels
qid_df = pd.DataFrame(test_qid, columns=["qid"])
doc_id_df = pd.DataFrame(test_doc_id, columns=["dic_id"])
pred_label_df = pd.DataFrame(pred_labels, columns=["pred_label"])
true_label_df = pd.DataFrame(test_labels, columns=["true_label"])
query_result_df = pd.concat([qid_df, doc_id_df, pred_label_df, true_label_df], axis=1)
query_result_df["pred_rank"] = query_result_df.groupby("qid")["pred_label"].rank("first", ascending=False)
query_result_df = query_result_df.groupby(["qid"]).apply(lambda x: x.sort_values(["pred_rank"])).reset_index(drop=True)
print(query_result_df.to_string())

        qid             dic_id  pred_label  true_label  pred_rank
0     18219   GX020-25-8391882         1.0         1.0        1.0
1     18219   GX004-93-7097963         0.0         0.0        2.0
2     18219   GX010-40-4497720         0.0         0.0        3.0
3     18219  GX016-32-14546147         0.0         0.0        4.0
4     18219   GX025-94-0531672         0.0         0.0        5.0
5     18219  GX026-03-13004845         0.0         0.0        6.0
6     18219  GX048-02-13747475         0.0         0.0        7.0
7     18219  GX268-53-13016636         0.0         0.0        8.0
8     18230   GX019-16-5501512         2.0         2.0        1.0
9     18230   GX233-80-3062211         2.0         2.0        2.0
10    18230   GX236-77-6583677         2.0         2.0        3.0
11    18230  GX256-85-15564040         2.0         2.0        4.0
12    18230  GX272-52-14408887         2.0         2.0        5.0
13    18230   GX000-52-8600090         1.0         1.0        6.0
14    1823

## Part 2.2: NDCG (20 points)

Based on your prediction file (results could be ranked by scores in the prediction file) and ground-truth (i.e., 0,1,2) in the test file, calculate NDCG for each query. Report average NDCG for all queries in the five-fold cross validation.

For NDCG, please bulid your own function rather then using any package.

In [16]:
# get NDCG@10 of test dataset for each fold
svm_clf = svm.SVC(kernel='rbf', gamma=best_parameter[0], C=best_parameter[1])
total_avg_ndcg = 0.
for k in range(5):
    # train and predict
    train_features, train_qid, train_doc_id, train_labels = parse_file(folds[k] + "train.txt")
    test_features, test_qid, test_doc_id, test_labels = parse_file(folds[k] + "test.txt")
    clf.fit(train_features, train_labels)
    pred_labels = clf.predict(test_features)

    # calculate ndcg@10
    average_ndcg, ndcg_dict = calculate_ndcg(test_qid, pred_labels, test_labels)
    print("NDCG in fold%d (%d queries): %f" % (k+1, len(ndcg_dict.keys()), average_ndcg))
    
    total_avg_ndcg += average_ndcg
    
print("Average NDCG@10: %f" % (total_avg_ndcg / 5.) )

NDCG in fold1 (156 queries): 0.363621
NDCG in fold2 (157 queries): 0.374346
NDCG in fold3 (157 queries): 0.372298
NDCG in fold4 (157 queries): 0.429219
NDCG in fold5 (157 queries): 0.410176
Average NDCG@10: 0.389932


## (BONUS) Pairwise Learning to Rank (5 points)

Rather than use the point-wise approach as in Part 2.1, instead try to implement a paiwise approach.

In [17]:
# your code here

## Collaboration declarations

*If you collaborated with anyone (see Collaboration policy at the top of this homework), you can put your collaboration declarations here.*

1. [A Improved PageRank Algorithm
Based on Page Link Weight](https://sci-hub.tw/10.1007/978-3-319-11197-1_57)
2. [PageRank - Wikipedia](https://en.wikipedia.org/wiki/PageRank)
3. [pagerank 算法 快速入门](https://www.zybuluo.com/zhuanxu/note/1157516)
4. [sklearn.svm.SVC — scikit-learn 0.22.2 documentation](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)