<a href="https://colab.research.google.com/github/rosslogan702/learning_to_rank/blob/master/ltr_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Learning to Rank - Regression

This notebook is concerned with applying learning to rank approaches on a benchmark dataset LETOR-10K from Microsoft.

This notebook will be concerned with applying regression approaches to the learning to rank problem.

## Mount Google Drive

Dataset is stored in google drive, mount google drive onto notebook to enable to read dataset in.

In [8]:
# Mount google drive for access to the LETOR-10K dataset
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


## Load Dataset

The LETOR-10K dataset is quite large and contains a number of files already pre-split into train & test sets.

Load the train & test data into a dataframe ready to be modelled. The dataset is stored in google drive so load from here.

The dataset contains a number of folds of data that have corresponding train, test & validation datasets. For the purposes of this notebook, we are only going to use the first train/test dataset in fold 1.

In [0]:
import pandas as pd
import numpy as np

In [0]:
# feature names to make the dataframe more easily readable
names = ['relevance_score','query_id','covered_query_term_number_body',
         'covered_query_term_number_anchor','covered_query_term_number_title',
         'covered_query_term_number_url',
         'covered_query_term_number_whole_document',
         'covered_query_term_ratio_body','covered_query_term_ratio_anchor',
         'covered_query_term_ratio_title','covered_query_term_ratio_url',
         'covered_query_term_ratio_whole_document','stream_length_body',
         'stream_length_anchor','stream_length_title','stream_length_url',
         'stream_length_whole_document','IDF(Inverse_document_frequency)_body',
         'IDF(Inverse_document_frequency)_anchor',
         'IDF(Inverse_document_frequency)_title',
         'IDF(Inverse_document_frequency)_url',
         'IDF(Inverse_document_frequency)_whole_document',
         'sum_of_term_frequency_body','sum_of_term_frequency_anchor',
         'sum_of_term_frequency_title','sum_of_term_frequency_url',
         'sum_of_term_frequency_whole_document','min_of_term_frequency_body',
         'min_of_term_frequency_anchor','min_of_term_frequency_title',
         'min_of_term_frequency_url','min_of_term_frequency_whole_document',
         'max_of_term_frequency_body','max_of_term_frequency_anchor',
         'max_of_term_frequency_title','max_of_term_frequency_url',
         'max_of_term_frequency_whole_document','mean_of_term_frequency_body',
         'mean_of_term_frequency_anchor','mean_of_term_frequency_title',
         'mean_of_term_frequency_url','mean_of_term_frequency_whole_document',
         'variance_of_term_frequency_body','variance_of_term_frequency_anchor',
         'variance_of_term_frequency_title','variance_of_term_frequency_url',
         'variance_of_term_frequency_whole_document',
         'sum_of_stream_length_normalized_term_frequency_body',
         'sum_of_stream_length_normalized_term_frequency_anchor',
         'sum_of_stream_length_normalized_term_frequency_title',
         'sum_of_stream_length_normalized_term_frequency_url',
         'sum_of_stream_length_normalized_term_frequency_whole_document',
         'min_of_stream_length_normalized_term_frequency_body',
         'min_of_stream_length_normalized_term_frequency_anchor',
         'min_of_stream_length_normalized_term_frequency_title',
         'min_of_stream_length_normalized_term_frequency_url',
         'min_of_stream_length_normalized_term_frequency_whole_document',
         'max_of_stream_length_normalized_term_frequency_body',
         'max_of_stream_length_normalized_term_frequency_anchor',
         'max_of_stream_length_normalized_term_frequency_title',
         'max_of_stream_length_normalized_term_frequency_url',
         'max_of_stream_length_normalized_term_frequency_whole_document',
         'mean_of_stream_length_normalized_term_frequency_body',
         'mean_of_stream_length_normalized_term_frequency_anchor',
         'mean_of_stream_length_normalized_term_frequency_title',
         'mean_of_stream_length_normalized_term_frequency_url',
         'mean_of_stream_length_normalized_term_frequency_whole_document',
         'variance_of_stream_length_normalized_term_frequency_body',
         'variance_of_stream_length_normalized_term_frequency_anchor',
         'variance_of_stream_length_normalized_term_frequency_title',
         'variance_of_stream_length_normalized_term_frequency_url',
         'variance_of_stream_length_normalized_term_frequency_whole_document',
         'sum_of_tf*idf_body','sum_of_tf*idf_body_anchor',
         'sum_of_tf*idf_body_title','sum_of_tf*idf_body_url',
         'sum_of_tf*idf_body_whole_document','min_of_tf*idf_body',
         'min_of_tf*idf_anchor','min_of_tf*idf_title','min_of_tf*idf_url',
         'min_of_tf*idf_whole_document','max_of_tf*idf_body',
         'max_of_tf*idf_anchor','max_of_tf*idf_title','max_of_tf*idf_url',
         'max_of_tf*idf_whole_document','mean_of_tf*idf_body',
         'mean_of_tf*idf_anchor','mean_of_tf*idf_title','mean_of_tf*idf_url',
         'mean_of_tf*idf_whole_document','variance_of_tf*idf_body',
         'variance_of_tf*idf_anchor','variance_of_tf*idf_title',
         'variance_of_tf*idf_url','variance_of_tf*idf_whole_document',
         'boolean_model_body','boolean_model_anchor','boolean_model_title',
         'boolean_model_url','boolean_model_whole_document',
         'vector_space_model_body','vector_space_model_anchor',
         'vector_space_model_title','vector_space_model_url',
         'vector_space_model_whole_document','BM_body','BM_anchor',
         'BM_title','BM_url','BM_whole_document','LMIR.ABS_body',
         'LMIR.ABS_anchor','LMIR.ABS_title','LMIR.ABS_url',
         'LMIR.ABS_whole_document','LMIR.DIR_body','LMIR.DIR_anchor',
         'LMIR.DIR_title','LMIR.DIR_url','LMIR.DIR_whole_document',
         'LMIR.JM_body','LMIR.JM_anchor','LMIR.JM_title','LMIR.JM_url',
         'LMIR.JM_whole_document','Number_of_slash_in_URL','Length_of_URL',
         'Inlink_number','Outlink_number','PageRank','SiteRank','QualityScore',
         'QualityScore2','Query-url_click_count','url_click_count',
         'url_dwell_time']

In [0]:
# unzip dataset 
!unzip -q "/content/drive/My Drive/MSLR-WEB10K.zip"

In [0]:
# read fold 1 training set into df
mslr_train_df = pd.read_csv(filepath_or_buffer='/content/Fold1/train.txt',
                            header=0,
                            names=names,
                            delim_whitespace=True)

In [0]:
# read fold 1 test set into df
mslr_test_df = pd.read_csv(filepath_or_buffer='/content/Fold1/train.txt',
                      header=0,
                      names=names,
                      delim_whitespace=True)

### Dataset Preparation

In [0]:
# In the dataset each column has values of the form column_number:value
# This method is to extract the values only so that we are left with the values only in columns
def extract_value(value):
    return float(value.split(':')[1])

In [0]:
# Method to transform dataset by extracting the values for each column
# First column is the relevance score so start from index 1
def transform_dataset(df):
    df[df.columns[1:]] = df[df.columns[1:]].applymap(extract_value)
    return df

#### Training Set

In [0]:
df_train = transform_dataset(mslr_train_df)

In [21]:
df_train.shape

(723411, 138)

In [28]:
df_train.head()

Unnamed: 0,relevance_score,query_id,covered_query_term_number_body,covered_query_term_number_anchor,covered_query_term_number_title,covered_query_term_number_url,covered_query_term_number_whole_document,covered_query_term_ratio_body,covered_query_term_ratio_anchor,covered_query_term_ratio_title,covered_query_term_ratio_url,covered_query_term_ratio_whole_document,stream_length_body,stream_length_anchor,stream_length_title,stream_length_url,stream_length_whole_document,IDF(Inverse_document_frequency)_body,IDF(Inverse_document_frequency)_anchor,IDF(Inverse_document_frequency)_title,IDF(Inverse_document_frequency)_url,IDF(Inverse_document_frequency)_whole_document,sum_of_term_frequency_body,sum_of_term_frequency_anchor,sum_of_term_frequency_title,sum_of_term_frequency_url,sum_of_term_frequency_whole_document,min_of_term_frequency_body,min_of_term_frequency_anchor,min_of_term_frequency_title,min_of_term_frequency_url,min_of_term_frequency_whole_document,max_of_term_frequency_body,max_of_term_frequency_anchor,max_of_term_frequency_title,max_of_term_frequency_url,max_of_term_frequency_whole_document,mean_of_term_frequency_body,mean_of_term_frequency_anchor,mean_of_term_frequency_title,...,boolean_model_anchor,boolean_model_title,boolean_model_url,boolean_model_whole_document,vector_space_model_body,vector_space_model_anchor,vector_space_model_title,vector_space_model_url,vector_space_model_whole_document,BM_body,BM_anchor,BM_title,BM_url,BM_whole_document,LMIR.ABS_body,LMIR.ABS_anchor,LMIR.ABS_title,LMIR.ABS_url,LMIR.ABS_whole_document,LMIR.DIR_body,LMIR.DIR_anchor,LMIR.DIR_title,LMIR.DIR_url,LMIR.DIR_whole_document,LMIR.JM_body,LMIR.JM_anchor,LMIR.JM_title,LMIR.JM_url,LMIR.JM_whole_document,Number_of_slash_in_URL,Length_of_URL,Inlink_number,Outlink_number,PageRank,SiteRank,QualityScore,QualityScore2,Query-url_click_count,url_click_count,url_dwell_time
0,2,1.0,3.0,0.0,3.0,0.0,3.0,1.0,0.0,1.0,0.0,1.0,406.0,0.0,5.0,5.0,416.0,6.931275,22.076928,19.673353,22.255383,6.926551,28.0,0.0,3.0,0.0,31.0,8.0,0.0,1.0,0.0,9.0,10.0,0.0,1.0,0.0,11.0,9.333333,0.0,1.0,...,0.0,1.0,0.0,1.0,0.994425,0.0,1.0,0.0,0.995455,20.885118,0.0,24.233365,0.0,21.161666,-11.55585,-21.242171,-8.429024,-25.436074,-11.297811,-16.487275,-24.805464,-21.461317,-27.690319,-16.208808,-11.646141,-24.041386,-5.14386,-28.119826,-11.411068,2.0,54.0,11089534.0,2.0,124.0,64034.0,1.0,2.0,0.0,0.0,0.0
1,0,1.0,3.0,0.0,2.0,0.0,3.0,1.0,0.0,0.666667,0.0,1.0,146.0,0.0,3.0,7.0,156.0,6.931275,22.076928,19.673353,22.255383,6.926551,14.0,0.0,2.0,0.0,16.0,1.0,0.0,0.0,0.0,1.0,7.0,0.0,1.0,0.0,8.0,4.666667,0.0,0.666667,...,0.0,0.0,0.0,1.0,0.851903,0.0,0.720414,0.0,0.842789,18.140878,0.0,17.748073,0.0,18.279205,-12.609065,-21.242171,-14.935056,-25.436074,-12.487989,-18.832941,-24.805464,-23.925663,-27.690319,-18.589543,-11.525277,-24.041386,-14.689844,-28.119826,-11.436378,3.0,45.0,3.0,1.0,124.0,3344.0,14.0,67.0,0.0,0.0,0.0
2,2,1.0,3.0,0.0,3.0,0.0,3.0,1.0,0.0,1.0,0.0,1.0,287.0,1.0,4.0,7.0,299.0,6.931275,22.076928,19.673353,22.255383,6.926551,7.0,0.0,3.0,0.0,10.0,2.0,0.0,1.0,0.0,3.0,3.0,0.0,1.0,0.0,4.0,2.333333,0.0,1.0,...,0.0,1.0,0.0,1.0,0.989585,0.0,1.0,0.0,0.995185,15.572998,0.0,26.759999,0.0,17.53163,-15.55564,-21.242171,-7.76183,-25.436074,-14.198901,-20.103511,-24.805464,-21.45982,-27.690319,-19.180736,-14.798285,-24.041386,-4.474536,-28.119826,-13.825417,3.0,56.0,11089534.0,13.0,123.0,63933.0,1.0,3.0,0.0,0.0,0.0
3,1,1.0,3.0,0.0,3.0,0.0,3.0,1.0,0.0,1.0,0.0,1.0,2009.0,2.0,4.0,7.0,2022.0,6.931275,22.076928,19.673353,22.255383,6.926551,8.0,0.0,3.0,0.0,11.0,2.0,0.0,1.0,0.0,3.0,3.0,0.0,1.0,0.0,4.0,2.666667,0.0,1.0,...,0.0,1.0,0.0,1.0,0.980551,0.0,1.0,0.0,0.989938,7.802556,0.0,26.759999,0.0,9.749707,-20.673887,-21.242171,-7.76183,-25.436074,-19.469471,-21.419394,-24.805464,-21.45982,-27.690319,-20.58994,-20.168345,-24.041386,-4.474536,-28.119826,-19.226044,3.0,64.0,5.0,7.0,256.0,49697.0,1.0,13.0,0.0,0.0,0.0
4,1,1.0,3.0,0.0,3.0,0.0,3.0,1.0,0.0,1.0,0.0,1.0,935.0,3.0,4.0,7.0,949.0,6.931275,22.076928,19.673353,22.255383,6.926551,22.0,0.0,3.0,0.0,25.0,2.0,0.0,1.0,0.0,3.0,11.0,0.0,1.0,0.0,12.0,7.333333,0.0,1.0,...,0.0,1.0,0.0,1.0,0.871319,0.0,1.0,0.0,0.898802,15.468397,0.0,26.759999,0.0,16.692247,-15.755644,-21.242171,-7.76183,-25.436074,-15.038405,-18.401835,-24.805464,-21.45982,-27.690319,-17.869366,-15.535053,-24.041386,-4.474536,-28.119826,-14.984541,3.0,62.0,6.0,7.0,210.0,49923.0,5.0,15.0,0.0,0.0,0.0


In [35]:
# How many different query id's are present in the dataset
df_train.groupby(['query_id'])['relevance_score'].count()

query_id
1.0         85
4.0        103
7.0        111
16.0       106
19.0        76
          ... 
29974.0    164
29977.0    180
29986.0     81
29989.0     86
29992.0     40
Name: relevance_score, Length: 6000, dtype: int64

In [36]:
# Count of relevance scores in training set
df_train.groupby(['relevance_score'])['query_id'].count()

relevance_score
0    377957
1    232569
2     95081
3     12658
4      5146
Name: query_id, dtype: int64

In [0]:
# X_train features
X_train = df_train[df_train.columns[2:]]

In [0]:
# y_train relevance score labels
y_train = df_train['relevance_score']

#### Test Set

In [0]:
df_test = transform_dataset(mslr_test_df)

In [25]:
df_test.shape

(723411, 138)

In [29]:
df_test.head()

Unnamed: 0,relevance_score,query_id,covered_query_term_number_body,covered_query_term_number_anchor,covered_query_term_number_title,covered_query_term_number_url,covered_query_term_number_whole_document,covered_query_term_ratio_body,covered_query_term_ratio_anchor,covered_query_term_ratio_title,covered_query_term_ratio_url,covered_query_term_ratio_whole_document,stream_length_body,stream_length_anchor,stream_length_title,stream_length_url,stream_length_whole_document,IDF(Inverse_document_frequency)_body,IDF(Inverse_document_frequency)_anchor,IDF(Inverse_document_frequency)_title,IDF(Inverse_document_frequency)_url,IDF(Inverse_document_frequency)_whole_document,sum_of_term_frequency_body,sum_of_term_frequency_anchor,sum_of_term_frequency_title,sum_of_term_frequency_url,sum_of_term_frequency_whole_document,min_of_term_frequency_body,min_of_term_frequency_anchor,min_of_term_frequency_title,min_of_term_frequency_url,min_of_term_frequency_whole_document,max_of_term_frequency_body,max_of_term_frequency_anchor,max_of_term_frequency_title,max_of_term_frequency_url,max_of_term_frequency_whole_document,mean_of_term_frequency_body,mean_of_term_frequency_anchor,mean_of_term_frequency_title,...,boolean_model_anchor,boolean_model_title,boolean_model_url,boolean_model_whole_document,vector_space_model_body,vector_space_model_anchor,vector_space_model_title,vector_space_model_url,vector_space_model_whole_document,BM_body,BM_anchor,BM_title,BM_url,BM_whole_document,LMIR.ABS_body,LMIR.ABS_anchor,LMIR.ABS_title,LMIR.ABS_url,LMIR.ABS_whole_document,LMIR.DIR_body,LMIR.DIR_anchor,LMIR.DIR_title,LMIR.DIR_url,LMIR.DIR_whole_document,LMIR.JM_body,LMIR.JM_anchor,LMIR.JM_title,LMIR.JM_url,LMIR.JM_whole_document,Number_of_slash_in_URL,Length_of_URL,Inlink_number,Outlink_number,PageRank,SiteRank,QualityScore,QualityScore2,Query-url_click_count,url_click_count,url_dwell_time
0,2,1.0,3.0,0.0,3.0,0.0,3.0,1.0,0.0,1.0,0.0,1.0,406.0,0.0,5.0,5.0,416.0,6.931275,22.076928,19.673353,22.255383,6.926551,28.0,0.0,3.0,0.0,31.0,8.0,0.0,1.0,0.0,9.0,10.0,0.0,1.0,0.0,11.0,9.333333,0.0,1.0,...,0.0,1.0,0.0,1.0,0.994425,0.0,1.0,0.0,0.995455,20.885118,0.0,24.233365,0.0,21.161666,-11.55585,-21.242171,-8.429024,-25.436074,-11.297811,-16.487275,-24.805464,-21.461317,-27.690319,-16.208808,-11.646141,-24.041386,-5.14386,-28.119826,-11.411068,2.0,54.0,11089534.0,2.0,124.0,64034.0,1.0,2.0,0.0,0.0,0.0
1,0,1.0,3.0,0.0,2.0,0.0,3.0,1.0,0.0,0.666667,0.0,1.0,146.0,0.0,3.0,7.0,156.0,6.931275,22.076928,19.673353,22.255383,6.926551,14.0,0.0,2.0,0.0,16.0,1.0,0.0,0.0,0.0,1.0,7.0,0.0,1.0,0.0,8.0,4.666667,0.0,0.666667,...,0.0,0.0,0.0,1.0,0.851903,0.0,0.720414,0.0,0.842789,18.140878,0.0,17.748073,0.0,18.279205,-12.609065,-21.242171,-14.935056,-25.436074,-12.487989,-18.832941,-24.805464,-23.925663,-27.690319,-18.589543,-11.525277,-24.041386,-14.689844,-28.119826,-11.436378,3.0,45.0,3.0,1.0,124.0,3344.0,14.0,67.0,0.0,0.0,0.0
2,2,1.0,3.0,0.0,3.0,0.0,3.0,1.0,0.0,1.0,0.0,1.0,287.0,1.0,4.0,7.0,299.0,6.931275,22.076928,19.673353,22.255383,6.926551,7.0,0.0,3.0,0.0,10.0,2.0,0.0,1.0,0.0,3.0,3.0,0.0,1.0,0.0,4.0,2.333333,0.0,1.0,...,0.0,1.0,0.0,1.0,0.989585,0.0,1.0,0.0,0.995185,15.572998,0.0,26.759999,0.0,17.53163,-15.55564,-21.242171,-7.76183,-25.436074,-14.198901,-20.103511,-24.805464,-21.45982,-27.690319,-19.180736,-14.798285,-24.041386,-4.474536,-28.119826,-13.825417,3.0,56.0,11089534.0,13.0,123.0,63933.0,1.0,3.0,0.0,0.0,0.0
3,1,1.0,3.0,0.0,3.0,0.0,3.0,1.0,0.0,1.0,0.0,1.0,2009.0,2.0,4.0,7.0,2022.0,6.931275,22.076928,19.673353,22.255383,6.926551,8.0,0.0,3.0,0.0,11.0,2.0,0.0,1.0,0.0,3.0,3.0,0.0,1.0,0.0,4.0,2.666667,0.0,1.0,...,0.0,1.0,0.0,1.0,0.980551,0.0,1.0,0.0,0.989938,7.802556,0.0,26.759999,0.0,9.749707,-20.673887,-21.242171,-7.76183,-25.436074,-19.469471,-21.419394,-24.805464,-21.45982,-27.690319,-20.58994,-20.168345,-24.041386,-4.474536,-28.119826,-19.226044,3.0,64.0,5.0,7.0,256.0,49697.0,1.0,13.0,0.0,0.0,0.0
4,1,1.0,3.0,0.0,3.0,0.0,3.0,1.0,0.0,1.0,0.0,1.0,935.0,3.0,4.0,7.0,949.0,6.931275,22.076928,19.673353,22.255383,6.926551,22.0,0.0,3.0,0.0,25.0,2.0,0.0,1.0,0.0,3.0,11.0,0.0,1.0,0.0,12.0,7.333333,0.0,1.0,...,0.0,1.0,0.0,1.0,0.871319,0.0,1.0,0.0,0.898802,15.468397,0.0,26.759999,0.0,16.692247,-15.755644,-21.242171,-7.76183,-25.436074,-15.038405,-18.401835,-24.805464,-21.45982,-27.690319,-17.869366,-15.535053,-24.041386,-4.474536,-28.119826,-14.984541,3.0,62.0,6.0,7.0,210.0,49923.0,5.0,15.0,0.0,0.0,0.0


In [34]:
# How many different query id's are present in the test set
df_test.groupby(['query_id'])['relevance_score'].count()

query_id
1.0         85
4.0        103
7.0        111
16.0       106
19.0        76
          ... 
29974.0    164
29977.0    180
29986.0     81
29989.0     86
29992.0     40
Name: relevance_score, Length: 6000, dtype: int64

In [37]:
# Count of relevance scores in test set
df_test.groupby(['relevance_score'])['query_id'].count()

relevance_score
0    377957
1    232569
2     95081
3     12658
4      5146
Name: query_id, dtype: int64

In [0]:
# X test features
X_test = df_test[df_test.columns[2:]]

In [0]:
# y_test relevance score labels
y_test = df_test['relevance_score']