# Assignment 4: Evaluating Search Engines

For this assignment, we leave aside the code we developed so far, and look into the more general issue of how to evaluate and compare different search engines. The ultimate test for any Information Retrieval system is how well it is able to satisfy the information needs of users.

# Cohen's Kappa

Our evaluation will involve the calculation of [Cohen's Kappa](https://en.wikipedia.org/wiki/Cohen's_kappa) to quantify the degree to which two human assessors agree or disagree on whether results are considered relevant or not. To calculate Cohen's Kappa, we are going to use the [scikit-learn library](http://scikit-learn.org/stable/):

In [1]:
! pip install --user scikit-learn

/bin/sh: 1: pip: not found


In [2]:
from sklearn.metrics import cohen_kappa_score

This library expects relevance assessments as lists of elements where `1` stands for _relevant_ and `0` stands for _not relevant_, for example like this:

In [3]:
a1=[1,0,1,0,1,0,1,0]

This list means that the first document was assessed to be relevant, the second to be not relevant, the third to be relevant etc.

We need two assessments in order to calculate Cohen's Kappa, so let's make another exemplary list that only differs on the last element:

In [4]:
a2=[1,0,1,0,1,0,1,1]

We can now invoke the library as follows to calculate the agreement between the two:

In [5]:
cohen_kappa_score(a1, a2)

0.75

This value represents high agreement. We can reach maximal agreement if the two assessments are identical:

In [6]:
cohen_kappa_score(a1, a1)

1.0

Now, let's see what happens for a third assessment that differs on three positions with the first one (the three last positions):

In [7]:
a3=[1,0,1,0,1,1,0,1]

cohen_kappa_score(a1, a3)

0.25

We get a smaller but still positive value, because these two assessments still mostly agree. If we make a further example that differs on 6 of the 8 positions, we get the following result:

In [8]:
a4=[1,0,0,1,0,1,0,1]

cohen_kappa_score(a1, a4)

-0.5

The score is now negative, because the two differ on more positions than they agree. The agreement is in fact less than what you would expect to occur just by chance. We get the maximal disagreement if we define a fifth example that disagrees on all positions:

In [9]:
a5=[0,1,0,1,0,1,0,1]

cohen_kappa_score(a1, a5)

-1.0

Now that we understand how this function works, we will apply it below for our specific evaluation.

# Results and Assessments

Next, we will define some auxilary code to deal with lists of URLs from search engines and associated relevance assessments. We will encode result lists like this:

In [10]:
urls = [
    'https://en.wikipedia.org/wiki/Information_retrieval/',  # 1st result
    'http://www.dictionary.com/browse/information',          # 2nd result
    'https://nlp.stanford.edu/IR-book/'                      # ...
]

And we represent corresponding assessments, as above, as lists of the same size containing relevance values:

In [11]:
my_assessment = [1, 0, 1]
another_assessment = [0, 0, 1]

In order to nicely display URL lists, with or without related assessments, we define a function called `display_results`:

In [12]:
from IPython.display import display, HTML

def display_results(urls, assessment1=None, assessment2=None):
    lines = []
    lines.append('<table>')
    header = '<tr><th>#</th><th>Result URL</th>'
    if (assessment1):
        header += '<th>Assessment 1</th>'
    if (assessment2):
        header += '<th>Assessment 2</th>'
    header += '</tr>'
    lines.append(header)
    i = 0
    for url in urls:
        show_url = url
        if (len(url) > 80):
            show_url = url[:75] + '...'
        line = '<tr><td>{}</td><td><a href="{:s}">{:s}</a></td>'.format(i+1, url, show_url)
        if (assessment1):
            if (assessment1[i] == 0):
                line += '<td><em>Not relevant</em></td>'
            else:
                line += '<td><strong>Relevant</strong></td>'
        if (assessment2):
            if (assessment2[i] == 0):
                line += '<td><em>Not relevant</em></td>'
            else:
                line += '<td><strong>Relevant</strong></td>'
        line += '</tr>'
        lines.append(line)
        i = i+1
    lines.append('</table>')
    display( HTML(''.join(lines)) )

We can use this function to display a list of URLs, optionally together with one or two assessment lists:

In [13]:
print("Just a list of URLs:")
display_results(urls)

print("With one assessment:")
display_results(urls, my_assessment)

print("With two assessments:")
display_results(urls, my_assessment, another_assessment)

Just a list of URLs:


#,Result URL
1,https://en.wikipedia.org/wiki/Information_retrieval/
2,http://www.dictionary.com/browse/information
3,https://nlp.stanford.edu/IR-book/


With one assessment:


#,Result URL,Assessment 1
1,https://en.wikipedia.org/wiki/Information_retrieval/,Relevant
2,http://www.dictionary.com/browse/information,Not relevant
3,https://nlp.stanford.edu/IR-book/,Relevant


With two assessments:


#,Result URL,Assessment 1,Assessment 2
1,https://en.wikipedia.org/wiki/Information_retrieval/,Relevant,Not relevant
2,http://www.dictionary.com/browse/information,Not relevant,Not relevant
3,https://nlp.stanford.edu/IR-book/,Relevant,Relevant


Now we are ready to perform an actual evaluation, which will involve a substantial amount of manual work.

---

# Tasks

**Your name:** David Rocker

### Task 1

Think up and formulate an information need in the areas of Computer Science or the Life Sciences (medicine, biology, etc.) for which you think the answer can be found in scientific publications. On page 152 in the book an example of such an information need is shown: "Information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine."

**Answer:** Information on whether socioeconomic status impacts health and diet choices in life

Next, write down specifically what documents have to look like to satisfy your information need. For example if your information need is about finding an overview of different cancer types, you could state that a document would need to list at least ten types of cancer to satisfy your information need (among other criteria). Write this down as a protocol with rules and examples. For example, such a protocol could state that at least three out of five given criteria have to be fulfilled for a document to be considered relevant for the information need, and then specify the criteria. Or your protocol could have the form of a sequence of rules, where each rule lets you either label the document as relevant or not relevant, or proceed with the next rule. Such rules and criteria can, for example, be about the general topic of the paper, the concepts mentioned in it, the covered relations between concepts, the type of publication (research paper, overview paper, etc.), the number of references, the types of contained diagrams, and so on, depending on your specified information need.

**Answer:** In order to be relevant the document must contain at least 3 out of 5 of these criteria:
1. Must compare different socioeconomic class's diets
2. Must compare different socioeconomic class's health habits(doctors checkups, exercise, injury compensation at work, etc)
3. Must provide statistics with references to back up said comparisons
4. Must contain a description of studies done
5. Studies done must contain at least 500 people from each class

### Task 2

Formulate a keyword query that represents the information need. For the example on page 152 in the book (see above), the example query "wine AND red AND white AND heart AND attack AND effective" is given. (You don't need to use connectors like "AND", but if you do, make first sure your chosen search engines below actually support them.)

**Answer:** socioeconomic status relating to health

Then submit your query to **two** of the following academic search engines:

- [Google Scholar](https://scholar.google.com) (all science disciplines)
- [Semantic Scholar](https://www.semanticscholar.org) (computer science and biomedicine)
- [PubMed Search](https://www.ncbi.nlm.nih.gov/pubmed) (Life Sciences / biomedicine)

The right choice of two from the three search engine depends on the topic of your information need. If your information need is in Computer Science, for example, you should use Google Scholar and Semantic Scholar.

Extract a list of the top 10 URLs of the lists of each of the search engines
given the query. Try to access the resulting publications. For the publications
where that is not possible (because of dead links or because the publication is
pay-walled even within the VU network), exclude them from the list and add more publications to the end of
your list (that is, append results number 11, then 12, etc. to ensure you have
two lists of 10 publications each). In order to deal with paywalls, you should try accessing the articles from the VU network, use
[UBVU Off-Campus
Access](http://www.ub.vu.nl.vu-nl.idm.oclc.org/nl/faciliteiten/toegang-buiten-de-campus/index.aspx), or try to find the respective documents from alternative sources (Google Scholar, for example, is very good at finding free PDFs of articles). If you get fewer than 10 results for one of the search engines, modify the keyword query above to make it more inclusive, and then redo the steps of this task.

Store your two lists of URLs in the form of Python lists as introduced above. Then, use the `display_results` function to nicely display them.

In [14]:
# Create two of the lists below, depending on your chosen engines:

urls_google = ['http://psycnet.apa.org/record/1994-29613-001', 
              'https://www.ncbi.nlm.nih.gov/pubmed/22812021',
              'http://aldricharchive.com/cuttings/1984/july%201984.pdf',
              'https://inequality.stanford.edu/sites/default/files/media/_media/pdf/Reference%20Media/Currie_2008_Health%20and%20Mental%20Health.pdf',
              'http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1021.8087&rep=rep1&type=pdf',
              'https://www.annualreviews.org/doi/full/10.1146/annurev.publhealth.23.112001.112349',
              'https://ajph.aphapublications.org/doi/pdfplus/10.2105/AJPH.92.7.1151',
              'https://deepblue.lib.umich.edu/bitstream/handle/2027.42/71908/j.1749-6632.1999.tb08114.x.pdf?sequence=1&isAllowed=y',
              'https://www.researchgate.net/profile/Catherine_Cubbin/publication/7416827_Socioeconomic_Status_in_Health_Research_One_Size_Does_Not_Fit_All/links/09e4150b63d937d03a000000/Socioeconomic-Status-in-Health-Research-One-Size-Does-Not-Fit-All.pdf',
              'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4499872/']
#urls_semantic = ...
urls_pubmed = ['https://academic.oup.com/abm/advance-article/doi/10.1093/abm/kay089/5161001',
              'https://www.sciencedirect.com/science/article/pii/S0163834318300720?via%3Dihub',
              'https://www.liebertpub.com/doi/full/10.1089/bfm.2018.0132?url_ver=Z39.88-2003&rfr_id=ori%3Arid%3Acrossref.org&rfr_dat=cr_pub%3Dpubmed&',
              'https://www.ncbi.nlm.nih.gov/books/NBK525234/',
              'https://onlinelibrary.wiley.com/doi/full/10.1111/hsc.12645',
              'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6097401/',
              'http://www.scielo.br/scielo.php?script=sci_arttext&pid=S1516-31802018005012103&lng=en&nrm=iso&tlng=en',
              'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6067362/',
              'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6006943/',
              'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6016129/']

# Call display_results here
display_results(urls_google)
display_results(urls_pubmed)

#,Result URL
1,http://psycnet.apa.org/record/1994-29613-001
2,https://www.ncbi.nlm.nih.gov/pubmed/22812021
3,http://aldricharchive.com/cuttings/1984/july%201984.pdf
4,https://inequality.stanford.edu/sites/default/files/media/_media/pdf/Refere...
5,http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1021.8087&rep=rep1...
6,https://www.annualreviews.org/doi/full/10.1146/annurev.publhealth.23.112001...
7,https://ajph.aphapublications.org/doi/pdfplus/10.2105/AJPH.92.7.1151
8,https://deepblue.lib.umich.edu/bitstream/handle/2027.42/71908/j.1749-6632.1...
9,https://www.researchgate.net/profile/Catherine_Cubbin/publication/7416827_S...
10,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4499872/


#,Result URL
1,https://academic.oup.com/abm/advance-article/doi/10.1093/abm/kay089/5161001
2,https://www.sciencedirect.com/science/article/pii/S0163834318300720?via%3Dihub
3,https://www.liebertpub.com/doi/full/10.1089/bfm.2018.0132?url_ver=Z39.88-20...
4,https://www.ncbi.nlm.nih.gov/books/NBK525234/
5,https://onlinelibrary.wiley.com/doi/full/10.1111/hsc.12645
6,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6097401/
7,http://www.scielo.br/scielo.php?script=sci_arttext&pid=S1516-31802018005012...
8,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6067362/
9,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6006943/
10,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6016129/


### Task 3

Then, find a fellow student who will **independently**
assess the results as "relevant" or "not relevant" using the protocol that you
have defined above, and also help (at least) one other student for his/her
assessment. Write down their names here:

**Name of the student who assesses my results:** Hannah Zucker

**Name of the student who I help to assess his/her results:** Hannah Zucker

Show to the other assessor everything you have written down above for Tasks 1 and 2 (and you might also want to give him/her the PDFs you got for these papers to simplify the process).

You as assessors need to stick to the protocol you made in Task 1 and should not discuss with each other, especially when you doubt whether a result is relevant or not. Write down your assessments as lists of relevance values, as introduced above, and make sure they correctly map to the URLs by displaying them together with the `display_results` function.

To avoid problems with extreme results, mark in each list at least one paper as 'relevant' and at least one paper as 'not relevant'. That is, if all papers seem relevant, mark the one that seems least relevant 'not relevant', and conversely, if none of the papers seem relevant, mark the one that seems a bit more relevant than the others as 'relevant'.

In [15]:
# 0 = not relevant; 1 = relevant

# You only need to create 4 of the following 6 lists, again depending on which search engines you chose.

# Assessment 1 is from you:

assessment1_google = [1, 1, 1, 1, 0, 1, 1, 1, 0, 0]
#assessment1_semantic = ...
assessment1_pubmed = [0, 0, 1, 0, 1, 0, 1, 0, 0, 1]

# Assessment 2 is from your fellow student (don't show him/her your own assessment!):

assessment2_google = [1, 1, 0, 1, 1, 1, 1, 1, 0, 0]
#assessment2_semantic = ...
assessment2_pubmed =[0, 0, 0, 0, 1, 0, 1, 0, 0, 1]

# Call display_results here
display_results(urls_google, assessment1_google, assessment2_google)
display_results(urls_pubmed, assessment1_pubmed, assessment2_pubmed)

#,Result URL,Assessment 1,Assessment 2
1,http://psycnet.apa.org/record/1994-29613-001,Relevant,Relevant
2,https://www.ncbi.nlm.nih.gov/pubmed/22812021,Relevant,Relevant
3,http://aldricharchive.com/cuttings/1984/july%201984.pdf,Relevant,Not relevant
4,https://inequality.stanford.edu/sites/default/files/media/_media/pdf/Refere...,Relevant,Relevant
5,http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1021.8087&rep=rep1...,Not relevant,Relevant
6,https://www.annualreviews.org/doi/full/10.1146/annurev.publhealth.23.112001...,Relevant,Relevant
7,https://ajph.aphapublications.org/doi/pdfplus/10.2105/AJPH.92.7.1151,Relevant,Relevant
8,https://deepblue.lib.umich.edu/bitstream/handle/2027.42/71908/j.1749-6632.1...,Relevant,Relevant
9,https://www.researchgate.net/profile/Catherine_Cubbin/publication/7416827_S...,Not relevant,Not relevant
10,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4499872/,Not relevant,Not relevant


#,Result URL,Assessment 1,Assessment 2
1,https://academic.oup.com/abm/advance-article/doi/10.1093/abm/kay089/5161001,Not relevant,Not relevant
2,https://www.sciencedirect.com/science/article/pii/S0163834318300720?via%3Dihub,Not relevant,Not relevant
3,https://www.liebertpub.com/doi/full/10.1089/bfm.2018.0132?url_ver=Z39.88-20...,Relevant,Not relevant
4,https://www.ncbi.nlm.nih.gov/books/NBK525234/,Not relevant,Not relevant
5,https://onlinelibrary.wiley.com/doi/full/10.1111/hsc.12645,Relevant,Relevant
6,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6097401/,Not relevant,Not relevant
7,http://www.scielo.br/scielo.php?script=sci_arttext&pid=S1516-31802018005012...,Relevant,Relevant
8,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6067362/,Not relevant,Not relevant
9,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6006943/,Not relevant,Not relevant
10,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6016129/,Relevant,Relevant


### Task 4

Compute Cohen's kappa to quantify how much the two assessors agreed. Use the function `cohen_kappa_score` demonstrated above to calculate two times the inter-annotator agreement (once for each of the two search engines), and print out the results.

In [16]:
# Add your code here:

kappa_google = cohen_kappa_score(assessment1_google, assessment2_google)
#kappa_semantic = ...
kappa_pubmed = cohen_kappa_score(assessment1_pubmed, assessment2_pubmed)

print("Kappa for Google Scholar:", kappa_google)
#print("Kappa for Semantic Scholar:", kappa_semantic)
print("Kappa for PubMed:", kappa_pubmed)

Kappa for Google Scholar: 0.5238095238095238
Kappa for PubMed: 0.7826086956521738


Explain whether the agreement can be considered high or not, based on the interpretation table on [this Wikipedia page](https://en.wikipedia.org/wiki/Fleiss'_kappa#Interpretation) (this Wikipedia page is about a different type of kappa but the interpretation table can also be used for Cohen's kappa).

**Answer:** The agreement for the kappa_google would be considered moderate according to the table because there are a couple differences. The agreement for PubMed would be considered high since there is only one difference its score falls in the substantial agreement category.

### Task 5

Define a function called `precision_at_n` that calculates Precision@n as described in the lecture slides, which takes as input an assessment list and a value for _n_ and returns the respective Precision@n value. Run this function to calculate Precision@10 (that is, n=10) on all four assessments (two assessors and two search engines).

In [18]:
# Add your code here:

def precision_at_n(assessment_list, n):
    tp = 0;
    fp = 0;
    for item in range(0, n):
        if(assessment_list[item] == 1):
            tp+=1
        else:
            fp+=1
            
    return (tp/(tp+fp))
        

# Print out Precision@10 for all assessments here.
print("Assessment1 Google:")
print(precision_at_n(assessment1_google, 10))
print("Assessment2 Google:")
print(precision_at_n(assessment2_google, 10))
print("Assessment1 PubMed:")
print(precision_at_n(assessment1_pubmed, 10))
print("Assessment2 PubMed:")
print(precision_at_n(assessment2_pubmed, 10))

Assessment1 Google:
0.7
Assessment2 Google:
0.7
Assessment1 PubMed:
0.4
Assessment2 PubMed:
0.3


Explain what these specific Precision@10 results tell us (or don't tell us) about the quality of the two search engines for your particular information need. You can also refer to the results of Task 4 if necessary.

**Answer:** These results tell us that on average, google is a better search engine then pubmed. Google returned way more relevant results then pubmed did according to both of our assessments, in Hannah's it was even more then double.

# Submission

Submit the answers to the assignment via Canvas as a modified version of this Notebook file (file with `.ipynb` extension) that includes your code and your answers.

Before submitting, restart the kernel and re-run the complete code (**Kernel > Restart & Run All**), and then check whether your assignment code still works as expected.

Don't forget to add your name, and remember that the assignments have to be done individually and group submissions are **not allowed**.