In [None]:
# This cell is used for creating a button that hides/unhides code cells to quickly look only the results.
# Works only with Jupyter Notebooks.

import os
from IPython.display import HTML

HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

In [None]:
# Description:
#   Exercise7 notebook.
#
# Copyright (C) 2018 Santiago Cortes, Juha Ylioinas
#
# This software is distributed under the GNU General Public 
# Licence (version 2 or later); please refer to the file 
# Licence.txt, included with the software, for details.

# Preparations
import numpy as np

# Select data directory
if os.path.isdir('/coursedata'):
    # JupyterHub
    course_data_dir = '/coursedata'
elif os.path.isdir('../../../coursedata'):
    # Local installation
    course_data_dir = '../../../coursedata'
else:
    # Docker
    course_data_dir = '/home/jovyan/work/coursedata/'

print('The data directory is %s' % course_data_dir)
data_dir = os.path.join(course_data_dir, 'exercise-07-data')
print('Data stored in %s' % data_dir)

# CS-E4850 Computer Vision Exercise Round 7

The exercises should be solved and the solutions submitted via Aalto JupyterHub by the deadline. 

**Deliverables:**
- **Jupyter notebook** (`exercise7.ipynb`) containing your solutions to the programming tasks and answers to the questions. Do not change the name of the notebook file. It may result in 0 points for the exercise. **Only this notebook will be graded.**

**Important:**
- Fill only the cells marked with `# YOUR CODE HERE`. Do not change function signatures.
- You may add extra cells for your own tests, but **do not** overwrite global variables or edit locked cells.
- **Never create new cells by menu commands "Edit/Copy Cells" and "Edit/Paste Cells ..."**. These commands create cells with duplicate ids and make autograding impossible. Use menu commands "Insert/Insert Cell ..." or the button with a plus sign to insert new cells.
- **All notebooks contain hidden tests** which are used for grading. They are hidden inside read-only cells. Therefore, **the read-only cells should never be removed.** 
- **Note:** Visible tests mainly check the shapes and data types of your function’s output. Hidden tests check the correctness of your solution more thoroughly. Passing the visible tests does not guarantee full points for the exercise.
- **Google Colab warning:** Uploading your assignment notebooks to Colab may cause problems. Colab can overwrite notebook metadata and break the autograding. To avoid this, we recommend copy-pasting your code into the notebooks fetched on JupyterHub. Sorry for the inconvenience.
- Be sure that everything that you need to implement should work with the pictures specified by the assignments of this exercise round.
- Running the cells in mixed order (which quite often happens while trying different things and debugging) may cause errors. While working on a particular cell be sure that you have freshly run all its preceding cells belonging to the same exercise.
- **Before submitting**, simply run all the cells of the notebook (for example, select "Restart & Run All" in the menu) and check that all the cells run properly.
- **Remember to submit your assignment!**

## Exercise 1 - Comparing  bags-of-words  with  tf-idf  weighting (10 points)
Assume  that  we  have  an  indexed  collection  of  documents  containing  the  five  terms  of the following table where the second row indicates the percentage of documents in which each term appears.<br>

| term | cat | dog |mammals | mouse | pet |
| --- | :---: | :---: | :---: | :---: | :---: |
| **% of documents** | 5 | 20 | 2 | 10 | 60 |

Now, given the query $Q=\{mouse, cat, pet, mammals\}$, compute the similarity between $Q$ and the following example documents $D1$, $D2$, $D3$, by using the cosine similarity measure and tf-idf weights (i.e. term frequency - inverse document frequency) for the bag-of-words histogram representations of the documents and the query.

-  $D1$ = Cat is a pet, dog is a pet, and mouse may be a pet too.
-  $D2$ = Cat, dog and mouse are all mammals.
-  $D3$ = Cat and dog get along well, but cat may eat a mouse.

Ignore other words except the five terms, which are listed in the table above. 

### Proceed with the following steps:

**1.1** Compute the inverse document frequency (idf) for each of the five terms<br>
**1.2** Compute the term frequencies for the query and each document. <br>
**1.3** Form the tf-idf weighted word occurrence histograms for the query and documents <br>
**1.4** Evaluate the cosine similarity between the query and each document<br> 
**1.5** Report the relative ranking of the documents <br>

Complete the tasks in **1.1**-**1.5**.

### 1.1 Compute the inverse document frequecy (idf)

#### Task:
Compute the inverse document frequency (idf) for each of the five terms in the table above.
Use the **logarithm with base 2**. (idf is the logarithm term on a slide of Lecture 6 where values $n_i/N$ are given in the table above.)

#### Important:
- Store the five inverse document frequencies in a variable named `idf`.
- `idf` must be a NumPy array of shape (5,).
- The values must follow the same order as the terms appear in the table above.


In [None]:
##--your-code-starts-here--##
#idf = np.zeros(5) # replace me
# YOUR CODE HERE
raise NotImplementedError()
##--your-code-ends-here--##

In [None]:
# Visible tests 

assert "idf" in globals(), "Missing variable: idf"
assert isinstance(idf, np.ndarray), "idf must be a NumPy array"
assert idf.shape[0] == 5, "idf must contain 5 values"
assert idf.ndim == 1, "idf must be one-dimensional"

assert np.allclose(idf[0], 4.3, atol=1e-1), "The first idf value is incorrect. Did you remember to use logarithm with base 2?"

print("All visible tests passed.")

In [None]:
# HIDDEN TEST CELL
# This cell contains hidden test cases that will be evaluated after the deadline.
# Please do not remove or modify this cell, as it is required for grading.


### 1.2 Compute the term frequencies for the query and each document

#### Task:
Compute the term frequencies (tf) for the query and each document.

#### Important:
- Store the term frequencies as NumPy arrays named:
  - `tf_q` for the query
  - `tf_d1`, `tf_d2`, and `tf_d3` for the three documents
- Each array must have shape (5,).
- The term order must be the same as in the table.

In [None]:
##--your-code-starts-here--##
# tf_q = np.zeros(5)  # replace me
# tf_d1 = np.zeros(5)  # replace me
# tf_d2 = np.zeros(5)  # replace me
# tf_d3 = np.zeros(5)  # replace me
# YOUR CODE HERE
raise NotImplementedError()
##--your-code-ends-here--##

In [None]:
# Visible tests 

for name in ["tf_q", "tf_d1", "tf_d2", "tf_d3"]:
    assert name in globals(), f"Missing variable: {name}"
    arr = eval(name)
    assert isinstance(arr, np.ndarray), f"{name} must be a NumPy array"
    assert arr.ndim == 1, f"{name} must be one-dimensional"
    assert arr.shape[0] == 5, f"{name} must contain 5 values"

print("All visible tests passed.")

In [None]:
# HIDDEN TEST CELL
# This cell contains hidden test cases that will be evaluated after the deadline.
# Please do not remove or modify this cell, as it is required for grading.


### 1.3 Form the tf-idf weighted word occurrence histograms for the query and documents

#### Task:
Compute the tf-idf weighted word occurrence histograms for the query and each document.

#### Important:
- Store the term frequencies as NumPy arrays named:
  - `tf_idf_q` for the query
  - `tf_idf_d1`, `tf_idf_d2`, and `tf_idf_d3` for the three documents
- Each array must have shape (5,).
- The term order must be the same as in the table.


In [None]:
##--your-code-starts-here--##
# tf_idf_q = np.zeros(5)  # replace me
# tf_idf_d1 = np.zeros(5)  # replace me
# tf_idf_d2 = np.zeros(5)  # replace me
# tf_idf_d3 = np.zeros(5)  # replace me
# YOUR CODE HERE
raise NotImplementedError()
##--your-code-ends-here--##

In [None]:
# Visible tests 
for name in ["tf_idf_q", "tf_idf_d1", "tf_idf_d2", "tf_idf_d3"]:
    assert name in globals(), f"Missing variable: {name}"
    arr = eval(name)
    assert isinstance(arr, np.ndarray), f"{name} must be a NumPy array"
    assert arr.ndim == 1, f"{name} must be one-dimensional"
    assert arr.shape[0] == 5, f"{name} must contain 5 values"

print("All visible tests passed.")

In [None]:
# HIDDEN TEST CELL
# This cell contains hidden test cases that will be evaluated after the deadline.
# Please do not remove or modify this cell, as it is required for grading.


### 1.4 Evaluate the cosine similarity between the query and each document

#### Task:
Evaluate the cosine similarity between the query and each document i.e. normalized scalar product between the weighted occurrence histograms as shown on the slides.

You should get similarities 0.95, 0.64, and 0.63, but you need to determine which corresponds to which document.

#### Important:
- Store the cosine similarities (scalars) in variables:
  - `s1` = similarity(Q, D1)
  - `s2` = similarity(Q, D2)
  - `s3` = similarity(Q, D3)
- Each of `s1`, `s2`, `s3` must be a float.

In [None]:
##--your-code-starts-here--##
# s1 = 0  # replace me
# s2 = 0  # replace me
# s3 = 0  # replace me
# YOUR CODE HERE
raise NotImplementedError()
##--your-code-ends-here--##

In [None]:
# Visible tests 
for name in ["s1", "s2", "s3"]:
    assert name in globals(), f"Missing variable: {name}"
    val = eval(name)
    assert isinstance(val, (float, np.floating)), f"{name} must be a float"

print("All visible tests passed.")

In [None]:
# HIDDEN TEST CELL
# This cell contains hidden test cases that will be evaluated after the deadline.
# Please do not remove or modify this cell, as it is required for grading.


### 1.5 Report the relative ranking of the documents 

#### Task:
Report the relative ranking of the documents. You should have got similarities 0.95, 0.64, and 0.63, but you need to determine which corresponds to which document.

#### Important:
- Create a Python list named `ranking`.
- Use document indices (1, 2, 3) and include each index exactly once.
- Report the ranking in descending order of similarity; for example, if document D3 is the most similar to the query, place 3 first in the list.


In [None]:
##--your-code-starts-here--##
# ranking = np.array(3)  # replace me
# YOUR CODE HERE
raise NotImplementedError()
##--your-code-ends-here--##

In [None]:
# Visible tests 
# Exists
assert "ranking" in globals(), "Missing variable: ranking"

# Allow list or ndarray; normalize to ndarray
_r = np.asarray(ranking)

# Must be 1D of length 3
assert _r.ndim == 1, "ranking must be one-dimensional"
assert _r.shape[0] == 3, "ranking must contain 3 document indices"

# Must be integers and a permutation of [1, 2, 3]
assert np.issubdtype(_r.dtype, np.integer), "ranking must be integer indices"
assert set(_r.tolist()) == {1, 2, 3}, "ranking must be a permutation of [1, 2, 3]"

print("All visible tests passed.")

In [None]:
# HIDDEN TEST CELL
# This cell contains hidden test cases that will be evaluated after the deadline.
# Please do not remove or modify this cell, as it is required for grading.


## Exercise 2 - Precision  and  recall
There is a database of 10000 images and a user, who is only interested in images which contain a car. It is known that there are 500 such images in the database. An  automatic image retrieval system retrieves 300 car images and 50 other images from the database. Determine and report the precision and recall of the retrieval system in this particular case. Also compute the counts of true positives (tp), true negatives (tn), false positives (fp), and false negatives (fn). <br> 

### Hint: 
Precision and recall are explained on the slides of Lecture 6. There’s also a good Wikipedia overview (including definitions of true/false positives and negatives): https://en.wikipedia.org/wiki/Precision_and_recall.

### Important:
Store your answers in the following variables:
- `tp` true positives
- `tn` true negatives
- `fp` false positives
- `fn` false negatives
- `precision` precision
- `recall` recall

In [None]:
##--your-code-starts-here--##
# tp = 0         # replace me
# tn = 0         # replace me 
# fp = 0         # replace me
# fn = 0         # replace me
# precision = 0  # replace me
# recall = 0     # replace me
# YOUR CODE HERE
raise NotImplementedError()
##--your-code-ends-here--##

In [None]:
# Visible tests 

# Variables exist
for name in ["tp", "tn", "fp", "fn", "precision", "recall"]:
    assert name in globals(), f"Missing variable: {name}"

# Counts should be integer and non-negative
for name in ["tp", "tn", "fp", "fn"]:
    val = eval(name)
    assert isinstance(val, (int, np.integer)), f"{name} must be an integer"
    assert val >= 0, f"{name} must be non-negative"
    
# precision/recall should be numeric
for name in ["precision", "recall"]:
    val = eval(name)
    assert isinstance(val, (int, float, np.integer, np.floating)), f"{name} must be numeric"

print("All visible tests passed.")

In [None]:
# HIDDEN TEST CELL
# This cell contains hidden test cases that will be evaluated after the deadline.
# Please do not remove or modify this cell, as it is required for grading.


In [None]:
# HIDDEN TEST CELL
# This cell contains hidden test cases that will be evaluated after the deadline.
# Please do not remove or modify this cell, as it is required for grading.


In [None]:
# HIDDEN TEST CELL
# This cell contains hidden test cases that will be evaluated after the deadline.
# Please do not remove or modify this cell, as it is required for grading.


In [None]:
# HIDDEN TEST CELL
# This cell contains hidden test cases that will be evaluated after the deadline.
# Please do not remove or modify this cell, as it is required for grading.


In [None]:
# HIDDEN TEST CELL
# This cell contains hidden test cases that will be evaluated after the deadline.
# Please do not remove or modify this cell, as it is required for grading.


In [None]:
# HIDDEN TEST CELL
# This cell contains hidden test cases that will be evaluated after the deadline.
# Please do not remove or modify this cell, as it is required for grading.


## Exercise 3 - VGG practical on object instance recognition (20 points)
See the questions in `part1.ipynb`, `part2.ipynb`, and `part3.ipynb` and write your answers in this notebook, `exercise7.ipynb`.

**Part1:** <br>
Stage I.A (two questions) <br>
Stage I.B (two questions) <br>
Stage I.C (one question) <br>

**Part2** (one question)

**Part3:** <br>
Stage III.A (three questions) <br>
Stage III.B (one question) <br>
Stage III.C (two questions) <br>

Answering questions in **Part 1** corresponds to **10 points** and **Parts 2 and 3** together correspond to **10 additional points**. Hence, this Exercise 3 is in total worth of 20 points.

### Part 1 (10 points)
Type your answers for **Part 1** (5 questions) below.

YOUR ANSWER HERE

### Part 2 and 3 (10 points)
Type your answers for **Part 2** (1 question) and **Part 3** (6 questions) below.

YOUR ANSWER HERE