# Vector space with ML

This lab will be devoted to the use of ML model for the needs of information retrieval and text classification.  

**Searching in the curious facts database**

The facts dataset is given [here](https://raw.githubusercontent.com/IUCVLab/information-retrieval/main/datasets/facts.txt), take a look. We want you to retrieve facts **relevant to the query** (whatever it means), for example, you type "good mood", and get to know that Cherophobia is the fear of fun. For this, the idea is to utilize document vectors. However, instead of forming vectors with tf-idf and reducing dimensions, this time we want to obtain fixed-size vectors for documents using ML model.

## 1. Use neural networks to embed sentences

Make use of any, starting from doc2vec up to Transformers, etc. Provide all code, dependencies, installation requirements.


- [UCE in spacy 2](https://spacy.io/universe/project/spacy-universal-sentence-encoder) (`!pip install spacy-universal-sentence-encoder`)
- [Sentence BERT in spacy 2](https://spacy.io/universe/project/spacy-sentence-bert) (`!pip install spacy-sentence-bert`)
- [Pretrained 🤗 Transformers](https://huggingface.co/transformers/pretrained_models.html)
- [Spacy 3 transformers](https://spacy.io/usage/embeddings-transformers#transformers-installation)
- [doc2vec pretrained](https://github.com/jhlau/doc2vec)
- [Some more sentence transformers](https://www.sbert.net/docs/quickstart.html)
- [Even fasttext can do a sentence embedding](https://fasttext.cc/docs/en/python-module.html#model-object)

Here should be dependency installation, download instructions and so on. With outputs.

In [None]:
!pip install #

And then use the library to download (and load) the model.

NB: model downloading may take time (depending on the model hosting). If you think it may take a long time, ask your TA for assistance with binaries.

In [None]:
import tensorflow_hub as hub

embed_tf_model = hub.load("https://tfhub.dev/google/universal-sentence-encoder-large/5")

## 2. Write a function that prepares embedding of arbitrary queries

Write a function, which returns a fixed-sized vector of embedding.

In [None]:
def embed(text):
    # TODO your code here
    return embed_tf_model([text]).numpy().flatten()
    

Here we check that embeddings are of the same size and type.

In [None]:
assert embed(
            "Some random text"
        ).shape == \
        embed(
            "Folks, here's a story about Minnie the Moocher. "
            "She was a lowdown hoochie coocher. "
            "She was the roughest, toughest frail, "
            "but Minnie had a heart as big as a whale"
        ).shape, "Shape should match"

In [None]:
# embed("some text for testing").numpy()

NB: here we check DISTANCE, not similarity. This similar texts should produce results close to 0.

In [None]:
from scipy.spatial.distance import cosine

assert abs(cosine(
            embed("some text for testing"), 
            embed("some text for testing")
        )) < 1e-4, "Embedding should match"

assert abs(cosine(
            embed("Cats eat mice."), 
            embed("Terminator is an autonomous cyborg, typically humanoid, originally conceived as a virtually indestructible soldier, infiltrator, and assassin.")
        )) > 0.2, "Embeddings should be far"

## 3. Read the data

Now, let's read the facts dataset. Download it from the abovementioned url and read to the list of sentences.

In [None]:
import requests
url = "https://raw.githubusercontent.com/IUCVLab/information-retrieval/main/datasets/facts.txt"

fact_all = requests.get(url).text
#TODO read facts into a list of facts. Each fact is a separate element of array
facts = fact_all.split('\n')

In [None]:
print(*facts[:5], sep='\n')

assert len(facts) == 159
assert ('our lovely little planet') in facts[0]

1. If you somehow found a way to extract all of the gold from the bubbling core of our lovely little planet, you would be able to cover all of the land in a layer of gold up to your knees.
2. McDonalds calls frequent buyers of their food "heavy users."
3. The average person spends 6 months of their lifetime waiting on a red light to turn green.
4. The largest recorded snowflake was in Keogh, MT during year 1887, and was 15 inches wide.
5. You burn more calories sleeping than you do watching television.


## 4. Transform sentences to vectors

Transform the list of facts to `numpy.array` of vectors corresponding to each document (`sent_vecs`), inferring them from the model we just loaded.

In [None]:
import numpy as np
sent_vecs = np.array([embed(fact) for fact in facts ])

In [None]:
assert sent_vecs.shape[0] == len(facts)

## 5. Find closest to the query

Now find 5 facts which are the closest to the query using cosine measure.

### 5.1. Closest search

In [None]:
def find_k_closest(query, dataset, k=10):
    similarities = [cosine(query, target) for target in dataset]
    return np.argsort(similarities)[:k]


### 5.1. Use your function

In [None]:
query = "good mood"
query_vec = embed(query)

print("Results for query:", query)
print()
for k in find_k_closest(query_vec, sent_vecs, 5):
    print("\t", facts[k])

Results for query: good mood

	 57. Gorillas burp when they are happy
	 68. Cherophobia is the fear of fun.
	 98. Blue-eyed people tend to have the highest tolerance of alcohol.
	 45. About half of all Americans are on a diet on any given day.
	 44. Honey never spoils.


## 6. Measure DCG@5 for the following query bucket
```
good mood
gorilla
woman
earth
japan
people
math
```

Recommend 5 facts to each of the queries. Write your code below.

In [None]:
bucket = """good mood
gorilla
woman
earth
japan
people
math""".split('\n')

for term in bucket:
    print(term)
    for k in find_k_closest(embed(term), sent_vecs, k=5):  # [::-1]
        print("\t", facts[k])

good mood
	 57. Gorillas burp when they are happy
	 68. Cherophobia is the fear of fun.
	 98. Blue-eyed people tend to have the highest tolerance of alcohol.
	 45. About half of all Americans are on a diet on any given day.
	 44. Honey never spoils.
gorilla
	 55. The word "gorilla" is derived from a Greek word meaning, "A tribe of hairy women."
	 57. Gorillas burp when they are happy
	 137. Human birth control pills work on gorillas.
	 106. The male ostrich can roar just like a lion.
	 85. The elephant is the only mammal that can't jump!
woman
	 151. Women have twice as many pain receptors on their body than men. But a much higher pain tolerance.
	 16. Men are 6 times more likely to be struck by lightning than women.
	 116. Male dogs lift their legs when they are urinating for a reason. They are trying to leave their mark higher so that it gives off the message that they are tall and intimidating.
	 55. The word "gorilla" is derived from a Greek word meaning, "A tribe of hairy women."


## 7. Write your own relevance assessments and compute DCG@5

In [None]:
assessments = [
    [1, 0, 0, 0, 0], # good mood
    [1, 1, 1, 0, 0], # gorilla
    [1, 1, 0, 1, 0], # ...
    [1, 1, 1, 0, 0],
    [1, 1, 1, 0, 0],
    [1, 1, 0, 1, 1],
    [1, 1, 1, 0, 0]
]


optimal = [[1] * 5] * 7
optimal

# compute_dcg(optimal[0], assessments[1])

[[1, 1, 1, 1, 1],
 [1, 1, 1, 1, 1],
 [1, 1, 1, 1, 1],
 [1, 1, 1, 1, 1],
 [1, 1, 1, 1, 1],
 [1, 1, 1, 1, 1],
 [1, 1, 1, 1, 1]]

In [None]:
sc = []
import numpy as np

for rels in assessments:
  sc.append([rel / np.log2(i + 1 +1) for i, rel in enumerate(rels)])
np.sum(sc, axis=1)



array([1.        , 2.13092975, 2.06160631, 2.13092975, 2.13092975,
       2.44845912, 2.13092975])

In [None]:
# soln:
optimal = [[1] * 5] * 7

def dcg(rels):
    from math import log
    s = 0
    for i, rel in enumerate(rels):
        s += rel / log(1 + i + 1, 2)
    return s

dcg5 = sum([dcg(row) for row in assessments]) / len(assessments)
idcg5 = sum([dcg(row) for row in optimal]) / len(optimal)

In [None]:
print(f"DCG@5 = {dcg5:.4f}")
print(f"IDCG@5 = {idcg5:.4f}")
print(f"nDCG@5 = {dcg5 / idcg5:.4f}")

DCG@5 = 2.0048
IDCG@5 = 2.9485
nDCG@5 = 0.6800
