<a href="https://colab.research.google.com/github/z216z/DNLP/blob/main/practices/P3/Practice_3_IR_and_Recommendation_systems.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Deep Natural Language Processing @ PoliTO**

---


**Teaching Assistant:** Moreno La Quatra

**Practice 3:** Information Retrieval & Elastic Search

### Download and setup ElasticSearch on Google Colab

In [1]:
# Download and extract elasticsearch
!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz
!tar -xzf elasticsearch-7.10.1-linux-x86_64.tar.gz
!chown -R daemon:daemon elasticsearch-7.10.1


--2021-11-03 10:45:17--  https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz
Resolving artifacts.elastic.co (artifacts.elastic.co)... 34.120.127.130, 2600:1901:0:1d7::
Connecting to artifacts.elastic.co (artifacts.elastic.co)|34.120.127.130|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 318801277 (304M) [application/x-gzip]
Saving to: ‘elasticsearch-7.10.1-linux-x86_64.tar.gz’


2021-11-03 10:45:28 (26.5 MB/s) - ‘elasticsearch-7.10.1-linux-x86_64.tar.gz’ saved [318801277/318801277]



In [178]:
import os
from subprocess import Popen, PIPE, STDOUT

# If issues are encountered with this section, ES can be manually started as follows:
# ./elasticsearch-7.10.1/bin/elasticsearch

# Start and wait for server
server = Popen(['elasticsearch-7.10.1/bin/elasticsearch'], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1))
!sleep 30

In [179]:
# wait a bit then test
!curl -X GET "localhost:9200/"

{
  "name" : "b66670bee898",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "tQm_d_qmTzeU8uxxmu38ZQ",
  "version" : {
    "number" : "7.10.1",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "1c34507e66d7db1211f66f3513706fdf548736aa",
    "build_date" : "2020-12-05T01:00:33.671820Z",
    "build_snapshot" : false,
    "lucene_version" : "8.7.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}


## Information Retrieval

Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of **texts**, images or sounds. (source: Wikipedia).

This practice is intended for the creation of a wikipedia-based search engine. For the purpose of the practice, only a subset of the wikipedia pages will be used.

Data Source: https://snap.stanford.edu/data/wikispeedia.html 

### **Question 1: Pagerank scores**
Exploiting the wikipedia citation network, compute, for each page, its associated [pagerank](http://ilpubs.stanford.edu:8090/422/) score.

What is the page with the highest Pagerank score?


In [4]:
%%capture
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P3/wikipedia_network/articles.tsv
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P3/wikipedia_network/categories.tsv
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P3/wikipedia_network/links.tsv


In [5]:
%%capture
! pip install elasticsearch==7.10.1
! pip install networkx

In [6]:
from urllib.parse import unquote

list_articles = open("articles.tsv").read()
list_articles = list_articles.split("\n")
list_articles = [l for l in list_articles if l!= ""]
list_articles = [l for l in list_articles if l[0] != "#"]
unquoted_list_articles = [unquote(l) for l in list_articles if l[0] != "#"]
dict_articles = {}
for i, l in enumerate(unquoted_list_articles):
    dict_articles[l] = {}
    dict_articles[l]["ID"] = l
    dict_articles[l]["quoted_ID"] = list_articles[i]

In [8]:
from urllib.parse import unquote

list_categories = open("categories.tsv").read()
list_categories = list_categories.split("\n")
list_categories = [l for l in list_categories if l!= ""]
list_categories = [l for l in list_categories if l[0] != "#"]

for l in list_categories:
    k, v = l.split("\t")
    k = unquote(k)
    v = unquote(v)
    if "categories" in dict_articles[k].keys():
        dict_articles[k]["categories"].append(v)
    else:
        dict_articles[k]["categories"] = [v]
    
print (dict_articles)

{'Áedán_mac_Gabráin': {'ID': 'Áedán_mac_Gabráin', 'quoted_ID': '%C3%81ed%C3%A1n_mac_Gabr%C3%A1in', 'categories': ['subject.History.British_History.British_History_1500_and_before_including_Roman_Britain', 'subject.People.Historical_figures']}, 'Åland': {'ID': 'Åland', 'quoted_ID': '%C3%85land', 'categories': ['subject.Countries', 'subject.Geography.European_Geography.European_Countries']}, 'Édouard_Manet': {'ID': 'Édouard_Manet', 'quoted_ID': '%C3%89douard_Manet', 'categories': ['subject.People.Artists']}, 'Éire': {'ID': 'Éire', 'quoted_ID': '%C3%89ire', 'categories': ['subject.Countries', 'subject.Geography.European_Geography.European_Countries']}, 'Óengus_I_of_the_Picts': {'ID': 'Óengus_I_of_the_Picts', 'quoted_ID': '%C3%93engus_I_of_the_Picts', 'categories': ['subject.History.British_History.British_History_1500_and_before_including_Roman_Britain', 'subject.People.Historical_figures']}, '€2_commemorative_coins': {'ID': '€2_commemorative_coins', 'quoted_ID': '%E2%82%AC2_commemorative

In [9]:
from urllib.parse import unquote

list_links = open("links.tsv").read()
list_links = list_links.split("\n")
list_links = [l for l in list_links if l!= ""]
list_links = [l for l in list_links if l[0] != "#"]

for l in list_links:
    s, t = l.split("\t")
    s = unquote(s)
    t = unquote(t)
    if "out_links" in dict_articles[s].keys():
        dict_articles[s]["out_links"].append(t)
    else:
        dict_articles[s]["out_links"] = [t]

In [10]:
print (dict_articles["Áedán_mac_Gabráin"])

{'ID': 'Áedán_mac_Gabráin', 'quoted_ID': '%C3%81ed%C3%A1n_mac_Gabr%C3%A1in', 'categories': ['subject.History.British_History.British_History_1500_and_before_including_Roman_Britain', 'subject.People.Historical_figures'], 'out_links': ['Bede', 'Columba', 'Dál_Riata', 'Great_Britain', 'Ireland', 'Isle_of_Man', 'Monarchy', 'Orkney', 'Picts', 'Scotland', 'Wales']}


In [54]:
list_links_4graph=[]
list_links
for link in list_links:
  split=link.split("\t")
  list_links_4graph.append((split[0],split[1]))

In [56]:
list_links_4graph

[('%C3%81ed%C3%A1n_mac_Gabr%C3%A1in', 'Bede'),
 ('%C3%81ed%C3%A1n_mac_Gabr%C3%A1in', 'Columba'),
 ('%C3%81ed%C3%A1n_mac_Gabr%C3%A1in', 'D%C3%A1l_Riata'),
 ('%C3%81ed%C3%A1n_mac_Gabr%C3%A1in', 'Great_Britain'),
 ('%C3%81ed%C3%A1n_mac_Gabr%C3%A1in', 'Ireland'),
 ('%C3%81ed%C3%A1n_mac_Gabr%C3%A1in', 'Isle_of_Man'),
 ('%C3%81ed%C3%A1n_mac_Gabr%C3%A1in', 'Monarchy'),
 ('%C3%81ed%C3%A1n_mac_Gabr%C3%A1in', 'Orkney'),
 ('%C3%81ed%C3%A1n_mac_Gabr%C3%A1in', 'Picts'),
 ('%C3%81ed%C3%A1n_mac_Gabr%C3%A1in', 'Scotland'),
 ('%C3%81ed%C3%A1n_mac_Gabr%C3%A1in', 'Wales'),
 ('%C3%85land', '20th_century'),
 ('%C3%85land', 'Baltic_Sea'),
 ('%C3%85land', 'Crimean_War'),
 ('%C3%85land', 'Currency'),
 ('%C3%85land', 'Euro'),
 ('%C3%85land', 'European_Union'),
 ('%C3%85land', 'Finland'),
 ('%C3%85land', 'League_of_Nations'),
 ('%C3%85land', 'List_of_countries_by_system_of_government'),
 ('%C3%85land', 'Nationality'),
 ('%C3%85land', 'Parliamentary_system'),
 ('%C3%85land', 'Police'),
 ('%C3%85land', 'Russia'),

In [81]:
import networkx as nx
from networkx.algorithms.link_analysis.pagerank_alg import pagerank

#create the graph
G=nx.DiGraph()
G.add_nodes_from(list_articles)
G.add_edges_from(list_links_4graph)
pageranks=nx.pagerank(G)
print( sorted(nx.pagerank(G).items(), key=lambda x: x[1], reverse=True))

[('United_States', 0.00956180652731311), ('France', 0.0064200413810133585), ('Europe', 0.006337014005458885), ('United_Kingdom', 0.006232394913963077), ('English_language', 0.004862980440047761), ('Germany', 0.00482224267836269), ('World_War_II', 0.0047226367934437305), ('England', 0.0044723357530703466), ('Latin', 0.004422148441338466), ('India', 0.004033922521194668), ('Japan', 0.0038882325861089276), ('Italy', 0.003715760912321338), ('Spain', 0.0036408643873478536), ('China', 0.00356413874641976), ('Russia', 0.0034947394714253083), ('Time_zone', 0.0034644735702094955), ('Canada', 0.003433549619451972), ('Currency', 0.0032358494514180634), ('Australia', 0.0032030209209760996), ('Africa', 0.0031664657423146948), ('London', 0.003076762208430332), ('Christianity', 0.0030168282774453155), ('Animal', 0.002882043170603183), ('List_of_countries_by_system_of_government', 0.0028321820144428947), ('United_Nations', 0.0028074111045188), ('French_language', 0.0027402318002918958), ('Islam', 0.00

In [82]:
list_articles

['%C3%81ed%C3%A1n_mac_Gabr%C3%A1in',
 '%C3%85land',
 '%C3%89douard_Manet',
 '%C3%89ire',
 '%C3%93engus_I_of_the_Picts',
 '%E2%82%AC2_commemorative_coins',
 '10th_century',
 '11th_century',
 '12th_century',
 '13th_century',
 '14th_century',
 '15th_Marine_Expeditionary_Unit',
 '15th_century',
 '16_Cygni',
 '16_Cygni_Bb',
 '16th_century',
 '1755_Lisbon_earthquake',
 '17th_century',
 '1896_Summer_Olympics',
 '18th_century',
 '1928_Okeechobee_Hurricane',
 '1973_oil_crisis',
 '1980_eruption_of_Mount_St._Helens',
 '1997_Pacific_hurricane_season',
 '19th_century',
 '1_Ceres',
 '1st_century',
 '1st_century_BC',
 '2-6-0',
 '2-8-0',
 '2003_Atlantic_hurricane_season',
 '2004_Atlantic_hurricane_season',
 '2004_Indian_Ocean_earthquake',
 '2005_Atlantic_hurricane_season',
 '2005_Hertfordshire_Oil_Storage_Terminal_fire',
 '2005_Kashmir_earthquake',
 '2005_Lake_Tanganyika_earthquake',
 '2005_Sumatra_earthquake',
 '20th_century',
 '21st_century',
 '2nd_century',
 '3_Juno',
 '3rd_century',
 '4-2-0',
 '4-

### **Question 2: Wikipedia pages indexing**

Create a new index in ElasticSearch and Index the Wikipedia webpage (alongiside with their content). The content of each page can be found at `plaintext_articles/QUOTED_ID_OF_THE_DOC.txt`

NB: pagerank score must be a field of the indexed doc


In [64]:
%%capture
! wget https://github.com/MorenoLaQuatra/DeepNLP/raw/main/practices/P3/plaintext_articles.zip
! unzip plaintext_articles.zip

In [181]:
from elasticsearch import Elasticsearch

es = Elasticsearch()
#create_index = es.indices.create(index="wiki")
for article,key in zip(list_articles,dict_articles.keys()):
  with open("plaintext_articles/"+article+".txt", "r") as f:
    if dict_articles[key]["quoted_ID"]==article.split(".txt")[0]:
      dict_articles[key]["content"]=f.read()
      res= es.index(index="wiki",body=dict_articles[key])
    #print(res["result"]) 


### **Question 3: Querying ElasticSearch**

Perform a query using ElasticSearch. Look for your favorite content (choose and report 3 of them) on the full text of the articles.

E.g.:
- query 1 : "The capital of Italy" (surprised by the result?)

In [223]:

search_param = {
    "query": {
        "match": {
            "content": "capital of France"
        }
    }
}
req = es.search(index="wiki",body=search_param)
print("Got %d Hits:" % req['hits']['total']['value'])
for hit in req['hits']['hits']:
    print( hit["_source"])
#print(sorted(req["hits"].items(), key=lambda x: x[1], reverse=True))

Got 7077 Hits:
{'ID': 'Lyon', 'quoted_ID': 'Lyon', 'categories': ['subject.Geography.European_Geography'], 'out_links': ['13th_century', '19th_century', 'Capital', 'Celtic_mythology', 'Christianity', 'Claudius', 'English_language', 'Europe', 'Film', 'France', 'French_language', 'Interpol', 'Italy', 'Julius_Caesar', 'Lille', 'List_of_countries', 'Marseille', 'Middle_Ages', 'Nazi_Germany', 'Paris', 'Renaissance', 'River', 'Roman_road', 'TGV', 'World_Heritage_Site', 'World_War_II'], 'content': '   #copyright\n\nLyon\n\n2007 Schools Wikipedia Selection. Related subjects: European Geography\n\n                               Ville de Lyon\n\n   Flag of Lyon\n                                    Coat of arms of Lyon\n      City flag                       City coat of arms\n   Motto: Avant, avant, Lion le melhor.\n   ( Arpitan: Forward, forward, Lyon the best)\n                                  Location\n\n   Image:Paris_plan_pointer_b_jms.gif\n   Map highlighting the commune of Lyon\n   Coordi

### **Question 4: integrating pagerank scores**

Create a template query to include pagerank while computing the score (`_score`). 

Use the [Script score](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html#function-script-score) to generate an hybrid score (`_score + pagerank_score * 250`). 

Perform the same set of queries with this modification, does it change the results?



In [None]:
# Your code here

### **Question 5: integrate semantic dense-vectors**

Generate a new index ("wiki-semantic-search") including all the information of the previous one plus an additional field that contains a BERT-based embedding vector of the `full_text` of the article. Once indexing is completed, repeat the same queries for a qualitative evaluation of the IR system. 

**Some hints below:**
- Use Sentence-BERT pretrained encoders (www.sbert.net). Choose the most suitable pretrained model (trade off between speed and accuracy). E.g., `multi-qa-MiniLM-L6-cos-v1`
- Use cosine similarity to compute the similarity between queries and full text of the article.

In [None]:
%%capture
!pip install sentence-transformers

In [None]:
# create mapping

dense_dim = len(sentence_encodings[0])

index_properties = {}
index_properties['settings']={ "number_of_shards": 2, "number_of_replicas": 1}
index_properties['mappings']={ "dynamic": "true", "_source": { "enabled": "true" }, "properties": {}}
for t in ['ID', 'quoted_ID', 'full_text']: 
    index_properties['mappings']['properties'][t]={ "type": "text" }
for t in ['pagerank_score']: 
    index_properties['mappings']['properties'][t]={ "type": "float" }
for d in ["embedding_bert"]: 
    index_properties['mappings']['properties'][d]={ "type": "dense_vector", "dims": dense_dim }

In [None]:
# Your code here

## Content-based Recommender Systems

A recommender system is a subclass of information filtering system that seeks to predict the "rating" or "preference" a user would give to an item. (source: [Wikipedia](https://en.wikipedia.org/wiki/Recommender_system))

In this part of the practice you will be required to generate a text-based unsupervised recommendation system (only **content**-based). The final goal is similar to a IR search engine, the main difference relies on **how you define the "queries".**

The tools at your disposal are:
1. `Sentence-BERT model`: should be used to obtain a vector representation of the input data.
2. `ElasticSearch`: can be used for indexing movie information and to perform **fast** similarity search.

For the recommendation system you need the following information:
- Movie's title
- Movie's plot
- Plot's embedding vector

The dataset used for this goal is: [Wikipedia Movie Plots](https://www.kaggle.com/jrobischon/wikipedia-movie-plots). For this practice you will use a truncated version of the data collection to reduce runtime.

In [169]:
! wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P3/wiki_plots_2005onward.csv
import pandas as pd
df_movies = pd.read_csv("wiki_plots_2005onward.csv")

--2021-11-03 12:35:52--  https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P3/wiki_plots_2005onward.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 45936814 (44M) [text/plain]
Saving to: ‘wiki_plots_2005onward.csv’


2021-11-03 12:35:54 (153 MB/s) - ‘wiki_plots_2005onward.csv’ saved [45936814/45936814]



### **Question 6: movie encodings**

Use Sentence-BERT model to encode movie plots into fixed-size vectors.

NB: the vector dimension is dependent on the choice of the pretrained model.

In [170]:
! pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.1.0.tar.gz (78 kB)
[K     |████████████████████████████████| 78 kB 3.4 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.12.2-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 11.3 MB/s 
[?25hCollecting tokenizers>=0.10.3
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 34.6 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 36.8 MB/s 
[?25hCollecting huggingface-hub
  Downloading huggingface_hub-0.1.0-py3-none-any.whl (59 kB)
[K     |████████████████████████████████| 59 kB 6.2 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |█████████████████████

In [172]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

#Our sentences we like to encode
sentences = df_movies["Plot"]

#Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences)



Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.69k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/314 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  8.00721720e-03  1.46463379e-01 -1.34830803e-01  1.70097291e-01
 -4.69532609e-01 -1.45623982e-01  1.27240106e-01 -4.13904525e-02
 -2.63423949e-01 -4.23610136e-02 -4.44927812e-02  1.64915435e-02
  3.95910889e-01 -2.93225974e-01 -2.42920779e-03  1.30482540e-01
  2.45588928e-01  9.11161453e-02 -2.85754800e-01 -4.69768792e-01
  8.01787153e-02  1.47882327e-02 -2.41782889e-01  2.43545800e-01
 -3.35529029e-01  3.51485729e-01 -7.63985813e-02  3.00199628e-01
 -2.96591759e-01  3.74684453e-01  1.87552124e-02  8.22071917e-04
  2.62508571e-01 -9.81132686e-03 -2.41566613e-01  1.45624373e-02
 -1.83602780e-01 -1.06825039e-01 -1.14057451e-01  2.19952583e-01
 -8.56654122e-02  1.18692987e-01 -9.60073546e-02 -9.77977179e-03
  1.58098459e-01  1.80781037e-01  9.36940014e-02 -1.64164037e-01
 -5.71717024e-01 -1.61934122e-02 -5.69700480e-01 -1.17876396e-01
  1.87978119e-01  1.13629391e-02  7.50036687e-02 -3.76470387e-02
  1.21280670e-01 -2.01441

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
 -1.44056212e-02 -7.11032525e-02  4.35826462e-03  1.53604716e-01
  6.64345101e-02 -1.79258555e-01 -6.58331066e-02  1.42127275e-01
  8.92216340e-05  1.58852741e-01 -2.76135743e-01  1.28445864e-01
 -1.28168359e-01  3.27617466e-01 -1.70847643e-02  1.72818512e-01
  2.96352088e-01  1.58615857e-01  6.71961457e-02 -1.37895746e-02
  2.81673700e-01 -7.17345327e-02  1.96420565e-01  2.01336458e-01
  2.39006087e-01 -1.49357617e-01 -2.03702182e-01  3.24983597e-01
 -3.99036184e-02  9.23575461e-02 -1.55328512e-01 -1.42591774e-01
 -4.28119332e-01  1.47966351e-02  2.18432620e-01 -1.19109929e-01
  2.95696348e-01  1.34453639e-01 -5.07323295e-02 -1.16667151e-01
  4.09633946e-03  6.39032274e-02 -1.35747744e-02 -2.48153657e-02
 -1.21917099e-01 -6.00223131e-02  1.65925086e-01 -1.14233114e-01
 -1.93147570e-01  2.71178316e-03 -1.13696516e-01 -1.36300683e-01
  1.19703360e-01  4.59264189e-01  2.67313480e-01  1.97991818e-01
  6.26350269e-02 -1.42327

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  0.21392787  0.03781003 -0.23971134  0.05678051  0.11550111  0.2382415
  0.05323323  0.34469602  0.03374845  0.06599358  0.21829078 -0.50926214
  0.2761192   0.3506474   0.18544954  0.04808371 -0.12608412  0.03961966
  0.16473418  0.43768713  0.1361852  -0.16142644  0.1982194   0.25168452
  0.03480024 -0.06051862  0.02928323 -0.13031267  0.10134406 -0.22869994
 -0.07447504  0.16463946  0.08494028 -0.49066672 -0.06916542 -0.00544408
  0.34355703  0.22971429  0.05706476 -0.02207801 -0.00915425  0.09897319
  0.21092229 -0.11530343 -0.28111124 -0.20947467  0.42386177  0.45057866
 -0.20752028 -0.36462685 -0.09056558 -0.03324251 -0.08020215 -0.04041601
  0.23506668  0.10182546  0.09616129 -0.17425323  0.15869448  0.03833
 -0.05158893  0.06505444 -0.12131289  0.00235558 -0.11647451 -0.02351005
 -0.1154426   0.11037661  0.3623191  -0.05253789 -0.01580999 -0.2267164
 -0.14942937  0.01650521  0.44219205 -0.42987654 -0.01144368  0.

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  1.71481207e-01 -2.91186750e-01 -1.46790013e-01 -1.28324807e-01
 -5.92456162e-02  1.38619706e-01 -5.23676798e-02 -8.74872953e-02
  2.76999995e-02 -7.19320998e-02 -4.84485060e-01 -1.06518112e-01
  2.85555452e-01 -4.45421010e-01  4.29625779e-01 -4.21671540e-01
  1.83212712e-01  5.67270815e-01  2.09073082e-01  1.14046931e-01
  1.20654576e-01  7.83415735e-02 -2.15099752e-01  2.21599191e-01
  1.22961059e-01 -3.67164910e-01  5.09278215e-02  1.47006497e-01
 -2.35135376e-01  1.12816002e-02  4.51310784e-01 -1.29180728e-02
 -6.99970201e-02  2.20807880e-01 -3.60599875e-01  8.14975947e-02
 -1.04004228e-02 -6.61346391e-02 -2.28395220e-02 -3.72320451e-02
  9.83165205e-02  5.80251664e-02 -6.27205670e-02 -1.83892865e-02
  1.44503281e-01  1.49666414e-01  5.11132181e-02 -1.08354557e-02
  2.75264919e-01  2.20533535e-01 -1.04323357e-01  2.20822915e-01
  3.09476912e-01  9.28461403e-02  1.32721469e-01  2.24594206e-01
  2.37092316e-01 -4.99788

### **Question 7: ElasticSearch indexing**

Create a new ElasticSearch index (`recsys-movies`) and index all movies with their embedding vectors.



In [None]:
# Your code here

### **Question 8: Query generation**

Create a function that accept the following arguments:
1. `embedding_model`: Sentence-BERT model used to generate embeddings
2. `df_movies`: the dataframe containing all the movies' information
3. `movie_title`: a string containing the title of the movie the user is currently watching.

It should return the embedding vector associated to the query by looking for the `movie_title` plot in `df_movies`. It uses `embedding_model` to encode it.




In [None]:
# Your code here

### **Question 8: Qualitative evaluation (your personal movie recommendation system)**

Evaluate your personal recommendation system by querying for some movies in the data collection. You need to create an elasticsearch query to use the recommendation system (see Q. 5 of this practice).

Just some examples:
1. title: Harry Potter and the Goblet of Fire
2. title: Avengers: Age of Ultron
3. title: Star Wars: The Last Jedi


In [None]:
 # Your code here

### **Question 9 (Bonus)**

Rewrite the function at Q.7 to take multiple movie titles (list of strings). Compute the average vector and use it to obtain recommendations. Perform a qualitative evaluation in this specific case (it is possible to choose movie's titles from the previous list)

In [None]:
# Your code here