# Migration Reports project: Retrieval
In this part, we focus on the retrieval part of the RAG implementation.

In [1]:
import pandas as pd
import numpy as np
import ipywidgets as widgets
from IPython.display import display
import chromadb
from chromadb.config import Settings

# Simple example
In this section, we first get a hang of ChromadB by following a simple example. We will later on focus on the implementation of our `.csv` file. 
### Indexing
Let's first focus on indexing a sentence in ChromadB. To do so, we can first install chromadB by running `pip install chromadb` in the terminal.

The steps are detailed on the [Chroma website](https://docs.trychroma.com/docs/overview/getting-started) and in this [example](https://www.datacamp.com/tutorial/chromadb-tutorial-step-by-step-guide).

The first step is to create a client, which will connect us to the database and thus allow us to create, store, query collections of embeddings.
We can create a persistent client, meaning we save the data on our disk, or an ephemeral client, in which case the data will disappear once we're done. We choose the latter for this first example.

In [None]:
# in-memory client for this example
#client_test = chromadb.Client()

# example of persistent client
# client = chromadb.PersistentClient(path="")


We can then create a collection, where our embeddings will be stored.

In [4]:
collection_test = client_test.create_collection(name="first_try")

Let's now add documents to our collection, these will be embedded with the default embedding model `all-MiniLM-L6-v2`. We could also choose a different embedding model, which we will most likely due for the rest of the project.

In [5]:
collection_test.add(
    ids=["id1", "id2"],
    documents=[
        "The sky is blue on earth and pink on Saturn.",
        "There is no sky in India."
    ]
)


### Add the query and compute the similarity

In [6]:
results_test = collection_test.query(
    query_texts=["What color is the sky on Saturn?"], # Chroma will embed the query
    n_results=2, # number of results to return, default=10
    include=["documents", "distances", "embeddings"] # default mode doesn't include embeddings, add them to visualize
)

In [7]:
print(results_test['distances'])
print(results_test['embeddings'])

[[0.2976532280445099, 1.2716426849365234]]
[array([[ 7.58604147e-03, -3.36620733e-02,  7.11525679e-02,
         8.41439050e-03,  7.40479901e-02, -1.85960811e-02,
         8.59402195e-02,  5.68682589e-02,  9.73398909e-02,
        -6.36850530e-03, -6.61930218e-02,  1.11738751e-02,
        -3.40286195e-02, -9.80932452e-03, -6.70989677e-02,
         5.86853437e-02, -2.41405573e-02, -7.16577992e-02,
        -2.85602733e-02, -1.48872854e-02, -4.84411605e-02,
         2.35866643e-02, -4.09192108e-02,  9.22331214e-02,
        -8.31628814e-02,  9.81782079e-02, -2.09377091e-02,
         5.39325876e-03,  7.10575609e-03,  6.13714978e-02,
        -1.21362749e-02,  3.07445694e-02,  2.28071883e-02,
        -1.43129472e-02, -8.39055106e-02, -3.72098982e-02,
        -1.21801952e-02, -3.42654921e-02, -1.34223672e-02,
         7.66166078e-04,  1.34405438e-02,  7.06612365e-04,
         2.47000419e-02, -5.17933586e-05,  4.71106265e-04,
         4.71932516e-02,  4.50511910e-02, -4.68019629e-03,
         2.8

As we see above the distance from the query to the first sentence is much smaller than to the second sentence, which makes sense as we are asking a question specifically about Saturn.

Let's now remove the word 'sky' altogether and see what we get.

In [8]:
results_test2 = collection_test.query(
    query_texts=["How is the weather on Saturn?"], 
    n_results=2
)
print(results_test2['distances'])

[[0.7035682797431946, 1.515793800354004]]


### Second example
Let's play with Chroma to get a sense of the limits of the embeddings.

In [9]:
collection_test.add(
    ids=["id3", "id4", "id5", "id6"],
    documents=[
        "Sri Lanka is not worth visiting.",
        "Ceylan is a beautiful  place.",
        "I like going to South Asia.",
        "I love going to the beach."
    ]
)

In [10]:
results_test3 = collection_test.query(
    query_texts=["I would love to go to Sri Lanka."], 
    n_results=6 
)
for i in range(5):
    sentence = results_test3['documents'][0][i]
    distance = results_test3['distances'][0][i]
    print(f"{sentence} {distance:.4f}")

Sri Lanka is not worth visiting. 0.8618
I like going to South Asia. 0.8624
I love going to the beach. 1.3409
Ceylan is a beautiful  place. 1.5688
There is no sky in India. 1.6129


We can already see some limitations, with sentiment not captured properly and certain words not fully understood (Sri Lanka was previously named Ceylan)

# Migration reports dataset

Let's now try to extract text directly from our `EMN.csv` file.

In [4]:
# load data
df = pd.read_csv('data/EMN.csv', encoding='utf-8')
df

Unnamed: 0,Country,Year,Title,Subtitle,Content,Grouped_Title
0,Finland,2012,LEGAL MIGRATION AND MOBILITY,Main,"As part of the Government Programme, the devel...",LEGAL MIGRATION AND MOBILITY
1,Finland,2012,LEGAL MIGRATION AND MOBILITY,PROMOTING LEGAL MIGRATION CHANNELS,As part of the Action Plan on labour migration...,LEGAL MIGRATION AND MOBILITY
2,Finland,2012,LEGAL MIGRATION AND MOBILITY,ECONOMIC MIGRATION,The improvement of the labour market position ...,LEGAL MIGRATION AND MOBILITY
3,Finland,2012,LEGAL MIGRATION AND MOBILITY,FAMILY REUNIFICATION,A comparative study on family reunification in...,LEGAL MIGRATION AND MOBILITY
4,Finland,2012,INTEGRATION,Main,"In September 2012, the Government Integration ...",INTEGRATION
...,...,...,...,...,...,...
4320,Portugal,2023,"BORDERS, VISA AND SCHENGEN",Main,"Ordinance 321/2023 of 27 October, first amendm...",BORDERS AND VISAS
4321,Portugal,2023,IRREGULAR MIGRATION,Main,No significant developments to report in 2023.,IRREGULAR MIGRATION
4322,Portugal,2023,TRAFFICKING IN HUMAN BEINGS,Main,"During 2023, the Observatory on Trafficking in...",TRAFFICKING IN HUMAN BEINGS
4323,Portugal,2023,RETURN AND READMISSION,Main,"During the second trimester of 2023, Portugal ...",RETURN AND READMISSION


Let's first focus on creating embeddings for a single country, single year, and without focusing on any metadata such as the title, year, country (consider only the Content column).

We can first consider only one cell, as it contains already a lot of text, and then build on this to add progressively more embeddings.

In [5]:
client = chromadb.PersistentClient(path="data")

In [12]:
collection = client.create_collection(name="emn_finland_2012")

Let's now extract the text and add it to our `migration_finland_2012` collection.

The text is the following:\
First cell:\
As part of the Government Programme, the development of a comprehensive Future of Migration 2020 Strategy started in 2012. The objective of the Strategy is to design a policy which supports the building of an unprejudiced, safe and pluralistic Finland as well as enhance Finland‚Äôs international competitiveness. The Strategy was developed under the coordination of the Ministry of Interior together with over 40 represented stakeholders and it was adopted in June 2013. Also, the Ministry of the Interior set up a project to improve the effectiveness of the administration of immigration affairs which will run until December 2014. The concept of improving the effectiveness of the administration of immigration affairs was supported by two projects related to centres of expertise: the first, led by the Ministry of Employment and the Economy, looked into the establishment of a centre of expertise that promotes integration activities, while the second, set up by the Ministry of the Interior, assessed the prerequisites for founding a centre of expertise on the compilation of statistics and research about immigration. A public dialogue on immigration was launched in 2012 focusing on the costs of immigration, while a project to adopt a cooperative model among immigration authorities ( FPB) was launched in January 2012 by the Minister of Interior. The objective of the project is to improve the effectiveness of co- operation of authorities responsible for immigration affairs. 

Second cell:\
As part of the Action Plan on labour migration, a number of projects in different parts of Finland have been financed by the European Social Fund ( ESF). Projects focused on creating training systems, developing services for settling as well as building models for recruitment of labour migrants. 

Third cell:\
The improvement of the labour market position of immigrants has been defined as one of the targets of the Government Programme. Cooperation between the State and the municipalities of the Helsinki Metropolitan Area was considered a functional operating model. The implementation of the policy guidelines for international employment services, approved by the Ministry of Employment and the Economy in 2011, started at the beginning of 2012. The Ministry of Employment and the Economy launched the HYV√Ñ nurse recruitment project as a part of its HYV√Ñ (entrepreneurship and cooperation programme (2013-15) with the objective to define the roles and tasks of various parties in the international recruitment of registered nursing staff. In addition, the Ministry of Employment and the Economy has published a practical guide ‚ÄúExperience of International Recruitment to Finland‚Äù, which contains practical hints, check-lists and experiences that can be utilised at different stages of recruitment of foreign nationals. The National Audit Office of Finland published the report ‚ÄúWork-based Immigration‚Äù in September 2012. The report reflected on targeted programmes and projects for the promotion of work-based immigration. 

Fourth cell:\
A comparative study on family reunification in Nordic countries conducted by the Ministry of Interior was published in April 2012 with the aim to analyse and compare legislative provisions related to 2 residence permits issued on the ground of family ties and to formulate a proposal for amendments to the Aliens Act. The study recommended raising the threshold requirements of sufficient income to beneficiaries of humanitarian protection as well as introducing requirements for adequate accommodation. 

Start by adding one cell to visualize the embeddings.

In [13]:
text0 = df['Content'][0]

In [14]:
collection.add(
    ids=["id0"],
    documents=[
        text0
    ]
)

In [15]:
results = collection.query(
    query_texts=["When did the development of a comprehensive Future of Migration 2020 Strategy start?"], 
    n_results=6,
    include=["documents", "distances", "embeddings"] 
    
)

print(np.shape(results['embeddings']), results['distances'])

(1, 1, 384) [[0.9370808601379395]]


In [16]:
client.delete_collection("emn_finland_2012")

As we can see above, the whole text was embedded as a single vector.

We can add progressively more cells and see how the retrieval performance is, to see if we need to chunk out text (we probably will).
To that purpose, let us define 3 questions with increasing difficulty:
- Which Finnish ministry coordinated the development of the Future of Migration 2020 Strategy? (refers to the first cell, matches literally)
- What initiatives were launched in Finland to recruit foreign nurses, and during which years did they take place? (refers to the 3rd cell, checks if meaning is understood)
- What measures has Finland taken to address climate change in Arctic regions? (not mentionned)

Let's also add the three other cells.

In [17]:
text1 = df['Content'][1]
text2 = df['Content'][2]
text3 = df['Content'][3]
print(f"{text1}\n")
print(f"{text2}\n")
print(f"{text3}\n")

As part of the Action Plan on labour migration, a number of projects in different parts of Finland have been financed by the European Social Fund ( ESF). Projects focused on creating training systems, developing services for settling as well as building models for recruitment of labour migrants. 

The improvement of the labour market position of immigrants has been defined as one of the targets of the Government Programme. Cooperation between the State and the municipalities of the Helsinki Metropolitan Area was considered a functional operating model. The implementation of the policy guidelines for international employment services, approved by the Ministry of Employment and the Economy in 2011, started at the beginning of 2012. The Ministry of Employment and the Economy launched the HYVÄ nurse recruitment project as a part of its HYVÄ (entrepreneurship and cooperation programme (2013-15) with the objective to define the roles and tasks of various parties in the international recruitm

In [18]:
collection = client.create_collection(name="emn_finland_2012")

collection.add(
    ids=["id0", "id1", "id2", "id3"],
    documents=[text0, text1, text2, text3]
)
print(collection.get()['documents'][3])  


A comparative study on family reunification in Nordic countries conducted by the Ministry of Interior was published in April 2012 with the aim to analyse and compare legislative provisions related to 2 residence permits issued on the ground of family ties and to formulate a proposal for amendments to the Aliens Act. The study recommended raising the threshold requirements of sufficient income to beneficiaries of humanitarian protection as well as introducing requirements for adequate accommodation. 


In [19]:
query0 = "Which Finnish ministry coordinated the development of the Future of Migration 2020 Strategy?"
query1 = "What initiatives were launched in Finland to recruit foreign nurses, and during which years did they take place?"
query2 = "What measures has Finland taken to address climate change in Arctic regions?"

queries = [query0, query1, query2]
results = collection.query(
    query_texts=queries, 
    include=["documents", "distances", "embeddings"] 
    
)

for i in range(3):
    query = queries[i]
    print(query)
    for j in range(4):
        document = results['documents'][0][j]
        dist = results['distances'][i][j]
        print(f"{dist:.4f} {document}")


Which Finnish ministry coordinated the development of the Future of Migration 2020 Strategy?
0.6571 As part of the Government Programme, the development of a comprehensive Future of Migration 2020 Strategy started in 2012. The objective of the Strategy is to design a policy which supports the building of an unprejudiced, safe and pluralistic Finland as well as enhance Finland’s international competitiveness. The Strategy was developed under the coordination of the Ministry of Interior together with over 40 represented stakeholders and it was adopted in June 2013. Also, the Ministry of the Interior set up a project to improve the effectiveness of the administration of immigration affairs which will run until December 2014. The concept of improving the effectiveness of the administration of immigration affairs was supported by two projects related to centres of expertise: the first, led by the Ministry of Employment and the Economy, looked into the establishment of a centre of expertise 

We see that the queries containing the exact text in the document render the right extraction (1st query), but the 2nd query, testing semantic understanding, does not render the right result.

Here are the following ideas to improve the similarity search:
- Choose a different embedding model.
- Split text into smaller chunks.
- Use metadata.
- Change distance function.

Before trying any of this, let's first add all the cells about Finland to our vector database.

In [20]:
client.delete_collection("emn_finland_2012")

# Indexing: Finland
As mentioned above, we can index all the cells regarding Finland as a baseline and then from there try to implement either four options cited above.

In [21]:
collection = client.create_collection(name="emn_finland")

In [22]:
df_finland = df[df['Country']=='Finland']
df_finland

Unnamed: 0,Country,Year,Title,Subtitle,Content,Grouped_Title
0,Finland,2012,LEGAL MIGRATION AND MOBILITY,Main,"As part of the Government Programme, the devel...",LEGAL MIGRATION AND MOBILITY
1,Finland,2012,LEGAL MIGRATION AND MOBILITY,PROMOTING LEGAL MIGRATION CHANNELS,As part of the Action Plan on labour migration...,LEGAL MIGRATION AND MOBILITY
2,Finland,2012,LEGAL MIGRATION AND MOBILITY,ECONOMIC MIGRATION,The improvement of the labour market position ...,LEGAL MIGRATION AND MOBILITY
3,Finland,2012,LEGAL MIGRATION AND MOBILITY,FAMILY REUNIFICATION,A comparative study on family reunification in...,LEGAL MIGRATION AND MOBILITY
4,Finland,2012,INTEGRATION,Main,"In September 2012, the Government Integration ...",INTEGRATION
...,...,...,...,...,...,...
4204,Finland,2023,"BORDERS, VISA AND SCHENGEN",Main,The instrumentalised migration from Russia to ...,BORDERS AND VISAS
4205,Finland,2023,IRREGULAR MIGRATION,Main,No significant developments to report in 2023.,IRREGULAR MIGRATION
4206,Finland,2023,TRAFFICKING IN HUMAN BEINGS,Main,Legislative amendments entered into force in J...,TRAFFICKING IN HUMAN BEINGS
4207,Finland,2023,RETURN AND READMISSION,Main,The government programme outlines a variety of...,RETURN AND READMISSION


In [23]:
# extract IDs (as in the original .csv -> not consecutive)
ids_finland = df_finland.index.tolist()
ids_finland_str = [str(i) for i in df_finland.index.tolist()]
# extract the text to embed
docus_finland = [df_finland['Content'][i] for i in ids_finland]

In [24]:
# add to our collection
collection.add(
    ids=ids_finland_str,
    documents=docus_finland
)
print(collection.get()['documents'][3])  


A comparative study on family reunification in Nordic countries conducted by the Ministry of Interior was published in April 2012 with the aim to analyse and compare legislative provisions related to 2 residence permits issued on the ground of family ties and to formulate a proposal for amendments to the Aliens Act. The study recommended raising the threshold requirements of sufficient income to beneficiaries of humanitarian protection as well as introducing requirements for adequate accommodation. 


### Queries
Now that the sub-dataset has been stored in the vector database, we can generate a few queries to see how well Chroma is able to retrieve the relevant information.

In [25]:
q_fin_1 = "Which Finnish ministry coordinated the development of the Future of Migration 2020 Strategy?" # row 2 in .csv
q_fin_2 = "What initiatives were launched in Finland to recruit foreign nurses, and during which years did they take place?" # row 4 in .csv
q_fin_3 = "What measures has Finland taken to address climate change in Arctic regions?" # not mentioned
q_fin_4 = "From 2015 on, how many asylum seekers has Finland commited to relocate?" # row 961 in .csv, 3,200 out of 160,000


In [26]:
queries = [q_fin_1, q_fin_2, q_fin_3, q_fin_4]

results = collection.query(
    query_texts=queries, 
    include=["documents", "distances", "embeddings"],
    n_results= len(ids_finland)
    
)

In [27]:
# Print top 5 distances and original text
print(np.shape(results['distances']))
print(results['distances'])

(4, 162)
[[0.5436655282974243, 0.5515410304069519, 0.5969497561454773, 0.6020833253860474, 0.6238492727279663, 0.6374987363815308, 0.6375800371170044, 0.6452267169952393, 0.6570550799369812, 0.7069797515869141, 0.7183794379234314, 0.7264158129692078, 0.7378663420677185, 0.7444997429847717, 0.7488411664962769, 0.7659775018692017, 0.7707520127296448, 0.7767466306686401, 0.7886557579040527, 0.7924311757087708, 0.802104651927948, 0.8148541450500488, 0.8218898773193359, 0.8362795114517212, 0.8373836278915405, 0.8382729887962341, 0.8483997583389282, 0.8492474555969238, 0.849816083908081, 0.8502651453018188, 0.8502699732780457, 0.85204017162323, 0.8520480394363403, 0.8554637432098389, 0.8560668230056763, 0.8600274920463562, 0.8617389798164368, 0.8723912239074707, 0.875162661075592, 0.8758412003517151, 0.8800768852233887, 0.8800935745239258, 0.8919601440429688, 0.9060940742492676, 0.9076313972473145, 0.9176934361457825, 0.9266870021820068, 0.9408370852470398, 0.94107985496521, 0.94116443395614

In [28]:
# print top 5 retrieved documents for each query
for q_id, query in enumerate(queries):
    print(f"\nQuery {q_id+1}: {queries[q_id]}")
    
    # get distances, original text, and ID for each query
    dists = np.array(results['distances'][q_id])
    docs = np.array(results['documents'][q_id])
    ids = np.array(results['ids'][q_id])
    
    top_k = 5
    
    # print 5 best results (smallest distance), results are already from closest to farthest
    for i in range(top_k):
        print(f"  Row in csv: {int(ids[i])+2} | Distance: {dists[i]:.4f} | Doc: {docs[i][:60]}...")
        


Query 1: Which Finnish ministry coordinated the development of the Future of Migration 2020 Strategy?
  Row in csv: 2345 | Distance: 0.5437 | Doc: The 2019 programme for government emphasised the connection ...
  Row in csv: 4200 | Distance: 0.5515 | Doc: A new government was formed in Finland after the parliamenta...
  Row in csv: 4199 | Distance: 0.5969 | Doc: A new government took office in 2023 and its programme seeks...
  Row in csv: 963 | Distance: 0.6021 | Doc: In 2015, the Finnish Government approved the Government migr...
  Row in csv: 11 | Distance: 0.6238 | Doc: The relationship between migration and development policy is...

Query 2: What initiatives were launched in Finland to recruit foreign nurses, and during which years did they take place?
  Row in csv: 4 | Distance: 0.6236 | Doc: The improvement of the labour market position of immigrants ...
  Row in csv: 464 | Distance: 0.7691 | Doc: The Development Policy Programme of 2012 identifies migratio...
  Row in csv: 967 

In [29]:
print(results['distances'][0])
print(results['documents'][0])
print(results['ids'][0])

[0.5436655282974243, 0.5515410304069519, 0.5969497561454773, 0.6020833253860474, 0.6238492727279663, 0.6374987363815308, 0.6375800371170044, 0.6452267169952393, 0.6570550799369812, 0.7069797515869141, 0.7183794379234314, 0.7264158129692078, 0.7378663420677185, 0.7444997429847717, 0.7488411664962769, 0.7659775018692017, 0.7707520127296448, 0.7767466306686401, 0.7886557579040527, 0.7924311757087708, 0.802104651927948, 0.8148541450500488, 0.8218898773193359, 0.8362795114517212, 0.8373836278915405, 0.8382729887962341, 0.8483997583389282, 0.8492474555969238, 0.849816083908081, 0.8502651453018188, 0.8502699732780457, 0.85204017162323, 0.8520480394363403, 0.8554637432098389, 0.8560668230056763, 0.8600274920463562, 0.8617389798164368, 0.8723912239074707, 0.875162661075592, 0.8758412003517151, 0.8800768852233887, 0.8800935745239258, 0.8919601440429688, 0.9060940742492676, 0.9076313972473145, 0.9176934361457825, 0.9266870021820068, 0.9408370852470398, 0.94107985496521, 0.9411644339561462, 0.9421