# EXERCICE 1: Parsing PDFs

The goal of the exercice is to download the pdfs in the excel `url2parse.xlsx` and create a table (csv file) with the content.

For each document, the text should be extracted and split into paragraphs.
Each row of the table should contain the following attributes:


```
{'title': title of the pdf (in the excel),
'date': publication date (in the excel),
'paragraphId': a unique id for paragraph splitted. The id can be generated how you want (no restriction). It should be of type 'str',
'text': the parsed text}
```

Here is an example of rows that we can find:


| title | date | paragraphId | text |
|:--------:|:--------:|:--------:|:--------:|
|  Attention Is All You Need   |  12/06/2017   |  4ee373ed-32a9-499f-8aaa-ae4f6e79d57e   |  The dominant sequence transduction models are based on complex recurrent...    |
|  Attention Is All You Need   |  12/06/2017   |  86389454-6036-4ef1-b7b1-2e57fc3d53f7   |  Recurrent neural networks, long short-term memory...   |


You can use the **library** that you are most confortable with. There is no restrictions.

## Step 1: Convert PDFs into raw text

In [1]:
import pandas as pd
import requests
from io import BytesIO
import PyPDF2
import uuid

# Read URLs and metadata
def load_data(excel_path):
    return pd.read_excel(excel_path)

# Download PDF and extract text
def download_pdf(url):
    response = requests.get(url)
    response.raise_for_status()
    return BytesIO(response.content)

def extract_text_from_pdf(file_stream):
    reader = PyPDF2.PdfReader(file_stream)
    text = [page.extract_text() for page in reader.pages if page.extract_text()]
    return "\n".join(text).strip()

## Step 2: Split raw text into paragraphs

In [2]:
# Split raw text into paragraphs
def split_into_paragraphs(text):
    paragraphs = text.split('\n')
    return [para.strip() for para in paragraphs if para.strip()]

## Step 3: Create the final table with all the metadatas

In [3]:
# Create DataFrame with metadata
def create_data_frame(data, paragraphs):
    return pd.DataFrame({
        'title': data['title'],
        'date': data['date'],
        'paragraphId': [str(uuid.uuid4()) for _ in range(len(paragraphs))],
        'text': paragraphs
    })

## Step 4: Save the table in a csv file

In [4]:
# Main function to process the PDFs
data = load_data('url2parse.xlsx')
all_data = []

for _, row in data.iterrows():
    file_stream = download_pdf(row['url'])
    text = extract_text_from_pdf(file_stream)
    paragraphs = split_into_paragraphs(text)
    df = create_data_frame(row, paragraphs)
    all_data.append(df)

    # Concatenate all data frames
    final_df = pd.concat(all_data, ignore_index=True)

    # Save to CSV
    final_df.to_csv('output.csv', index=False)

In [5]:
final_df

Unnamed: 0,title,date,paragraphId,text
0,Attention Is All You Need,2017-06-12,1e460845-beb6-47fb-805c-59bf66557b2b,"Provided proper attribution is provided, Googl..."
1,Attention Is All You Need,2017-06-12,467bc396-da9a-4a06-99f7-6256078b8cb8,reproduce the tables and figures in this paper...
2,Attention Is All You Need,2017-06-12,b1ac7576-14fa-4821-aa86-679dd4f84fe7,scholarly works.
3,Attention Is All You Need,2017-06-12,a0a4f389-c0fa-4e68-a133-37cacc8949ee,Attention Is All You Need
4,Attention Is All You Need,2017-06-12,df95a974-900c-4de7-b923-61e3ca35990e,Ashish Vaswani∗
...,...,...,...,...
5805,LLaMA: Open and Efficient Foundation Language ...,2023-02-27,fd917a59-4397-4361-a708-1eb553b2d68c,MemTotal: 164928 kB
5806,LLaMA: Open and Efficient Foundation Language ...,2023-02-27,f91af309-62a3-4773-bad0-6f32387db8a8,MemFree: 140604 kB
5807,LLaMA: Open and Efficient Foundation Language ...,2023-02-27,f946446f-4336-4ebb-82d2-c321539e40e4,Buffers: 48 kB
5808,LLaMA: Open and Efficient Foundation Language ...,2023-02-27,6528a310-b536-48f1-8461-92284b02e110,Cached: 19768 kB


# EXERCICE 2: Topic extraction

The goal of the task is to extract topics related to a database containing articles about the Covid.

Given a query you need to score all the paragraphs with scores between 0 and 1:

*   0 = the document is not related to the query
*   1 = the document is very related to the query


Then, you should select the top-200 paragraphs and compute a score to evaluate the performance of the model.


For the exercice, we restrict to the usage of the following librairies:
**numpy**, **pandas**, **torch**, **transformers**, **sentence-transformers**, **scikit-learn**.

## Step 0: Load the database and the queries

In [6]:
%pip install --quiet numpy pandas torch datasets transformers sentence-transformers scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [7]:
from datasets import load_dataset

docs = load_dataset('BeIR/trec-covid', 'corpus')
queries = load_dataset('BeIR/trec-covid', 'queries')
qrels = load_dataset('BeIR/trec-covid-qrels')

In [8]:
docs['corpus'][0]

{'_id': 'ug7v899j',
 'title': 'Clinical features of culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia',
 'text': 'OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia. METHODS: Patients with positive M. pneumoniae cultures from respiratory specimens from January 1997 through December 1998 were identified through the Microbiology records. Charts of patients were reviewed. RESULTS: 40 patients were identified, 33 (82.5%) of whom required admission. Most infections (92.5%) were community-acquired. The infection affected all age groups but was most common in infants (32.5%) and pre-school children (22.5%). It occurred year-round but was most common in the fall (35%) and spring (30%). More than three-quarters of patients (77.5%) had comorbidities. Twenty-four isolates (

In [9]:
queries['queries'][0]

{'_id': '1', 'title': '', 'text': 'what is the origin of COVID-19'}

In [10]:
qrels['test'][0]

{'query-id': 1, 'corpus-id': '005b2j4b', 'score': 2}

# Step 1: Filter database with a subset of documents

The database is too big. We will only work with a subset of documents. Do not change the code here.

In [11]:
queries_id = [6, 7, 20, 21, 26, 28, 36, 38, 39, 45]

In [12]:
filtered_queries = queries['queries'].filter(lambda x: int(x['_id']) in queries_id)

In [13]:
corpus_id =qrels['test'].filter(lambda x: x['query-id'] in queries_id)['corpus-id']
filtered_docs = docs['corpus'].filter(lambda x: x['_id'] in corpus_id)

## Step 2: Compute scores

For each query in `filtered_queries`, you need to compute scores for all the documents in `filtered_docs` database.

The score should be normalized between 0 and 1.

In [14]:
from sentence_transformers import SentenceTransformer
import torch

# Load the pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Compute embeddings and cosine similarities
def compute_similarity_scores(queries, docs):

    query_embeddings = model.encode([q['text'] for q in queries], convert_to_tensor=True)
    doc_embeddings = model.encode([d['text'] for d in docs], convert_to_tensor=True)
    
    # Initialize a tensor to hold the similarity scores
    similarity_scores = torch.zeros((query_embeddings.size(0), doc_embeddings.size(0)))

    # Compute cosine similarity between each pair of query and document embeddings
    for i, query_embedding in enumerate(query_embeddings):
        for j, doc_embedding in enumerate(doc_embeddings):

            similarity_scores[i, j] = torch.nn.functional.cosine_similarity(
                query_embedding.unsqueeze(0), doc_embedding.unsqueeze(0), dim=1
            )

    return similarity_scores

In [15]:
# Compute scores
scores = compute_similarity_scores(filtered_queries, filtered_docs)
scores

tensor([[ 3.6140e-03,  1.5683e-02,  1.9560e-02,  ...,  3.5394e-01,
          5.2043e-01,  1.9560e-02],
        [-1.9257e-02,  7.2602e-02,  2.9699e-02,  ...,  3.8416e-01,
          5.6438e-01,  2.9699e-02],
        [-2.2836e-04, -1.1545e-02, -1.3360e-02,  ...,  6.5444e-01,
          3.3391e-01, -1.3360e-02],
        ...,
        [ 1.1536e-01,  1.7593e-01,  2.9238e-02,  ...,  4.4745e-01,
          4.9671e-01,  2.9238e-02],
        [ 1.3107e-01,  1.9313e-01, -1.6635e-02,  ...,  4.5137e-01,
          4.7010e-01, -1.6635e-02],
        [ 6.8332e-03,  3.5327e-02,  7.2967e-02,  ...,  3.0039e-01,
          3.5457e-01,  7.2967e-02]])

In [16]:
scores.size()

torch.Size([10, 11122])

In [17]:
def normalize_scores(scores):
    min_val = scores.min()
    max_val = scores.max()
    # Normalizing the scores to be between 0 and 1
    normalized_scores = (scores - min_val) / (max_val - min_val)
    return normalized_scores

# Apply normalization
normalized_scores = normalize_scores(scores)
normalized_scores

tensor([[0.2143, 0.2255, 0.2290,  ..., 0.5369, 0.6902, 0.2290],
        [0.1933, 0.2779, 0.2384,  ..., 0.5648, 0.7307, 0.2384],
        [0.2108, 0.2004, 0.1987,  ..., 0.8136, 0.5185, 0.1987],
        ...,
        [0.3172, 0.3730, 0.2379,  ..., 0.6230, 0.6684, 0.2379],
        [0.3317, 0.3888, 0.1957,  ..., 0.6266, 0.6439, 0.1957],
        [0.2173, 0.2435, 0.2782,  ..., 0.4876, 0.5375, 0.2782]])

## Step 3: Select top-200

For each query in `filtered_queries`, select the top 200 most relevant documents in the `filtered_docs` database.

In [18]:
def select_top_200(scores, docs):
    top_docs_per_query = []
    for score in scores:
        # Get the top 200 indices
        top_indices = torch.topk(score, 200).indices.tolist()
        
        # Use these indices to select documents
        top_docs = [docs[i] for i in top_indices]
        top_docs_per_query.append(top_docs)
    
    return top_docs_per_query

In [19]:
# Select top 200 docs
top_200_docs = select_top_200(normalized_scores, filtered_docs)
top_200_docs[0]

[{'_id': 'vn0xu1cn',
  'title': 'Do we know the diagnostic properties of the tests used in COVID-19? A rapid review of recently published literature.',
  'text': 'COVID-19 has brought death and disease to large parts of the world. Governments must deploy strategies to screen the population and subsequently isolate the suspect cases. Diagnostic testing is critical for epidemiological surveillance, but the accuracy (sensitivity and specificity) and clinical utility (impact on health outcomes) of the current diagnostic methods used for SARS-CoV-2 detection are not known. I ran a quick search in PubMed/MEDLINE to find studies on laboratory diagnostic tests and rapid viral diagnosis. After running the search strategies, I found 47 eligible articles that I discuss in this review, commenting on test characteristics and limitations. I did not find any papers that report on the clinical utility of the tests currently used for COVID-19 detection, meaning that we are fighting a battle without pro

In [20]:
print(len(top_200_docs))
print(len(top_200_docs[0]))

10
200


## Step 4: Evaluate the performance of the approach

For each query, you need to compute Discounted Cumulative Gain ([DCG](https://en.wikipedia.org/wiki/Discounted_cumulative_gain)) of your top-200 documents.

You will use this formula

$$DCG_{200} = \sum_{i=1}^{200} \frac{rel_i}{log_2(i+1)}$$

To do so, you can find the true relevance score $rel_i$ of each document in the `qrels` dataset.

In [21]:
import numpy as np

# Fonction pour calculer le DCG pour une liste de documents
def compute_dcg(relevance_scores):
    relevance_scores = np.array(relevance_scores)
    n = len(relevance_scores)
    if n == 0:
        return 0
    discounts = np.log2(np.arange(2, n + 2))  # Les indices commencent à 2 car log2(1) = 0
    return np.sum(relevance_scores / discounts)

In [22]:
qrels_dict = {(item['query-id'], item['corpus-id']): item['score'] for item in qrels['test']}

In [23]:
# Calculer le DCG pour chaque requête
dcg_scores = []
for query_id, docs in zip(queries_id, top_200_docs):
    # Récupérer les scores de pertinence pour chaque document
    relevance_scores = [qrels_dict.get((query_id, doc['_id']), 0) for doc in docs]
    # Calculer le DCG
    discounts = np.log2(np.arange(2, len(relevance_scores) + 2))  # ici j'ai fait +2 parce que log2(1) est 0 et index commence à 2
    dcg = np.sum(relevance_scores / discounts)
    dcg_scores.append(dcg)

dcg_scores

[52.12833614148015,
 47.88755423350702,
 65.55953246079419,
 56.95627109709791,
 52.17888118900875,
 60.399993173520386,
 52.30716686180492,
 47.57748913412243,
 49.87357383120355,
 48.459369711610044]