# Final Notebook

This notebook is your search engine. 

For testing your work, we will run each cell. Thus, your code we'll have to fit the structure expected.



## Initialisation

- Install libraries (if you use Colab and needed),
- Import the modules,
- Declare global variable


In [27]:
!pip install nltk
!python -m textblob.download_corpora
!pip install beautifulsoup4
!pip install sentence-transformers --quiet

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
Finished.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [28]:
import pandas as pd
import os
from bs4 import BeautifulSoup
import re
import math
import numpy as np
import pickle
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [29]:
# Only if you use Colab
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [30]:
DATAPATH = 'drive/MyDrive/EI_web_data/Data'

###Save and load data

In [31]:
# Save and Load your data in Pickle format

def save_data(savepath, file_name, obj):
    with open(os.path.join(savepath, file_name), 'wb') as file:
      pickle.dump(obj, file)

def load_data(savepath, file_name):
    with open(os.path.join(savepath, file_name),'rb') as file:
      return pickle.load(file)

## Extraction the data

In [32]:
def remove_tags(text:str)->str:
  '''retire les balises html du texte'''
  soup = BeautifulSoup(text, 'html.parser')

  # Supprimer toutes les balises de script et de style
  for script in soup(['script', 'style']):
    script.extract()

  # Obtenir le texte propre sans balises
  texte_propre = soup.get_text()
  
  # Supprimer les espaces supplémentaires et les sauts de ligne
  texte_propre = re.sub(r'\s+', ' ', texte_propre)

  return texte_propre

def extract_tokens(text:str)->list:
  '''récupère (tous) les mots de chaque phrase'''
  tokens=re.findall(r'\w+', text)     
  return [token.lower() for token in tokens]

def lemmatize(tokens:list)->list:
  '''lemmatise tous les mots de la liste'''
  wnl = WordNetLemmatizer()
  return [wnl.lemmatize(token) for token in tokens]

def remove_stopwords(word_list):
  '''retire tous les stopwords de la liste'''
  return [word for word in word_list if word not in stopwords.words('english')]


def extract_data(datapath):
    df=pd.read_xml(os.path.join(datapath, 'Posts.xml'), parser="etree", encoding="utf8")
    df['CleanBody'] = df['Body'].fillna('').apply(remove_tags)
    df['Tokens'] = df['CleanBody'].apply(extract_tokens)
    df['Words']= df['Tokens'].apply(lemmatize)
    df['MeaningfullWords'] = df['Words'].apply(remove_stopwords)

    return df

The firt time you execute the code, uncomment to extract data then save it (it may take a few minutes). The next times you can just load the dataframe.

In [33]:
#df = extract_data(datapath=DATAPATH) 
#save_data(DATAPATH, 'df.pkl', df)
df=load_data(DATAPATH, 'df.pkl')

## Indexation data

In [34]:
def index_data(df:pd.DataFrame)-> set:
  ''' renvoie un dictionnaire de la forme 
  dic={mot:{id:f_id}}'''

  dic={}
  for rang in df.index:
    Words = df.loc[rang, 'Words']
    id = df.loc[rang,'Id']
    for word in Words:
      if word in dic.keys(): #si le mot est déjà apparu dans le corpus
        if id in dic[word].keys(): #si le mot est déjà apparu dans ce document
          dic[word][id]+=1
        else:
          dic[word][id]=1 #première occurence du mot dans ce document
      else:#première occurence du mot dans le corpus
        dic[word]={id: 1}

  return dic

In [35]:
#inverted_index=index_data(df)
#save_data(DATAPATH, 'inverted_index.pkl', inverted_index)
inverted_index=load_data(DATAPATH, 'inverted_index.pkl')

## Search Method

In [37]:
MODEL_ST = SentenceTransformer('all-MiniLM-L6-v2')
#embeddings = MODEL_ST.encode(df.CleanBody.values, normalize_embeddings=True)
#save_data(DATAPATH, 'embeddings.pkl', embeddings)
embeddings=load_data(DATAPATH, 'embeddings.pkl')

In [56]:
def similarity_matrix(query, df=df, embeddings=embeddings, MODEL_ST=MODEL_ST):
  encoded_query = MODEL_ST.encode([query], normalize_embeddings=True)
  matrix=cosine_similarity(encoded_query, embeddings)
  return(matrix)


def BM25(query, df=df, inverted_index=inverted_index, k1=1.5, b=0.75):
  '''retourne une copy de la dataframe avec une colonne 'Queryscore' contenant le score BM25 pour la requête'''
  #extraction des mots de la requête
  processed_query=lemmatize(extract_tokens(query))

  #création d'une copie de la dataframe
  df_copy=df.copy()

  #ajout d'une colonne longueur du document
  df_copy['Lenght']=df_copy['Words'].apply(lambda x : len(x))

  N=len(df_copy) #nombre de docs dans la collection
  avgdl=df_copy['Lenght'].mean() #longueur moyenne des docs

  #calcul du score de chaque document
  scores=[]
  for rang in df_copy.index:
    doc_id=df_copy.loc[rang,'Id']
    lenght=df_copy.loc[rang,'Lenght']
    s=0

    for terme in processed_query:
      if terme not in inverted_index : 
        s+=0 #le terme n'apparait ni dans le doc ni même dans le corpus --> contribution nulle au score 
      else :
        n=len(inverted_index[terme]) # nombre de documents contenants le terme
        
        IDF=math.log((N-n+0.5)/(n+0.5))
        if doc_id in inverted_index[terme]: freq=inverted_index[terme][doc_id]
        else: freq=0
        s+=IDF*freq*(k1+1)/(freq+k1*(1-b+b*lenght/avgdl))
    
    scores.append(s)

  #ajout d'une colonne 'Queryscore' à la copie de la dataframe
  df_copy['QueryScore']=scores

  return scores


def scored_df(query, df=df, inverted_index=inverted_index, k1=1.5, b=0.75, embeddings=embeddings, MODEL_ST=MODEL_ST):
  ''' renvoie une copie de la dataframe munie des colonnes 'ScoreBM25' et 'CosineSimilarity' '''
  df_copy=df.copy()
  df_copy['ScoreBM25']=BM25(query, df, inverted_index, k1, b)
  df_copy['CosineSimilarity'] = similarity_matrix(query, df, embeddings, MODEL_ST)[0]
  return df_copy

#à la fin de scored_query, on obtient une df avec deux colonnes contenant les score BM25 et la mesure similarité

def search(query, df=df, inverted_index=inverted_index, k1=1.5, b=0.75, embeddings=embeddings, MODEL_ST=MODEL_ST):
  
  return

In [57]:
query='what is stochastic gradient descent ?'
scored_df(query)

Unnamed: 0,Id,PostTypeId,CreationDate,Score,ViewCount,Body,OwnerUserId,LastActivityDate,Title,Tags,...,OwnerDisplayName,CommunityOwnedDate,LastEditorDisplayName,FavoriteCount,CleanBody,Tokens,Words,MeaningfullWords,ScoreBM25,CosineSimilarity
0,5,1,2014-05-13T23:58:30.457,9,898.0,<p>I've always been interested in machine lear...,5.0,2014-05-14T00:36:31.077,How can I do simple machine learning without h...,<machine-learning>,...,,,,,I've always been interested in machine learnin...,"[i, ve, always, been, interested, in, machine,...","[i, ve, always, been, interested, in, machine,...","[always, interested, machine, learning, figure...",0.000000,0.200056
1,7,1,2014-05-14T00:11:06.457,4,478.0,"<p>As a researcher and instructor, I'm looking...",36.0,2014-05-16T13:45:00.237,What open-source books (or other materials) pr...,<education><open-source>,...,,,,,"As a researcher and instructor, I'm looking fo...","[as, a, researcher, and, instructor, i, m, loo...","[a, a, researcher, and, instructor, i, m, look...","[researcher, instructor, looking, open, source...",0.000000,0.073826
2,9,2,2014-05-14T00:36:31.077,5,,"<p>Not sure if this fits the scope of this SE,...",51.0,2014-05-14T00:36:31.077,,,...,,,,,"Not sure if this fits the scope of this SE, bu...","[not, sure, if, this, fits, the, scope, of, th...","[not, sure, if, this, fit, the, scope, of, thi...","[sure, fit, scope, se, stab, answer, anyway, a...",-0.851537,0.193558
3,10,2,2014-05-14T00:53:43.273,13,,"<p>One book that's freely available is ""The El...",22.0,2014-05-14T00:53:43.273,,,...,,,,,"One book that's freely available is ""The Eleme...","[one, book, that, s, freely, available, is, th...","[one, book, that, s, freely, available, is, th...","[one, book, freely, available, element, statis...",-2.265948,0.331188
4,14,1,2014-05-14T01:25:59.677,26,1901.0,<p>I am sure data science as will be discussed...,66.0,2020-08-16T13:01:33.543,Is Data Science the Same as Data Mining?,<data-mining><definitions>,...,,,,,I am sure data science as will be discussed in...,"[i, am, sure, data, science, as, will, be, dis...","[i, am, sure, data, science, a, will, be, disc...","[sure, data, science, discussed, forum, ha, se...",-0.668133,0.087813
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75722,119962,1,2023-03-04T20:06:06.820,0,8.0,<p>I am implementing a neural network of arbit...,147597.0,2023-03-04T20:22:12.523,Back Propagation on arbitrary depth network wi...,<neural-network><backpropagation>,...,,,,,I am implementing a neural network of arbitrar...,"[i, am, implementing, a, neural, network, of, ...","[i, am, implementing, a, neural, network, of, ...","[implementing, neural, network, arbitrary, dep...",2.314880,0.230572
75723,119963,1,2023-03-04T20:12:19.677,0,10.0,<p>I am using KNN for a regression task</p>\n<...,147598.0,2023-03-04T20:12:19.677,Evaluation parameter in knn,<regression><k-nn>,...,,,,,I am using KNN for a regression task It's like...,"[i, am, using, knn, for, a, regression, task, ...","[i, am, using, knn, for, a, regression, task, ...","[using, knn, regression, task, like, 1, normal...",-2.196023,-0.013185
75724,119964,1,2023-03-05T00:14:12.597,0,7.0,<p>I have developed a small encoding algorithm...,44581.0,2023-03-05T00:14:12.597,Can I use zero-padded input and output layers ...,<deep-learning><convolutional-neural-network>,...,,,,,I have developed a small encoding algorithm th...,"[i, have, developed, a, small, encoding, algor...","[i, have, developed, a, small, encoding, algor...","[developed, small, encoding, algorithm, accept...",-1.874446,0.124191
75725,119965,1,2023-03-05T00:43:12.213,0,5.0,"<p>To my understanding, optimizing a model wit...",84437.0,2023-03-05T00:43:12.213,Why does cross validation and hyperparameter t...,<cross-validation><hyperparameter-tuning>,...,,,,,"To my understanding, optimizing a model with k...","[to, my, understanding, optimizing, a, model, ...","[to, my, understanding, optimizing, a, model, ...","[understanding, optimizing, model, k, fold, cr...",-2.368447,0.106993


## Ranking

In [None]:
def rank_search(results, top=5):
    # TODO

    return

## Visualising Results

In [None]:
def visualize_output():
    # TODO
    
    return

## Querying

In [None]:
def make_query(natural_query):
    # TODO

    return

## Scoring

In [None]:
# Pas sûr de garder cette partie

## Testing

In [None]:
# TODO