# Final Notebook

This notebook is your search engine. 

For testing your work, we will run each cell. Thus, your code we'll have to fit the structure expected.



## Initialisation

- Install libraries (if you use Colab and needed),
- Import the modules,
- Declare global variable


In [1]:
!pip install nltk
!python -m textblob.download_corpora
!pip install beautifulsoup4

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
Finished.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [39]:
import pandas as pd
import os
from bs4 import BeautifulSoup
import re
import math
import numpy as np
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
# Only if you use Colab
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
DATAPATH = 'drive/MyDrive/EI_web_data/Data'

## Extraction the data

In [50]:
def remove_tags(text:str)->str:
  '''retire les balises html du texte'''
  soup = BeautifulSoup(text, 'html.parser')

  # Supprimer toutes les balises de script et de style
  for script in soup(['script', 'style']):
    script.extract()

  # Obtenir le texte propre sans balises
  texte_propre = soup.get_text()
  
  # Supprimer les espaces supplémentaires et les sauts de ligne
  texte_propre = re.sub(r'\s+', ' ', texte_propre)

  return texte_propre

def extract_tokens(text:str)->list:
  '''récupère (tous) les mots de chaque phrase'''
  tokens=re.findall(r'\w+', text)     
  return [token.lower() for token in tokens]

def lemmatize(tokens:list)->list:
  '''lemmatise tous les mots de la liste'''
  wnl = WordNetLemmatizer()
  return [wnl.lemmatize(token) for token in tokens]

def remove_stopwords(word_list):
  '''retire tous les stopwords de la liste'''
  return [word for word in word_list if word not in stopwords.words('english')]


def extract_data(datapath):
    df=pd.read_xml(os.path.join(datapath, 'Posts.xml'), parser="etree", encoding="utf8").sample(1000) #on ne prend pas tout pour gagner du temps
    df['CleanBody'] = df['Body'].fillna('').apply(remove_tags)
    df['Tokens'] = df['CleanBody'].apply(extract_tokens)
    df['Words']= df['Tokens'].apply(lemmatize)
    df['MeaningfullWords'] = df['Words'].apply(remove_stopwords)

    return df

In [51]:
df = extract_data(datapath=DATAPATH)

## Indexation data

In [44]:
def index_data(df:pd.DataFrame)-> set:
  ''' renvoie un dictionnaire de la forme 
  dic={mot:{id:f_id}}'''

  dic={}
  for rang in df.index:
    Words = df.loc[rang, 'Words']
    id = df.loc[rang,'Id']
    for word in Words:
      if word in dic.keys(): #si le mot est déjà apparu dans le corpus
        if id in dic[word].keys(): #si le mot est déjà apparu dans ce document
          dic[word][id]+=1
        else:
          dic[word][id]=1 #première occurence du mot dans ce document
      else:#première occurence du mot dans le corpus
        dic[word]={id: 1}

  return dic

In [52]:
inverted_index=index_data(df)

In [None]:
# Save and Load your Index(es) in Pickle format

def save_index(savepath):
    # TODO

    return


def load_index(savepath):
    # TODO

    return

## Search Method

In [53]:
def BM25(query, df=df, inverted_index=inverted_index, k1=1.5, b=0.75):
  '''retourne une copy de la dataframe avec une colonne 'Queryscore' contenant le score BM25 pour la requête'''
  #extraction des mots de la requête
  processed_query=lemmatize(extract_tokens(query))

  #création d'une copie de la dataframe
  df_copy=df.copy()

  #ajout d'une colonne longueur du document
  df_copy['Lenght']=df_copy['Words'].apply(lambda x : len(x))

  N=len(df_copy) #nombre de docs dans la collection
  avgdl=df_copy['Lenght'].mean() #longueur moyenne des docs

  #calcul du score de chaque document
  scores=[]
  for rang in df_copy.index:
    doc_id=df_copy.loc[rang,'Id']
    lenght=df_copy.loc[rang,'Lenght']
    s=0

    for terme in processed_query:
      if terme not in inverted_index : 
        s+=0 #le terme n'apparait ni dans le doc ni même dans le corpus --> contribution nulle au score 
      else :
        n=len(inverted_index[terme]) # nombre de documents contenants le terme
        
        IDF=math.log((N-n+0.5)/(n+0.5))
        if doc_id in inverted_index[terme]: freq=inverted_index[terme][doc_id]
        else: freq=0
        s+=IDF*freq*(k1+1)/(freq+k1*(1-b+b*lenght/avgdl))
    
    scores.append(s)

  #ajout d'une colonne 'Queryscore' à la copie de la dataframe
  df_copy['QueryScore']=scores

  return df_copy




def search(query):
    # TODO

    return

In [57]:
query='what is stochastic gradient descent ?'
BM25(query).nlargest(5, 'QueryScore')

Unnamed: 0,Id,PostTypeId,CreationDate,Score,ViewCount,Body,OwnerUserId,LastActivityDate,Title,Tags,...,OwnerDisplayName,CommunityOwnedDate,LastEditorDisplayName,FavoriteCount,CleanBody,Tokens,Words,MeaningfullWords,Lenght,QueryScore
4472,9404,2,2015-12-15T19:31:07.267,7,,<p>Let us say that the output of one neural ne...,14779.0,2015-12-15T19:31:07.267,,,...,,,,,Let us say that the output of one neural netwo...,"[let, us, say, that, the, output, of, one, neu...","[let, u, say, that, the, output, of, one, neur...","[let, u, say, output, one, neural, network, gi...",163,16.047851
13155,22075,1,2017-08-08T15:24:06.500,3,5074.0,<p>Since the aim of a Discriminator is to outp...,32882.0,2018-03-31T15:41:23.930,Loss function in GAN,<unsupervised-learning><gan>,...,,,,,Since the aim of a Discriminator is to output ...,"[since, the, aim, of, a, discriminator, is, to...","[since, the, aim, of, a, discriminator, is, to...","[since, aim, discriminator, output, 1, real, d...",85,11.273306
65087,105080,2,2021-12-13T19:31:29.770,1,,<p>What you have is a standard optimization pr...,80744.0,2021-12-13T19:31:29.770,,,...,,,,,What you have is a standard optimization probl...,"[what, you, have, is, a, standard, optimizatio...","[what, you, have, is, a, standard, optimizatio...","[standard, optimization, problem, ha, got, not...",168,11.033807
52489,81654,1,2020-09-14T05:23:29.517,0,1218.0,<p>According to me:</p>\n<p><strong>Mini Batch...,98121.0,2020-09-14T13:44:32.070,Why Mini batch gradient descent is faster than...,<deep-learning><gradient-descent><mini-batch-g...,...,,,,,According to me: Mini Batch Gradient Descent :...,"[according, to, me, mini, batch, gradient, des...","[according, to, me, mini, batch, gradient, des...","[according, mini, batch, gradient, descent, 1,...",110,11.023721
23221,36804,4,2018-08-11T20:54:07.867,0,,"Use for questions about Backpropagation, which...",29575.0,2018-08-12T23:29:15.453,,,...,,,,,"Use for questions about Backpropagation, which...","[use, for, questions, about, backpropagation, ...","[use, for, question, about, backpropagation, w...","[use, question, backpropagation, commonly, use...",23,10.196139


## Ranking

In [None]:
def rank_search(results, top=5):
    # TODO

    return

## Visualising Results

In [None]:
def visualize_output():
    # TODO
    
    return

## Querying

In [None]:
def make_query(natural_query):
    # TODO

    return

## Scoring

In [None]:
# Pas sûr de garder cette partie

## Testing

In [None]:
# TODO