<a href="https://colab.research.google.com/github/solalducloyer/EI_ST4_Groupe1/blob/main/Search_Engine_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Final Notebook

This notebook is your search engine. 

For testing your work, we will run each cell. Thus, your code we'll have to fit the structure expected.



## Initialisation

- Install libraries (if you use Colab and needed),
- Import the modules,
- Declare global variable


In [1]:
!pip install nltk
!python -m textblob.download_corpora
!pip install beautifulsoup4
!pip install sentence-transformers --quiet

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
Finished.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Pr

In [2]:
import pandas as pd
import os
from bs4 import BeautifulSoup
import re
import math
import numpy as np
import pickle
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
# Only if you use Colab
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
DATAPATH = 'drive/MyDrive/EI_web_data/Data'

###Save and load data

In [5]:
# Save and Load your data in Pickle format

def save_data(savepath, file_name, obj):
    with open(os.path.join(savepath, file_name), 'wb') as file:
      pickle.dump(obj, file)

def load_data(savepath, file_name):
    with open(os.path.join(savepath, file_name),'rb') as file:
      return pickle.load(file)

## Extraction the data

In [6]:
def remove_tags(text:str)->str:
  '''retire les balises html du texte'''
  soup = BeautifulSoup(text, 'html.parser')

  # Supprimer toutes les balises de script et de style
  for script in soup(['script', 'style']):
    script.extract()

  # Obtenir le texte propre sans balises
  texte_propre = soup.get_text()
  
  # Supprimer les espaces supplémentaires et les sauts de ligne
  texte_propre = re.sub(r'\s+', ' ', texte_propre)

  return texte_propre

def extract_tokens(text:str)->list:
  '''récupère (tous) les mots de chaque phrase'''
  tokens=re.findall(r'\w+', text)     
  return [token.lower() for token in tokens]

def lemmatize(tokens:list)->list:
  '''lemmatise tous les mots de la liste'''
  wnl = WordNetLemmatizer()
  return [wnl.lemmatize(token) for token in tokens]

def remove_stopwords(word_list):
  '''retire tous les stopwords de la liste'''
  return [word for word in word_list if word not in stopwords.words('english')]


def extract_data(datapath):
    df=pd.read_xml(os.path.join(datapath, 'Posts.xml'), parser="etree", encoding="utf8")
    df['CleanBody'] = df['Body'].fillna('').apply(remove_tags)
    df['Tokens'] = df['CleanBody'].apply(extract_tokens)
    df['Words']= df['Tokens'].apply(lemmatize)
    df['MeaningfullWords'] = df['Words'].apply(remove_stopwords)

    return df

The firt time you execute the code, uncomment to extract data then save it (it may take a few minutes). The next times you can just load the dataframe.

In [7]:
# df = extract_data(datapath=DATAPATH) 
# save_data(DATAPATH, 'df.pkl', df)
df=load_data(DATAPATH, 'df.pkl')

## Indexation data

In [8]:
def index_data(df:pd.DataFrame)-> set:
  ''' renvoie un dictionnaire de la forme 
  dic={mot:{id:f_id}}'''

  dic={}
  for rang in df.index:
    Words = df.loc[rang, 'Words']
    id = df.loc[rang,'Id']
    for word in Words:
      if word in dic.keys(): #si le mot est déjà apparu dans le corpus
        if id in dic[word].keys(): #si le mot est déjà apparu dans ce document
          dic[word][id]+=1
        else:
          dic[word][id]=1 #première occurence du mot dans ce document
      else:#première occurence du mot dans le corpus
        dic[word]={id: 1}

  return dic

In [9]:
# inverted_index=index_data(df)
# save_data(DATAPATH, 'inverted_index.pkl', inverted_index)
inverted_index=load_data(DATAPATH, 'inverted_index.pkl')

## Search Method

In [10]:
MODEL_ST = SentenceTransformer('all-MiniLM-L6-v2')
#embeddings = MODEL_ST.encode(df.CleanBody.values, normalize_embeddings=True)
#save_data(DATAPATH, 'embeddings.pkl', embeddings)
embeddings=load_data(DATAPATH, 'embeddings.pkl')

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [67]:
def similarity_matrix(query, df=df, embeddings=embeddings, MODEL_ST=MODEL_ST):
  encoded_query = MODEL_ST.encode([query], normalize_embeddings=True)
  matrix=cosine_similarity(encoded_query, embeddings)
  return(matrix)


def BM25(query, df=df, inverted_index=inverted_index, k1=1.5, b=0.75):
  '''retourne une copy de la dataframe avec une colonne 'ScoreBM25' contenant le score BM25 pour la requête'''
  #extraction des mots de la requête
  processed_query=lemmatize(extract_tokens(query))

  #création d'une copie de la dataframe
  df_copy=df.copy()

  #ajout d'une colonne longueur du document
  df_copy['Lenght']=df_copy['Words'].apply(lambda x : len(x))

  N=len(df_copy) #nombre de docs dans la collection
  avgdl=df_copy['Lenght'].mean() #longueur moyenne des docs

  #calcul du score de chaque document
  scores=[]
  for rang in df_copy.index:
    doc_id=df_copy.loc[rang,'Id']
    lenght=df_copy.loc[rang,'Lenght']
    s=0

    for terme in processed_query:
      if terme not in inverted_index : 
        s+=0 #le terme n'apparait ni dans le doc ni même dans le corpus --> contribution nulle au score 
      else :
        n=len(inverted_index[terme]) # nombre de documents contenants le terme
        
        IDF=math.log((N-n+0.5)/(n+0.5))
        if doc_id in inverted_index[terme]: freq=inverted_index[terme][doc_id]
        else: freq=0
        s+=IDF*freq*(k1+1)/(freq+k1*(1-b+b*lenght/avgdl))
    
    scores.append(s)

  #ajout d'une colonne 'Queryscore' à la copie de la dataframe
  df_copy['ScoreBM25']=scores

  return scores, df_copy


def scored_df(query, df=df, inverted_index=inverted_index, k1=1.5, b=0.75, m=0.7, embeddings=embeddings, MODEL_ST=MODEL_ST):
  ''' renvoie une copie de la dataframe munie des colonnes 'ScoreBM25','CosineSimilarity', 'Xmesure' et 'MMesure' 
  XMesure est le produit du score BM25 et de la mesure cosinus
  MMesure est la moyenne pondérée par m du score BM25 normalisé et de la mesure cosinus'''
  df_copy=df.copy()
  df_copy['ScoreBM25']=BM25(query, df, inverted_index, k1, b)[0]
  df_copy['CosineSimilarity'] = similarity_matrix(query, df, embeddings, MODEL_ST)[0]
  df_copy['XMesure']=df_copy['ScoreBM25']*df_copy['CosineSimilarity']
  
  A=df_copy['ScoreBM25'].max()
  df_copy['MMesure']=df_copy['ScoreBM25']*m/A + df_copy['CosineSimilarity']*(1-m)
  return df_copy


def search(query, top=10, mesure='MMesure', df=df, inverted_index=inverted_index, k1=1.5, b=0.75, m=0.7, embeddings=embeddings, MODEL_ST=MODEL_ST):
  df_copy=scored_df(query, df, inverted_index, k1, b, m, embeddings, MODEL_ST)
  return df_copy.nlargest(top, mesure)

In [75]:
query='draw neural network'
search(query, top=10, mesure = 'MMesure').loc[:,['Id','Title','CleanBody','ScoreBM25','CosineSimilarity','XMesure','MMesure','ViewCount','Score']]

Unnamed: 0,Id,Title,CleanBody,ScoreBM25,CosineSimilarity,XMesure,MMesure,ViewCount,Score
40559,63201,,I asked me something similar as well as I thou...,15.57407,0.605822,9.435117,0.837603,,1
40555,63195,How to draw neural network diagrams with this ...,I would like to draw a neural network architec...,15.503238,0.595052,9.225228,0.831389,3084.0,1
43023,66343,Drawing Neural Network diagram for academic pa...,Is there any tool that one can use to draw neu...,15.019099,0.636582,9.560882,0.82346,9263.0,4
64585,104365,What Shape Does Naive Bayes make?,Decision Trees draw straight lines to partitio...,16.622309,0.398364,6.621737,0.819509,540.0,3
25738,40235,,I wrote some latex code to draw Deep networks ...,12.96927,0.618202,8.017623,0.731623,,86
43027,66349,,"As far as I know, most researchers use general...",10.830108,0.692501,7.499859,0.663829,,0
5016,10050,Simple ANN visualisation,TLDR: Please help me understand the graph repr...,11.025428,0.627747,6.921176,0.652628,353.0,3
7238,12859,,In Caffe you can use caffe/draw.py to draw the...,12.154403,0.458155,5.568605,0.649294,,21
71930,114669,,"The neural network is nonlinear if, and only i...",12.345783,0.377695,4.662938,0.633215,,1
55953,87844,How can I fix my classifier only predicting tw...,I have a relatively simple 16 feature neural n...,11.944656,0.408411,4.878327,0.625538,77.0,0


## Ranking

In [82]:
def viewcount_ranking(query, top, mesure='MMesure'):

  results = search(query, top, mesure)

  views_tot = sum(x for x in results['ViewCount'].tolist() if not math.isnan(x))
  views_max = max(results['ViewCount'].tolist())
  posts_ids = results['Id'].tolist()

  new_scores = []

  for post_id in posts_ids:

    post_metadata = results.loc[results['Id'] == post_id]
    post_viewcount = post_metadata['ViewCount'].tolist()[0]
    post_score = post_metadata[mesure].tolist()[0]

    if not math.isnan(post_viewcount) :
      new_scores.append(post_score*math.pow(post_viewcount, 0.3))
    else :
      new_scores.append(post_score)

  results['MetadataScore'] = new_scores

  new_ranking = results.nlargest(top, 'MetadataScore')

  return new_ranking

In [100]:
query='draw neural network'
viewcount_ranking(query, top=10).loc[:,['Id','Title','CleanBody','ScoreBM25','CosineSimilarity','XMesure','MMesure','ViewCount','Score']]

Unnamed: 0,Id,Title,CleanBody,ScoreBM25,CosineSimilarity,XMesure,MMesure,ViewCount,Score
43023,66343,Drawing Neural Network diagram for academic pa...,Is there any tool that one can use to draw neu...,15.019099,0.636582,9.560882,0.82346,9263.0,4
40555,63195,How to draw neural network diagrams with this ...,I would like to draw a neural network architec...,15.503238,0.595052,9.225228,0.831389,3084.0,1
64585,104365,What Shape Does Naive Bayes make?,Decision Trees draw straight lines to partitio...,16.622309,0.398364,6.621737,0.819509,540.0,3
5016,10050,Simple ANN visualisation,TLDR: Please help me understand the graph repr...,11.025428,0.627747,6.921176,0.652628,353.0,3
55953,87844,How can I fix my classifier only predicting tw...,I have a relatively simple 16 feature neural n...,11.944656,0.408411,4.878327,0.625538,77.0,0
40559,63201,,I asked me something similar as well as I thou...,15.57407,0.605822,9.435117,0.837603,,1
25738,40235,,I wrote some latex code to draw Deep networks ...,12.96927,0.618202,8.017623,0.731623,,86
43027,66349,,"As far as I know, most researchers use general...",10.830108,0.692501,7.499859,0.663829,,0
7238,12859,,In Caffe you can use caffe/draw.py to draw the...,12.154403,0.458155,5.568605,0.649294,,21
71930,114669,,"The neural network is nonlinear if, and only i...",12.345783,0.377695,4.662938,0.633215,,1


## Visualising Results

In [15]:
def visualize_output():
    # TODO
    
    return

## Querying

In [16]:
def make_query(natural_query):
    # TODO

    return

## Scoring

In [17]:
# Pas sûr de garder cette partie

## Testing

In [108]:
query='mesure performance for multiclassification model'
liste_id=[6107,15989,13490,12321,22,14899,5706,15135,12851,694,9302,9443]
results=search(query, top=len(df), mesure = 'MMesure').loc[df['Id'].isin(liste_id)]
results


Unnamed: 0,Id,PostTypeId,CreationDate,Score,ViewCount,Body,OwnerUserId,LastActivityDate,Title,Tags,...,LastEditorDisplayName,FavoriteCount,CleanBody,Tokens,Words,MeaningfullWords,ScoreBM25,CosineSimilarity,XMesure,MMesure
9533,15989,1,2016-12-29T17:39:07.967,276,340524.0,<p>I am trying out a multiclass classification...,13518.0,2022-04-02T09:28:24.697,Micro Average vs Macro average Performance in ...,<multiclass-classification><evaluation>,...,,,I am trying out a multiclass classification se...,"[i, am, trying, out, a, multiclass, classifica...","[i, am, trying, out, a, multiclass, classifica...","[trying, multiclass, classification, setting, ...",4.928014,0.44277,2.181977,0.383084
7741,13490,1,2016-08-17T09:35:45.110,251,391409.0,<p>I know that there is a possibility in Keras...,21560.0,2021-07-15T16:24:48.873,How to set class weights for imbalanced classe...,<deep-learning><classification><keras><weighte...,...,,,I know that there is a possibility in Keras wi...,"[i, know, that, there, is, a, possibility, in,...","[i, know, that, there, is, a, possibility, in,...","[know, possibility, kera, class_weights, param...",0.0,0.248855,0.0,0.074656
4383,9302,1,2015-12-10T06:22:48.927,149,164514.0,"<p>In the <a href=""https://www.tensorflow.org/...",8820.0,2021-09-11T16:22:44.930,The cross-entropy error function in neural net...,<machine-learning><tensorflow>,...,,,In the MNIST For ML Beginners they define cros...,"[in, the, mnist, for, ml, beginners, they, def...","[in, the, mnist, for, ml, beginner, they, defi...","[mnist, ml, beginner, define, cross, entropy, ...",-0.778684,0.268063,-0.208736,0.040876
612,694,1,2014-07-07T19:17:04.973,150,125487.0,<p>I'm using Neural Networks to solve differen...,989.0,2018-12-11T23:39:52.667,Best python library for neural networks,<machine-learning><python><neural-network>,...,,,I'm using Neural Networks to solve different M...,"[i, m, using, neural, networks, to, solve, dif...","[i, m, using, neural, network, to, solve, diff...","[using, neural, network, solve, different, mac...",0.0,0.117553,0.0,0.035266
4508,9443,1,2015-12-19T19:30:35.527,171,116988.0,<p>I have been building models with categorica...,10462.0,2022-02-20T22:17:56.370,When to use One Hot Encoding vs LabelEncoder v...,<scikit-learn><categorical-data><feature-engin...,...,,,I have been building models with categorical d...,"[i, have, been, building, models, with, catego...","[i, have, been, building, model, with, categor...","[building, model, categorical, data, situation...",0.080767,0.100002,0.008077,0.034102
7230,12851,1,2016-07-18T17:08:17.237,164,204629.0,<p>When writing a paper / making a presentatio...,8820.0,2022-08-29T01:27:44.843,How do you visualize neural network architectu...,<machine-learning><neural-network><deep-learni...,...,,,When writing a paper / making a presentation a...,"[when, writing, a, paper, making, a, presentat...","[when, writing, a, paper, making, a, presentat...","[writing, paper, making, presentation, topic, ...",0.0,0.109746,0.0,0.032924
8928,15135,1,2016-11-15T14:55:04.130,184,347425.0,<p>How could I randomly split a data matrix an...,21560.0,2022-12-05T16:13:04.480,Train/Test/Validation Set Splitting in Sklearn,<machine-learning><scikit-learn><cross-validat...,...,,,How could I randomly split a data matrix and t...,"[how, could, i, randomly, split, a, data, matr...","[how, could, i, randomly, split, a, data, matr...","[could, randomly, split, data, matrix, corresp...",0.0,0.1016,0.0,0.03048
6804,12321,1,2016-06-21T10:05:08.587,238,336451.0,<p>I do not understand the difference between ...,15064.0,2021-07-08T08:09:09.163,What's the difference between fit and fit_tran...,<python><scikit-learn>,...,,,I do not understand the difference between the...,"[i, do, not, understand, the, difference, betw...","[i, do, not, understand, the, difference, betw...","[understand, difference, fit, fit_transform, m...",-0.16219,0.108724,-0.017634,0.024381
12,22,1,2014-05-14T05:58:21.927,198,286053.0,<p>My data set contains a number of numeric at...,97.0,2022-10-14T09:40:25.270,K-Means clustering for mixed numeric and categ...,<data-mining><clustering><octave><k-means><cat...,...,,,My data set contains a number of numeric attri...,"[my, data, set, contains, a, number, of, numer...","[my, data, set, contains, a, number, of, numer...","[data, set, contains, number, numeric, attribu...",-0.884025,0.206735,-0.182759,0.017128
8752,14899,1,2016-11-03T03:10:24.893,189,316075.0,<p>I have built my model. Now I want to draw t...,20618.0,2020-05-12T19:03:48.143,How to draw Deep learning network architecture...,<machine-learning><neural-network><deep-learni...,...,,,I have built my model. Now I want to draw the ...,"[i, have, built, my, model, now, i, want, to, ...","[i, have, built, my, model, now, i, want, to, ...","[built, model, want, draw, network, architectu...",-0.193108,0.069474,-0.013416,0.011036
