<a href="https://colab.research.google.com/github/solalducloyer/EI_ST4_Groupe1/blob/main/Search_Engine_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Final Notebook

This notebook is your search engine. 

For testing your work, we will run each cell. Thus, your code we'll have to fit the structure expected.



## Initialisation

- Install libraries (if you use Colab and needed),
- Import the modules,
- Declare global variable


In [1]:
!pip install nltk
!python -m textblob.download_corpora
!pip install beautifulsoup4
!pip install sentence-transformers --quiet

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
Finished.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Pr

In [2]:
import pandas as pd
import os
from bs4 import BeautifulSoup
import re
import math
import numpy as np
import pickle
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
# Only if you use Colab
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [8]:
DATAPATH = 'drive/MyDrive/EI_web_data/Data'
# DATAPATH = 'drive/MyDrive/EI_ST4_Groupe1/data'

###Save and load data

In [5]:
# Save and Load your data in Pickle format

def save_data(savepath, file_name, obj):
    with open(os.path.join(savepath, file_name), 'wb') as file:
      pickle.dump(obj, file)

def load_data(savepath, file_name):
    with open(os.path.join(savepath, file_name),'rb') as file:
      return pickle.load(file)

## Extraction the data

In [6]:
def remove_tags(text:str)->str:
  '''retire les balises html du texte'''
  soup = BeautifulSoup(text, 'html.parser')

  # Supprimer toutes les balises de script et de style
  for script in soup(['script', 'style']):
    script.extract()

  # Obtenir le texte propre sans balises
  texte_propre = soup.get_text()
  
  # Supprimer les espaces supplémentaires et les sauts de ligne
  texte_propre = re.sub(r'\s+', ' ', texte_propre)

  return texte_propre

def extract_tokens(text:str)->list:
  '''récupère (tous) les mots de chaque phrase'''
  tokens=re.findall(r'\w+', text)     
  return [token.lower() for token in tokens]

def lemmatize(tokens:list)->list:
  '''lemmatise tous les mots de la liste'''
  wnl = WordNetLemmatizer()
  return [wnl.lemmatize(token) for token in tokens]

def remove_stopwords(word_list):
  '''retire tous les stopwords de la liste'''
  return [word for word in word_list if word not in stopwords.words('english')]


def extract_data(datapath):
    df=pd.read_xml(os.path.join(datapath, 'Posts.xml'), parser="etree", encoding="utf8")
    df['CleanBody'] = df['Body'].fillna('').apply(remove_tags)
    df['Tokens'] = df['CleanBody'].apply(extract_tokens)
    df['Words']= df['Tokens'].apply(lemmatize)
    df['MeaningfullWords'] = df['Words'].apply(remove_stopwords)

    return df

The firt time you execute the code, uncomment to extract data then save it (it may take a few minutes). The next times you can just load the dataframe.

In [10]:
# df = extract_data(datapath=DATAPATH) 
# save_data(DATAPATH, 'df.pkl', df)
df=load_data(DATAPATH, 'df.pkl')

  soup = BeautifulSoup(text, 'html.parser')


## Indexation data

In [11]:
def index_data(df:pd.DataFrame)-> set:
  ''' renvoie un dictionnaire de la forme 
  dic={mot:{id:f_id}}'''

  dic={}
  for rang in df.index:
    Words = df.loc[rang, 'Words']
    id = df.loc[rang,'Id']
    for word in Words:
      if word in dic.keys(): #si le mot est déjà apparu dans le corpus
        if id in dic[word].keys(): #si le mot est déjà apparu dans ce document
          dic[word][id]+=1
        else:
          dic[word][id]=1 #première occurence du mot dans ce document
      else:#première occurence du mot dans le corpus
        dic[word]={id: 1}

  return dic

In [13]:
# inverted_index=index_data(df)
# save_data(DATAPATH, 'inverted_index.pkl', inverted_index)
inverted_index=load_data(DATAPATH, 'inverted_index.pkl')

## Search Method

In [14]:
MODEL_ST = SentenceTransformer('all-MiniLM-L6-v2')
#embeddings = MODEL_ST.encode(df.CleanBody.values, normalize_embeddings=True)
#save_data(DATAPATH, 'embeddings.pkl', embeddings)
embeddings=load_data(DATAPATH, 'embeddings.pkl')

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [23]:
def similarity_matrix(query, df=df, embeddings=embeddings, MODEL_ST=MODEL_ST):
  encoded_query = MODEL_ST.encode([query], normalize_embeddings=True)
  matrix=cosine_similarity(encoded_query, embeddings)
  return(matrix)


def BM25(query, df=df, inverted_index=inverted_index, k1=1.5, b=0.75):
  '''retourne une copy de la dataframe avec une colonne 'Queryscore' contenant le score BM25 pour la requête'''
  #extraction des mots de la requête
  processed_query=lemmatize(extract_tokens(query))

  #création d'une copie de la dataframe
  df_copy=df.copy()

  #ajout d'une colonne longueur du document
  df_copy['Lenght']=df_copy['Words'].apply(lambda x : len(x))

  N=len(df_copy) #nombre de docs dans la collection
  avgdl=df_copy['Lenght'].mean() #longueur moyenne des docs

  #calcul du score de chaque document
  scores=[]
  for rang in df_copy.index:
    doc_id=df_copy.loc[rang,'Id']
    lenght=df_copy.loc[rang,'Lenght']
    s=0

    for terme in processed_query:
      if terme not in inverted_index : 
        s+=0 #le terme n'apparait ni dans le doc ni même dans le corpus --> contribution nulle au score 
      else :
        n=len(inverted_index[terme]) # nombre de documents contenants le terme
        
        IDF=math.log((N-n+0.5)/(n+0.5))
        if doc_id in inverted_index[terme]: freq=inverted_index[terme][doc_id]
        else: freq=0
        s+=IDF*freq*(k1+1)/(freq+k1*(1-b+b*lenght/avgdl))
    
    scores.append(s)

  #ajout d'une colonne 'Queryscore' à la copie de la dataframe
  df_copy['QueryScore']=scores

  return scores, df_copy


def scored_df(query, df=df, inverted_index=inverted_index, k1=1.5, b=0.75, embeddings=embeddings, MODEL_ST=MODEL_ST):
  ''' renvoie une copie de la dataframe munie des colonnes 'ScoreBM25' et 'CosineSimilarity' '''
  df_copy=df.copy()
  df_copy['ScoreBM25']=BM25(query, df, inverted_index, k1, b)[0]
  df_copy['CosineSimilarity'] = similarity_matrix(query, df, embeddings, MODEL_ST)[0]
  return df_copy

#à la fin de scored_query, on obtient une df avec deux colonnes contenant les score BM25 et la mesure similarité

def search(query, df=df, inverted_index=inverted_index, k1=1.5, b=0.75, embeddings=embeddings, MODEL_ST=MODEL_ST):
  
  return

In [24]:
query='what is stochastic gradient descent ?'
scored_df(query)

ValueError: ignored

## Ranking

In [25]:
def viewcount_ranking(query, top):

  results = BM25(query)[1].nlargest(10*top, 'QueryScore')

  views_tot = sum(x for x in results['ViewCount'].tolist() if not math.isnan(x))
  views_max = max(results['ViewCount'].tolist())
  posts_ids = results['Id'].tolist()
  print(views_max)
  print(posts_ids)

  new_scores = []

  for post_id in posts_ids:

    post_metadata = results.loc[results['Id'] == post_id]
    post_viewcount = post_metadata['ViewCount'].tolist()[0]
    post_score = post_metadata['QueryScore'].tolist()[0]

    if not math.isnan(post_viewcount) :
      new_scores.append(post_score*math.pow(post_viewcount, 0.3))
    else :
      new_scores.append(post_score)

  results['MetadataScore'] = new_scores

  new_ranking = results.nlargest(top, 'MetadataScore')

  return new_ranking

In [26]:
viewcount_ranking(query, top=10)

137355.0
[53047, 57795, 70271, 24919, 80232, 81452, 65174, 28492, 36450, 53884, 67663, 31858, 109347, 14370, 88153, 111960, 19317, 44464, 67772, 107875, 53052, 9717, 36481, 85585, 113073, 6434, 37941, 77105, 67150, 81982, 13525, 94355, 57789, 27438, 34059, 1246, 71643, 103883, 70229, 82482, 88154, 108326, 87485, 80233, 62329, 65624, 15996, 16074, 81628, 14118, 81289, 44446, 1185, 82786, 14104, 31483, 36299, 81768, 47146, 75660, 84168, 24671, 78524, 85390, 27228, 90311, 100622, 78513, 104447, 62752, 23384, 8892, 5349, 75907, 36451, 74383, 63879, 103410, 25520, 66731, 77091, 15416, 58806, 410, 36454, 45408, 67425, 9404, 19248, 37772, 102526, 44466, 23161, 94253, 65625, 16005, 53266, 30896, 68666, 16722]


Unnamed: 0,Id,PostTypeId,CreationDate,Score,ViewCount,Body,OwnerUserId,LastActivityDate,Title,Tags,...,CommunityOwnedDate,LastEditorDisplayName,FavoriteCount,CleanBody,Tokens,Words,MeaningfullWords,Lenght,QueryScore,MetadataScore
22952,36450,1,2018-08-04T06:36:04.657,75,137355.0,<p>What is the difference between Gradient Des...,57082.0,2021-02-07T20:51:58.520,What is the difference between Gradient Descen...,<machine-learning><neural-network><deep-learni...,...,,,,What is the difference between Gradient Descen...,"[what, is, the, difference, between, gradient,...","[what, is, the, difference, between, gradient,...","[difference, gradient, descent, stochastic, gr...",27,21.527701,748.774413
361,410,1,2014-06-16T18:08:38.623,114,120202.0,<p>I'm currently working on implementing Stoch...,890.0,2020-01-31T16:28:25.547,Choosing a learning rate,<machine-learning><neural-network><deep-learni...,...,,,,I'm currently working on implementing Stochast...,"[i, m, currently, working, on, implementing, s...","[i, m, currently, working, on, implementing, s...","[currently, working, implementing, stochastic,...",159,15.629559,522.300232
24067,37941,1,2018-09-07T18:15:11.690,10,20279.0,"<p>To my understanding, the SGD classifier, an...",58736.0,2018-09-07T18:31:58.697,What is the difference between SGD classifier ...,<machine-learning><logistic-regression><gradie...,...,,,,"To my understanding, the SGD classifier, and L...","[to, my, understanding, the, sgd, classifier, ...","[to, my, understanding, the, sgd, classifier, ...","[understanding, sgd, classifier, logistic, reg...",73,19.106045,374.355965
1109,1246,1,2014-10-10T13:34:11.543,10,8760.0,<p>let's assume that I want to train a stochas...,2576.0,2014-11-21T11:50:47.717,Stochastic gradient descent based on vector op...,<python><gradient-descent><regression>,...,,,,let's assume that I want to train a stochastic...,"[let, s, assume, that, i, want, to, train, a, ...","[let, s, assume, that, i, want, to, train, a, ...","[let, assume, want, train, stochastic, gradien...",374,18.367978,279.777367
8191,14118,1,2016-09-20T21:27:12.427,4,9423.0,<p>I wonder when to use linear regression with...,21254.0,2016-09-21T00:46:01.750,Linear regression - LMS with gradient descent ...,<machine-learning><linear-regression>,...,,,,I wonder when to use linear regression with st...,"[i, wonder, when, to, use, linear, regression,...","[i, wonder, when, to, use, linear, regression,...","[wonder, use, linear, regression, stochastic, ...",91,17.785494,276.899834
7770,13525,1,2016-08-18T11:35:25.867,3,3901.0,"<p>I was reading <a href=""http://rads.stackove...",13100.0,2022-08-15T12:02:09.390,Is there any book for modern optimization in P...,<beginner><tools><career><reference-request><b...,...,,,,I was reading Modern Optimization with R (Use ...,"[i, was, reading, modern, optimization, with, ...","[i, wa, reading, modern, optimization, with, r...","[wa, reading, modern, optimization, r, use, r,...",36,18.758954,224.161699
51536,80232,1,2020-08-13T14:00:00.913,4,1259.0,"<p>Which of the following is true, given the o...",85353.0,2020-08-13T14:36:28.063,Does convergence of loss function is always gu...,<loss-function><optimization>,...,,,,"Which of the following is true, given the opti...","[which, of, the, following, is, true, given, t...","[which, of, the, following, is, true, given, t...","[following, true, given, optimal, learning, ra...",128,22.403807,190.690709
39984,62329,1,2019-10-28T19:48:04.397,0,956.0,<p>Is backpropagation a learning method or an ...,83406.0,2019-11-27T23:02:00.283,Back-propagation and stochastic gradient descent,<machine-learning><neural-network><gradient-de...,...,,,,Is backpropagation a learning method or an opt...,"[is, backpropagation, a, learning, method, or,...","[is, backpropagation, a, learning, method, or,...","[backpropagation, learning, method, optimisati...",20,18.047447,141.433776
34094,53047,1,2019-06-01T14:17:40.917,4,265.0,"<p>As I know, Gradient Descent has three varia...",56922.0,2020-12-14T16:54:45.387,How is Stochastic Gradient Descent used like M...,<optimization><gradient-descent>,...,,,,"As I know, Gradient Descent has three variants...","[as, i, know, gradient, descent, has, three, v...","[a, i, know, gradient, descent, ha, three, var...","[know, gradient, descent, ha, three, variant, ...",120,24.81127,132.319165
2033,5349,1,2015-03-18T17:25:32.460,4,801.0,<p>I'm not sure which word to use to different...,8714.0,2015-12-06T12:17:40.473,"Terminology: SOMs, batch learning, online lear...",<machine-learning><gradient-descent>,...,,,,I'm not sure which word to use to differentiat...,"[i, m, not, sure, which, word, to, use, to, di...","[i, m, not, sure, which, word, to, use, to, di...","[sure, word, use, differentiate, self, organiz...",147,16.159323,120.091676


## Visualising Results

In [None]:
def visualize_output():
    # TODO
    
    return

## Querying

In [None]:
def make_query(natural_query):
    # TODO

    return

## Scoring

In [None]:
# Pas sûr de garder cette partie

## Testing

In [None]:
# TODO