# DAY 3: Student version

**Machine Learning NLP**

The goal of this session is to improve the search engine using NLP features.

This notebook guides you through different techniques to explore. It is expected of you to be inventive and improve the techniques introduced. 

First, let's import the useful packages and load the data.

## Installs

In [87]:
! pip install sentence-transformers --quiet

In [88]:
!pip install beautifulsoup4

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Imports

In [89]:
import os
import re
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import find

from sentence_transformers import SentenceTransformer

import nltk
nltk.download('punkt')
nltk.download('wordnet')

from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Load Data

In [90]:
# Only if you use Colab
from google.colab import drive
drive.mount('/content/drive')

import os

# TODO:
DATA_PATH = 'drive/MyDrive/EI_web_data/Data' 

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [91]:
posts = pd.read_xml(os.path.join(DATA_PATH, 'Posts.xml'), parser="etree", encoding="utf8")

## Data Cleaning

Implement a function to clean the posts. 

You can reuse what you have used in the Day 2 notebook or improve it.

In [103]:
def clean_post(text:str)->str:
  if type(text)==str:
    soup = BeautifulSoup(text, 'html.parser')

    # Supprimer toutes les balises de script et de style
    for script in soup(['script', 'style']):
      script.extract()

    # Obtenir le texte propre sans balises
    texte_propre = soup.get_text()
    
    # Supprimer les espaces supplémentaires et les sauts de ligne
    texte_propre = re.sub(r'\s+', ' ', texte_propre)

    return texte_propre

In [106]:
posts['cleaned_body'] = posts.Body.apply(clean_post)

  soup = BeautifulSoup(text, 'html.parser')


In [107]:
# Supprimer les lignes où 'clean_body' n'est pas une chaîne de caractères
posts = posts.drop(posts[~posts['cleaned_body'].apply(lambda x: isinstance(x, str))].index)

In [159]:
posts.columns

Index(['Id', 'PostTypeId', 'CreationDate', 'Score', 'ViewCount', 'Body',
       'OwnerUserId', 'LastActivityDate', 'Title', 'Tags', 'AnswerCount',
       'CommentCount', 'ClosedDate', 'ContentLicense', 'AcceptedAnswerId',
       'LastEditorUserId', 'LastEditDate', 'ParentId', 'OwnerDisplayName',
       'CommunityOwnedDate', 'LastEditorDisplayName', 'FavoriteCount',
       'cleaned_body'],
      dtype='object')

In [109]:
posts

Unnamed: 0,Id,PostTypeId,CreationDate,Score,ViewCount,Body,OwnerUserId,LastActivityDate,Title,Tags,...,ContentLicense,AcceptedAnswerId,LastEditorUserId,LastEditDate,ParentId,OwnerDisplayName,CommunityOwnedDate,LastEditorDisplayName,FavoriteCount,cleaned_body
0,5,1,2014-05-13T23:58:30.457,9,898.0,<p>I've always been interested in machine lear...,5.0,2014-05-14T00:36:31.077,How can I do simple machine learning without h...,<machine-learning>,...,CC BY-SA 3.0,,,,,,,,,I've always been interested in machine learnin...
1,7,1,2014-05-14T00:11:06.457,4,478.0,"<p>As a researcher and instructor, I'm looking...",36.0,2014-05-16T13:45:00.237,What open-source books (or other materials) pr...,<education><open-source>,...,CC BY-SA 3.0,10.0,97.0,2014-05-16T13:45:00.237,,,,,,"As a researcher and instructor, I'm looking fo..."
2,9,2,2014-05-14T00:36:31.077,5,,"<p>Not sure if this fits the scope of this SE,...",51.0,2014-05-14T00:36:31.077,,,...,CC BY-SA 3.0,,,,5.0,,,,,"Not sure if this fits the scope of this SE, bu..."
3,10,2,2014-05-14T00:53:43.273,13,,"<p>One book that's freely available is ""The El...",22.0,2014-05-14T00:53:43.273,,,...,CC BY-SA 3.0,,,,7.0,,,,,"One book that's freely available is ""The Eleme..."
4,14,1,2014-05-14T01:25:59.677,26,1901.0,<p>I am sure data science as will be discussed...,66.0,2020-08-16T13:01:33.543,Is Data Science the Same as Data Mining?,<data-mining><definitions>,...,CC BY-SA 3.0,29.0,322.0,2014-06-17T16:17:20.473,,,,,,I am sure data science as will be discussed in...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75722,119962,1,2023-03-04T20:06:06.820,0,8.0,<p>I am implementing a neural network of arbit...,147597.0,2023-03-04T20:22:12.523,Back Propagation on arbitrary depth network wi...,<neural-network><backpropagation>,...,CC BY-SA 4.0,,147597.0,2023-03-04T20:22:12.523,,,,,,I am implementing a neural network of arbitrar...
75723,119963,1,2023-03-04T20:12:19.677,0,10.0,<p>I am using KNN for a regression task</p>\n<...,147598.0,2023-03-04T20:12:19.677,Evaluation parameter in knn,<regression><k-nn>,...,CC BY-SA 4.0,,,,,,,,,I am using KNN for a regression task It's like...
75724,119964,1,2023-03-05T00:14:12.597,0,7.0,<p>I have developed a small encoding algorithm...,44581.0,2023-03-05T00:14:12.597,Can I use zero-padded input and output layers ...,<deep-learning><convolutional-neural-network>,...,CC BY-SA 4.0,,,,,,,,,I have developed a small encoding algorithm th...
75725,119965,1,2023-03-05T00:43:12.213,0,5.0,"<p>To my understanding, optimizing a model wit...",84437.0,2023-03-05T00:43:12.213,Why does cross validation and hyperparameter t...,<cross-validation><hyperparameter-tuning>,...,CC BY-SA 4.0,,,,,,,,,"To my understanding, optimizing a model with k..."


You can also implement a function that cleans the user's query (the query). 

This step is optionnal (if you don't think that it is necessary, just return the query)

In [110]:
def clean_query(text:str)->str:
    #TODO
    return cleaned_query

## Text specific metadata

What metadata can you get from the text at your disposal ? Which ones are relevant ? 

In [111]:
#TODO
#The relevant metadata are the title, the tags, the CreationDate and the LastEditDate/last activity Date,
# the ViewCount, the favoriteCount, the owner user ID (to be linked withe the table user)

## Classic Preprocessing

The goal for this part is to implement a classic vectorization (Count vectorizer, tfidf...).

You can do it on your own or use scikit-learn.

Hints : pay attention to stopwords, additionnal preprocessing steps and techniques of vectoriation


In [112]:
vectorizer = TfidfVectorizer()
docs=posts.cleaned_body.values.tolist() #liste de chaine des caratères
vectorizer.fit_transform(docs)
vectors = vectorizer.transform(posts.cleaned_body.values)

Write a function that applies the same process to the query


In [113]:
def vectorize_query(query : str, vectorizer=vectorizer):
    """vectorizes the query
    Args:
        query (str): query string
        vectorizer (optional): Defaults to vectorizer.

    Returns:
        query vectorized
    """
    query_vectorized=vectorizer.transform([query])

    return query_vectorized

Determine a way to use this vectorization to suggest the closest items to the entry in the database

In [114]:
def vectorizer_search(query : str,
                      vectors=vectors,
                      vectorizer=vectorizer) -> list:
    # Vectorize the query
    query_vectorized = vectorize_query(query, vectorizer)

    # Calculate the similarity between the query vector and all database vectors
    similarity_scores = cosine_similarity(query_vectorized, vectors)

    # Get the indices of the closest items based on similarity scores
    closest_indices = np.argsort(similarity_scores, axis=1)[0][::-1]

    # Get the closest items from the database
    closest_items = [docs[i] for i in closest_indices]

    return closest_items

In [115]:
entry = 'what is stochastic gradient descent ?'

In [116]:
 vectorizer_search(entry)[:3]

['As I know, Gradient Descent has three variants which are: 1- Batch Gradient Descent: processes all the training examples for each iteration of gradient descent. 2- Stochastic Gradient Descent: processes one training example per iteration. Hence, the parameters are being updated even after one iteration in which only a single example has been processed. 3- Mini Batch gradient descent: which works faster than both batch gradient descent and stochastic gradient descent. Here, b examples where b < m are processed per iteration. But in some cases they use Stochastic Gradient Descent and they define a batch size for training which is what I am confused about. Also, what about Adam, AdaDelta & AdaGrad, are they all mini-batch gradient descent or not? ',
 'What is the difference between Gradient Descent and Stochastic Gradient Descent? I am not very familiar with these, can you describe the difference with a short example? ',
 'Both algorithms are quite similar. The only difference comes whi

How you can improve this approach ? 

Answer here

## Semantic similarity

There are NLP methods that go further than word-by-word study, by taking into account the context of the terms. There are several methods: Word2vec, Bert.

From the Sentence Transformers documentation: https://www.sbert.net/docs/pretrained_models.html choose the pre-trained model that you think is the most appropriate. Justify your choice.

In [117]:
sentence_transformer_model = 'all-MiniLM-L6-v2'

Answer Here

In [118]:
#Pertinence du domaine : Étant donné que votre base de données est constituée de documents
# liés à la science des données, un modèle qui a été affiné pour des tâches de NLI à l'aide 
#de textes divers devrait capturer les subtilités et le contexte propres au domaine de la 
#science des données.


⚠ To use Sentence Transformers it is recommended to activate the GPU of google colab.

In [119]:
MODEL_ST = SentenceTransformer(sentence_transformer_model)

Use this algorithm to encode the data in the database

In [120]:
#embeddings = MODEL_ST.encode(posts.cleaned_body.values, normalize_embeddings=True)

*If this process is slow, you can save this array in case you need to load it again*

In [121]:
import pickle

#with open(os.path.join(DATA_PATH, 'embeddings.pkl'), 'wb') as file:
#   pickle.dump(embeddings, file)

L=[]
with open(os.path.join(DATA_PATH, 'embeddings.pkl'),'rb') as file:
    embeddings=pickle.load(file)

Make a function that transforms the input

In [122]:
def encode_query(query : str) ->  np.ndarray:
    encoded_query = MODEL_ST.encode([query], normalize_embeddings=True)
    return encoded_query

Which distance is most relevant to measure the distance between the input and the data?

In [123]:
#la mesure cosinus

Write a function that returns a matrix containing information about the similarity between the query and the data

In [124]:
def similarity(query, embeddings=embeddings):
    encoded_query=encode_query(query)
    return cosine_similarity(encoded_query, embeddings)

In [125]:
query = 'what is stochastic gradient descent ?'
matrix_similarity = similarity(query, embeddings)

How do you determine which documents in the data set most closely match the input?

In [126]:
def ordre_en_fonction_similarité(matrix_similarity):
    # Get the indices of the documents sorted by their similarity scores
    sorted_indices = np.argsort(matrix_similarity)[::-1]

    return sorted_indices

In [127]:
ordre_en_fonction_similarité([0.6, 0.8, 0.7])
#on doit trouver [1, 2, 0]

array([1, 2, 0])

Put it all together in a function.

In [128]:
def closest_semantic_doc(query, embeddings=embeddings, top_n=10):
    matrix_similarity=similarity(query, embeddings)
    closest_posts=ordre_en_fonction_similarité(matrix_similarity)[:top_n]
    return closest_posts

What methods could be used to improve the recommendations of this algorithm?

Answer here

In [129]:
#We could take into account the metadata, and try to merge different search method (for example BM25 and the semantic search)

## Text clustering (BONUS)

We can use topic modeling techniques to identify groups of texts among our document base and classify the input to restrict the application of the proximity calculations seen previously.

#### LDA

Latent Dirichlet Allocation is a topic modeling algorithm that allows soft clustering. Soft clustering means that the LDA does not allocate an input to a cluster, but gives a probabilistic score for each identified cluster. This decomposition allows to identify topics within the documents. 

In order to compute this algorithm, you need to vectorize your data (you can use the one you have already done previously or make another one).

In [145]:
# Vectorize document using TF-IDF
vectorizer_lda =CountVectorizer()

#fichier test
documents=posts.sample(100).cleaned_body.values

# Fit and Transform the documents
train_data = vectorizer_lda.fit_transform(documents)

You can use Gensim or scikit-learn to compute LDA.

In [146]:
!pip install lda

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [147]:
from lda import LDA

In [153]:
#Initialize the LDA model
model = LDA(n_topics=10, n_iter=500)

#Training
model.fit(train_data)

# Get the topic-word distributions
topic_word_distributions = model.topic_word_

# Infer the document-topic distributions
document_topic_distributions = model.transform(train_data)

# Print the top words for each topic
for topic_idx, topic_words in enumerate(model.topic_word_):
    top_words = [vectorizer_lda.get_feature_names_out()[i] for i in topic_words.argsort()[:-6:-1]]
    print(f"Topic {topic_idx}: {', '.join(top_words)}")

Topic 0: np, for, reshape, w1, w2
Topic 1: of, the, each, value, features
Topic 2: topic, in, tensorflow, file, tf_files
Topic 3: the, for, network, is, of
Topic 4: the, to, is, and, in
Topic 5: self, nn, kernel_size, in, model
Topic 6: the, activation, layer, model, as
Topic 7: tf, time, model, n_hidden, plt
Topic 8: theta, am, of, in, hidden
Topic 9: of, for, as, or, use


In [156]:
from gensim import corpora, models, matutils

In [158]:
# Initialize TfidfVectorizer
vectorizer = TfidfVectorizer()

# Create the TF-IDF matrix
dtm_tfidf = vectorizer.fit_transform(documents)

# Convert the TF-IDF matrix to Gensim corpus
corpus = matutils.Sparse2Corpus(dtm_tfidf, documents_columns=False)

# Create the Gensim dictionary
id2word = dict((i, word) for i, word in enumerate(vectorizer.get_feature_names_out()))

# Train the LDA model
lda_model = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=10, passes=10)

# Print the topics and their top words
for topic in lda_model.show_topics():
    print(topic)


(0, '0.002*"10" + 0.002*"bin" + 0.002*"83" + 0.002*"training" + 0.002*"label" + 0.001*"word" + 0.001*"learn" + 0.001*"column" + 0.001*"performance" + 0.001*"probably"')
(1, '0.003*"tf" + 0.002*"age" + 0.002*"nn" + 0.001*"img" + 0.001*"torch" + 0.001*"google" + 0.001*"errors" + 0.001*"soon" + 0.001*"scout" + 0.001*"isnt"')
(2, '0.002*"the" + 0.002*"move" + 0.002*"partition" + 0.002*"because" + 0.002*"each" + 0.002*"is" + 0.002*"independence" + 0.002*"sigmoid" + 0.002*"hidden" + 0.002*"function"')
(3, '0.003*"the" + 0.002*"theta" + 0.002*"self" + 0.002*"sentence" + 0.002*"to" + 0.002*"in" + 0.002*"bert" + 0.002*"covid" + 0.002*"argument" + 0.002*"semantic"')
(4, '0.002*"levels" + 0.002*"killed" + 0.002*"test" + 0.002*"tensorflow" + 0.002*"counter" + 0.002*"recent_school_shootings" + 0.001*"carts" + 0.001*"the" + 0.001*"sample" + 0.001*"dropout"')
(5, '0.002*"orange" + 0.002*"iloc" + 0.002*"rattle" + 0.001*"x_init" + 0.001*"chart" + 0.001*"convolution" + 0.001*"strides" + 0.001*"test_inde

Assign a main topic to each document

Write a function that assigns a topic to the query


In [None]:
def get_topic_query(query, vectorizer=vectorizer_lda, lda_model=lda_model) -> int:
    #TODO
    return topic_query

## Merge methods

Write an algorithm to merge the methods seen. Which methods to use? How can you check if they are relevant ?

Answer here

In [None]:
def nlp_search_algorithm(query,
                         topic_documents=topic_documents,
                         vectors=vectors,
                         vectorizer=vectorizer,
                         vectorizer_lda=vectorizer_lda,
                         lda_model=lda_model,
                         embeddings=embeddings,
                         top_n=10
                         )->list:
    #TODO
 
    return matching_posts




Once you have a list of possible results, you can it: (you can use one of the ranking algorithms you have previously done or make up a new one)

In [None]:
def rank(possible_results):
    #to_do
    return possible_results

## Incorporation in the search engine

### Addition of metadata

You must now have a new set of metadata and data to add to your original index. You can load the index you had as a result of Day 2 and today's work to it. 

In [None]:
#load previous data 

#TODO 

In [None]:
# add the new data to the previous index

Hint : if you have changed the preprocessing function at the beginning of this notebook make sure to update the Clean Body attribute

### Adaptation to the index format

Adapt your nlp search results to the index format

In [None]:
def nlp_search_in_index(query,
                        index,
                        args):

    return matching_posts
  

### Compare the new searching and ranking method to the previous ones

Compare in terms of efficiency (precision, completeness, speed, memory usage)

### merge all methods to make an efficient search algorithm