# Search Engine Experiments

Test of different tokenization methods and Word Embedding for the search engine of the articles of Equinox by Asesoftware

## CSV of Articles

CSV Columns: “article_name”, “enumeration_in_article”, “content” 
“stringWithAllText”

In [118]:
import pandas as pd
df = pd.read_csv("articles_paragraphs.csv")

In [119]:
df_eng = df[df['language_code'] =='en'].reset_index()

In [120]:
df_eng

Unnamed: 0.1,index,Unnamed: 0,article_name,content,enumeration_in_article,file_id,language_code
0,0,0,Enhance medicine using AI,"During the last decades, AI transformed multip...",0,0,en
1,1,1,Enhance medicine using AI,The first time AI helped humans research was i...,1,0,en
2,2,2,Enhance medicine using AI,AI is turning the drug-discovery paradigm upsi...,2,0,en
3,3,3,Enhance medicine using AI,Another contribution of AI to this field was m...,3,0,en
4,4,4,Enhance medicine using AI,Then this information is used to create a repr...,4,0,en
...,...,...,...,...,...,...,...
383,537,537,BI for making decisions,"Data Analysis Reporting: In this step, data vi...",11,20,en
384,538,538,BI for making decisions,Decision Making: This last step is where the r...,12,20,en
385,539,539,BI for making decisions,BI can have different applications for one org...,13,20,en
386,540,540,BI for making decisions,"In conclusion, BI is important for organizatio...",14,20,en


## Data Preprocessing and Tokenization

### Lemma Whitespace Tokenization

In [121]:
import pandas as pd
import string
import spacy

'''
In this example, we use the Spacy library to preprocess and tokenize the text, 
lowercasing the text, removing punctuation, lemmatizing the words, and removing stopwords 
and short words. We then apply this function to each paragraph in the 'content' column of the CSV file using a for loop, 
and append the resulting list of tokens to a list of lists. The final result is a list of lists, where each 
sublist contains the tokens of each paragraph.

'''

# load spacy nlp model
nlp = spacy.load('en_core_web_sm')

# define function for pre-processing and tokenization
def preprocess_text_lemma(text):
    # lowercase
    text = text.lower()
    # remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # lemmatize
    doc = nlp(text)
    lemmatized_text = [token.lemma_ for token in doc]
    # remove stopwords and short words
    stopwords = spacy.lang.en.stop_words.STOP_WORDS
    tokens = [token for token in lemmatized_text if token not in stopwords and len(token) > 2]
    return tokens

# apply pre-processing and tokenization to the 'content' column of each row
tokenized_paragraphs_lemma_eng = []
for paragraph in df_eng['content']:
    tokens = preprocess_text_lemma(paragraph)
    tokenized_paragraphs_lemma_eng.append(tokens)

# print the resulting list of lists of tokens
print(tokenized_paragraphs_lemma_eng)


[['decade', 'transform', 'multiple', 'field', 'knowledge', 'medicine', 'transformation', 'different', 'way', 'enhance', 'medicine', 'use', 'article', 'introduce', 'help', 'discover', 'new', 'drug', 'understand', 'mystery', 'cancer', 'learn', 'billion', 'relation', 'different', 'research', 'resource'], ['time', 'help', 'human', 'research', '2007', 'adam', 'robot', 'generate', 'hypothesis', 'gene', 'code', 'critical', 'enzyme', 'catalyze', 'reaction', 'yeast', 'saccharomyce', 'cerevisiae', 'adam', 'use', 'robotic', 'test', 'prediction', 'lab', 'physically', 'researcher', 'university', 'aberystwyth', 'cambridge', 'independently', 'test', 'adamsadam', 'hypothese', 'function', 'gene', 'new', 'accurate', 'wrong', 'example', 'multiple', 'application', 'field', 'ready', 'learn'], ['turn', 'drugdiscovery', 'paradigm', 'upside', 'use', 'patientdriven', 'biology', 'datum', 'derive', 'morepredictive', 'hypothesis', 'traditional', 'trialanderror', 'approach', 'example', 'boston', 'berg', 'biotechno

#### Agregamos los vectores con su respectivo tokenization


### PoS tagging

In [122]:
import nltk
from gensim.models import Word2Vec
import numpy as np

Creamos las tuplas con sus tags

In [123]:
nltk.download('averaged_perceptron_tagger')

tagged_sentences_eng = [nltk.pos_tag((paragraph)) for paragraph in tokenized_paragraphs_lemma_eng]

# Entrenamiento del modelo Word2Vec
model_eng = Word2Vec(tagged_sentences_eng,  window=20, min_count=1, workers=4)

# Utilizar el modelo Word2Vec para encontrar palabras similares



[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/felipe/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [165]:
# Acceder a los vectores 
vectors = model_eng.wv.vectors # Acceder a las palabras
words = model_eng.wv.index_to_key

Creamos los vectores por cada parrafo

In [124]:
# Calculate the meaning vector per paragraph
paragraph_tag_vectors_lemma = []
for paragraph_tokens in tagged_sentences_eng:
    vectors = []
    ind = 0
    for ind, token in enumerate(paragraph_tokens):
        ind = ind + 1
        if ind in range(len(tagged_sentences_eng)):
            vectors.append(model_eng.wv[ind])
    if len(vectors) > 0:
        paragraph_tag_vectors_lemma.append(np.mean(vectors, axis=0))
    else:
        paragraph_tag_vectors_lemma.append(np.zeros(model_eng.vector_size))

In [125]:
df_eng['vector_tag'] = paragraph_tag_vectors_lemma

In [126]:
df_eng.head()

Unnamed: 0.1,index,Unnamed: 0,article_name,content,enumeration_in_article,file_id,language_code,vector_tag
0,0,0,Enhance medicine using AI,"During the last decades, AI transformed multip...",0,0,en,"[-0.009098507, 0.018190403, 0.011062439, 0.000..."
1,1,1,Enhance medicine using AI,The first time AI helped humans research was i...,1,0,en,"[-0.008574716, 0.015772605, 0.009261614, -0.00..."
2,2,2,Enhance medicine using AI,AI is turning the drug-discovery paradigm upsi...,2,0,en,"[-0.008642484, 0.016703138, 0.010268111, 0.000..."
3,3,3,Enhance medicine using AI,Another contribution of AI to this field was m...,3,0,en,"[-0.011193633, 0.020102842, 0.011527595, -0.00..."
4,4,4,Enhance medicine using AI,Then this information is used to create a repr...,4,0,en,"[-0.008791533, 0.01816993, 0.010745333, 0.0005..."


## Similarity Function

In [127]:
import numpy as np
from gensim.models import KeyedVectors


def cosine_similarity_list(vectors_list, query_vector):
    #Compute the cosine similarity between the vector representation of the input and the vector representations of each sentence in the text
    similarity_scores = []
    for vector in vectors_list:
        score = query_vector.dot(vector) / (np.linalg.norm(query_vector) * np.linalg.norm(vector))
        similarity_scores.append(score)

    # Sort the sentences in descending order of their cosine similarity to the input and return the top-N most similar sentences
    n = 100
    most_similar_sentences = [[vectors_list[idx],idx] for idx in np.argsort(similarity_scores)[::-1][:n] if np.sum(vectors_list[idx]) != 0]

    return most_similar_sentences[:20]


In [128]:
cosine_similarity_list(df_eng['vector_tag'],df_eng['vector_tag'][0])

[[array([-9.09850746e-03,  1.81904025e-02,  1.10624386e-02,  3.33047123e-04,
          4.40837489e-03, -3.95729207e-02,  1.85493790e-02,  5.43353558e-02,
         -2.15476006e-02, -1.66466944e-02, -6.45605335e-03, -4.48495112e-02,
         -5.53743821e-03,  8.37479625e-03,  8.25444516e-03, -1.49675775e-02,
          4.55922028e-03, -2.26176269e-02,  7.70997140e-05, -5.08735739e-02,
          1.01740081e-02,  1.04416078e-02,  2.60108970e-02, -8.43573269e-03,
         -1.24019580e-02,  2.29363516e-03, -1.60197038e-02, -3.01229525e-02,
         -2.49278583e-02,  3.45556135e-03,  2.49259677e-02,  9.34775919e-03,
          1.57137364e-02, -5.32893743e-03, -9.95000824e-03,  3.08686271e-02,
          4.79548518e-03, -2.93166023e-02, -1.44181484e-02, -5.16894870e-02,
          1.92584482e-03, -1.70211960e-02, -7.24708708e-03, -2.28232704e-04,
          2.74234675e-02, -1.66440755e-02, -1.75678413e-02, -3.39111965e-03,
          1.15835546e-02,  1.24777183e-02,  9.14896745e-03, -3.10296435e-02,

## Prompt Preprocessing, Tokenization and Embedding

### Lemma with Pos-tagging

In [138]:
userPrompt="artificial "
tokenized_prompt = preprocess_text_lemma(userPrompt)
tagged_prompt = nltk.pos_tag(tokenized_prompt)
print(tagged_prompt)
promptVector_lemma_eng = np.zeros((model_eng.vector_size,))

word_count  = 0
ind = 0
for ind,token in enumerate(tokenized_prompt):
    ind = ind + 1
    if token in tokenized_prompt:
        promptVector_lemma_eng += model_eng.wv[ind]
        word_count += 1
        print(word_count,ind)

if word_count > 0:
    promptVector_lemma_eng /= word_count
    





[('artificial', 'JJ')]
['artificial']
1 1


### Testing PoS tagging

In [134]:
var = cosine_similarity_list(df_eng['vector_tag'],promptVector_lemma_eng)

## Similarity Test

In [135]:
var=np.array(var)
var[:,1]
possible_solutions=df_eng.iloc[var[:,1]]

  var=np.array(var)


In [136]:
for paragraph in possible_solutions["content"]:
    print(paragraph, "\n")

And what about these? Perhaps the sensations are better than the previous ones. 

BI can have different applications for one organization; among them, we can find [2]: 

For example, if the username is myjetson and IP is 192.168.101.15, then the command will be: 

If you want to read more of our content follow the link : 

Despite their differences, natural thinking and AI can complement each other. 

Now, why use AI to improve the user experience? Because Artificial Intelligence (AI) also has the potential to transform the way healthcare is delivered and can lead to better outcomes and improve productivity and efficiency in service delivery. 

These systems are divided into two main groups: task-oriented systems and chatbots. The former focus on accomplishing valuable tasks, while the latter focus on maintaining a conversation with the user without implementing any specific task. 

Robotic process automation (RPA) is a technology that develops robots. These robots are able to emulate 