# Search Engine Experiments

Test of different tokenization methods and Word Embedding for the search engine of the articles of Equinox by Asesoftware

## CSV of Articles

CSV Columns: “article_name”, “enumeration_in_article”, “content” 
“stringWithAllText”

In [1]:
import pandas as pd
df = pd.read_csv("articles_paragraphs.csv")
df_word = pd.read_csv("articles_paragraphs.csv")
df_stem = pd.read_csv("articles_paragraphs.csv")
df_lemma = pd.read_csv("articles_paragraphs.csv")

## Data Preprocessing and Tokenization

### Lemma Whitespace Tokenization

In [2]:
import pandas as pd
import string
import spacy

'''
In this example, we use the Spacy library to preprocess and tokenize the text, 
lowercasing the text, removing punctuation, lemmatizing the words, and removing stopwords 
and short words. We then apply this function to each paragraph in the 'content' column of the CSV file using a for loop, 
and append the resulting list of tokens to a list of lists. The final result is a list of lists, where each 
sublist contains the tokens of each paragraph.

'''

# load spacy nlp model
nlp = spacy.load('en_core_web_sm')

# define function for pre-processing and tokenization
def preprocess_text_lemma(text):
    # lowercase
    text = text.lower()
    # remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # lemmatize
    doc = nlp(text)
    lemmatized_text = [token.lemma_ for token in doc]
    # remove stopwords and short words
    stopwords = spacy.lang.en.stop_words.STOP_WORDS
    tokens = [token for token in lemmatized_text if token not in stopwords and len(token) > 2]
    return tokens

# apply pre-processing and tokenization to the 'content' column of each row
tokenized_paragraphs_lemma = []
for paragraph in df['content']:
    tokens = preprocess_text_lemma(paragraph)
    tokenized_paragraphs_lemma.append(tokens)

# print the resulting list of lists of tokens
print(tokenized_paragraphs_lemma)


2023-04-14 17:02:03.691566: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-14 17:02:03.873350: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-04-14 17:02:03.887970: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-04-14 17:02:03.888019: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore 

[['decade', 'transform', 'multiple', 'field', 'knowledge', 'medicine', 'transformation', 'different', 'way', 'enhance', 'medicine', 'use', 'article', 'introduce', 'help', 'discover', 'new', 'drug', 'understand', 'mystery', 'cancer', 'learn', 'billion', 'relation', 'different', 'research', 'resource'], ['time', 'help', 'human', 'research', '2007', 'adam', 'robot', 'generate', 'hypothesis', 'gene', 'code', 'critical', 'enzyme', 'catalyze', 'reaction', 'yeast', 'saccharomyce', 'cerevisiae', 'adam', 'use', 'robotic', 'test', 'prediction', 'lab', 'physically', 'researcher', 'university', 'aberystwyth', 'cambridge', 'independently', 'test', 'adamsadam', 'hypothese', 'function', 'gene', 'new', 'accurate', 'wrong', 'example', 'multiple', 'application', 'field', 'ready', 'learn'], ['turn', 'drugdiscovery', 'paradigm', 'upside', 'use', 'patientdriven', 'biology', 'datum', 'derive', 'morepredictive', 'hypothesis', 'traditional', 'trialanderror', 'approach', 'example', 'boston', 'berg', 'biotechno

### Stemmer Tokenization


In [3]:
import pandas as pd
import string
import spacy
from nltk.stem import SnowballStemmer

# load spacy nlp model
nlp = spacy.load('en_core_web_sm')
# load stemmer
stemmer = SnowballStemmer('english')

# define function for pre-processing and tokenization
def preprocess_text_stem(text):
    # lowercase
    text = text.lower()
    # remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # stem
    doc = nlp(text)
    stemmed_text = [stemmer.stem(token.text) for token in doc]
    # remove stopwords and short words
    stopwords = spacy.lang.en.stop_words.STOP_WORDS
    tokens = [token for token in stemmed_text if token not in stopwords and len(token) > 2]
    return tokens

# apply pre-processing and tokenization to the 'content' column of each row
tokenized_paragraphs_stem = []
for paragraph in df['content']:
    tokens = preprocess_text_stem(paragraph)
    tokenized_paragraphs_stem.append(tokens)

# print the resulting list of lists of tokens
print(tokenized_paragraphs_stem)


[['dure', 'decad', 'transform', 'multipl', 'field', 'knowledg', 'medicin', 'transform', 'mani', 'differ', 'way', 'enhanc', 'medicin', 'use', 'articl', 'introduc', 'help', 'discov', 'new', 'drug', 'understand', 'mysteri', 'cancer', 'learn', 'billion', 'relat', 'differ', 'research', 'resourc'], ['time', 'help', 'human', 'research', '2007', 'adam', 'robot', 'generat', 'hypothes', 'gene', 'code', 'critic', 'enzym', 'catalyz', 'reaction', 'yeast', 'saccharomyc', 'cerevisia', 'adam', 'use', 'robot', 'test', 'predict', 'lab', 'physic', 'research', 'univers', 'aberystwyth', 'cambridg', 'independ', 'test', 'adamsadam', 'hypothes', 'function', 'gene', 'new', 'accur', 'onli', 'wrong', 'onli', 'exampl', 'multipl', 'applic', 'field', 'readi', 'learn'], ['turn', 'drugdiscoveri', 'paradigm', 'upsid', 'use', 'patientdriven', 'biolog', 'data', 'deriv', 'morepredict', 'hypothes', 'tradit', 'trialanderror', 'approach', 'exampl', 'boston', 'berg', 'biotechnolog', 'compani', 'develop', 'model', 'identifi',

### Word-based Tokenization

In [4]:
import pandas as pd
import string
import spacy

# load spacy nlp model
nlp = spacy.load('en_core_web_sm')

# define function for pre-processing and tokenization
def preprocess_text_word(text):
    # lowercase
    text = text.lower()
    # remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # split into words
    words = text.split()
    # remove stopwords and short words
    stopwords = spacy.lang.en.stop_words.STOP_WORDS
    tokens = [word for word in words if word not in stopwords and len(word) > 2]
    return tokens

# apply pre-processing and tokenization to the 'content' column of each row
tokenized_paragraphs_word = []
for paragraph in df['content']:
    tokens = preprocess_text_word(paragraph)
    tokenized_paragraphs_word.append(tokens)

# print the resulting list of lists of tokens
print(tokenized_paragraphs_word)


[['decades', 'transformed', 'multiple', 'fields', 'knowledge', 'medicine', 'transformation', 'different', 'ways', 'enhance', 'medicine', 'article', 'introduce', 'help', 'discover', 'new', 'drugs', 'understand', 'mysteries', 'cancer', 'learn', 'billion', 'relations', 'different', 'research', 'resources'], ['time', 'helped', 'humans', 'research', '2007', 'adam', 'robot', 'generated', 'hypotheses', 'genes', 'code', 'critical', 'enzymes', 'catalyze', 'reactions', 'yeast', 'saccharomyces', 'cerevisiae', 'adam', 'robotics', 'test', 'predictions', 'lab', 'physically', 'researchers', 'universities', 'aberystwyth', 'cambridge', 'independently', 'tested', 'adamsadams', 'hypotheses', 'functions', 'genes', 'new', 'accurate', 'wrong', 'example', 'multiple', 'applications', 'field', 'ready', 'learn'], ['turning', 'drugdiscovery', 'paradigm', 'upside', 'patientdriven', 'biology', 'data', 'derive', 'morepredictive', 'hypotheses', 'traditional', 'trialanderror', 'approach', 'example', 'bostons', 'berg'

## Word Embedding

### Lemma Whitespace

In [5]:
import gensim
import numpy as np

'''
Here we train the Word2Vec model with a list of lists where each sublist is a tokenized paragraph.
After we get the word vectors per paragraph, we compute our paragraph meaning vector as the mean
of its word vectors.
'''

# Train Word2Vec model
lemmaModel = gensim.models.Word2Vec(tokenized_paragraphs_lemma, window=20, min_count=1, workers=4)
lemmaModel.save("paragraphModel")

# Calculate the meaning vector per paragraph
paragraph_vectors_lemma = []
for paragraph_tokens in tokenized_paragraphs_lemma:
    vectors = []
    for token in paragraph_tokens:
        if token in lemmaModel.wv.key_to_index:
            vectors.append(lemmaModel.wv[token])
    if len(vectors) > 0:
        paragraph_vectors_lemma.append(np.mean(vectors, axis=0))
    else:
        paragraph_vectors_lemma.append(np.zeros(lemmaModel.vector_size))

print(paragraph_vectors_lemma[383])

[-0.00525929  0.03342709  0.00908627 -0.00916142  0.00410249 -0.05454782
  0.00016843  0.06592313 -0.02907532 -0.02053871 -0.01257421 -0.05162814
 -0.00242832  0.02738758  0.00276962 -0.01666317  0.01349642 -0.02452609
  0.00457875 -0.06017574  0.01488929  0.02067158  0.03652705 -0.0148907
 -0.00455017  0.0014725  -0.03140944 -0.03297865 -0.0286214  -0.00708132
  0.03810474  0.00939171  0.0042708  -0.0196527  -0.00829194  0.02082983
  0.00433604 -0.04448033 -0.02428575 -0.06071176  0.0075468  -0.03541793
 -0.01029794  0.01837993  0.03032434 -0.00908384 -0.03771281 -0.00668032
  0.01649439  0.01441899  0.00699207 -0.01661702 -0.00167964  0.01152045
 -0.01277249  0.00516889  0.03587487 -0.01563798 -0.02780675  0.02512508
  0.01400139  0.01199561 -0.01159274  0.01326421 -0.03784709  0.03456539
 -0.00565172  0.01530736 -0.04080844  0.02315336 -0.01878805 -0.00052571
  0.03794163 -0.0145161   0.02880925  0.01611175  0.00693901 -0.00432779
 -0.01547797  0.01145408 -0.00658098 -0.01022015 -0.

### Stemmer

In [6]:
import gensim
import numpy as np

'''
Here we train the Word2Vec model with a list of lists where each sublist is a tokenized paragraph.
After we get the word vectors per paragraph, we compute our paragraph meaning vector as the mean
of its word vectors.
'''

# Train Word2Vec model
stemModel = gensim.models.Word2Vec(tokenized_paragraphs_stem, window=5, min_count=1, workers=4)

# Calculate the meaning vector per paragraph
paragraph_vectors_stem = []
for paragraph_tokens in tokenized_paragraphs_stem:
    vectors = []
    for token in paragraph_tokens:
        if token in stemModel.wv.key_to_index:
            vectors.append(stemModel.wv[token])
    if len(vectors) > 0:
        paragraph_vectors_stem.append(np.mean(vectors, axis=0))
    else:
        paragraph_vectors_stem.append(np.zeros(stemModel.vector_size))

print(paragraph_vectors_stem[383])

[-1.9842412e-03  1.7739395e-03  1.7528664e-04  7.9447302e-05
  1.7268516e-03 -6.5731863e-03  1.1929423e-03  6.7807138e-03
 -1.1121889e-03 -2.2192812e-03 -4.2660879e-03 -5.6049656e-03
 -2.7701227e-04  2.3648108e-03  2.3974280e-03 -5.8502299e-03
  3.2057595e-03 -5.0007133e-03  1.4750662e-03 -5.7442193e-03
  5.2495720e-03  6.4631167e-04  2.7877865e-03 -5.2025518e-04
 -1.3279220e-03 -8.7805552e-04 -3.2298502e-03 -3.3381509e-03
 -1.2146538e-03  1.6951225e-03  6.6512343e-03  2.4173772e-03
 -3.4524576e-04 -3.4868049e-03 -3.2569678e-03  4.3841265e-03
 -6.9366052e-04 -3.1345291e-03 -1.9551553e-03 -8.5864803e-03
 -1.8398198e-03 -4.5068068e-03 -2.2728366e-03 -1.5595128e-03
  3.2221535e-03 -1.7402075e-03 -4.4937115e-03  9.4602728e-04
  3.9650025e-03  2.8670046e-03  7.1823364e-04 -3.4133319e-03
 -2.1948619e-03  1.2182832e-03 -2.4633014e-03  3.0626932e-03
  3.5178251e-04 -1.5150794e-03 -7.8432197e-03  9.6524355e-04
 -4.8967166e-04  2.2514914e-03  3.4297953e-04 -3.1238785e-03
 -2.3440602e-03  2.97232

### Word-based

In [7]:
import gensim
import numpy as np

'''
Here we train the Word2Vec model with a list of lists where each sublist is a tokenized paragraph.
After we get the word vectors per paragraph, we compute our paragraph meaning vector as the mean
of its word vectors.
'''

# Train Word2Vec model
wordModel = gensim.models.Word2Vec(tokenized_paragraphs_word, window=5, min_count=1, workers=4)

# Calculate the meaning vector per paragraph
paragraph_vectors_word = []
for paragraph_tokens in tokenized_paragraphs_word:
    vectors = []
    for token in paragraph_tokens:
        if token in wordModel.wv.key_to_index:
            vectors.append(wordModel.wv[token])
    if len(vectors) > 0:
        paragraph_vectors_word.append(np.mean(vectors, axis=0))
    else:
        paragraph_vectors_word.append(np.zeros(wordModel.vector_size))

print(paragraph_vectors_word[383])

[ 7.5685891e-04  5.0688640e-04  1.7122237e-04 -2.1125354e-04
 -1.3664133e-03 -2.0826403e-03  1.7542954e-04  2.5175570e-03
  9.3788776e-04  7.6208124e-04 -1.6333509e-03 -2.8812313e-03
  1.6361651e-04  2.3910373e-03  1.0930161e-03 -3.9947630e-04
 -1.0646634e-03 -2.6911558e-03 -1.4608311e-04 -2.9126971e-03
  2.2641211e-03  5.5683759e-04  9.2044228e-04  6.9364015e-04
  1.3207922e-03 -9.7588840e-04 -1.4331409e-03 -1.2645663e-03
  6.3672790e-04 -1.5584974e-03  1.3582709e-03  1.5579015e-03
  7.5903430e-04 -2.9434140e-03 -7.4918900e-04  1.6580701e-03
 -1.3274289e-03 -6.9018034e-04  1.1352702e-04 -1.0104182e-03
 -1.1864083e-03 -2.2664671e-03  1.9740479e-03 -2.6771385e-04
  5.0429452e-07 -1.3779446e-04 -1.5670425e-03 -1.9170389e-03
 -9.0392109e-04  3.3079362e-03 -8.7222044e-04 -2.0475246e-03
 -4.4525741e-04 -1.1821578e-03 -8.7536575e-04 -5.9174094e-04
 -1.7086954e-03  2.4275477e-03 -2.9744427e-03  1.5323662e-03
  1.3436098e-03 -1.0287591e-03 -1.3170620e-04  1.1349778e-03
 -9.8780612e-04  9.59336

#### Agregamos los vectores con su respectivo tokenization


In [8]:
df_lemma['vector'] = paragraph_vectors_lemma
df_word['vector'] = paragraph_vectors_word
df_stem['vector'] = paragraph_vectors_stem

### PoS tagging

In [9]:
import nltk
from gensim.models import Word2Vec
import numpy as np

Creamos las tuplas con sus tags

In [10]:
nltk.download('averaged_perceptron_tagger')
tagged_sentences = [nltk.pos_tag((paragraph)) for paragraph in tokenized_paragraphs_lemma]

# Entrenamiento del modelo Word2Vec
model1 = Word2Vec(tagged_sentences,  window=20, min_count=5, workers=4)

# Utilizar el modelo Word2Vec para encontrar palabras similares

# print(similar_words)


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/felipe/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [11]:
tagged_sentences

[[('decade', 'NN'),
  ('transform', 'NN'),
  ('multiple', 'JJ'),
  ('field', 'NN'),
  ('knowledge', 'NN'),
  ('medicine', 'JJ'),
  ('transformation', 'NN'),
  ('different', 'JJ'),
  ('way', 'NN'),
  ('enhance', 'NN'),
  ('medicine', 'NN'),
  ('use', 'NN'),
  ('article', 'NN'),
  ('introduce', 'NN'),
  ('help', 'NN'),
  ('discover', 'VB'),
  ('new', 'JJ'),
  ('drug', 'NN'),
  ('understand', 'NN'),
  ('mystery', 'NN'),
  ('cancer', 'NN'),
  ('learn', 'VBP'),
  ('billion', 'CD'),
  ('relation', 'NN'),
  ('different', 'JJ'),
  ('research', 'NN'),
  ('resource', 'NN')],
 [('time', 'NN'),
  ('help', 'NN'),
  ('human', 'VB'),
  ('research', 'NN'),
  ('2007', 'CD'),
  ('adam', 'NN'),
  ('robot', 'NN'),
  ('generate', 'NN'),
  ('hypothesis', 'NN'),
  ('gene', 'NN'),
  ('code', 'NN'),
  ('critical', 'JJ'),
  ('enzyme', 'NN'),
  ('catalyze', 'NN'),
  ('reaction', 'NN'),
  ('yeast', 'NN'),
  ('saccharomyce', 'NN'),
  ('cerevisiae', 'NN'),
  ('adam', 'NN'),
  ('use', 'NN'),
  ('robotic', 'JJ'),
  (

Creamos los vectores por cada parrafo

In [12]:
# Calculate the meaning vector per paragraph
paragraph_tag_vectors_lemma = []
for paragraph_tokens in tagged_sentences:
    vectors = []
    ind = 0
    for ind, token in enumerate(paragraph_tokens):
        ind = ind + 1
        if ind in range(len(tagged_sentences)):
            vectors.append(model1.wv[ind])
    if len(vectors) > 0:
        paragraph_tag_vectors_lemma.append(np.mean(vectors, axis=0))
    else:
        paragraph_tag_vectors_lemma.append(np.zeros(model1.vector_size))

print(paragraph_tag_vectors_lemma[383])

[ 0.02210978  0.3506812  -0.10570294  0.2434982  -0.10984829 -0.8197706
  0.30201432  1.2113723  -0.21940422 -0.12031902 -0.28925252 -0.93250346
  0.4537314  -0.1102639  -0.24586663 -0.52795917  0.10768392 -0.34957862
  0.0832575  -0.86937475  0.39030161  0.5484019  -0.18228899 -0.40937397
 -0.23189811  0.02812907 -0.29221344 -0.69715095 -0.5700277   0.15932018
  0.53407145  0.21488546  0.38273528 -0.2869111  -0.40650722  1.0559549
 -0.17314686 -0.63339037 -0.3971571  -0.8990972   0.03575141 -0.2776595
 -0.17005818 -0.49568808  0.6156093   0.08386169 -0.7123526  -0.23202404
  0.5587019  -0.199832    0.18055554 -0.39416587 -0.05753853 -0.16324385
 -0.46054536  0.36505532  0.11021925 -0.32921734 -0.45804605  0.13487531
  0.29039952 -0.0996778  -0.20079748 -0.32720166 -0.65087235  0.08756707
  0.1652264   0.09877604 -0.9558222   0.63089615 -0.5138576  -0.04107206
  0.330457   -0.22292982  0.46555263  0.273732    0.00462859 -0.3274106
 -0.57027304  0.59978193 -0.12988532 -0.43005958 -0.602

In [13]:
df_lemma['vector_tag'] = paragraph_tag_vectors_lemma

In [14]:
df_lemma.head()

Unnamed: 0.1,Unnamed: 0,article_name,content,enumeration_in_article,file_id,language_code,vector,vector_tag
0,0,Enhance medicine using AI,"During the last decades, AI transformed multip...",0,0,en,"[-0.009722112, 0.07163435, 0.016896, -0.021761...","[0.021664836, 0.3455166, -0.10470651, 0.240304..."
1,1,Enhance medicine using AI,The first time AI helped humans research was i...,1,0,en,"[-0.0071612075, 0.05785236, 0.01586827, -0.016...","[0.0197499, 0.3101085, -0.09615648, 0.21619526..."
2,2,Enhance medicine using AI,AI is turning the drug-discovery paradigm upsi...,2,0,en,"[-0.008561857, 0.05853796, 0.01637258, -0.0183...","[0.02115402, 0.32796857, -0.10073699, 0.229255..."
3,3,Enhance medicine using AI,Another contribution of AI to this field was m...,3,0,en,"[-0.008204909, 0.056730535, 0.014804537, -0.01...","[0.022283265, 0.38243735, -0.11590545, 0.26470..."
4,4,Enhance medicine using AI,Then this information is used to create a repr...,4,0,en,"[-0.007051957, 0.05966627, 0.015305638, -0.019...","[0.02172771, 0.34095144, -0.10375843, 0.237372..."


## Similarity Function

In [15]:
import numpy as np
from gensim.models import KeyedVectors


def cosine_similarity_list(vectors_list, query_vector):
    #Compute the cosine similarity between the vector representation of the input and the vector representations of each sentence in the text
    similarity_scores = []
    for vector in vectors_list:
        score = query_vector.dot(vector) / (np.linalg.norm(query_vector) * np.linalg.norm(vector))
        similarity_scores.append(score)

    # Sort the sentences in descending order of their cosine similarity to the input and return the top-N most similar sentences
    n = 100
    most_similar_sentences = [[vectors_list[idx],idx] for idx in np.argsort(similarity_scores)[::-1][:n] if np.sum(vectors_list[idx]) != 0]

    return most_similar_sentences[:20]


In [16]:
cosine_similarity_list(df_lemma['vector'],df_lemma['vector'][0])

[[array([-0.00972211,  0.07163435,  0.016896  , -0.02176142,  0.01410261,
         -0.12766795,  0.005295  ,  0.15194596, -0.06461419, -0.04598702,
         -0.02913097, -0.11594567, -0.00703777,  0.06123924,  0.00674666,
         -0.03721721,  0.02629692, -0.05647319,  0.0113253 , -0.14104082,
          0.03788463,  0.04276301,  0.08123159, -0.03031487, -0.01521166,
          0.00225455, -0.07138196, -0.07587193, -0.06547288, -0.01528888,
          0.08166   ,  0.02668528,  0.00488759, -0.04597041, -0.01613482,
          0.05589634,  0.00712082, -0.09762578, -0.05386311, -0.14052054,
          0.02224149, -0.08435624, -0.02070905,  0.04038403,  0.06669354,
         -0.02009915, -0.08375835, -0.01487685,  0.03729497,  0.03304398,
          0.01451912, -0.04007566, -0.00249123,  0.02289612, -0.02968929,
          0.00848527,  0.08076252, -0.03190417, -0.06090316,  0.05814185,
          0.03199105,  0.02307762, -0.02587534,  0.03330164, -0.08664081,
          0.07271503, -0.01424389,  0.

In [17]:
cosine_similarity_list(df_stem['vector'],df_stem['vector'][0])

[[array([-0.00195509,  0.00488185,  0.00453548, -0.00156028,  0.00341492,
         -0.01945026,  0.00398706,  0.01992424, -0.0038199 , -0.0089833 ,
         -0.0069988 , -0.01443844, -0.00071342,  0.00399412,  0.00459499,
         -0.01159773,  0.00440929, -0.01265747,  0.00308167, -0.01783688,
          0.00749393,  0.00552588,  0.00207943, -0.00597158, -0.00093428,
         -0.00082205, -0.01180057, -0.00633555, -0.00861118,  0.00377235,
          0.0145078 ,  0.00680286,  0.00028807, -0.01301141, -0.00585641,
          0.01345353, -0.00151225, -0.00845203, -0.00633184, -0.02362994,
         -0.00010767, -0.01342468, -0.00163693, -0.00320272,  0.00927213,
         -0.00441162, -0.01190884, -0.0003376 ,  0.0080568 ,  0.00772528,
          0.00331942, -0.01270547, -0.00567569,  0.00100349, -0.00708799,
          0.00392209,  0.00573043, -0.0023245 , -0.01500339,  0.00084338,
         -0.00050375,  0.00699709, -0.00263494,  0.00249798, -0.00779127,
          0.008602  ,  0.00194216,  0.

In [18]:
cosine_similarity_list(df_word['vector'],df_word['vector'][0])

[[array([-5.3599314e-04,  7.5327267e-04,  7.8932557e-04,  7.8670401e-04,
          2.0671041e-04, -1.4080501e-03,  9.6205226e-04,  6.4045638e-03,
         -2.4874697e-03,  1.4851890e-03, -2.6202013e-03, -2.7064187e-03,
         -3.3459705e-04,  1.5990541e-03,  1.4302612e-03, -1.3466969e-03,
         -2.9268049e-04, -1.2050077e-03,  9.1636402e-04, -2.6194355e-03,
          8.1092614e-04,  2.0547407e-03,  1.7025199e-03, -2.1447616e-03,
         -9.5983309e-04,  1.0152506e-03, -2.3019691e-03, -1.7348113e-03,
         -2.1743518e-03,  7.4292642e-05,  1.0824753e-03,  5.8054901e-04,
         -8.4552972e-04, -1.6295633e-03,  3.8108678e-04,  1.1933994e-03,
         -5.6966914e-05, -2.6193501e-03, -1.4333053e-03, -3.3040987e-03,
          2.1811016e-03, -3.2242155e-03, -5.3563376e-04,  1.1685791e-03,
          2.3050955e-03, -3.1989191e-03, -4.4237701e-03,  1.7049884e-03,
          9.1137204e-05,  3.8268589e-03,  2.4813286e-03, -2.3092183e-03,
          1.8301567e-03, -6.4463657e-04, -3.6581983

In [20]:
# print(df_word['content'][0],'\n',df_word['content'][86],'\n',df_word['content'][737])

## Prompt Preprocessing, Tokenization and Embedding

### Lemma Whitespace

In [33]:
userPrompt = "artificial intelligence, model medicine in debes ayuda amigos con todos ayudesz"
tokenized_prompt = preprocess_text_lemma(userPrompt)
print(tokenized_prompt)

promptVector_lemma = np.zeros((lemmaModel.vector_size,))
word_count = 0

for token in tokenized_prompt:
    if token in lemmaModel.wv.key_to_index:
        promptVector_lemma += lemmaModel.wv[token]
        word_count += 1
        print(token)

if word_count > 0:
    promptVector_lemma /= word_count
    
print(promptVector_lemma)
   


['artificial', 'intelligence', 'model', 'medicine', 'debe', 'ayuda', 'amigo', 'con', 'todo', 'ayudesz']
artificial
intelligence
model
medicine
debe
ayuda
amigo
con
todo
[-0.02309902  0.13113993  0.03705962 -0.04207387  0.02497981 -0.23858935
  0.00697994  0.28508285 -0.11497237 -0.08665495 -0.05599498 -0.21349209
 -0.01657503  0.11470655  0.01640302 -0.06840169  0.04782532 -0.10824977
  0.02610146 -0.26081892  0.06345818  0.08224655  0.14975433 -0.05507427
 -0.02803465  0.00554133 -0.13178302 -0.13418086 -0.11749568 -0.02654441
  0.14789241  0.04417824  0.0097922  -0.08066102 -0.03199581  0.10653211
  0.01250139 -0.17862383 -0.09951821 -0.25388424  0.0399696  -0.15025362
 -0.04638506  0.07619479  0.12688752 -0.03367056 -0.15848855 -0.02770912
  0.07167744  0.06535858  0.02794787 -0.06893092 -0.0086351   0.03821502
 -0.0512072   0.01859919  0.14716538 -0.063629   -0.10998878  0.1038334
  0.05667406  0.04572906 -0.04839327  0.06168818 -0.1583986   0.13559065
 -0.02420228  0.05011297 -0.1

### Stemmer

In [22]:
userPrompt = "medicine using artificial intelligence"
tokenized_prompt = preprocess_text_stem(userPrompt)
print(tokenized_prompt)

promptVector = np.zeros((stemModel.vector_size,))
word_count = 0

for token in tokenized_prompt:
    if token in lemmaModel.wv.key_to_index:
        promptVector += lemmaModel.wv[token]
        word_count += 1

if word_count > 0:
    promptVector /= word_count
    
print(promptVector)

['medicin', 'use', 'artifici', 'intellig']
[-0.02391157  0.2391587   0.06462903 -0.07437063  0.04754031 -0.42382395
  0.02397694  0.49994504 -0.21237099 -0.15959176 -0.09241626 -0.38017294
 -0.02903207  0.21064085  0.02069084 -0.12194536  0.10091076 -0.18146043
  0.04043122 -0.4745068   0.11705362  0.14517671  0.28189954 -0.10481324
 -0.04436547  0.01195794 -0.24235608 -0.25483415 -0.20900211 -0.05249029
  0.27440485  0.10054272  0.0152335  -0.14537829 -0.05600438  0.18903898
  0.02008404 -0.33404812 -0.18595149 -0.46971393  0.07830066 -0.27820224
 -0.06926267  0.12133776  0.23060875 -0.05358608 -0.28774959 -0.063265
  0.1248192   0.11331511  0.04948066 -0.13541988 -0.0259171   0.08395179
 -0.10420308  0.02846084  0.26512322 -0.11905078 -0.20689835  0.18091781
  0.1027649   0.07729243 -0.07810123  0.11410932 -0.28356278  0.250884
 -0.05012493  0.08368278 -0.312673    0.16591245 -0.15000468 -0.00409645
  0.28805396 -0.11041766  0.22477745  0.12999995  0.07077643 -0.0291267
 -0.11031137 

### Word-based

In [23]:
userPrompt = "artificial intelligence"
tokenized_prompt = preprocess_text_word(userPrompt)
print(tokenized_prompt)

promptVector = np.zeros((wordModel.vector_size,))
word_count = 0

for token in tokenized_prompt:
    if token in wordModel.wv.key_to_index:
        promptVector += wordModel.wv[token]
        word_count += 1

if word_count > 0:
    promptVector /= word_count
    
print(promptVector)

['artificial', 'intelligence']
[-0.00992446  0.01500029  0.00632449  0.01337329  0.00451102 -0.01007485
  0.01046465  0.01400736 -0.01051739 -0.00936545 -0.00416123 -0.02061749
  0.00984663  0.00565882  0.01076017 -0.00148383 -0.00736679 -0.00905269
  0.00131472 -0.01393862  0.00630008  0.01270976  0.00368491 -0.00436381
  0.00534625  0.00513423 -0.00997125 -0.00556974 -0.00283791 -0.00700042
  0.00880761  0.00115563 -0.00149072 -0.00153547  0.00141123  0.01572557
  0.0036969  -0.0037629  -0.00664401 -0.01315418  0.00452691 -0.00904297
 -0.00841212 -0.00139046  0.00374415 -0.00807601 -0.00653941  0.00407123
 -0.00186189  0.00108945  0.00982021 -0.01100161 -0.00502287 -0.00095846
 -0.00312118  0.00896563  0.00220033 -0.00403434 -0.00433606 -0.00490903
 -0.00535639 -0.00072206  0.00448262  0.00597795 -0.01019298  0.00659725
 -0.00259145  0.00576831 -0.01546877  0.01329302 -0.01349834  0.00930187
  0.01385972 -0.00276206 -0.0004157   0.00347066 -0.01060762 -0.003318
 -0.01174806  0.005243

### Lemma with Pos-tagging

In [24]:
userPrompt="medicine using artificial intelligence"
tokenized_prompt = preprocess_text_lemma(userPrompt)
tagged_prompt = nltk.pos_tag(tokenized_prompt)
print(tagged_prompt)
promptVector_lemma = np.zeros((model1.vector_size,))
word_count  = 0
ind = 0
for ind,token in enumerate(tokenized_prompt):
    ind = ind + 1
    if token in tokenized_prompt:
        promptVector_lemma += model1.wv[ind]
        word_count += 1
        print(word_count,ind)

if word_count > 0:
    promptVector_lemma /= word_count
    
print(promptVector_lemma)




[('medicine', 'NN'), ('use', 'NN'), ('artificial', 'JJ'), ('intelligence', 'NN')]
1 1
2 2
3 3
4 4
[ 0.02020932  0.42668521 -0.13106244  0.29355253 -0.12439486 -0.98649044
  0.3632608   1.46346477 -0.26635456 -0.14252803 -0.34912498 -1.12271033
  0.54369761 -0.13102918 -0.2929647  -0.63684423  0.13262438 -0.41816846
  0.09717494 -1.04651101  0.4697547   0.65657482 -0.21303554 -0.49364244
 -0.27752075  0.03195625 -0.34866028 -0.83547212 -0.6924832   0.19432786
  0.64876138  0.25666315  0.45709571 -0.35024316 -0.4887149   1.2730808
 -0.20662315 -0.76399386 -0.47720468 -1.08055915  0.04249744 -0.33607993
 -0.20945706 -0.59641731  0.74638252  0.10468077 -0.85288796 -0.27845751
  0.67257974 -0.24218452  0.22156953 -0.47636407 -0.07193792 -0.20103224
 -0.5585591   0.43699954  0.13269844 -0.39601797 -0.55520742  0.16134576
  0.35082333 -0.11906513 -0.24167253 -0.39774513 -0.78573549  0.10521483
  0.206856    0.12359836 -1.15230155  0.76342458 -0.6185988  -0.0491527
  0.40006654 -0.26486025  0.

### Testing PoS tagging

In [35]:
var = cosine_similarity_list(df_lemma['vector'],promptVector_lemma)

In [26]:
df['content'][var[0][1]]

'BI can have different applications for one organization; among them, we can find [2]:'

## Similarity Test

In [38]:
var=cosine_similarity_list(df_lemma['vector'],promptVector_lemma)
var[0]


[array([-0.01678474,  0.1014806 ,  0.02849557, -0.03156531,  0.02117067,
        -0.18022399,  0.01013669,  0.21405648, -0.08704736, -0.06676333,
        -0.04088521, -0.16050397, -0.01070653,  0.08513112,  0.01478421,
        -0.05308341,  0.03807824, -0.08141152,  0.0161918 , -0.19722204,
         0.04981495,  0.06076785,  0.1150202 , -0.04364143, -0.01884962,
         0.00484414, -0.09822907, -0.10460122, -0.08939172, -0.01959769,
         0.11258888,  0.03628403,  0.00739953, -0.06047707, -0.02201552,
         0.07935686,  0.01015776, -0.13538991, -0.07547379, -0.19431251,
         0.02863635, -0.11330178, -0.03492139,  0.05705839,  0.09460498,
        -0.0271433 , -0.11635169, -0.02193734,  0.05364744,  0.05018494,
         0.02181658, -0.0548007 , -0.00791511,  0.02914806, -0.03993335,
         0.01445641,  0.11011721, -0.0478379 , -0.08504202,  0.07732499,
         0.04278922,  0.03322824, -0.03630802,  0.04464458, -0.12044249,
         0.1058788 , -0.01645791,  0.03987676, -0.1

In [28]:
df["content"][var[1][1]]

'Select the SSH configuration file to update, press enter to select the first option, which should contain “user” or “home”.'

In [39]:
var=np.array(var)
var[:,1]
possible_solutions=df.iloc[var[:,1]]

  var=np.array(var)


In [40]:
for paragraph in possible_solutions["content"]:
    print(paragraph, "\n")

Un ejemplo se puede ver en el sector de la salud. Este trabaja todos los días para entender como el cuerpo humano y las enfermedades funcionan. Uno de estos es el cáncer de mamá, el cual es uno de los canceres más frecuentes con un porcentaje de 11,6% de todos los cánceres diagnosticados (Quironsalud). En mayo del 2019, el Computer Science and Artificial Intelligence Laboratory desarrollo un algoritmo que logra predecir la aparición de cáncer de mama hasta con 5 años de antelación (RGT Consultores Internacionales, 2020). 

Un ejemplo se puede ver en el sector de la salud. Este trabaja todos los días para entender como el cuerpo humano y las enfermedades funcionan. Uno de estos es el cáncer de mamá, el cual es uno de los canceres más frecuentes con un porcentaje de 11,6% de todos los cánceres diagnosticados (Quironsalud). En mayo del 2019, el Computer Science and Artificial Intelligence Laboratory desarrollo un algoritmo que logra predecir la aparición de cáncer de mama hasta con 5 años