<h3>Vào thư mục search_engine để lấy preprocessor</h3>

In [1]:
%cd search_engine/
from preprocess import Preprocessor # type: ignore

d:\Projects\Python\OOP-PROJECT-G4\src\main\python\search_engine


<h3>Import các thư viện</h3>

In [2]:
import pandas as pd # type: ignore
from gensim.models.doc2vec import Doc2Vec # type: ignore
from sklearn.metrics.pairwise import cosine_similarity
from joblib import dump, load
import Levenshtein
from sklearn.feature_extraction.text import TfidfVectorizer


<h3>Về thư mục gốc</h3>

In [3]:
%cd ../

d:\Projects\Python\OOP-PROJECT-G4\src\main\python


<h3>Load model Doc2Vec của gensim và gọi Preprocessor</h3>

In [4]:
doc2vec_model = Doc2Vec.load('search_engine/models/doc2vec_model.model')

preprocessor = Preprocessor()



<h3>Tải file CSV</h3>

In [5]:
relative_path = "../resources/data/data.csv"

sample_df = pd.read_csv(relative_path)

df = sample_df
df.head()

Unnamed: 0,Article link,Website source,Article type,Article title,Content,Creation date,Author,Category,Tags,Summary,Temp
0,https://www.theblock.co/post/285730/custodia-i...,https://www.theblock.co,News Article,Custodia is not entitled to a Fed master accou...,The Federal Reserve does not have to give digi...,"March 29, 2024, 7:05PM EDT",Sarah Wynn,Policy,COURT HEARINGS-LAWSUITS,Custodia Bank sued the central bank in 2022 fo...,
1,https://www.theblock.co/post/285724/multicoin-...,https://www.theblock.co,News Article,"Multicoin Capital's hedge fund has grown 9,281...",Multicoin Capital’s crypto-focused hedge fund ...,"March 29, 2024, 7:00PM EDT UPDATED: March 29, ...",Elizabeth Napolitano,Companies,INVESTMENT FIRMS,"Multicoin Capital’s hedge fund has returned 9,...",
2,https://www.theblock.co/post/285702/1kx-raise-...,https://www.theblock.co,News Article,1kx raises $75 million in latest funding round,"1kx has raised $75 million, the latest sign in...","March 29, 2024, 3:20PM EDT",Elizabeth Napolitano,Companies,,Investment firm 1kx has raised $75 million for...,
3,https://www.theblock.co/post/285690/cftc-commi...,https://www.theblock.co,News Article,CFTC Commissioner Pham says agency may be infr...,One of the Commodity Futures Trading Commissio...,"March 29, 2024, 12:06PM EDT",Sarah Wynn,Exchanges,CFTC-SEC,The agency’s complaint “appears to assert that...,
4,https://www.theblock.co/post/285608/bitcoin-fu...,https://www.theblock.co,News Article,Bitcoin futures open interest reaches new high...,Open interest for bitcoin futures on centraliz...,"March 29, 2024, 11:03AM EDT UPDATED: March 29,...",Vishal Chawla,The Block,,Bitcoin futures open interest on centralized e...,


In [6]:


def find_closest_word(word, model, model_name='Tfidf'):
    if model_name == 'Doc2Vec':
        vocab = model.wv.key_to_index
    elif model_name == 'Tfidf':
        vocab = model.vocabulary_
    try:
        # Kiểm tra xem từ có trong từ điển không
        if word in vocab:
            return word
        else:
            # Tìm từ gần nhất trong từ điển sử dụng Levenshtein distance
            closest_word = min(vocab, key=lambda x: Levenshtein.distance(word, x))
            return closest_word
    except KeyError:
        # Trong trường hợp từ không tồn tại trong vocab
        return None

# Demo
word = "sam"  # Từ không tồn tại trong từ điển
closest_word = find_closest_word(word, doc2vec_model, model_name='Doc2Vec')
print(f"The closest word to '{word}' is '{closest_word}'")


The closest word to 'sam' is 'sam'


In [7]:
def search(query):
    query = preprocessor.preprocess_text(query)
    query = ' '.join(find_closest_word(word=word, model=doc2vec_model, model_name='Doc2Vec') for word in query.split())
    inferred_vector = doc2vec_model.infer_vector(query.split())
    
    sims = doc2vec_model.dv.most_similar([inferred_vector], topn=10)

    results = []
    for sim in sims:
        doc_index = int(sim[0])
        similarity = sim[1]
        title = df.iloc[doc_index][' Article title']
        content = df.iloc[doc_index][' Content']
        results.append((doc_index, similarity, title, content))

    return results



In [8]:
query = "altman"

for doc_index, similarity, title, content in search(query):
    print(doc_index)
    print(similarity)
    print(title)
    print(content)


1877
0.7729318737983704
Worldcoin price swings accompany twists in OpenAI saga
The saga surrounding Sam Altman and OpenAI appears to have triggered significant price action in WLD, the token issued by the crypto project Worldcoin, which Altman also co-founded. WLD has experienced significant volatility over the past few days, with price swings seemingly triggered by news relating to Altman’s post at OpenAI, analysts said. WLD rose 9.1% over the past 24 hours to trade at $2.55 at around 4 p.m. Hong Kong time on Monday, according to CoinGecko data. The token rose 31.4% in the past week. While he’s been ousted as CEO of OpenAI, Altman appears to remain the co-founder and chairman of Tools for Humanity, the developer behind Worldcoin. On Friday, the tech world was shocked by news that Altman had been removed as CEO from OpenAI, the outfit behind ChaptGPT. He has since engaged in negotiations with the board to return to the role. On Sunday night in the U.S., however, the board of directors 

In [9]:
tfidf_matrix = load("search_engine/models/tfidf/tfidf_matrix.joblib")
vectorizer = load("search_engine/models/tfidf/vectorizer.joblib")

In [10]:
def tfidf_query(query):
  preprocessed_query = preprocessor.preprocess_text(query)
  preprocessed_query = ' '.join(find_closest_word(word=word, model=vectorizer, model_name='Tfidf') for word in preprocessed_query.split())
# Tìm kiếm và xác định hàng liên quan nhất
  query_vector = vectorizer.transform([preprocessed_query])
  similarities = cosine_similarity(query_vector, tfidf_matrix)

  # Bước 6: Sắp xếp và hiển thị kết quả
  results = []
  for idx, sim in enumerate(similarities[0]):
      results.append((df.iloc[idx][' Article title'], sim, preprocessed_query))

  results.sort(key=lambda x: x[1], reverse=True)
  
  return [result for result in results if result[1] > 0.1]

In [13]:
query = "elon musk"
tfidf_query(query)


[("Elon Musk claims X 'never will' launch a crypto token",
  0.6452821533857734,
  'elon musk'),
 ('Dogecoin rallies after Elon Musk’s Twitter takeover',
  0.6271641085024764,
  'elon musk'),
 ('Twitter will be added to dogecoin Ponzi scheme litigation if Elon Musk cannot get case dismissed',
  0.5963614222316188,
  'elon musk'),
 ('Elon Musk could bring more crypto into Twitter: Bloomberg',
  0.584188564742343,
  'elon musk'),
 ("Worldcoin token drops 5% amid Elon Musk's lawsuit against OpenAI",
  0.5277470612701318,
  'elon musk'),
 ("Binance CEO surprised dogecoin hasn't died yet, 'but Elon Musk latched onto it'",
  0.5125700178097224,
  'elon musk'),
 ("Binance CEO unfollows Elon Musk on Twitter, but Binance's $500 million investment in Twitter remains intact",
  0.5108524667607449,
  'elon musk'),
 ('Elon Musk will find someone else to run Twitter: WSJ',
  0.5095181319110593,
  'elon musk'),
 ("Elon Musk's Dogecoin hype draws attention to mysterious wallet that once held $24 billi