# Search Engine Implementation

 * Implementing News search engine
 * Dataset: https://www.kaggle.com/datasets/rmisra/news-category-dataset

## Installing and setting up Elasticsearch

In [None]:
# The following bash scripts download the ElasticSearch library and install it
# on the Google Colab instance.
# Credit for the bash scripts and server run start goes to:
# https://gist.github.com/korakot/15fe4f18d0e0f53d7b834ef797880500

# You need to run these only once when you work on your search engine notebook.

# NOTE: If you are working on a large dataset (20k+ docs) you should do this
# locally i.e. in a jupyter notebook. This way you only need to install ES once
# and index your data once.

In [None]:
!wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.9.2-linux-x86_64.tar.gz
!wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.9.2-linux-x86_64.tar.gz.sha512
!tar -xzf elasticsearch-oss-7.9.2-linux-x86_64.tar.gz
!sudo chown -R daemon:daemon elasticsearch-7.9.2/
!shasum -a 512 -c elasticsearch-oss-7.9.2-linux-x86_64.tar.gz.sha512

elasticsearch-oss-7.9.2-linux-x86_64.tar.gz: OK


In [None]:
# https://stackoverflow.com/questions/68762774/elasticsearchunsupportedproducterror-the-client-noticed-that-the-server-is-no#answer-68918449
!pip install elasticsearch==7.9.1 -q

In [None]:
# check elasticsearch version in environment
!pip freeze | grep elasticsearch

elasticsearch==7.9.1


In [None]:
# Import required libraries
# import urllib.request
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from elasticsearch import Elasticsearch
from flask import Flask, render_template, request, jsonify
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn import preprocessing, model_selection, naive_bayes, pipeline, manifold
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# from bs4 import BeautifulSoup
import re
import time
import string
from nltk.corpus import wordnet
import ipywidgets as widgets
from google.colab import files
import numpy as np

# for data
import json

# for plotting
# import matplotlib.pyplot as plt
import seaborn as sns

# for processing
import nltk

# for bag-of-words
# import feature_extraction
# import naive_bayes
# import manifold
# import preprocessing

# for explainer
# import lime_text

# for word embedding
# import gensim
# import gensim.downloader as gensim_api


In [None]:
# Download required NLTK resources
nltk.download('words')
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
# Starting Elasticsearch server
%%bash --bg
sudo -H -u daemon elasticsearch-7.9.2/bin/elasticsearch

In [None]:
%%bash
ps -ef | grep elasticsearch

root       72132   72130  0 17:32 ?        00:00:00 sudo -H -u daemon elasticsearch-7.9.2/bin/elasticsearch
daemon     72133   72132  0 17:32 ?        00:00:00 /bin/bash elasticsearch-7.9.2/bin/elasticsearch
root       72153   72139  0 17:32 ?        00:00:00 grep elasticsearch
daemon     72154   72133  0 17:32 ?        00:00:00 /bin/bash elasticsearch-7.9.2/bin/elasticsearch


In [None]:
# start ES server
time.sleep(30) # Wait for 30 seconds to start the server
!curl -X GET "http://localhost:9200"

{
  "name" : "b7dbf95ab1b3",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "wmxTC68fQ5axkS4QPiSImA",
  "version" : {
    "number" : "7.9.2",
    "build_flavor" : "oss",
    "build_type" : "tar",
    "build_hash" : "d34da0ea4a966c4e49417f2da2f244e3e97b4e6e",
    "build_date" : "2020-09-23T00:45:33.626720Z",
    "build_snapshot" : false,
    "lucene_version" : "8.6.2",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}


In [None]:
# Testing Elasticsearch instance
es = Elasticsearch("http://localhost:9200")
if es.ping():
  print('ES instance working')
else:
  print('ES instance not working')

ES instance working


In [None]:
# The Server's information
es.info()

{'name': 'b7dbf95ab1b3',
 'cluster_name': 'elasticsearch',
 'cluster_uuid': 'wmxTC68fQ5axkS4QPiSImA',
 'version': {'number': '7.9.2',
  'build_flavor': 'oss',
  'build_type': 'tar',
  'build_hash': 'd34da0ea4a966c4e49417f2da2f244e3e97b4e6e',
  'build_date': '2020-09-23T00:45:33.626720Z',
  'build_snapshot': False,
  'lucene_version': '8.6.2',
  'minimum_wire_compatibility_version': '6.8.0',
  'minimum_index_compatibility_version': '6.0.0-beta1'},
 'tagline': 'You Know, for Search'}





# Loading data from Kaggle dataset




##Setup Steps
In order to get our own JSON file, we need to follow these steps:


*   First we need to select any dataset from Kaggle
*   Then we should download API Credentials
*   After that, we must setup the Colab Notebook
*   And finally download the dataset

In [None]:
#1. Uploading and reading our news dataset
df = files.upload()

Saving kaggle.json to kaggle (8).json


In [None]:
#2. Set of commands for setting-up (for download)

!ls -lha kaggle.json
!pip install -q kaggle # installing the kaggle package
!mkdir -p ~/.kaggle # creating .kaggle folder where the key should be placed
!cp kaggle.json ~/.kaggle/ # move the key to the folder
!pwd # checking the present working directory

-rw-r--r-- 1 root root 68 Apr 14 13:56 kaggle.json
/content


In [None]:
#3. Giving rw access (if 401-nathorized)

!chmod 600 ~/.kaggle/kaggle.json

In [None]:
!kaggle datasets list

ref                                                        title                                           size  lastUpdated          downloadCount  voteCount  usabilityRating  
---------------------------------------------------------  ---------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
salvatorerastelli/spotify-and-youtube                      Spotify and Youtube                              9MB  2023-03-20 15:43:25           6466        246  1.0              
erdemtaha/cancer-data                                      Cancer Data                                     49KB  2023-03-22 07:57:00           2243         58  1.0              
ulrikthygepedersen/fastfood-nutrition                      Fastfood Nutrition                              12KB  2023-03-21 10:02:41           2372         48  1.0              
lokeshparab/amazon-products-dataset                        Amazon Products Sales Dataset 2023              80M

In [None]:
!kaggle datasets download -d rmisra/news-category-dataset

news-category-dataset.zip: Skipping, found more recently modified local copy (use --force to force download)


If asked, if you want to replace the file, type y

In [None]:
!unzip *news-category-dataset.zip -d news-category-dataset

Archive:  news-category-dataset.zip
replace news-category-dataset/News_Category_Dataset_v3.json? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: news-category-dataset/News_Category_Dataset_v3.json  



In [None]:
news_df = pd.read_json('news-category-dataset/News_Category_Dataset_v3.json', lines = True, nrows=2000)
news_df.head()

Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22


In [None]:
# Checking for blank links
blank_links = news_df[news_df['link'].isnull()]

if not blank_links.empty:
    print("The following articles have blank links:")
    print(blank_links)
else:
    print("No articles have blank links.")

No articles have blank links.


In [None]:
news_df.head(50)

Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22
5,https://www.huffpost.com/entry/belk-worker-fou...,Cleaner Was Dead In Belk Bathroom For 4 Days B...,U.S. NEWS,The 63-year-old woman was seen working at the ...,,2022-09-22
6,https://www.huffpost.com/entry/reporter-gets-a...,Reporter Gets Adorable Surprise From Her Boyfr...,U.S. NEWS,"""Who's that behind you?"" an anchor for New Yor...",Elyse Wanshel,2022-09-22
7,https://www.huffpost.com/entry/puerto-rico-wat...,Puerto Ricans Desperate For Water After Hurric...,WORLD NEWS,More than half a million people remained witho...,"DÁNICA COTO, AP",2022-09-22
8,https://www.huffpost.com/entry/mija-documentar...,How A New Documentary Captures The Complexity ...,CULTURE & ARTS,"In ""Mija,"" director Isabel Castro combined mus...",Marina Fang,2022-09-22
9,https://www.huffpost.com/entry/biden-un-russia...,Biden At UN To Call Russian War An Affront To ...,WORLD NEWS,White House officials say the crux of the pres...,"Aamer Madhani, AP",2022-09-21


In [None]:
# Remove duplicates and combine headline and short_description columns
df = news_df.drop_duplicates()
text_columns = ['headline', 'short_description']
df['combined'] = df[text_columns].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)

##Preprocessing text data using NLTK

In [None]:
# Preprocess the text by removing stop words, punctuations and non-english words
words = set(nltk.corpus.words.words())
def preprocess_text(text):
    separator = " "
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    text = " ".join(w for w in nltk.wordpunct_tokenize(text) \
         if w.lower() in words or not w.isalpha())
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(text)
    filtered = [w for w in word_tokens if not w.lower() in stop_words]
    com = separator.join(filtered)
    return com

##Creating a corpus for news articles

In [None]:
def get_news_corpus():
    """
    This function downloads the news category dataset from Kaggle and returns
    a preprocessed corpus containing the news headlines, article descriptions,
    categories, and the length of each news article.
    """

    # preprocess the text data
    #df['headline'] = df['headline'].apply(preprocess_text)
    #df['short_description'] = df['short_description'].apply(preprocess_text)
    df['category'] = df['category'].apply(preprocess_text)
    #df['link'] = df['link'].apply(preprocess_text)
    df['combined'] = df['combined'].apply(preprocess_text)

    # create a corpus of news articles
    corpus = []
    for index, row in df.iterrows():
        article_id = index
        headline = row['headline']
        link = row['link']
        short_description = row['short_description']
        #article_len = len(headline.split() + desc.split())
        corpus.append((article_id, headline, link, short_description))

    return corpus

In [None]:
corpus = get_news_corpus()

In [None]:
df['headline'] = df['headline'].apply(preprocess_text)
#df['short_description'] = df['short_description'].apply(preprocess_text)
#df['category'] = df['category'].apply(preprocess_text)
#df['link'] = df['link'].apply(preprocess_text)
#df['combined'] = df['combined'].apply(preprocess_text)

In [None]:
# Split dataset
dtf_train, dtf_test = model_selection.train_test_split(df, test_size=0.3)
# Get target
y_train = dtf_train["category"].values
y_test = dtf_test["category"].values

In [None]:
# vectorizer = feature_extraction.text.TfidfVectorizer(vocabulary=X_names)
# #vectorizer.fit(cor)
# X_train = vectorizer.transform(cor)
# dic_vocabulary = vectorizer.vocabulary_

In [None]:
# # pipeline
# model = pipeline.Pipeline([("vectorizer", vectorizer),
#                            ("classifier", classifier)])
# ## train classifier
# model["classifier"].fit(X_train, y_train)
# ## test
# X_test = dtf_test["text_clean"].values
# predicted = model.predict(X_test)
# predicted_prob = model.predict_proba(X_test)

# Machine Learning - Knn

In [None]:
df.shape

(2000, 7)

In [None]:
# Associate Category names with numerical index and save it in new column CategoryId
target_category = df['category'].unique()
print(target_category)

['us news' 'comedy' '' 'world news' 'culture' 'tech' 'sports'
 'entertainment' 'politics' 'weird news' 'environment' 'education' 'crime'
 'science' 'wellness' 'business' 'style beauty' 'food drink' 'media'
 'queer' 'home living' 'black' 'travel' 'money' 'religion']


In [None]:
df['CategoryId'] = df['category'].factorize()[0]
#df.head()

In [None]:
# Create a new pandas dataframe "category", which only has unique Categories, also sorting this list in order of CategoryId values
Category = df[['category', 'CategoryId']].drop_duplicates().sort_values('CategoryId')
Category

Unnamed: 0,category,CategoryId
0,us news,0
2,comedy,1
3,,2
7,world news,3
8,culture,4
13,tech,5
17,sports,6
20,entertainment,7
21,politics,8
29,weird news,9


## Building a text classification model using Scikit-Learn

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
x = np.array(df.iloc[:,0].values)
y = np.array(df.CategoryId.values)
cv = CountVectorizer(max_features = 5000)
x = cv.fit_transform(df.combined).toarray()
print("X.shape = ",x.shape)
print("y.shape = ",y.shape)

X.shape =  (2000, 5000)
y.shape =  (2000,)


In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0, shuffle = True)
print(len(x_train))
print(len(x_test))

1400
600


In [None]:
# from sklearn.metrics import precision_recall_fscore_support as score
# from sklearn.multiclass import OneVsRestClassifier
# def run_model(model_name, est_c, est_pnlty):
#     mdl = KNeighborsClassifier(n_neighbors=15 , metric= 'minkowski' , p = 4)
#     # Performance metrics
#     oneVsRest = OneVsRestClassifier(mdl)
#     oneVsRest.fit(x_train, y_train)
#     y_pred = oneVsRest.predict(x_test)
#     accuracy = round(accuracy_score(y_test, y_pred) * 100, 2)

#     # Get precision, recall, f1 scores

#     precision, recall, f1score, support = score(y_test, y_pred, average='micro')

#     print(f'Test Accuracy Score of Basic {model_name}: % {accuracy}')

#     print(f'Precision : {precision}')

#     print(f'Recall : {recall}')

#     print(f'F1-score : {f1score}')

# run_model('K Nearest Neighbour', est_c=None, est_pnlty=None)

In [None]:
#corpus

In [None]:
import http
# Mappings are used to define what kind of structure your data has.
# Here an explicit mapping is used:
# https://www.elastic.co/guide/en/elasticsearch/reference/current/explicit-mapping.html

# The mapping is used when creating the index through the request body:

request_body = {
    'settings': {
        'number_of_shards': 1,
        'number_of_replicas': 1
    },
    'mappings': {
        'properties': {
            'article_id': {'type': 'integer'},
            'headline': {'type': 'keyword'},
            'link': {'type': 'text'},
            'short_description': {'type': 'text'}
        }
    }
}


index_name = 'news'
try:
  es.indices.get(index_name)
  print('index {} already exists'.format(index_name))
except:
  print('creating index {}'.format(index_name))
  es.indices.create(index_name, body=request_body)

index news already exists


In [None]:
# Now what we want to do is to put some data into the index, i.e. index it:
for article_id, headline, link, short_description in corpus:
  doc_body = {
      'article_id': article_id,
      'headline': headline,
      'link': link,
      'short_description': short_description
  }
  es.index(index_name, doc_body)

In [None]:
# Define a function to retrieve and print information about an Elasticsearch index
def index_info(index_name):
  count, deleted, shards, =  es.cat.indices(index=index_name,
                                            h=['docs.count', 'docs.deleted', 'pri'])[:-1].split(' ')
  # Print the index information using a multi-line formatted string
  print(
      """
      #### INDEX INFO #####
      index_name = {}
      doc_count = {}
      shard_count = {}
      deleted_doc_count = {}
      """.format(index_name, count, shards, deleted)
  )

# Call the index_info function with a specific index name
index_info(index_name)


      #### INDEX INFO #####
      index_name = news
      doc_count = 12000
      shard_count = 1
      deleted_doc_count = 0
      


In [None]:
# This function searches for documents in Elasticsearch and returns two types of results:
# - the original Elasticsearch results (in the variable `results`)
# - a simplified list of relevant fields for each hit (in the variable `plain_results`).
# It also prints error messages if any of the required fields are missing in the Elasticsearch results.
def search(index_name, query_body):
    results = es.search(index=index_name, body=query_body, explain=False)
    plain_results = []  # We fields we are most interested in
    for hit in results['hits']['hits']:
        article_id = hit['_source'].get('article_id')
        headline = hit['_source'].get('headline')
        link = hit['_source'].get('link')
        short_description = hit['_source'].get('short_description')
        score = hit['_score']
        if article_id and score:
            plain_results.append((article_id, headline, link, short_description, round(score, 1)))
        else:
            print('Error: missing data for article')
    return results, plain_results

In [None]:
# This function takes the simplified plain results as input and prints them to the console.
# It also checks if any of the required fields are missing and prints error messages if necessary.
def print_plain_results(plain_results):
    print('Results:')
    printed_article_ids = set()
    for article_id, headline, link, short_description, score in plain_results:
        if article_id and score and article_id not in printed_article_ids:
            if link is None:
                link = ''  # substitute an empty string if link is None
            print('-' * 70)
            headline = headline.capitalize()
            print(headline)
            print(link)
            print(short_description)
            print(f'Article id: {article_id}')
            print(f'Score: {round(score, 1)}')
            printed_article_ids.add(article_id)
        elif not article_id:
            print('Error: missing article_id for article')
        elif not score:
            print('Error: missing score for article')


In [None]:
query_body = {
    'query':{
        'term': {
            'body':  ''
        }
    }
}
results, plain_results = search(index_name, query_body)
print_plain_results(plain_results)

Results:


# UI with Widgets

In [None]:
import ipywidgets as widgets

In [None]:
# Taking input from the user and preprocess the query
query1 = input("Enter a search query: ")
query = preprocess_text(query1)

# Getting synonyms for each word in the query
synonyms = []
for word in query.split():
    for syn in wordnet.synsets(word):
        for lemma in syn.lemmas():
            if lemma.name() not in synonyms and lemma.name() != word:
                synonyms.append(lemma.name())

# Combining the original query with synonyms
expanded_query = query + ' ' + ' '.join(synonyms)

# Searching for matching results in the index
query_body = {
    "query": {
        "bool": {
            "must": [
                {
                    "match": {
                        "short_description": expanded_query
                    }
                }
            ]
        }
    }
}

results, plain_results = search(index_name, query_body)

# If no results are found, print a message and exit the program
if not results:
    print('No results found.')
else:
    # Print the results and ask the user if they are satisfied
    t = print_plain_results(plain_results)
    satisfied = input("Are you satisfied with the results? (y/n) ")
    # If the user is satisfied, print a message and exit the program
    if satisfied.lower() == 'y':
        print("Thank you for using our Search Engine!")
        print("Goodbye.")
    # If the user is not satisfied, ask for a score increase
    elif satisfied.lower() == 'n':
        # Ask user which article_id is the most relevant
        article_id = input("Which article_id is the most relevant ")
        x = input("What should be the score?")
        count = float(x)
        print("Thank you for the feedback!")
        print("We will work to improve it, Have a good day!")


Enter a search query: politics
Results:
----------------------------------------------------------------------
Republicans want parents to be angry. democrats are trying to give them money.
https://www.huffpost.com/entry/republicans-party-of-parents_n_618481ace4b0c8666bda7752
The 2022 midterm elections will be a big test of whether improving people's lives is effective politics.
Article id: 1710
Score: 6.9
----------------------------------------------------------------------
Trump touts 'massive' turnout at georgia rally that journalists say was 'smallest' in years
https://www.huffpost.com/entry/trump-georgia-rally-gop-primary_n_6240d6c0e4b0ccd4f5211e5d
“This is the smallest crowd I’ve seen at a rally of his in Georgia since he won the 2016 election," one local politics reporter said.
Article id: 940
Score: 5.9
Are you satisfied with the results? (y/n) y
Thank you for using our Search Engine!
Goodbye.


## Resources

###ElasticSearch API

* [Creating an Index](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html)
* [Using IR models](https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html)
* [Mappings](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html)
* [Shards](https://opster.com/guides/elasticsearch/glossary/elasticsearch-shards/)
https://www.geeksforgeeks.org/how-to-import-kaggle-datasets-directly-into-google-colab/
https://discuss.elastic.co/t/how-to-connect-google-colab-to-elasticsearch/315618


### More UI Examples

* [Colab Forms](https://colab.research.google.com/notebooks/forms.ipynb#scrollTo=eFN7-fUKs-Bu)
* [Jupyter Widgets](https://ipywidgets.readthedocs.io/en/latest/examples/Widget%20List.html#intrangeslider)
