In [1]:
!pip install beautifulsoup4
!pip install requests



I am going to scrape the data from popular Nepali News Portal: https://english.onlinekhabar.com/. For this, firstly, I am going to get all the urls of all the associated pages that contain the news. I went through the website of OnlineKhabar and found out that the pages containing the news start with https://english.onlinekhabar.com/ and end with .html. So, I implemented all these logics using BeautifulSoup library in the code below:

In [None]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

def get_internal_urls(url):
    internal_urls = set()

    response = requests.get(url)

    if response.status_code == 200:
        html_content = response.text
        soup = BeautifulSoup(html_content, 'html.parser')

        anchor_tags = soup.find_all('a')

        # filterng URLs from anchor tags
        for tag in anchor_tags:
            href = tag.get('href')
            if href and href.startswith('https://english.onlinekhabar.com') and href.endswith('.html'):
                internal_urls.add(href)

    else:
        print("Failed to retrieve content.")

    return internal_urls


url = "https://english.onlinekhabar.com/"
internal_urls = get_internal_urls(url)

# Print all the internal URLs
for internal_url in internal_urls:
    print(internal_url)

https://english.onlinekhabar.com/president-ramchandra-paudel-appoints-nepals-ambassador-to-canada-and-portugal.html
https://english.onlinekhabar.com/change-education-system-nepal.html
https://english.onlinekhabar.com/tu-vice-chancellor-shortlist.html
https://english.onlinekhabar.com/nepal-canada-cricket-odi.html
https://english.onlinekhabar.com/dahal-calls-for-unity-among-progressive-forces.html
https://english.onlinekhabar.com/kag-nyimba-historic-site-mustang.html
https://english.onlinekhabar.com/deepfakes-threat-nepal.html
https://english.onlinekhabar.com/2024-gac-aion-y-ev-price.html
https://english.onlinekhabar.com/online-ticketing-national-park-nepal.html
https://english.onlinekhabar.com/everest-climbers-have-to-bring-back-faeces-and-urine.html
https://english.onlinekhabar.com/2024-asus-rog-strix-g18.html
https://english.onlinekhabar.com/nsjf-pulsar-sports-award-2024.html
https://english.onlinekhabar.com/rare-quartz-crystal-gorkha-darbar.html
https://english.onlinekhabar.com/crick

Now, let's get the news(text data) from all these pages. I am going to use BeautifulSoup library to scrape and store the text data from these pages. Also, let's see how the news data looks like.

In [None]:
def get_page_data(url):
    document = ""

    response = requests.get(url)

    if response.status_code == 200:
        html_content = response.text

        soup = BeautifulSoup(html_content, 'html.parser')

        body = soup.find('div', class_='post-content-wrap')
        if body:
            paragraphs = body.find_all('p')

            # Join paragraphs into a single string
            document = '\n'.join([p.get_text(separator='\n', strip=True) for p in paragraphs])

    else:
        print("Failed to retrieve content.")

    return document

documents = []

# Loop through all the URLs
for url in internal_urls:
    document = get_page_data(url)

    print(document)
    print("="*50)
    print("\n")

    # Append the document to the list
    documents.append(document)


Kathmandu, February 9
President
Ramchandra Paudel
has appointed Nepal’s ambassadors for Canada and Portugal.
President Paudel appointed Bharat Raj Paudyal as Nepal’s ambassador to Canada and Sanil Nepal as Nepal’s ambassador to Portugal as per the Constitution.
According to the recommendation of the Council of Ministers, the process of their parliamentary hearing had already been completed. Ambassador Paudyal, who was appointed to Canada, is the outgoing
Foreign Secretary.


Currently, the role of local government in education is of primary concern. There was a general complaint from political leaders to activists that the centralised governance system had delayed the entire development work including education and brought the country’s situation to a critical state.
Many young leaders would often be heard advocating for a democratic
education system
. These people are now in local government leadership roles and have failed to deliver on promised changes to the education system, leavi

Using all of these data for my task will be cumbersome. So for the purpose of simplicity, I am going to scrape and use data from https://english.onlinekhabar.com/preparation-to-hand-over-to-the-local-level.html page only.


In [None]:
text = get_page_data('https://english.onlinekhabar.com/tourists-visit-nepal-in-january.html')

In [None]:
text

'Kathmandu, February 1\nA total of 79,100 foreign tourists arrived in Nepal in January.\nThe number was up by 24,026 as compared to January 2023 when 55,074 foreign tourists visited Nepal.\nOut of the 79,100 tourists in January, the highest number of visitors to Nepal was from India, with 24,139 tourists, compared to 16,436 in January 2023, according to the\nNepal Tourism Board\n.\nAdditionally, there were 7,267 tourists from China, 7,047 from America, 4,619 from Thailand, 3,812 from South Korea, 3,629 from Bangladesh, and 3,421 from Australia who entered Nepal in January.\nSimilarly, 3,276 tourists from the UK, 2,229 from Bhutan, and 1,568 from Japan visited Nepal during the month.'

In [None]:
text = text.replace("\n", " ")

In [None]:
text

'Kathmandu, February 1 A total of 79,100 foreign tourists arrived in Nepal in January. The number was up by 24,026 as compared to January 2023 when 55,074 foreign tourists visited Nepal. Out of the 79,100 tourists in January, the highest number of visitors to Nepal was from India, with 24,139 tourists, compared to 16,436 in January 2023, according to the Nepal Tourism Board . Additionally, there were 7,267 tourists from China, 7,047 from America, 4,619 from Thailand, 3,812 from South Korea, 3,629 from Bangladesh, and 3,421 from Australia who entered Nepal in January. Similarly, 3,276 tourists from the UK, 2,229 from Bhutan, and 1,568 from Japan visited Nepal during the month.'

Now, I am going to install and import the necessary packages and libraries.Also, I will define the variables for Neo4j database setup. I have used getpass library to prompt for credentials. (This is for security purpose.)

In [2]:
%pip install langchain openai  tiktoken neo4j transformers

Collecting langchain
  Downloading langchain-0.1.7-py3-none-any.whl (815 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m815.9/815.9 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting openai
  Downloading openai-1.12.0-py3-none-any.whl (226 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m226.7/226.7 kB[0m [31m28.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tiktoken
  Downloading tiktoken-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m44.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting neo4j
  Downloading neo4j-5.17.0.tar.gz (197 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m197.8/197.8 kB[0m [31m25.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25h

In [3]:
import os
import re
from langchain.vectorstores.neo4j_vector import Neo4jVector
from langchain.document_loaders import WikipediaLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

In [4]:
import getpass

openai_api_key = getpass.getpass(prompt='Enter your OPENAI API Key: ')
neo4j_uri = getpass.getpass(prompt='Enter your neo4j uri: ')
neo4j_username = getpass.getpass(prompt= 'Enter your neo4j username: ')
neo4j_password = getpass.getpass(prompt = 'Enter your neo4j password: ')

Enter your OPENAI API Key: ··········
Enter your neo4j uri: ··········
Enter your neo4j username: ··········
Enter your neo4j password: ··········


First, I am going to perform tokenization and chunking on text data.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def bert_len(text):
    tokens = tokenizer.encode(text)
    return len(tokens)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
text_splitter = RecursiveCharacterTextSplitter(
          chunk_size = 200,
          chunk_overlap  = 20,
          length_function = bert_len,
          separators=['\n\n', '\n', ' ', ''],
      )

documents = text_splitter.create_documents(text)

In [None]:
# Instantiate Neo4j vector from documents
neo4j_vector = Neo4jVector.from_documents(
    documents,
    OpenAIEmbeddings(),
    url=neo4j_uri,
    username=neo4j_username,
    password=neo4j_password
)

  warn_deprecated(


Upto here, I have instantiated Neo4j vector from our preprocessed text document. Moving ahead, I am going to use LangChain's GraphCypherQAChain to query the graph database and get the answer to our question. (Since LangChain and LLMs are taking NLP to another level, why not use this power in my project?)

In [None]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import GraphCypherQAChain
from langchain.graphs import Neo4jGraph

In [None]:
graph = Neo4jGraph(
    url=neo4j_uri, username=neo4j_username, password=neo4j_password
)

In [None]:
# print(graph.schema)

Extracting entities and relationships using Spacy

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp(text)

entities = []
relationships = []

# Extract entities (countries) and relationships (number of tourists)
for ent in doc.ents:
    if ent.label_ == "GPE":
        entities.append(ent.text)
    elif ent.text.replace(',', '').isdigit():
        if entities:
            relationships.append((entities[-1], int(ent.text.replace(',', ''))))

print("Entities:", entities)
print("Relationships:", relationships)



Entities: ['Nepal', 'Nepal', 'Nepal', 'India', 'China', 'America', 'Thailand', 'South Korea', 'Bangladesh', 'Australia', 'Nepal', 'UK', 'Bhutan', 'Japan', 'Nepal']
Relationships: [('Nepal', 24026), ('Nepal', 55074), ('Nepal', 79100), ('India', 24139), ('India', 16436), ('India', 7267), ('China', 7047), ('America', 4619), ('Thailand', 3812), ('South Korea', 3629), ('Bangladesh', 3421), ('Nepal', 3276), ('UK', 2229), ('Bhutan', 1568)]


Now, I am going to write cypher queries to add data to Neo4j. Since our data contains information about countries, month and number of tourists visiting Nepal, I will create the nodes and relationships accordingly.

In [None]:
from neo4j import GraphDatabase

# Neo4j connection parameters
uri = neo4j_uri
username = neo4j_username
password = neo4j_password

# Function to execute Cypher queries
def execute_query(query):
    with GraphDatabase.driver(uri, auth=(username, password)) as driver:
        with driver.session() as session:
            result = session.run(query)
            return result

In [None]:

cypher_queries = []

# Creating nodes for countries
for entity in entities:
    cypher_queries.append(f"MERGE (:Country {{name: '{entity}'}})")

# Creating node for month
cypher_queries.append("MERGE (:Month {name: 'January'})")

cypher_query = "\n".join(cypher_queries)

# calling execute_query function defined above to create nodes
execute_query(cypher_query)

# Creating relationships for number of tourists
for source, value in relationships:
    cypher_query = f"MATCH (source:Country {{name: '{source}'}}), " \
                   f"(month:Month {{name: 'January'}}) " \
                   f"MERGE (source)-[:TOURISTS {{number: {value}}}]->(month)"

    # Executing the Cypher query to create relationships
    execute_query(cypher_query)

print("Data added to Neo4j database successfully.")


Data added to Neo4j database successfully.


Upto here, data is added to our Neo4j database. Now I will check the graph schema and create GraphCypherQAChain. Finally, I will run the chain and check the output on some queries.

In [None]:
graph.refresh_schema()

In [None]:
graph.schema

'Node properties are the following:\nChunk {embedding: LIST, id: STRING, text: STRING},Country {name: STRING},Month {name: STRING}\nRelationship properties are the following:\nTOURISTS {number: INTEGER}\nThe relationships are the following:\n(:Country)-[:TOURISTS]->(:Month)'

In [None]:
chain = GraphCypherQAChain.from_llm(
    ChatOpenAI(temperature=0), graph=graph, verbose=True
)

In [None]:
chain.run("Do people from different countries visit Nepal?")



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (c1:Country)-[t:TOURISTS]->(m:Month)<-[t2:TOURISTS]-(c2:Country {name: "Nepal"})
WHERE c1 <> c2
RETURN c1, c2, t.number as tourists_visited, m.name as month_visited[0m
Full Context:
[32;1m[1;3m[{'c1': {'name': 'India'}, 'c2': {'name': 'Nepal'}, 'tourists_visited': 24139, 'month_visited': 'January'}, {'c1': {'name': 'India'}, 'c2': {'name': 'Nepal'}, 'tourists_visited': 16436, 'month_visited': 'January'}, {'c1': {'name': 'India'}, 'c2': {'name': 'Nepal'}, 'tourists_visited': 7267, 'month_visited': 'January'}, {'c1': {'name': 'China'}, 'c2': {'name': 'Nepal'}, 'tourists_visited': 7047, 'month_visited': 'January'}, {'c1': {'name': 'America'}, 'c2': {'name': 'Nepal'}, 'tourists_visited': 4619, 'month_visited': 'January'}, {'c1': {'name': 'Thailand'}, 'c2': {'name': 'Nepal'}, 'tourists_visited': 3812, 'month_visited': 'January'}, {'c1': {'name': 'South Korea'}, 'c2': {'name': 'Nepal'}, 'tourists_vis

'Yes, people from India, China, America, Thailand, South Korea, Bangladesh, UK, and Bhutan visit Nepal.'

In [None]:
chain.run("From which countries do people visit Nepal?")



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (c:Country)-[r:TOURISTS]->(:Month)<-[:TOURISTS]-(n:Country {name: "Nepal"})
RETURN c.name[0m
Full Context:
[32;1m[1;3m[{'c.name': 'Nepal'}, {'c.name': 'Nepal'}, {'c.name': 'Nepal'}, {'c.name': 'India'}, {'c.name': 'India'}, {'c.name': 'India'}, {'c.name': 'China'}, {'c.name': 'America'}, {'c.name': 'Thailand'}, {'c.name': 'South Korea'}][0m

[1m> Finished chain.[0m


'India, China, America, Thailand, South Korea.'

In [None]:
chain.run("Did tourists visit Nepal in January?")



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (c:Country {name: "Nepal"})-[:TOURISTS]->(m:Month {name: "January"})
RETURN c, m[0m
Full Context:
[32;1m[1;3m[{'c': {'name': 'Nepal'}, 'm': {'name': 'January'}}, {'c': {'name': 'Nepal'}, 'm': {'name': 'January'}}, {'c': {'name': 'Nepal'}, 'm': {'name': 'January'}}, {'c': {'name': 'Nepal'}, 'm': {'name': 'January'}}][0m

[1m> Finished chain.[0m


'Yes, tourists visited Nepal in January.'

In [None]:
chain.run("From which country were there the highest number of visitors to Nepal?")



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (c:Country)-[r:TOURISTS]->(:Month)<-[:TOURISTS]-(nepal:Country {name: "Nepal"})
RETURN c.name, MAX(r.number) as highest_number_visitors[0m
Full Context:
[32;1m[1;3m[{'c.name': 'Nepal', 'highest_number_visitors': 79100}, {'c.name': 'India', 'highest_number_visitors': 24139}, {'c.name': 'China', 'highest_number_visitors': 7047}, {'c.name': 'America', 'highest_number_visitors': 4619}, {'c.name': 'Thailand', 'highest_number_visitors': 3812}, {'c.name': 'South Korea', 'highest_number_visitors': 3629}, {'c.name': 'Bangladesh', 'highest_number_visitors': 3421}, {'c.name': 'UK', 'highest_number_visitors': 2229}, {'c.name': 'Bhutan', 'highest_number_visitors': 1568}][0m

[1m> Finished chain.[0m


'India had the highest number of visitors to Nepal.'

In [None]:
chain.run("Were there visitors from China to Nepal in January?")



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (c1:Country {name: 'China'})-[:TOURISTS]->(m:Month)<-[:TOURISTS]-(c2:Country {name: 'Nepal'})
RETURN c1, m, c2[0m
Full Context:
[32;1m[1;3m[{'c1': {'name': 'China'}, 'm': {'name': 'January'}, 'c2': {'name': 'Nepal'}}, {'c1': {'name': 'China'}, 'm': {'name': 'January'}, 'c2': {'name': 'Nepal'}}, {'c1': {'name': 'China'}, 'm': {'name': 'January'}, 'c2': {'name': 'Nepal'}}, {'c1': {'name': 'China'}, 'm': {'name': 'January'}, 'c2': {'name': 'Nepal'}}][0m

[1m> Finished chain.[0m


'Yes, there were visitors from China to Nepal in January.'

In [None]:
chain.run("Did people from Bhutan come to Nepal?")



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (c1:Country {name: "Bhutan"})-[:TOURISTS]->(m:Month)<-[:TOURISTS]-(c2:Country {name: "Nepal"})
RETURN c1, m, c2;[0m
Full Context:
[32;1m[1;3m[{'c1': {'name': 'Bhutan'}, 'm': {'name': 'January'}, 'c2': {'name': 'Nepal'}}, {'c1': {'name': 'Bhutan'}, 'm': {'name': 'January'}, 'c2': {'name': 'Nepal'}}, {'c1': {'name': 'Bhutan'}, 'm': {'name': 'January'}, 'c2': {'name': 'Nepal'}}, {'c1': {'name': 'Bhutan'}, 'm': {'name': 'January'}, 'c2': {'name': 'Nepal'}}][0m

[1m> Finished chain.[0m


'Yes, people from Bhutan came to Nepal in January.'

In [None]:
chain.run("Did people from China visit Nepal?")



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (c:Country {name: 'China'})-[:TOURISTS]->(m:Month)<-[:TOURISTS]-(n:Country {name: 'Nepal'})
RETURN c, m, n[0m
Full Context:
[32;1m[1;3m[{'c': {'name': 'China'}, 'm': {'name': 'January'}, 'n': {'name': 'Nepal'}}, {'c': {'name': 'China'}, 'm': {'name': 'January'}, 'n': {'name': 'Nepal'}}, {'c': {'name': 'China'}, 'm': {'name': 'January'}, 'n': {'name': 'Nepal'}}, {'c': {'name': 'China'}, 'm': {'name': 'January'}, 'n': {'name': 'Nepal'}}][0m

[1m> Finished chain.[0m


'Yes, people from China visited Nepal.'

In [None]:
chain.run("From which countries did people visit Nepal?")



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (c:Country)-[t:TOURISTS]->(:Month)<-[:TOURISTS]-(:Country {name: "Nepal"})
RETURN c.name[0m
Full Context:
[32;1m[1;3m[{'c.name': 'Nepal'}, {'c.name': 'Nepal'}, {'c.name': 'Nepal'}, {'c.name': 'India'}, {'c.name': 'India'}, {'c.name': 'India'}, {'c.name': 'China'}, {'c.name': 'America'}, {'c.name': 'Thailand'}, {'c.name': 'South Korea'}][0m

[1m> Finished chain.[0m


'India, China, America, Thailand, South Korea visited Nepal.'

Note: I am using openai_api_key which is free upto certain limit. Also, Neo4j provides one instance free, so I am utilizing that free instance to perform this task.