# Inspect Business News
In this notebook, the study will seek to inspect news files (stored as csv files), and ingest its content to the vectordb

In [1]:
%load_ext autoreload
%autoreload 2

In [6]:
import pandas as pd
import glob
import os
import sys
import warnings
from pathlib import Path
warnings.filterwarnings("ignore")

In [7]:
# Fix paths for src files
project_root = Path(os.getcwd()).parent
script_dir = project_root / "src"
if str(script_dir) not in sys.path:
    sys.path.append(str(script_dir))

In [8]:
from load.explore_news_schema import analyze_schemas

In [4]:
# Path to where you just downloaded the files
SCRATCH_DIR = os.environ.get("SCRATCH")
NEWS_DIR = os.path.join(SCRATCH_DIR, "mshauri-fedha/data/news")

# Run the exploration
analyze_schemas(NEWS_DIR)

üîç Scanning 22 files in '/capstor/scratch/cscs/tligawa/mshauri-fedha/data/news'...

--- Schema Report ---

TYPE 1: Found in 5 files
Columns: ['description', 'published date', 'publisher', 'title', 'url']
Examples: ['google_news_10-11-2025.csv', 'google_news_10-11-2025-19-27.csv', 'google_news_19-11-2025-19-49.csv'] ... (+2 others)

TYPE 2: Found in 7 files
Columns: ['authors', 'date', 'full_content', 'image', 'source', 'summary', 'title', 'url', 'word_count']
Examples: ['kenya_news_full_27-10-2025.csv', 'kenya_news_full_17-11-2025-17-52.csv', 'newsdata_10-11-2025.csv'] ... (+4 others)

TYPE 3: Found in 10 files
Columns: ['content', 'date', 'source', 'title', 'url']
Examples: ['gnews_19-11-2025-19-49.csv', 'the_news_10-11-2025.csv', 'the_news_19-11-2025-19-49.csv'] ... (+7 others)

--- Date Format Sample ---
Sample from column 'published date' in google_news_10-11-2025.csv:
['Sun, 09 Nov 2025 03:15:00 GMT', 'Sun, 09 Nov 2025 18:45:00 GMT', 'Tue, 04 Nov 2025 06:00:00 GMT', 'Tue, 04 Nov

In [9]:
from load.ingest_news import ingest_news_data

In [10]:
from load.start_ollama import start_ollama_server, pull_embedding_model

In [7]:
start_ollama_server()

üöÄ Starting Ollama Server...
‚è≥ Waiting for server to boot...
‚úÖ Server started successfully.


True

In [8]:
# pull embedding model
pull_embedding_model("nomic-embed-text")

‚¨áÔ∏è  Requesting pull for 'nomic-embed-text'...
   success manifest digest00%
‚úÖ Model 'nomic-embed-text' installed successfully!


In [9]:
VECTOR_DB = "mshauri_fedha_chroma_db"
EMBEDDING_MODEL = "nomic-embed-text" # Make sure this matches your existing DB model

# Run
ingest_news_data(NEWS_DIR, VECTOR_DB, EMBEDDING_MODEL)

üöÄ Found 22 news files. Processing...


Reading CSVs: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 22/22 [00:00<00:00, 179.19file/s]

   üìâ Condensed into 198 unique articles.
üß† Embedding 455 chunks into Vector DB...



  embeddings = OllamaEmbeddings(model=model, base_url="http://127.0.0.1:25000")
  vectorstore = Chroma(persist_directory=vector_db_path, embedding_function=embeddings)


Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.


Embedding News: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 455/455 [00:21<00:00, 21.06chunk/s]


‚úÖ News Ingestion Complete.





In [None]:
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import OllamaEmbeddings

In [None]:
# --- CONFIG ---
VECTOR_DB_PATH = "mshauri_fedha_chroma_db"
EMBEDDING_MODEL = "nomic-embed-text"
OLLAMA_URL = "http://127.0.0.1:25000"

In [None]:
# Connect to DB
print("Connecting to Vector Store...")
embeddings = OllamaEmbeddings(model=EMBEDDING_MODEL, base_url=OLLAMA_URL)
vectorstore = Chroma(persist_directory=VECTOR_DB_PATH, embedding_function=embeddings)

In [None]:
# Get Stats
count = vectorstore._collection.count()
print(f"Total Documents stored: {count}")

In [10]:
# Peek at a Sample
print("\n Random Sample Document:")
# We fetch 1 random ID just to peek
result = vectorstore.get(limit=1)

if result['ids']:
    meta = result['metadatas'][0]
    content = result['documents'][0]
    
    print(f"--- Metadata ---")
    print(f"Date:   {meta.get('date')}")
    print(f"Source: {meta.get('source')}")
    print(f"Type:   {meta.get('type')}")
    
    print(f"\n--- Content (First 300 chars) ---")
    print(content[:300] + "...")
else:
    print("Database is empty!")

üîå Connecting to Vector Store...
‚úÖ Total Documents stored: 455

üëÄ Random Sample Document:
--- Metadata ---
Date:   2025-11-03
Source: african markets
Type:   news

--- Content (First 300 chars) ---
Title: BGFI Holding finally gets regulatory approval for its BVMAC IPO: inside a tumultuous IPO journey - african markets
Date: 2025-11-03
Source: african markets

BGFI Holding finally gets regulatory approval for its BVMAC IPO: inside a tumultuous IPO journey african markets...


In [11]:
# --- TEST QUERY ---
query = "How have the protests impacted the Kenyan economy?"

print(f"\nüîé Searching for: '{query}'...")

# Perform Similarity Search
results = vectorstore.similarity_search(query, k=10)

print(f"Found {len(results)} relevant articles:\n")

for i, doc in enumerate(results):
    print(f"Result #{i+1} -------------------------")
    print(f"Date:   {doc.metadata.get('date', 'N/A')}")
    print(f"Source: {doc.metadata.get('source', 'N/A')}")
    print(f"Excerpt: {doc.page_content[:2500].replace(chr(10), ' ')}...") # Remove newlines for clean print
    print("------------------------------------\n")


üîé Searching for: 'How have the protests impacted the Kenyan economy?'...
Found 10 relevant articles:

Result #1 -------------------------
üìÖ Date:   2025-09-29
üì∞ Source: Devdiscourse
üìù Excerpt: Title: Madagascar's Government Dissolution Amidst Gen Z-Inspired Protests: A Call for Dialogue and Reform Date: 2025-09-29 Source: Devdiscourse  In response to youth-led protests over worsening water and power shortages, Malagasy President Andry Rajoelina announced the dissolution of the government on Monday. The unrest, largely influenced by Gen Z movements in Kenya and Nepal, marks the largest such demonstrations in Madagascar in years. These rallies significantly challenge Rajoelina's leadership since his recent 2023 re-election. The president offered an apology for governmental shortcomings and vowed to engage in dialogue with the youth while ensuring support for affected businesses. The protests have seen significant casualties, with both protestors and bystanders affected, alth