<a href="https://colab.research.google.com/github/teofizzy/mshauri-fedha/blob/main/notebooks/01_data_ingestion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Mshauri-fedha**
Mshauri-fedha is a financial advisory assistant, an NLP-powered advisory system that can process diverse financial data sources, extract key insights, and generate actionable advice tailored to the needs of the CBK. The system should be capable of ingesting multiple forms of data, including official reports (PDF, Word, spreadsheets), market news articles, policy briefs, and even social media feeds that influence public sentiment.

Using Natural Language Processing techniques, the system automatically extracts relevant indicators such as inflation trends, interest rate changes, currency fluctuations, loan performance, and emerging risks in the banking sector. Once the data is analyzed, the system should generate plain-language summaries and policy insights.

For example, it might highlight that `“inflationary pressure is rising due to food and fuel prices, suggesting a possible need to adjust the central bank rate,”` or that `“loan defaults in the agricultural sector have spiked by 12% this quarter, indicating credit risk exposure.”`

The ultimate goal is to create a prototype policy advisory tool that blends structured data analytics with unstructured text understanding, supports interactive queries, and delivers trustworthy financial insights that could guide CBK decision-making on monetary policy, financial stability, and regulatory oversight.

## Data ingestion
In this section, we extract data from various sources such as inflation trends, interest rate changes, currency fluctuations, loan performance and emerging risks in the banking sector.
## Sources
### A. Central Bank of Kenya (CBK)
- Monetary Policy Committee press releases
- Monthly economic Indicators
- Bank supervision reports

### B. Kenya National Bureau of Statistics
- CPI and inflation data
- Economic surveys and quarterly GDP

### C. Financial News
- Business Daily Africa
- Nation Business
- Reuters (Kenya tag)

In [None]:
!git clone https://github.com/teofizzy/mshauri-fedha.git

Cloning into 'mshauri-fedha'...
remote: Enumerating objects: 76, done.[K
remote: Counting objects: 100% (76/76), done.[K
remote: Compressing objects: 100% (58/58), done.[K
remote: Total 76 (delta 32), reused 43 (delta 13), pack-reused 0 (from 0)[K
Receiving objects: 100% (76/76), 124.87 KiB | 4.03 MiB/s, done.
Resolving deltas: 100% (32/32), done.


In [None]:
!pip install certifi --quiet

In [None]:
from google.colab import drive, userdata
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
github_username = userdata.get('GITHUB_USERNAME')
github_token = userdata.get('GITHUB_TOKENS')
github_email = userdata.get('GITHUB_EMAIL')

In [None]:
!pip install pdfplumber spacy sentence-transformers faiss-cpu openai --quiet
!pip install black flake8 mypy pytest --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.7/67.7 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.0/60.0 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m42.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.6/23.6 MB[0m [31m27.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m37.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.2/85.2 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import os

In [None]:
os.listdir()

['.config', 'mshauri-fedha', 'drive', 'sample_data']

In [None]:
os.chdir('mshauri-fedha')

In [None]:
import pandas as pd

### Central Bank of Kenya

In [None]:
!pip install gnews newspaper3k lxml[html_clean] --quiet

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m57.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.1/211.1 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.5/81.5 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.4/107.4 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m331.1/331.1 kB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for tinysegmenter (setup.py) ... [?25l[?25hdone
  Building wheel for feedfinder2 (setup.py) ... [?25l[?25hdone
  Building wheel for jieba3k (setup.py) ... 

In [None]:
from fynesse.access import CBKExplorer

In [None]:
import requests

In [None]:
# Check if we have permission
cbk_base_url = "https://www.centralbank.go.ke/"
response = requests.get(cbk_base_url)
print(response.status_code)

200


In [None]:
repo_url = f"https://{github_token}@github.com/teofizzy/mshauri-fedha.git"

In [None]:
# !git config --global user.email {github_email}
# !git config --global user.name {github_username}

# !git add fynesse/access.py
# !git commit -m "Business news fetchers"
# !git push {repo_url} main

In [None]:
from fynesse.access import load_repo

In [None]:
# Init explorer
cbk_explorer = CBKExplorer(github_username)

In [None]:
root_dir = '/content/drive/MyDrive/school-projects/mshauri-fedha/data/cbk'

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv(os.path.join(root_dir, 'file_links.csv'))
df.head()

Unnamed: 0,page,text,file_url
0,https://www.centralbank.go.ke/website-navigati...,government securities auction rules,https://www.centralbank.go.ke/wp-content/uploa...
1,https://www.centralbank.go.ke/website-navigati...,benchmark bond programme guidelines,https://www.centralbank.go.ke/wp-content/uploa...
2,https://www.centralbank.go.ke/website-navigati...,kenya master repurchase agreement,https://www.centralbank.go.ke/wp-content/uploa...
3,https://www.centralbank.go.ke/website-navigati...,diaspora remittances survey,https://www.centralbank.go.ke/wp-content/uploa...
4,https://www.centralbank.go.ke/website-navigati...,kenya electronic payment and settlement system...,https://www.centralbank.go.ke/wp-content/uploa...


In [None]:
df.shape

(11049, 3)

In [None]:
print(df.duplicated(subset=['file_url']).sum())
df.drop_duplicates(subset=['file_url'], inplace=True)

0


In [None]:
df['text'] = df['text'].str.replace('\n', ' ')
# df['text'] = df['text'].apply(lambda x: x.strip().lower())
df.head()

Unnamed: 0,page,text,file_url
0,https://www.centralbank.go.ke/website-navigati...,government securities auction rules,https://www.centralbank.go.ke/wp-content/uploa...
1,https://www.centralbank.go.ke/website-navigati...,benchmark bond programme guidelines,https://www.centralbank.go.ke/wp-content/uploa...
2,https://www.centralbank.go.ke/website-navigati...,kenya master repurchase agreement,https://www.centralbank.go.ke/wp-content/uploa...
3,https://www.centralbank.go.ke/website-navigati...,diaspora remittances survey,https://www.centralbank.go.ke/wp-content/uploa...
4,https://www.centralbank.go.ke/website-navigati...,kenya electronic payment and settlement system...,https://www.centralbank.go.ke/wp-content/uploa...


In [None]:
df[df['text'].str.contains('forex', na=False)]

Unnamed: 0,page,text,file_url
10,https://www.centralbank.go.ke/forex/,forex bureau rates,https://www.centralbank.go.ke/uploads/foreign_...
11,https://www.centralbank.go.ke/forex/,forex bureau rates,https://www.centralbank.go.ke/uploads/foreign_...
12,https://www.centralbank.go.ke/forex/,forex bureau rates,https://www.centralbank.go.ke/uploads/foreign_...
13,https://www.centralbank.go.ke/forex/,forex bureau rates,https://www.centralbank.go.ke/uploads/foreign_...
14,https://www.centralbank.go.ke/forex/,forex bureau rates,https://www.centralbank.go.ke/uploads/foreign_...
...,...,...,...
1227,https://www.centralbank.go.ke/forex/,forex bureau rates,https://www.centralbank.go.ke/uploads/foreign_...
4766,https://www.centralbank.go.ke/policy-procedure...,forex bureaus,https://www.centralbank.go.ke/images/docs/Lice...
4774,https://www.centralbank.go.ke/policy-procedure...,forex bureau guidelines 2011 – march 18 2011,https://www.centralbank.go.ke/wp-content/uploa...
4775,https://www.centralbank.go.ke/policy-procedure...,forex bureau penalty regulations,https://www.centralbank.go.ke/wp-content/uploa...


In [None]:
# df.to_csv(os.path.join(root_dir, 'file_links.csv'), index=False)

In [None]:
knbs_base_url = "https://www.knbs.or.ke/"

In [None]:
# knbs_response = requests.get(knbs_base_url)
# print(knbs_response.status_code)

In [None]:
# Init explorer
knbs_explorer = CBKExplorer(github_username)

In [None]:
resp, soup = knbs_explorer.fetch(knbs_base_url)
if resp:
    print("OK:", resp.status_code, "Title:", soup.title.string if soup.title else "<no title>")
else:
    # try http fallback explicitly
    print("Failed to fetch via all strategies. Try http://www.knbs.or.ke or check network.")

[fetch] SSL error with certifi for https://www.knbs.or.ke/: HTTPSConnectionPool(host='www.knbs.or.ke', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1010)')))
[fetch] HTTP fallback failed for http://www.knbs.or.ke/: HTTPSConnectionPool(host='www.knbs.or.ke', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1010)')))
[fetch] Trying insecure fetch (verify=False) for https://www.knbs.or.ke/ — not recommended for sensitive data.
OK: 200 Title: Kenya National Bureau of Statistics - Kenya's Top Data Site


In [None]:
# Crawl for files
df_knbs = knbs_explorer.crawl_links_for_files(
    start_url=knbs_base_url,
    max_pages=2000  # crawl the first 2000 links
)

[fetch] SSL error with certifi for https://www.knbs.or.ke/: HTTPSConnectionPool(host='www.knbs.or.ke', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1010)')))
[fetch] HTTP fallback failed for http://www.knbs.or.ke/: HTTPSConnectionPool(host='www.knbs.or.ke', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1010)')))
[fetch] Trying insecure fetch (verify=False) for https://www.knbs.or.ke/ — not recommended for sensitive data.
Will inspect 0 linked pages from https://www.knbs.or.ke/


0it [00:00, ?it/s]

In [None]:
response = requests.get("https://www.knbs.or.ke/statistical-releases", verify=False)
print(response.status_code)

200


In [None]:
response = requests.get("https://www.knbs.or.ke/publications/", verify=False)
print(response.status_code)

200


In [None]:
from bs4 import BeautifulSoup

In [None]:
soup = BeautifulSoup(response.text, 'html.parser')

In [None]:
statistical_releases_links = []
for a in soup.select("div.w-ibanner a"):
    title = a.get("title") or a.get_text(strip=True)
    href = a.get("href")
    if href:
        statistical_releases_links.append((title, href))

# Display results
for t, h in statistical_releases_links:
    print(f"{t} -> {h}")

In [None]:
publications_links = []

for a in soup.select("div.w-ibanner a"):
    title = a.get("title") or a.get_text(strip=True)
    href = a.get("href")

    if href:
        publications_links.append((title, href))

# Display results
for t, h in publications_links:
    print(f"{t} -> {h}")

In [None]:
# statistical_releases_file_links_df.query("link=='https://www.knbs.or.ke/wp-content/uploads/2024/12/2024-FinAccess-Household-Survey-Report.pdf'")

In [None]:
import requests, time
from bs4 import BeautifulSoup
from urllib.parse import urljoin

visited = set()

def fetch_page(url):
    try:
        r = requests.get(url, verify=False, timeout=15)
        r.raise_for_status()
        return BeautifulSoup(r.content, "html.parser")
    except Exception as e:
        print(f"[fetch] Failed {url}: {e}")
        return None

def explore_recursive(
    url,
    allowed_exts=(".pdf", ".xls", ".xlsx", ".csv"),
    depth=0,
    max_depth=2
):
    if url in visited:
        return []
    visited.add(url)

    if depth > max_depth:
        return []

    soup = fetch_page(url)
    if not soup:
        return []

    files, subpages = [], []

    for a in soup.find_all("a", href=True):
        href = a["href"].strip()
        text = a.get_text(strip=True)

        if href.startswith("/"):
            href = urljoin("https://www.knbs.or.ke", href)

        if "knbs.or.ke" not in href:
            continue

        if any(href.lower().endswith(ext) for ext in allowed_exts):
            files.append((text, href))
        else:
            # follow only useful categories (narrower filter)
            if any(kw in href for kw in [
                "/reports/",
                "/economic-surveys/",
                "/leading-economic-indicators/",
                "/cpi-and-inflation-rates/",
                "/statistical-abstracts/",
                "/all-reports/page/",
                "publications",
                "census",
                "abstracts"
            ]):
                subpages.append(href)

    print(f"{'  '*depth}[{url}] -> {len(files)} files, {len(subpages)} subpages")

    # recurse into subpages
    for sp in subpages:
        time.sleep(0.5)  # polite pause
        files.extend(explore_recursive(sp, allowed_exts, depth+1, max_depth))

    return files


In [None]:
import asyncio
import aiohttp
from bs4 import BeautifulSoup
from urllib.parse import urljoin

visited = set()
semaphore = asyncio.Semaphore(10)  # limit concurrent requests

BASE_URL = "https://www.knbs.or.ke"

async def fetch_page(session, url):
    """Fetch a page asynchronously."""
    try:
        async with semaphore:
            async with session.get(url, ssl=False, timeout=20) as r:
                if r.status == 200:
                    html = await r.text()
                    return BeautifulSoup(html, "html.parser")
                else:
                    print(f"[fetch] Failed {url}: {r.status}")
                    return None
    except Exception as e:
        print(f"[fetch] Exception {url}: {e}")
        return None

async def explore_recursive(session, url, allowed_exts=(".pdf", ".xls", ".xlsx", ".csv"), depth=0, max_depth=5):
    """Asynchronous recursive exploration."""
    if url in visited or depth > max_depth:
        return []
    visited.add(url)

    soup = await fetch_page(session, url)
    if not soup:
        return []

    files, subpages = [], []

    for a in soup.find_all("a", href=True):
        href = a["href"].strip()
        text = a.get_text(strip=True)

        if href.startswith("/"):
            href = urljoin(BASE_URL, href)

        if "knbs.or.ke" not in href:
            continue

        if any(href.lower().endswith(ext) for ext in allowed_exts):
            files.append((text, href))
        else:
            if any(kw in href for kw in [
                "/reports/",
                "/economic-surveys/",
                "/leading-economic-indicators/",
                "/cpi-and-inflation-rates/",
                "/statistical-abstracts/",
                "/all-reports/page/",
                "publications",
                "census",
                "abstracts"
            ]):
                subpages.append(href)

    print(f"{'  '*depth}[{url}] -> {len(files)} files, {len(subpages)} subpages")

    # Recurse concurrently into subpages
    tasks = [explore_recursive(session, sp, allowed_exts, depth+1, max_depth) for sp in subpages]
    results = await asyncio.gather(*tasks, return_exceptions=True)

    for r in results:
        if isinstance(r, list):
            files.extend(r)

    return files

async def explore_async(start_urls):
    async with aiohttp.ClientSession() as session:
        all_files = []
        for url in start_urls:
            results = await explore_recursive(session, url)
            all_files.extend(results)

        print(f"\n✅ Total unique files collected: {len(all_files)}")
        for text, link in all_files:
            print(f"- {text} -> {link}")

        return all_files


In [None]:
import nest_asyncio
nest_asyncio.apply()

# Extract URLs from the list of tuples
statistical_releases_urls = [link for text, link in statistical_releases_links]

# Run the asynchronous fetch
statistical_releases_file_links = asyncio.run(explore_async(statistical_releases_urls))


✅ Total unique files collected: 0


In [None]:
# statistical_releases_file_links[0][1].split('/')[-1].split('.')[0]

In [None]:
def clean_text_col_knbs(df):
    # Ensure columns exist
    if 'text' not in df.columns or 'link' not in df.columns:
        raise ValueError("DataFrame must have 'text' and 'link' columns")

    # Step 1: Clean text safely
    df['text'] = (
        df['text']
        .fillna('')  # avoid NaNs
        .astype(str)
        .str.replace('\n', ' ', regex=False)
        .str.strip('- ')
        .str.lower()
    )

    # Step 2: Identify 'pdf' placeholders or empty text
    mask = df['text'].isin(['pdf', 'download pdf', '', 'download'])

    # Step 3: Replace with filename extracted from URL
    df.loc[mask, 'text'] = (
        df.loc[mask, 'link']
        .fillna('')
        .apply(lambda x: (
            x.rstrip('/')
             .split('/')[-1]
             .split('.')[0]
             .replace('%20', '-')
             .replace('_', '-')
             .lower()
        ) if isinstance(x, str) else '')
    )

    # Step 4: Final cleanup
    df['text'] = (
        df['text']
        .str.replace(r'[^a-z0-9\- ]+', '', regex=True)
        .str.strip('- ')
        .str.replace('--+', '-', regex=True)
    )

    return df

In [None]:
# statistical_releases_file_links_df = pd.DataFrame(statistical_releases_file_links, columns=['text', 'link'])
# statistical_releases_file_links_df = clean_text_col_knbs(statistical_releases_file_links_df)
# statistical_releases_file_links_df

In [None]:
# statistical_releases_file_links_df.drop_duplicates(inplace=True)
# statistical_releases_file_links_df.reset_index(drop=True, inplace=True)

In [None]:
knbs_dir = os.path.join(root_dir, "..", "knbs")
os.makedirs(name=knbs_dir, exist_ok=True)
os.path.exists(knbs_dir)

True

In [None]:
# statistical_releases_file_links_df.to_csv(os.path.join(knbs_dir, 'statistical_releases_file_links.csv'), index=False)

In [None]:
knbs_files = pd.read_csv(os.path.join(knbs_dir, 'statistical_releases_file_links.csv'))
knbs_files.head()

Unnamed: 0,text,link
0,kenya-leading-economic-indicators-july-2025,https://www.knbs.or.ke/wp-content/uploads/2025...
1,kenya-leading-economic-indicators-june-2025,https://www.knbs.or.ke/wp-content/uploads/2025...
2,kenya-leading-economic-indicators-may-2025,https://www.knbs.or.ke/wp-content/uploads/2025...
3,lei-april-report,https://www.knbs.or.ke/wp-content/uploads/2025...
4,leading-economic-indicators-march-2025,https://www.knbs.or.ke/wp-content/uploads/2025...


## Business News

In [None]:
# !pip install newspaper3k --quiet

In [None]:
# !pip install lxml[html_clean] --quiet

In [None]:
news_data_api_key = "pub_a16ad7d03c3e43a2b2b79c318a08afbc"
gnews_api_key = "603e2c7e6bd83c72c77eefd0eab0e844"
the_news_api_token = "ZAGIcsvQDeFpS3v59iM3S3ZjNTIsM9Wk0RL8zzrY"

In [None]:
from fynesse.access import fetch_kenya_gnews

In [None]:
gnews_df = fetch_kenya_gnews(gnews_api_key)
gnews_df.head()

Unnamed: 0,title,content,url,date,source
0,Arcadis expands new global collaboration with ...,Arcadis expands new global collaboration with ...,https://www.arcadis.com/en/news/global/2025/11...,2025-11-19T06:03:57Z,Arcadis
1,China's diesel trucks are shifting to electric,China is rapidly replacing its aging diesel tr...,https://abcnews.go.com/International/wireStory...,2025-11-19T05:53:32Z,abcnews.go.com
2,ACI World reveals key trends shaping air trave...,"Montreal, Canada, 19 November 2025 – Airports ...",https://aci.aero/2025/11/19/aci-world-reveals-...,2025-11-19T05:30:17Z,ACI World
3,Uzbekistan to operate Airbus Flexrotor UAS,Airbus Helicopters has been awarded a contract...,https://www.airbus.com/en/newsroom/press-relea...,2025-11-19T05:00:00Z,Airbus
4,Rise of the robots: the promise of physical AI,"With enough practice, arms like these can comp...",https://www.citizen.digital/article/rise-of-th...,2025-11-19T04:42:52Z,Citizen Digital


In [None]:
gnews_df.shape

(10, 5)

In [None]:
print(gnews_df['content'][0])

Arcadis expands new global collaboration with Shell through major workplace Architecture and Design win


In [None]:
from fynesse.access import fetch_kenya_thenewsapi

In [None]:
the_news_df = fetch_kenya_thenewsapi(the_news_api_token)
the_news_df.head()

Unnamed: 0,title,content,url,date,source
0,"Report on Africa’s digital economy, China-Afri...","NAIROBI, Kenya, May 10, 2024 /PRNewswire/ — Th...",https://bubblear.com/report-on-africas-digital...,2024-05-11T04:48:00.000000Z,thebubble.com
1,"India-Kenya collaboration in trade, economy, e...","New Delhi [India], August 29 (ANI): Defence Mi...",https://theprint.in/world/india-kenya-collabor...,2023-08-29T07:43:02.000000Z,theprint.in
2,Kenya and South Africa offer insights into dig...,It’s still fashionable today to promote tech s...,https://theconversation.com/kenya-and-south-af...,2022-06-27T14:10:48.000000Z,theconversation.com


In [None]:
the_news_df.shape

(3, 5)

In [None]:
print(the_news_df['content'][0])

NAIROBI, Kenya, May 10, 2024 /PRNewswire/ — This is a report from China.org.cn: A report on Africa’s digital economic development index and China–Africa ...


In [None]:
from fynesse.access import scrape_google_news_kenya

In [None]:
google_news_df = scrape_google_news_kenya()
google_news_df.head()

Unnamed: 0,title,description,published date,url,publisher
0,Central Bank piloting instant payments to Stat...,Central Bank piloting instant payments to Stat...,"Mon, 17 Nov 2025 15:45:00 GMT",https://news.google.com/rss/articles/CBMisgFBV...,{'href': 'https://www.businessdailyafrica.com'...
1,Global rate cuts set to reshape East Africa’s ...,Global rate cuts set to reshape East Africa’s ...,"Wed, 19 Nov 2025 14:15:00 GMT",https://news.google.com/rss/articles/CBMivAFBV...,"{'href': 'https://www.theeastafrican.co.ke', '..."
2,Businessman Withdraws Lawsuit Against CBK Over...,Businessman Withdraws Lawsuit Against CBK Over...,"Mon, 17 Nov 2025 10:16:25 GMT",https://news.google.com/rss/articles/CBMirAFBV...,"{'href': 'https://kenyanwallstreet.com', 'titl..."
3,CBK seeking to raise Sh40 billion from investo...,CBK seeking to raise Sh40 billion from investo...,"Thu, 13 Nov 2025 05:00:00 GMT",https://news.google.com/rss/articles/CBMiuwFBV...,"{'href': 'https://www.the-star.co.ke', 'title'..."
4,Kenya’s Economy Poised for Faster Growth in 20...,Kenya’s Economy Poised for Faster Growth in 20...,"Thu, 13 Nov 2025 10:59:35 GMT",https://news.google.com/rss/articles/CBMitgFBV...,"{'href': 'https://www.dawan.africa', 'title': ..."


In [None]:
google_news_df.shape

(50, 5)

In [None]:
from fynesse.access import scrape_african_business_rss

In [None]:
# No API key needed!
rss_df = scrape_african_business_rss()
rss_df.head()

Unnamed: 0,title,url,date,summary,source
0,Caution urged as Chinese AI takes root in Africa,https://african.business/2025/11/technology-in...,"Wed, 19 Nov 2025 14:13:39 +0000",,African Business
1,Africa's Business Heroes announces top 10 fina...,https://african.business/2025/11/quick-reads/a...,"Wed, 19 Nov 2025 10:36:51 +0000",,African Business
2,GAICA 2025: When Sousse became Africa's capita...,https://african.business/2025/11/innov-africa-...,"Wed, 19 Nov 2025 09:20:40 +0000",,African Business
3,Africa's Voice at the G20: Turning challenges ...,https://african.business/2025/11/partner-conte...,"Wed, 19 Nov 2025 07:27:11 +0000",,African Business
4,Muganga celebrates successful French launch fo...,https://african.business/2025/11/arts-culture/...,"Tue, 18 Nov 2025 14:49:29 +0000",,African Business


In [None]:
from fynesse.access import scrape_kenya_news_maximum

In [None]:
df = scrape_kenya_news_maximum(
        newsdata_key=news_data_api_key,
        gnews_key=gnews_api_key,
        thenewsapi_key=the_news_api_token,
        max_workers=12
    )

🔍 Fetching maximum articles from all APIs...

📰 NewsData.io: 35 URLs
📰 GNews.io: 7 URLs
📰 TheNewsAPI: 24 URLs (limited to 3/request on free)

✅ Total unique URLs: 57



📄 Scraping:   0%|          | 0/57 [00:00<?, ?article/s]


❌ https://www.reuters.com/world/africa/kenyas-inflat... | Article `download()` failed with 401 Client Error: HTTP Forbidden for url: https://www.reuters.com/world/africa/kenyas-inflation-steady-46-year-on-year-october-2025-10-31/ on URL https://www.reuters.com/world/africa/kenyas-inflation-steady-46-year-on-year-october-2025-10-31/

❌ http://www.flynous.com/kenya-airways-business-clas... | Article `download()` failed with 403 Client Error: Forbidden for url: https://www.flynous.com/kenya-airways-business-class-kenya/ on URL http://www.flynous.com/kenya-airways-business-class-kenya/

❌ https://www.investing.com/news/stock-market-news/u... | Article `download()` failed with 403 Client Error: Forbidden for url: https://www.investing.com/news/stock-market-news/us-kenya-look-to-strengthen-business-trade-ties-3456898 on URL https://www.investing.com/news/stock-market-news/us-kenya-look-to-strengthen-business-trade-ties-3456898

✅ 48 articles scraped | ❌ 9 failed | 📊 84.2% success
📝 Avg: 508

In [None]:
from datetime import datetime
import os

In [None]:
news_dir = os.path.join(root_dir, "..", "news")
os.makedirs(news_dir, exist_ok=True)
file_time_stamp = datetime.now().strftime("%d-%m-%Y-%H-%M")
file_time_stamp


'19-11-2025-19-49'

In [None]:

gnews_df.to_csv(os.path.join(news_dir, f"gnews_{file_time_stamp}.csv"), index=False)
google_news_df.to_csv(os.path.join(news_dir, f"google_news_{file_time_stamp}.csv"), index=False)
# newsdata_df.to_csv(os.path.join(news_dir, f"newsdata_{file_time_stamp}.csv"), index=False)
the_news_df.to_csv(os.path.join(news_dir, f"the_news_{file_time_stamp}.csv"), index=False)
df.to_csv(os.path.join(news_dir, f"kenya_news_full_{file_time_stamp}.csv"), index=False)

In [None]:
df.head()

Unnamed: 0,title,full_content,summary,url,date,source,authors,image,word_count
0,CJ Koome forms Standing Committee to Boost Sma...,"NAIROBI, Kenya Nov 19 – Chief Justice Martha K...","Since its launch in 2021, the Court has unlock...",https://www.capitalfm.co.ke/news/2025/11/cj-ko...,2025-11-19 05:20:00,capitalfm,"Laban Wanambisi, Bruhan Makong, .Wp-Block-Co-A...",https://www.capitalfm.co.ke/news/files/2025/08...,178
1,Bitcoin ATMs pop up in Nairobi malls as Kenya’...,"NAIROBI, Kenya Nov 18 – Bitcoin ATMs have been...",Their arrival coincides with the commencement ...,https://www.capitalfm.co.ke/news/2025/11/bitco...,2025-11-18 08:39:20,capitalfm,"Editorial Desk, Faith Masita, Bruhan Makong, Y...",https://www.capitalfm.co.ke/news/files/2025/11...,775
2,Mosiria Moved to Citizen Engagement in Sakaja’...,"NAIROBI, Kenya, Nov 19 – Environment Chief Off...","Governor Sakaja said the reshuffle, guided by ...",https://www.capitalfm.co.ke/news/2025/11/mosir...,2025-11-19 06:12:49,capitalfm,"Bruhan Makong, .Wp-Block-Co-Authors-Plus-Coaut...",https://www.capitalfm.co.ke/news/files/2025/11...,136
3,Govt Promises Kakamega Locals Direct Benefits ...,The government has assured Kakamega County res...,The government has assured Kakamega County res...,https://nairobiwire.com/2025/11/kakamega-gold-...,2025-11-18 03:44:23,nairobiwire,Richard Kamau,https://nairobiwire.com/wp-content/uploads/202...,341
4,"School Verification Exercise Exposes 87,000 Gh...","Kenya has cleared at least 44,495 schools for ...","Kenya has cleared at least 44,495 schools for ...",https://nairobiwire.com/2025/11/kenya-ghost-le...,2025-11-18 03:43:08,nairobiwire,Richard Kamau,https://nairobiwire.com/wp-content/uploads/202...,396


In [None]:
os.listdir(root_dir)

['cbk-landing-page', 'monetary-policy', 'file_links.csv']