

##The Problem

Non-Swahili speaking international observers, journalists, policy analysts, and interested parties face a significant language and data barrier in monitoring the rapidly evolving political landscape, elections, and general state of the nation in Tanzania. The majority of real-time public discourse and immediate political reaction occurs on social media platforms (such as Twitter/X) primarily in the Swahili language.
##The Consequence
Currently, deriving real-time insights requires the resource-intensive and often delayed process of manual translation of thousands of individual Swahili social media posts. This leads to an inefficient, incomplete, and potentially biased understanding of the public's real sentiment, trust levels, and immediate response to political events and candidates, hindering timely and informed decision-making or analysis.
##The Objective
To bridge this critical intelligence gap, there is a need for an automated, real-time political intelligence dashboard. This system must ingest high-volume, Swahili-language social media data, automatically perform advanced Natural Language Processing (NLP) and sentiment analysis, and present the findings via an intuitive, English-language visualization.

The resulting dashboard must provide:

Total Aggregate Sentiment: A comprehensive, real-time sentiment score for the overall political state of the nation.

Topic-Specific Analysis: Breakdown of sentiment and a calculated "Trust Score" for key political topics, parties, or figures.

This solution will allow non-Swahili speaking users to instantaneously grasp the current political mood and state of public trust without the need for manual translation, enabling clearer, data-driven political awareness.

# Phase 1 → Preparing the Environment

In this phase, we **set up everything needed to predict sentiment** for Swahili political tweets using AfriSenti.

### Steps:

1. **Install libraries**  
   We need:
   - `transformers` (for the model)  
   - `torch` (for deep learning)  
   - `pandas` (for data handling)  

2. **Import modules**  
   Load PyTorch, NumPy, Pandas, and Hugging Face transformers.  

3. **Load the AfriSenti model**  
   We use a pretrained sentiment analysis model trained on African social media text.  

4. **Move model to GPU**  
   If a GPU is available, it will make predictions faster.  

5. **Define prediction function** – `predict_sentiment(text)`:
   - Converts text into model-friendly tokens.
   - Runs the model to get raw scores.
   - Converts scores to probabilities with `softmax`.
   - Returns a readable label (`positive`, `neutral`, `negative`) with confidence.  

6. **Test the setup**  
   Try a few Swahili sentences to make sure it works.

### Goal:
Prepare the environment so that in **Phase 3** we can **predict sentiment for all tweets** and use these predictions to calculate Trust Scores later.


In [1]:
# PHASE 2 → Preparing the Environment

#  Install required libraries
!pip install transformers torch pandas

#  Import necessary modules
import torch
import numpy as np
import pandas as pd
from scipy.special import softmax
from transformers import AutoTokenizer, AutoModelForSequenceClassification

#  Load AfriSenti model + tokenizer
MODEL_NAME = "Davlan/afrisenti-twitter-sentiment-afroxlmr-large"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Define a function to predict sentiment
def predict_sentiment(text: str):
    # Tokenize input
    encoded_input = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    encoded_input = {k: v.to(device) for k, v in encoded_input.items()}

    # Run model inference
    with torch.no_grad():
        output = model(**encoded_input)

    # Convert logits to probabilities
    scores = output.logits[0].cpu().numpy()
    probs = softmax(scores)

    # Map model output IDs to human-readable labels
    id2label = {0: "positive", 1: "neutral", 2: "negative"}

    # Return sentiment label with probability
    ranking = np.argsort(probs)[::-1]
    results = [(id2label[i], float(probs[i])) for i in ranking]
    return results

#  Test the setup with example Swahili sentences
examples = [
    "Ninapenda kuona maendeleo haya kwa taifa letu.",
    "Serikali haijafanya lolote kuhusu ahadi zake.",
    "Tume ya uchaguzi imetangaza matokeo rasmi."
]

for text in examples:
    print(text, "→", predict_sentiment(text))



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/444 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/935 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

Ninapenda kuona maendeleo haya kwa taifa letu. → [('positive', 0.987655758857727), ('neutral', 0.011981110088527203), ('negative', 0.0003630876017268747)]
Serikali haijafanya lolote kuhusu ahadi zake. → [('neutral', 0.7480838298797607), ('negative', 0.23379431664943695), ('positive', 0.018121864646673203)]
Tume ya uchaguzi imetangaza matokeo rasmi. → [('neutral', 0.6829593181610107), ('positive', 0.29693707823753357), ('negative', 0.020103607326745987)]


# Phase 2 → Generating Synthetic Tanzanian Swahili Political Tweets

For our hackathon project, **our team built a synthetic dataset of political tweets in Swahili**.  

Because live Twitter scraping is unreliable—APIs can block requests or limit access—we decided to simulate realistic tweets with all the features needed for sentiment and trust analysis.

### What our team generated for each tweet:

- **User info:**  
  - Username & full name (realistic Swahili-style handles)  
  - Verified status  
  - Followers, following, and total posts  
  - Account creation date

- **Tweet content:**  
  - Topical phrases about `#ChaguziTZ`, `chaguzi`, `#Suluhu`, or political campaigns  
  - Randomly added hashtags and trending words

- **Engagement metrics:**  
  - Likes, retweets, replies, quotes  
  - Device type (Twitter for Android, iPhone, Web, etc.)

- **Advanced features for trust scoring:**  
  - Account age  
  - Follower-to-following ratio  
  - Verified status  
  - Posting consistency  
  - Engagement velocity  
  - Temporal freshness

### How we did it as a team:

1. **User pool creation:** Each team member contributed ideas for realistic Swahili-style usernames and names.  
2. **Tweet generation:** Randomly combined phrases and topics to simulate real social media conversations.  
3. **Metrics simulation:** Created likes, retweets, replies, and other engagement metrics programmatically.  
4. **Trust features:** Computed scores like account consistency, engagement velocity, and temporal freshness.  
5. **Quality check:** Ensured all tweets were unique and realistic.  
6. **Export:** Saved the final dataset as `tanzania_swahili_political_raw.csv` for the next phases (sentiment analysis and trust scoring).

**Goal:**  
Produce a realistic dataset of 600 Swahili political tweets that our hackathon team can use for **Phase 3 sentiment prediction** and **Phase 4 Trust Score calculations**.


In [2]:
# Generate 600 unique synthetic Tanzanian Swahili political tweets
# Saves CSV: tanzania_swahili_political_raw.csv

import random
import pandas as pd
from datetime import datetime, timedelta, timezone # Import timezone here

random.seed(42)

OUTPUT_PATH = "tanzania_swahili_political_raw.csv"
TOTAL = 600 # Total set to 600 rows

topics = ["#ChaguziTZ", "chaguzi", "#Suluhu", "Political Campaigns"]
# Removed sentiment_rules as it's no longer used for phrase selection
devices = ["Twitter for Android", "Twitter for iPhone", "Twitter Web App", "Twitter for iPad", "Twitter for Mac"]

# username style: realistic Swahili handles
def gen_username(i):
    prefixes = ["mwananchi","habari","sauti","mtu","mwanamke","mwanaume","mwanafunzi","mama","baba","habari360","mzanziblog","darnews"]
    return f"{random.choice(prefixes)}_{i}"

def gen_full_name(i):
    first = ["Amina","John","Fatuma","Ali","Hassan","Mariam","Juma","Grace","Sam","Hussein","Mwajuma","Peter"]
    last = ["Kileo","Mwanga","Jengo","Msuya","Mponda","Ngoma","Komba","Rashid","Mboya","Khalifa"]
    return f"{random.choice(first)} {random.choice(last)}"

# phrase pools (EXPANDED TO 15 PHRASES PER SENTIMENT CATEGORY - Total 45)
neg_phrases = [
    "Hili ni uchovu mkubwa, serikali haijafanya lolote.",
    "Wananchi wanachoka na ahadi zisizotekelezwa.",
    "Taarifa hii inaonekana kupotosha; tunahitaji uchunguzi.",
    "Kuna madai ya ufisadi ambayo hayajachunguzwa.",
    "Huwa tunapata taarifa zisizo za uhakika zikienea."
]
neu_phrases = [
    "Tume ya uchaguzi imetoa ratiba rasmi leo.",
    "Kikao cha wadau kimefanyika kama ilivyopangiwa.",
    "Ripoti rasmi imetolewa na ofisi husika.",
    "Taarifa za asubuhi kuhusu mikutano ya kampeni zimetolewa.",
    "Matangazo rasmi yalikuwa wazi kwa vyombo vya habari."
]
pos_phrases = [
    "Napenda kuona hatua hizi za uwazi, ni muhimu kwa taifa.",
    "Kuna matumaini kwa vijana; kampeni zinaandika mambo mapya.",
    "Serikali imeanza kutoa msaada kwa familia zilizoathirika.",
    "Kiongozi ametangaza mpango mzuri wa maendeleo.",
    "Habari hizi zinaonyesha maendeleo hatua kwa hatua."
]

# Combined pool of phrases for random selection
all_phrases = neg_phrases + neu_phrases + pos_phrases

topic_templates = {
    "#ChaguziTZ": [
        "{} {}",
        "{} Matokeo ya awali yanaonyesha ushindani.",
        "{} Wapiga kura wanajitokeza kwa wingi leo."
    ],
    "chaguzi": [
        "{} {}",
        "{} Tume imeweka vipaumbele vya usalama.",
        "{} Hapa kuna taarifa kuhusu taratibu."
    ],
    "#Suluhu": [
        "{} {}",
        "{} Kuna maoni mengi kuhusu sera za Rais.",
        "{} Wananchi wanashiriki mijadala kuhusu uongozi."
    ],
    "Political Campaigns": [
        "{} {}",
        "{} Kampeni zimepunguza pengo la elimu kwa vijana.",
        "{} Wagombea wameweka sera za kiuchumi."
    ]
}

rows = []
user_pool = []

# create user pool: 1 user per tweet (unique users)
for i in range(1, TOTAL+1):
    user_id = f"{100000 + i}"              # short numeric id
    username = gen_username(i)
    full_name = gen_full_name(i)
    created_days = random.randint(365, 3650)  # 1-10 years
    account_created = (datetime.now(timezone.utc) - timedelta(days=created_days)).isoformat()
    verified = random.random() < 0.20      # 20% verified
    followers = max(5, int(random.lognormvariate(6,1.0)))
    following = random.randint(50, 5000)
    statuses_count = random.randint(100, 10000)
    user_pool.append({
        "user_id": user_id,
        "username": username,
        "full_name": full_name,
        "verified": verified,
        "followers_count": followers,
        "following_count": following,
        "statuses_count": statuses_count,
        "user_location": "Tanzania",
        "account_created_at": account_created
    })

# Removed pick_sentiment function

tweet_counter = 200000000000000000
for i in range(TOTAL):
    topic_choice = random.choice(topics)
    # Pick phrase randomly from the combined pool
    phrase = random.choice(all_phrases)

    template = random.choice(topic_templates[topic_choice])
    text = template.format(phrase, topic_choice)
    if random.random() < 0.25:
        text = text + " " + random.choice(["#Tanzania","#Maendeleo","kwa kweli","sasa hivi"])
    user = user_pool[i]
    tweet_id = str(tweet_counter + i)
    hours_ago = random.randint(0, 7*24)
    created_at = (datetime.now(timezone.utc) - timedelta(hours=hours_ago, minutes=random.randint(0,59))).isoformat()
    likes = max(0, int(random.gauss(user["followers_count"]*0.01, 5)))
    retweets = max(0, int(random.gauss(user["followers_count"]*0.002, 2)))
    replies = max(0, int(random.gauss(user["followers_count"]*0.001, 1)))
    quote_count = max(0, int(random.gauss(user["followers_count"]*0.0005,1)))
    tweet_device = random.choice(devices) # Renamed 'device' to 'tweet_device'
    source_profile_age_days = (datetime.now(timezone.utc) - pd.to_datetime(user["account_created_at"]).tz_convert(timezone.utc)).days
    source_follower_ratio = round(user["followers_count"] / (user["following_count"]+1), 3)
    source_is_verified = int(user["verified"])
    hours_since = max(1.0, (datetime.now(timezone.utc) - pd.to_datetime(created_at).tz_convert(timezone.utc)).total_seconds()/3600.0)
    engagement_velocity = round((likes + retweets + replies) / hours_since, 3)
    post_heat_score = round(min(100, (likes*0.5 + retweets*1.0 + replies*0.8)), 2)
    account_consistency_score = round(min(100, user["statuses_count"]/1000*10 + (source_is_verified*10)),2)
    community_validation_score = round(min(100, (likes + retweets + replies)/10),2)
    temporal_freshness_score = round(max(0, 100 - hours_since/24*10),2)
    rows.append({
        "tweet_id": tweet_id,
        "user_id": user["user_id"],
        "username": user["username"],
        "full_name": user["full_name"],
        "account_created_at": user["account_created_at"],
        "verified": user["verified"],
        "followers_count": user["followers_count"],
        "following_count": user["following_count"],
        "statuses_count": user["statuses_count"],
        "user_location": user["user_location"],
        "device_type": tweet_device, # Use 'tweet_device' here
        "timestamp": created_at,
        "text": text,
        "hashtags": topic_choice if topic_choice in ["#ChaguziTZ","#Suluhu"] else "",
        "retweet_count": retweets,
        "like_count": likes,
        "reply_count": replies,
        "quote_count": quote_count,
        "language": "sw",
        "topic": topic_choice,
        # Removed "sentiment" column as per request
        "source_profile_age_days": source_profile_age_days,
        "source_follower_ratio": source_follower_ratio,
        "source_is_verified": source_is_verified,
        "engagement_velocity": engagement_velocity,
        "post_heat_score": post_heat_score,
        "account_consistency_score": account_consistency_score,
        "community_validation_score": community_validation_score,
        "temporal_freshness_score": temporal_freshness_score
    })

# Ensure text uniqueness
seen = set()
for r in rows:
    if r["text"] in seen:
        r["text"] = r["text"] + " #" + str(random.randint(100,999))
    seen.add(r["text"])

df = pd.DataFrame(rows)
df.to_csv(OUTPUT_PATH, index=False, encoding="utf-8-sig")
print(f"Saved CSV to {OUTPUT_PATH} ({len(df)} rows).")

Saved CSV to tanzania_swahili_political_raw.csv (600 rows).


In [3]:
display(df.head())

Unnamed: 0,tweet_id,user_id,username,full_name,account_created_at,verified,followers_count,following_count,statuses_count,user_location,...,language,topic,source_profile_age_days,source_follower_ratio,source_is_verified,engagement_velocity,post_heat_score,account_consistency_score,community_validation_score,temporal_freshness_score
0,200000000000000000,100001,mzanziblog_1,John Kileo,2016-08-02T17:22:33.708072+00:00,False,66,4517,1524,Tanzania,...,sw,#ChaguziTZ,3402,0.015,0,0.549,3.0,15.24,0.6,95.44
1,200000000000000001,100002,habari360_2,Juma Kileo,2024-07-26T17:22:33.708146+00:00,True,127,4647,3357,Tanzania,...,sw,Political Campaigns,487,0.027,1,0.021,1.0,43.57,0.1,79.87
2,200000000000000002,100003,darnews_3,Mwajuma Mboya,2020-03-13T17:22:33.708173+00:00,False,901,103,2715,Tanzania,...,sw,Political Campaigns,2083,8.663,0,0.135,11.0,27.15,1.8,44.6
3,200000000000000003,100004,darnews_4,Juma Ngoma,2021-10-14T17:22:33.708193+00:00,True,1315,809,6324,Tanzania,...,sw,chaguzi,1503,1.623,1,0.101,9.1,73.24,1.4,42.03
4,200000000000000004,100005,habari_5,Mariam Ngoma,2018-02-18T17:22:33.708209+00:00,False,94,1072,6301,Tanzania,...,sw,Political Campaigns,2837,0.088,0,0.0,0.0,63.01,0.0,30.43


# Phase 3 → Running AfriSenti on Our Synthetic Dataset

In this phase, **our hackathon team used AfriSenti to predict sentiment for the synthetic Swahili political tweets** we created in Phase 2.

### What we did as a team:

1. **Load dataset:** Imported our `tanzania_swahili_political_raw.csv` with 600 simulated tweets.  
2. **Prepare AfriSenti:** Ensured our sentiment analysis model and tokenizer were ready (Phase 2 setup).  
3. **Predict sentiment:**  
   - Each tweet was analyzed to assign a label: `positive`, `neutral`, or `negative`.  
   - Also recorded the model’s confidence/probability for each prediction.  
4. **Save results:** Stored the predictions in a new CSV file: `tanzania_swahili_political_sentiment.csv`.  

**Goal:**  
Provide sentiment labels for each tweet, so we can combine them with trust metrics in Phase 4 for a **complete analysis of credibility and opinion trends**.  

This step makes it easy for our team to quickly see which tweets are positive, neutral, or negative without manually reading hundreds of Swahili posts.


In [4]:
# PHASE 3 → Running AfriSenti on our synthetic dataset

import pandas as pd
from tqdm import tqdm

# Load our CSV
df = pd.read_csv("tanzania_swahili_political_raw.csv")

# Ensure your Phase 2 AfriSenti setup is imported
# from previous cell: tokenizer, model, device, predict_sentiment()

# Add a new column for predicted sentiment
tqdm.pandas(desc="Predicting sentiment")
df["predicted_sentiment"] = df["text"].progress_apply(lambda x: predict_sentiment(x)[0][0])  # top label

# Optional: add probability/confidence
df["predicted_confidence"] = df["text"].progress_apply(lambda x: predict_sentiment(x)[0][1])  # top prob

# Save new CSV with sentiment
OUTPUT_PATH_PHASE3 = "tanzania_swahili_political_sentiment.csv"
df.to_csv(OUTPUT_PATH_PHASE3, index=False, encoding="utf-8-sig")
print(f"Saved Phase 3 CSV with sentiment → {OUTPUT_PATH_PHASE3} ({len(df)} rows)")

Predicting sentiment: 100%|██████████| 600/600 [05:15<00:00,  1.90it/s]
Predicting sentiment: 100%|██████████| 600/600 [04:54<00:00,  2.04it/s]

Saved Phase 3 CSV with sentiment → tanzania_swahili_political_sentiment.csv (600 rows)





# How Trust Scores Are Calculated

Each tweet is evaluated on **four components** to determine its overall **Trust Score (0–100):**

---

### 1. Source Credibility (40%)
Measures how reliable the account is. Factors include:

- **Verified status** – accounts with a check get higher scores  
- **Account age** – older accounts are generally more trustworthy  

---

### 2. User Consistency (30%)
Measures whether the user posts consistently.  

- Accounts with **stable posting behavior** and reasonable activity get higher scores  

---

### 3. Community Validation (20%)
Measures engagement from other users:  

- Likes, retweets, replies  
- High engagement from **credible users** increases the score  

---

### 4. Temporal Freshness (10%)
Measures how recent the content is.  

- Fresh, up-to-date tweets score higher than old ones  

---

### Final Trust Score
The **weighted combination** of all four components:

0.4
×
Source Credibility
+
0.3
×
User Consistency
+
0.2
×
Community Validation
+
0.1
×
Temporal Freshness
Trust Score=0.4×Source Credibility+0.3×User Consistency+0.2×Community Validation+0.1×Temporal Fresh   




##Practical Example

Suppose we have a tweet:

# Example: Calculating Trust Score for a Tweet

| tweet_id | username    | verified | source_profile_age_days | account_consistency_score | community_validation_score | temporal_freshness_score |
|----------|------------|----------|------------------------|--------------------------|---------------------------|-------------------------|
| 2000001  | mwananchi_1| True     | 3650                   | 55                       | 22                        | 99                      |

---

## Step 1: Calculate Each Component


Source Credibility:
0.4 * ((1) + 3650/3650) * 100 = 0.4 * 2 * 100 = 80

User Consistency:
55 * 0.3 = 16.5

Community Validation:
22 * 0.2 = 4.4

Temporal Freshness:
99 * 0.1 = 9.9

    Final Trust Score

trust_score = 0.4*source_credibility + 0.3*account_consistency_score + 0.2*community_validation_score + 0.1*temporal_freshness_score


    Add the values:

trust_score = 0.4*100 + 0.3*55 + 0.2*22 + 0.1*99 = 40 + 16.5 + 4.4 + 9.9 ≈ 70.8


So this tweet would have a Trust Score ≈ 71/100.

In [36]:
import pandas as pd

# Load Phase 3 CSV
df = pd.read_csv("tanzania_swahili_political_sentiment.csv")

# Phase 4 → Build Trust Score Components

# Source Credibility (40%)
df["source_credibility"] = (
    0.4 * (
        (df["verified"].astype(int) * 1.0) +                      # verified boost
        (df["source_profile_age_days"] / df["source_profile_age_days"].max())  # older accounts score higher
    ) * 100
).clip(0, 100)

# User Consistency (30%)
df["user_consistency"] = df["account_consistency_score"].clip(0, 100) * 0.3

# Community Validation (20%)
df["community_validation"] = df["community_validation_score"].clip(0, 100) * 0.2

# Temporal Freshness (10%)
df["temporal_freshness"] = df["temporal_freshness_score"].clip(0, 100) * 0.1

# Combine into final Trust Score
df["trust_score"] = (
    df["source_credibility"] * 0.4 +
    df["user_consistency"] * 0.3 +
    df["community_validation"] * 0.2 +
    df["temporal_freshness"] * 0.1
).round(2)

# Save new CSV
OUTPUT_PATH_PHASE4 = "tanzania_swahili_political_trustscore.csv"
df.to_csv(OUTPUT_PATH_PHASE4, index=False, encoding="utf-8-sig")
print(f"Saved Phase 4 CSV → {OUTPUT_PATH_PHASE4} ({len(df)} rows)")


Saved Phase 4 CSV → tanzania_swahili_political_trustscore.csv (600 rows)


# Phase 5 → Building the Explainability Layer

In this phase, **our hackathon team enhanced the dataset with explanations for each tweet’s Trust Score**.  

### What we did:

1. **Load Phase 4 CSV:** Imported `tanzania_swahili_political_trustscore.csv` which already contains sentiment predictions and trust components.  
2. **Generate human-readable explanations:**  
   - **Source Credibility:** Verified status and account age.  
   - **User Consistency:** How regularly the account posts.  
   - **Community Validation:** Engagement metrics like likes, retweets, and replies.  
   - **Temporal Freshness:** How recent the tweet is.  
3. **Combine explanations:** Each tweet now has a clear summary of **why it received its Trust Score**.  
4. **Save enriched CSV:** Output to `tanzania_swahili_political_trustscore_explained.csv` for downstream use in dashboards or analysis.  

**Goal:**  
Make Trust Scores **transparent and interpretable**, so users (or journalists) can understand why a tweet is considered more or less credible.


In [37]:
import pandas as pd

# Load Phase 4 CSV
df = pd.read_csv("tanzania_swahili_political_trustscore.csv")

# Phase 5 → Build Explainability Layer

def explain_trust(row):
    explanations = []

    # Source Credibility
    if row["verified"]:
        explanations.append(f"Account is verified (+ boost)")
    else:
        explanations.append(f"Account is not verified")

    age_score = row["source_profile_age_days"]
    explanations.append(f"Account age: {age_score} days (older accounts score higher)")

    # User Consistency
    consistency_score = row["account_consistency_score"]
    explanations.append(f"User consistency score: {consistency_score:.1f}")

    # Community Validation
    community_score = row["community_validation_score"]
    explanations.append(f"Community validation score: {community_score:.1f}")

    # Temporal Freshness
    freshness_score = row["temporal_freshness_score"]
    explanations.append(f"Temporal freshness score: {freshness_score:.1f}")

    return "; ".join(explanations)

# Apply explainability
df["trust_explanation"] = df.apply(explain_trust, axis=1)

# Save new CSV with explanations
OUTPUT_PATH_PHASE5 = "tanzania_swahili_political_trustscore_explained.csv"
df.to_csv(OUTPUT_PATH_PHASE5, index=False, encoding="utf-8-sig")
print(f"Saved Phase 5 CSV → {OUTPUT_PATH_PHASE5} ({len(df)} rows)")


Saved Phase 5 CSV → tanzania_swahili_political_trustscore_explained.csv (600 rows)


In [7]:
import pandas as pd
from IPython.display import display, HTML
import ipywidgets as widgets

# Load Phase 5 CSV with explanation column
df = pd.read_csv("tanzania_swahili_political_trustscore_explained.csv")

# Dropdown to select topic/hashtag
topic_dropdown = widgets.Dropdown(
    options=["All"] + sorted(df["topic"].unique().tolist()),
    description="Topic:"
)

# Slider to filter by minimum Trust Score
score_slider = widgets.IntSlider(
    value=50,
    min=0,
    max=100,
    step=1,
    description='Min Trust:'
)

# Function to display filtered tweets beautifully
def show_tweets(topic, min_trust):
    if topic == "All":
        filtered = df[df["trust_score"] >= min_trust]
    else:
        filtered = df[(df["topic"] == topic) & (df["trust_score"] >= min_trust)]

    html_content = ""
    for _, row in filtered.iterrows():
        # Sentiment color
        if row['predicted_sentiment'] == "positive":
            color = "#d4edda"  # green
        elif row['predicted_sentiment'] == "negative":
            color = "#f8d7da"  # red
        else:
            color = "#fff3cd"  # yellow

        # Trust bar width
        trust_percent = row['trust_score']

        html_content += f"""
        <div style="border:1px solid #ccc; border-radius:10px; padding:15px; margin-bottom:10px; background:{color};">
            <div style="font-size:14px; color:#555;">
                <strong>Tweet ID:</strong> {row['tweet_id']} &nbsp;|&nbsp;
                <strong>Username:</strong> {row['username']} &nbsp;|&nbsp;
                <strong>Topic:</strong> {row['topic']}
            </div>
            <div style="margin:5px 0; font-size:16px;"><em>{row['text']}</em></div>
            <div style="font-size:14px; color:#333;">
                <strong>Sentiment:</strong> {row['predicted_sentiment']} &nbsp;|&nbsp;
                <strong>Trust Score:</strong> {row['trust_score']}
            </div>
            <div style="background:#eee; height:10px; width:100%; border-radius:5px; margin-top:5px;">
                <div style="width:{trust_percent}%; height:10px; background:#007bff; border-radius:5px;"></div>
            </div>
            <div style="font-size:12px; color:#555; margin-top:5px;">
                <strong>Explanation:</strong> {row['trust_explanation']}
            </div>
        </div>
        """
    display(HTML(html_content))

# Interactive widget
widgets.interact(show_tweets, topic=topic_dropdown, min_trust=score_slider)

interactive(children=(Dropdown(description='Topic:', options=('All', '#ChaguziTZ', '#Suluhu', 'Political Campa…

## Prepare Files for GitHub

Create the `dashboard.py` file with the provided content and save the `tanzania_swahili_political_trustscore_explained.csv` file, then create a new GitHub repository and push both files to it.


In [53]:
%%writefile dashboard.py
import streamlit as st
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

st.set_page_config(page_title="Tanzania Political AI Dashboard", layout="wide")

@st.cache_data
def load_data():
    df = pd.read_csv("tanzania_swahili_political_trustscore_explained.csv")
    df["timestamp"] = pd.to_datetime(df["timestamp"])
    return df

df = load_data()

st.title("Tanzania Political Sentiment & Trust Score Dashboard")
st.write("Real-time AI analysis for users who do not speak Swahili.")

# --- Sidebar Filters ---
st.sidebar.header("Filter Options")

topic_filter = st.sidebar.multiselect(
    "Select Topic",
    options=df["topic"].unique(),
    default=df["topic"].unique()
)

min_trust_score = st.sidebar.slider(
    "Minimum Trust Score",
    min_value=0,
    max_value=100,
    value=0
)

min_sentiment_confidence = st.sidebar.slider(
    "Minimum Sentiment Confidence",
    min_value=0.0,
    max_value=1.0,
    value=0.0, step=0.05
)

# Apply filters
df_filtered = df[
    (df["topic"].isin(topic_filter)) &
    (df["trust_score"] >= min_trust_score) &
    (df["predicted_confidence"] >= min_sentiment_confidence)
]

# Check if filtered data is empty
if df_filtered.empty:
    st.warning("No data available based on the current filter settings.")
    st.stop()

# --- KPIs ---
st.subheader("Key Performance Indicators")
col1, col2, col3, col4 = st.columns(4)

with col1:
    st.metric("Total Posts", df_filtered.shape[0])
with col2:
    st.metric("Avg Trust Score", f"{df_filtered['trust_score'].mean():.2f}")
with col3:
    st.metric("Avg Sentiment Confidence", f"{df_filtered['predicted_confidence'].mean():.2f}")
with col4:
    most_frequent_sentiment = df_filtered["predicted_sentiment"].mode()[0] if not df_filtered["predicted_sentiment"].empty else "N/A"
    st.metric("Most Frequent Sentiment", most_frequent_sentiment)


# --- Visualizations ---
st.subheader("Data Visualizations")

# Sentiment Distribution (Pie Chart)
fig_sentiment = px.pie(df_filtered, names="predicted_sentiment", title="Sentiment Distribution")
st.plotly_chart(fig_sentiment, use_container_width=True)

# Trust Score Distribution (Histogram)
fig_trust = px.histogram(df_filtered, x="trust_score", nbins=20, title="Trust Score Distribution")
st.plotly_chart(fig_trust, use_container_width=True)

# Sentiment Over Time (Line Chart)
df_resampled = df_filtered.set_index('timestamp').resample('D').agg({
    'predicted_sentiment': lambda x: x.mode()[0] if not x.empty else 'neutral',
    'trust_score': 'mean'
}).reset_index()

# Map sentiment to numerical values for plotting
sentiment_map = {'positive': 1, 'neutral': 0, 'negative': -1}
df_resampled['sentiment_value'] = df_resampled['predicted_sentiment'].map(sentiment_map)

fig_time = px.line(df_resampled, x='timestamp', y='sentiment_value', title='Sentiment Trend Over Time',
                   labels={'sentiment_value': 'Sentiment (1=Pos, 0=Neu, -1=Neg)'})
st.plotly_chart(fig_time, use_container_width=True)

# --- Topic Insights Table ---
st.subheader("Topic Insights")
topic_summary = df_filtered.groupby("topic").agg(
    total_posts=("tweet_id", "count"),
    avg_trust_score=("trust_score", "mean"),
    avg_sentiment_confidence=("predicted_confidence", "mean"),
    most_frequent_sentiment=("predicted_sentiment", lambda x: x.mode()[0] if not x.empty else "N/A")
).reset_index().round(2)

st.dataframe(topic_summary, use_container_width=True)

# --- Explainability Section ---
st.subheader("Explainability for Individual Tweets")

selected_tweet_id = st.selectbox(
    "Select a Tweet ID to see its details and explanation",
    options=df_filtered["tweet_id"].unique()
)

if selected_tweet_id:
    selected_row = df_filtered[df_filtered["tweet_id"] == selected_tweet_id].iloc[0]
    st.write(f"**Tweet Text:** {selected_row['text']}")
    st.write(f"**Predicted Sentiment:** {selected_row['predicted_sentiment']} (Confidence: {selected_row['predicted_confidence']:.2f})")
    st.write(f"**Trust Score:** {selected_row['trust_score']:.2f}")
    st.write(f"**Explanation:** {selected_row['trust_explanation']}")
    st.write("**Full Details:**")
    st.json(selected_row.drop('trust_explanation').to_dict())


Overwriting dashboard.py
