# PART I: Weather Forecast Synthesizer--Jupyter Notebook

## Overview
This notebook synthesizes weather forecasts from multiple APIs and uses an LLM to summarize and compare against historical data.
- about the branching of capabilities and how it ties over to the other portions of the assignment.
- talk about the goal of the workflow
- additional details on artchitecture and decisions can ber found in report.

## Important Notes
- To support this application, data is pulled from various sources, usually leveraging an API key or token. These keys/tokens were obtained using my personal email and will be kept active through the evaluation period.
- **Please note: that queries using the Open AI API key are a paid service and running that portion of the application will incur a small, nominal charge.**

## Set-up

### About
Set-up python environment and notebook with necessary packages and functions.

### Instructions
1. Ensure that environment has required libraries. If necessary, uncomment the cell with dependencies and install packages.
2. If application folders structure is maintained upon transfer, then the file paths should not need updating. If the directories have changed, update file paths accordingly.
3. API keys and tokens are included in the .env file. If errors appear when resolving, obtain separate keys or contact Ian.

In [1]:
# # Install dependencies
# %pip install openmeteo-requests geopy transformers datasets torch python-dotenv openai protobuf accelerate sentencepiece huggingface_hub llama-cpp-python fastapi rich praw

In [2]:
# # Export environment packages and libraries
# with open("requirements.txt", "w") as f:
#     subprocess.run([sys.executable, "-m", "pip", "freeze"], stdout=f)

In [3]:
# Load libraries 
from geopy.geocoders import Nominatim
from geopy.distance import geodesic
from transformers import pipeline
from dotenv import load_dotenv
import os
import requests
import json
from datetime import datetime, timedelta, date, timezone
import time
import pandas as pd
import openai
import pprint
from llama_cpp import Llama
from huggingface_hub import hf_hub_download
from pathlib import Path
import subprocess
import sys
from typing import List, Dict
import praw


notebook_dir = Path().resolve()  # This is the notebook's directory
env_path = notebook_dir.parent / "env" / ".env"
load_dotenv(dotenv_path=env_path)

# Load the API keys
WEATHERBIT_API_KEY = os.getenv("WEATHERBIT_API_KEY")
NOAA_TOKEN = os.getenv("NOAA_TOKEN")
OPENAI_KEY = os.getenv("OPENAI_KEY")
NEWSAPI_API_KEY = os.getenv("NEWSAPI_API_KEY")
REDDIT_SECRET = os.getenv("REDDIT_SECRET")
REDDIT_ID = os.getenv("REDDIT_ID")


# Initialize OpenAI endpoint
client = openai.OpenAI(api_key=OPENAI_KEY)

  from .autonotebook import tqdm as notebook_tqdm


## Helper functions

### About
These helper functions are defined explicitly within this notebook for reference and use. They have also been refactored and transferred to the corresponding module.py files for use by the MCP server, but should otherwise work exactly the same. These functions were created to perform three general tasks: 
1. Obtain the necessary data,
2. Wrangle weather data into three tables, and
3. Create the necessary artifacts and execute a query to the LLM.

### Instructions
1. Run cells.

In [4]:
# Functions to obtain weather data
def get_coordinates(city_name):
    geolocator = Nominatim(user_agent="weather_forecast")
    location = geolocator.geocode(city_name)
    
    if not location:
        print("Location not found.")
        return None,None
    
    return location.latitude, location.longitude

def get_weatherbit_forecast(lat, lon):

    url = "https://api.weatherbit.io/v2.0/forecast/daily"
    params = {
        "lat": lat,
        "lon": lon,
        "key": WEATHERBIT_API_KEY,
        "days": 7
    }
    response = requests.get(url, params=params)
    if response.status_code == 200:
        return response.json()
    else:
        print("Weatherbit API error:", response.status_code)
        return {}

def get_open_meteo_forecast(lat, lon):
    url = "https://api.open-meteo.com/v1/forecast"
    params = {
        "latitude": lat,
        "longitude": lon,
        "daily": [
            "temperature_2m_max",
            "temperature_2m_min",
            "precipitation_sum",
            "windspeed_10m_max"
        ],
        "timezone": "auto"
    }
    response = requests.get(url, params=params)
    if response.status_code == 200:
        return response.json()
    else:
        print("Open-Meteo API error:", response.status_code)
        return {}

def safe_noaa_request(url, headers, params, max_retries=5, backoff=5):
    for attempt in range(max_retries):
        response = requests.get(url, headers=headers, params=params)
        if response.status_code == 200:
            return response
        elif response.status_code == 503:
            print(f"503 error, retrying in {backoff} seconds...")
            time.sleep(backoff)
            backoff *= 2
        else:
            print(f"NOAA API error {response.status_code}: {response.text}")
            break
    return None

def generate_past_10yr_ranges(days_back=7):
    today = date.today()
    ranges = []
    for i in range(1, 11):
        try:
            start = today.replace(year=today.year - i)
        except ValueError:
            # Handle leap year case for Feb 29 by falling back to Feb 28
            start = today.replace(year=today.year - i, day=28)
        end = start + timedelta(days=days_back)
        ranges.append((start.strftime("%Y-%m-%d"), end.strftime("%Y-%m-%d")))
    return ranges

def find_nearest_station(lat, lon, start_date, end_date):
    url = "https://www.ncdc.noaa.gov/cdo-web/api/v2/stations"
    headers = {"token": NOAA_TOKEN}
    params = {
        "datasetid": "GHCND",
        "startdate": start_date,
        "enddate": end_date,
        "limit": 1000,
        # Slightly bigger bounding box to catch nearby stations
        "extent": f"{lat - 1},{lon - 1},{lat + 1},{lon + 1}",
        "sortfield": "datacoverage",
        "sortorder": "desc"
    }

    response = safe_noaa_request(url, headers, params)
    if response and response.status_code == 200:
        stations = response.json().get("results", [])
        # Filter to only USW stations
        usw_stations = [s for s in stations if s["id"].startswith("GHCND:USW")]

        if not usw_stations:
            print("No USW stations found in bounding box.")
            return None

        # Debug print to see what was found
        print(f"Found {len(usw_stations)} USW stations, choosing closest...")

        # Find closest USW station by geodesic distance
        closest_usw_station = min(
            usw_stations,
            key=lambda s: geodesic((lat, lon), (s["latitude"], s["longitude"])).km
        )

        print(f"Closest USW station: {closest_usw_station['id']} - {closest_usw_station.get('name', '')}")
        return closest_usw_station["id"]

    else:
        print(f"NOAA API request failed with status: {response.status_code if response else 'No response'}")
    return None

def get_noaa_data_for_range(station_id, start_date, end_date, datatypeids=None):
    if datatypeids is None:
        datatypeids = ["TMIN", "TMAX", "PRCP", "AWND"]

    url = "https://www.ncdc.noaa.gov/cdo-web/api/v2/data"
    headers = {"token": NOAA_TOKEN}
    all_results = []
    limit = 1000
    offset = 1

    while True:
        params = {
            "datasetid": "GHCND",
            "datatypeid": datatypeids,
            "stationid": station_id,
            "startdate": start_date,
            "enddate": end_date,
            "limit": limit,
            "offset": offset,
            "units": "standard",
            "sortfield": "date",
            "sortorder": "asc",
            "includemetadata": "false"
        }
        response = safe_noaa_request(url, headers, params)
        if not response:
            break
        data = response.json()
        results = data.get("results", [])
        all_results.extend(results)
        metadata = data.get("metadata", {}).get("resultset", {})
        count = metadata.get("count", 0)
        if offset + limit > count:
            break
        offset += limit

    return all_results

def get_noaa_10yr_historical(lat, lon, days_back=7):
    date_ranges = generate_past_10yr_ranges(days_back)
    # Use earliest date range (10 years ago) to find station with coverage
    earliest_start, earliest_end = date_ranges[-1]
    station_id = find_nearest_station(lat, lon, earliest_start, earliest_end)
    if not station_id:
        raise ValueError("No NOAA station found with data coverage for the location and date range.")

    print(f"Using station {station_id}")

    combined_results = []
    for start_date, end_date in date_ranges:
        print(f"Fetching data for {start_date} to {end_date}...")
        data = get_noaa_data_for_range(station_id, start_date, end_date)
        combined_results.extend(data)

    return combined_results, station_id


In [5]:
# Functions to obtain news data
US_STATE_ABBR_TO_NAME = {
    'AL': 'Alabama', 'AK': 'Alaska', 'AZ': 'Arizona', 'AR': 'Arkansas', 'CA': 'California',
    'CO': 'Colorado', 'CT': 'Connecticut', 'DE': 'Delaware', 'FL': 'Florida', 'GA': 'Georgia',
    'HI': 'Hawaii', 'ID': 'Idaho', 'IL': 'Illinois', 'IN': 'Indiana', 'IA': 'Iowa',
    'KS': 'Kansas', 'KY': 'Kentucky', 'LA': 'Louisiana', 'ME': 'Maine', 'MD': 'Maryland',
    'MA': 'Massachusetts', 'MI': 'Michigan', 'MN': 'Minnesota', 'MS': 'Mississippi',
    'MO': 'Missouri', 'MT': 'Montana', 'NE': 'Nebraska', 'NV': 'Nevada', 'NH': 'New Hampshire',
    'NJ': 'New Jersey', 'NM': 'New Mexico', 'NY': 'New York', 'NC': 'North Carolina',
    'ND': 'North Dakota', 'OH': 'Ohio', 'OK': 'Oklahoma', 'OR': 'Oregon', 'PA': 'Pennsylvania',
    'RI': 'Rhode Island', 'SC': 'South Carolina', 'SD': 'South Dakota', 'TN': 'Tennessee',
    'TX': 'Texas', 'UT': 'Utah', 'VT': 'Vermont', 'VA': 'Virginia', 'WA': 'Washington',
    'WV': 'West Virginia', 'WI': 'Wisconsin', 'WY': 'Wyoming',
    'DC': 'District of Columbia'
}

def extract_city_state(location: str) -> tuple[str, str]:
    """
    Extracts the city name and state name from a location string formatted as
    'City', 'City, State', or 'City, State, Country'. Converts US state abbreviations
    to full state names.

    Args:
        location (str): A location string, e.g., 'Phoenix, AZ' or 'Los Angeles, CA, USA'.

    Returns:
        tuple[str, str]: A tuple containing the extracted city name and full state name.
                         If no state is found, returns empty string for state.
                         Example: ('Phoenix', 'Arizona'), ('Los Angeles', 'California'), ('London', '')
    """
    if not location or not isinstance(location, str):
        return "", ""

    parts = [part.strip() for part in location.split(",") if part.strip()]
    city = parts[0] if len(parts) >= 1 else ""
    state = parts[1].upper() if len(parts) >= 2 else ""

    # Convert state abbreviation to full name if found
    full_state = US_STATE_ABBR_TO_NAME.get(state, parts[1] if len(parts) >= 2 else "")

    return city, full_state

def get_weather_news(city: str, api_key: str, days_back: int = 3, max_articles: int = 5) -> List[Dict]:
    """
    Queries the NewsAPI for weather-related news articles about a given city.

    Args:
        city (str): The city to search news for. Accepts formats like 'City, State'.
        api_key (str): Your NewsAPI.org API key.
        days_back (int): How many days back to search for news.
        max_articles (int): Maximum number of articles to return.

    Returns:
        List[str]: A list of formatted article summaries.
    """
    # Extract the city name to improve search relevance
    clean_city,clean_state = extract_city_state(city)
    print(clean_city, clean_state)
    
    # Build query
    # query = f"{clean_city} weather OR storm OR rainfall OR heat OR climate OR flood"
    
    query = (
        f"({clean_city} OR {clean_state})"
        "AND (weather OR storm OR forecast OR temperature OR rainfall OR snow OR flooding OR humidity) "
        "-sports -baseball -football -NBA -concert -game -soccer -crime"
    )

    # Dates
    from_date = (datetime.now() - timedelta(days=days_back)).strftime('%Y-%m-%d')
    to_date = datetime.now().strftime('%Y-%m-%d')

    url = "https://newsapi.org/v2/everything"
    params = {
        "q": query,
        "from": from_date,
        "to": to_date,
        "language": "en",
        "sortBy": "relevancy",
        "pageSize": max_articles,
        "apiKey": api_key
    }

    try:
        response = requests.get(url, params=params)
        response.raise_for_status()
        articles = response.json().get("articles", [])
        
        return [
            {
                "title": article.get("title"),
                "source": article.get("source", {}).get("name"),
                "datePublished": article.get("publishedAt"),
                "snippet": article.get("description"),
                "url": article.get("url")
            }
            for article in articles if article.get("title") and article.get("description")
        ]

    except Exception as e:
        print(f"[NewsAPI Error] {e}")
        return []


In [6]:
# # Functions to obtain social media data
def fetch_reddit_weather_posts(
    reddit_client_id: str,
    reddit_client_secret: str,
    reddit_user_agent: str,
    location: str,
    max_posts: int = 20,
) -> List[Dict]:
    """
    Fetch recent weather-related Reddit posts from select subreddits and subreddits matching the location name.

    Args:
        reddit_client_id (str): Reddit API client ID.
        reddit_client_secret (str): Reddit API client secret.
        reddit_user_agent (str): User agent string.
        location (str): City or city+state string, e.g. "Seattle WA"
        max_posts (int): Maximum number of posts to return.

    Returns:
        List[Dict]: List of posts dicts with keys: title, subreddit, created_utc, url, selftext
    """

    reddit = praw.Reddit(
        client_id=reddit_client_id,
        client_secret=reddit_client_secret,
        user_agent=reddit_user_agent,
    )

    weather_subreddits = ["weather", "climate", "StormComing"]

    location_parts = location.lower().replace(",", "").split()
    location_subreddits = []

    for part in location_parts:
        # Search subreddits with location part in the name, limit 5 per part
        for sub in reddit.subreddits.search_by_name(part, exact=False)[:5]:
            sub_name = sub.display_name.lower()
            if any(loc_part in sub_name for loc_part in location_parts):
                location_subreddits.append(sub.display_name)

    # Unique combined list
    all_subreddits = list(set(weather_subreddits + location_subreddits))

    weather_keywords = [
        "weather", "storm", "flood", "rain", "snow", "heatwave",
        "tornado", "hurricane", "drought", "lightning", "climate",
        "hail", "wind"
    ]

    posts = []

    for subreddit_name in all_subreddits:
        subreddit = reddit.subreddit(subreddit_name)
        query = " OR ".join(weather_keywords)
        # Search top posts in past week, sorted by relevance
        for submission in subreddit.search(query, time_filter="week", sort="relevance", limit=max_posts):
            posts.append({
                "title": submission.title,
                "subreddit": subreddit_name,
                "created_utc": submission.created_utc,
                "url": submission.url,
                "selftext": submission.selftext,
            })
            if len(posts) >= max_posts:
                break
        if len(posts) >= max_posts:
            break

    return posts


In [7]:
# Functions to manipulate data
def normalize_forecast(data,source_name):
    normalized = []

    if source_name == "open_meteo":
        daily = data.get("daily", {})
        dates = daily.get("time", [])
        for i, date in enumerate(dates):
            normalized.append({
                "date": date,
                "temp_max-degC-open_meteo": daily.get("temperature_2m_max", [None]*len(dates))[i],
                "temp_min-degC-open_meteo": daily.get("temperature_2m_min", [None]*len(dates))[i],
                "precip-mm-open_meteo": daily.get("precipitation_sum", [None]*len(dates))[i],
                "wind_max-mpersec-open_meteo": daily.get("windspeed_10m_max", [None]*len(dates))[i]
            })

    elif source_name == "weatherbit":
        for day in data.get("data", []):
            normalized.append({
                "date": day.get("datetime"),
                "temp_max-degC-weatherbit": day.get("max_temp"),
                "temp_min-degC-weatherbit": day.get("min_temp"),
                "precip-mm-weatherbit": day.get("precip"),
                "wind_max-mpersec-weatherbit": day.get("wind_spd")
            })

    return pd.DataFrame(normalized)

def merge_forecasts(open_meteo_data, weatherbit_data, normalize_fn):
    # Normalize each source
    df_openmeteo = normalize_fn(open_meteo_data, source_name="open_meteo")
    df_weatherbit = normalize_fn(weatherbit_data, source_name="weatherbit")

    # Merge on date using outer join to retain all data points
    merged_df = pd.merge(df_openmeteo, df_weatherbit, on="date", how="outer")

    # Sort by date
    merged_df = merged_df.sort_values("date").reset_index(drop=True)
    
    return merged_df

def normalize_noaa_data(noaa_raw_data):
    """
    Normalize NOAA raw data into a pandas DataFrame matching your forecast column naming.
    
    Args:
        noaa_raw_data (list of dict): raw NOAA data list
    
    Returns:
        pd.DataFrame: normalized DataFrame with columns:
                      ['date', 'temp_max-degC-noaa', 'temp_min-degC-noaa', 'precip-mm-noaa', 'wind_max-mpersec-noaa']
    """
    if not noaa_raw_data:
        return pd.DataFrame(columns=[
            "date",
            "temp_max-degC-noaa",
            "temp_min-degC-noaa",
            "precip-mm-noaa",
            "wind_max-mpersec-noaa"
        ])

    # Convert to DataFrame
    df = pd.DataFrame(noaa_raw_data)
    df['date'] = pd.to_datetime(df['date']).dt.strftime('%Y-%m-%d')

    # Pivot so each datatype is a column
    df_pivot = df.pivot_table(index='date', 
                              columns='datatype', 
                              values='value', 
                              aggfunc='first').reset_index()

    # Rename columns with your naming convention
    df_pivot = df_pivot.rename(columns={
        "TMAX": "temp_max-degC-noaa",
        "TMIN": "temp_min-degC-noaa",
        "PRCP": "precip-mm-noaa",
        "AWND": "wind_max-mpersec-noaa"
    })

    # Ensure all columns exist
    expected_cols = [
        "date",
        "temp_max-degC-noaa",
        "temp_min-degC-noaa",
        "precip-mm-noaa",
        "wind_max-mpersec-noaa"
    ]
    for col in expected_cols:
        if col not in df_pivot.columns:
            df_pivot[col] = pd.NA

    # Reorder columns
    df_pivot = df_pivot[expected_cols]

    return df_pivot

def summarize_noaa_data(df: pd.DataFrame) -> pd.DataFrame:
    """
    Summarize NOAA historical weather data by computing mean, std, and count 
    for each weather variable, handling missing data appropriately.

    Parameters:
        df (pd.DataFrame): NOAA normalized DataFrame with expected columns:
            'temp_max-degC-noaa', 'temp_min-degC-noaa', 
            'precip-mm-noaa', 'wind_max-mpersec-noaa'

    Returns:
        pd.DataFrame: Summary with average, standard deviation, and count
    """

    # Ensure date is datetime
    if "date" in df.columns:
        df["date"] = pd.to_datetime(df["date"])

    # Define weather columns to summarize
    weather_cols = [
        "temp_max-degC-noaa",
        "temp_min-degC-noaa",
        "precip-mm-noaa",
        "wind_max-mpersec-noaa"
    ]

    # Drop rows where all weather columns are missing
    df_clean = df.dropna(subset=weather_cols, how="all")

    # Ensure all columns are numeric (e.g., convert <NA> to np.nan)
    for col in weather_cols:
        df_clean[col] = pd.to_numeric(df_clean[col], errors="coerce")

    # Create summary stats
    summary = pd.DataFrame({
        "mean": df_clean[weather_cols].mean(),
        "std": df_clean[weather_cols].std(),
        "count": df_clean[weather_cols].count()
    })

    # Round results for clarity
    summary = summary.round(2)

    return summary

def summarize_noaa_daily_climatology(df: pd.DataFrame) -> pd.DataFrame:
    """
    Compute average, standard deviation, and count for each day-of-year across 10 years
    of NOAA weather data.

    Parameters:
        df (pd.DataFrame): NOAA normalized DataFrame with expected columns:
            'date', 'temp_max-degC-noaa', 'temp_min-degC-noaa', 
            'precip-mm-noaa', 'wind_max-mpersec-noaa'

    Returns:
        pd.DataFrame: Summary DataFrame with per-day statistics.
    """

    # Ensure datetime format
    df["date"] = pd.to_datetime(df["date"], errors="coerce")
    
    # Extract month and day to group by day-of-year
    df["month_day"] = df["date"].dt.strftime("%m-%d")

    # Define weather columns
    weather_cols = [
        "temp_max-degC-noaa",
        "temp_min-degC-noaa",
        "precip-mm-noaa",
        "wind_max-mpersec-noaa"
    ]

    # Ensure numeric and drop rows missing all variables
    df_clean = df.dropna(subset=weather_cols, how="all")
    for col in weather_cols:
        df_clean[col] = pd.to_numeric(df_clean[col], errors="coerce")

    # Group by month_day and calculate summary stats
    summary = df_clean.groupby("month_day")[weather_cols].agg(['mean', 'std', 'count'])

    # Flatten multi-index columns
    summary.columns = ['-'.join(col).strip() for col in summary.columns.values]
    summary = summary.reset_index()

    return summary

def format_news_data(articles: List[Dict]) -> str:
    """
    Formats a list of weather-related news articles into a readable summary
    suitable for LLM prompting or report inclusion.

    Args:
        articles (List[Dict]): A list of dictionaries, each containing metadata
                               for a news article with keys like 'title',
                               'source', 'datePublished', 'snippet', and 'url'.

    Returns:
        str: A formatted string summarizing article details. If no articles are
             provided, returns a fallback message.
    """
    if not articles:
        return "No weather-related news was found for this city in the past few days."

    lines = ["Recent Weather News Articles:"]
    for i, a in enumerate(articles, 1):
        title = a.get("title", "No Title")
        source = a.get("source", "Unknown Source")
        date = a.get("datePublished", "Unknown Date")
        snippet = a.get("snippet", "No snippet available.")
        url = a.get("url", "No URL")

        lines.append(
            f"\n[{i}] {title}\n"
            f"Source: {source} | Published: {date}\n"
            f"Snippet: {snippet}\n"
            f"Link: {url}"
        )
    return "\n".join(lines)

def format_reddit_posts_for_llm(posts: List[Dict]) -> str:
    """
    Format Reddit posts for input to an LLM.

    Args:
        posts (List[Dict]): List of Reddit post dicts with keys 'title', 'subreddit', 'created_utc', 'url', 'selftext'.

    Returns:
        str: Formatted multi-line string summarizing posts.
    """
    from datetime import datetime

    formatted_posts = []
    for post in posts:
        # Convert epoch UTC to readable datetime string
        dt = datetime.utcfromtimestamp(post['created_utc']).strftime('%Y-%m-%d %H:%M UTC')
        # Snippet from selftext, limit to 150 chars (or empty if no selftext)
        snippet = (post['selftext'][:150] + '...') if post['selftext'] else ''
        formatted = (
            f"[{dt}] r/{post['subreddit']}: {post['title']}\n"
            f"Snippet: {snippet}\n"
            f"URL: {post['url']}\n"
            "----"
        )
        formatted_posts.append(formatted)

    return "\n".join(formatted_posts)



In [8]:
# Functions to create prompt and query LLM
def create_chatgpt_prompt(persona: str, instructions: str, output_format: str, city: str, lat: str, lon: str, station_id: str, news: str, social_media: str,
                         df1_str: str, df2_str: str, df3_str: str) -> str:
    """
    Constructs a prompt for ChatGPT with persona, instructions, output format,
    and embedded dataframes as context.

    Args:
        persona (str): Description of assistant’s role and tone.
        instructions (str): Specific tasks or questions for the model.
        output_format (str): Desired format for the response.
        df1_str (str): String representation (e.g., JSON or CSV) of first dataframe.
        df2_str (str): String representation of second dataframe.
        df3_str (str): String representation of third dataframe.

    Returns:
        str: The full prompt string to send to the ChatGPT API.
    """

    prompt = f"""
You are {persona}.

Your task is to:
{instructions}

Here are the datasets to assist your analysis:

Dataset 1:
{df1_str}

Dataset 2:
{df2_str}

Dataset 3:
{df3_str}

News data:
{news}

Social media data:
{social_media}

Please provide your response strictly following this format:

City: {city}
Latitude: {lat}
Longitude: {lon}
NOAA Station ID: {station_id}

{output_format}

"""
    return prompt

def query_openai(prompt, openai_key, model="gpt-4o", temperature=0.7, max_tokens=1000):
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "user"
                 ,"content": prompt}
            ],
            temperature=temperature,
            max_tokens=max_tokens
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        return f"Error: {e}"

def query_llm_with_fallback(
    prompt: str,
    openai_key: str = None,
    notebook_path: Path = None,
    repo_id: str = "TheBloke/OpenChat-3.5-1210-GGUF",
    model_filename: str = "openchat-3.5-1210.Q4_K_M.gguf",
    max_tokens: int = 1500,
    n_ctx: int = 8192,
    n_threads: int = 8,
    verbose: bool = False,
) -> str:
    """
    Query OpenAI if API key is present. Otherwise, use a local GGUF model downloaded from Hugging Face.

    Args:
        prompt (str): The prompt to query.
        openai_key (str, optional): OpenAI API key. If None, use local model.
        notebook_path (Path, optional): Path to the notebook (use __file__ or Path.cwd() in scripts).
        repo_id (str): Hugging Face repo ID for the GGUF model.
        model_filename (str): GGUF model file name.
        max_tokens (int): Max tokens for LLM response.
        n_ctx (int): Context length for LLM.
        n_threads (int): Threads used by Llama model.
        verbose (bool): Verbosity for Llama.

    Returns:
        str: The model's text output.
    """
    if openai_key:
        print("Querying OpenAI endpoint...")
        response = query_openai(prompt, openai_key=openai_key)  # Assumes this is defined elsewhere
        return response
    
    print("No OpenAI key detected. Using local GGUF model.")

    # Determine model directory relative to the notebook
    notebook_dir = notebook_path.parent if notebook_path else Path.cwd()
    local_model_dir = notebook_dir / "models"
    model_path = local_model_dir / model_filename

    # Download model if not present
    if not model_path.exists():
        print(f"Model not found locally. Downloading from Hugging Face...")
        model_path = hf_hub_download(
            repo_id=repo_id,
            filename=model_filename,
            local_dir=local_model_dir
        )
        print(f"Download complete. Model saved at: {model_path}")
    else:
        print(f"Using cached model at: {model_path}")

    # Initialize model
    print("Querying model...")
    llm = Llama(
        model_path=str(model_path),
        n_ctx=n_ctx,
        n_threads=n_threads,
        use_mlock=True,
        verbose=verbose
    )

    # Query model
    output = llm(prompt, max_tokens=max_tokens, echo=False)
    return output["choices"][0]["text"]


## Obtain data

### About
Given an input city, weather data is obtained from the following sources: Weatherbit, Open-Meteo, and NOAA. Their respective databases are queried using their defined API logic and the lat/lon coordinates of the input city. These raw data are then stored as pandas dataframes for wrangling.

### Important Notes
For this Proof-of-Concept application, a few constraints on the functionality were placed to balance development time and demonstration of capability:
- Finding high-quality historical datasets usually required querying NOAA USW weather stations as other stations often had incomplete data. As such, the NOAA query was written to select the USW weather station with coordinates within a bounding box of the latitude/longitude coordinates of the input city and the highest proportion of complete data for the historical date ranges. If there is no station within the bounding box, the workflow will fail with an error message, and no summary forecast report will be produced. In a production application, more graceful failover logic can be implemented.
- There has been limited testing of obtaining the appropriate data for cities outside of the United States.
- Occasionally, the underlying databases will be unavailable. While some failover logic has been implemented for temporary outages, extended outages may require running at a later time when the databases are available.

### Instructions
1. Update with desired city located with the United States. Key data will be printed as the workflow progresses.

In [9]:
# Example inputs, with corresponding USW station for reference. Actual station used may differ
# city = "New York, NY" #GHCND:USW00094728
# city = "Chicago, IL" #GHCND:USW00094846
# city = "San Francisco, CA" #GHCND:USW00023234
city = "Seattle, WA" #GHCND:USW00024233
# city = "Atlanta, GA" #GHCND:USW00013874
# city = "Minneapolis, MN" #GHCND:USW00014922
# city = "Denver, CO" #GHCND:USW00023062
# city = "Boston, MA" #GHCND:USW00014739
# city = "Miami, FL" #GHCND:USW00012839

# Input city name and state, with following format: "City, State Abbreviation"
# city = "Phoenix, AZ" #GHCND:USW00023183

# Obtain and print lat/lon coordinates for selected city
lat, lon = get_coordinates(city)
print(f"Coordinates for {city}: {lat}, {lon}")

# Fetch Weatherbit forecast data
weatherbit_data = get_weatherbit_forecast(lat, lon)

# Fetch Open-Meteo forecast data
open_meteo_data = get_open_meteo_forecast(lat, lon)

# Fetch NOAA historical data
noaa_data, station_id = get_noaa_10yr_historical(lat, lon)

# Fetch news data
news_data = get_weather_news(city,api_key=NEWSAPI_API_KEY, max_articles=3)

# Fetch Reddit posts
reddit_data = fetch_reddit_weather_posts(
    reddit_client_id=REDDIT_ID,
    reddit_client_secret=REDDIT_SECRET,
    reddit_user_agent="WeatherAnalysisBot/0.1 by OkHold2363",
    location=city,
    max_posts=10
    )

Coordinates for Seattle, WA: 47.6038321, -122.330062
Found 12 USW stations, choosing closest...
Closest USW station: GHCND:USW00024234 - SEATTLE BOEING FIELD, WA US
Using station GHCND:USW00024234
Fetching data for 2024-06-02 to 2024-06-09...
Fetching data for 2023-06-02 to 2023-06-09...
Fetching data for 2022-06-02 to 2022-06-09...
Fetching data for 2021-06-02 to 2021-06-09...
Fetching data for 2020-06-02 to 2020-06-09...
Fetching data for 2019-06-02 to 2019-06-09...
Fetching data for 2018-06-02 to 2018-06-09...
Fetching data for 2017-06-02 to 2017-06-09...
Fetching data for 2016-06-02 to 2016-06-09...
Fetching data for 2015-06-02 to 2015-06-09...
Seattle Washington


## Manipulate and clean data

### About
This section will take the raw dataframes produced in the previous step and normalize and combine them to produce three tables to be used as input for the final LLM prompt, namely:

- forecast_df_merged: the forecast data from Open-Meteo and Weatherbit,
- daily_historical_df: the daily 10-year historical data from the nearest NOAA station, summarized by day, and
- summary_historical_df: the 10-year historical data from the nearest NOAA station, summarized by weather variable.

While the raw datasets could have been used and analysis handled in the prompt, due to the constraints of the assignment, untested abilities of various LLMs for this task, and relative ease to wrangle, the data were manually transformed and provided to the LLM in a standard form that makes the definitions and assumptions more clear. 

### Instructions
1. Run cells.

In [10]:
# Normalize and merge forecast data
forecast_df_merged = merge_forecasts(open_meteo_data, weatherbit_data, normalize_forecast)

# Normalize historical data
historical_df = normalize_noaa_data(noaa_data)

# Summarize historical data
summary_historical_df = summarize_noaa_data(historical_df)

# Summarize daily historical data
daily_historical_df = summarize_noaa_daily_climatology(historical_df)

# Format news articles for LLM
news_formatted = format_news_data(news_data)

# Format social media posts for LLM
reddit_post_formatted = format_reddit_posts_for_llm(reddit_data)

## Prompt engineering

### About
These steps create the necessary artifacts to be included in the final LLM prompt. The are created to be modular and allow for rapid iteration and testing, should the need occur. Note that this prompt is relatively long. Its content and the number of tokens required can be optimized within the context of the selected LLM model, its capabilities, and other considerations. For the purposes of this application POC, the decision was made to provide more explicit context to achieve a minimum quality of response.

### Instructions
1. Run cell.

In [11]:
# Create artifacts for prompt
persona = "You are a professional, friendly meteorologist, communicating with an audience about their local weather."
instructions = """
        First, analyze the provided table of weather forecast data for the city in question. It has the data from multiple sources. Use your expertise and knowledge about the weather for that location to provide a single, 7-day forecast of the weather in a table format along with any helpful commentary. For example, in cases where there is a large discrepancy between the two provided forecasts for a particular variable, consider providing commentary on its presence, potential root cause, and how you resolved it for the final forecast.
        Second, analyze and compare the summary historical data for the particular variable with the forecast data. 
        Third, analyze and compare the daily historical data for the particular variable on that date with the forecast data.
        For the second and third tasks, Consider including anything noteworthy in the "Historical comparison" section, for example calling out large deviations for historical data. Use your well-informed, and expert opinion to decide when and how to highlight discrepancies. Is the forecast typical compared to history? Anything unusual? Look at different metrics like temperature, humidity, wind, wind chill, cloud cover, etc.
        Fourth, for all of the weather variables, highlight important considerations that residents should take with regard to that variable including potential severe weather, unusual conditions, or impacts on daily life in the "Important considerations" section. For example, if there is extreme or unsafe temperatures, include a note to not leave children in cars, think about pets, and hydrate often.
        Fifth, in the "About the data" section, list any data anomalies, limitations, or special considerations that had to be taken into account in your analysis (for example, averaging two different values for temperature). Also attribute the sources of your data here.
        Sixth, the "Weather-related news" section: summarize the sentiment of the news articles that were found. Filter for only the most relevant to the weather and input location. Provide up to three high-quality news articles for reference.
        Seventh, the "Social media posts" section: summarize the sentiment of the social media posts that were found. Filter for only the most relevant to the weather and input location. Provide up to three high-relevancy posts for reference.
        If any data are conflicting or missing, please highlight and explain. If you are unsure of a conclusion, feel free to make it if you provide acknowledgement of limitations. If you don't know the answer, do not make one up or hallucinate a response; instead, acknowledge limitation(s) and recommend other actions to resolve. End your response politely and professionally.
        """
output_format = f"""The final 7-day forecast should be formatted as a table. The table should have the forecast the date along the top of the table, and the rows should contain the predicted value for the particular variable.

Example table:
| Date                | 2025-05-28 | 2025-05-29 | 2025-05-30 | 2025-05-31 | 2025-06-01 | 2025-06-02 | 2025-06-03 |
|---------------------|------------|------------|------------|------------|------------|------------|------------|
| Max Temp (°C)       | 39.0       | 37.3       | 38.9       | 39.1       | 33.2       | 35.0       | 33.2       |
| Min Temp (°C)       | 22.7       | 22.3       | 22.9       | 27.4       | 24.9       | 22.4       | 24.0       |
| Precipitation (mm)  | 0.0        | 0.0        | 0.0        | 0.25       | 2.5        | 2.5        | 0.0        |
| Max Wind Speed (m/s)| 8.9        | 7.3        | 10.0       | 17.4       | 18.6       | 18.5       | 15.6       |

Below the table, include a commentary section, formatted as below with the following information:

Commentary:
**Historical comparison:**
- individual points in bulleted list.

**Important considerations:**
- individual points in bulleted list.

**About the data:**
- individual points in bulleted list.

**Weather-related news:**
- Short summary of sentiment of recent weather-related news.
- Three relevant news articles, if present.

**Social media posts:**
- Short summary of sentiment of recent social media posts.
- Three relevant social media posts, including title and URL.
"""
df1_str = forecast_df_merged.to_json(orient='records') 
df2_str = summary_historical_df.to_json(orient='records')
df3_str = daily_historical_df.to_json(orient='records')

# Create prompt
prompt = create_chatgpt_prompt(persona, instructions, output_format, city, lat, lon, station_id, news_formatted, reddit_post_formatted, df1_str, df2_str, df3_str)

## Query LLM directly from Jupyter Notebook

### About
This section tests the final prompt against an LLM. There are two options for LLM to query:
1. an OpenAI endpoint, or
2. a local GGUF model.

An OpenAI key has been provided in the initial transfer in the .env file and instantiated in the Setup section above. Running the cell below as-is will result in a call to the OpenAI endpoint at a small, nominal charge.

If you would like to test a local model, you will need to uncomment the line in the cell below that sets OPENAI_KEY = None. This will trigger a workflow that will look in the application file directory for a specific GGUF model. If the model is present, then the query will proceed with the generated prompt. Otherwise, the model will be downloaded from huggingface in a process that will require some time (depending on connection speed), prior to the query.

The default model to download is the "openchat-3.5-1210.Q4_K_M.gguf". It was selected to demonstrate the capabilities of a local model that can be stored and run on a modestly powered CPU, free of charge, to produce serviceable results. 

### Instructions
1. Adjust whether or not to use the OPENAI_KEY.
2. Run cell.s

In [12]:
# Query LLM

# OPENAI_KEY = None # Uncomment to leverage a locally downloaded model. Download should initiate itself.
# response = query_llm_with_fallback(openai_key=OPENAI_KEY, prompt=prompt, notebook_path=notebook_dir)
# print(response)

# PART II: Weather Forecast Synthesizer--MCP Server

## Overview
The following cells are meant to demonstrate a few different methods to interact with the same workflow logic as above to provide a friendly weather forecast and analysis, but through an MCP server. The server will then use the logic to extract the data, find the appropriate model, and query with the created prompt. Updates to the logic will need to occur in the reference files/modules that are leverage by the MCP server in order to take effect.

## Important Notes
- If you want to run inference against a local model using the MCP server, you will need to update the .env file to remove the OPENAI_KEY variable.
- To support this application, data is pulled from various sources, usually leveraging an API key or token. These keys/tokens were obtained using my personal email and will be kept active through the evaluation period.
- **Please note: that queries using the Open AI API key are a paid service and running that portion of the application will incur a small, nominal charge.**

## Spin-up MCP server

### About
Two methods are provided to spin up the associated MCP server.

### Option 1 Instructions (spin up the MCP server from this notebook): 
1. Run the below cell.

### Options 2 Instructions (spin up the MCP server from bash):
1. Open a bash terminal. 
2. Navigate to the application root directory, /weatherChatbot.
3. Activate virtual environment: source .venv/Scripts/activate
4. Install any necessary packages (see pip install line at top or env/requirements.txt).
5. Start FastAPI server: uvicorn mcp_server.main:app --reload
6. Confirm that server is running by following the URL provided in the resulting response.

In [13]:
# Step into project root (one level up from 'notebooks/')
os.chdir("..")

# Start FastAPI server via uvicorn
server = subprocess.Popen(
    [sys.executable, "-m", "uvicorn", "mcp_server.main:app", "--port", "8000"],
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE
    )

print("✅ MCP Server started on http://127.0.0.1:8000")
time.sleep(8)  # Give server time to start

res = requests.get("http://127.0.0.1:8000/")
print("✅ Response:", res.status_code)
print(res.json())

✅ MCP Server started on http://127.0.0.1:8000
✅ Response: 200
{'message': 'MCP Weather Server is running!'}


## Ping MCP server from Jupyter notebook

### About
Two methods are provided to query the associated MCP server.

### Option 1 Instructions (query the MCP server from this notebook): 
1. Run the below cell.

### Options 2 Instructions (query the MCP server from CLI, Powershell for Windows):
1. Once MCP server is running, open a powershell terminal (or equivalent). 
2. Navigate to the application root directory, /weatherChatbot.
3. In applicable, start your virtual environment: .venv\Scripts\Activate.ps1

There are two options to run for the CLI Agent: 
- "cli_agent/main.py": this is the same script that runs in the Option 1 below. It is a very simple intake from the CLI, with no error handling, conversation, etc.
- "cli_agent/main-enhanced.py": as stated in the title, this version of the cli_agent is enhanced with an LLM back-end, to facilitate a conversation and more robust handling of input variation. A few important notes: although the code is written to first check for an OpenAI endpoint and use a local model if not (and download if necessary), the local model interaction is very slow and does not provide robust responses. The code has been written to leverage the OpenAI endpoint (even if the original key is commented out, see comments in .env file) to facilitate UX.

Make a selection and run using the below steps.

4. Run CLI agent with following command: python cli_agent/main.py OR python cli_agent/main-enhanced.py
5. Dialogue asking for city input should appear. Enter desired city (within the US) with the following format: City, State Abbreviation

In [14]:
# Query MCP server from jupyter notebook

# Input city name , with corresponding USW station for reference. Actual station used may differ.
city = "New York, NY" #GHCND:USW00094728

# Obtain response
response = requests.post("http://127.0.0.1:8000/forecast", json={"city": city})
print(response.json())


{'city': 'New York, NY', 'forecast': 'City: New York, NY  \nLatitude: 40.7127281  \nLongitude: -74.0060152  \nNOAA Station ID: GHCND:USW00094728  \n\n| Date                | 2025-06-02 | 2025-06-03 | 2025-06-04 | 2025-06-05 | 2025-06-06 | 2025-06-07 | 2025-06-08 |\n|---------------------|------------|------------|------------|------------|------------|------------|------------|\n| Max Temp (°C)       | 22.0       | 25.5       | 24.5       | 29.5       | 28.7       | 24.0       | 24.6       |\n| Min Temp (°C)       | 12.2       | 13.4       | 15.7       | 19.5       | 21.5       | 19.6       | 18.1       |\n| Precipitation (mm)  | 0.0        | 0.0        | 0.0        | 1.0        | 2.8        | 8.5        | 0.55       |\n| Max Wind Speed (m/s)| 8.8        | 8.2        | 13.3       | 14.8       | 12.4       | 10.7       | 8.9        |\n\nCommentary:\n\n**Historical comparison:**\n- The forecasted maximum temperatures for this week are slightly below the historical mean for early June, in

## Close MCP server

### About
When finished, terminate all processes associated with application.

### Instructions: 
1. Run the cell.

In [14]:
# Terminate server when finished
server.terminate()
server.communicate()
time.sleep(8)  # Give server time to stop
print("🛑 MCP Server stopped.")

🛑 MCP Server stopped.
