# New word extraction and category dictionary pipeline

This document describes the full pipeline used to build a multilingual agricultural keyword dictionary from a large corpus of farmer questions.

The process consists of three main steps:

1. **New word extraction** from the raw question corpus.  
2. **External classification** of terms into agricultural categories using a GPT-based model.  
3. **Merging JSON outputs** into a single consolidated category dictionary.

---

## Step 1 – New word extraction from the question corpus

The goal of the first step is to scan a very large CSV file with farmer questions, extract word-like tokens, and collect only those words that have **not** been seen in previous runs.

The script in this notebook:

1. **Loads previously seen words**  
   - Reads the cumulative file `words_seen.csv` (if it exists).  
   - The file is expected to contain a single column `word`.  
   - The words are loaded into a Python `set`, optionally lower-cased.  
   - This set is used to skip tokens that were already collected in earlier runs.

2. **Streams the main CSV file in chunks**  
   - Reads `raw_challenge_2_seasonality.csv` using only the `question_content` column.  
   - The file is processed in chunks of `CHUNKSIZE` rows (e.g. 100 000) to avoid memory issues.  
   - Rows without text in `question_content` are dropped.

3. **Tokenises each question**  
   - For each question, the text is optionally converted to lowercase (when `LOWERCASE = True`).  
   - Tokens are extracted using a simple regex pattern `\w+` (letters, digits, underscore).  
   - Tokens shorter than `MIN_WORD_LENGTH` (e.g. `< 2` characters) are discarded.  
   - Only tokens **not present** in `words_seen.csv` are counted.

4. **Counts new words**  
   - A `Counter` object accumulates occurrences of each *new* token across all chunks.  
   - After the entire file is processed, the script reports how many distinct new words were found.

5. **Filters rare words**  
   - The counter is converted to a DataFrame with columns `word` and `count`.  
   - Words that occur fewer than `MIN_COUNT` times (e.g. 1) are removed.  
   - If no words remain after filtering, the script exits.

6. **Updates the cumulative word list**  
   - From the filtered DataFrame, only the `word` column is taken.  
   - If `words_seen.csv` already exists, the new words are appended (no header).  
   - If it does not exist, a new file is created with a header.  
   - As a result, each word appears at most once in `words_seen.csv`.

7. **Creates batch files for downstream processing**  
   - The filtered `(word, count)` table is split into batches of size `BATCH_SIZE` (e.g. 3000).  
   - Each batch is saved as a separate CSV file named:  
     `session_batch_001.csv`, `session_batch_002.csv`, etc.  
   - These batch files contain the most frequent **new** words and are used as input for the classification step.



In [None]:
import pandas as pd
from collections import Counter
import re
from pathlib import Path

# === CONFIGURATION ===

INPUT_CSV = "raw_challenge_2_seasonality.csv"
TEXT_COL = "question_content"

# Cumulative file with all words ever seen (append, single column 'word')
CUMULATIVE_OUTPUT = "words_seen.csv"

# Output batches, e.g. session_batch_001.csv, 002, ...
BATCH_PREFIX = "session_batch_"
BATCH_SIZE = 3000

# Number of rows per chunk when reading the large CSV
CHUNKSIZE = 100_000

# Word filters
MIN_WORD_LENGTH = 2   # ignore tokens shorter than this
MIN_COUNT = 2         # drop words that occur only once

LOWERCASE = True

# Simple pattern for a "word"
TOKEN_PATTERN = re.compile(r"\w+", flags=re.UNICODE)


def load_already_seen_words(path: str) -> set:
    """Load words previously collected in earlier runs.

    Expects a CSV file with a 'word' column.
    Returns an empty set if the file does not exist.
    """
    p = Path(path)
    if not p.exists():
        return set()

    df = pd.read_csv(p, usecols=["word"], encoding="utf-8")
    words = df["word"].astype(str)
    if LOWERCASE:
        words = words.str.lower()
    return set(words)


def tokenize(text: str):
    """Tokenise text into simple word-like tokens."""
    if not isinstance(text, str):
        return []

    if LOWERCASE:
        text = text.lower()

    tokens = TOKEN_PATTERN.findall(text)
    return [t for t in tokens if len(t) >= MIN_WORD_LENGTH]


def main():
    # 1. Load words from previous runs to skip them
    already_seen = load_already_seen_words(CUMULATIVE_OUTPUT)
    print(f"Loaded {len(already_seen)} words from cumulative file '{CUMULATIVE_OUTPUT}'.")

    counter = Counter()
    total_rows = 0

    # 2. Stream the large CSV file in chunks
    for chunk in pd.read_csv(
        INPUT_CSV,
        sep=";",
        usecols=[TEXT_COL],
        chunksize=CHUNKSIZE,
        encoding="utf-8"
    ):
        total_rows += len(chunk)

        # Drop rows without text
        chunk = chunk.dropna(subset=[TEXT_COL])

        for text in chunk[TEXT_COL]:
            tokens = tokenize(text)
            for tok in tokens:
                if tok in already_seen:
                    continue
                counter[tok] += 1

        print(f"Processed {total_rows} rows...")

    print(f"Number of NEW words (before count filter): {len(counter)}")

    if not counter:
        print("No new words found – everything is already in the cumulative file.")
        return

    # 3. All new words sorted by frequency
    most_common = counter.most_common()
    df_all = pd.DataFrame(most_common, columns=["word", "count"])

    # 4. Filter out words that are too rare
    df_all = df_all[df_all["count"] >= MIN_COUNT].reset_index(drop=True)
    print(f"After dropping words with count < {MIN_COUNT}, {len(df_all)} words remain.")

    if df_all.empty:
        print("No words left to save after applying MIN_COUNT.")
        return

    # 5. Append words to the cumulative file (only the 'word' column)
    df_cum = df_all[["word"]].copy()

    cum_path = Path(CUMULATIVE_OUTPUT)
    if cum_path.exists():
        df_cum.to_csv(cum_path, mode="a", header=False, index=False, encoding="utf-8")
        print(f"Appended {len(df_cum)} words to cumulative file '{CUMULATIVE_OUTPUT}'.")
    else:
        df_cum.to_csv(cum_path, index=False, encoding="utf-8")
        print(f"Created new cumulative file '{CUMULATIVE_OUTPUT}' with {len(df_cum)} words.")

    # 6. Generate word batches of size BATCH_SIZE
    num_words = len(df_all)
    batch_count = 0

    for start in range(0, num_words, BATCH_SIZE):
        batch = df_all.iloc[start:start + BATCH_SIZE]
        batch_count += 1
        batch_filename = f"{BATCH_PREFIX}{batch_count:03d}.csv"
        batch.to_csv(batch_filename, index=False, encoding="utf-8")
        print(f"Saved batch {batch_count} ({len(batch)} words) to '{batch_filename}'.")

    print(f"Done. Created {batch_count} batch files with up to {BATCH_SIZE} words each.")


if __name__ == "__main__":
    main()



---

## Step 2 – External classification of terms into categories (LLM step)

The batch files from Step 1 (e.g. `session_batch_001.csv`, `session_batch_002.csv`, …) contain lists of high-frequency terms extracted from the corpus.  
These terms are then assigned to agricultural categories by a GPT-based conversational model configured specifically for this task.

Because of input length limits, each ~3000-word batch is processed in smaller chunks of about **250 terms**:

- From each `session_batch_XXX.csv`, a sublist of roughly 250 terms is taken.  
- This list is sent to the GPT-based model, which uses a fixed prompt describing:  
  - the category definitions,  
  - and the required output format (a single JSON object).  
- The model returns **one JSON object** mapping categories to lists of terms.

The JSON returned by the model obeys these rules:

- Keys are fixed category names:  
  - `planting_growing`  
  - `livestock`  
  - `pests_disease`  
  - `timing_harvest`  
  - `weather`  
  - `market_price`  
- Values are lists of **original terms** assigned to each category (no translation or rewriting).  
- Terms classified as `other` are omitted and do **not** appear in the JSON.  
- Pure numbers and numeric suffixes are ignored.

**Privacy note:**  
The model never receives full question texts or any user-level context. Only isolated words or very short expressions extracted from the corpus are sent. This means no original questions or sentences are transferred — only a de-contextualised vocabulary of terms.

**Language-independence:**  
Because only tokens are sent (without assuming any specific language), the resulting dictionary is language-agnostic. The model assigns terms to categories based on their agricultural meaning, regardless of whether they are in English, Swahili or local languages.

Each JSON response from the model is saved as a separate file, for example:

```text
words_categories_batch_001.json
words_categories_batch_002.json
words_categories_batch_003.json
…
```

Together, these files form a set of partial dictionaries with term–category assignments.



---

## Step 3 – Merging JSON files into a single category dictionary

The final Python step merges the JSON files produced by the classifier into one consolidated dictionary.

Although the pipeline is capable of processing all files matching  
`words_categories_batch_*.json`, **only the first three batches were intentionally used**:

```text
words_categories_batch_001.json
words_categories_batch_002.json
words_categories_batch_003.json
```

The remaining batches contained words with lower global frequency. These terms were often ambiguous, strongly dialect-specific or appeared only a few times in the corpus, which introduced noise into the final dictionary. To maintain quality and avoid adding unstable or unreliable entries, only the top three files (covering the most frequent and semantically stable terms) were merged.

Each selected JSON file has the form:

```json
{
  "planting_growing": ["term1", "term2", "..."],
  "livestock": ["term3", "..."],
  "pests_disease": ["..."],
  "timing_harvest": ["..."],
  "weather": ["..."],
  "market_price": ["..."]
}
```

The merging logic is implemented in the code cell below. It:

1. Loads the selected JSON files.  
2. For each category, collects all terms using a Python `set` to ensure uniqueness.  
3. Converts each set into a sorted list.  
4. Writes the final consolidated dictionary to:

```text
words_categories_merged.json
```

The output file contains a clean, deduplicated dictionary of high-frequency agricultural terms, aggregated across the three most reliable classification batches.



In [None]:
import json
from pathlib import Path

# === CONFIGURATION ===
# Directory with words_categories_batch_XXX.json files
INPUT_DIR = Path(".")  # current directory; change if needed

# File name pattern
FILE_PATTERN = "words_categories_batch_*.json"

# Output file
OUTPUT_FILE = Path("words_categories_merged.json")


def merge_word_category_batches(input_dir: Path, pattern: str) -> dict:
    """Merge all JSON files matching the pattern into a single dictionary.

    Each input file is expected to contain a JSON object of the form:
        category -> list of words

    The result is a dictionary:
        category -> sorted list of unique words.
    """
    merged = {}  # category -> set(words)

    json_files = sorted(input_dir.glob(pattern))
    if not json_files:
        raise FileNotFoundError(
            f"No files matching pattern: {pattern} found in {input_dir}"
        )

    print(f"Found {len(json_files)} files to merge.")

    for json_path in json_files:
        print(f"Processing: {json_path}")
        with open(json_path, "r", encoding="utf-8") as f:
            data = json.load(f)

        if not isinstance(data, dict):
            print(f"  [WARNING] {json_path} does not contain a dict – skipping.")
            continue

        for category, words in data.items():
            if not isinstance(words, list):
                print(
                    f"  [WARNING] {json_path} -> key '{category}' is not a list – skipping this key."
                )
                continue

            if category not in merged:
                merged[category] = set()

            merged[category].update(words)

    # convert sets to sorted lists
    merged_as_lists = {cat: sorted(list(words)) for cat, words in merged.items()}
    return merged_as_lists


def main():
    merged = merge_word_category_batches(INPUT_DIR, FILE_PATTERN)

    with open(OUTPUT_FILE, "w", encoding="utf-8") as f:
        json.dump(merged, f, ensure_ascii=False, indent=2)

    print(f"Saved merged dictionary to: {OUTPUT_FILE}")


if __name__ == "__main__":
    main()



## Step 4 – Assigning a single category to every question in the corpus

In the previous steps:

1. High-frequency words were extracted from the large question corpus (`raw_challenge_2_seasonality.csv`).
2. A GPT-based model was used to assign these words to agricultural categories and save the results as JSON batches.
3. Selected JSON files were merged into a consolidated dictionary `words_categories_merged.json`, where each category maps to a list of terms.

The goal of this final step is to **assign exactly one category to every question** in the original CSV file, using that consolidated dictionary.

### What this step does

1. **Loads the consolidated dictionary**

   The file `words_categories_merged.json` is read into memory as a dictionary of the form:

   `category -> list of words`.

2. **Normalises all words in the dictionary**

   Each word is:

   - converted to string,
   - lowercased,
   - stripped of surrounding whitespace,
   - optionally cleaned from a leading `@`.

   For every category, a `set` of normalised words is built. Using sets makes checking membership and intersections efficient.

3. **Defines category priority and ensures `other` is last**

   The category list is taken from the dictionary keys, preserving the original JSON order (which acts as a priority order).  
   The special category `other` is added if missing and always moved to the **end** of the list. This guarantees that:

   - all “regular” categories are checked first,
   - `other` is only used as a fallback if nothing else matches.

4. **Defines the classification function for a single text**

   The function `get_category_for_text(text)`:

   - returns `"other"` if the input is not a string,
   - tokenises the text into `\w+` tokens (Unicode, lowercased),
   - builds a set of tokens found in the text,
   - iterates over categories in the defined priority order (excluding `other`),
   - for each category, checks whether there is **any intersection** between:
     - the set of words associated with the category, and
     - the set of tokens appearing in the text,
   - returns the first category that has at least one matching word,
   - returns `"other"` if no category matches.

   This guarantees **exactly one category per question**, based on the first category that matches any token in the text.

5. **Streams the large CSV file and writes the result**

   The input file:

   - `CSV_INPUT = "raw_challenge_2_seasonality.csv`  
   - `TEXT_COL = "question_content"`

   is processed in chunks (`CHUNKSIZE` rows at a time), using `sep=";"` and UTF-8 encoding.

   For each chunk:

   - a new column `category` is created by applying `get_category_for_text` to the `question_content` column,
   - only `question_content` and `category` are kept,
   - rows are appended to the output CSV file:

   - `CSV_OUTPUT = "topics_with_categories_one_col.csv"`

   The output file is created once at the beginning (removing any previous version). A header is written only for the first chunk; subsequent chunks are appended without a header.

At the end of this step, the file `topics_with_categories_one_col.csv` contains:

- `question_content` – the original question text,
- `category` – the assigned category (e.g. `livestock`, `market_price`, `pests_disease`, …, or `other`).


In [None]:
import pandas as pd
import json
import re
from pathlib import Path

# ===== CONFIGURATION =====
CSV_INPUT = "raw_challenge_2_seasonality.csv"
DICT_JSON = "words_categories_merged.json"
CSV_OUTPUT = "topics_with_categories_one_col.csv"

TEXT_COL = "question_content"
CHUNKSIZE = 100_000

# ===== 1. Load dictionary: category -> [words] =====
with open(DICT_JSON, "r", encoding="utf-8") as f:
    # Python 3.7+ preserves key order = order from JSON
    cat_to_words_raw = json.load(f)


def norm_word(w):
    """Normalize a single word from the dictionary."""
    w = str(w).lower().strip()
    if w.startswith("@"):
        w = w[1:]
    return w


# Normalized dictionary: category -> set(words)
normalized_cat_to_words = {}
for cat, words in cat_to_words_raw.items():
    if not isinstance(words, (list, tuple)):
        # if value is not a list/tuple, skip
        continue
    norm_set = set()
    for w in words:
        nw = norm_word(w)
        if nw:
            norm_set.add(nw)
    normalized_cat_to_words[cat] = norm_set

# Category order as in JSON
categories_in_order = list(normalized_cat_to_words.keys())

# Ensure "other" exists and is at the END
if "other" not in categories_in_order:
    categories_in_order.append("other")
else:
    # move "other" to the end
    categories_in_order = [c for c in categories_in_order if c != "other"] + ["other"]

print("Category order (priority):", categories_in_order)

# ===== 2. Function: one category per text =====
token_re = re.compile(r"\w+", flags=re.UNICODE)


def get_category_for_text(text) -> str:
    """Return exactly one category for the given text."""
    # missing / wrong type -> other
    if not isinstance(text, str):
        return "other"

    # set of tokens from text (lowercase)
    tokens = {t.lower() for t in token_re.findall(text)}

    # iterate over categories in priority order
    for cat in categories_in_order:
        if cat == "other":
            # handle "other" at the very end if nothing matches
            continue

        cat_words = normalized_cat_to_words.get(cat, set())
        # if there is any intersection -> choose this category
        if cat_words & tokens:
            return cat

    # nothing matched -> "other"
    return "other"


# ===== 3. Process the CSV in chunks =====
out_path = Path(CSV_OUTPUT)
if out_path.exists():
    out_path.unlink()

header_written = False

for chunk in pd.read_csv(
    CSV_INPUT,
    chunksize=CHUNKSIZE,
    encoding="utf-8",
    on_bad_lines="skip",
    sep=";"      # adjust if your CSV uses a different separator
):
    if TEXT_COL not in chunk.columns:
        raise ValueError(f"Column '{TEXT_COL}' not found in input file!")

    # one category per row
    chunk["category"] = chunk[TEXT_COL].apply(get_category_for_text)

    cols_to_save = [TEXT_COL, "category"]

    chunk[cols_to_save].to_csv(
        out_path,
        mode="a",
        index=False,
        header=not header_written,
        encoding="utf-8"
    )
    header_written = True

print(f"Result saved to: {CSV_OUTPUT}")



### Classification quality

To evaluate the quality of dictionary-based categorisation, a manual review of 200 randomly selected questions was performed, covering the major languages present in the dataset (e.g., Swahili, Luganda, Runyankole, English).
Each question was verified against its assigned category and corrected where necessary.

Based on this sample, the estimated overall accuracy of the category assignment is approximately 85%.

This value should be interpreted as an approximation of real-world performance, as some ambiguity remains in multi-topic questions and very short or context-poor inputs.