# File Description: Runyankore Word Frequency Analysis

This notebook is used to build structured lists of words appearing in the `question_content` column of `questions_nyn.csv`.  
The process is designed to first identify the most frequently occurring vocabulary, then filter out non-informative tokens (stop-words), and finally extract more informative words suitable for constructing a classification dictionary.

In the first stage, a list of the 3000 most frequent words is generated.  
This list is reviewed outside the notebook (in ChatGPT), where the file `nyn_non_category_words.csv` is created to store words not associated with any thematic category.  
In the second stage, these stop-words are applied as a filter during frequency recomputation, enabling the extraction of an additional list of approximately 5000 more informative words.

### Stage 1 – Generate initial frequency list (top 3000 words)

In the first stage:

1. The dataset is loaded from `questions_nyn.csv`.  
2. The text is cleaned (lowercased, special characters removed, spacing normalised).  
3. Each question is tokenised, and a set of unique tokens per question is used (each word counted at most once per question).  
4. For each word, the number of questions in which it appears is counted.  
5. The 3000 most frequent words are selected and saved to `nyn_top_words.csv`.

In [None]:
import pandas as pd
import re
from collections import Counter
import csv

def generate_top_nyn(n_top=3000):
    # Load entire file
    df = pd.read_csv("questions_nyn.csv")

    def clean(text):
        text = str(text).lower()
        # only letters and spaces - no dots or other characters
        text = re.sub(r"[^a-zA-Z\u00C0-\u024F\u1E00-\u1EFF\s]", " ", text)
        text = re.sub(r"\s+", " ", text).strip()
        return text

    df["clean"] = df["question_content"].apply(clean)

    words = Counter()

    # we count words max once per question
    for row in df["clean"]:
        tokens = set(row.split())           # <- set = no repetition in one question
        for w in tokens:
            if len(w) > 2:                  # we reject one-word garbage
                words[w] += 1               # how many QUESTIONS did this word contain?

    # TOP n_top words
    top_words = words.most_common(n_top)

    with open("nyn_top_words.csv", "w", newline="", encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerow(["word", "question_count"])
        writer.writerows(top_words)

    return f"Saved {len(top_words)} most frequent words to 'nyn_top_words.csv'."

msg = generate_top_nyn(3000)
print(msg)

Saved 3000 most frequent words to 'nyn_top_words.csv'.


### Intermediate step – Stop-word selection (ChatGPT)

After the file `nyn_top_words.csv` is generated, an intermediate step is performed outside the notebook:

1. The list of the 3000 most frequent words is analysed in ChatGPT.  
2. Words lacking thematic relevance (pronouns, particles, fillers, purely grammatical forms, etc.) are identified.  
3. These non-informative tokens are saved in `nyn_non_category_words.csv`, which serves as the stop-word list.

The purpose of this step is to reduce high-frequency but low-information words before generating the next vocabulary list.

### Stage 2 – Apply stop-words and extract next 5000 informative words

In the second stage:

1. The file `questions_nyn.csv` is loaded again and cleaned using the same function as in Stage 1.  
2. The stop-word list in `nyn_non_category_words.csv` is applied.  
3. All stop-words are removed from the question texts.  
4. For the remaining tokens, the number of questions containing each word is recalculated (each word counted at most once per question).  
5. Approximately 5000 of the most frequent filtered words are saved to `nyn_next5000_words.csv`.

The resulting list contains vocabulary with higher informational value, suitable for constructing a classification dictionary.

In [None]:
import pandas as pd
import re
from collections import Counter

def extract_next_5000_words():
    # ----- 1. Load full questions -----
    df = pd.read_csv("questions_nyn.csv")   # adjust path if needed

    # Cleaning function – same logic as before
    def clean(text):
        text = str(text).lower()
        # keep only letters and spaces
        text = re.sub(r"[^a-zA-Z\u00C0-\u024F\u1E00-\u1EFF\s]", " ", text)
        text = re.sub(r"\s+", " ", text).strip()
        return text

    df["clean"] = df["question_content"].apply(clean)

    # ----- 2. Load words to ignore (from nyn_non_category_words.csv) -----
    ignore_df = pd.read_csv("nyn_non_category_words.csv", header=None)
    ignore_words = (
        ignore_df.iloc[0]
        .dropna()
        .astype(str)
        .str.lower()
        .tolist()
    )
    ignore_words = set(ignore_words)

    # ----- 3. Count word frequencies on full corpus (one count per question) -----
    freq = Counter()

    for row in df["clean"]:
        tokens = set(row.split())  # count each word at most once per question
        for w in tokens:
            if len(w) <= 2:
                continue
            if w in ignore_words:
                continue
            freq[w] += 1

    # ----- 4. Build DataFrame, sort and take next 5000 words -----
    freq_df = pd.DataFrame(
        [(w, c) for w, c in freq.items()],
        columns=["word", "question_count"]
    ).sort_values("question_count", ascending=False)

    next_5000 = freq_df.head(5000)

    output_file = "nyn_next5000_words.csv"
    next_5000.to_csv(output_file, index=False, encoding="utf-8")

    print("✅ Done. Saved 5000 additional words to:", output_file)
    print("Total unique words considered (after filtering):", len(freq_df))

extract_next_5000_words()


✅ Done. Saved 5000 additional words to: nyn_next5000_words.csv
Total unique words considered (after filtering): 326910
