# Build Stoplist - Top 50 Words

This notebook builds a stoplist by analyzing word frequencies across documents in the `docs` directory.

**Note:** Only the most common English stop words (the, is, and, in, a, etc.) are filtered out.


In [17]:
import pathlib, re, collections

DOCS = pathlib.Path("docs")
OUT = pathlib.Path("stoplist_top50.txt")

TOKEN_SPLIT = re.compile(r"[^A-Za-z]+")

# Common English stop words to filter out (minimal set)
STOP_WORDS = {
    'a', 'an', 'and', 'are', 'as', 'at', 'be', 'by', 'for', 'from',
    'has', 'in', 'is', 'it', 'of', 'on', 'the', 'to',
    'was', 'were', 'with', 'you'
}


In [18]:
def build_stoplist_top50(filter_stopwords=True):
    counter = collections.Counter()

    files = sorted(DOCS.glob("*.txt"))
    if not files:
        raise SystemExit("No files found in ./docs. Run fetch_wiki_docs.py first.")

    for p in files:
        txt = p.read_text(encoding="utf-8").lower()
        # drop header line "# TITLE: ..."
        txt = re.sub(r"^# title: .*?\n\n", "", txt, flags=re.IGNORECASE)
        tokens = [t for t in TOKEN_SPLIT.split(txt) if t and (len(t) > 1 or t in ("a", "i"))]
        
        # Filter out stop words if requested
        if filter_stopwords:
            tokens = [t for t in tokens if t not in STOP_WORDS]
        
        counter.update(tokens)

    top50 = counter.most_common(50)

    # save
    with OUT.open("w", encoding="utf-8") as f:
        f.write("Top 50 words (word\tcount)\n")
        for w, c in top50:
            f.write(f"{w}\t{c}\n")
    
    return top50


In [19]:
# Show which stop words are being filtered
print(f"Filtering out {len(STOP_WORDS)} common stop words")
print(f"Examples: {', '.join(sorted(list(STOP_WORDS))[:20])}...")
# Run the function to build the stoplist (with stop word filtering)
top50 = build_stoplist_top50(filter_stopwords=True)


Filtering out 22 common stop words
Examples: a, an, and, are, as, at, be, by, for, from, has, in, is, it, of, on, the, to, was, were...


In [20]:
# Generate and display markdown table
from IPython.display import Markdown, display

markdown_table = "## Top 50 Words\n\n| Rank | Word | Count |\n|------|------|-------|\n"
for i, (w, c) in enumerate(top50, 1):
    markdown_table += f"| {i} | {w} | {c} |\n"

display(Markdown(markdown_table))


## Top 50 Words

| Rank | Word | Count |
|------|------|-------|
| 1 | retrieved | 79 |
| 2 | orlandi | 49 |
| 3 | county | 47 |
| 4 | otto | 43 |
| 5 | music | 36 |
| 6 | his | 32 |
| 7 | dance | 29 |
| 8 | chaplin | 29 |
| 9 | route | 25 |
| 10 | census | 24 |
| 11 | so | 24 |
| 12 | can | 21 |
| 13 | or | 20 |
| 14 | espa | 20 |
| 15 | single | 19 |
| 16 | washington | 19 |
| 17 | season | 19 |
| 18 | library | 18 |
| 19 | we | 18 |
| 20 | rave | 18 |
| 21 | july | 17 |
| 22 | not | 17 |
| 23 | de | 17 |
| 24 | government | 17 |
| 25 | archived | 17 |
| 26 | original | 17 |
| 27 | beatport | 17 |
| 28 | think | 17 |
| 29 | state | 16 |
| 30 | columbia | 16 |
| 31 | walla | 16 |
| 32 | he | 16 |
| 33 | top | 16 |
| 34 | united | 15 |
| 35 | states | 15 |
| 36 | feat | 15 |
| 37 | dancing | 15 |
| 38 | national | 14 |
| 39 | population | 14 |
| 40 | ol | 14 |
| 41 | tracklists | 14 |
| 42 | cerda | 13 |
| 43 | that | 13 |
| 44 | no | 13 |
| 45 | its | 13 |
| 46 | deportivo | 13 |
| 47 | show | 13 |
| 48 | woodstar | 13 |
| 49 | don | 13 |
| 50 | stars | 13 |
