# Lesson 1: Text Basics: Tokenisation, n‑grams, Frequencies

**Goal:** build intuition by tokenising text, making n‑grams, and plotting frequencies on a ~50‑sentence business dataset.



> You'll fill in the `# TODO` parts while sharing your screen.

In [1]:
import pandas as pd
texts = ['The new ERP rollout improved our monthly reporting speed by 40 percent.', 'Customer support response time is slow but the knowledge base is helpful.', 'Weekly sales increased after the email campaign and discount codes.', 'Our invoice reconciliation still takes too long despite the automation.', 'The dashboard is clean, but export to Excel sometimes breaks.', 'Warehouse picking errors dropped once barcodes were introduced.', 'Marketing wants a way to tag and search proposals across clients.', 'Cash flow forecasting is better, although the model drifts each quarter.', 'We migrated the CRM without issues; user training was the hardest part.', 'Late supplier deliveries cause stockouts and unhappy customers.', 'Finance wants to compare actuals vs budget by region on one page.', 'Legal needs a quick search for similar contract clauses.', 'The Slack bot that summarizes meetings saves everyone time.', 'Holiday season traffic overloaded our API rate limits last year.', 'Managers ask for a weekly PDF with key KPIs and anomalies.', 'Sales reps complain that leads are stale by the time they call.', 'The procurement team built a simple approval workflow in Sheets.', 'Our service desk tickets spike after each product release.', 'The BI team wants ownership of all production dashboards.', 'We track NPS monthly but the comments are hard to triage.', 'Accounting wants to tag expenses with project codes automatically.', 'The logistics team wants a daily SMS when shipments are delayed.', 'Retail partners send price lists in different formats every quarter.', "The HR portal search is poor; people can't find policies quickly.", 'Website conversions improved when we simplified the checkout page.', 'Security wants automated alerts for suspicious logins out of hours.', 'The pricing model needs to factor in currency volatility.', 'Suppliers keep changing SKUs which breaks our catalog import.', 'Revenue operations wants one source of truth for pipeline stages.', 'The help center needs a better way to surface duplicate questions.', 'Users asked for dark mode and better keyboard shortcuts.', 'We should archive old contracts and keep only the latest version.', 'The forecasting spreadsheet crashes when the dataset exceeds 50k rows.', 'QA wants a daily digest of failed test cases grouped by component.', 'The field sales team needs offline access to the product brochure.', 'The team wants to track trial-to-paid conversion by cohort.', 'Data entry errors happen when people copy from PDFs into Excel.', 'Projects get delayed because stakeholders approve documents late.', 'We need a faster way to find similar customer complaints.', 'Finance wants bank feeds to reconcile automatically overnight.', 'The board pack preparation takes three days every month.', 'The SEO team wants to compare rankings by region and device.', 'The training team needs a library of reusable lesson templates.', 'Warehouse staff want to scan returns and print labels in one step.', 'Customer success wants churn risk scores inside the CRM.', 'Support needs a smart reply assistant for common questions.', 'The expense policy changed; employees need an easy explainer.', 'Legal wants a tool to search for similar indemnity clauses.', 'IT wants a report of unused SaaS seats to reduce costs.', 'Partners ask for a portal to check order status in real time.']

In [2]:
STOPWORDS = {
    'the','a','an','and','or','but','to','of','in','on','for','by','with','is','are','was','were',
    'our','we','it','that','this','those','these','as','at','from','without','once','each','across',
    'still','too','vs','one','last','year','sometimes','once','although','over','into','out','when',
    'every','very','out','off','after','before','can','t','s','re','ll','d','should','could','would'
}


In [None]:
import re
from collections import Counter
from itertools import islice

def clean_text(text: str) -> str:
    text = text.lower()
    text = re.sub(r"[^a-z\s]", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

def tokenize(text: str):
    tokens = text.split()
    return [t for t in tokens if t not in STOPWORDS]

def make_ngrams(tokens, n=2):
    return list(zip(*[tokens[i:] for i in range(n)]))

def top_n(counter: Counter, n=15):
    return list(islice(counter.most_common(), n))


In [3]:
import matplotlib.pyplot as plt

def plot_bar(items, title, xlabel, ylabel):
    labels = [' '.join(k) if isinstance(k, tuple) else k for k, _ in items]
    counts = [v for _, v in items]
    plt.figure()
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.bar(range(len(labels)), counts)
    plt.xticks(range(len(labels)), labels, rotation=45, ha='right')
    plt.tight_layout()
    plt.show()


## 1) DataFrame

In [None]:
df = ...  #TO DO: load into pandas dataframe
df.head()

## 2) Clean + tokenize

In [None]:
df['tokens'] = ...  # TO DO: tokenize the cleaned text 
df[['text','tokens']].head()

## 3) Unigram frequencies + plot

In [None]:
from collections import Counter
all_tokens = ...  # sum(df['tokens'].tolist(), [])
uni_counts = # TO DO: a counter on all_tokens
top_uni = top_n(uni_counts, 15)
top_uni[:5]

In [4]:
...  # TO DO: plot bar

Ellipsis

## 4) Bigrams & trigrams

In [None]:
from collections import Counter
bigram_counts = Counter()
for toks in df['tokens']:
    bigs = ...  # TO DO: make the 2-grams
    bigram_counts.update(bigs)

top_bi = top_n(bigram_counts, 15)
top_bi[:5]

In [None]:
...  # TO DO: make a plot bar

In [None]:
trigram_counts = Counter()
for toks in df['tokens']:
    tris = ...  # TO DO: make the 3-grams
    trigram_counts.update(tris)

top_tri = top_n(trigram_counts, 15)
top_tri[:5]

In [None]:
...  # TO DO: plot bar

## 5) KWIC — Keyword in Context

KWIC prints each occurrence of a keyword together with a few words around it (the *context window*). It lets you inspect *how* a word is used.

**Task:** implement `kwic(term, window=3)` and try words like `wants`, `search`, `weekly`. 

In [None]:
def kwic(term, window=3):
    term = term.lower()
    for sent in texts:
        cleaned = clean_text(sent)
        toks = cleaned.split()
        for i,t in enumerate(toks):
            if t == term:
                left = ' '.join(toks[max(0,i-window):i])
                right = ' '.join(toks[i+1:i+1+window])
                print(f"... {left} [{t}] {right} ...")
