
## Part I – Distributional Hypothesis

| Sentence context clues | Correct word | Why? |
|------------------------|--------------|------|
| “piece of …”, “eat …”, “cut with knife”, “made from milk” | **cheese** | All four collocations are prototypical of *cheese*. *Cake* is not made from milk and *butter* is rarely “a piece of”. |
| “parked in driveway”, “bought”, “drive fast”, “wash on weekends” | **car** | Fits every clue; motorcycles aren’t usually “parked in driveway & washed”. |
| “read”, “enjoy before bed”, “has chapters & cover”, “borrow from library” | **book** | Only *book* unites all four contexts. |

**Summary.** Under the Distributional Hypothesis, words appearing in similar contexts have related meanings. Matching each set of shared contexts selects **cheese**, **car**, and **book** as the correct completions.


In [None]:

import pandas as pd, numpy as np
from sklearn.model_selection import StratifiedKFold, GroupKFold, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier

# Load dataset
df = pd.read_csv('/content/data_stories_one_shot.csv')

# Map Stage to binary label: Stage 1 = Show (0), Stage 2/3 = Tell (1)
df['Label'] = df['Stage'].apply(lambda x: 0 if x == 1 else 1)

X = df['Sentence'].values
y = df['Label'].values
groups = df['Plot_Name'].values

# TF‑IDF vectoriser with English stop‑words and alphabetic tokens only (≥2 chars)
vectorizer = TfidfVectorizer(lowercase=True,
                             stop_words='english',
                             token_pattern=r'\b[a-z]{2,}\b')

classifiers = {
    'LogReg': LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42),
    'LinSVM': LinearSVC(class_weight='balanced', random_state=42),
    'MultNB': MultinomialNB(),
    'RandForest': RandomForestClassifier(n_estimators=300,
                                         class_weight='balanced',
                                         random_state=42)
}

def eval_model(pipe, cv, **kwargs):
    scores = cross_val_score(pipe, X, y, cv=cv, scoring='accuracy', **kwargs)
    return scores.mean(), scores.std()

# 5‑fold Stratified CV
print("=== 5‑fold Stratified CV ===")
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for name, clf in classifiers.items():
    pipe = make_pipeline(vectorizer, clf)
    mean, std = eval_model(pipe, skf)
    print(f"{name:<9}: {mean:.3f} ± {std:.3f}")

# Leave‑One‑Plot‑Out
print("\n=== Leave‑One‑Plot‑Out (GroupKFold) ===")
gkf = GroupKFold(n_splits=len(np.unique(groups)))
for name, clf in classifiers.items():
    pipe = make_pipeline(vectorizer, clf)
    mean, std = eval_model(pipe, gkf, groups=groups)
    print(f"{name:<9}: {mean:.3f} ± {std:.3f}")


=== 5‑fold Stratified CV ===
LogReg   : 0.815 ± 0.075
LinSVM   : 0.800 ± 0.045
MultNB   : 0.777 ± 0.066
RandForest: 0.792 ± 0.062

=== Leave‑One‑Plot‑Out (GroupKFold) ===
LogReg   : 0.804 ± 0.186
LinSVM   : 0.805 ± 0.183
MultNB   : 0.711 ± 0.142
RandForest: 0.642 ± 0.118



### Results & Discussion

* Linear models (*Logistic Regression* and *Linear SVM*) consistently outperform Naïve Bayes and Random Forest on both CV schemes, echoing Figure 6 of the assignment paper.
* The wider gap under **leave‑one‑plot‑out** highlights how some plots differ stylistically; linear models still generalise better.
* For the optional “bonus”, try replacing the TF‑IDF step with Sentence‑BERT embeddings (`sentence-transformers` library) and re‑running the CV blocks—the code is modular to allow that swap.


