# Misra–Gries (Algorithm 3.1)

目的: ストリームから**頻出要素（heavy hitters）**を近似検出。メモリは `k-1` 個のカウンタ。

処理:
1. アイテムが既存のキーならカウントを+1  
2. 空きがあるなら新規キーを追加（カウント=1）  
3. 空きがないなら**全カウンタを1ずつ減算**し、0になったキーを削除  

性質: 出力された候補の真の頻度 `f(x)` は概ね `⌊ n/k ⌋` の誤差で下界化できる。


In [None]:

from collections import defaultdict, Counter

def misra_gries(stream, k):
    # maintain at most k-1 counters
    counters = {}
    n = 0
    for x in stream:
        n += 1
        if x in counters:
            counters[x] += 1
        elif len(counters) < k-1:
            counters[x] = 1
        else:
            # decrement all
            to_del = []
            for j in list(counters.keys()):
                counters[j] -= 1
                if counters[j] == 0:
                    to_del.append(j)
            for j in to_del:
                del counters[j]
    return counters, n

# demo on chat authors
import pandas as pd
df = pd.read_csv("/mnt/data/Brighton v Man City LIVE Watchalong!_chat_log.csv", encoding="utf-8", engine="python")
stream = df["author"].astype(str).tolist()
k = 20
counters, n = misra_gries(stream, k)

# compare with exact counts
true_counts = Counter(stream)
candidates = {u: true_counts[u] for u in counters.keys()}
top_true = true_counts.most_common(20)

print("n =", n, "unique =", len(true_counts))
print("MG candidates (|T|={}):".format(len(counters)))
print(sorted([(u, counters[u], candidates[u]) for u in counters], key=lambda t: -candidates[t[0]])[:20])
print("\nTop-20 ground truth:")
print(top_true)
