<a href="https://colab.research.google.com/github/soohyoen/artificial-intelligence/blob/main/Copy_of_2_A_cluster_words_class_1_ipynb%EC%9D%98_%EC%82%AC%EB%B3%B8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 2-A 실습: Clustering Words with Co-occurence matrix
- author: Eu-Bin KIM
- @likelion
- tlrndk123@gmail.com
- 9th of August 2021

## Overview
### `eval_clusters` 는 뭘하는 함수?
이 실습은 `target_words` 와 `target_words_reversed` 를 클러스터링 하는 것에 목표를 둡니다.
예를 들어,
```python3
target_words = ["abstraction", "actually", "add"]
target_words_reversed = ["noitcartsba", "yllautca", "dda"]
```
이라고 한다면, 먼저 이 단어들의 벡터 표현을 말뭉치로부터 co-occurence matrix 를 구축하여  얻습니다. 이 6개의 벡터포현을 클러스터링을 했을 때, 반전된 단어가 반전되기전 단어와 같은 클러스터에 존재하면 co-occurence matrix의 퀄리티가 좋다고 볼 수 있습니다:
```
# comat의 성능이 매우 좋음: accuracy = 3 / 3 = 1
[["abstraction","noitcartsba"],  
 ["actually", "yllautca"],
 ["add", "dda"]]

# comat의 성능이 매우 안 좋음 accuracy = 0 / 3 = 0
[["abstraction","yllautca"],  
 ["actually", "noitcartsba", "dda"],
 ["add"]]
```
그렇게 pseudo-evaluation을 진행하면, 현재 구한 co-occurence matrix의 퀄리티를 정량적으로
측정할 수 있게됩니다(`accuracy`로 측정). 이를 측정하는 함수가 `eval_clusters` 함수입니다.  

### `reverse_half` 는 뭘하는 함수?

이런식의 평가를 진행하기 위해선, 반전된 단어의 벡터표현을 얻어야 합니다. 하지만 말뭉치에는
반전된 단어가 존재하지 않습니다. 때문에 말뭉치에 존재하는 `target_words`의 절반을 반전하는 작업이 필요합니다. 이를 위한 함수가 `reverse_half`입니다.

예를들어,
```python3
CORPUS = "actually, I actually like the way you actually speak. you actually seem to be a nice person."
```
이런 작은 말뭉치가 있다면, `CORPUS`에서 나타나는 `actually`의 절반을 뒤집어서 `yllautca`로 바꿔줍니다:
```python3
CORPUS = "actually, I yllautca like the way you actually speak. you yllautca seem to be a nice person."
```

이 전처리된 말뭉치를 바탕으로 co-occurence matrix를 구축하면, 이제, `actually`와 `yllautca`의 벡터표현을 모두 얻을 수 있게됩니다. 퀄리티가 좋은 co-occurence matrix라면 두 벡터는 매우 유사할 것입니다 (e.g. 모두 `like`, `speak`같은 동사와 같이 출현함). 


### `build_count_comat`은 무엇을 하는 함수?

그렇다면 위에서 언급한 co-occurence matrix란 무엇일까요? 우리가 알고 있는 Document-Term Matrix와 크게 다른 점은 없고, 단지 Document 와 Term 모두 말뭉치에서 추출한 어휘로 설정하게 되면, co-occurence matrix라고 부릅니다. 

예를들어, 다음과 같은 문장이 있다고 생각해보겠습니다:
> Roses are red. bees are yellow.

만약 window가 2라면, 이 문장을 바탕으로 다음과 같은 bigrams를 만들어줄 수 있습니다.:
```
windows = [(Roses, are), (are, red), (red, bees), (bees, are), (are, yellow)]
```

그럼 각 window를 살펴보며 다음과 같은 co-occurence matrix를 구축할 수 있습니다:
```
      |  Roses | are | red | bees | yellow
Roses |    1   |  1  |  0  |  0   |    0
are   |    1   |  1  |  1  |  1   |    1
red   |    1   |  1  |  1  |  1   |    0
bees  |    0   |  1  |  1  |  1   |    1
yellow|    0   |  0  |  0  |  0   |    1
```

Roses와 are이 "co-occur"하는 window는 하나 뿐이기에 `comat[0, 1]` 에 해당하는 값은 1이 됩니다. are은 존재하는 모든 어휘와 co-occur하는 단어 이기에, `comat[1, :]`은 전부 1이 됩니다.
이렇게 주어진 `windows`를 바탕으로 `comat`을 구축하는 함수가 `build_count_comat`함수 입니다.



In [None]:
!pip3 install nltk
!pip3 install scikit-learn
from typing import List, Optional, Tuple
import nltk
from nltk.corpus import brown, product_reviews_2, stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.util import ngrams
import random
import numpy as np
from sklearn.cluster import KMeans
from tqdm import tqdm
from pprint import PrettyPrinter

# --- constants --- #
# the target words to perform clustering on (total of 50 words)
TARGET_WORDS: str = \
"""abstraction
actually
add
address
answer
argument
arguments
back
call
car
case
cdr
computer
course
dictionary
different
evaluator
function
general
got
idea
kind
lambda
machine
mean
object
operator
order
pair
part
particular
pattern
place
problem
process
product
program
reason
register
result
set
simple
structure
system
they
together
using
variable
why
zero"""

BROWN_NAME = "brown"
PR2_NAME = "product_reviews_2"
RAND_STATE = 318  # to be used for k-means clustering
random.seed(RAND_STATE)

stemmer = PorterStemmer()
lemmatiser = WordNetLemmatizer()

nltk.download('wordnet')  # for lemmatisation.
nltk.download('stopwords') # for stopwords filtering
nltk.download(BROWN_NAME)
nltk.download(PR2_NAME)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package product_reviews_2 to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping corpora/product_reviews_2.zip.


True

In [None]:
# brown corpus의 첫 10개의 단어
print(list(brown.words())[:10])

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of']


In [None]:
def reverse_half(words: List[str], targets: List[str]):
    """
    reverse half of the occurences of target words in `words` (in-place)
    :param words:
    :param targets
    :return:
    """
    ### TODO 1 ###
    # 말뭉치에 나타나는 target words의 절반을 뒤집어준다.
    # car -> rac.
    # use random.sample()
    occ_idxs = ...
    sub_idxs = ...
    for idx in sub_idxs:
        words[idx] = "".join(reversed(words[idx]))
    ##############

- 예시:
- 입력: 


In [None]:
class Vocab:
    def __init__(self, words: List[str]):
        # 인덱스가 주어지면, 인덱스에 해당하는 어휘를 반환
        self.idx2word = list(set(words))
        # 어휘가 주어지면, 어휘에 해당하는 인덱스를 반환
        self.word2idx = {
            word: idx
            for idx, word in enumerate(self.idx2word)
        }

    def __contains__(self, item: str):
        return item in self.idx2word

    def __len__(self):
        return len(self.idx2word)

In [None]:
# stemmer, lemmatiser는 제가 구현했습니다 :)
def stem(words: List[str]):
    global stemmer
    for idx, word in enumerate(words):
        words[idx] = stemmer.stem(word)

def lemmatise(words: List[str]):
    global lemmatiser
    for idx, word in enumerate(words):
        words[idx] = lemmatiser.lemmatize(word)

In [None]:
def build_count_comat(vocab: Vocab, windows: List[Tuple[str]]) -> np.ndarray:
    """
    count the frequencies of the co-occurrences.
    # dtm. documents = vocab. terms = vocab.
    :param vocab:
    :param windows:
    :return:
    """

    num_words = len(vocab)
    comat = np.zeros(shape=(num_words, num_words))
    for window in tqdm(windows, desc="building count comat..."):
        for term_1 in window:
            for term_2 in window:
                  ### TODO 2 ###
                  # use vocab.word2idx, 파이썬의 try-catch pattern, word in vocab
                  # 만약 term_1과 term_2 모두 어휘에 해당하는 단어라면, 해당 comat에서
                  # 해당 co-occurence를 +1 한다.
                  ##############
    # set the diagonal to zero (useless)
    comat[range(num_words), range(num_words)] = 0
    ###########
    return comat

In [None]:
def cluster_target_words(n_clusters, tfidf_mat: np.array, vocab: Vocab) -> List[List[str]]:
    """
    K-means 알고리즘을 활용해서, 클러스터링을 진행합니다.
    """
    clusters: List[List[str]] = [list() for _ in range(n_clusters)]  # a bucket to collect clusters
    k_means = KMeans(n_clusters=n_clusters, random_state=RAND_STATE)
    result = k_means.fit(tfidf_mat)
    for word_idx, cluster_idx in enumerate(result.labels_):
        word = vocab.idx2word[word_idx]
        # append the word to the cluster
        clusters[cluster_idx].append(word)
    return clusters

In [None]:
def eval_clusters(clusters: List[List[str]], targets: List[str], targets_reversed: List[str]) -> float:
    """
    returns the accuracy.
    :param clusters:
    :param targets
    :param targets_reversed
    :return:
    """
    pairs = list(zip(targets, targets_reversed))
    total = len(pairs)
    correct = 0
    ### TODO 3 ###
    # targets (반전하기전), target_reversed (반전 후)가 모두 같은 클러스터에 존재하는지
    # 확인해서, 같은 클러스터에 존재한다면 correct += 1.

    ##############
    return correct / total


In [None]:
def run_experiment(corpus_name: str,
                   lower_case: bool,
                   remove_stopwords: bool,
                   norm_mode: Optional[str],
                   window_size: int):
    if corpus_name == BROWN_NAME:
        # nltk에서 제공하는 함수
        corpus = brown
    elif corpus_name == PR2_NAME:
        corpus = product_reviews_2
    else:
        raise ValueError
    targets = TARGET_WORDS.split("\n")
    # 말뭉치에 있는 모든 단어 
    words: List[str] = list(corpus.words())
    # 이후에는 파라미터에 따라 말뭉치를 전처리.
    # --- case folding --- #
    if lower_case:
        words = [word.lower() for word in words]
    # --- stopwords filtering --- #
    if remove_stopwords:
        words = [word for word in words if word not in stopwords.words("english")]
    # --- 정규화: stemming or lemmatisation --- #
    if norm_mode == "stem":
        # then.. you must stem the vocab.
        stem(words)  # stem words in-place
        stem(targets)
    elif norm_mode == "lemmatise":
        lemmatise(words)  # lemmatise words in-place
        lemmatise(targets)
    # --- ngrams 으로 맥락 윈도우 구축하기 --- #
    reverse_half(words, targets)  # for pseudo-eval
    windows = list(ngrams(words, window_size))
    # --- 어휘 구축하기 --- #
    targets_reversed = ["".join(reversed(target)) for target in targets]
    vocab = Vocab(words=targets + targets_reversed)
    # --- build a word2word co-occurrence matrix (dtm), where both documents & terms are target words --- #
    comat = build_count_comat(vocab, windows)
    # --- cluster the target words using their tfidf vectors --- #
    n_clusters = len(vocab) // 2
    clusters = cluster_target_words(n_clusters, comat, vocab)
    # --- evaluate the clusters;check if the reversed form is included in the same cluster --- #
    acc = eval_clusters(clusters, targets, targets_reversed)
    # --- 결과 리포트 --- #
    print("### REPORT ###")
    print("vocab_size:{}, n_clusters:{}, lower_case: {}, remove_stopwords:{}, corpus_name:{}, stem_or_lemmatise:{}, window_size:{}"
          .format(len(vocab), n_clusters, lower_case, remove_stopwords, corpus_name, norm_mode, window_size))
    pprinter = PrettyPrinter(compact=True)
    pprinter.pprint(clusters)
    print("accuracy:", acc)

In [None]:
  
  run_experiment(corpus_name=BROWN_NAME, lower_case=False, remove_stopwords=False, norm_mode=None,  window_size=8)
  # the effect of lemmatisation on the performance
  run_experiment(corpus_name=BROWN_NAME, lower_case=False, remove_stopwords=False, norm_mode="lemmatise", window_size=8)
  # the effect of stemming on the performance
  run_experiment(corpus_name=BROWN_NAME, lower_case=False, remove_stopwords=False,  norm_mode="stem", window_size=8)  # 이게 아마도 성능이 제일 좋을 겁니다.
  # the effect of case folding on the performance
  run_experiment(corpus_name=BROWN_NAME,  lower_case=True, remove_stopwords=False, norm_mode="stem", window_size=8)
  # the effect of switching corpus on the performance
  run_experiment(corpus_name=PR2_NAME,  lower_case=False, remove_stopwords=False, norm_mode="stem", window_size=8)
  # the effect of removing stopwords
  run_experiment(corpus_name=BROWN_NAME,  lower_case=False, remove_stopwords=True, norm_mode="stem", window_size=8)

다음과 같은 결과가 나와야 합니다:

(결과가 다를수도 있습니다) -> 
```
building count comat...: 100%|██████████| 1161185/1161185 [00:09<00:00, 119558.38it/s]
### REPORT ###
vocab_size:100, n_clusters:50, lower_case: False, remove_stopwords:False, corpus_name:brown, stem_or_lemmatise:None, window_size:8
[['redro'], ['yeht'], ['they'], ['program', 'yllautca'], ['rehtegot'],
 ['rotarepo', 'elpmis'], ['yhw'], ['reason'], ['llac'], ['got'], ['kcab'],
 ['tnereffid'], ['esruoc'], ['tes'], ['metsys'], ['different'], ['margorp'],
 ['back'], ['system'], ['course'], ['rac'], ['why'], ['kind'], ['nosaer'],
 ['together'], ['place'], ['tog'], ['ecalp'], ['ralucitrap'], ['order'],
 ['part'], ['aedi'],
 ['machine', 'retsiger', 'operator', 'computer', 'stnemugra', 'enihcam',
  'tcejbo', 'dda', 'arguments', 'variable', 'cdr', 'sserdda', 'address',
  'tcudorp', 'gnisu', 'noitcartsba', 'rdc', 'pair', 'abstraction', 'evaluator',
  'register', 'elbairav', 'retupmoc', 'yranoitcid', 'dictionary', 'zero',
  'structure', 'erutcurts', 'rotaulave', 'adbmal', 'nrettap', 'orez',
  'function', 'product', 'result', 'lambda', 'noitcnuf', 'riap', 'melborp',
  'argument'],
 ['set'], ['dnik'], ['object'], ['ssecorp'], ['trap'], ['problem'], ['process'],
 ['tluser'], ['lareneg'], ['call', 'actually', 'esac', 'add'], ['car'],
 ['using'], ['particular'],
 ['naem', 'case', 'pattern', 'tnemugra', 'answer', 'general', 'idea'],
 ['simple'], ['mean'], ['rewsna']]
accuracy: 0.32
building count comat...: 100%|██████████| 1161185/1161185 [00:09<00:00, 120675.27it/s]
### REPORT ###
vocab_size:98, n_clusters:49, lower_case: False, remove_stopwords:False, corpus_name:brown, stem_or_lemmatise:lemmatise, window_size:8
[['rac'], ['yeht'],
 ['machine', 'rotarepo', 'retsiger', 'operator', 'computer', 'enihcam', 'dda',
  'simple', 'pattern', 'actually', 'variable', 'cdr', 'sserdda', 'address',
  'tcudorp', 'gnisu', 'noitcartsba', 'rdc', 'pair', 'abstraction', 'evaluator',
  'object', 'register', 'using', 'elbairav', 'retupmoc', 'yranoitcid',
  'dictionary', 'tnemugra', 'zero', 'structure', 'erutcurts', 'rotaulave',
  'adbmal', 'orez', 'function', 'product', 'lambda', 'noitcnuf', 'riap',
  'argument'],
 ['they'], ['case'], ['why'], ['tes'], ['esruoc'], ['got'], ['reason'],
 ['back'], ['kcab'], ['car'], ['tog'], ['aedi'],
 ['call', 'tcejbo', 'redro', 'yllautca'], ['rehtegot', 'together'], ['system'],
 ['different'], ['nosaer'], ['tnereffid'], ['melborp'], ['order'], ['metsys'],
 ['kind'], ['problem'], ['set'], ['idea'], ['place'],
 ['answer', 'rewsna', 'tluser'], ['ecalp'], ['part'], ['margorp'], ['process'],
 ['yhw'], ['mean'], ['dnik'], ['naem'], ['ralucitrap'], ['elpmis'],
 ['esac', 'particular', 'add', 'llac'], ['trap'], ['lareneg'], ['ssecorp'],
 ['result'], ['program'], ['nrettap'], ['course'], ['general']]
accuracy: 0.42
building count comat...: 100%|██████████| 1161185/1161185 [00:09<00:00, 120763.07it/s]
### REPORT ###
vocab_size:98, n_clusters:49, lower_case: False, remove_stopwords:False, corpus_name:brown, stem_or_lemmatise:stem, window_size:8
[['tes'], ['yeht'], ['they'], ['ulave', 'dda', 'answer', 'ralucitrap', 'lpmis'],
 ['reason'], ['esu'], ['tog', 'htegot'], ['sruoc', 'dnik'], ['whi'],
 ['set', 'idea'], ['reffid'],
 ['case', 'pattern', 'comput', 'tnemugra', 'structur', 'nrettap', 'evalu',
  'simpl', 'rac', 'tupmoc'],
 ['use'], ['result', 'tluser'], ['program', 'trap'], ['gener'], ['naem'],
 ['reneg'], ['problem'], ['call'], ['differ'], ['system'], ['kcab'], ['ihw'],
 ['redro', 'esac', 'particular', 'rewsna', 'actual'], ['metsys'], ['oper'],
 ['back'], ['process', 'product'], ['object'], ['function', 'noitcnuf'],
 ['ssecorp'], ['place'], ['cours', 'lautca'], ['part'], ['aedi'],
 ['got', 'togeth', 'llac'], ['car'], ['mean'], ['tcejbo'], ['margorp'],
 ['ecalp'], ['kind'], ['repo'],
 ['machin', 'tsiger', 'regist', 'variabl', 'cdr', 'sserdda', 'abstract',
  'address', 'rdc', 'pair', 'tcartsba', 'iranoitcid', 'rutcurts', 'nihcam',
  'zero', 'adbmal', 'add', 'orez', 'lbairav', 'lambda', 'riap', 'argument',
  'dictionari'],
 ['melborp'], ['tcudorp'], ['nosaer'], ['order']]
accuracy: 0.28
building count comat...: 100%|██████████| 1161185/1161185 [00:09<00:00, 120562.30it/s]
### REPORT ###
vocab_size:98, n_clusters:49, lower_case: True, remove_stopwords:False, corpus_name:brown, stem_or_lemmatise:stem, window_size:8
[['machin', 'variabl', 'esac', 'nihcam', 'lbairav', 'simpl'], ['yeht'],
 ['they'], ['htegot', 'togeth'], ['use'], ['program'], ['case', 'object'],
 ['nosaer'], ['ihw'], ['aedi', 'ralucitrap', 'lpmis', 'rewsna', 'dnik'],
 ['gener'], ['mean'], ['got', 'tog'], ['back'],
 ['pattern', 'particular', 'rutcurts', 'structur', 'idea', 'nrettap', 'result'],
 ['differ'], ['cours', 'redro'], ['reneg'], ['esu'], ['tluser'],
 ['tsiger', 'regist', 'ulave', 'dda', 'cdr', 'sserdda', 'abstract', 'address',
  'rdc', 'pair', 'tcartsba', 'iranoitcid', 'comput', 'answer', 'tnemugra',
  'zero', 'adbmal', 'add', 'evalu', 'orez', 'lambda', 'noitcnuf', 'riap',
  'argument', 'rac', 'tupmoc', 'dictionari'],
 ['melborp'], ['reffid'], ['metsys'], ['kind'], ['whi'], ['set', 'place'],
 ['oper'], ['part'], ['margorp', 'process'], ['tcudorp'], ['trap'], ['naem'],
 ['kcab'], ['order'], ['call'], ['reason'], ['ssecorp'],
 ['sruoc', 'lautca', 'actual'], ['car'], ['repo'], ['product'], ['llac'],
 ['problem'], ['tcejbo'], ['system'], ['tes'], ['ecalp'], ['function']]
accuracy: 0.4
building count comat...: 100%|██████████| 84612/84612 [00:00<00:00, 119893.90it/s]
### REPORT ###
vocab_size:98, n_clusters:49, lower_case: False, remove_stopwords:False, corpus_name:product_reviews_2, stem_or_lemmatise:stem, window_size:8
[['simpl'], ['redro', 'process', 'answer'],
 ['cours', 'machin', 'tsiger', 'regist', 'aedi', 'ulave', 'tcejbo', 'dda',
  'pattern', 'cdr', 'abstract', 'address', 'rdc', 'pair', 'tcartsba', 'object',
  'particular', 'ecalp', 'htegot', 'iranoitcid', 'rutcurts', 'tnemugra',
  'structur', 'zero', 'idea', 'reneg', 'adbmal', 'nrettap', 'evalu', 'orez',
  'lbairav', 'result', 'dnik', 'lambda', 'part', 'noitcnuf', 'riap', 'order',
  'argument', 'dictionari', 'tluser'],
 ['esu'], ['use'], ['repo'], ['they'], ['yeht'], ['llac'], ['actual'],
 ['problem'], ['tes'], ['tcudorp'], ['system'], ['call'], ['melborp'],
 ['comput'], ['mean', 'case', 'car', 'nihcam', 'function'], ['metsys'], ['set'],
 ['tupmoc'], ['program'], ['ihw'], ['margorp'], ['kcab'], ['product'], ['got'],
 ['oper'], ['back'], ['differ'], ['nosaer'], ['whi'], ['lautca'], ['sruoc'],
 ['tog'], ['kind'], ['add'], ['ssecorp'], ['rac'], ['esac'], ['sserdda'],
 ['naem', 'togeth', 'trap', 'rewsna'], ['reffid'], ['gener'], ['ralucitrap'],
 ['lpmis'], ['reason'], ['place'], ['variabl']]
accuracy: 0.3
building count comat...: 100%|██████████| 727494/727494 [00:06<00:00, 120995.04it/s]
### REPORT ###
vocab_size:98, n_clusters:49, lower_case: False, remove_stopwords:True, corpus_name:brown, stem_or_lemmatise:stem, window_size:8
[['ihw', 'tsiger', 'regist', 'variabl', 'ulave', 'dda', 'cdr', 'sserdda',
  'abstract', 'whi', 'rdc', 'pair', 'tcartsba', 'iranoitcid', 'answer',
  'tnemugra', 'ralucitrap', 'structur', 'zero', 'adbmal', 'add', 'orez',
  'lpmis', 'rewsna', 'lbairav', 'lambda', 'riap', 'simpl', 'argument', 'tupmoc',
  'dictionari'],
 ['reneg'], ['melborp'], ['set'], ['tluser'], ['got', 'car', 'tog'], ['esu'],
 ['use'], ['metsys'], ['kcab'], ['gener'], ['repo'], ['ecalp'],
 ['sruoc', 'esac'], ['object'], ['program'], ['reffid'], ['system'],
 ['ssecorp'], ['mean'], ['kind', 'case', 'reason'], ['llac'], ['differ'],
 ['trap'], ['process'], ['yeht'], ['result'], ['noitcnuf'], ['dnik'], ['rac'],
 ['order'], ['product'], ['redro'], ['margorp'], ['tcudorp'], ['naem'],
 ['problem'], ['tes'], ['part'], ['call'], ['back'], ['they'],
 ['machin', 'aedi', 'pattern', 'address', 'particular', 'rutcurts', 'comput',
  'lautca', 'nihcam', 'nrettap', 'evalu', 'actual', 'nosaer'],
 ['idea'], ['oper'], ['htegot', 'togeth'], ['cours'], ['tcejbo', 'function'],
 ['place']]
accuracy: 0.38
```

## 다음의 질문에 답하세요.

>  stemming을 했을 때, 안했을 때 성능의 차이? 이유는?

>  case folding을 했을때, 안 했을 때 성능의 차이? 이유는?

>  말뭉치가 BROWN 일 때, 아닐 때, 성능의 차이? 이유는?