### **문제 1) Tokenizer 생성하기**

**1-1. `preprocessing()`**

텍스트 전처리를 하는 함수입니다.

- input: 여러 영어 문장이 포함된 list 입니다. ex) ['I go to school.', 'I LIKE pizza!']
- output: 각 문장을 토큰화한 결과로, nested list 형태입니다. ex) [['i', 'go', 'to', 'school'], ['i', 'like', 'pizza']]
- 조건 1: 입력된 문장에 대해서 소문자로의 변환과 특수문자 제거를 수행합니다.
- 조건 2: 토큰화는 white space 단위로 수행합니다.
    
    

**1-2. `fit()`**

어휘 사전을 구축하는 함수입니다.

- input: 여러 영어 문장이 포함된 list 입니다. ex) ['I go to school.', 'I LIKE pizza!']
- 조건 1: 위에서 만든 `preprocessing` 함수를 이용하여 각 문장에 대해 토큰화를 수행합니다.
- 조건 2: 각각의 토큰을 정수 인덱싱 하기 위한 어휘 사전(`self.word_dict`)을 생성합니다.
    - 주어진 코드에 있는 `self.word_dict`를 활용합니다.
    

**1-3. `transform()`**

어휘 사전을 활용하여 입력 문장을 정수 인덱싱하는 함수입니다.

- input: 여러 영어 문장이 포함된 list입니다. ex) ['I go to school.', 'I LIKE pizza!']
- output: 각 문장의 정수 인덱싱으로, nested list 형태입니다. ex) [[1, 2, 3, 4], [1, 5, 6]]
- 조건 1: 어휘 사전(`self.word_dict`)에 없는 단어는 'oov'의 index로 변환합니다.

In [16]:
class Tokenizer():
  def __init__(self):
    self.word_dict = {'oov': 0}
    self.fit_checker = False
  
  def preprocessing(self, sequences):
    result = []
    '''
    문제 1-1.
    '''
    import re
    for sentence in sequences:
      result.append(re.sub(r"[^a-zA-Zㄱ-힣0-9\s]","",sentence.lower()).split())
    return result
  
  def fit(self, sequences):
    self.fit_checker = False
    '''
    문제 1-2.
    '''
    self.word_dict = {'oov': 0}
    num=1
    for i in self.preprocessing(sequences):
      for j in i:
        if j not in self.word_dict:
          self.word_dict[j]=num
          num+=1
    self.fit_checker = True
  
  def transform(self, sequences):
    result = []
    tokens = self.preprocessing(sequences)
    if self.fit_checker:
      '''
      문제 1-3.
      '''
      for sentence in tokens:
          result.append(list(map(lambda x: self.word_dict[x] if x in self.word_dict else self.word_dict['oov'],sentence)))
      return result
    else:
      raise Exception("Tokenizer instance is not fitted yet.")
      
  def fit_transform(self, sequences):
    self.fit(sequences)
    result = self.transform(sequences)
    return result

In [22]:
example='In the 1600s the Dutch East India Company employed hundreds of ships to trade gold, porcelain, spices, and silks around the globe. But running this massive operation wasn’t cheap. In order to fund their expensive voyages, the company turned to private citizens– individuals who could invest money to support the trip in exchange for a share of the ship’s profits. This practice allowed the company to afford even grander voyages, increasing profits for both themselves and their savvy investors.'.split('.')
example.pop()
for sentence in example:
  print(sentence)

In the 1600s the Dutch East India Company employed hundreds of ships to trade gold, porcelain, spices, and silks around the globe
 But running this massive operation wasn’t cheap
 In order to fund their expensive voyages, the company turned to private citizens– individuals who could invest money to support the trip in exchange for a share of the ship’s profits
 This practice allowed the company to afford even grander voyages, increasing profits for both themselves and their savvy investors


In [17]:
tok=Tokenizer()
tok.preprocessing(example)

[['in',
  'the',
  '1600s',
  'the',
  'dutch',
  'east',
  'india',
  'company',
  'employed',
  'hundreds',
  'of',
  'ships',
  'to',
  'trade',
  'gold',
  'porcelain',
  'spices',
  'and',
  'silks',
  'around',
  'the',
  'globe'],
 ['but', 'running', 'this', 'massive', 'operation', 'wasnt', 'cheap'],
 ['in',
  'order',
  'to',
  'fund',
  'their',
  'expensive',
  'voyages',
  'the',
  'company',
  'turned',
  'to',
  'private',
  'citizens',
  'individuals',
  'who',
  'could',
  'invest',
  'money',
  'to',
  'support',
  'the',
  'trip',
  'in',
  'exchange',
  'for',
  'a',
  'share',
  'of',
  'the',
  'ships',
  'profits'],
 ['this',
  'practice',
  'allowed',
  'the',
  'company',
  'to',
  'afford',
  'even',
  'grander',
  'voyages',
  'increasing',
  'profits',
  'for',
  'both',
  'themselves',
  'and',
  'their',
  'savvy',
  'investors']]

In [19]:
tok.fit_transform(example)

[[1, 2, 3, 2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20],
 [21, 22, 23, 24, 25, 26, 27],
 [1,
  28,
  12,
  29,
  30,
  31,
  32,
  2,
  7,
  33,
  12,
  34,
  35,
  36,
  37,
  38,
  39,
  40,
  12,
  41,
  2,
  42,
  1,
  43,
  44,
  45,
  46,
  10,
  2,
  11,
  47],
 [23, 48, 49, 2, 7, 12, 50, 51, 52, 32, 53, 47, 44, 54, 55, 17, 30, 56, 57]]

In [20]:
tok.word_dict

{'1600s': 3,
 'a': 45,
 'afford': 50,
 'allowed': 49,
 'and': 17,
 'around': 19,
 'both': 54,
 'but': 21,
 'cheap': 27,
 'citizens': 35,
 'company': 7,
 'could': 38,
 'dutch': 4,
 'east': 5,
 'employed': 8,
 'even': 51,
 'exchange': 43,
 'expensive': 31,
 'for': 44,
 'fund': 29,
 'globe': 20,
 'gold': 14,
 'grander': 52,
 'hundreds': 9,
 'in': 1,
 'increasing': 53,
 'india': 6,
 'individuals': 36,
 'invest': 39,
 'investors': 57,
 'massive': 24,
 'money': 40,
 'of': 10,
 'oov': 0,
 'operation': 25,
 'order': 28,
 'porcelain': 15,
 'practice': 48,
 'private': 34,
 'profits': 47,
 'running': 22,
 'savvy': 56,
 'share': 46,
 'ships': 11,
 'silks': 18,
 'spices': 16,
 'support': 41,
 'the': 2,
 'their': 30,
 'themselves': 55,
 'this': 23,
 'to': 12,
 'trade': 13,
 'trip': 42,
 'turned': 33,
 'voyages': 32,
 'wasnt': 26,
 'who': 37}

In [21]:
tok.transform(['I go to school.', 'I LIKE pizza!'])

[[0, 0, 12, 0], [0, 0, 0]]

In [24]:
tok2=Tokenizer()
print(tok2.fit_transform(['I go to school.', 'I LIKE pizza!']))
print(tok2.word_dict)

[[1, 2, 3, 4], [1, 5, 6]]
{'oov': 0, 'i': 1, 'go': 2, 'to': 3, 'school': 4, 'like': 5, 'pizza': 6}


In [45]:
for i in tok2.word_dict.values():
  print(i)

0
1
2
3
4
5
6


### **문제 2) TfidfVectorizer 생성하기**

**2-1. `fit()`**

입력 문장들을 이용해 IDF 행렬을 만드는 함수입니다.

- input: 여러 영어 문장이 포함된 list 입니다. ex) ['I go to school.', 'I LIKE pizza!']
- 조건 1: IDF 행렬은 list 형태입니다.
    - ex) [토큰1에 대한 IDF 값, 토큰2에 대한 IDF 값, .... ]
- 조건 2: IDF 값은 아래 식을 이용해 구합니다.
    
    $$
    idf(d,t)=log_e(\frac{n}{1+df(d,t)})
    $$
    
    - $df(d,t)$ : 단어 t가 포함된 문장 d의 개수
    - $n$ : 입력된 전체 문장 개수
- 조건 3: 입력된 문장의 토큰화에는 문제 1에서 만든 Tokenizer를 사용합니다.
    
    

**2-2. `transform()`**

입력 문장들을 이용해 TF-IDF 행렬을 만드는 함수입니다.

- input: 여러 영어 문장이 포함된 list입니다. ex) ['I go to school.', 'I LIKE pizza!']
- output : nested list 형태입니다.
    
    ex) [[tf-idf(1, 1), tf-idf(1, 2), tf-idf(1, 3)], [tf-idf(2, 1), tf-idf(2, 2), tf-idf(2, 3)]]
    
    |  | 토큰1 | 토큰2 | 토큰3 |
    | --- | --- | --- | --- |
    | 문장1 | tf-idf(1,1) | tf-idf(1,2) | tf-idf(1,3) |
    | 문장2 | tf-idf(2,1) | tf-idf(2,2) | tf-idf(2,3) |
- 조건1 : 입력 문장을 이용해 TF 행렬을 만드세요.
    - $tf(d, t)$ : 문장 d에 단어 t가 나타난 횟수
- 조건2 : 문제 2-1( `fit()`)에서 만든 IDF 행렬과 아래 식을 이용해 TF-IDF 행렬을 만드세요
    
    $$
    tf-idf(d,t) = tf(d,t) \times idf(d,t)
    $$

In [51]:
class TfidfVectorizer:
  def __init__(self, tokenizer):
    self.tokenizer = tokenizer()
    self.fit_checker = False
  
  def fit(self, sequences):
    tokenized = self.tokenizer.fit_transform(sequences)
    '''
    문제 2-1.
    '''
    import numpy as np
    import math
    n_doc=len(sequences)
    tf=np.array([[tokenized[n].count(token) for token in self.tokenizer.word_dict.values()] for n in range(n_doc)])
    df=np.sum(tf,axis=0)
    idffunc=np.vectorize(lambda x: math.log10(n_doc/(1+x)))
    self.idf=idffunc(df)

    self.fit_checker = True
    

  def transform(self, sequences):
    if self.fit_checker:
      tokenized = self.tokenizer.transform(sequences)
      '''
      문제 2-2.
      '''
      import numpy as np
      n_doc=len(sequences)
      tf=np.array([[tokenized[n].count(token) for token in self.tokenizer.word_dict.values()] for n in range(n_doc)])
      self.tfidf_matrix=[np.multiply(tknd_seq,self.idf) for tknd_seq in tf]
      return self.tfidf_matrix
    else:
      raise Exception("TfidfVectorizer instance is not fitted yet.")

  
  def fit_transform(self, sequences):
    self.fit(sequences)
    return self.transform(sequences)

In [37]:
tfidf=TfidfVectorizer(Tokenizer)
tfidf.fit(example)
matrix=tfidf.transform(example)
import pandas as pd
df=pd.DataFrame(matrix,columns=tfidf.tokenizer.word_dict.values())
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
0,0.0,0.287682,0.0,0.693147,0.693147,0.693147,0.693147,0.0,0.693147,0.693147,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.575364,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.693147,0.693147,0.693147,0.693147,0.693147,0.693147,0.693147,0.693147,0.693147,0.693147


In [52]:
tfidf2=TfidfVectorizer(Tokenizer)
tfidf2.fit(['I go to school.', 'I LIKE pizza!'])
matrix=tfidf2.transform(['I go to school.', 'I LIKE pizza!'])
import pandas as pd
df2=pd.DataFrame(matrix,columns=tfidf2.tokenizer.word_dict.values())
df2.head()

Unnamed: 0,0,1,2,3,4,5,6
0,0.0,-0.176091,0.0,0.0,0.0,0.0,0.0
1,0.0,-0.176091,0.0,0.0,0.0,0.0,0.0


손계산   
sentence1 : i go to school   
1 2 3 4   
sentence2 : i like pizza   
1 5 6   
oov : 0   
in sentence1,   
tf1 = {0:0, 1:1, 2:1, 3:1, 4:1, 5:0 ,6:0}   
in sentence2,   
tf2 = {0:0, 1:1, 2:0, 3:0, 4:0, 5:1 ,6:1}   
df = {0:0, 1:2, 2:1, 3:1, 4:1, 5:1 ,6:1}   
n(총 문서 수) = 2   
idf = {   
  0:log(2/(1+0))=log2,   
  1:log(2/(1+2))=log(2/3),   
  2:log(2/(1+1))=0,   
  3:log(2/(1+1))=0,   
  4:log(2/(1+1))=0,   
  5:log(2/(1+1))=0,   
  6:log(2/(1+1))=0   
  }   
tf-idf1 = {   
  0:0xlog2=0,   
  1:1xlog(2/3)=log(2/3),   
  2:1x0=0,   
  3:1x0=0,   
  4:1x0=0,   
  5:0x0=0,   
  6:0x0=0   
  }   
tf-idf2 = {   
  0:0xlog2=0,   
  1:1xlog(2/3)=log(2/3),   
  2:0x0=0,   
  3:0x0=0,   
  4:0x0=0,   
  5:1x0=0 ,  
  6:1x0=0   
  }   
   
log(2/3)=-0.1761   