캐글의 [SMS Spam dataset](https://www.kaggle.com/uciml/sms-spam-collection-dataset) 에 사전 학습된 Word2Vec 임베딩 벡터를 적용하여 분류해보기<br/>
세션 노트에 있었던 단어 임베딩 벡터를 평균내어 분류하는 방법을 적용해보기

In [1]:
!pip install gensim --upgrade

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gensim
  Downloading gensim-4.2.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[K     |████████████████████████████████| 24.1 MB 1.3 MB/s 
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-4.2.0


In [2]:
import gensim

gensim.__version__

'4.2.0'

In [3]:
import gensim.downloader as api

wv = api.load('word2vec-google-news-300')



In [4]:
wv['data'][1]


-0.14257812

In [5]:
wv['science'][1]


0.12158203

### 1. Word2Vec과 코사인 유사도

word2vec을 이용해 구한 'data'와 'science'임베딩 값의 코사인 유사도를 구하며, sklearn의 cosine_similarity를 이용하기

In [6]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity(wv['data'].reshape(1,-1), wv['science'].reshape(1,-1))


array([[0.1575913]], dtype=float32)

위에서 구한 코사인 유사도를 소수점 3째자리까지 입력하기

In [7]:
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

from keras.preprocessing import sequence
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.preprocessing.text import Tokenizer

### 2. 텍스트 분류

### 1) 데이터 전처리
    
- 데이터셋을 데이터프레임으로 읽어옵니다 `encoding = 'latin-1'` 을 사용
- 필요없는 열(column)을 삭제
- LabelEncoder를 사용하여 label 전처리를 하기

In [8]:
from google.colab import files

file = files.upload()

Saving spam.csv to spam.csv


In [9]:
df = pd.read_csv("spam.csv", encoding = 'latin-1')
df.head(10)


Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
5,spam,FreeMsg Hey there darling it's been 3 week's n...,,,
6,ham,Even my brother is not like to speak with me. ...,,,
7,ham,As per your request 'Melle Melle (Oru Minnamin...,,,
8,spam,WINNER!! As a valued network customer you have...,,,
9,spam,Had your mobile 11 months or more? U R entitle...,,,


In [10]:
dfc = df[['v1','v2']]


In [11]:
dfc['v1'].value_counts()


ham     4825
spam     747
Name: v1, dtype: int64

In [12]:
dfc.head(10)


Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [13]:
enc = LabelEncoder()
dfc['v1'] = enc.fit_transform(dfc['v1'])


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [14]:
dfc


Unnamed: 0,v1,v2
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,1,This is the 2nd time we have tried 2 contact u...
5568,0,Will Ì_ b going to esplanade fr home?
5569,0,"Pity, * was in mood for that. So...any other s..."
5570,0,The guy did some bitching but I acted like i'd...


### 2) 텍스트 분류를 수행

- 데이터셋 split시 test_size의 비율은 15%로, `random_state = 42` 로 설정 
- Tokenizer의 `num_words = 1000` 으로 설정
- pad_sequence의 `maxlen=150` 으로 설정
- 학습 시, 파라미터는 `batch_size=64, epochs=10, validation_split=0.2` 로 설정
- evaluate 했을 때의 loss와 accuarcy를 [loss, acc] 형태로 입력하기

In [15]:
np.random.seed(42)
tf.random.set_seed(42)

In [16]:
target = 'v1'
features = 'v2'
X = dfc[features]
y = dfc[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.15, random_state = 42)


In [17]:
print('train_shape: ',X_train.shape)
print('test_shape: ',X_test.shape)

train_shape:  (4736,)
test_shape:  (836,)


In [18]:
tokenizer = Tokenizer(num_words=1000)
tokenizer.fit_on_texts(X_train)

In [19]:
vocab_size = len(tokenizer.word_index) + 1
X_encoded = tokenizer.texts_to_sequences(X_train)

In [20]:
vocab_size

8210

In [21]:
X_train=pad_sequences(X_encoded, maxlen=150, padding='post')

In [22]:
embedding_matrix = np.zeros((vocab_size, 300))

np.shape(embedding_matrix)

(8210, 300)

In [23]:
model = Sequential()
model.add(Embedding(vocab_size, 300, weights=[embedding_matrix], input_length=150, trainable=False))
model.add(GlobalAveragePooling1D())
model.add(Dense(1, activation='sigmoid'))

In [24]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
model.fit(X_train,y_train, batch_size=64, epochs=10, validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fdaaf17f1d0>

In [25]:
X_test_encoded = tokenizer.texts_to_sequences(X_test)
X_test=pad_sequences(X_test_encoded, maxlen=150, padding='post')

In [26]:
model.evaluate(X_test, y_test)



[0.5296106934547424, 0.8708133697509766]

### 3)Word2Vec에서의 OOV 문제

```
def get_vector(word):
    """
    해당 word가 word2vec에 있는 단어일 경우 임베딩 벡터를 반환
    """
    if word in wv:
        return wv[word]
    else:
        return None
 
for word, i in tokenizer.word_index.items():
    temp = get_vector(word)
    if temp is not None:
        embedding_matrix[i] = temp
```
Lecture Note에 있는 위의 코드를 변형하여, OOV 개수를 확인
- tokenizer는 위에서 활용한 tokenizer를 그대로 사용
- Tip : dictionary를 활용하거나, Counter를 활용

In [27]:
oov = []
def get_vector(word):
  if word in wv:
    return wv[word]
  else:
    return oov.append(word)

In [28]:
for word, i in tokenizer.word_index.items():
    temp = get_vector(word)
    if temp is None:
        embedding_matrix[i] = temp

In [29]:
print(len(oov))

2419
