# 형태소 분석

## [형태소](https://ko.wikipedia.org/wiki/%ED%98%95%ED%83%9C%EC%86%8C)
형태소(morpheme)는 언어학에서 (일반적인 정의를 따르면) 일정한 의미가 있는 가장 작은 말의 단위로 발화체 내에서 따로 떠어낼 수 있는 것을 말한다.    
즉, 더 분석하면 뜻이 없어지는 말의 단위이다.     
  
예) 한나가 책을 보았다.

### 형태소의 의미/기능으로 구분
- 실질형태소
  > 어휘적 의미가 있는 형태소로 어떤 대상이나 상태, 동작을 가리키는 형태소를 말한다. 일반적으로 명사, 동사, 형용사, 부사가 이에 속한다.    
  > 위의 예에서는 "한나", "책", "보"가 이에 해당한다.  
- 형식형태소
  > 문법적 의미가 있는 형태소로 어휘형태소와 함께 스여 그들 사이의 관계를 나타내는 기능을 하는 행태소를 말한다. 한국어에서는 조사, 어미가 이에 속한다.   
  > 위의 예에서는 "가", "을", "았", "다"가 이에 해당한다.

### 형태소의 의존성으로 구분 
- 자립형태소
    > 다른 형태소 없이 홀로 어절을 이루어 사용될 수 있는 형태소를 말한다. 한국어에서는 일반적으로 명사, 대명사, 수사, 관형사, 부사, 감탄사 등이 이에 속한다.     
    > 위의 예에서는 "한나", "책"가 이에 해당한다.  
- 의존형태소
    > 문장에서 반드시 다른 형태소와 함께 쓰여서 어절을 이루는 형태소를 말한다. 한국어에서는 조사와 어미는 물론 이에 속하고 용언의 어간 즉 동사, 형용사의 어간이 이에 속한다.     
    > 위의 예에서는 "가", "을", "보", "았", "다"가 이에 해당한다. 

## 형태소 분석기
형태소 분석기는 품사를 태깅해주는 (무슨 품사인지 마킹해주는)는 라이브러리입니다.    
영어에서의 품사는 문장에서 위치나 말할 때 귾어 있는 띄어쓰기 단위로 되어 있기 때문에 POS(Part of Speech) tagger라고 합니다. 반면에 한국어에서는 단어를 다 잘라내야 제대로 형태소를 갈라낼 수 있어서 Morphology Analyzer라고 합니다.   
  
예)
- 우리는 한국인이다 -> 우리(명사), 는(조사)
- We are Korean -> We(명사)

### [nltk](https://www.nltk.org/)    
python에서 가장 오래되고 유명한 자연어 처리 라이브러리(한국어 미지원)

In [1]:
import nltk
nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [2]:
sentence = """
At eight o'clock on Thursday morning
Arthur didn't feel very good """
sentence

"\nAt eight o'clock on Thursday morning\nArthur didn't feel very good "

토큰화

In [3]:
tokens = nltk.word_tokenize(sentence) # tokenize 모델델
tokens

['At',
 'eight',
 "o'clock",
 'on',
 'Thursday',
 'morning',
 'Arthur',
 'did',
 "n't",
 'feel',
 'very',
 'good']

문장속 단어에 품사를 적용하기

In [4]:
tagged = nltk.pos_tag(tokens)
tagged

[('At', 'IN'),
 ('eight', 'CD'),
 ("o'clock", 'NN'),
 ('on', 'IN'),
 ('Thursday', 'NNP'),
 ('morning', 'NN'),
 ('Arthur', 'NNP'),
 ('did', 'VBD'),
 ("n't", 'RB'),
 ('feel', 'VB'),
 ('very', 'RB'),
 ('good', 'JJ')]

동사, 명사만

In [5]:
[token for token,pos in tagged if pos.startswith("N") or pos.startswith("V")]

["o'clock", 'Thursday', 'morning', 'Arthur', 'did', 'feel']

### [spacy](https://spacy.io/)
자연어 처리를 위한 python 기반의 오픈 소스 라이브러리

In [7]:
pip install spacy
python -m spacy download en
python -m spacy download ko_core_news_sm

SyntaxError: ignored

#### 영어

In [6]:
import spacy
from spacy.lang.en.examples import sentences 



In [8]:
nlp = spacy.load("en_core_web_sm") # spacy 로드
doc = nlp(sentences[0]) 
print(doc.text)
print('-'*80)
print("단어","원형","품사","태그", "의존성", "모양", "알파벳", "금칙어",sep="\t")
for token in doc:
    print(
        token.text # 단어
        , token.lemma_ # 원형
        , token.pos_ # 품사
        , token.tag_ # 태그
        , token.dep_ # 의존성
        , token.shape_ # 모양
        , token.is_alpha # 알파벳
        , token.is_stop # 금칙어
        , sep='\t')

Apple is looking at buying U.K. startup for $1 billion
--------------------------------------------------------------------------------
단어	원형	품사	태그	의존성	모양	알파벳	금칙어
Apple	Apple	PROPN	NNP	nsubj	Xxxxx	True	False
is	be	AUX	VBZ	aux	xx	True	True
looking	look	VERB	VBG	ROOT	xxxx	True	False
at	at	ADP	IN	prep	xx	True	True
buying	buy	VERB	VBG	pcomp	xxxx	True	False
U.K.	U.K.	PROPN	NNP	dobj	X.X.	False	False
startup	startup	NOUN	NN	dobj	xxxx	True	False
for	for	ADP	IN	prep	xxx	True	True
$	$	SYM	$	quantmod	$	False	False
1	1	NUM	CD	compound	d	False	False
billion	billion	NUM	CD	pobj	xxxx	True	False


#### 한국어

In [9]:
import locale
def getpreferredencoding(do_setlocale = True):
  return "UTF-8"
locale.getpreferredencoding = getpreferredencoding  

In [10]:
!python -m spacy download ko_core_news_sm

2023-03-14 00:26:54.627976: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-03-14 00:26:54.628119: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-03-14 00:26:56.314009: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting ko-core-news-sm==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/downloa

In [11]:
import spacy
from spacy.lang.ko.examples import sentences

In [12]:
nlp = spacy.load("ko_core_news_sm")
doc = nlp(sentences[0])
print(doc.text)
print('-'*80)
print("단어", "원형", "품사", "태그", "의존성", "모양", "알파벳", "금칙어", sep="\t")
for token in doc:
  print(
      token.text
      , token.lemma_
      , token.pos_
      , token.tag_
      , token.dep_
      , token.shape_
      , token.is_alpha
      , token.is_stop
      , sep='\t'
  )

애플이 영국의 스타트업을 10억 달러에 인수하는 것을 알아보고 있다.
--------------------------------------------------------------------------------
단어	원형	품사	태그	의존성	모양	알파벳	금칙어
애플이	애플이	CCONJ	nq+jcj	dislocated	xxx	True	False
영국의	영국+의	PROPN	nq+jcm	nmod	xxx	True	False
스타트업을	스타트업+을	NOUN	ncn+jcs	dislocated	xxxx	True	False
10억	10+억	NUM	nnc+nnc	nummod	ddx	False	False
달러에	달러+에	ADV	nbu+jca	obl	xxx	True	False
인수하는	인수+하+는	VERB	ncpa+xsv+etm	acl	xxxx	True	False
것을	것+을	NOUN	nbn+jco	obj	xx	True	False
알아보고	알+아+보+고	AUX	pvg+ecx+px+ecx	ROOT	xxxx	True	False
있다	있+다	AUX	px+ef	aux	xx	True	False
.	.	PUNCT	sf	punct	.	False	False


### [Konlpy](https://konlpy-ko.readthedocs.io/ko/v0.4.3/)

Konnlpy는 다음과 같은 다양한 형태소 분석, 태깅 라이브러리를 파이썬에서 쉽게 사용할 수 있도록 모아놓았습니다.
- Hannanum: 한나눔. KAIST Semantic Web Research Center 개발.
  - http://semanticweb.kaist.ac.kr/hannanum/

- Kkma: 꼬꼬마. 서울대학교 IDS(Intelligent Data Systems) 연구실 개발.
  - http://kkma.snu.ac.kr/

- Komoran: 코모란. Shineware에서 개발.
  - https://github.com/shin285/KOMORAN

- Mecab: 메카브. 일본어용 형태소 분석기를 한국어를 사용할 수 있도록 수정.
  - https://bitbucket.org/eunjeon/mecab-ko

- Open Korean Text: 오픈 소스 한국어 분석기. 과거 트위터 형태소 분석기.
  - https://github.com/open-korean-text/open-korean-text

In [13]:
!git clone https://github.com/SOMJANG/Mecab-ko-for-Google-Colab.git
!bash /content/Mecab-ko-for-Google-Colab/install_mecab-ko_on_colab190912.sh

Cloning into 'Mecab-ko-for-Google-Colab'...
remote: Enumerating objects: 115, done.[K
remote: Counting objects: 100% (24/24), done.[K
remote: Compressing objects: 100% (20/20), done.[K
remote: Total 115 (delta 11), reused 10 (delta 3), pack-reused 91[K
Receiving objects: 100% (115/115), 1.27 MiB | 14.62 MiB/s, done.
Resolving deltas: 100% (50/50), done.
Installing konlpy.....
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting konlpy
  Downloading konlpy-0.6.0-py2.py3-none-any.whl (19.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.4/19.4 MB[0m [31m40.1 MB/s[0m eta [36m0:00:00[0m
Collecting JPype1>=0.7.0
  Downloading JPype1-1.4.1-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (465 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m465.3/465.3 KB[0m [31m32.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: JPype1, konlpy
Successfully installed JPype1-1.4.1 

#### Okt(previous Twitter)

In [16]:
from konlpy.tag import Okt

In [18]:
okt = Okt

In [23]:
txt = "아버지가방에들어가신다."

형태소 분석

In [26]:
okt.pos(txt) 

TypeError: ignored

In [21]:
[token[0] for token in okt.pos(txt) if token[1][0] in "NYJ"]

TypeError: ignored

명사만 분석

In [27]:
okt.nouns(txt)

TypeError: ignored

###  Mecab

In [28]:
from konlpy.tag import Mecab

In [29]:
mec = Mecab()

In [30]:
txt = "아버지가방에들어가신다."

형태소 분석

In [31]:
mec.pos(txt)

[('아버지', 'NNG'),
 ('가', 'JKS'),
 ('방', 'NNG'),
 ('에', 'JKB'),
 ('들어가', 'VV'),
 ('신다', 'EP+EF'),
 ('.', 'SF')]

In [32]:
[ token[0] for token in mec.pos(txt) if token[1][0] in "NVJ" ]

['아버지', '가', '방', '에', '들어가']

명사만 분석

In [33]:
mec.nouns(txt)

['아버지', '방']

# Load Review Dataset

In [34]:
import numpy as np
import pandas as pd
import torch
from torchtext.vocab import build_vocab_from_iterator

from tqdm.auto import tqdm

In [35]:
# 구글 드라이브 연결(데이터 로드를 위해서)
from google.colab import drive
drive.mount('/content/data')

Mounted at /content/data


In [36]:
DATA_PATH = "/content/data/MyDrive/dev/2.deep learning/4. NLP/data"

# English Dataset

In [38]:
df_en = pd.read_csv(DATA_PATH+"/IMDB/IMDB-Dataset.csv")

print(f'{df_en.isnull().sum().sum()} / {df_en.shape}') 
df_en.head()

0 / (50000, 2)


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [39]:
df_en = df_en[:1000]
df_en['sentiment'] = df_en['sentiment'].map({'positive':1, 'negative':0})

print(f'{df_en.isnull().sum().sum()} / {df_en.shape}')
df_en.head()

0 / (1000, 2)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_en['sentiment'] = df_en['sentiment'].map({'positive':1, 'negative':0})


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


### 어휘집 만들기

In [40]:
import spacy

nlp = spacy.load("en_core_web_sm")

In [41]:
def tokenizer(text):
  doc = nlp(text)
  return [token.lemma_ for token in doc if token.tag_[0] in "NYJ"]

In [42]:
tokenizer(df_en["review"][0])[:5]

['other', 'reviewer', 'oz', 'episode', 'right']

In [43]:
def yield_tokens(data, tokenizer):
  for text in tqdm(data):
    yield tokenizer(text)

In [44]:
gen = yield_tokens(df_en["review"],tokenizer)
vocab = build_vocab_from_iterator(gen, specials=["<pad>", "<unk>"])
vocab.set_default_index(vocab["<unk>"])
len(vocab)

  0%|          | 0/1000 [00:00<?, ?it/s]

15185

In [45]:
vocab(["a", "very", "karns"])

[1836, 638, 1]

In [46]:
vocab.lookup_tokens([1879, 756, 1, 0])

['festival', 'cartoon', '<unk>', '<pad>']

### Korean Dataset

In [48]:
df_ko = pd.read_csv(DATA_PATH+"/naver_review/naver_review_train.csv", sep="\t")

print(f'{df_ko.isnull().sum().sum()} / {df_ko.shape}') 
df_ko.head()

5 / (150000, 3)


Unnamed: 0,id,document,label
0,9976970,아 더빙.. 진짜 짜증나네요 목소리,0
1,3819312,흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나,1
2,10265843,너무재밓었다그래서보는것을추천한다,0
3,9045019,교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정,0
4,6483659,사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 ...,1


In [50]:
df_ko = df_ko.dropna().reset_index(drop=True)
# df_ko = df_ko[:1000]
df_ko['document'] = df_ko['document'].map(lambda x: x.strip())

print(f'{df_ko.isnull().sum().sum()} / {df_ko.shape}')
df_ko.head()

0 / (149995, 3)


Unnamed: 0,id,document,label
0,9976970,아 더빙.. 진짜 짜증나네요 목소리,0
1,3819312,흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나,1
2,10265843,너무재밓었다그래서보는것을추천한다,0
3,9045019,교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정,0
4,6483659,사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 ...,1


### 어휘집 만들기

In [51]:
from konlpy.tag import Mecab

mecab = Mecab()

In [52]:
def tokenizer(text):
  tokens = mecab.pos(text)
  return [token[0] for token in tokens if token[1][0] in "NYJ"]

In [53]:
tokenizer(df_ko["document"][0])[:5]

['짜증', '나', '목소리']

In [54]:
def yield_tokens(data, tokenizer):
  for text in tqdm(data):
    yield tokenizer(text)

In [56]:
gen = yield_tokens(df_ko["document"],tokenizer)
vocab = build_vocab_from_iterator(gen, specials=["<pad>", "<unk>"])
vocab.set_default_index(vocab["<unk>"])
len(vocab)

  0%|          | 0/149995 [00:00<?, ?it/s]

32242

In [57]:
vocab(["짜증", "나", "네요", "목소리", "karns"])

[99, 20, 4856, 350, 1]

In [58]:
vocab.lookup_tokens([77, 17, 147, 621, 1, 0])

['부터', '에서', '대사', '꼴', '<unk>', '<pad>']

# Make a Dataset

텍스트 인코딩

In [59]:
train = [vocab(tokenizer(text)) for text in df_ko["document"].tolist()]
len(train)

149995

In [60]:
train[:2]

[[99, 20, 350], [233, 250, 321, 2, 62, 837, 22, 440]]

문장 최대길이

In [61]:
max_len = max(len(lst) for lst in train)
max_len

68

In [62]:
train = [ lst + [0] * (max_len - len(lst))  if len(lst) < max_len else lst for lst in train]
train = np.array(train)
train.shape

(149995, 68)

In [63]:
train[:2]

array([[ 99,  20, 350,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0],
       [233, 250, 321,   2,  62, 837,  22, 440,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0]])

target

In [64]:
target = df_ko["label"].to_numpy()
target.shape

(149995,)

In [65]:
target[:2]

array([0, 1])

In [66]:
target = target.reshape(-1, 1)
target.shape

(149995, 1)

In [67]:
target[:2]

array([[0],
       [1]])

In [68]:
class ReviewDataset(torch.utils.data.Dataset):
  def __init__(self, x, y=None):
    self.x = x
    self.y = y
  def __len__(self):
    return self.x.shape[0]
  def __getitem__(self, idx):
    item = {}
    item["x"] = torch.LongTensor(self.x[idx])
    if self.y is not None:
      item["y"] = torch.Tensor(self.y[idx])
    return item      

In [73]:
dt = ReviewDataset(train, target)

# Make a DataLoader

In [74]:
dl = torch.utils.data.DataLoader(dt,batch_size=2,shuffle=False)

In [75]:
batch = next(iter(dl))
batch

{'x': tensor([[ 99,  20, 350,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
            0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
            0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
            0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
            0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
         [233, 250, 321,   2,  62, 837,  22, 440,   0,   0,   0,   0,   0,   0,
            0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
            0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
            0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
            0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0]]),
 'y': tensor([[0.],
         [1.]])}

# Embedding

In [76]:
len(vocab)

32242

In [77]:
emb_out_size = 4
emb_layer = torch.nn.Embedding(len(vocab), emb_out_size)
emb_out = emb_layer(batch["x"])
emb_out.shape

torch.Size([2, 68, 4])

# CNN(Convolutional Neural Network

## 1D CNN
- in_channels : input 의 feature 차원수
- out_channels: output 의 feature 차원수
- kernel_size : 입력길이를 얼마만큼 볼것인가
- stride: kernel 을 얼마만큼씩 이동할 것인가
- padding: 양방향으로 얼마만큼 패딩할 것인가    

입력 텐서 shape    
  >  batch , feature dim , time step(입력 길이)


In [78]:
emb_out.shape # shape(배치 크기, 문장 최대 길이, 임베딩 아웃풋 크기)

torch.Size([2, 68, 4])

In [79]:
emb_out.permute(0, 2, 1).shape

torch.Size([2, 4, 68])

In [80]:
conv1 = torch.nn.Conv1d(in_channels=4, out_channels=8, kernel_size=5, stride=1)
conv1_out = conv1(emb_out.permute(0, 2, 1))
conv1_out.shape

torch.Size([2, 8, 64])

In [81]:
conv1.weight.shape

torch.Size([8, 4, 5])

In [82]:
conv1.bias.shape

torch.Size([8])

## pooling layer
합성곱 layer의 출력 크기를 줄이거나 특정 출력 부분을 강조하기 위해 사용

In [83]:
conv1_out.shape

torch.Size([2, 8, 64])

- Average Pooling

In [84]:
avg_pool1d = torch.nn.AvgPool1d(2)
avg_pool1d(conv1_out).shape

torch.Size([2, 8, 32])

- Max Pooling

In [85]:
max_pool1d = torch.nn.MaxPool1d(2)
max_pool1d(conv1_out).shape

torch.Size([2, 8, 32])

## Global Pooling layer
- pooling layer보다 급격하게 차원을 감소 시킨다.
- linear layer 넣기 위해 차원을 변경하는 목적으로 쓴다.

In [86]:
conv1_out.shape

torch.Size([2, 8, 64])

- Average Pooling

In [89]:
avg_adpt = torch.nn.AdaptiveAvgPool1d(1)
avg_adpt(conv1_out).shape

torch.Size([2, 8, 1])

Max Pooling

In [90]:
max_adpt = torch.nn.AdaptiveMaxPool1d(1)
max_adpt(conv1_out).shape

torch.Size([2, 8, 1])

# CNN Model

In [91]:
class Conv1dModel(torch.nn.Module):
    def __init__(self,vocab_size,embedding_dim):
        super().__init__()
        self.emb_layer = torch.nn.Embedding(vocab_size,embedding_dim)

        self.seq = torch.nn.Sequential(
            torch.nn.Conv1d(in_channels=embedding_dim,out_channels=embedding_dim*2,kernel_size=3),
            torch.nn.ReLU(),
            torch.nn.MaxPool1d(2),
            torch.nn.Conv1d(in_channels=embedding_dim*2,out_channels=embedding_dim*4,kernel_size=3),
            torch.nn.ReLU(),
            torch.nn.MaxPool1d(2),
            torch.nn.AdaptiveAvgPool1d(1),
            torch.nn.Flatten(),
            torch.nn.Linear(embedding_dim*4,1)
        )
    def forward(self,x):
        x = self.emb_layer(x)
        x = x.permute(0,2,1)
        return self.seq(x)

In [92]:
model = Conv1dModel(len(vocab), 128)
model(batch["x"])

tensor([[-0.0648],
        [-0.0641]], grad_fn=<AddmmBackward0>)

## Train

In [93]:
def train_loop(dataloader,model,loss_fn,optimizer,device):
    epoch_loss = 0 
    model.train()
    for batch in dataloader:
        pred = model(batch["x"].to(device)) 
        loss = loss_fn(pred, batch["y"].to(device)) 
        
        optimizer.zero_grad() 
        loss.backward()  
        optimizer.step()
        
        epoch_loss += loss.item()

    epoch_loss /= len(dataloader) 

    return epoch_loss 

In [94]:
@torch.no_grad()
def test_loop(dataloader, model, loss_fn, device):
  epoch_loss = 0
  model.eval()

  pred_list = []
  sig = torch.nn.Sigmoid()

  for batch in dataloader:

    pred = model(batch["x"].to(device))
    if batch.get("y") is not None:
      loss = loss_fn(pred, batch["y"].to(device))
      epoch_loss += loss.item()

    pred = sig(pred)
    pred = pred.to("cpu").numpy()
    pred_list.append(pred)

  epoch_loss /= len(dataloader)

  pred = np.concatenate(pred_list)
  return epoch_loss, pred    

In [95]:
n_splits = 5
vocab_size = len(vocab) 
embedding_dim = 32 
batch_size = 16 
epochs = 100
loss_fn = torch.nn.BCEWithLogitsLoss()

In [97]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold

SEED = 42
cv = KFold(n_splits=n_splits, shuffle=True, random_state=SEED)

In [99]:
import random
import os

def reset_seeds(seed):
  random.seed(seed)
  os.environ['PYTHONHASHSEED'] = str(seed)
  np.random.seed(seed)
  torch.manual_seed(seed)
  torch.cuda.manual_seed(seed)
  torch.backends.cudnn.deterministic = True

In [100]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cpu'

In [101]:
is_holdout = False
reset_seeds(SEED)
best_score_list = []
for i,(tri,vai) in enumerate(cv.split(train)):
    
    model = Conv1dModel(vocab_size, embedding_dim).to(device)
    optimizer = torch.optim.Adam(model.parameters())
    
    train_dt = ReviewDataset(train[tri],target[tri])
    valid_dt = ReviewDataset(train[vai],target[vai])
    train_dl = torch.utils.data.DataLoader(train_dt, batch_size=batch_size, shuffle=True)
    valid_dl = torch.utils.data.DataLoader(valid_dt, batch_size=batch_size,shuffle=False)

    best_score = 0
    patience = 0

    for epoch in tqdm(range(epochs)):
        
        train_loss = train_loop(train_dl, model, loss_fn,optimizer,device )
        valid_loss , pred = test_loop(valid_dl, model, loss_fn,device  )
        pred = (pred > 0.5).astype(int) # 정확도 계산이니까 0 과 1 둘중에 결정해주기

        score = accuracy_score(target[vai],pred )
        print(train_loss,valid_loss,score)
        patience += 1
        if best_score < score:
            patience = 0
            best_score = score
            torch.save(model.state_dict(),f"model_{i}.pth")

        if patience == 10:
            break
    print(f"Fold ({i}), BEST ACC: {best_score}")
    best_score_list.append(best_score)

    if is_holdout:
        break

  0%|          | 0/100 [00:00<?, ?it/s]

0.5525113476534684 0.5028668185631434 0.7396913230441015
0.48049956862131754 0.487162170513471 0.7547584919497317
0.4496775400499503 0.48662335891723635 0.754291809726991
0.42628330094019573 0.4864603516459465 0.75989199639988
0.4043539083490769 0.4911536521911621 0.7570585686189539
0.38250495334267615 0.5085851617336273 0.7567252241741391
0.36012808148413894 0.5249261368513107 0.755125170839028
0.3379973836789529 0.550366781938076 0.748491616387213
0.31626875471845267 0.6359826544801395 0.7327910930364345
0.2946427116299669 0.6350877775549889 0.746424880829361
0.27427945059078435 0.6737477670709292 0.7413913797126571
0.25571866687300304 0.7171670526603857 0.7411580386012867
0.23833444584086538 0.7803063420693079 0.7350245008166939
0.2227412830884258 0.8464120841483275 0.7368578952631755
Fold (0), BEST ACC: 0.75989199639988


  0%|          | 0/100 [00:00<?, ?it/s]

0.5501799285213153 0.5104565979719162 0.7334911163705456
0.48114579297502835 0.49152734568913775 0.7478915963865462
0.45065465648969016 0.48769556121826174 0.753958465282176
0.42625877969364323 0.4989502735217412 0.7517250575019168
0.4052484239518642 0.5037132786273957 0.7533584452815094
0.38319221910933654 0.522134176671505 0.7496916563885463
0.36173578707476456 0.5361479721784592 0.750291676389213
0.33966128230889636 0.5589052951931953 0.7494583152771759
0.3179738293796778 0.5916460032761097 0.7480582686089536
0.2966735167168081 0.6384836228013039 0.7383246108203607
0.2759185331794123 0.6727876788258552 0.7388912963765459
0.25685731439950565 0.7447910060584545 0.7402913430447682
0.23886446814884743 0.7939270779550075 0.7299576652555085
Fold (1), BEST ACC: 0.753958465282176


  0%|          | 0/100 [00:00<?, ?it/s]

0.5528224817057451 0.5077419278701146 0.7372245741524718
0.4781507106264432 0.4936536385933558 0.7478915963865462
0.44729696314930917 0.4879290278673172 0.7521917397246575
0.42472822663585347 0.4969512304186821 0.7532917763925464
0.4037029864658912 0.5018782220840454 0.7503916797226574
0.38256137080192565 0.5177587728857994 0.7539917997266575
0.3613752509236336 0.538661958527565 0.7471249041634721
0.339795585582157 0.5513314785838127 0.7461582052735091
0.3179539546181758 0.5975763408144316 0.7439247974932498
0.2972161465284725 0.6231249711672465 0.7429247641588053
0.2763703972344597 0.6573498006979624 0.7380579352645088
0.2571154546894133 0.7186043729603291 0.736824560818694
0.2398200916721175 0.7763821992595991 0.7356245208173606
0.22274880633372812 0.8434879186769326 0.7302576752558418
0.20839646542205786 0.9646043695251147 0.7266242208073602
0.19546970620863138 1.00072686667641 0.7307910263675456
Fold (2), BEST ACC: 0.7539917997266575


  0%|          | 0/100 [00:00<?, ?it/s]

0.5505910356581211 0.5098765072425206 0.7379245974865829
0.4796308892766635 0.49000442215601603 0.7528584286142871
0.4500232215722402 0.4861355239788691 0.7566585552851762
0.4267758373171091 0.4895526692469915 0.7589586319543985
0.40604481822649635 0.4996431033174197 0.7568918963965465
0.3851930143694083 0.5050788683176041 0.7561918730624354
0.36342078316013016 0.5198921120166778 0.7520250675022501
0.34141560539950927 0.5377559976021449 0.7476582552751758
0.31915994978298745 0.5835223104039828 0.7467915597186573
0.2983802059101562 0.6179807533144951 0.7479582652755092
0.2776218880126874 0.6476867487510045 0.7440581352711757
0.2574213234080623 0.7167178417464097 0.7413580452681756
0.2400107562566797 0.7683559271097183 0.7403913463782126
0.22423188440774877 0.8272006655852 0.7388579619320644
Fold (3), BEST ACC: 0.7589586319543985


  0%|          | 0/100 [00:00<?, ?it/s]

0.5519413048088551 0.5032767887751262 0.7378579285976199
0.4798200606505076 0.4860539707104365 0.75112503750125
0.45058147825102013 0.48536626737912497 0.7538917963932131
0.427942340400815 0.4841530523061752 0.7565252175072502
0.4070147623469432 0.49507167967955273 0.7580586019533985
0.38631907615065575 0.5067853724559148 0.7523917463915464
0.3656268970489502 0.5187118004401525 0.7544251475049168
0.34364489415238303 0.5399104031244913 0.7506583552785093
0.3212720488364498 0.5736760919650395 0.7495249841661389
0.30016426875044905 0.5955510288476944 0.7465582186072869
0.27930774680674075 0.6609777319629987 0.7463248774959166
0.25962524409517646 0.7159370522816976 0.7394246474882497
0.2419336069467167 0.788536017048359 0.7417913930464349
0.22608783895329881 0.8098831251502037 0.7399246641554719
0.210979889538077 0.8942530708789825 0.7399913330444348
Fold (4), BEST ACC: 0.7580586019533985


In [None]:
np.mean(best_score_list)