<a href="https://colab.research.google.com/github/jeongjongyoon/ai-security-1/blob/master/First%20Assignment/jeongjongyoon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
%matplotlib inline


Classifying Names with a Character-Level RNN
*********************************************
**Author**: `Sean Robertson <https://github.com/spro/practical-pytorch>`_

We will be building and training a basic character-level RNN to classify
words. A character-level RNN reads words as a series of characters -
outputting a prediction and "hidden state" at each step, feeding its
previous hidden state into each next step. We take the final prediction
to be the output, i.e. which class the word belongs to.

Specifically, we'll train on a few thousand surnames from 18 languages
of origin, and predict which language a name is from based on the
spelling:

::

    $ python predict.py Hinton
    (-0.47) Scottish
    (-1.52) English
    (-3.57) Irish

    $ python predict.py Schmidhuber
    (-0.19) German
    (-2.48) Czech
    (-2.68) Dutch


**Recommended Reading:**

I assume you have at least installed PyTorch, know Python, and
understand Tensors:

-  http://pytorch.org/ For installation instructions
-  :doc:`/beginner/deep_learning_60min_blitz` to get started with PyTorch in general
-  :doc:`/beginner/pytorch_with_examples` for a wide and deep overview
-  :doc:`/beginner/former_torchies_tutorial` if you are former Lua Torch user

It would also be useful to know about RNNs and how they work:

-  `The Unreasonable Effectiveness of Recurrent Neural
   Networks <http://karpathy.github.io/2015/05/21/rnn-effectiveness/>`__
   shows a bunch of real life examples
-  `Understanding LSTM
   Networks <http://colah.github.io/posts/2015-08-Understanding-LSTMs/>`__
   is about LSTMs specifically but also informative about RNNs in
   general

Preparing the Data
==================

.. Note::
   Download the data from
   `here <https://download.pytorch.org/tutorial/data.zip>`_
   and extract it to the current directory.

Included in the ``data/names`` directory are 18 text files named as
"[Language].txt". Each file contains a bunch of names, one name per
line, mostly romanized (but we still need to convert from Unicode to
ASCII).

We'll end up with a dictionary of lists of names per language,
``{language: [names ...]}``. The generic variables "category" and "line"
(for language and name in our case) are used for later extensibility.



In [0]:
from __future__ import unicode_literals, print_function, division #python2를 python3처럼 사용하는 모듈
from io import open 
import glob #사용자가 제시한 조건에 맞는 파일명을 리스트 형식으로 반환
import os #from os import * 대신에 import os 스타일을 사용해야 os.open() 이 내장 open() 을 가리는 것을 피할 수 있음

def findFiles(path): return glob.glob(path) #findFiles 함수 정의

print(findFiles('data/names/*.txt')) #data/names 디렉토리에서 txt파일 출력

import unicodedata #유니코드 데이터베이스
import string #string 모듈

all_letters = string.ascii_letters + " .,;'" #ascii코드 글자
n_letters = len(all_letters) #글자의 길이

# 유니코드 문자열을 일반 ASCII로 변환, thanks to http://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s): #unicodeToAscii 함수 정의
    return ''.join( #텍스트를 AXCII로 변환하는 유니코드 정규화
        c for c in unicodedata.normalize('NFD', s) 
        if unicodedata.category(c) != 'Mn'
        and c in all_letters
    )

print(unicodeToAscii('Ślusàrski')) #Ślusàrski를 ASCII 코드로 번역해 출력

# 언어별 이름 목록인 category_lines 사전을 만들기
category_lines = {} # dictionary : 각 카테고리를 줄(이름)목록에 매핑
all_categories = [] # array(언어 목록)

# 파일을 읽고 라인으로 분리하기
def readLines(filename):  #readLines 함수 정의
    lines = open(filename, encoding='utf-8').read().strip().split('\n') #utf-8로 인코딩하여 한글깨짐 해결후 line에 삽입
    return [unicodeToAscii(line) for line in lines]

for filename in findFiles('data/names/*.txt'): #filename이 findFiles에 있을때까지
    category = os.path.splitext(os.path.basename(filename))[0] #category 사전에 삽입
    all_categories.append(category) #all_categories 배열에 추가
    lines = readLines(filename) #lines에 readLines 함수 넣기
    category_lines[category] = lines #category 크기 만큼 category_lines 배열 만들기

n_categories = len(all_categories) #n_categories에 all_categories 길이 삽입

Now we have ``category_lines``, a dictionary mapping each category
(language) to a list of lines (names). We also kept track of
``all_categories`` (just a list of languages) and ``n_categories`` for
later reference.




In [0]:
print(category_lines['Italian'][:5]) #Italian 사전에 5개 출력

Turning Names into Tensors
--------------------------

Now that we have all the names organized, we need to turn them into
Tensors to make any use of them.

To represent a single letter, we use a "one-hot vector" of size
``<1 x n_letters>``. A one-hot vector is filled with 0s except for a 1
at index of the current letter, e.g. ``"b" = <0 1 0 0 0 ...>``.

To make a word we join a bunch of those into a 2D matrix
``<line_length x 1 x n_letters>``.

That extra 1 dimension is because PyTorch assumes everything is in
batches - we're just using a batch size of 1 here.




In [0]:
import torch

# all_letters로 문자의 주소 찾기, 예시 "a"=0
def letterToIndex(letter):
    return all_letters.find(letter) #주소값 반환

# 검증을 위해서 한 문자를 <1 x n_letters> tensor로 변환하기
def letterToTensor(letter): #글자를 tensor로 변환하기
    tensor = torch.zeros(1, n_letters) #주어진 사이즈의 0으로 이루어진 텐서 생성
    tensor[0][letterToIndex(letter)] = 1 #one-hot 벡터는 현재 문자의 주소에만 1을 가지고 나머지는 0으로 채워짐
    return tensor #tensor 값 반환

# 한 줄(이름)을 <line_length x 1 x n_letters>,
# 또는 문자 벡터의 어레이로 변경하기
def lineToTensor(line):     #line을 tensor로
    tensor = torch.zeros(len(line), 1, n_letters) #주어진 사이즈의 0으로 이루어진 tensor 생성
    for li, letter in enumerate(line):  #단어를 만들기 위해 one-hot 벡터 묶음을 2차원 행렬에 결합
        tensor[li][0][letterToIndex(letter)] = 1
    return tensor

print(letterToTensor('J'))  #J 문자의 주소 찾기

print(lineToTensor('Jones').size()) #단어 찾기

Creating the Network
====================

Before autograd, creating a recurrent neural network in Torch involved
cloning the parameters of a layer over several timesteps. The layers
held hidden state and gradients which are now entirely handled by the
graph itself. This means you can implement a RNN in a very "pure" way,
as regular feed-forward layers.

This RNN module (mostly copied from `the PyTorch for Torch users
tutorial <http://pytorch.org/tutorials/beginner/former_torchies/
nn_tutorial.html#example-2-recurrent-net>`__)
is just 2 linear layers which operate on an input and hidden state, with
a LogSoftmax layer after the output.

.. figure:: https://i.imgur.com/Z2xbySO.png
   :alt:





In [0]:
import torch.nn as nn  #torch.nn 모듈을 가져오고 nn이라 칭함

class RNN(nn.Module):   #RNN 클래스를 정의. input 및 hidden state에서 작동하는 2개의 선형 레이어. 출력 후에는 LogSoftMax 레이어가 있음
    def __init__(self, input_size, hidden_size, output_size): #초기화 함수 정의
        super(RNN, self).__init__() #자식클래스에서 부모클래스(RNN)사용하고 싶은 경우

        self.hidden_size = hidden_size #갚 상속 받기
#input과 hidden이 결합해 i2o와 i2h로 나옴
        self.i2h = nn.Linear(input_size + hidden_size, hidden_size) 
        self.i2o = nn.Linear(input_size + hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1) 

    def forward(self, input, hidden): #forward 함수 정의
        combined = torch.cat((input, hidden), 1) #input과 hidden 결합
        hidden = self.i2h(combined) #hidden state에 결합값 넣기
        output = self.i2o(combined) #결합한 값 넣기
        output = self.softmax(output) #softmax값으로 넣기
        return output, hidden # output, hidden 값 출력

    def initHidden(self): #Hidden초기화 함수
        return torch.zeros(1, self.hidden_size) #hidden_size만큼 텐서 생성

n_hidden = 128
rnn = RNN(n_letters, n_hidden, n_categories)

To run a step of this network we need to pass an input (in our case, the
Tensor for the current letter) and a previous hidden state (which we
initialize as zeros at first). We'll get back the output (probability of
each language) and a next hidden state (which we keep for the next
step).




In [0]:
input = letterToTensor('A') #input 값에 A를 tensor로 바꾼 값 삽입
hidden =torch.zeros(1, n_hidden) #hidden값에 n_hidden 크기만큼 텐서 생성

output, next_hidden = rnn(input, hidden) #이 네트워크의 한 단계를 실행하기 위해 입력과 이전의 hidden state 전달

For the sake of efficiency we don't want to be creating a new Tensor for
every step, so we will use ``lineToTensor`` instead of
``letterToTensor`` and use slices. This could be further optimized by
pre-computing batches of Tensors.




In [0]:
input = lineToTensor('Albert') #효율성을 위해서 lineToTensor를 잘라서 사용
hidden = torch.zeros(1, n_hidden) #hidden 값에 텐서 생성

output, next_hidden = rnn(input[0], hidden) #역시 입력의 0번째 위치 값과 hidden state 전달
print(output) #print

As you can see the output is a ``<1 x n_categories>`` Tensor, where
every item is the likelihood of that category (higher is more likely).




Training
========
Preparing for Training
----------------------

Before going into training we should make a few helper functions. The
first is to interpret the output of the network, which we know to be a
likelihood of each category. We can use ``Tensor.topk`` to get the index
of the greatest value:




In [0]:
def categoryFromOutput(output): #네트워크를 알고 있는 각 카테고리의 우도로 출력을 해석하기
    top_n, top_i = output.topk(1) #value를 기반으로 topK를 뽑아냄
    category_i = top_i[0].item()
    return all_categories[category_i], category_i #사전과 위치 반환

print(categoryFromOutput(output))

We will also want a quick way to get a training example (a name and its
language):




In [0]:
import random 

def randomChoice(l): #randomChoice 함수 정의
    return l[random.randint(0, len(l) - 1)]

def randomTrainingExample(): #학습예시를 얻는 빠른 방법
    category = randomChoice(all_categories) #category에 랜덤위치 지정
    line = randomChoice(category_lines[category]) #line에도 랜덤위치 지정
    category_tensor = torch.tensor([all_categories.index(category)], dtype=torch.long)
    line_tensor = lineToTensor(line)
    return category, line, category_tensor, line_tensor #category, line 값과 위치 반환

for i in range(10): #랜덤으로 10개의 언어와 랜덤이름 찾아내기
    category, line, category_tensor, line_tensor = randomTrainingExample()
    print('category =', category, '/ line =', line)

Training the Network
--------------------

Now all it takes to train this network is show it a bunch of examples,
have it make guesses, and tell it if it's wrong.

For the loss function ``nn.NLLLoss`` is appropriate, since the last
layer of the RNN is ``nn.LogSoftmax``.




In [0]:
criterion = nn.NLLLoss() #RNN의 마지막 레이어가 nn.LogSoftmax이므로 손실함수는 nn.NLLoss

Each loop of training will:

-  Create input and target tensors
-  Create a zeroed initial hidden state
-  Read each letter in and

   -  Keep hidden state for next letter

-  Compare final output to target
-  Back-propagate
-  Return the output and loss




In [0]:
learning_rate = 0.005 # 너무 높으면 폭발. 너무 낮으면 학습이 안될수도 있음

def train(category_tensor, line_tensor): #입력과 목표 tensor
    hidden = rnn.initHidden() #초기화된 hidden state

    rnn.zero_grad() #rnn을 0으로 만듦

    for i in range(line_tensor.size()[0]): #각문자를 읽기
        output, hidden = rnn(line_tensor[i], hidden) #다음 문자를 읽기 위한 hidden state 유지

    loss = criterion(output, category_tensor) #목표와 출력 비교
    loss.backward() #역전파

    # learning rate를 곱한 파라미터의 경사도를 파라미터 값에 더함
    for p in rnn.parameters():
        p.data.add_(-learning_rate, p.grad.data)

    return output, loss.item() #출력과 손실 반환

Now we just have to run that with a bunch of examples. Since the
``train`` function returns both the output and loss we can print its
guesses and also keep track of loss for plotting. Since there are 1000s
of examples we print only every ``print_every`` examples, and take an
average of the loss.




In [0]:
import time #time 모듈 import
import math #math 모듈 import

n_iters = 100000 #값 정의
print_every = 5000
plot_every = 1000



# 도식화를 위한 loss 추적
current_loss = 0 #현재 loss
all_losses = [] #누적 losses

def timeSince(since): #timeSince 함수 정의
    now = time.time() #컴퓨터의 현재 시각
    s = now - since #시간 구하기
    m = math.floor(s / 60) #분 구하기
    s -= m * 60 #초 구하기
    return '%dm %ds' % (m, s)

start = time.time() #start값으로 현재 시각 정의

for iter in range(1, n_iters + 1):
    category, line, category_tensor, line_tensor = randomTrainingExample() #random학습예시 뽑아내기
    output, loss = train(category_tensor, line_tensor)
    current_loss += loss 

    # iter 숫자, 손실, 이름, 추측 출력
    if iter % print_every == 0:
        guess, guess_i = categoryFromOutput(output)
        correct = '✓' if guess == category else '✗ (%s)' % category #추측이 맞고 틀릴시 조건 입력
        print('%d %d%% (%s) %.4f %s / %s %s' % (iter, iter / n_iters * 100, timeSince(start), loss, line, guess, correct))

    # 현재 평균 손실을 손실 리스트에 추가
    if iter % plot_every == 0:
        all_losses.append(current_loss / plot_every)
        current_loss = 0

Plotting the Results
--------------------

Plotting the historical loss from ``all_losses`` shows the network
learning:




In [0]:
import matplotlib.pyplot as plt  #matplotlib 이용해 그래프 그리기
import matplotlib.ticker as ticker 

plt.figure() #새로운 figure 생성
plt.plot(all_losses) #all_losses의 라인 플롯 생성

Evaluating the Results
======================

To see how well the network performs on different categories, we will
create a confusion matrix, indicating for every actual language (rows)
which language the network guesses (columns). To calculate the confusion
matrix a bunch of samples are run through the network with
``evaluate()``, which is the same as ``train()`` minus the backprop.




In [0]:
# confusion matrix에서 정확한 추측을 추정
confusion = torch.zeros(n_categories, n_categories)
n_confusion = 10000

# 주어진 라인의 출력 반환
def evaluate(line_tensor): #evaluate로 많은 수의 샘플을 네트워크에 실행
    hidden = rnn.initHidden()

    for i in range(line_tensor.size()[0]):
        output, hidden = rnn(line_tensor[i], hidden)

    return output #결과 출력

# 올바르게 추측된 예시와 기록 살펴보기
for i in range(n_confusion): #10000번 동안
    category, line, category_tensor, line_tensor = randomTrainingExample() #추측 비교
    output = evaluate(line_tensor) #결과 비교
    guess, guess_i = categoryFromOutput(output)
    category_i = all_categories.index(category)
    confusion[category_i][guess_i] += 1

# 모든 행을 합계로 나눔으로써 정규화하기
for i in range(n_categories):
    confusion[i] = confusion[i] / confusion[i].sum()

# 도식 설정
fig = plt.figure() #새로운 figure 생성
ax = fig.add_subplot(111) #하나의 figure에 마치 표처럼 여러 axes를 표현가능하게 해줌
cax = ax.matshow(confusion.numpy()) #두 축을 같은 축에 표시하기 위해 plt.matshow가 아닌 ax.matshow
fig.colorbar(cax) #색깔별 label만들기

# 축 설정
ax.set_xticklabels([''] + all_categories, rotation=90)
ax.set_yticklabels([''] + all_categories)

# 모든 tick에서 강제로 레이블 지정
ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

# sphinx_gallery_thumbnail_number = 2
plt.show() #생성된 모든 figure를 보여줌

You can pick out bright spots off the main axis that show which
languages it guesses incorrectly, e.g. Chinese for Korean, and Spanish
for Italian. It seems to do very well with Greek, and very poorly with
English (perhaps because of overlap with other languages).




Running on User Input
---------------------




In [0]:
def predict(input_line, n_predictions=3): #predict함수 정의(이름으로 예측보기)
    print('\n> %s' % input_line)
    with torch.no_grad(): #기록을 추적하는 것을 방지하기 위해, 코드 블럭을 감쌀 수 있음
        output = evaluate(lineToTensor(input_line))

        # 최고 N 카테고리 열기
        topv, topi = output.topk(n_predictions, 1, True)
        predictions = []

        for i in range(n_predictions):  #예측범위 내에서
            value = topv[0][i].item()  # i번째 값
            category_index = topi[0][i].item() #i번째 category_index 삽입
            print('(%.2f) %s' % (value, all_categories[category_index])) #결과 출력
            predictions.append([value, all_categories[category_index]]) #prediction에 요소 추가

predict('Dovesky')
predict('Jackson')
predict('Satoshi')

The final versions of the scripts `in the Practical PyTorch
repo <https://github.com/spro/practical-pytorch/tree/master/char-rnn-classification>`__
split the above code into a few files:

-  ``data.py`` (loads files)
-  ``model.py`` (defines the RNN)
-  ``train.py`` (runs training)
-  ``predict.py`` (runs ``predict()`` with command line arguments)
-  ``server.py`` (serve prediction as a JSON API with bottle.py)

Run ``train.py`` to train and save the network.

Run ``predict.py`` with a name to view predictions:

::

    $ python predict.py Hazaki
    (-0.42) Japanese
    (-1.39) Polish
    (-3.51) Czech

Run ``server.py`` and visit http://localhost:5533/Yourname to get JSON
output of predictions.




Exercises
=========

-  Try with a different dataset of line -> category, for example:

   -  Any word -> language
   -  First name -> gender
   -  Character name -> writer
   -  Page title -> blog or subreddit

-  Get better results with a bigger and/or better shaped network

   -  Add more linear layers
   -  Try the ``nn.LSTM`` and ``nn.GRU`` layers
   -  Combine multiple of these RNNs as a higher level network


