[전제조건] 
- 알파벳 26개를 사용하는 언어를 식별할 수 있다.
- 언어마다 사용하는 알파벳 빈도는 다르다

[모델]
- 알파벳 입력 시 해당 언어 출력

<hr>

[풀이]
- 언어별로 알파벳 빈도를 계산한다.
    - 방법 : 각 파일을 읽어 단어별로 나누고, 각 단어의 알파벳 빈도를 계산
- 입력된 알파벳 빈도와 언어별 빈도를 비교하여 가장 유사한 언어를 출력

[모델]
- 딥러닝을 활용
- 입력 데이터 : 알파벳 빈도
- 입력 라벨 : 4개 언어 - en, fr, id, tl
- 활성화 함수 : softmax
- 손실 함수 : categorical_crossentropy
- 최적화 알고리즘 : adam
- 평가 지표 : accuracy

[구성]
1. 데이터 준비
2. 모델 구성
3. 학습
4. 평가
5. 예측
6. 모델 저장

In [1]:
# 1. Load Data
train_dir = '../DATA/lang_data/train/'
test_dir = '../DATA/lang_data/test/'

import os
import numpy as np

def load_data(data_dir, start):
    # 파일명 시작 두 글자로 분류
    data = []
    for file in os.listdir(data_dir):
        if file.startswith(start):
            data.append(open(data_dir+file, 'r').read())
    return data

# en, fr, id, tl
en_tr_data = load_data(train_dir, 'en')
fr_tr_data = load_data(train_dir, 'fr')
id_tr_data = load_data(train_dir, 'id')
tl_tr_data = load_data(train_dir, 'tl')

print('en:', len(en_tr_data), 'fr:', len(fr_tr_data), 'id:', len(id_tr_data), 'tl:', len(tl_tr_data))
print(en_tr_data[:2])

en: 10 fr: 10 id: 10 tl: 10
['\n\n\n\nThe main Henry Ford Museum building houses some of the classrooms for the Henry Ford Academy\n\n\nHenry Ford Academy is the first charter school in the United States to be developed jointly by a global corporation, public education, and a major nonprofit cultural institution. The school is sponsored by the Ford Motor Company, Wayne County Regional Educational Service Agency and The Henry Ford Museum and admits high school students. It is located in Dearborn, Michigan on the campus of the Henry Ford museum. Enrollment is taken from a lottery in the area and totaled 467 in 2010.[1]\nFreshman meet inside the main museum building in glass walled classrooms, while older students use a converted carousel building and Pullman cars on a siding of the Greenfield Village railroad. Classes are expected to include use of the museum artifacts, a tradition of the original Village Schools. When the Museum was established in 1929, it included a school which served

In [2]:
# 2. Preprocessing
# 2-1. 단어 단위로 나눔

def split_words(lang_data):
    total = []
    for words in lang_data:
    # 단어 단위로 나눔
        for word in words.split():
            total.append(word)
    return total

en_tr_split = split_words(en_tr_data)
fr_tr_split = split_words(fr_tr_data)
id_tr_split = split_words(id_tr_data)
tl_tr_split = split_words(tl_tr_data)

print(en_tr_split)



이제 이 단어들을 피처로 넣고 라벨을 붙임(데이터셋)  
그리고 이 데이터셋을 가지고 딥러닝을 통해 학습시킴  
학습은 각 단어를 알파벳별로 쪼개서 알파벳별로 빈도수를 계산해서 학습시킴

In [3]:
len(en_tr_split)

40058

In [4]:
# 2-2. 단어를 숫자로 변환 : ord()

def word_to_num(lang_data):
    num_data = []
    for word in lang_data:
        # 알파벳일때만
        if word.isalpha():
            num_data.append([ord(char.lower()) for char in word])
    return num_data

en_tr_num = word_to_num(en_tr_split)
en_tr_num[:5]

[[116, 104, 101],
 [109, 97, 105, 110],
 [104, 101, 110, 114, 121],
 [102, 111, 114, 100],
 [109, 117, 115, 101, 117, 109]]

In [5]:
fr_to_num = word_to_num(fr_tr_split)
id_to_num = word_to_num(id_tr_split)
tl_to_num = word_to_num(tl_tr_split)

print(id_to_num[:5])

[[103, 111, 114, 105, 108, 97], [106, 97, 110, 116, 97, 110], [98, 101, 116, 105, 110, 97], [100, 101, 110, 103, 97, 110], [97, 110, 97, 107, 110, 121, 97]]


In [6]:
# 한 리스트로 합치기
def l_to_num(lang_data):
    num_data = []
    for list in lang_data:
        for num in list:
            num_data.append(num)
    return num_data

en_list = l_to_num(en_tr_num)
fr_list = l_to_num(fr_to_num)
id_list = l_to_num(id_to_num)
tl_list = l_to_num(tl_to_num)

print(en_list[:10])
print(type(en_list[0]))


[116, 104, 101, 109, 97, 105, 110, 104, 101, 110]
<class 'int'>


In [7]:
# 2-2. Dataset 생성
# 중복 제거할까? 일단 보류
import torch
from torch.utils.data import Dataset, DataLoader

class LangDataset(Dataset):
    def __init__(self, lang_data, lang_label):
        # lang_data: list of int; 
        # lang_label: 0, 1, 2, 3
        super(LangDataset, self).__init__()
        self.features = torch.tensor(lang_data, dtype=torch.long)
        self.labels = torch.tensor([lang_label]*len(lang_data), dtype=torch.long)
        
    def __len__(self):
        return len(self.features)
    
    def __getitem__(self, idx):
        return self.features[idx], self.labels[idx]

en_tr_dataset = LangDataset(en_list, 0)
fr_tr_dataset = LangDataset(fr_list, 1)
id_tr_dataset = LangDataset(id_list, 2)
tl_tr_dataset = LangDataset(tl_list, 3)

print(en_tr_dataset.features[:5], en_tr_dataset.labels[:5])

tensor([116, 104, 101, 109,  97]) tensor([0, 0, 0, 0, 0])


In [8]:
# DataLoader
Batch = 2048

# train data 통합 : feature, label은 그대로 유지
from torch.utils.data import ConcatDataset
train_dataset = ConcatDataset([en_tr_dataset, fr_tr_dataset, id_tr_dataset, tl_tr_dataset])
print(len(train_dataset), train_dataset[1])

train_loader = DataLoader(train_dataset, batch_size=Batch, shuffle=True, drop_last=True)
print(len(train_loader), train_loader.dataset[1])
print(train_loader.dataset[1][0])

493629 (tensor(104), tensor(0))
241 (tensor(104), tensor(0))
tensor(104)


In [9]:
# 3. 모델 구성
# - 딥러닝 모델, 활성화: relu, 손실함수: cross entropy, optimizer: adam
import torch.nn as nn

class NewClassiModel(nn.Module):
    def __init__(self, IN, OUT):
        super(NewClassiModel, self).__init__()
        self.input = nn.Linear(IN, 256)
        self.hidden1 = nn.Linear(256, 128)
        self.hidden2 = nn.Linear(128, 64)
        self.output = nn.Linear(64, OUT)
        self.relu = nn.ReLU()
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        y = self.input(x)
        y = self.relu(y)
        y = self.hidden1(y)
        y = self.relu(y)
        y = self.hidden2(y)
        y = self.relu(y)
        y = self.output(y)
        y = self.softmax(y)
        return y

In [10]:
model = NewClassiModel(1, 4).to('cpu')
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
pred = model(train_loader.dataset[1][0].view(1, -1).float())
label = train_loader.dataset[1][1]
loss = loss_fn(pred, label.view(1))
print(loss)

tensor(1.6164, grad_fn=<NllLossBackward0>)


In [11]:
for i, (x, y) in enumerate(train_loader):
    print(x.view(1,-1), y)
    break

tensor([[105, 116, 115,  ..., 111, 116, 109]]) tensor([2, 1, 3,  ..., 0, 0, 1])


In [12]:
# feature : 32개
for i, (x, y) in enumerate(train_loader):
    x = x.view(Batch, -1).float()
    y = y.view(Batch)
    print(x, y)
    break

tensor([[116.],
        [112.],
        [110.],
        ...,
        [114.],
        [114.],
        [ 98.]]) tensor([2, 3, 3,  ..., 1, 0, 0])


In [13]:
# 4. 모델 학습
# - loss, optimizer

# Fix the input dimensions of the model
# IN, OUT = 32, 32

# model = NewClassiModel(IN, OUT)

# loss_fn = nn.CrossEntropyLoss()
# optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# # 학습
# EPOCHS = 10
# for epoch in range(EPOCHS): 
#     for data, label in train_loader:
#         data = data.view(Batch, -1).float()
#         optimizer.zero_grad()
#         pred = model(data.float())
#         print(pred, label)
#         loss = loss_fn(pred, label.float())
#         loss.backward()
#         optimizer.step()
        
#     print(f'Epoch [{epoch+1}/{EPOCHS}], Loss: {loss.item():.4f}')


In [14]:
# 5. 모델 평가


en_list 부터 다시 시작

In [15]:
print(en_list)
print(len(en_list))

[116, 104, 101, 109, 97, 105, 110, 104, 101, 110, 114, 121, 102, 111, 114, 100, 109, 117, 115, 101, 117, 109, 98, 117, 105, 108, 100, 105, 110, 103, 104, 111, 117, 115, 101, 115, 115, 111, 109, 101, 111, 102, 116, 104, 101, 99, 108, 97, 115, 115, 114, 111, 111, 109, 115, 102, 111, 114, 116, 104, 101, 104, 101, 110, 114, 121, 102, 111, 114, 100, 97, 99, 97, 100, 101, 109, 121, 104, 101, 110, 114, 121, 102, 111, 114, 100, 97, 99, 97, 100, 101, 109, 121, 105, 115, 116, 104, 101, 102, 105, 114, 115, 116, 99, 104, 97, 114, 116, 101, 114, 115, 99, 104, 111, 111, 108, 105, 110, 116, 104, 101, 117, 110, 105, 116, 101, 100, 115, 116, 97, 116, 101, 115, 116, 111, 98, 101, 100, 101, 118, 101, 108, 111, 112, 101, 100, 106, 111, 105, 110, 116, 108, 121, 98, 121, 97, 103, 108, 111, 98, 97, 108, 112, 117, 98, 108, 105, 99, 97, 110, 100, 97, 109, 97, 106, 111, 114, 110, 111, 110, 112, 114, 111, 102, 105, 116, 99, 117, 108, 116, 117, 114, 97, 108, 116, 104, 101, 115, 99, 104, 111, 111, 108, 105, 115, 1

In [16]:
# 숫자의 개수별 빈도수
def mod_func(en_list):
    en_data = {}
    for num in en_list:
        if num in en_data:
            en_data[num] += 1
        else:
            en_data[num] = 1
    return en_data

mod_func(en_list)

{116: 14033,
 104: 8045,
 101: 19708,
 109: 4046,
 97: 13982,
 105: 12498,
 110: 12049,
 114: 10358,
 121: 2390,
 102: 3741,
 111: 12397,
 100: 6545,
 117: 4890,
 115: 10749,
 98: 2466,
 108: 6935,
 103: 2895,
 99: 5530,
 118: 1866,
 112: 3040,
 106: 249,
 119: 2732,
 107: 769,
 120: 297,
 122: 204,
 113: 196,
 237: 4,
 233: 71,
 225: 1,
 243: 3,
 250: 1,
 363: 1,
 333: 2,
 699: 2,
 257: 1,
 12525: 1,
 12540: 1,
 12464: 1,
 228: 13,
 1506: 1,
 1501: 1,
 231: 7,
 224: 3,
 235: 4,
 252: 1,
 227: 1,
 230: 10,
 229: 8,
 246: 14,
 244: 2,
 417: 1,
 232: 26,
 658: 1,
 239: 2,
 226: 4,
 238: 5}

In [22]:
en_data = mod_func(en_list)
fr_data = mod_func(fr_list)
id_data = mod_func(id_list)
tl_data = mod_func(tl_list)

# 키값으로 정렬, 값만 추출해 리스트화
en_data = list(dict(sorted(en_data.items())).values())
fr_data = list(dict(sorted(fr_data.items())).values())
id_data = list(dict(sorted(id_data.items())).values())
tl_data = list(dict(sorted(tl_data.items())).values())

print(en_data[:10])

[13982, 2466, 5530, 6545, 19708, 3741, 2895, 8045, 12498, 249]


In [18]:
# dataset 생성

numbering = {'en':1, 'fr':2, 'id':3, 'tl':4}

class new_dataset(Dataset):
    def __init__(self, data):
        super(new_dataset, self).__init__()
        self.features = data

왜 training testing이 없지??

In [None]:
# training
def training(data):
    