# Recurrent Neural Network 

## Classifying Names with a Character-Level RNN 

- Reference: [RNN PyTorch Tutorial](https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html)

You can download the data from [this link](https://download.pytorch.org/tutorial/data.zip).

## Bring Dataset

- The dataset contains names of people from 18 different nations.
- We want to train a neural network that classifies a name to its corresponding nation.

In [2]:
from __future__ import unicode_literals, print_function, division
from io import open
import glob
import os

def findFiles(path): return glob.glob(path)

print(findFiles('data/names/*.txt'))

['data/names/Czech.txt', 'data/names/German.txt', 'data/names/Arabic.txt', 'data/names/Japanese.txt', 'data/names/Chinese.txt', 'data/names/Vietnamese.txt', 'data/names/Russian.txt', 'data/names/French.txt', 'data/names/Irish.txt', 'data/names/English.txt', 'data/names/Spanish.txt', 'data/names/Greek.txt', 'data/names/Italian.txt', 'data/names/Portuguese.txt', 'data/names/Scottish.txt', 'data/names/Dutch.txt', 'data/names/Korean.txt', 'data/names/Polish.txt']


## Data Preprocessing

In [3]:
import unicodedata
import string

all_letters = string.ascii_letters + " .,;'"
n_letters = len(all_letters)

In [5]:
print('Letters:', all_letters)
print('Numer of letters:', n_letters)

Letters: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ .,;'
Numer of letters: 57


#### Turn a Unicode string to plain ASCII, thanks to https://stackoverflow.com/a/518232/2809427
- ASCII: [Link](https://ko.wikipedia.org/wiki/ASCII)

In [6]:
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_letters
    )

print(unicodeToAscii('Ślusàrski'))

Slusarski


In [7]:
# Build the category_lines dictionary, a list of names per language
category_lines = {}
all_categories = []

# Read a file and split into lines
def readLines(filename):
    lines = open(filename, encoding='utf-8').read().strip().split('\n')
    return [unicodeToAscii(line) for line in lines]

for filename in findFiles('data/names/*.txt'):
    category = os.path.splitext(os.path.basename(filename))[0]
    all_categories.append(category)
    lines = readLines(filename)
    category_lines[category] = lines

n_categories = len(all_categories)

In [8]:
# Target Classes of the data

all_categories

['Czech',
 'German',
 'Arabic',
 'Japanese',
 'Chinese',
 'Vietnamese',
 'Russian',
 'French',
 'Irish',
 'English',
 'Spanish',
 'Greek',
 'Italian',
 'Portuguese',
 'Scottish',
 'Dutch',
 'Korean',
 'Polish']

In [10]:
# Examples 

category_lines['Czech']

['Abl',
 'Adsit',
 'Ajdrna',
 'Alt',
 'Antonowitsch',
 'Antonowitz',
 'Bacon',
 'Ballalatak',
 'Ballaltick',
 'Bartonova',
 'Bastl',
 'Baroch',
 'Benesch',
 'Betlach',
 'Biganska',
 'Bilek',
 'Blahut',
 'Blazek',
 'Blazek',
 'Blazejovsky',
 'Blecha',
 'Bleskan',
 'Blober',
 'Bock',
 'Bohac',
 'Bohunovsky',
 'Bolcar',
 'Borovka',
 'Borovski',
 'Borowski',
 'Borovsky',
 'Brabbery',
 'Brezovjak',
 'Brousil',
 'Bruckner',
 'Buchta',
 'Cablikova',
 'Camfrlova',
 'Cap',
 'Cerda',
 'Cermak',
 'Chermak',
 'Cermak',
 'Cernochova',
 'Cernohous',
 'Cerny',
 'Cerney',
 'Cerny',
 'Cerv',
 'Cervenka',
 'Chalupka',
 'Charlott',
 'Chemlik',
 'Chicken',
 'Chilar',
 'Chromy',
 'Cihak',
 'Clineburg',
 'Klineberg',
 'Cober',
 'Colling',
 'Cvacek',
 'Czabal',
 'Damell',
 'Demall',
 'Dehmel',
 'Dana',
 'Dejmal',
 'Dempko',
 'Demko',
 'Dinko',
 'Divoky',
 'Dolejsi',
 'Dolezal',
 'Doljs',
 'Dopita',
 'Drassal',
 'Driml',
 'Duyava',
 'Dvorak',
 'Dziadik',
 'Egr',
 'Entler',
 'Faltysek',
 'Faltejsek',
 'Fencl',

In [11]:
# Finding the total number of data

num_data = 0

for nation in all_categories:
    n = len(category_lines[nation])
    print('The number of names in ' + nation + ' is', n)
    num_data += n

print('='*40)
print('Total number of data is', num_data)

The number of names in Czech is 519
The number of names in German is 724
The number of names in Arabic is 2000
The number of names in Japanese is 991
The number of names in Chinese is 268
The number of names in Vietnamese is 73
The number of names in Russian is 9408
The number of names in French is 277
The number of names in Irish is 232
The number of names in English is 3668
The number of names in Spanish is 298
The number of names in Greek is 203
The number of names in Italian is 709
The number of names in Portuguese is 74
The number of names in Scottish is 100
The number of names in Dutch is 297
The number of names in Korean is 94
The number of names in Polish is 139
Total number of data is 20074
