<a href="https://colab.research.google.com/github/rahiakela/deep-learning-research-and-practice/blob/main/inside-deep-learing/04-recurrent-neural-networks/02_simple_sequence_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Simple sequence classification

Let’s start by creating a many-to-one classification problem. 

What do I mean by this?

We will have many inputs (every time step), but we will have only one output: the class label we are trying to predict.

For a simple sequence classification problem, we will take an example of identifying the language a name comes from.

For example, “Steven” is an English name.
Note that this problem can’t be solved perfectly—for example, “Frank” could be English
or German—so we should expect some errors due to these issues and oversimplification.

<img src='https://github.com/rahiakela/deep-learning-research-and-practice/blob/main/inside-deep-learing/04-recurrent-neural-networks/images/2.png?raw=1' width='600'/>

RNN process for classifying a name’s source language. The individual characters of a name
make the sequence that is fed into the RNN. 

We learn how to convert each character into a vector and
how to get an RNN to process that sequence and return a final activation $h_T$ , and we end with a linear
layer that produces a prediction.



##Setup

In [None]:
!wget https://github.com/EdwardRaff/Inside-Deep-Learning/raw/main/idlmam.py

In [6]:
import torch
import torch.nn as nn
import torch.nn.functional as F

import torchvision 
from torchvision import transforms

from torch.utils.data import Dataset, DataLoader

from tqdm.autonotebook import tqdm

import requests, zipfile, io
import unicodedata
import string

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow

from sklearn.metrics import accuracy_score

from scipy.signal import convolve

import time

from idlmam import train_simple_network, set_seed, Flatten, weight_reset

In [3]:
%matplotlib inline
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('png', 'pdf')

def set_seed(seed):
    torch.manual_seed(seed)
    np.random.seed(seed)

torch.backends.cudnn.deterministic=True
set_seed(42)

In [4]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

Let's downloads the dataset and extracts all the files.

In [5]:
zip_file_url = "https://download.pytorch.org/tutorial/data.zip"

# Zip file is organized as data/names/[LANG].txt , where [LANG] is a specific language
req = requests.get(zip_file_url)
z = zipfile.ZipFile(io.BytesIO(req.content))
z.extractall()

##Preparing dataset

Since this dataset is pretty small, we load all of it into memory.

In [8]:
# dictionary, which maps the language name
namge_language_data = {}

all_letters = string.ascii_letters + " .,;'"
n_letters = len(all_letters)
alphabet = {}

for i in range(n_letters):
  alphabet[all_letters[i]] = i

In [9]:
def unicode_to_ascii(s):
  return "".join(c for c in unicodedata.normalize("NFD", s) if unicodedata.category(c) != "Mn" and c in all_letters)

In [10]:
# Loops through every language, opens the zip file entry, and reads all the lines from the text file
for zip_path in z.namelist():
  if "data/names/" in zip_path and zip_path.endswith(".txt"):
    lang = zip_path[len("data/names/"): -len(".txt")]
    with z.open(zip_path) as zipfile:
      # Turns a Unicode string into plain ASCII
      lang_names = [unicode_to_ascii(line).lower() for line in str(zipfile.read(), encoding="utf-8").strip().split("\n")]
      namge_language_data[lang] = lang_names
    print(f"{lang}: {str(len(lang_names))}")

Arabic: 2000
Chinese: 268
Czech: 519
Dutch: 297
English: 3668
French: 277
German: 724
Greek: 203
Irish: 232
Italian: 709
Japanese: 991
Korean: 94
Polish: 139
Portuguese: 74
Russian: 9408
Scottish: 100
Spanish: 298
Vietnamese: 73


Now we have created a dataset, which you may notice is not well balanced: there are far
more Russian names than any other language.