# Thinking in tensors in PyTorch

Hands-on training  by [Piotr Migdał](https://p.migdal.pl) (2019). Version 0.4 for Uniwersytet Śląski.

**Work in progress**

## RNN: Text one-hot encoding, names part 1

We use [US Baby Names - Kaggle Dataset](https://www.kaggle.com/kaggle/us-baby-names).
If needed, you can use: `!wget https://www.dropbox.com/s/s14l44ptqevgech/NationalNames.csv.zip?dl=1`

See also:

* [The Most Unisex Names in US History](https://flowingdata.com/2013/09/25/the-most-unisex-names-in-us-history/)
* [Why Most European Names Ending in A Are Female](http://blog-en.namepedia.org/2015/11/why-most-european-names-ending-in-a-are-female/)

And for Polish names and surnames:

* [Najpopularniejsze imiona w Polsce - Otwarte Dane](https://dane.gov.pl/dataset/219)
* [Nazwiska występujące w rejestrze PESEL - Otwarte Dane](https://dane.gov.pl/dataset/568)
* https://nazwiska-polskie.pl/
* [List of polish first and last names - Kaggle Dataset](https://www.kaggle.com/djablo/list-of-polish-first-and-last-names/home)

In [None]:
%matplotlib inline
from collections import Counter
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
import h5py

In [None]:
names = pd.read_csv("./data/NationalNames.csv")

In [None]:
names.info()

In [None]:
names.head()

In [None]:
names['Year'].max()

In [None]:
names2014 = names.loc[lambda df: df['Year'] == 2014]

In [None]:
names2014.shape

In [None]:
names2014.sample(5)

In [None]:
names2014['Gender'].value_counts()

In [None]:
names2014['Name'].apply(len).value_counts().sort_index()

In [None]:
y = names2014['Gender'].map({'F': 0, 'M': 1}).values.astype('int64')

In [None]:
y[:5]

In [None]:
X_text = list(names2014['Name'])

In [None]:
X_text[:5]

In [None]:
char_count = Counter()
for name in X_text:
    char_count.update(name)

In [None]:
char_count.most_common(5)

In [None]:
char_count.keys()

In [None]:
char_count_lower = Counter()
for name in X_text:
    char_count_lower.update(name.lower())

In [None]:
chars = sorted(char_count_lower.keys())
"".join(chars)

In [None]:
char2id = {c: i for i, c in enumerate(chars)}

In [None]:
char2id

In [None]:
max_len = 16
X = np.zeros((len(X_text), len(chars), max_len), dtype='float32')
for i, name in enumerate(X_text):
    for j, c in enumerate(name.lower()):
        X[i, char2id[c], j] = 1.

In [None]:
sns.heatmap(pd.DataFrame(X[1], index=chars))

In [None]:
len(X)

In [None]:
len(y)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=137)

In [None]:
with h5py.File("data/names_dense.h5") as f:
    f.create_dataset('X_train', data=X_train)
    f.create_dataset('y_train', data=y_train)
    f.create_dataset('X_test', data=X_test)
    f.create_dataset('y_test', data=y_test)
    f.create_dataset('characters', data=np.array(chars, dtype='S1'))
    f.create_dataset('categories', data=np.array(['F', 'M'], dtype='S1'))