# TOC

__Chapter 6 - Deep learning with sequence data and text__

1. [Import](#Import)
1. [Word embedding](#Word-embedding)
    1. [Training word embedding by building a sentiment classifier](#Training-word-embedding-by-building-a-sentiment-classifier)
    1. [torchtext.datasets](#torchtextdatasets)
    1. [Building vocabulary](#Building-vocabulary)
    1. [Generate batches of vectors](#Generate-batches-of-vectors)
1. [](#)
1. [](#)
1. [](#)
1. [](#)
1. [](#)
1. [](#)
1. [](#)
1. [](#)
1. [](#)
1. [](#)
1. [](#)

# Import

<a id = 'Import'></a>

In [1]:
# Standard libary and settings
import os
import sys
import importlib
import itertools
import warnings; warnings.simplefilter('ignore')
from IPython.core.display import display, HTML; display(HTML("<style>.container { width:95% !important; }</style>"))

# Data extensions and settings
import numpy as np
np.set_printoptions(threshold = np.inf, suppress = True)
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.options.display.float_format = '{:,.6f}'.format

# pytorch tools
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
from torch.autograd import Variable
from torchvision import datasets, models, transforms

# Visualization extensions and settings
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_style('whitegrid')


# Word embedding

Word embedding is a popular way of representing text data in problems that are solved by deep learning algorithms. This technique provides a dense representation of a word filled with floats. The vector dimension varies based on the vocabulary size. It is common to use a word emebedding of dimension size 50, 100, 256, 300 and occassionally 1,000. This size is a hyperparameter.

Contrasting this with on-hot encoding, if we have a vocabulary of 20,000 words, then we end up with 20,000 x 20,000 numbers, the vast majority of which will be zero. This same vocabulary can be represented as a word emebedding of size 20,000 x (dimension size).

One method for creating word embeddings is to start with dense vectors of random numbers for each token, then train a model (such as a document classifier or sentiment classifier). The floating point numbers in the vectors, which collectively represent the tokens, are adjusted in a way such that semantically 'close' words will have similar represented.

Word embeddings may not be feasible if there isn't enough data. In these case, embeddings trained by some other machine learning algorithm can be used.

<a id = 'Word-embedding'></a>

## Training word embedding by building a sentiment classifier

Using a dataset called IMDB (which contains movie reviews), we will build a sentiment classifier. In the processing training the model, we will also train word embedding for the words in the IMDB dataset. This will be done using a library called torchtext.

The torchtext.data module has a class called Field, which defines how the data needs to be read and tokenized. Below, we define two Field objects, one for the text itself and a second for the labels. The Field constructor also accepts a tokenize argument, which by default use the str.split function. We can override this by passing in a tokenizer of choice.

<a id = 'Training-word-embedding-by-building-a-sentiment-classifier'></a>

In [None]:
# 

from torchtext import data
text = data.Field(lower = True, batch_first = True, fix_length = 20)
label = data.Field(sequential = False)


## torchtext.datasets

torchtext.datasets provides wrappers for several different datasets, such as IMDB. This utility abstracts away the process of downloading, tokenizing and splitting the datasets.

<a id = 'torchtextdatasets'></a>

In [None]:
# download IMDB
train, test = datasets.IMDB.splits(text, label)


In [None]:
#
print('train.fields', train.fields)

# results
print(vars(train[0]))


## Building vocabulary

We can use the build_vocab method to take in an object from which we will build a vocabulary. Below, we pass in the train object, and using the dim argument, initialize vectors with pretrained mebeddings of dimension 300. The max_size instance limits the number of words in the vocabulary, and min_freq removes any word which has not occurred more than 10 times.

Once the vocabulary is built we can obtain different values such as frequency, word index and the vector representation of each word.

<a id = 'Building-vocabulary'></a>

In [None]:
# build the vocabulary
text.build_vocab(train, vectors = GloVe(name = '6B', dim = 300), max_size = 10000, min_freq = 10)
label.build(train)


In [None]:
# print word frequencies
print(text.vocab.freqs)


In [None]:
# print word vectors, which displays the 300 dimension vector for each word
print(text.vocab.vectors)


In [None]:
# print word and their indexes
print(text.vocab.stoi)


## Generate batches of vectors

BucketIterator is a tools that helps to batch the text and replace the words with the index number of the individual words. The following code creates iterators that generate batches for the train and test objects.

<a id = 'Generate-batches-of-vectors'></a>

In [None]:
# 
train_iter, test_iter = data.BuckerIterator.splits((train, test), batch_size = 18, device = -1, shuffle = True)

batch = next(iter(train_iter))
print(batch.text)

print(batch.label)


## A



<a id = ''></a>

In [None]:
# 

