In [12]:
import pandas as pd

#### **Load Dataset files**

Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.

In [14]:
train = pd.read_csv('train.csv', header=None) 
test = pd.read_csv('test.csv', header=None) 

### .....

### .....

**Vocabulary and Dictionary Creation**

- Having transformed the sentences in an analyzable form, we continue by making the vocabulary and creating the corresponding dictionary in which we assign an index for each word of the vocabulary.
- In order to create the vocabulary and the dictionary we use the method **vocabulary_builder** from the vocabulary module.

In [70]:
from vocabulary import vocabulary_builder # words, word2idx, idx2word = vocabulary_builder(df,'sent')

In [53]:
words, word2idx, idx2word =  vocabulary_builder(train, 'Sentence')

We can see a few words. To account for unknown words and padding, we'll have to add them to our vocabulary as well.

In [55]:
len(words)

39568

Each word has an index. For example the word 'on'

In [56]:
word2idx['game']

27

In [71]:
words[:10]

['_PAD', '_UNK', 's', 'new', 'said', 'reuters', 'ap', 'gt', 'lt', 'year']

**Having created the vocabulary, we continue by replacing the words with the corresponding indexes**

In [57]:
# replace words with indeces
train_sentences = word2idx_(train, 'Sentence',word2idx)
test_sentences = word2idx_(test, 'Sentence',word2idx)
valid_sentences = word2idx_(valid, 'Sentence',word2idx)

**Then we pad the sentences with 0s and shortening the lengthy sentences so that the data can be trained in batches to speed the training process.**

In [58]:
seq_len = 60

In [59]:
train_sentences_ = pad_input(train_sentences, seq_len)
valid_sentences_ = pad_input(valid_sentences, seq_len)
test_sentences_ = pad_input(test_sentences, seq_len)

**Loading Pretrained Word Embeddings (Glove) and Creating the embedding_matrix**

In [67]:
from pretrained_vectors import creating_matrix

In [69]:
FILE = 'glove.6B.200d.txt'
embedding_matrix = creating_matrix(FILE, 200, words)

------------------------------------------------------------
Loading glove.6B.200d.txt file ...
A total of 400000 word vectors loaded.
Pre-trained word embeddings cover 96.27% of vocabulary
Creating embedding matrix ...
Total absent words are 1477 which is 3.73 % of total words
Embeding matrix created of shape: (39569, 200)
-------------------------------------------------------------
