<a href="https://colab.research.google.com/github/rojinadeuja/NLP-Model-Implementations/blob/master/GloVe-using-Custom-model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## How to train GloVe on a custom corpus
If the web datasets above don't match the semantics of your end use case, you can train word vectors on your own corpus.

1. Clone the repository from 
> `$ git clone http://github.com/stanfordnlp/glove`

2. To run the model
> `$ cd glove && make`

3. To train it on your own corpus, you make to make changes to the *demo.sh* file 
> The demo.sh script downloads a small corpus *text8*, consisting of the first 100M characters of Wikipedia. It collects unigram counts, constructs and shuffles cooccurrence data, and trains a simple version of the GloVe model. It also runs a word analogy evaluation script in python to verify word vector quality. 

4. Remove the script from *if* to *fi* after '*make*'
5. Replace the CORPUS name with your corpus file name 'filename.txt'
6. At the end of the file, there is another *if* loop. Replace text8 with your corpus file name
> `if [ "$CORPUS" = 'text8' ]; then`

7. Run *demo.sh* once the changes are made
> `$ ./demo.sh`

8. Note: You can also change other model parameters inside the *demo.sh* file. For eg. *vector size*, *window size*, *minimum count* and so on

## How to create and use the corpus file
- To train your own GloVe vectors, first you'll need to prepare your corpus as a single text file with all words separated by one or more spaces or tabs. 

- If your corpus has multiple documents, the documents (only) should be separated by new line characters. Co-occurrence contexts for words do not extend past newline characters.

To create your corpus file in the correct format, follow the instructions on my notebook **[CSV-to-Corpus](https://github.com/rojinadeuja/Data-Processing-Utilities/blob/main/CSV-to-Corpus.ipynb)**

- Once you get the file, move it to the root folder inside your *glove* directory.

## How to get GloVe embeddings
- After you run the *demo.sh* script, a *vector.txt* file will be generated that will contain all the word embeddings.

# Use your custom word embeddings

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Import Modules

In [2]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords

from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers.core import Activation, Dropout, Dense
from keras.layers import Flatten
from keras.layers import GlobalMaxPooling1D
from keras.layers.embeddings import Embedding
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer

## Load Dataset
The IMDb Dataset can also be downloaded using Keras file utility.
 
```
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
dataset = tf.keras.utils.get_file("aclImdb_v1.tar.gz", url,
                                    untar=True, cache_dir='.',
                                    cache_subdir='')
dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')
os.listdir(dataset_dir)
```
The *train/* directory has *pos* and *neg* folders with movie reviews labelled as positive and negative respectively. You can reviews from *pos* and *neg* folders to train a binary classification model.

In [3]:
df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/IMDB.csv')
df.replace(['positive', 'negative'], [1, 0], inplace=True)
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


## Data Pre-processing

In [4]:
def preprocess(s):
    '''Function for data pre-processing'''
    # Removing html tags
    TAG_RE = re.compile(r'<[^>]+>')
    sentence = TAG_RE.sub('', s)
    
    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z]', ' ', sentence)

    # Single character removal
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)

    # Removing multiple spaces
    sentence = re.sub(r'\s+', ' ', sentence)

    return sentence

## Create X and y matrices

In [5]:
# Create feature matrix
X = []
sentences = list(df['review'])
for sentence in sentences:
    X.append(preprocess(sentence))

# Create target vector
y = df['sentiment']

## Split dataset into train and test

In [6]:
# Create train-test splits
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [7]:
# Convert into numpy arrays for processing with tensorflo
y_train = np.array(y_train)
y_test = np.array(y_test)

## Tokenize the data

In [8]:
# Tokenize the text
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X_train)

X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

In [9]:
# Adding 1 because of reserved 0 index
vocab_size = len(tokenizer.word_index) + 1

maxlen = 100

X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

## Create Embeddings
Keras makes it easy to use word embeddings. We will use our custom embeddings rather than randomly initializing the embedding layer.

The dimensionality (or width) of the embedding is a parameter you can experiment with to see what works well for your problem, much in the same way you would experiment with the number of neurons in a Dense layer. 

For this experiment, we have set the dimensionality to 100.

In [10]:
# Create embeddings
embeddings_dictionary = dict()
glove_file = open('/content/drive/My Drive/Colab Notebooks/glove.imdb.50k.100d.txt', encoding="utf8")

for line in glove_file:
    records = line.split()
    word = records[0]
    vector_dimensions = np.asarray(records[1:], dtype='float32')
    embeddings_dictionary [word] = vector_dimensions
glove_file.close()

In [11]:
embedding_matrix = np.zeros((vocab_size, 100))
for word, index in tokenizer.word_index.items():
    embedding_vector = embeddings_dictionary.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector

## Classification using a simple Neural Network

In [12]:
# Text Classification with a Simple Neural Network
model = Sequential()
embedding_layer = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=maxlen , trainable=False)
model.add(embedding_layer)

model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

In [13]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 100, 100)          9254700   
_________________________________________________________________
flatten (Flatten)            (None, 10000)             0         
_________________________________________________________________
dense (Dense)                (None, 1)                 10001     
Total params: 9,264,701
Trainable params: 10,001
Non-trainable params: 9,254,700
_________________________________________________________________
None


## Train the model

In [14]:
# Train the model
history = model.fit(X_train, y_train, batch_size=128, epochs=6, verbose=1, validation_split=0.2)

Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


## Evaluate model on test data

In [15]:
# Evaluate the model
score = model.evaluate(X_test, y_test, verbose=1)
print("\nTest Accuracy:", score[1])


Test Accuracy: 0.8027999997138977


## Results
The test accuracy of our model trained on custom word embeddings was found to be 80.5%