<a href="https://colab.research.google.com/github/vikramkrishnan9885/MyColab/blob/master/CNNSentenceClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

CNNs are quite different from fully connected neural networks and have achieved state-of-the-art performance in numerous tasks. These tasks include image  classification, object detection, speech recognition, and of course, sentence classification. One of the main advantages of CNNs is that compared to a fully connected layer, a convolution layer in a CNN has a much smaller number of parameters. This allows us to build deeper models without worrying about memory overflow. Also, deeper models usually lead to better performance.

First, we will define the inputs and outputs. The input will be a batch of sentences, where the words are represented by one-hot-encoded word vectors—word
embeddings will deliver even better performance than the one-hot-encoded
representations; however, we will use the one-hot-encoded representation for
simplicity:


In [1]:
%matplotlib inline
from __future__ import print_function
import collections
import math
import numpy as np
import os
import random
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
import bz2
from matplotlib import pylab
import urllib
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from scipy.sparse import lil_matrix
import nltk # standard preprocessing
import operator # sorting items in dictionary by value
nltk.download('punkt') #tokenizers/punkt/PY3/english.pickle
from math import ceil
import zipfile

Instructions for updating:
non-resource variables are not supported in the long term
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [3]:
# DOWNLOAD DATA

def download_data(url, filename):
  if not os.path.exists(filename):
    print('Downloading file:','\t',url)
    filename, _ = urllib.request.urlretrieve(url,filename)
  else:
    raise Exception("FILE ALREADY EXISTS!")
  return filename

url = 'http://cogcomp.org/Data/QA/QC/train_1000.label'
filename = 'train_1000.label'
filename = download_data(url,filename)

test_url = 'http://cogcomp.org/Data/QA/QC/TREC_10.label'
test_filename = 'TREC_10.label'
test_filename = download_data(test_url, test_filename)

Downloading file: 	 http://cogcomp.org/Data/QA/QC/train_1000.label
Downloading file: 	 http://cogcomp.org/Data/QA/QC/TREC_10.label


In [7]:
filenames = ['train_1000.label','TREC_10.label']
num_files = len(filenames)
dir_name = "./"

In [8]:
# LOADING DATA
# Records the maximum length of the sentences
# as we need to pad shorter sentences accordingly
max_sent_length = 0 

def read_data(filename):
  '''
  Read data from a file with given filename
  Returns a list of strings where each string is a lower case word
  '''
  global max_sent_length
  questions = []
  labels = []
  with open(filename,'r',encoding='latin-1') as f:        
    for row in f:
        row_str = row.split(":")
        lb,q = row_str[0],row_str[1]
        q = q.lower()
        labels.append(lb)
        questions.append(q.split())        
        if len(questions[-1])>max_sent_length:
            max_sent_length = len(questions[-1])
  return questions,labels

# Process train and Test data
for i in range(num_files):    
    print('\nProcessing file %s'%os.path.join(dir_name,filenames[i]))
    if i==0:
        # Processing training data
        train_questions,train_labels = read_data(os.path.join(dir_name,filenames[i]))
        # Making sure we got all the questions and corresponding labels
        assert len(train_questions)==len(train_labels)
    elif i==1:
        # Processing testing data
        test_questions,test_labels = read_data(os.path.join(dir_name,filenames[i]))
        # Making sure we got all the questions and corresponding labels.
        assert len(test_questions)==len(test_labels)
        
    # Print some data to see everything is okey
    for j in range(5):
        print('\tQuestion %d: %s' %(j,train_questions[j]))
        print('\tLabel %d: %s\n'%(j,train_labels[j]))
        
print('Max Sentence Length: %d'%max_sent_length)
print('\nNormalizing all sentences to same length')


Processing file ./train_1000.label
	Question 0: ['manner', 'how', 'did', 'serfdom', 'develop', 'in', 'and', 'then', 'leave', 'russia', '?']
	Label 0: DESC

	Question 1: ['cremat', 'what', 'films', 'featured', 'the', 'character', 'popeye', 'doyle', '?']
	Label 1: ENTY

	Question 2: ['manner', 'how', 'can', 'i', 'find', 'a', 'list', 'of', 'celebrities', "'", 'real', 'names', '?']
	Label 2: DESC

	Question 3: ['animal', 'what', 'fowl', 'grabs', 'the', 'spotlight', 'after', 'the', 'chinese', 'year', 'of', 'the', 'monkey', '?']
	Label 3: ENTY

	Question 4: ['exp', 'what', 'is', 'the', 'full', 'form', 'of', '.com', '?']
	Label 4: ABBR


Processing file ./TREC_10.label
	Question 0: ['manner', 'how', 'did', 'serfdom', 'develop', 'in', 'and', 'then', 'leave', 'russia', '?']
	Label 0: DESC

	Question 1: ['cremat', 'what', 'films', 'featured', 'the', 'character', 'popeye', 'doyle', '?']
	Label 1: ENTY

	Question 2: ['manner', 'how', 'can', 'i', 'find', 'a', 'list', 'of', 'celebrities', "'", 'rea

In [9]:
# PADDING SENTENCES

# Padding training data
for qi,que in enumerate(train_questions):
    for _ in range(max_sent_length-len(que)):
        que.append('PAD')
    assert len(que)==max_sent_length
    train_questions[qi] = que
print('Train questions padded')

# Padding testing data
for qi,que in enumerate(test_questions):
    for _ in range(max_sent_length-len(que)):
        que.append('PAD')
    assert len(que)==max_sent_length
    test_questions[qi] = que
print('\nTest questions padded')  

# Printing a test question to see if everything is correct
print('\nSample test question: %s',test_questions[0])

Train questions padded

Test questions padded

Sample test question: %s ['dist', 'how', 'far', 'is', 'it', 'from', 'denver', 'to', 'aspen', '?', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD']


In [10]:
# BUILD THE DATASET

def build_dataset(questions):
    words = []
    data_list = []
    count = []
    
    # First create a large list with all the words in all the questions
    for d in questions:
        words.extend(d)
    print('%d Words found.'%len(words))    
    print('Found %d words in the vocabulary. '%len(collections.Counter(words).most_common()))
    
    # Sort words by there frequency
    count.extend(collections.Counter(words).most_common())
    
    # Create an ID for each word by giving the current length of the dictionary
    # And adding that item to the dictionary
    dictionary = dict()
    for word, _ in count:
        dictionary[word] = len(dictionary)
    
    # Traverse through all the text and 
    # replace the string words with the ID 
    # of the word found at that index
    for d in questions:
        data = list()
        for word in d:
            index = dictionary[word]        
            data.append(index)
            
        data_list.append(data)
        
    reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) 
    
    return data_list, count, dictionary, reverse_dictionary

# Create a dataset with both train and test questions
all_questions = list(train_questions)
all_questions.extend(test_questions)

# Use the above created dataset to build the vocabulary
all_question_ind, count, dictionary, reverse_dictionary = build_dataset(all_questions)

# Print some statistics about the processed data
print('All words (count)', count[:5])
print('\n0th entry in dictionary: %s',reverse_dictionary[0])
print('\nSample data', all_question_ind[0])
print('\nSample data', all_question_ind[1])
print('\nVocabulary: ',len(dictionary))
vocabulary_size = len(dictionary)

print('\nNumber of training questions: ',len(train_questions))
print('Number of testing questions: ',len(test_questions))

49500 Words found.
Found 3369 words in the vocabulary. 
All words (count) [('PAD', 34407), ('?', 1454), ('the', 999), ('what', 963), ('is', 587)]

0th entry in dictionary: %s PAD

Sample data [38, 12, 19, 1006, 1007, 6, 28, 1008, 1009, 544, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Sample data [44, 3, 545, 1010, 2, 163, 1011, 1012, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Vocabulary:  3369

Number of training questions:  1000
Number of testing questions:  500


In [11]:
# GENERATING BATCHES OF DATA
batch_size = 16 # We process 16 questions at a time
sent_length = max_sent_length

num_classes = 6 # Number of classes
# All the types of question that are in the dataset
all_labels = ['NUM','LOC','HUM','DESC','ENTY','ABBR'] 

class BatchGenerator(object):
    '''
    Generates a batch of data
    '''
    def __init__(self,batch_size,questions,labels):
        self.questions = questions
        self.labels = labels
        self.text_size = len(questions)
        self.batch_size = batch_size
        self.data_index = 0
        assert len(self.questions)==len(self.labels)
        
    def generate_batch(self):
        '''
        Data generation function. This outputs two matrices
        inputs: a batch of questions where each question is a tensor of size
        [sent_length, vocabulary_size] with each word one-hot-encoded
        labels_ohe: one-hot-encoded labels corresponding to the questions in inputs
        '''
        global sent_length,num_classes
        global dictionary, all_labels
        
        # Numpy arrays holding input and label data
        inputs = np.zeros((self.batch_size,sent_length,vocabulary_size),dtype=np.float32)
        labels_ohe = np.zeros((self.batch_size,num_classes),dtype=np.float32)
        
        # When we reach the end of the dataset
        # start from beginning
        if self.data_index + self.batch_size >= self.text_size:
            self.data_index = 0
            
        # For each question in the dataset
        for qi,que in enumerate(self.questions[self.data_index:self.data_index+self.batch_size]):
            # For each word in the question
            for wi,word in enumerate(que): 
                # Set the element at the word ID index to 1
                # this gives the one-hot-encoded vector of that word
                inputs[qi,wi,dictionary[word]] = 1.0
            
            # Set the index corrsponding to that particular class to 1
            labels_ohe[qi,all_labels.index(self.labels[self.data_index + qi])] = 1.0
        
        # Update the data index to get the next batch of data
        self.data_index = (self.data_index + self.batch_size)%self.text_size
            
        return inputs,labels_ohe
    
    def return_index(self):
        # Get the current index of data
        return self.data_index

# Test our batch generator
sample_gen = BatchGenerator(batch_size,train_questions,train_labels)
# Generate a single batch
sample_batch_inputs,sample_batch_labels = sample_gen.generate_batch()
# Generate another batch
sample_batch_inputs_2,sample_batch_labels_2 = sample_gen.generate_batch()

# Make sure that we infact have the question 0 as the 0th element of our batch
assert np.all(np.asarray([dictionary[w] for w in train_questions[0]],dtype=np.int32) 
              == np.argmax(sample_batch_inputs[0,:,:],axis=1))

# Print some data labels we obtained
print('Sample batch labels')
print(np.argmax(sample_batch_labels,axis=1))
print(np.argmax(sample_batch_labels_2,axis=1))

Sample batch labels
[3 4 3 4 5 2 2 2 3 2 0 3 2 2 4 1]
[3 0 3 3 0 4 2 3 3 4 2 1 4 1 5 4]


In [12]:
# HYPERPARAMETERS
tf.reset_default_graph()

batch_size = 32
# Different filter sizes we use in a single convolution layer
filter_sizes = [3,5,7] 

# inputs and labels
sent_inputs = tf.placeholder(shape=[batch_size,sent_length,vocabulary_size],dtype=tf.float32,name='sentence_inputs')
sent_labels = tf.placeholder(shape=[batch_size,num_classes],dtype=tf.float32,name='sentence_labels')

In [17]:
# PARAMETERS
# Our model has following parameters.
# 3 sets of convolution layer weights and biases (one for each parallel layer)
# 1 fully connected output layer


# 3 filters with different context window sizes (3,5,7)
# Each of this filter spans the full one-hot-encoded length of each word and the context window width

# Weights of the first parallel layer
w1 = tf.Variable(tf.truncated_normal([filter_sizes[0],vocabulary_size,1],stddev=0.02,dtype=tf.float32),name='weights_1')
b1 = tf.Variable(tf.random_uniform([1],0,0.01,dtype=tf.float32),name='bias_1')

# Weights of the second parallel layer
w2 = tf.Variable(tf.truncated_normal([filter_sizes[1],vocabulary_size,1],stddev=0.02,dtype=tf.float32),name='weights_2')
b2 = tf.Variable(tf.random_uniform([1],0,0.01,dtype=tf.float32),name='bias_2')

# Weights of the third parallel layer
w3 = tf.Variable(tf.truncated_normal([filter_sizes[2],vocabulary_size,1],stddev=0.02,dtype=tf.float32),name='weights_3')
b3 = tf.Variable(tf.random_uniform([1],0,0.01,dtype=tf.float32),name='bias_3')

# Fully connected layer
w_fc1 = tf.Variable(tf.truncated_normal([len(filter_sizes),num_classes],stddev=0.5,dtype=tf.float32),name='weights_fulcon_1')
b_fc1 = tf.Variable(tf.random_uniform([num_classes],0,0.01,dtype=tf.float32),name='bias_fulcon_1')

In [20]:
# Defining Inference of the CNN
# Here we define the CNN inference logic. 
# First compute the convolution output for each parallel layer within the convolution layer. 
# Then perform pooling-over-time over all the convolution outputs. 
# Finally feed the output of the pooling layer to a fully connected layer to obtain the output logits.

# Calculate the output for all the filters with a stride 1
# We use relu activation as the activation function
h1_1 = tf.nn.relu(tf.nn.conv1d(sent_inputs,w1,stride=1,padding='SAME') + b1)
h1_2 = tf.nn.relu(tf.nn.conv1d(sent_inputs,w2,stride=1,padding='SAME') + b2)
h1_3 = tf.nn.relu(tf.nn.conv1d(sent_inputs,w3,stride=1,padding='SAME') + b3)

# Pooling over time operation

# This is doing the max pooling. Thereare two options to do the max pooling
# 1. Use tf.nn.max_pool operation on a tensor made by concatenating h1_1,h1_2,h1_3 and converting that tensor to 4D
# (Because max_pool takes a tensor of rank >= 4 )
# 2. Do the max pooling separately for each filter output and combine them using tf.concat 
# (this is the one used in the code)

h2_1 = tf.reduce_max(h1_1,axis=1)
h2_2 = tf.reduce_max(h1_2,axis=1)
h2_3 = tf.reduce_max(h1_3,axis=1)

h2 = tf.concat([h2_1,h2_2,h2_3],axis=1)

# Calculate the fully connected layer output (no activation)
# Note: since h2 is 2d [batch_size,number of parallel filters] 
# reshaping the output is not required as it usually do in CNNs
logits = tf.matmul(h2,w_fc1) + b_fc1

In [21]:
# LOSS, OPTIMIZER AND PREDICTIONS

# Loss (Cross-Entropy)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(labels=sent_labels,logits=logits))

# Momentum Optimizer
optimizer = tf.train.MomentumOptimizer(learning_rate=0.01,momentum=0.9).minimize(loss)

predictions = tf.argmax(tf.nn.softmax(logits),axis=1)

In [22]:
# RUN THE MODEL

# With filter widths [3,5,7] and batch_size 32 the algorithm 
# achieves around ~90% accuracy on test dataset (50 epochs). 
# From batch sizes [16,32,64] I found 32 to give best performance

session = tf.InteractiveSession()

num_steps = 50 # Number of epochs the algorithm runs for

# Initialize all variables
tf.global_variables_initializer().run()
print('Initialized\n')

# Define data batch generators for train and test data
train_gen = BatchGenerator(batch_size,train_questions,train_labels)
test_gen = BatchGenerator(batch_size,test_questions,test_labels)

# How often do we compute the test accuracy
test_interval = 1

# Compute accuracy for a given set of predictions and labels
def accuracy(labels,preds):
    return np.sum(np.argmax(labels,axis=1)==preds)/labels.shape[0]

# Running the algorithm
for step in range(num_steps):
    avg_loss = []
    
    # A single traverse through the whole training set
    for tr_i in range((len(train_questions)//batch_size)-1):
        # Get a batch of data
        tr_inputs, tr_labels = train_gen.generate_batch()
        # Optimize the network and compute the loss
        l,_ = session.run([loss,optimizer],feed_dict={sent_inputs: tr_inputs, sent_labels: tr_labels})
        avg_loss.append(l)

    # Print average loss
    print('Train Loss at Epoch %d: %.2f'%(step,np.mean(avg_loss)))
    test_accuracy = []
    
    # Compute the test accuracy
    if (step+1)%test_interval==0:        
        for ts_i in range((len(test_questions)-1)//batch_size):
            # Get a batch of test data
            ts_inputs,ts_labels = test_gen.generate_batch()
            # Get predictions for that batch
            preds = session.run(predictions,feed_dict={sent_inputs: ts_inputs, sent_labels: ts_labels})
            # Compute test accuracy
            test_accuracy.append(accuracy(ts_labels,preds))
        
        # Display the mean test accuracy
        print('Test accuracy at Epoch %d: %.3f'%(step,np.mean(test_accuracy)*100.0))

Initialized

Train Loss at Epoch 0: 1.76
Test accuracy at Epoch 0: 20.833
Train Loss at Epoch 1: 1.67
Test accuracy at Epoch 1: 24.375
Train Loss at Epoch 2: 1.58
Test accuracy at Epoch 2: 29.375
Train Loss at Epoch 3: 1.47
Test accuracy at Epoch 3: 35.417
Train Loss at Epoch 4: 1.35
Test accuracy at Epoch 4: 43.333
Train Loss at Epoch 5: 1.24
Test accuracy at Epoch 5: 45.000
Train Loss at Epoch 6: 1.15
Test accuracy at Epoch 6: 45.417
Train Loss at Epoch 7: 1.08
Test accuracy at Epoch 7: 46.250
Train Loss at Epoch 8: 1.01
Test accuracy at Epoch 8: 49.792
Train Loss at Epoch 9: 0.97
Test accuracy at Epoch 9: 60.833
Train Loss at Epoch 10: 0.92
Test accuracy at Epoch 10: 66.667
Train Loss at Epoch 11: 0.88
Test accuracy at Epoch 11: 67.917
Train Loss at Epoch 12: 0.84
Test accuracy at Epoch 12: 72.083
Train Loss at Epoch 13: 0.80
Test accuracy at Epoch 13: 75.625
Train Loss at Epoch 14: 0.76
Test accuracy at Epoch 14: 76.042
Train Loss at Epoch 15: 0.73
Test accuracy at Epoch 15: 77.292