# Deep learning for NLP

Deep learning (DL) a is a vital tool for the applications related to Natural Language Processing (NLP), Computer Vision and many other related fields. DL has a very long history. But the real capability of the DL is shown with the recent advancement in computation capability with the help of advance graphical processor units (GPUs) and a vast scale availability of the training data. The research in this area took a new direction arguably due to a significant breakthrough in the image classification task. Nowadays most of the state-of-the-art solutions of various NLP and computer vision related problems are based on deep learning methods. 

A reasonably good amount of the tutorials for the deep learning are available online.
A __[GitHub repository for deep learning](https://github.com/ChristosChristofidis/awesome-deep-learning)__ lists an exhaustive list of tutorials, online courses, dataset, etc. Please refer this for the introductory model of the convolutional neural network which is used in this notebook. More specifically these __[slides](http://cs231n.stanford.edu/slides/2018/cs231n_2018_lecture05.pdf)__ from the Standford CS231n course, can be referred to get a quick overall idea about the neural network and the convolutional neural networks (CNNs).


 *There are various applications of the DL in the area of natural language processing such as ''__[Sentence Classification](http://www.aclweb.org/anthology/D14-1181)__ using CNN and __[Character-level Convolutional Networks for Text
Classification](https://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf)__, __['Named Entity Recognition'](http://www.aclweb.org/anthology/N16-1030)__ using neural networks, __[Natural Language Generation,
Paraphrasing and Summarization
of User Reviews with Recurrent
Neural Networks](http://www.meanotek.ru/files/TarasovDS%282%292015-Dialogue.pdf)__ using RNN (Recurrent Neural Network), __[Detecting Semantically Equivalent Questions](https://www.aclweb.org/anthology/K15-1013)__ using CNN etc.*


**In this notebook, I will first describe two recently published work on sentence classification using deep learning approach. And in the last phase of this notebook a sentence classification (positive or negative sentiments on __[IMDB dataset](https://www.kaggle.com/c/sentiment-analysis-on-imdb-movie-reviews#evaluation)__ dataset) using the neural network is described and for CNN based approach two links are given for further exploration. **


# Description of the two recently published work for Sentence Classification using CNN:


# 1. Sentence Classification using CNN (__[Yoon Kim 2014](http://www.aclweb.org/anthology/D14-1181)__)


 Before deep learning based methods, most of the methods proposed in the sentence classification literature, focus on designing best possible features to represent sentences for a particular task and finally choosing the best classifier to make the prediction. The work described by Yoon Kim uses Convolutional Neural Networks (CNN) for __[Sentence Classification](http://www.aclweb.org/anthology/D14-1181)__. 
CNN models were initially being invented for the tasks related to the field of computer vision. But now, CNN has shown good results in various NLP associated tasks such as sentence modeling, search query retrieval, semantic
parsing, etc. In this work, CNN is trained using pretrained word vectors for the sentence level classification task. Word Vectors as we know from our NLP class are lower dimensional vector representation which we obtain by training Continous Bag of Word (CBOW) and skip-gram models. Both the models have one hidden layer. These two neural network based models basically project the one-hot representation (V dimensional vector, where V is vocabulary size) of a particular word $w_i$ to a lower dimensional space (say $d$, so $w_i\in \mathbb{R}^d$). Such lower dimensional representation has the property that the word vectors of the two semantically closed words $w_i$, and $w_j$ have very high cosine similarity (or small Euclidean distance between them) in lower dimensional space $d$. To summarize, dimensions of the word vector encode the semantic features of the word.

* Following are the various datasets on which the tasks of sentence classification are addressed in this paper:- 
    1. 'MR' (Movie Reviews) dataset for detecting sentiments from the sentences, either positive or the negative (two-class classification problem)
    2. 'CR' (Customer Reviews) dataset for detecting sentiments from the sentences, either positive or the negative (two-class classification problem)
    3. 'SST-1' (Stanford Sentiment Treebank) dataset, A multiclass (very positive, positive, neutral, negative, very negative) classification task
    4. 'SST-2' (Stanford Sentiment Treebank) dataset, A multiclass (very positive, positive, negative, very negative) classification task
    5. 'Subj' dataset, sentences were classified into two types, subjective or objective
    6. 'TREC' dataset, sentences were classified into six question types 


For the evaluation of the proposed algorithm results on all the above datasets are compared with the previous state-of-the-art methods.


A CNN with one convolutional layer is trained using the word vectors.
These word vectors are available to download (__[word2vec](https://code.google.com/archive/p/word2vec/)__). And these word vectors are trained on 100 billion words of Google News.
Following are the main take away point of this study:- 
* **Pre-trained vectors are ‘universal’ feature extractors that can be utilized for various classification tasks:-** 
    * This point is proven as the input to the CNN (named 'CNN-static' in the paper) in one of the experiments were the 'static' (word vectors obtained on 100 billion words of Google News) and weight of the convolutional filters are the only learnable parameters of the model. Still, the methods performed significantly better on the multiple benchmark datasets.
* **Learning task-specific vectors through fine-tuning results in further improvements:-**
    * Task-specific fine-tuning of the pretrained vectors was done which resulted in improved classification accuracy.
* **Proposed method performed better as compared to previously proposed methods on multiple datasets**


<div class="alert alert-block alert-success">
 A step-by-step python notebook (using Keras deep learning library) based on this paper can be found on this __[link](https://github.com/rouseguy/DeepLearning-NLP/blob/master/notebooks/3.%20CNN%20-%20Text.ipynb)__.
</div>





# 2. Character-level Text Classification using CNN (__[Zhang et al. 2015](https://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf)__)


This is the first empirical study which uses only on characters for the text classification using character level inputs to the CNN. Experimental results have shown that deep Convnets do not need the knowledge of the words, and any other syntactic or semantic structures of the language. 



Text classification is a classic topic for natural language processing, in which one needs to assign
predefined categories to free-text documents. Inspired by the self-learning capability of the CNN in computer vision related tasks using the raw data, authors treated the text as 'raw data' on character level.


One dimensional convolutional neural network has been used, and a classification task is performed to demonstrate the CNN's ability to understand the text.


## Character-level Convolutional Networks

Following are the critical steps of the proposed system:
* Input: a sequence of encoded characters
    * For the encoding, an alphabet of size, say $m$ for the particular language is chosen. Then one-hot encoding is done for each of the characters (Referred to as the 'Quantization').
    * The alphabet used consists of 70 characters(26 English letters, ten digits, 33 other characters, and the new line character).

    
    
    
* Now with a fixed predefined length, say $l_0$, string is converted into a $l_0\times m$ matrix. Input feature length $l_0$ is fixed to 1014. Alphabets that are not in these $m$ predefined list including blank characters are quantized as all-zero vectors
* Propsed Model overview:
![cnn_nlp.png](attachment:cnn_nlp.png)
<img src="url.gif" alt="Illustration of proposed model (Source:https://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf)" title="Title text " />



The input to the network is $l_0\times m$. The architecture consists of 7 convolutional layer and three fully connected layers. 
* Dataset used for the evaluation of the proposed algorithm
    1. AG’s news corpus: 4 largest classes were chosen from this corpus
    * Sogou news corpus: 5 classes – “sports”, “finance”, “entertainment”, “automobile” and “technology”
    * DBPedia ontology dataset: 14 ontology classes from DBpedia 201
    * Yelp Review Polarity: 2 classes
    * Yelp Review Full: 5 classes
    * Yahoo! Answers: 10 classes 
    * Amazon Review Full: 5 classes
    * Amazon Review Polarity: 2 classes

Comparisons are done with the traditional method such as 'Bag-of-words and its TFIDF', 'Bag-of-ngrams and its TFIDF', 'Bag-of-means on word embedding' as well as deep learning based methods such as 'Word-based ConvNets' (__[Yoon Kim 2014](http://www.aclweb.org/anthology/D14-1181)__) and LSTM based method.

Following are the main take away points of this study:
* ConvNet trained using characters only is a suitable method of test understanding. For the sentence classification, there is no need of the words. Character-level ConvNets perform better on the various text classification task
* This implies that the natural languages can also be considered as a standard signal as any other signals.

* Experimental results have shown that the more massive datasets tend to perform better

* For user-generated texts that are not curated very systematically, character-level ConvNets work. This indicates the better working of the algorithm in real cases as the data generated in day to day use is not well curated.
* Choice of alphabet makes a difference.

* Experimental results re-emphasize the pre-established belief that a single machine learning or deep learning algorithm cannot work well even for the same task on different datasets. For a particular task ($T_1$), one machine learning algorithm ($A_1$) may work very well on one dataset ($D_1$) but for the same task ($T_1$) on the different dataset ($D_2$) it is not guaranteed that the same learning algorithm ($A_1$) will work well 



<div class="alert alert-block alert-success">
The __[code](https://github.com/zhangxiangxiao/Crepe)__ and __[dataset](https://drive.google.com/drive/folders/0Bz8a_Dbh9Qhbfll6bVpmNUtUcFdjYmF2SEpmZUZUcVNiMUw1TWN6RDV3a0JHT3kxLVhVR2M)__ for this paper are publicly available for the download (provided by the authors).
</div>


## The notebook after this line describes a neural network for the text classification.
We will talk about sentiment classification (positive or negative) of IMDB movie reviews on the __[IMDB dataset](https://www.kaggle.com/c/sentiment-analysis-on-imdb-movie-reviews#evaluation)__. This dataset consists of 25k sentiment annotated reviews. 

## Please note that the 'labeledTrainData.tsv' is not being uploaded on Github. Please download this dataset from __[here](https://github.com/jhfjhfj1/autokeras/blob/master/examples/text_cnn/labeledTrainData.tsv)__.

### Data loading, preprocessing, and train test split

In [9]:
# import various libraries
import re
import pandas as pd
from sklearn.model_selection import train_test_split
# load the data
df = pd.read_csv('labeledTrainData.tsv', sep='\t', quoting=3) # 'review' and 'sentiment'
# Split into train and test data (10% for testing)
train_data, test_data, train_label, test_label = train_test_split(df['review'], df['sentiment'], shuffle=True,test_size=0.1) 
# printout some random sample
train_data[2] 

'"The film starts with a manager (Nicholas Bell) giving welcome investors (Robert Carradine) to Primal Park . A secret project mutating a primal animal using fossilized DNA, like ¨Jurassik Park¨, and some scientists resurrect one of nature\'s most fearsome predators, the Sabretooth tiger or Smilodon . Scientific ambition turns deadly, however, and when the high voltage fence is opened the creature escape and begins savagely stalking its prey - the human visitors , tourists and scientific.Meanwhile some youngsters enter in the restricted area of the security center and are attacked by a pack of large pre-historical animals which are deadlier and bigger . In addition , a security agent (Stacy Haiduk) and her mate (Brian Wimmer) fight hardly against the carnivorous Smilodons. The Sabretooths, themselves , of course, are the real star stars and they are astounding terrifyingly though not convincing. The giant animals savagely are stalking its prey and the group run afoul and fight against 

In [10]:
print('Total number of samples in IMDB dataset are ',len(df['review']))
print('Total number of samples in train data are ',len(train_data))
print('Total number of samples in test data are ',len(test_data))

Total number of samples in IMDB dataset are  25000
Total number of samples in train data are  22500
Total number of samples in test data are  2500


In [11]:
# Convert a collection of text documents to a matrix of token counts
from sklearn.feature_extraction.text import CountVectorizer
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords # for removing the stopwords of the english language


[nltk_data] Downloading package stopwords to /home/vinay/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [12]:
# Remove stopwords, do lowercase, and  build a vocabulary 
# that only consider the top max_features ordered by term frequency across the corpus
count_vectors = CountVectorizer(binary=True, stop_words=stopwords.words('english'), 
lowercase=True, max_features=10000)
# Apply above properties on the train_data
train_data_onehot = count_vectors.fit_transform(train_data)


In [13]:
import numpy as np
np.shape(train_data_onehot)[1]

10000

### Now bulid a neural network using Keras to classify the test data 

Let's use a three layer neural network (number of layers is hyperparameter of the network, so different combination needs to be tried to get the optimnum number of the layers). Input is 10k dimensional feature vector (decided by max_features in count_vectors), after that first, second and third layer has 500, 300, and 300 neurons respectively. 'Sigmoid' activation function is used. As the task is two class classification task, so 'binary_crossentropy' loss is computed and __[Adam optimizer](https://arxiv.org/pdf/1412.6980.pdf)__ is used for the weight updation.

In [14]:
from keras.models import Sequential
from keras.layers import Dense

input_dim_feat=np.shape(train_data_onehot)[1]
model = Sequential()
 
model.add(Dense(units=500, activation='sigmoid', input_dim=input_dim_feat))
model.add(Dense(units=300, activation='sigmoid'))
model.add(Dense(units=300, activation='sigmoid'))
model.add(Dense(units=1, activation='sigmoid'))
 
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_5 (Dense)              (None, 500)               5000500   
_________________________________________________________________
dense_6 (Dense)              (None, 300)               150300    
_________________________________________________________________
dense_7 (Dense)              (None, 300)               90300     
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 301       
Total params: 5,241,401
Trainable params: 5,241,401
Non-trainable params: 0
_________________________________________________________________


In [15]:
# save the picture of the model
from keras.utils import plot_model
plot_model(model, to_file='model.png')

In [None]:
# further split train_data in to train and validation data(5 %) and train the model
model.fit(train_data_onehot,train_label,batch_size=256, epochs=5, validation_split=0.05, shuffle=True)

Train on 21375 samples, validate on 1125 samples
Epoch 1/5


In [None]:
# Now check the accuracy on the test data on the trained model
scores = model.evaluate(count_vectors.transform(test_data), test_label, verbose=1)
print("Accuracy:", scores[1])  # Accuracy: 0.875

## Now we want to use CNN instead of nural network 

This is the same python notebook referred above 'Sentence Classification using CNN' (__[Yoon Kim 2014](http://www.aclweb.org/anthology/D14-1181)__). There is one more simplified implementation availble on https://nlpforhackers.io/keras-intro/.

To summarize two implemention avilable for the above task are:
* __[link1](https://github.com/rouseguy/DeepLearning-NLP/blob/master/notebooks/3.%20CNN%20-%20Text.ipynb)__
* __[link2](https://nlpforhackers.io/keras-intro/)__

## References

* http://www.aclweb.org/anthology/D14-1181
* https://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf
* https://nlpforhackers.io/keras-intro/


Note:
    __[Keras installation](https://keras.io/#installation)__