# Project Plan and Steps

** Due Dates:**

* Presentation (**Dec 15th - 19th**)
* Report (**Dec 19th**)


### 1. Implementation 

#### Paragraph to Vec (target: Nov 26th)

* Finalize and its output need to align with the dimension of CNN output

#### Word Embedding

* Pretrained google word2vec

* Not trained word embedding to be trained with CNN

#### CNN (need to follow the paper)

* Paper has a unique 2 hidden layer structure CNN output, we will just use a softmax output layer

* Region sizes, filter size, width size to mirror the paper

#### Open Classification

* 1-vs-Rest

* Clustering approach
    * Gaussian Mixture Model
    * Infinite Dirichlet process

### 2. Training and Cross Validation

#### Full combinations: can start in pipeline or parallel once the implementation is ready

* paragraph vec + 1-vs-rest
* paragraph vec + GM
* paragraph vec + IDPs
* CNN + 1-vs-rest
* CNN + GM
* CNN + IDP

### 3. Experiment Evaluation

* Use Macro F1 Score

### 4. Report write-up


# Language Model Architecture

### Paragraph Vector Model

#### Ref: (Le and Mikolov, 2014) https://arxiv.org/pdf/1405.4053v2.pdf)

* Max vocab size = 20000
* Output feature vector size = 450

### CNN (Converlutional Neural Network)

#### Ref:

**Model Structure and Parameters:**

* Embedding Layer:

    * batch_size: N
    * number of class: M
    * word embedding dimension: d = 300
    * Sentence length: s = 500 (meidan length of the document)
    * A sentence matrix has size of $s \times d$
    
    
* CNN Layer

    * Tensor flow cnn: _tf.nn.conv2d_
    * Region size: [3, 4, 5] has 3 regions (***Tensor flow define it as height***)
    * Width of the filter: d = 300 (***Tensor flow define it as width, usually equal to word embedding dimension***)
    * Number of filters per region: f = 150(***This is equalt to number of feature maps for each region size***)
    * 1 Max-pooling apply to 1 feature map
    
    Use the below image just an example

<img src="cnn.png" width = "80%">

* Output Layer: The softmax output layer where:
    * o: feature vector from cnn layer after max-poolling is in $k$
    * W': $K \times r$
    * y: $M$

# Open Classification

### Approach to Implement

* 1-vs-Rest Layer of DOC (This is the method of reference paper)

    * M (number of class) sigmoid function, N (batch_size)
    * Objective function for training is $$loss = \sum_{i=1}^M \sum_{i=1}^N y_n log(p) + (1 - y_n)log(1 - p(y))$$ is the summation of all log loss (cross-entropy) on the training data.
    * At prediction, reject if all predicted probability is less than their threshold t_i, otherwise $argmax(Sigmoid(d))$
    * The theshold is determined by using outlier detection. (***We can use a fixed number such as 0.95 to validate our model implementation***)

We use the similar methodology of 1-vs-Rest as the DOC paper.

   * After we trained the language model (paragraph2vec or CNN). We use the feature vectors from the language model and conduce 1-vs-rest classifition using SK-learn for 1-vs-Rest classification analysis
   * We remove some class labels as unseen from the training data during the training and cross-validation
       * 5 classes labeled + 1 unseen
       * 5 classes labeled + 2 unseen
       * 5 classes labeled + 3 unseen
   * We calibrate the probability and define a probability threshold for each classes in the training set i.e. without the unseen classes.
   * We predict the test set classes use the probability threshold: if a data sample is below this threshold for all classes, it belongs to the unseen classes
   * We evaluate our prediction using macro-F1 score


* Clustering Approach
    * Gausian Mix Model
    * Infinite Dirichlet process
    * We remove some class labels as unseen from the training data during the training and cross-validation
       * 5 classes labeled + 1 unseen
       * 5 classes labeled + 2 unseen
       * 5 classes labeled + 3 unseen
    * We use outlier detection approach to identify unseen class
    * We evaluate our prediction using macro-F1 score

# Test and Evaluation Methods and Metrics

### Data Setup

* Use 20 newsgroup data
* The DOC paper use the top ** _20000_ ** Most frequent vocabulary. 

We use the same basic CNN parameters as in the DOC paper such as region size, number of filters and size of the filter. But We removed the additional hidden layer and we reduce the word length from 2000 to 500 which is the median document length (489) of the 20 newsgroup data set. We made the change to reduce complexity of the network so we can manage with the computer resource we have. 

### Word embedding

Google word2vec vs. training word embedding on the fly

* Google word2vec outperform training word embedding:

Using 5 random classes data, we compared the performance using google word2vec and train word embedding layer on the fly method. Google word2vec outperform training word embedding layer by 11% in accuracy with test sets. This is due to the fact google word2vec is based 100 billion google news data data with 3 million vocabulary and phrases whereas training our embedding is only limited to the 20 newsgroup data set.

### Compare language models

We compared the basic performance paragraph2vec and CNN model although the setup is different. 

* We train the CNN model with google word2vec using a output soltmax layer for classificaiton of 20 newsgroup data. 
* We train paragraph2vec model with 20 newsgroup data. After the paragraph2vec model is trained. We extract the feature vectors and connect with the same softmax output layer as the CNN output layer.
* We keep the feature length of 450 (3 region of 150 filter) for both CNN and paragraph2vec model

We learn the following:

* CNN is difficult the train with small amount of computer resources. 
    * We choose to use the median document length of 500 due to the slow computing speed whereas the DOC paper uses 2000.
    * We choose to use less number of classes for the same reason
* paragraph2vec outperform the CNN network (with limited hyperparameter tuning)
    * Paragraph2vec yields 85% test accuracy in 5 classes classification tasks whereas CNN yield yields only 71% test accuracy.

   
### Open classification Experiments

* Random sample 64% document as training, 16% for validation, and 20% for testing
* Vary number of training classes
    * 5 labelled + 1 open
    * 5 labelled + 2 open
    * 5 labelled + 3 open
    * Use Macro-F1 scores

# Implementation Notebooks

### Paragraph vector

* Paragraph vector 1-vs-rest notebook: **Text_Open_Classification_Paragraph2vec_workbook.ipynb**
* Paragraph vector GMM (5 seen + 1 unseen classes): **ParagraphVec_Clustering_5_plus_1.ipynb**
* Paragraph vector GMM (5 seen + 2 unseen classes): **ParagraphVec_Clustering_5_plus_2.ipynb**
* Paragraph vector GMM (5 seen + 3 unseen classes): **ParagraphVec_Clustering_5_plus_3.ipynb**
* Paragraph vector IDP (5 seen + 1 unseen classes): **ParagraphVec_Clustering_5_plus_1-IDP.ipynb**
* Paragraph vector IDP (5 seen + 2 unseen classes): **ParagraphVec_Clustering_5_plus_2-IDP.ipynb**
* Paragraph vector IDP (5 seen + 3 unseen classes): **ParagraphVec_Clustering_5_plus_3-IDP.ipynb**

### CNN 

* CNN Implementation and Training Notebook: **Text_Open_Classification_CNN_workbook.ipynb**
* CNN Open classification 1-vs-rest, GMM, IDP (5 seen + 1 unseen classes): **CNN_open_classification_5_plus_1.ipynb**
* CNN Open classification 1-vs-rest, GMM, IDP (5 seen + 2 unseen classes): **CNN_open_classification_5_plus_2.ipynb**
* CNN Open classification 1-vs-rest, GMM, IDP (5 seen + 3 unseen classes): **CNN_open_classification_5_plus_3.ipynb**