In [1]:
# set working directory to parent
import os
os.chdir("..")

# import packages
import pandas as pd
pd.options.display.max_colwidth = 100


import boto3
import numpy as np
from sklearn.model_selection import train_test_split

# tensorflow
import tensorflow as tf
import tensorflow_hub as hub
from tensorboard import notebook
%load_ext tensorboard

from src.data import process_data 

# check tf version
print("TF Version: ", tf.__version__)


TF Version:  2.0.0


# Quora Insinceere Question Classification 

[Kaggle Competition](https://www.kaggle.com/c/quora-insincere-questions-classification/notebooks)

This project is meant to help me explore some of the theory behind neural network models, as well as the methodologies behind designing their architectures and training them. 

Here is an excerpt from the problem statement from the Kaggle:

"An existential problem for any major website today is how to handle toxic and divisive content. Quora wants to tackle this problem head-on to keep their platform a place where users can feel safe sharing their knowledge with the world...**A key challenge is to weed out insincere questions -- those founded upon false premises, or that intend to make a statement rather than look for helpful answers**... Help Quora uphold their policy of “Be Nice, Be Respectful” and continue to be a place for sharing and growing the world’s knowledge."





All of the approaches in this notebook will involve neural networks written in the Tensorflow 2 framework. The following sections detail the approaches and methodologies that will be tested across the different models I implement, listed roughly in the order in which they appear in the pipeline. 




The first thing you notice about this data set is the imbalance in class proportions. The size of the data set is about 1.3 million examples, of which only 6.2% are instances of "insincere" questions. Whether this proportion represents a good estimate of the 'true' distribution of classes or an anomolous sample is not so relevant as simply understanding how to model a classifier given class imbalance. 


Below I pull the training data from AWS S3 and then split the full data set into a training (80%) and test (20%) set. I use Sklearn's stratified sampling method to preserve the class proportions observed in the original data set in both train and test sets.  

In [3]:
# load data from S3
data = process_data.retrieve_training(bucket = "quora-questions", file_name = "data/train.csv")

# Use a utility from sklearn to split and shuffle our dataset
train_df, test_df = train_test_split(data, test_size=.2, random_state=42, stratify = data['target'].values)

# Measure data imbalance in training and test set 
for k,v in {"training set":train_df, "test set":test_df}.items():
    neg, pos = np.bincount(v['target'].values)
    total = neg + pos
    print('{}:\n    Total: {}\n    Positive: {} ({:.2f}% of total)\n'.format(
        k,total, pos, 100 * pos / total))

KeyboardInterrupt: 

**Training set:**
- Total: 1044897 observations
- Positive: 64648 (6.19% of total)

**Test set:**
- Total: 261225 observations
- Positive: 16162 (6.19% of total)



In [2]:
for sent in train_df[train_df['target'] == 0]['question_text']:
    print(sent)
    print(" ")

NameError: name 'train_df' is not defined

In [None]:
for sent in train_df[train_df['target'] == 1]['question_text'][0:]:
    print(sent)
    print(" ")

Looking at these examples of "insincere" questions, it's clear that the criteria is not as simple as targeting explictly hateful or derogatory expressions. In fact, the pattern seems to center around filtering out any question that betray any point of view. Furthermore, the question must have value as a **general** resource, must not solicit the personal experiences or personally held beliefs of respondants or public figures, and not require familiarity with  


Quora states the following: 
Has a non-neutral tone
Has an exaggerated tone to underscore a point about a group of people
Is rhetorical and meant to imply a statement about a group of people
Is disparaging or inflammatory
Suggests a discriminatory idea against a protected class of people, or seeks confirmation of a stereotype
Makes disparaging attacks/insults against a specific person or group of people
Based on an outlandish premise about a group of people
Disparages against a characteristic that is not fixable and not measurable
Isn't grounded in reality
Based on false information, or contains absurd assumptions
Uses sexual content (incest, bestiality, pedophilia) for shock value, and not to seek genuine answers

A few examples of insincere questions:



# Word Embeddings

Quora provides four different pre-trained embedding files:
- [GoogleNews-vectors-negative300, word2vec](https://code.google.com/archive/p/word2vec/)
- [glove.840B.300d ](https://nlp.stanford.edu/projects/glove/)
- [paragram_300_sl999](https://cogcomp.seas.upenn.edu/page/resource_view/106)
- [wiki-news-300d-1M](https://fasttext.cc/docs/en/english-vectors.html)






# Text Data Preprocessing when Applying Word Embeddings

When applying a pre-trained word embedding to a training set, issues arise 

The following text preprocessing methods are specific to a context that uses pre-trained word embeddings as part of a model:
- removing special characters with no value in word embedding
- replacing numbers with "#"
- replacing contractions

- using word embedding to correct misspells, probabalistically 
- out of vocabulary 

- Tokenizer
- Pad Sequence
- Embedding Enrichment 

**====================================**<br>
**Resources:**
- https://mlwhiz.com/blog/2019/01/17/deeplearning_nlp_preprocess/
- https://medium.com/@b.terryjack/nlp-everything-about-word-embeddings-9ea21f51ccfe


## Class Imbalance

Thinking about data gathering processes in general, it's quite possible that phenomenon that 'naturally' display class imbalance empirically display balance, and vice versa. Knowing whether the imbalance is intrinsic or extrinsic is outside the scope of the problem here. Further, the generating process for processes, like sincere vs. insincere questions, can evolve over time as the user base or judgement standards of a platform like Quora evolves. Therefore, I try to read little into the fact that the classes display imbalance and focus on how to account for it in the context of a model.   

Given highly imbalanced data, most learners will exhibit bias towards the majority class, and in more extreme cases even ignore the minority class altogether. From a probabalistic point of view, for the learner, this often proves logical because the prior probability of the majority group often outweighs the evidence. 

In [Survey on Deep Learning with Class Imbalance](https://link.springer.com/article/10.1186/s40537-019-0192-5) from Journal of Big Data, authors Johnson and Khoshgoftaar group methods for handling class imbalance into three categories. The first, data-level techniques, attempt to reduce imbalance through resampling methods. The second, algorithm-level methods, implement a cost or weight schema on the underlying learner. Hybrid approaches combine both sampling and weighting methods. 


https://towardsdatascience.com/handling-imbalanced-datasets-in-deep-learning-f48407a0e758
https://www.tensorflow.org/tutorials/structured_data/imbalanced_data
http://203.170.84.89/~idawis33/DataScienceLab/publication/IJCNN15.wang.final.pdf
https://towardsdatascience.com/handling-imbalanced-data-4fb691e23fe9
http://di.ulb.ac.be/map/adalpozz/pdf/Racing_unbalanced_IDEAL.pdf
https://rikunert.com/SMOTE_explained


**Resampling methods to test:***<br>
**1. no resampling** <br>

**2. SMOTE:**

## Model Cost functions & Evaluation


notes on binary cross entropy 
Notes from [Michael Nielson's online NN guide](http://neuralnetworksanddeeplearning.com/chap3.html)

http://www.jussihuotari.com/2018/01/17/why-loss-and-accuracy-metrics-conflict/





## Model Architecture

Three main architectural decisions:

1) How to represent the text?
2) How many layers to use in the model?
3) How many hidden units to use for each layer?

## Model Hyperparameters

## Training Strategies

## Overfitting Considerations

## Model Results: TensorBoard

Evaluating and comparing the different model architechtures, hyperparameter combinations, and training strategies requires a suite of accuracy metrics in addition to the loss function used in the training process. 

Tensorboard offers a good interface for plotting these metrics at each training epoch for both training and testing data. 


## =================================================
## =============== Model Evaluation ===================
## =================================================

## Model 1: 

### Instructions to train, save, load model, and predict test set


In [None]:
https://www.datacamp.com/community/tutorials/tensorboard-tutorial
    https://www.quora.com/Whats-an-intuitive-way-to-think-of-cross-entropy
        

In [None]:
https://www.tensorflow.org/tutorials/images/transfer_learning_with_hub
    https://towardsdatascience.com/a-comprehensive-hands-on-guide-to-transfer-learning-with-real-world-applications-in-deep-learning-212bf3b2f27a
        https://www.mlq.ai/transfer-learning-tensorflow-2-0/
            https://medium.com/@karanaryan/a-beginners-guide-to-data-pipelines-in-tensorflow-2-0-a291535bd5c3
                

In [None]:
https://github.com/Amin-Tgz/awesome-tensorflow-2#GitHub-tutorials

In [None]:
https://nbviewer.jupyter.org/github/timotheechauvin/NNDL-solutions/blob/master/notebooks/chap-3-improving-the-way-neural-networks-learn.ipynb