# Advanced Programme in Deep Learning (Foundations and Applications)
## A Program by IISc and TalentSprint

### Mini Project Notebook: Irrelevant/inappropriate Questions Classification using Deep Neural Networks.


## Learning Objectives

At the end of the mini-hackathon, you will be able to :

* perform data preprocessing/preprocess the text
* represent the text/words using the pretrained word embeddings - Word2Vec/Glove
* build the deep neural networks to classify the questions as Irrelevant/inappropriate or not


## Dataset

The challenge in this competition is to predict whether a question asked on a well known public forum/platform is irrelevant/inappropriate or not.

A irrelevant/inappropriate question is defined as a question intended to make a statement and not with a purpose of looking for helpful/meaningful answers. The following are some of the characteristics that can signify that a question is irrelevant/inappropriate:

* Based on false information, or contains absurd assumptions
* Does not have a non-neutral tone
* Has an exaggerated tone to underscore a point about a group of people
* Is rhetorical and meant to imply a statement about a group of people
* Is disparaging or inflammatory against an individual or a group of people
* Uses sexual content (such as incest, pedophilia), and not to seek genuine answers
* Suggests a discriminatory idea against a protected class of people, or seeks confirmation of a stereotype
* Based on an unrealistic premise about a group of people
* Is not grounded in reality

The training dataset includes the questions 1044897 that was asked, and whether it was identified as irrelevant/inappropriate (target = 1) or as relevant/appropriate (target = 0). The test dataset consists of approximately 261000 questions.

The training data might be imbalanced or noisy. They are not guaranteed to be perfect. Please take the necessary actions/steps while building the model.
 

## Description

This dataset has the following information:

1. **qid** - unique question identifier
2. **question_text** - the text of the question asked in the well known public forum/platform
3. **target** - a question labeled "irrelevant/inappropriate" has a value of 1, otherwise 0



## Problem Statement

To perform classification of approximately 261000 questions asked on a well known public form using Deep Neural Networks such as RNN/CNN/BERT/LSTM as 'irrelevant/inappropriate' questions or 'relevant/appropriate' questions

## Grading = 10 Marks

Here is a handy link to Kaggle's competition documentation (https://www.kaggle.com/docs/competitions), which includes, among other things, instructions on submitting predictions (https://www.kaggle.com/docs/competitions#making-a-submission).

## Instructions for downloading train and test dataset from Kaggle API are as follows:

### 1. Create an API key in Kaggle.

To do this, go to the competition site on Kaggle at (https://www.kaggle.com/t/bde6f23028154933a99e4b4ca8a3dff2) and click on user then click on your profile as shown below. Click Account.

![alt text](https://cdn.iisc.talentsprint.com/DLFA/Experiment_related_data/Capture-NLP.PNG)

### 2. Next, scroll down to the API access section and click on **Create New Token** to download an API key (kaggle.json). 

![alt text](https://cdn.iisc.talentsprint.com/DLFA/Experiment_related_data/Capture-NLP_1.PNG)

### 3. Upload your kaggle.json file using the following snippet in a code cell:



In [None]:
from google.colab import files
files.upload()

In [None]:
#If successfully uploaded in the above step, the 'ls' command here should display the kaggle.json file.
%ls

### 4. Install the Kaggle API using the following command


In [None]:
!pip install -U -q kaggle==1.5.8

### 5. Move the kaggle.json file into ~/.kaggle, which is where the API client expects your token to be located:



In [None]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

In [None]:
# Execute the following command to verify whether the kaggle.json is stored in the appropriate location: ~/.kaggle/kaggle.json
!ls ~/.kaggle

In [None]:
!chmod 600 /root/.kaggle/kaggle.json # run this command to ensure your Kaggle API token is secure on colab

### 6. Now download the Test Data from Kaggle

**NOTE: If you get a '404 - Not Found' error after running the cell below, it is most likely that the user (whose kaggle.json is uploaded above) has not 'accepted' the rules of the competition and therefore has 'not joined' the competition.**

If you encounter **401-unauthorised** download latest **kaggle.json** by repeating steps 1 & 2

In [None]:
#If you get a forbidden link, you have most likely not joined the competition.
!kaggle competitions download -c toxic-questions-classification

In [None]:
!unzip /content/toxic-questions-classification.zip

## YOUR CODING STARTS FROM HERE

## Import required packages

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical
from sklearn.utils import resample
from sklearn.utils import shuffle
from sklearn.metrics import confusion_matrix,classification_report
import re

import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords # to get collection of stopwords
from nltk.tokenize import word_tokenize
import string
import gensim
import matplotlib.pyplot as plt

##   **Stage 1**:  Data Loading and Perform Exploratory Data Analysis (1 Points)

In [None]:
data = pd.read_csv('train_dataset.csv')
data_t = pd.read_csv('test_dataset.csv')

In [None]:
data.head()

In [None]:
data.groupby('target').count()

In [None]:
data.groupby('target').target.count().plot.bar()

In [None]:
max(data['question_text'].str.len()), min(data['question_text'].str.len())

In [None]:
data[data['question_text'].str.len() <= 10]

##   **Stage 2**: Data Pre-Processing  (1 Points)

####  Clean and Transform the data into a specified format


In [None]:
# Data Preprocessing 
def preprocess_text(sen):
    
    sen = re.sub('<.*?>', ' ', sen) # remove html tag

    tokens = word_tokenize(sen)  # tokenizing words

    tokens = [w.lower() for w in tokens]    # lower case

    table = str.maketrans('', '', string.punctuation)  # remove punctuations
    stripped = [w.translate(table) for w in tokens]

    words = [word for word in stripped if word.isalpha()]  # remove non alphabet
    stop_words = set(stopwords.words('english'))

    words = [w for w in words if not w in stop_words]   # remove stop words
    
    #words = [w for w in words if len(w) > 2]  # Ignore words less than 2
    
    return words

In [None]:
x=preprocess_text('Shankar is <> looking for a job in 90 days.')
x

In [None]:
data_lines = data
data_lines['question_text'] = data_lines.apply(lambda x: preprocess_text(x['question_text']), axis=1)


In [None]:
data_lines.head()

##   **Stage 3**: Build the Word Embeddings using pretrained Word2vec/Glove (Text Representation) (1 Point)



In [None]:
EMBEDDING_DIM = 100
# Train word2vec model after preprocessing the reviews
model = gensim.models.Word2Vec(sentences=data_lines["question_text"], vector_size=EMBEDDING_DIM, window=5, workers=4, min_count=1)

In [None]:
print(model)

In [None]:
words = list(model.wv.index_to_key)
print('Vocabulary size: %d' % len(words))

In [None]:
# Save model
filename = "questions_embedding_word2vec.txt"
model.wv.save_word2vec_format(filename, binary=False)

In [None]:
import os 

embeddings_index = {}
f = open(os.path.join('','questions_embedding_word2vec.txt'), encoding="utf-8")

for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:])
    embeddings_index[word] = coefs
f.close()

In [None]:
embeddings_index["bjp"]

##   **Stage 4**: Build and Train the Deep networks model using Pytorch/Keras (5 Points)



In [None]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data_lines["question_text"])
X = tokenizer.texts_to_sequences(data_lines["question_text"])

X = pad_sequences(X, padding='post', truncating='post')
Y = data_lines["target"]

print('Shape of X tensor:', X.shape)
print('Shape of Y tensor', Y.shape)

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.20, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

In [None]:
# Adding 1 because of reversed 0 index
word_index = tokenizer.word_index
vocab_size = len(word_index) + 1

# Create a weight matrix for words in the training data
embedding_matrix = np.zeros((vocab_size, EMBEDDING_DIM))

for word, index in word_index.items():
  embedding_vector = embeddings_index.get(word)
  if embedding_vector is not None:
    # If words not found in embedding matrix will be all 0's
    embedding_matrix[index, :] = embedding_vector

In [None]:
# Define Model
model = Sequential()
embedding_layer = Embedding(vocab_size,
                            EMBEDDING_DIM, 
                            weights = [embedding_matrix],
                            trainable=False)
model.add(embedding_layer)
model.add(LSTM(32, dropout=0.3, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

In [None]:
print(model.summary())

In [None]:
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

In [None]:
X_train.shape, Y_train.shape, X_test.shape, Y_test.shape

In [None]:
history = model.fit(X_train, Y_train, batch_size=1280, verbose=1, epochs=5, validation_split=0.2)

In [None]:
acc_train = history.history['accuracy']
acc_val = history.history['val_accuracy']
epochs = range(1,6)
plt.plot(epochs, acc_train, 'g', label='Training accuracy')
plt.plot(epochs, acc_val, 'b', label='validation accuracy')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

In [None]:
import pickle
# save the model to disk
filename = 'finalized_model.sav'
pickle.dump(model, open(filename, 'wb'))
# load the model from disk
#loaded_model = pickle.load(open(filename, 'rb'))

In [None]:
model.evaluate(X_test, Y_test)

##   **Stage 5**: Evaluate the Model and get model predictions on the test dataset (2 Points)








In [None]:
# YOUR CODE HERE