# Tagging System of Questions using Transfer Learning

## Overview 

In this challenge, we provide the titles, text, and tags of Stack Exchange questions from six different sites. We then ask for tag predictions on unseen physics questions. Solving this problem via a standard machine learning approach might involve training an algorithm on a corpus of related text. Here, you are challenged to train on material from outside the field. Can an algorithm predict appropriate physics tags after learning from biology, chemistry or mathematics data? Let's find out!

## Objective 

Main goal of this task is to train a model on questions belonging to domains like biology, chemistry, or mathematics but use that to predict tags of physics question.  These tags describe the topic of questions.

## Dataset
In this dataset, you are provided with question titles, content, and tags for Stack Exchange sites on a variety of topics (biology, cooking, cryptography, diy, robotics, and travel). The content of each question is given as HTML. ​The tags are words or phrases that describe the topic of the question. The test set is comprised of questions from the ​[physics.stackexchange.com](https://physics.stackexchange.com)
​ . For each question in the test set, you should use the title and question content in order to generate potential tags

**Importing Relevant libraries**

In [None]:
import tensorflow as tf # to create deep neural networks
import pandas as pd # to handle the spreadsheets
from bs4 import BeautifulSoup # to process html tags
import os
import glob
import nltk
nltk.download('stopwords')
import string
import random
import copy
from tqdm import tqdm
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
import numpy as np

**Setting the hyper parameters**

1. *neg_sample_size* : for every positive sample, we create *neg_sample_size* negative examples.

2. *embedding_size* : determines the embedding dimension of both our word embeddings and hidden layers of our recurrent neural networks

3. *epochs* : determine the number of epochs for training

4. *batch_size* : determines the size of batch for every training iteration

5. *data_dir* : specifies the location of training data

6. *test_data_dir* : specifies the location of testing data

**Note**

Due to time, space and computational limitations the value of epochs is set to 10, neg_sample_size to 4 and train and test data directories to sample train and sample test respectively.

If you wish to conduct full training, please change the values respectively.

In [None]:
neg_sample_size = 4
embedding_size = 128
batch_size = 32
epochs = 10
data_dir = './Dataset/sample_train/'
test_data_dir = './Dataset/sample_test/'

**Extractining data from the files**

In [None]:
list_of_columns = ['title','content','tags']
data_frame = pd.DataFrame(columns=list_of_columns)
for file in glob.glob(data_dir + '*') :
    temp_data_frame = pd.read_csv(file)
    data_frame = data_frame.append(temp_data_frame)

**Defining the tokenize functions**

In [None]:
def tokenize(text) :
    

    stop_words = set(stopwords.words('english'))

    soup = BeautifulSoup(text)
    content = soup.get_text()
    
    raw_tokens = word_tokenize(content)
    raw_tokens = [w.lower() for w in raw_tokens]
    
    tokens = list()
    
    for tk in raw_tokens :
        tkns = tk.split('-')
        for tkn in tkns :
            tokens.append(tkn)
    
    old_punctuation = string.punctuation
    new_punctuation = old_punctuation.replace('-','')
    
    table = str.maketrans('','',new_punctuation)
    #table = str.maketrans('','',string.punctuation)
    
    stripped = [w.translate(table) for w in tokens]
    
    words = [word for word in stripped if word.isalpha()]
    words = [w for w in words if not w in stop_words]
    
    return words

In [None]:
def tokenize_candidates(text) :
    
    initial_candidates = text.lower().split()
    final_candidates = list()
    
    for candidate in initial_candidates :
        intermidiate_candidates = candidate.split('-')
        final_candidates.append(intermidiate_candidates)
        
    return final_candidates 

**Creating the tokenized version of all the train data**

In [None]:
vocab_set = set()
train_set = list()
labels_set = list()
tags_set = set()

max_sentence_length = -9999
for index, row in data_frame.iterrows() :
    raw_content = row['content']
    raw_title = row['title']
    raw_tags = row['tags']
    
    content_words = tokenize(raw_content)
    title_words = tokenize(raw_title)
    tag_words = tokenize(raw_tags)
    
    candidate_words = tokenize_candidates(raw_tags)
    
    sequence = list()
    sequence.extend(title_words)
    sequence.extend(content_words)
    
    train_set.append(sequence)
    labels_set.append(candidate_words)
    
    for tag in tag_words :
        tags_set.add(tag)
        
    new_set_of_words = content_words + title_words + tag_words
    for w in new_set_of_words :
        vocab_set.add(w)

**Creating the vocabulary dictionary**

In [None]:
word_idx = dict()
vocab_size = len(vocab_set)
for i,word in enumerate(vocab_set) :
    word_idx[word] = i

**Preparing the train data with positive and negative samples**

In [None]:
train_data = list()
max_sentence_size = -9999


for i in range(len(train_set)) :
    train_x = train_set[i]
    neg_tags = copy.deepcopy(tags_set)
    
    tags_to_remove = labels_set[i]
    
    for j in range(neg_sample_size) :
        for tags_list in tags_to_remove :
            for tag in tags_list :
                if tag in neg_tags :
                    neg_tags.remove(tag)
                    
        neg_tag = random.sample(neg_tags,random.randint(0,2))
        
        # append a negative sample
        x = list()
        x.extend(train_x)
        x.extend(neg_tag)
        
        # finding the maximum sentence size, this will be used to set the padding limit
        max_sentence_size = max(max_sentence_size,len(x))
        train_data.append([x,0])
        
        tags_to_remove = neg_tag
        
    # append the true sample
    for true_label in labels_set[i] :
        x = list()
        x.extend(train_x)
        x.extend(true_label)
        # finding the maximum sentence size
        max_sentence_size = max(max_sentence_size,len(x))
        train_data.append([x,1])
random.shuffle(train_data)

**Creating vectorized version of train data**

In [None]:
vec_train_data = list()
vec_train_labels = list()

for train_set in train_data :
    train_seq = train_set[0]
    train_label = train_set[1]
    lq = max(0,max_sentence_size-len(train_seq))
    
    vec_seq = [word_idx[w] if w in word_idx.keys() else 0 for w in train_seq] + lq*[0]
    
    vec_train_data.append(vec_seq)
    vec_train_labels.append([train_label])

train_x = np.array(vec_train_data)
train_y = np.array(vec_train_labels)

**Creating the tensorflow model**

In [None]:
tf.reset_default_graph()
x = tf.placeholder(tf.int32,[None,max_sentence_size],name="input_seq")
y = tf.placeholder(tf.int32,[None,1],name="answers")

A = tf.get_variable("word_embeddings",[vocab_size,embedding_size])

word_embeddings = tf.nn.embedding_lookup(A,x)

rnn_cell = tf.nn.rnn_cell.LSTMCell(embedding_size)

outputs, states = tf.nn.dynamic_rnn(rnn_cell,word_embeddings,dtype=tf.float32)
dense_layer = tf.layers.dense
logit_outputs = dense_layer(outputs[:,-1,:],1,activation=tf.nn.sigmoid)

loss = tf.losses.mean_squared_error(y,logit_outputs)
train_op = tf.train.GradientDescentOptimizer(0.001).minimize(loss)

**Running the tensorflow model**

In [None]:
session = tf.Session()
init_op = tf.global_variables_initializer()
batches = zip(range(0,len(train_x)-batch_size,batch_size),range(batch_size,len(train_x),batch_size))
batches = [(start,end) for start, end in batches]
session.run(init_op)
for epoch in range(epochs) :
    for start,end in tqdm(batches) :
        train_batch = train_x[start:end]
        train_ans = train_y[start:end]
        
        feed_dict = {x : train_batch, y : train_ans}
        _, l = session.run([train_op,loss],feed_dict=feed_dict)

**Preparing the test data**

In [None]:
test_set = list()
max_test_sentence_length = -9999


test_list_of_columns = ['title','content']
test_data_frame = pd.DataFrame(columns=test_list_of_columns)

for file in glob.glob(data_dir + '*') :
    temp_data_frame = pd.read_csv(file)
    test_data_frame = test_data_frame.append(temp_data_frame)

    
for index, row in test_data_frame.iterrows() :
    raw_content = row['content']
    raw_title = row['title']
    
    content_words = tokenize(raw_content)
    title_words = tokenize(raw_title)
    
    sequence = list()
    sequence.extend(title_words)
    sequence.extend(content_words)
    
    max_test_sentece_length = max(max_test_sentence_length,len(sequence))
    
    test_set.append(sequence)

**Making and Storing Predictions**

For each test example, we make predictions, rank them and store the best 4 in the *submission.csv* file.

In [None]:
list_of_test_columns = ['id','tags']
test_data_frame = pd.DataFrame(columns=list_of_test_columns)
tag_list = list(tags_set)
id_count = 0
for initial_seq in test_set :
    test_seq_raw = initial_seq
    score_list = list()
    tag_identification_list = list()
    id_count += 1
    response_dict = dict()
    response_dict['id'] = id_count
    for i in tqdm(range(len(tag_list))) :
        
        first_tag = tag_list[i]
        test_seq = copy.deepcopy(test_seq_raw)
        test_seq.extend([first_tag])
        #print(max_sentence_size)
        lq = max(0,max_sentence_size-len(test_seq))
        vec_seq = [word_idx[w] if w in word_idx.keys() else 0 for w in test_seq] + lq*[0]
        test_case = np.array([vec_seq])
        score = session.run([logit_outputs],feed_dict={x:test_case})
        score_list.append(score[0][0])
        tag_identification_list.append(first_tag)
        
        for j in range(i+1,len(tag_list)) :
            
            second_tag = tag_list[j]
            test_seq = copy.deepcopy(test_seq_raw)
            test_seq.extend([first_tag,second_tag])
            lq = max(0,max_sentence_size-len(test_seq))
            vec_seq = [word_idx[w] if w in word_idx.keys() else 0 for w in test_seq] + lq*[0]
            test_case = np.array([vec_seq])
            score = session.run([logit_outputs],feed_dict={x : test_case})
            score_list.append(score)
            tag_identification_list.append('{}-{}'.format(first_tag,second_tag))
            
    index_array = np.array(score_list)
    sorted_index_array = np.argsort(index_array.reshape(index_array.shape[0],))
    index_list = sorted_index_array.tolist()[-4:]
    #print(index_array)
    list_of_correct_responses = list()
    for index in index_list :
        print(tag_identification_list[index])
        list_of_correct_responses.append(tag_identification_list[index])
    tag_ans = ' '.join(list_of_correct_responses)
    response_dict['tags'] = tag_ans
    test_data_frame = test_data_frame.append(response_dict,ignore_index=True)
    print('\n')
test_data_frame.to_csv('submission.csv',sep=',',encoding='utf-8')