# **Aim of this Challenge:** 

Create intelligent question and answer systems that can reliably predict context without relying on complicated and opaque rating guidelines.

# The Business Problem:


To create a more human-like question and answering system can answer the provided question having the intuitive understanding of the question. This can attract users and address their question more human-like and this can also increase the number of user participation in the question answering forms and create human-like conversation chat boxes.


# Exploring dataset

In [None]:
# importing the required libraries 

import pandas as pd
import  numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [None]:
train_dataset = pd.read_csv('/kaggle/input/google-quest-challenge/train.csv')
test_dataset = pd.read_csv('/kaggle/input/google-quest-challenge/test.csv')
sample_submission_dataset = pd.read_csv('/kaggle/input/google-quest-challenge/sample_submission.csv')

print("Train shape:", train_dataset.shape)
print("Test shape:", test_dataset.shape)
print("Sample submission shape:", sample_submission_dataset.shape)

### Observations:
* In train dataset we have 41 column and 6079 rows(instances/training points).
* in test dataset we have only 11 column and 476 rows(instances/test points).
* in submission dataset we have 31 column and 476 rows.

In [None]:
# Check for train data samples
train_dataset.head(2)

In [None]:
# getting basic info from training data
train_dataset.info()

> **Observations:** There are 10 features and no null values and 10 are having type as object and 30 labels are having type as float64 

### Features:
 1   question_title                         
 2   question_body                           
 3   question_user_name                      
 4   question_user_page                     
 5   answer                                 
 6   answer_user_name                      
 7   answer_user_page                        
 8   url                                     
 9   category                                
 10  host      

In [None]:
# Describing the train data
train_dataset.describe()

### **Observations:** 
* In the above 41 columns, 10 are feature and 30 are the class labels and one column qa_id is the unique ID for every instance.
* **21 class** labels are for **questions** that is the label  that starts with "question_..."
* **9 class** labels are for **answers** that is the label  which starts with "answer_..."

* Total we have **30 Class Lables**

In [None]:
# Let's see the list of column names

list(train_dataset.columns[1:])

In [None]:
train_dataset.head()

## Checking density of words & characters present in the `question_title` feature

In [None]:
import seaborn as sns


def word_count(sentense):
    sentense = sentense.strip()

    return len(sentense.split(" "))


fig, ax = plt.subplots(1,2, figsize = ( 20 , 5))


question_title_lengths_train = train_dataset['question_title'].apply(len)
question_title_lengths_test = test_dataset['question_title'].apply(len)
question_title_lengths_train_words = train_dataset['question_title'].apply(word_count)
question_title_lengths_test_words = test_dataset['question_title'].apply(word_count)


sns.histplot(question_title_lengths_train, label="Train", kde=True, stat="density", linewidth=0,  color="red", ax=ax[0])
sns.histplot(question_title_lengths_test, label="Test", kde=True, stat="density", linewidth=0,  color="blue", ax=ax[0])
sns.histplot(question_title_lengths_train_words, label="Train", kde=True, stat="density", linewidth=0,  color="red", ax=ax[1])
sns.histplot(question_title_lengths_test_words, label="Test", kde=True, stat="density", linewidth=0,  color="blue", ax=ax[1])

# Set label for x-axis
ax[0].set_xlabel( "No. of characters" , size = 12 )
  
# Set label for y-axis
ax[0].set_ylabel( "Density of character" , size = 12 )
  
# Set title for plot
ax[0].set_title( "Density of characters in 'question_title' feature\n" , size = 15 )

ax[0].legend()


# Set label for x-axis
ax[1].set_xlabel( "No. of Words" , size = 12 )
  
# Set label for y-axis
ax[1].set_ylabel( "Density of Words" , size = 12 )
  
# Set title for plot
ax[1].set_title( "Density of Words in 'question_title' feature\n" , size = 15 )

ax[1].legend()



plt.show();


### Observation: 
* Both train and test having the same distribution of characters and words. 
* Most of the words lies in range 5-10 both train and test. 
* Most of the characters lies in the range 40-60 train and test. 

## Checking density of words & characters present in the `question_body` feature

In [None]:
import seaborn as sns


def word_count(sentense):
    sentense = sentense.strip()

    return len(sentense.split(" "))


fig, ax = plt.subplots(1,2, figsize = ( 20 , 5))


question_body_lengths_train = train_dataset['question_body'].apply(len)
question_body_lengths_test = test_dataset['question_body'].apply(len)
question_body_lengths_train_words = train_dataset['question_body'].apply(word_count)
question_body_lengths_test_words = test_dataset['question_body'].apply(word_count)


sns.histplot(question_body_lengths_train, label="Train", kde=True, stat="density", linewidth=0,  color="red", ax=ax[0])
sns.histplot(question_body_lengths_test, label="Test", kde=True, stat="density", linewidth=0,  color="blue", ax=ax[0])
sns.histplot(question_body_lengths_train_words, label="Train", kde=True, stat="density", linewidth=0,  color="red", ax=ax[1])
sns.histplot(question_body_lengths_test_words, label="Test", kde=True, stat="density", linewidth=0,  color="blue", ax=ax[1])

# Set label for x-axis
ax[0].set_xlabel( "No. of characters" , size = 12 )
  
# Set label for y-axis
ax[0].set_ylabel( "Density of character" , size = 12 )
  
# Set title for plot
ax[0].set_title( "Density of characters in 'question_body' feature\n" , size = 15 )

ax[0].legend()


# Set label for x-axis
ax[1].set_xlabel( "No. of Words" , size = 12 )
  
# Set label for y-axis
ax[1].set_ylabel( "Density of Words" , size = 12 )
  
# Set title for plot
ax[1].set_title( "Density of Words in 'question_body' feature\n" , size = 15 )

ax[1].legend()



plt.show();


### Observation:
* We can observe that the distribution of both words and characters are very much right skewed.
* Most of the characters in question_body lies below 2500.
* Most of the words in question_body lies below 1000.

## Similarly we will check for `answer` feature

In [None]:
import seaborn as sns


def word_count(sentense):
    sentense = sentense.strip()
    return len(sentense.split(" "))


fig, ax = plt.subplots(1,2, figsize = ( 20 , 5))


answer_lengths_train = train_dataset['answer'].apply(len)
answer_lengths_test = test_dataset['answer'].apply(len)
answer_lengths_train_words = train_dataset['answer'].apply(word_count)
answer_lengths_test_words = test_dataset['answer'].apply(word_count)


sns.histplot(answer_lengths_train, label="Train", kde=True, stat="density", linewidth=0,  color="red", ax=ax[0])
sns.histplot(answer_lengths_test, label="Test", kde=True, stat="density", linewidth=0,  color="blue", ax=ax[0])
sns.histplot(answer_lengths_train_words, label="Train", kde=True, stat="density", linewidth=0,  color="red", ax=ax[1])
sns.histplot(answer_lengths_test_words, label="Test", kde=True, stat="density", linewidth=0,  color="blue", ax=ax[1])

# Set label for x-axis
ax[0].set_xlabel( "No. of characters" , size = 12 )
  
# Set label for y-axis
ax[0].set_ylabel( "Density of character" , size = 12 )
  
# Set title for plot
ax[0].set_title( "Density of characters in 'answer' feature\n" , size = 15 )

ax[0].legend()


# Set label for x-axis
ax[1].set_xlabel( "No. of Words" , size = 12 )
  
# Set label for y-axis
ax[1].set_ylabel( "Density of Words" , size = 12 )
  
# Set title for plot
ax[1].set_title( "Density of Words in 'answer' feature\n" , size = 15 )

ax[1].legend()



plt.show();


### Observation:
* As similar to question_body we can find that answer distribution is also skewed.
* Their may be some extreme outlier instance that words/char length are very high in both question_body and answer features.

## Analyzing `question_body` and `answer` features sequence length

In [None]:
for i in range(0,101,10):
    print(f'{i}th percentile of question_body input sequence {np.percentile(question_body_lengths_train_words, i)}')
print()
for i in range(90,101):
    print(f'{i}th percentile of question_body input sequence {np.percentile(question_body_lengths_train_words, i)}')
print()
for i in [99.1,99.2,99.3,99.4,99.5,99.6,99.7,99.8,99.9,100]:
    print(f'{i}th percentile of question_body input sequence {np.percentile(question_body_lengths_train_words, i)}')

## **Observation:** 99.9% the of words in question body lies below **3220**

In [None]:
for i in range(0,101,10):
    print(f'{i}th percentile of answer input sequence {np.percentile(answer_lengths_train_words, i)}')
print()
for i in range(90,101):
    print(f'{i}th percentile of answer input sequence {np.percentile(answer_lengths_train_words, i)}')
print()
for i in [99.1,99.2,99.3,99.4,99.5,99.6,99.7,99.8,99.9,100]:
    print(f'{i}th percentile of answer input sequence {np.percentile(answer_lengths_train_words, i)}')

## **Observation:** 99.9% of words in answer feature lies below **2200**

# Analyzing `category` Feature

In [None]:
train_dataset['category'].unique()

In [None]:
train_category_feature_count = train_dataset['category'].value_counts()
test_category_feature_count = test_dataset['category'].value_counts()

print("Train category:\n",train_category_feature_count)
print()
print("Test category:\n",test_category_feature_count)

In [None]:
figure, ax = plt.subplots(1,2, figsize=(12, 6))

train_category_feature_count.plot(kind='bar', ax=ax[0])
test_category_feature_count.plot(kind='bar', ax=ax[1])

ax[0].set_title('Train')
ax[0].set_xlabel( "unique category" , size = 12 )
ax[0].set_ylabel( "count" , size = 12 )

ax[1].set_title('Test')
ax[1].set_xlabel( "unique category" , size = 12 )
ax[1].set_ylabel( "count" , size = 12 )

plt.show()

In [None]:
# Sample stack over flow question and answer
train_dataset[train_dataset['category'] == 'STACKOVERFLOW'].values[11]

In [None]:
# sample science question and answer 
train_dataset[train_dataset['category'] == 'SCIENCE'].values[11]

In [None]:
# sample life art and culture question and answer
train_dataset[train_dataset['category'] == 'LIFE_ARTS'].values[11]

In [None]:
# sample life art and culture question and answer
train_dataset[train_dataset['category'] == 'CULTURE'].values[11]

### Observation:
* Five unique category are present in the category feature.
* **Technology** and **Stackoverflow** are the highest count and both are related topics.
* **Life_arts** as the lowest count category.
* Distribution of train and test category are the same.
* **Life_arts & culture** follow general english syntax & structure.
* **Science** utilizes latex with expressions prepended and appended with symbol: $
* **Technology & stackoverflow** have code snippets & logs.

# Word cloud

In [None]:
from wordcloud import WordCloud


def plot_wordcloud(text, ax, title=None):
    wordcloud = WordCloud(max_font_size=None, background_color='white', width=1200, height=1000).generate(text)
    ax.imshow(wordcloud)
    if title is not None:
        ax.set_title(title, size = 15)
    ax.axis("off")

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# word cloud for train data
text = ' '.join(train_dataset['question_title'].values)
plot_wordcloud(text, axes[0][0], 'Train Question title')

text = ' '.join(train_dataset['question_body'].values)
plot_wordcloud(text, axes[0][1], 'Train Question body')

text = ' '.join(train_dataset['answer'].values)
plot_wordcloud(text, axes[0][2], 'Train Answer')


# word cloud for Test data
text = ' '.join(test_dataset['question_title'].values)
plot_wordcloud(text, axes[1][0], 'Test Question title')

text = ' '.join(test_dataset['question_body'].values)
plot_wordcloud(text, axes[1][1], 'Test Question body')

text = ' '.join(test_dataset['answer'].values)
plot_wordcloud(text, axes[1][2], 'Test Answer')

plt.tight_layout()
fig.show()

### Observation:
* We can observe that some of words match between train and test set.
Reference: https://www.kaggle.com/corochann/google-quest-first-data-introduction?scriptVersionId=23910525&cellId=34

# Analyzing labels 

In [None]:
for label in train_dataset.columns[11:]:
    print(f"{label:.20}: no. of unique label values: {len(train_dataset[label].unique())}")

### Observation:
* The output label are regression(real) values but the distribution is not continuous.
* Except for `answer_satisfaction` label rest every label are having unique values some are with 9 unique values and some are of 5 unique values.
* Using this insights we can use post pocessing to get better scoring 

In [None]:
for label in train_dataset.columns[11:]:
    sns.histplot(train_dataset[label], label=label, kde=False)
    plt.show()

### Observation:
* **Label values are imbalance** like for some of the label values are having only one values ex: **question_type_spelling**, **question_not_really_question** etc that is the distribution of label are very dissimilar.

### correlation between target variables

In [None]:
fig, ax = plt.subplots(figsize=(20,20))   
sns.heatmap(train_dataset[11:].corr(), linewidths=1, ax=ax, annot_kws={"fontsize":40})
plt.show();

### Observations:
From the above heatmap of correleation we can observe that `answer_helpful`, `answer_level_of_information`, `answer_plausible`, `answer_releveance` and `answer_satification` have some correlation between them.

## Analyzing `host` feature

In [None]:
print(f"Total unique host present in the dataset {len(train_dataset['host'].unique())}")

In [None]:
train_host_feature_count = train_dataset['host'].value_counts()


figure, ax = plt.subplots( figsize=(20, 5))

train_host_feature_count.plot(kind='bar', ax=ax)

ax.set_title('Train dataset - count of Q&A collected from each website', size=20)
ax.set_xlabel( "Host" , size = 12 )
ax.set_ylabel( "Count" , size = 12 )

plt.show()

In [None]:
test_host_feature_count = test_dataset['host'].value_counts()
figure, ax = plt.subplots( figsize=(20, 5))
test_host_feature_count.plot(kind='bar', ax=ax)
ax.set_title('Test dataset - count of Q&A collected from each website', size=20)
ax.set_xlabel( "Host" , size = 12 )
ax.set_ylabel( "Count" , size = 12 )
plt.show()


### Observation:
* All question and answer in the dataset are extracted from **63 websites**.
* Most of the question and answer are from **stackoverflow.com** as we observe from the  `category` feature analysis that most of the caterogy fall under **technology and stackoverflow**.

# Spliting the data in to train and validation

In [None]:
y_columns = ['question_asker_intent_understanding',
       'question_body_critical', 'question_conversational',
       'question_expect_short_answer', 'question_fact_seeking',
       'question_has_commonly_accepted_answer',
       'question_interestingness_others', 'question_interestingness_self',
       'question_multi_intent', 'question_not_really_a_question',
       'question_opinion_seeking', 'question_type_choice',
       'question_type_compare', 'question_type_consequence',
       'question_type_definition', 'question_type_entity',
       'question_type_instructions', 'question_type_procedure',
       'question_type_reason_explanation', 'question_type_spelling',
       'question_well_written', 'answer_helpful',
       'answer_level_of_information', 'answer_plausible', 'answer_relevance',
       'answer_satisfaction', 'answer_type_instructions',
       'answer_type_procedure', 'answer_type_reason_explanation',
       'answer_well_written']

y = train_dataset[y_columns]
X = train_dataset.drop(y_columns,axis=1)

In [None]:
X.shape, y.shape

In [None]:
from sklearn.model_selection import train_test_split


X_train_dataset, X_valid_dataset, y_train_dataset, y_valid_dataset = train_test_split(X,y, test_size=0.10)

In [None]:
X_train_dataset.shape, X_valid_dataset.shape, y_train_dataset.shape, y_valid_dataset.shape

In [None]:
X_train_dataset

#  **Preprocessing Text Feature**

In [None]:
# https://stackoverflow.com/a/47091490/4084039
import re

def decontracted(phrase):
    phrase = re.sub(r"(W|w)on(\'|\’)t ", "will not ", phrase)
    phrase = re.sub(r"(C|c)an(\'|\’)t ", "can not ", phrase)
    phrase = re.sub(r"(Y|y)(\'|\’)all ", "you all ", phrase)
    phrase = re.sub(r"(Y|y)a(\'|\’)ll ", "you all ", phrase)
    phrase = re.sub(r"(I|i)(\'|\’)m ", "i am ", phrase)
    phrase = re.sub(r"(A|a)isn(\'|\’)t ", "is not ", phrase)
    phrase = re.sub(r"n(\'|\’)t ", " not ", phrase)
    phrase = re.sub(r"(\'|\’)re ", " are ", phrase)
    phrase = re.sub(r"(\'|\’)d ", " would ", phrase)
    phrase = re.sub(r"(\'|\’)ll ", " will ", phrase)
    phrase = re.sub(r"(\'|\’)t ", " not ", phrase)
    phrase = re.sub(r"(\'|\’)ve ", " have ", phrase)
    
    return phrase


def clean_text(x):

    x = str(x)
    for punct in "/-'":
        x = x.replace(punct, ' ')
    for punct in '&':
        x = x.replace(punct, f' {punct} ')
    for punct in '?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '“”’':
        x = x.replace(punct, '')
    return x

def clean_numbers(x):

    x = re.sub('[0-9]{5,}', '12345', x)
    x = re.sub('[0-9]{4}', '1234', x)
    x = re.sub('[0-9]{3}', '123', x)
    x = re.sub('[0-9]{2}', '12', x)
    return x

In [None]:
# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
stopwords= ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"]

In [None]:
# Combining all the above stundents 
from tqdm import tqdm
def preprocess_text(text_data):
    preprocessed_text = []
    # tqdm is for printing the status bar
    for sentance in tqdm(text_data):
        sent = decontracted(sentance)
        sent = clean_text(sentance)
        sent = clean_numbers(sentance)
        sent = sent.replace('\\r', ' ')
        sent = sent.replace('\\n', ' ')
        sent = sent.replace('\\"', ' ')
        sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
        # https://gist.github.com/sebleier/554280
        sent = ' '.join(e for e in sent.split() if e.lower() not in stopwords)
        preprocessed_text.append(sent.lower().strip())
    return preprocessed_text

In [None]:
X_train_dataset['preprocessed_question_title'] = preprocess_text(X_train_dataset['question_title'].values)
X_train_dataset['preprocessed_question_body'] = preprocess_text(X_train_dataset['question_body'].values)
X_train_dataset['preprocessed_answer'] = preprocess_text(X_train_dataset['answer'].values)


X_valid_dataset['preprocessed_question_title'] = preprocess_text(X_valid_dataset['question_title'].values)
X_valid_dataset['preprocessed_question_body'] = preprocess_text(X_valid_dataset['question_body'].values)
X_valid_dataset['preprocessed_answer'] = preprocess_text(X_valid_dataset['answer'].values)

In [None]:
test_dataset['preprocessed_question_title'] = preprocess_text(test_dataset['question_title'].values)
test_dataset['preprocessed_question_body'] = preprocess_text(test_dataset['question_body'].values)
test_dataset['preprocessed_answer'] = preprocess_text(test_dataset['answer'].values)

### question_title text after preprocessing

In [None]:
# Text before preprocessing
X_train_dataset['question_title'].values[0]

In [None]:
# Text after preprocessing
X_train_dataset['preprocessed_question_title'].values[0]

### question_body after preprocessing

In [None]:
# Text before preprocessing
X_train_dataset['question_body'].values[0]

In [None]:
# Text after preprocessing
X_train_dataset['preprocessed_question_body'].values[0]

### Answer after preprocessing

In [None]:
# Text before preprocessing
X_train_dataset['answer'].values[0]

In [None]:
# Text after preprocessing
X_train_dataset['preprocessed_answer'].values[0]

# **Feature engineering:**

## Text count based features:

1. Number of characters in the **question_title**
2. Number of characters in the **question_body**
3. Number of characters in the **answer**
4. Number of words in the **question_title**
5. Number of words in the **question_body**
6. Number of words in the **answer**
7. Number of unique words in the **question_title**
8. Number of unique words in the **question_body**
9. Number of unique words in the **answer**


In [None]:
def word_count(sentense):
    sentense = sentense.strip()

    return len(sentense.split(" "))

def unique_word_count(sentense):
    sentense = sentense.strip()

    return len(set(sentense.split(" ")))


In [None]:
import warnings
warnings.filterwarnings('ignore')

# Number of characters in the text
X_train_dataset["question_title_num_chars"] = X_train_dataset["question_title"].apply(len)
X_train_dataset["question_body_num_chars"] = X_train_dataset["question_body"].apply(len)
X_train_dataset["answer_num_chars"] = X_train_dataset["answer"].apply(len)

# Feature engineering for validation dataset
X_valid_dataset["question_title_num_chars"] = X_valid_dataset["question_title"].apply(len)
X_valid_dataset["question_body_num_chars"] = X_valid_dataset["question_body"].apply(len)
X_valid_dataset["answer_num_chars"] = X_valid_dataset["answer"].apply(len)

test_dataset["question_title_num_chars"] = test_dataset["question_title"].apply(len)
test_dataset["question_body_num_chars"] = test_dataset["question_body"].apply(len)
test_dataset["answer_num_chars"] = test_dataset["answer"].apply(len)

#########################################################################################################
# Number of words in the text
X_train_dataset["question_title_num_words"] = X_train_dataset["question_title"].apply(word_count)
X_train_dataset["question_body_num_words"] = X_train_dataset["question_body"].apply(word_count)
X_train_dataset["answer_num_words"] = X_train_dataset["answer"].apply(word_count)

# validation dataset features
X_valid_dataset["question_title_num_words"] = X_valid_dataset["question_title"].apply(word_count)
X_valid_dataset["question_body_num_words"] = X_valid_dataset["question_body"].apply(word_count)
X_valid_dataset["answer_num_words"] = X_valid_dataset["answer"].apply(word_count)

test_dataset["question_title_num_words"] = test_dataset["question_title"].apply(word_count)
test_dataset["question_body_num_words"] = test_dataset["question_body"].apply(word_count)
test_dataset["answer_num_words"] = test_dataset["answer"].apply(word_count)


#######################################################################################################
# Number of unique words in the text
X_train_dataset["question_title_num_unique_words"] = X_train_dataset["question_title"].apply(unique_word_count)
X_train_dataset["question_body_num_unique_words"] = X_train_dataset["question_body"].apply(unique_word_count)
X_train_dataset["answer_num_unique_words"] = X_train_dataset["answer"].apply(unique_word_count)

# Validation dataset
X_valid_dataset["question_title_num_unique_words"] = X_valid_dataset["question_title"].apply(unique_word_count)
X_valid_dataset["question_body_num_unique_words"] = X_valid_dataset["question_body"].apply(unique_word_count)
X_valid_dataset["answer_num_unique_words"] = X_valid_dataset["answer"].apply(unique_word_count)

test_dataset["question_title_num_unique_words"] = test_dataset["question_title"].apply(unique_word_count)
test_dataset["question_body_num_unique_words"] = test_dataset["question_body"].apply(unique_word_count)
test_dataset["answer_num_unique_words"] = test_dataset["answer"].apply(unique_word_count)

## TF-IDF based features:

* Word Level N-Gram TF-IDF of **question_title**
* Word Level N-Gram TF-IDF of **question_body**
* Word Level N-Gram TF-IDF of **answer**

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

vectorizer = TfidfVectorizer(min_df=2)
tsvd = TruncatedSVD(n_components = 128, n_iter=5)


qt_tfidf = vectorizer.fit_transform(X_train_dataset['preprocessed_question_title'].values)
tfidf_question_title_train = tsvd.fit_transform(qt_tfidf)


qb_tfidf = vectorizer.fit_transform(X_train_dataset['preprocessed_question_body'].values)
tfidf_question_body_train = tsvd.fit_transform(qb_tfidf)


ans_tfidf = vectorizer.fit_transform(X_train_dataset['preprocessed_answer'].values)
tfidf_answer_train = tsvd.fit_transform(ans_tfidf)

In [None]:
qt_tfidf = vectorizer.fit_transform(X_valid_dataset['preprocessed_question_title'].values)
tfidf_question_title_valid = tsvd.fit_transform(qt_tfidf)


qb_tfidf = vectorizer.fit_transform(X_valid_dataset['preprocessed_question_body'].values)
tfidf_question_body_valid = tsvd.fit_transform(qb_tfidf)


ans_tfidf = vectorizer.fit_transform(X_valid_dataset['preprocessed_answer'].values)
tfidf_answer_valid = tsvd.fit_transform(ans_tfidf)

In [None]:
qt_tfidf = vectorizer.fit_transform(test_dataset['preprocessed_question_title'].values)
tfidf_question_title_test = tsvd.fit_transform(qt_tfidf)


qb_tfidf = vectorizer.fit_transform(test_dataset['preprocessed_question_body'].values)
tfidf_question_body_test = tsvd.fit_transform(qb_tfidf)


ans_tfidf = vectorizer.fit_transform(test_dataset['preprocessed_answer'].values)
tfidf_answer_test = tsvd.fit_transform(ans_tfidf)

In [None]:
X_train_dataset["tfidf_question_title"] = list(tfidf_question_title_train)
X_train_dataset["tfidf_question_body"] = list(tfidf_question_body_train)
X_train_dataset["tfidf_answer"] = list(tfidf_answer_train)

In [None]:
X_valid_dataset["tfidf_question_title"] = list(tfidf_question_title_valid)
X_valid_dataset["tfidf_question_body"] = list(tfidf_question_body_valid)
X_valid_dataset["tfidf_answer"] = list(tfidf_answer_valid)

In [None]:
test_dataset["tfidf_question_title"] = list(tfidf_question_title_test)
test_dataset["tfidf_question_body"] = list(tfidf_question_body_test)
test_dataset["tfidf_answer"] = list(tfidf_answer_test)

## Features using web scraping 


## `answer_user_page` features:


In [None]:
!pip install bs4

In [None]:

from tqdm.notebook import tqdm
from bs4 import BeautifulSoup
from urllib import request


def get_user_rating(url):
    try:
        get = request.urlopen(url).read()
        src = BeautifulSoup(get, 'html.parser')
        #print(src)
        reputation, gold = [], []
        silver, bronze = [], []
        reputation = int(''.join(src.find_all("div", class_ = 'fs-body3 fc-dark')[0].text.strip().split(',')))
        try:
            gold = int(''.join(src.find_all('div', class_='fs-title fw-bold fc-black-800')[0].text.strip().split(',')))
        except:
            gold = 0

        try:    
            silver = int(''.join(src.find_all('div', class_='fs-title fw-bold fc-black-800')[1].text.strip().split(',')))
        except:
            silver = 0

        try:
            bronze = int(''.join(src.find_all('div', class_='fs-title fw-bold fc-black-800')[2].text.strip().split(',')))
        except:
            bronze = 0

        output = [reputation, gold, silver, bronze]
    except:
        output = [0]*4

    return output
'''
data = []
for url in tqdm(X_train_dataset['answer_user_page']):
    #print(url)
    data.append(get_user_rating(url))
    columns = ['reputation', 'gold', 'silver', 'bronze']  
scraped = pd.DataFrame(data, columns=columns)
scraped.to_csv(f'train_web_scrap_features.csv', index=False)

data = []
for url in tqdm(X_valid_dataset['answer_user_page']):
    #print(url)
    data.append(get_user_rating(url))
    columns = ['reputation', 'gold', 'silver', 'bronze']  
scraped = pd.DataFrame(data, columns=columns)
scraped.to_csv(f'valid_web_scrap_features.csv', index=False)
'''

train_web_scraping_feature = pd.read_csv('../input/feature-engineering/train_web_scrap_features.csv')
valid_web_scraping_feature = pd.read_csv('../input/feature-engineering/valid_web_scrap_features.csv')

In [None]:
train_web_scraping_feature

In [None]:
valid_web_scraping_feature

### References for feature engineering:
* https://www.kaggle.com/c/google-quest-challenge/discussion/130041 - meta features.
* https://www.kaggle.com/codename007/start-from-here-quest-complete-eda-fe?scriptVersionId=25618132&cellId=65 - tfidf, count based features.
* https://towardsdatascience.com/hands-on-transformers-kaggle-google-quest-q-a-labeling-affd3dad7bcb - web scraping features

# Converting **`question_title`**  Text -> Vector

In [None]:
import tensorflow as tf

question_title_tk = tf.keras.preprocessing.text.Tokenizer(filters = " ")
question_title_tk.fit_on_texts(X_train_dataset['question_title'].values)

vocab_size_question_title = len(question_title_tk.word_index) + 1

Converting **`question_title`** text feature to tokens 

In [None]:
tokenized_question_title_train = question_title_tk.texts_to_sequences(X_train_dataset['question_title'].values)
tokenized_question_title_valid = question_title_tk.texts_to_sequences(X_valid_dataset['question_title'].values)
tokenized_question_title_test = question_title_tk.texts_to_sequences(test_dataset['question_title'].values)

In [None]:
print("max length in the question title feature",max([(len(title)) for title in tokenized_question_title_train]))

Padding **`question_title`** tokens to have all question_title in same lenght (i.e: 30)

In [None]:
tokenized_question_title_train = tf.keras.preprocessing.sequence.pad_sequences(tokenized_question_title_train, maxlen=30, dtype='int32', padding='post')
tokenized_question_title_valid = tf.keras.preprocessing.sequence.pad_sequences(tokenized_question_title_valid, maxlen=30, dtype='int32', padding='post')
tokenized_question_title_test = tf.keras.preprocessing.sequence.pad_sequences(tokenized_question_title_test, maxlen=30, dtype='int32', padding='post')

In [None]:
tokenized_question_title_train.shape, tokenized_question_title_valid.shape, tokenized_question_title_test.shape

# Converting **`question_body`** Feature Text -> Vector

In [None]:
question_body_tk = tf.keras.preprocessing.text.Tokenizer(filters = " ")
question_body_tk.fit_on_texts(X_train_dataset['question_body'].values)

vocab_size_question_body = len(question_body_tk.word_index) + 1

 Converting **`question_body`** text feature to tokens 

In [None]:
tokenized_question_body_train = question_body_tk.texts_to_sequences(X_train_dataset['question_body'].values)
tokenized_question_body_valid = question_body_tk.texts_to_sequences(X_valid_dataset['question_body'].values)
tokenized_question_body_test = question_body_tk.texts_to_sequences(test_dataset['question_body'].values)

In [None]:
question_body_max_len = max([(len(body)) for body in tokenized_question_body_train])
print("max length in the question body feature",question_body_max_len)

Padding **`question_body`** tokens to have all question_body in same lenght (i.e: 1397)

In [None]:
tokenized_question_body_train = tf.keras.preprocessing.sequence.pad_sequences(tokenized_question_body_train, maxlen=question_body_max_len, dtype='int32', padding='post')
tokenized_question_body_valid = tf.keras.preprocessing.sequence.pad_sequences(tokenized_question_body_valid, maxlen=question_body_max_len, dtype='int32', padding='post')
tokenized_question_body_test = tf.keras.preprocessing.sequence.pad_sequences(tokenized_question_body_test, maxlen=question_body_max_len, dtype='int32', padding='post')

In [None]:
tokenized_question_body_train.shape, tokenized_question_body_valid.shape, tokenized_question_body_test.shape

# Converting **`answer`** Feature Text -> Vector

In [None]:
answer_tk = tf.keras.preprocessing.text.Tokenizer(filters = " ")
answer_tk.fit_on_texts(X_train_dataset['answer'].values)

vocab_size_answer = len(answer_tk.word_index) + 1

Converting **`answer`** text feature to tokens 

In [None]:
tokenized_answer_train = answer_tk.texts_to_sequences(X_train_dataset['answer'].values)
tokenized_answer_valid = answer_tk.texts_to_sequences(X_valid_dataset['answer'].values)
tokenized_answer_test = answer_tk.texts_to_sequences(test_dataset['answer'].values)

In [None]:
answer_max_len = max([(len(answer)) for answer in tokenized_answer_train])
print("max length in the answer feature",answer_max_len)

Padding **`answer`** tokens to have all answer in same lenght (i.e: 2332)

In [None]:
tokenized_answer_train = tf.keras.preprocessing.sequence.pad_sequences(tokenized_answer_train, maxlen=answer_max_len, dtype='int32', padding='post')
tokenized_answer_valid = tf.keras.preprocessing.sequence.pad_sequences(tokenized_answer_valid, maxlen=answer_max_len, dtype='int32', padding='post')
tokenized_answer_test = tf.keras.preprocessing.sequence.pad_sequences(tokenized_answer_test, maxlen=answer_max_len, dtype='int32', padding='post')

In [None]:
tokenized_question_title_train.shape, tokenized_question_body_train.shape, tokenized_answer_train.shape

In [None]:
tokenized_question_title_valid.shape, tokenized_question_body_valid.shape, tokenized_answer_valid.shape

In [None]:
tokenized_question_title_test.shape, tokenized_question_body_test.shape, tokenized_answer_test.shape

In [None]:
y_train_dataset.shape

In [None]:
y_valid_dataset.shape

# Embedding weights for **question_title**

In [None]:
import numpy as np
from tqdm.notebook import tqdm

glove_file = open('../input/glove-embeddings/glove.6B.300d.txt', encoding='utf8')

embeddings_index = dict()   
for line in tqdm(glove_file):
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
glove_file.close()
print('Loaded %s word vectors.' % len(embeddings_index))

In [None]:
# create a weight matrix for words in training docs

embedding_matrix_question_title = np.zeros((vocab_size_question_title, 300))
for word, i in question_title_tk.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix_question_title[i] = embedding_vector

In [None]:
embedding_matrix_question_title.shape

# Embedding weights for **question_body**

In [None]:
# create a weight matrix for words in training docs

embedding_matrix_question_body = np.zeros((vocab_size_question_body, 300))
for word, i in question_body_tk.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix_question_body[i] = embedding_vector

In [None]:
embedding_matrix_question_body.shape

# Embedding weights for **answer**

In [None]:
# create a weight matrix for words in training docs

embedding_matrix_answer = np.zeros((vocab_size_answer, 300))
for word, i in question_body_tk.word_index.items():
    
    
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix_answer[i] = embedding_vector

In [None]:
embedding_matrix_answer.shape

In [None]:
# PLot the history of the model



# Building Base LSTM Model 

In [None]:
############################ INPUT 1 - Question title ############################################################
input_1 = tf.keras.layers.Input(shape=(30,))
embed_1 = tf.keras.layers.Embedding(vocab_size_question_title, 
                                    300, 
                                    weights=[embedding_matrix_question_title],
                                    input_length=30,
                                    embeddings_initializer=tf.keras.initializers.RandomNormal(mean=0, stddev=1, seed=23), 
                                    trainable=False)(input_1)
lstm1 = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(24,activation='tanh',kernel_initializer=tf.keras.initializers.glorot_uniform(seed=26),
          return_sequences=True))(embed_1)
print("lstm_1 :",lstm1.shape)
flat_1 = tf.keras.layers.Flatten()(lstm1)


############################ INPUT 2 - Question body ##############################################################
input_2 = tf.keras.layers.Input(shape=(question_body_max_len,))
embed_2 = tf.keras.layers.Embedding(vocab_size_question_body,
                    300, 
                    weights=[embedding_matrix_question_body], 
                    input_length=question_body_max_len, 
                    embeddings_initializer=tf.keras.initializers.RandomNormal(mean=0, stddev=1, seed=24), 
                    trainable=False)(input_2)
lstm2 = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(24,activation='tanh',kernel_initializer=tf.keras.initializers.glorot_uniform(seed=27),
          return_sequences=True))(embed_2)
print("lstm_2 :",lstm2.shape)
flat_2 = tf.keras.layers.Flatten()(lstm2)


############################ INPUT 3 - Answer ################################################################
input_3 = tf.keras.layers.Input(shape=(answer_max_len,))
embed_3 = tf.keras.layers.Embedding(vocab_size_answer, 300, weights=[embedding_matrix_answer], 
                    input_length=answer_max_len, 
                    embeddings_initializer=tf.keras.initializers.RandomNormal(mean=0, stddev=1, seed=25), 
                    trainable=False)(input_3)
lstm3 = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(24,activation='tanh',kernel_initializer=tf.keras.initializers.glorot_uniform(seed=28),
          return_sequences=True))(embed_3)
print("lstm_3 :",lstm3.shape)
flat_3 = tf.keras.layers.Flatten()(lstm3)




print("Flat_1 shape :",flat_1.shape)
print("Flat_2 shape :",flat_2.shape)
print("Flat_3 shape :",flat_3.shape)

concat = tf.keras.layers.Concatenate()([flat_1,flat_2,flat_3])
print(concat.shape)
dense = tf.keras.layers.Dense(32,activation = 'relu',kernel_initializer=tf.keras.initializers.he_normal(seed=40))(concat)

output = tf.keras.layers.Dense(30,activation = 'sigmoid',kernel_initializer=tf.keras.initializers.glorot_uniform(seed=45))(dense)

model = tf.keras.Model(inputs = [input_1, input_2, input_3], outputs = output)

model.summary()

In [None]:
tf.keras.utils.plot_model(model)

In [None]:
from scipy.stats import spearmanr

list_1 = [1,2,3,4,5]
list_2 = [2,3,4,5,6]
spearmanr(list_1, list_2)

In [None]:
from scipy.stats import spearmanr

class SpearmanCallback(tf.keras.callbacks.Callback):
    def __init__(self, validation_data):
        self.x_val = validation_data[0]
        self.y_val = validation_data[1]

    def on_epoch_end(self, epoch, logs={}):
        y_pred_val = self.model.predict(self.x_val)
        rho_val = np.mean([spearmanr(self.y_val[:, ind], y_pred_val[:, ind] + np.random.normal(0, 1e-7, y_pred_val.shape[0])).correlation for ind in range(y_pred_val.shape[1])])
        print(rho_val)
        print('\nval_spearman-corr: %s' % (str(round(rho_val, 6))), end=100*' '+'\n')
        return rho_val

In [None]:
model.compile(loss=tf.keras.losses.BinaryCrossentropy(),
                   optimizer=tf.keras.optimizers.Adam()
            )

In [None]:
y_valid_dataset = np.array(y_valid_dataset)
y_train_dataset = np.array(y_train_dataset)
custom_callback = SpearmanCallback(validation_data=([tokenized_question_title_valid, tokenized_question_body_valid, tokenized_answer_valid ], y_valid_dataset))

In [None]:
#history = model.fit([tokenized_question_title_train, tokenized_question_body_train, tokenized_answer_train], y_train_dataset, 
#           epochs=10,  
#           validation_data=([tokenized_question_title_valid, tokenized_question_body_valid, tokenized_answer_valid ], y_valid_dataset), 
#           callbacks=[custom_callback])

In [None]:
#pd.DataFrame(history.history).plot()
#plt.title("Base model with 3 features")

## **Observation:** 
After 11 epochs we have acheived an spearman score of **0.27447** with three question_title, question_body, answer features 

# Base LSTM model applying 18 Feature Engineering features

* **3 text features** - question_title, question_body, answer
* **9 Feature engineering features (9 dim)** -  question_title_num_chars, question_body_num_chars, answer_num_chars,  question_title_num_words, question_body_num_words, answer_num_words, question_title_num_unique_words, question_body_num_unique_words, answer_num_unique_words
* **3 TF-IDF features (384 dim)** - TF-IDF quesion_title, TF_IDF quesiton_body, TF-IDF answer.
* **4 Web scraping features (4 dim)** - reputation, 	gold, 	silver, 	bronze.

In [None]:
feature_engineer_columns = ['question_title_num_chars', 'question_body_num_chars',
       'answer_num_chars', 'question_title_num_words',
       'question_body_num_words', 'answer_num_words',
       'question_title_num_unique_words', 'question_body_num_unique_words',
       'answer_num_unique_words']


tfidf_features = ['tfidf_question_title','tfidf_question_body', 'tfidf_answer']


train_web_scraping_feature.shape, valid_web_scraping_feature.shape

In [None]:
X_train_feature_eng = np.array(X_train_dataset[feature_engineer_columns])
X_valid_feature_eng = np.array(X_valid_dataset[feature_engineer_columns])

In [None]:
X_train_feature_eng.shape

In [None]:
X_valid_feature_eng.shape

###  TF-IDF Features

In [None]:
tfidf_question_title_train.shape, tfidf_question_body_train.shape, tfidf_answer_train.shape

In [None]:
tfidf_question_title_valid.shape, tfidf_question_body_valid.shape, tfidf_answer_valid.shape

In [None]:
tfidf_train_features = np.hstack((tfidf_question_title_train, tfidf_question_body_train, tfidf_answer_train))
tfidf_valid_features = np.hstack((tfidf_question_title_valid, tfidf_question_body_valid, tfidf_answer_valid))

In [None]:
tfidf_train_features.shape, tfidf_valid_features.shape

In [None]:
############################ INPUT 1 - Question title ############################################################
input_1 = tf.keras.layers.Input(shape=(30,))
embed_1 = tf.keras.layers.Embedding(vocab_size_question_title, 
                                    300, 
                                    weights=[embedding_matrix_question_title],
                                    input_length=30,
                                    embeddings_initializer=tf.keras.initializers.RandomNormal(mean=0, stddev=1, seed=23), 
                                    trainable=False)(input_1)
lstm1 = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(24,activation='tanh',kernel_initializer=tf.keras.initializers.glorot_uniform(seed=26),
          return_sequences=True))(embed_1)
print("lstm_1 :",lstm1.shape)
flat_1 = tf.keras.layers.Flatten()(lstm1)


############################ INPUT 2 - Question body ##############################################################
input_2 = tf.keras.layers.Input(shape=(question_body_max_len,))
embed_2 = tf.keras.layers.Embedding(vocab_size_question_body,
                    300, 
                    weights=[embedding_matrix_question_body], 
                    input_length=question_body_max_len, 
                    embeddings_initializer=tf.keras.initializers.RandomNormal(mean=0, stddev=1, seed=24), 
                    trainable=False)(input_2)
lstm2 = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(24,activation='tanh',kernel_initializer=tf.keras.initializers.glorot_uniform(seed=27),
          return_sequences=True))(embed_2)
print("lstm_2 :",lstm2.shape)
flat_2 = tf.keras.layers.Flatten()(lstm2)


############################ INPUT 3 - Answer ################################################################
input_3 = tf.keras.layers.Input(shape=(answer_max_len,))
embed_3 = tf.keras.layers.Embedding(vocab_size_answer, 300, weights=[embedding_matrix_answer], 
                    input_length=answer_max_len, 
                    embeddings_initializer=tf.keras.initializers.RandomNormal(mean=0, stddev=1, seed=25), 
                    trainable=False)(input_3)
lstm3 = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(24,activation='tanh',kernel_initializer=tf.keras.initializers.glorot_uniform(seed=28),
          return_sequences=True))(embed_3)
print("lstm_3 :",lstm3.shape)
flat_3 = tf.keras.layers.Flatten()(lstm3)

#########################  INPUT 4 - Feature Engineering features ############################################
input_4 = tf.keras.layers.Input(shape=(9,))
x = tf.keras.layers.Dense(64, activation='relu')(input_4)
flat_4 = tf.keras.layers.Dense(32, activation='relu')(x)


####################### INPUT 5 - TF-IDF Features ###########################################################
input_5 = tf.keras.layers.Input(shape=(384,))
x = tf.keras.layers.Dense(128, activation='relu')(input_5)
x = tf.keras.layers.Dense(64, activation='relu')(x)
flat_5 = tf.keras.layers.Dense(32, activation='relu')(x)


##################### Input 6 - Web scraping feature ##########################################################
input_6 = tf.keras.layers.Input(shape=(4,))
x = tf.keras.layers.Dense(128, activation='relu')(input_6)
x = tf.keras.layers.Dense(64, activation='relu')(x)
flat_6 = tf.keras.layers.Dense(32, activation='relu')(x)






print("Flat_1 shape :",flat_1.shape)
print("Flat_2 shape :",flat_2.shape)
print("Flat_3 shape :",flat_3.shape)
print("Flat_4 shape :",flat_4.shape)
print("Flat_5 shape :",flat_5.shape)


concat = tf.keras.layers.Concatenate()([flat_1,flat_2,flat_3, flat_4, flat_5, flat_6])
print(concat.shape)
dense = tf.keras.layers.Dense(32,activation = 'relu',kernel_initializer=tf.keras.initializers.he_normal(seed=40))(concat)

output = tf.keras.layers.Dense(30,activation = 'sigmoid',kernel_initializer=tf.keras.initializers.glorot_uniform(seed=45))(dense)

model = tf.keras.Model(inputs = [input_1, input_2, input_3, input_4, input_5, input_6], outputs = output)

model.summary()

In [None]:
tf.keras.utils.plot_model(model)

In [None]:
from scipy.stats import spearmanr

class SpearmanCallback(tf.keras.callbacks.Callback):
    def __init__(self, validation_data):
        self.x_val = validation_data[0]
        self.y_val = validation_data[1]

    def on_epoch_end(self, epoch, logs={}):
        y_pred_val = self.model.predict(self.x_val)
        rho_val = np.mean([spearmanr(self.y_val[:, ind], y_pred_val[:, ind] + np.random.normal(0, 1e-7, y_pred_val.shape[0])).correlation for ind in range(y_pred_val.shape[1])])
        print('\nval_spearman-corr: %s' % (str(round(rho_val, 6))), end=100*' '+'\n')
        return rho_val

In [None]:
model.compile(loss=tf.keras.losses.BinaryCrossentropy(),
                   optimizer=tf.keras.optimizers.Adam()
            )

In [None]:
y_valid_dataset = np.array(y_valid_dataset)
custom_callback = SpearmanCallback(validation_data=([tokenized_question_title_valid, tokenized_question_body_valid, tokenized_answer_valid, X_valid_feature_eng, tfidf_valid_features, valid_web_scraping_feature ], y_valid_dataset))

In [None]:
#history = model.fit([tokenized_question_title_train, tokenized_question_body_train, tokenized_answer_train, X_train_feature_eng, tfidf_train_features, train_web_scraping_feature], y_train_dataset, 
#           epochs=10,  
#           validation_data=([tokenized_question_title_valid, tokenized_question_body_valid, tokenized_answer_valid, X_valid_feature_eng, tfidf_valid_features, valid_web_scraping_feature], y_valid_dataset), 
#           callbacks=[custom_callback])

## **Observation:** 
In the above model we have trained with all 18 feature engineering features but by using all the features we have observed that performance has decreased a lot comparing to basic three features

In [None]:
#pd.DataFrame(history.history).plot()
#plt.title("Base Model with 18 FE features")

# Base LSTM model applying 13 Feature Engineering features

> Removing TF-IDF features as in the above model the performance has decreased comparing to the base model with only three features as tf-idf has more dimension 


* **3 text features** - question_title, question_body, answer
* **9 Feature engineering features (9 dim)** -  question_title_num_chars, question_body_num_chars, answer_num_chars,  question_title_num_words, question_body_num_words, answer_num_words, question_title_num_unique_words, question_body_num_unique_words, answer_num_unique_words
* **4 Web scraping features (4 dim)** - reputation, 	gold, 	silver, 	bronze.

In [None]:
############################ INPUT 1 - Question title ############################################################
input_1 = tf.keras.layers.Input(shape=(30,))
embed_1 = tf.keras.layers.Embedding(vocab_size_question_title, 
                                    300, 
                                    weights=[embedding_matrix_question_title],
                                    input_length=30,
                                    embeddings_initializer=tf.keras.initializers.RandomNormal(mean=0, stddev=1, seed=23), 
                                    trainable=False)(input_1)
lstm1 = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(24,activation='tanh',kernel_initializer=tf.keras.initializers.glorot_uniform(seed=26),
          return_sequences=True))(embed_1)
print("lstm_1 :",lstm1.shape)
flat_1 = tf.keras.layers.Flatten()(lstm1)


############################ INPUT 2 - Question body ##############################################################
input_2 = tf.keras.layers.Input(shape=(question_body_max_len,))
embed_2 = tf.keras.layers.Embedding(vocab_size_question_body,
                    300, 
                    weights=[embedding_matrix_question_body], 
                    input_length=question_body_max_len, 
                    embeddings_initializer=tf.keras.initializers.RandomNormal(mean=0, stddev=1, seed=24), 
                    trainable=False)(input_2)
lstm2 = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(24,activation='tanh',kernel_initializer=tf.keras.initializers.glorot_uniform(seed=27),
          return_sequences=True))(embed_2)
print("lstm_2 :",lstm2.shape)
flat_2 = tf.keras.layers.Flatten()(lstm2)


############################ INPUT 3 - Answer ################################################################
input_3 = tf.keras.layers.Input(shape=(answer_max_len,))
embed_3 = tf.keras.layers.Embedding(vocab_size_answer, 300, weights=[embedding_matrix_answer], 
                    input_length=answer_max_len, 
                    embeddings_initializer=tf.keras.initializers.RandomNormal(mean=0, stddev=1, seed=25), 
                    trainable=False)(input_3)
lstm3 = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(24,activation='tanh',kernel_initializer=tf.keras.initializers.glorot_uniform(seed=28),
          return_sequences=True))(embed_3)
print("lstm_3 :",lstm3.shape)
flat_3 = tf.keras.layers.Flatten()(lstm3)

#########################  INPUT 4 - Feature Engineering features ############################################
input_4 = tf.keras.layers.Input(shape=(9,))
x = tf.keras.layers.Dense(64, activation='relu')(input_4)
flat_4 = tf.keras.layers.Dense(32, activation='relu')(x)


##################### Input 6 - Web scraping feature ##########################################################
input_5 = tf.keras.layers.Input(shape=(4,))
x = tf.keras.layers.Dense(128, activation='relu')(input_5)
x = tf.keras.layers.Dense(64, activation='relu')(x)
flat_5 = tf.keras.layers.Dense(32, activation='relu')(x)

print("Flat_1 shape :",flat_1.shape)
print("Flat_2 shape :",flat_2.shape)
print("Flat_3 shape :",flat_3.shape)
print("Flat_4 shape :",flat_4.shape)
print("Flat_5 shape :",flat_5.shape)


concat = tf.keras.layers.Concatenate()([flat_1,flat_2,flat_3, flat_4, flat_5])
print(concat.shape)
dense = tf.keras.layers.Dense(32,activation = 'relu',kernel_initializer=tf.keras.initializers.he_normal(seed=40))(concat)

output = tf.keras.layers.Dense(30,activation = 'sigmoid',kernel_initializer=tf.keras.initializers.glorot_uniform(seed=45))(dense)

model = tf.keras.Model(inputs = [input_1, input_2, input_3, input_4, input_5], outputs = output)

model.summary()

In [None]:
tf.keras.utils.plot_model(model)

In [None]:
model.compile(loss=tf.keras.losses.BinaryCrossentropy(),
                   optimizer=tf.keras.optimizers.Adam()
            )

In [None]:
y_valid_dataset = np.array(y_valid_dataset)
custom_callback = SpearmanCallback(validation_data=([tokenized_question_title_valid, tokenized_question_body_valid, tokenized_answer_valid, X_valid_feature_eng, valid_web_scraping_feature ], y_valid_dataset))

In [None]:
#history = model.fit([tokenized_question_title_train, tokenized_question_body_train, tokenized_answer_train, X_train_feature_eng, train_web_scraping_feature], y_train_dataset, 
#           epochs=10,  
#           validation_data=([tokenized_question_title_valid, tokenized_question_body_valid, tokenized_answer_valid, X_valid_feature_eng, valid_web_scraping_feature], y_valid_dataset), 
#           callbacks=[custom_callback])

In [None]:
#pd.DataFrame(history.history).plot()
#plt.title("Base model with 13 FE features")

## **Observation:**
The model still not performed better comparing to the model with three basic text feature even after removing the **TF-IDF** features

# Base LSTM model applying 9 Feature Engineering features

> Removing **TF-IDF features** as in the above model the performance has decreased comparing to the base model with only three features as tf-idf has more dimension.

> Removing **Web scraping features** 


* **3 text features** - question_title, question_body, answer
* **9 Feature engineering features (9 dim)** -  question_title_num_chars, question_body_num_chars, answer_num_chars,  question_title_num_words, question_body_num_words, answer_num_words, question_title_num_unique_words, question_body_num_unique_words, answer_num_unique_words


In [None]:
############################ INPUT 1 - Question title ############################################################
input_1 = tf.keras.layers.Input(shape=(30,))
embed_1 = tf.keras.layers.Embedding(vocab_size_question_title, 
                                    300, 
                                    weights=[embedding_matrix_question_title],
                                    input_length=30,
                                    embeddings_initializer=tf.keras.initializers.RandomNormal(mean=0, stddev=1, seed=23), 
                                    trainable=False)(input_1)
lstm1 = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(24,activation='tanh',kernel_initializer=tf.keras.initializers.glorot_uniform(seed=26),
          return_sequences=True))(embed_1)
print("lstm_1 :",lstm1.shape)
flat_1 = tf.keras.layers.Flatten()(lstm1)


############################ INPUT 2 - Question body ##############################################################
input_2 = tf.keras.layers.Input(shape=(question_body_max_len,))
embed_2 = tf.keras.layers.Embedding(vocab_size_question_body,
                    300, 
                    weights=[embedding_matrix_question_body], 
                    input_length=question_body_max_len, 
                    embeddings_initializer=tf.keras.initializers.RandomNormal(mean=0, stddev=1, seed=24), 
                    trainable=False)(input_2)
lstm2 = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(24,activation='tanh',kernel_initializer=tf.keras.initializers.glorot_uniform(seed=27),
          return_sequences=True))(embed_2)
print("lstm_2 :",lstm2.shape)
flat_2 = tf.keras.layers.Flatten()(lstm2)


############################ INPUT 3 - Answer ################################################################
input_3 = tf.keras.layers.Input(shape=(answer_max_len,))
embed_3 = tf.keras.layers.Embedding(vocab_size_answer, 300, weights=[embedding_matrix_answer], 
                    input_length=answer_max_len, 
                    embeddings_initializer=tf.keras.initializers.RandomNormal(mean=0, stddev=1, seed=25), 
                    trainable=False)(input_3)
lstm3 = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(24,activation='tanh',kernel_initializer=tf.keras.initializers.glorot_uniform(seed=28),
          return_sequences=True))(embed_3)
print("lstm_3 :",lstm3.shape)
flat_3 = tf.keras.layers.Flatten()(lstm3)

#########################  INPUT 4 - Feature Engineering features ############################################
input_4 = tf.keras.layers.Input(shape=(9,))
x = tf.keras.layers.Dense(64, activation='relu')(input_4)
flat_4 = tf.keras.layers.Dense(32, activation='relu')(x)




print("Flat_1 shape :",flat_1.shape)
print("Flat_2 shape :",flat_2.shape)
print("Flat_3 shape :",flat_3.shape)
print("Flat_4 shape :",flat_4.shape)


concat = tf.keras.layers.Concatenate()([flat_1,flat_2,flat_3, flat_4])
print(concat.shape)
dense = tf.keras.layers.Dense(32,activation = 'relu',kernel_initializer=tf.keras.initializers.he_normal(seed=40))(concat)

output = tf.keras.layers.Dense(30,activation = 'sigmoid',kernel_initializer=tf.keras.initializers.glorot_uniform(seed=45))(dense)

model = tf.keras.Model(inputs = [input_1, input_2, input_3, input_4], outputs = output)

model.summary()

In [None]:
tf.keras.utils.plot_model(model)

In [None]:
model.compile(loss=tf.keras.losses.BinaryCrossentropy(),
                   optimizer=tf.keras.optimizers.Adam()
            )

In [None]:
custom_callback = SpearmanCallback(validation_data=([tokenized_question_title_valid, tokenized_question_body_valid, tokenized_answer_valid, X_valid_feature_eng ], y_valid_dataset))

In [None]:
#history = model.fit([tokenized_question_title_train, tokenized_question_body_train, tokenized_answer_train, X_train_feature_eng], y_train_dataset, 
#           epochs=10,  
#           validation_data=([tokenized_question_title_valid, tokenized_question_body_valid, tokenized_answer_valid, X_valid_feature_eng], y_valid_dataset), 
#           callbacks=[custom_callback])

In [None]:
#pd.DataFrame(history.history).plot()
#plt.title("Base model with 9 FE features")

## **Observation:**
The model with meta feature engineering has acheived better performance comparing to the TF-IDF, Web scraping features by acheiving the height spearman value as **0.2871** at 14th epoch and the model started overfitting the train data after 5th epoch as the training loss decrease but validation loss increasing

# Overall Observations:

* Model with basic three features (question_title, quesiton_body, answer) + Meta features has acheived best performance comparing to the model with TF-IDF and web scraping features.
* The best base model has acheived an spearman score of **0.2871**.
* There is no need for training for 30 epochs as after 10 epochs the validation loss is increasing so the model will be overfitting to the training data.
* TF-IDF and Web scraping feature are not important to get best performance, meta features are the important features.

# Base Model with 100 dim glove 


# Embedding 100 dim weights for **question_title**

In [None]:
import numpy as np
from tqdm.notebook import tqdm

glove_file = open('../input/glove-embeddings/glove.6B.100d.txt', encoding='utf8')

embeddings_index_100 = dict()   
for line in tqdm(glove_file):
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
glove_file.close()
print('Loaded %s word vectors.' % len(embeddings_index))

In [None]:
# create a weight matrix for words in training docs

embedding_100_matrix_question_title = np.zeros((vocab_size_question_title, 100))
for word, i in question_title_tk.word_index.items():
    embedding_vector = embeddings_index_100.get(word)
    if embedding_vector is not None:
        embedding_matrix_question_title[i] = embedding_vector

In [None]:
embedding_100_matrix_question_title.shape

# Embedding 100 dim weights for **question_body**

In [None]:
# create a weight matrix for words in training docs

embedding_100_matrix_question_body = np.zeros((vocab_size_question_body, 100))
for word, i in question_body_tk.word_index.items():
    embedding_vector = embeddings_index_100.get(word)
    if embedding_vector is not None:
        embedding_matrix_question_body[i] = embedding_vector

In [None]:
embedding_100_matrix_question_body.shape

# Embedding 100 dim weights for **answer**

In [None]:
# create a weight matrix for words in training docs

embedding_100_matrix_answer = np.zeros((vocab_size_answer, 100))
for word, i in question_body_tk.word_index.items():
    embedding_vector = embeddings_index_100.get(word)
    if embedding_vector is not None:
        embedding_matrix_answer[i] = embedding_vector

In [None]:
embedding_100_matrix_answer.shape

# Building base model with 100 dim embeddings



In [None]:
############################ INPUT 1 - Question title ############################################################
input_1 = tf.keras.layers.Input(shape=(30,))
embed_1 = tf.keras.layers.Embedding(vocab_size_question_title, 
                                    100, 
                                    weights=[embedding_100_matrix_question_title],
                                    input_length=30,
                                    embeddings_initializer=tf.keras.initializers.RandomNormal(mean=0, stddev=1, seed=23), 
                                    trainable=False)(input_1)
lstm1 = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(24,activation='tanh',kernel_initializer=tf.keras.initializers.glorot_uniform(seed=26),
          return_sequences=True))(embed_1)
print("lstm_1 :",lstm1.shape)
flat_1 = tf.keras.layers.Flatten()(lstm1)


############################ INPUT 2 - Question body ##############################################################
input_2 = tf.keras.layers.Input(shape=(question_body_max_len,))
embed_2 = tf.keras.layers.Embedding(vocab_size_question_body,
                    100, 
                    weights=[embedding_100_matrix_question_body], 
                    input_length=question_body_max_len, 
                    embeddings_initializer=tf.keras.initializers.RandomNormal(mean=0, stddev=1, seed=24), 
                    trainable=False)(input_2)
lstm2 = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(24,activation='tanh',kernel_initializer=tf.keras.initializers.glorot_uniform(seed=27),
          return_sequences=True))(embed_2)
print("lstm_2 :",lstm2.shape)
flat_2 = tf.keras.layers.Flatten()(lstm2)


############################ INPUT 3 - Answer ################################################################
input_3 = tf.keras.layers.Input(shape=(answer_max_len,))
embed_3 = tf.keras.layers.Embedding(vocab_size_answer, 100, weights=[embedding_100_matrix_answer], 
                    input_length=answer_max_len, 
                    embeddings_initializer=tf.keras.initializers.RandomNormal(mean=0, stddev=1, seed=25), 
                    trainable=False)(input_3)
lstm3 = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(24,activation='tanh',kernel_initializer=tf.keras.initializers.glorot_uniform(seed=28),
          return_sequences=True))(embed_3)
print("lstm_3 :",lstm3.shape)
flat_3 = tf.keras.layers.Flatten()(lstm3)

#########################  INPUT 4 - Feature Engineering features ############################################
input_4 = tf.keras.layers.Input(shape=(9,))
x = tf.keras.layers.Dense(64, activation='relu')(input_4)
x = tf.keras.layers.Dropout(0.2)(x)
flat_4 = tf.keras.layers.Dense(32, activation='relu')(x)


##################### Input 5 - Web scraping feature ##########################################################
input_5 = tf.keras.layers.Input(shape=(4,))
x = tf.keras.layers.Dense(128, activation='relu')(input_5)
x = tf.keras.layers.Dense(64, activation='relu')(x)
x = tf.keras.layers.Dropout(0.2)(x)
flat_5 = tf.keras.layers.Dense(32, activation='relu')(x)

print("Flat_1 shape :",flat_1.shape)
print("Flat_2 shape :",flat_2.shape)
print("Flat_3 shape :",flat_3.shape)
print("Flat_4 shape :",flat_4.shape)
print("Flat_5 shape :",flat_5.shape)


concat = tf.keras.layers.Concatenate()([flat_1,flat_2,flat_3, flat_4, flat_5])
print(concat.shape)
dense = tf.keras.layers.Dense(32,activation = 'relu',kernel_initializer=tf.keras.initializers.he_normal(seed=40))(concat)

output = tf.keras.layers.Dense(30,activation = 'sigmoid',kernel_initializer=tf.keras.initializers.glorot_uniform(seed=45))(dense)

model = tf.keras.Model(inputs = [input_1, input_2, input_3, input_4, input_5], outputs = output)

model.summary()

In [None]:
tf.keras.utils.plot_model(model)

In [None]:
model.compile(loss=tf.keras.losses.BinaryCrossentropy(),
                   optimizer=tf.keras.optimizers.Adam()
            )

In [None]:
y_valid_dataset = np.array(y_valid_dataset)
custom_callback = SpearmanCallback(validation_data=([tokenized_question_title_valid, tokenized_question_body_valid, tokenized_answer_valid, X_valid_feature_eng, valid_web_scraping_feature ], y_valid_dataset))

In [None]:
#history = model.fit([tokenized_question_title_train, tokenized_question_body_train, tokenized_answer_train, X_train_feature_eng, train_web_scraping_feature], y_train_dataset, 
#           epochs=10,  
#           validation_data=([tokenized_question_title_valid, tokenized_question_body_valid, tokenized_answer_valid, X_valid_feature_eng, valid_web_scraping_feature], y_valid_dataset), 
#           callbacks=[custom_callback])

In [None]:
#pd.DataFrame(history.history).plot()
#plt.title("Base model with 100 dim embeddings")

In [None]:
from prettytable import PrettyTable


myTable = PrettyTable(["Base Model", "Features", "Spearman scroe"])


myTable.add_row(["Bi-LSTM", "Three basic features", "0.2756"])
myTable.add_row(["Bi-LSTM", "Three basic features + 18 FE Features(meta, TF-IDF, Web scraping)", "0.00256"])
myTable.add_row(["Bi-LSTM", "Three basic features + 13 FE Features(meta, Web scraping)", "0.012556"])
myTable.add_row(["Bi-LSTM", "Three basic features + 9 FE Features(meta features)", "0.2865"])
myTable.add_row(["Bi-LSTM", "Three basic features + 13 FE features with 100 dim embeddings(meta, Web scraping)", "-0.004126"])

print(myTable)

# Observations:

* From the above we can observe that base model with three basic + meta features provided high score of **0.2865** comparing to all the other model.
* The model with 100 dim embeddings with meta and web scraping features has also not reached the score of base model with base feature.
* so, we can conclude that base model with meta features are gives the high score comparing to other models that we have experimented with.

* As the training data is less and for training neural network model we need huge data so we will try with  transfer learning model like bert, albert, XLnet etc..

# Universal sentense encoder

The Universal Sentence Encoder makes getting sentence level embeddings as easy as it has historically been to lookup the embeddings for individual words. The sentence embeddings can then be trivially used to compute sentence level meaning similarity as well as to enable better performance on downstream classification tasks using less supervised training data.

> **Reference:** https://www.tensorflow.org/hub/tutorials/semantic_similarity_with_tf_hub_universal_encoder

In [None]:
import tensorflow as tf

import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re
import seaborn as sns

In [None]:
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
uni_sen_embed = hub.load(module_url)

In [None]:
#@title Compute a representation for each message, showing various lengths supported.
word = "Elephant"
sentence = "I am a sentence for which I would like to get its embedding."
paragraph = (
    "Universal Sentence Encoder embeddings also support short paragraphs. "
    "There is no hard limit on how long the paragraph is. Roughly, the longer "
    "the more 'diluted' the embedding will be.")
messages = [word, sentence, paragraph]

message_embeddings = uni_sen_embed(messages)
for i, message_embedding in enumerate(np.array(message_embeddings).tolist()):
  print("Message: {}".format(messages[i]))
  print("Embedding size: {}".format(len(message_embedding)))
  message_embedding_snippet = ", ".join(
      (str(x) for x in message_embedding[:3]))
  print("Embedding: [{}, ...]\n".format(message_embedding_snippet))

**Observartion:**  we can observe that for single word the embedding is 512 and even for varying sentence the embedding size is 512

In [None]:
def plot_similarity(labels, features, rotation):
    corr = np.inner(features, features)
    sns.set(font_scale=1.2)
    g = sns.heatmap(
      corr,
      xticklabels=labels,
      yticklabels=labels,
      vmin=0,
      vmax=1,
      cmap="YlOrRd")
    g.set_xticklabels(labels, rotation=rotation)
    g.set_title("Semantic Textual Similarity")

def run_and_plot(messages_):
    message_embeddings_ = uni_sen_embed(messages_)
    plot_similarity(messages_, message_embeddings_, 90)

    
 ################ sample example of sentense level embeddings using USE ###############################3
messages = [
    # Smartphones
    "I like my phone",
    "My phone is not good.",
    "Your cellphone looks great.",

    # Weather
    "Will it snow tomorrow?",
    "Recently a lot of hurricanes have hit the US",
    "Global warming is real",

    # Food and health
    "An apple a day, keeps the doctors away",
    "Eating strawberries is healthy",
    "Is paleo better than keto?",

    # Asking about age
    "How old are you?",
    "what is your age?",
]

run_and_plot(messages)

**Observation:** we can observe that similar sentense are grouped together

In [None]:
X_train_dataset.shape, X_valid_dataset.shape, test_dataset.shape

In [None]:
embeddings_train = {}
embeddings_valid = {}
embeddings_test = {}


for text in ['question_title', 'question_body', 'answer']:
    print(text)
    train_text = X_train_dataset[text].str.replace('?', '.').str.replace('!', '.').tolist()
    valid_text = X_valid_dataset[text].str.replace('?', '.').str.replace('!', '.').tolist()
    test_text = test_dataset[text].str.replace('?', '.').str.replace('!', '.').tolist()
    
    curr_train_emb = []
    curr_valid_emb = []
    curr_test_emb = []

    batch_size = 4
    ind = 0
    while ind*batch_size < len(train_text):
        curr_train_emb.append(uni_sen_embed(train_text[ind*batch_size: (ind + 1)*batch_size]).numpy())
        ind += 1
        
    ind = 0
    while ind*batch_size < len(test_text):
        curr_test_emb.append(uni_sen_embed(test_text[ind*batch_size: (ind + 1)*batch_size]).numpy())
        ind += 1    
        
    ind = 0
    while ind*batch_size < len(valid_text):
        curr_valid_emb.append(uni_sen_embed(valid_text[ind*batch_size: (ind + 1)*batch_size]).numpy())
        ind += 1    
    
    embeddings_train[text + '_USE_embedding'] = np.vstack(curr_train_emb)
    embeddings_valid[text + '_USE_embedding'] = np.vstack(curr_valid_emb)
    embeddings_test[text + '_USE_embedding'] = np.vstack(curr_test_emb)

In [None]:
embeddings_test['question_title_USE_embedding']

In [None]:
X_train = np.hstack([item for k, item in embeddings_train.items()])
X_valid = np.hstack([item for k, item in embeddings_valid.items()])
X_test = np.hstack([item for k, item in embeddings_test.items()])

In [None]:
# Building the model

input_1 = tf.keras.layers.Input(shape=(X_train.shape[1],))
x = tf.keras.layers.Dense(512, activation='elu')(input_1)
x = tf.keras.layers.Dropout(0.2)(x)
x = tf.keras.layers.Dense(218, activation='relu')(x)
x = tf.keras.layers.BatchNormalization()(x)
output = tf.keras.layers.Dense(30, activation='sigmoid')(x)


model = tf.keras.Model(inputs=input_1, outputs=output)

In [None]:
model.compile(optimizer= tf.keras.optimizers.Adam(lr=1e-4),
        loss=['binary_crossentropy']
    )

model.summary()

In [None]:
tf.keras.utils.plot_model(model)

In [None]:
y_valid_dataset = np.array(y_valid_dataset)
custom_callback = SpearmanCallback(validation_data=(X_valid, y_valid_dataset))

In [None]:
history = model.fit(X_train, y_train_dataset,
          epochs =100,
          validation_data=(X_valid, y_valid_dataset),
                    callbacks=[custom_callback]
         )

In [None]:
pd.DataFrame(history.history).plot()
plt.title("USE with basic Three features")

## **Observations:** 
* For 12 epochs we were able to acheive an spearman score of **0.33029** after this model is overfitting so no need to train more than 15 epochs.
* With basic three features (quesiton_title, quesiton_body, answer) using USE embeddings we are able to acheive spearman score of 0.33029 which is better than base+9 Meta feature model. 


## Lets do some more experiments on USE as below
    1. USE + 9 Meta feature
    2. USE + L2 distance similarity features
    3. USE + cosine similarity features
    4. USE + all distance similarity features + 9 meta featues


### **USE + 9 Meta features Model**

In [None]:
# Building the model


######################## USE Embeddings #########################
input_1 = tf.keras.layers.Input(shape=(X_train.shape[1],), name='USE_Embeddings')

####################### 9 Meta features #########################
input_2 = tf.keras.layers.Input(shape=(9,))
x  = tf.keras.layers.Dense(68, activation='relu')(input_2)
x = tf.keras.layers.Dense(64, activation='relu')(input_2)
x = tf.keras.layers.Dropout(0.2)(x)
flat_2 = tf.keras.layers.Dense(32, activation='relu')(x)

concat = tf.keras.layers.Concatenate()([input_1,flat_2])


x = tf.keras.layers.Dense(512, activation='elu')(concat)
x = tf.keras.layers.Dropout(0.2)(x)
x = tf.keras.layers.Dense(218, activation='relu')(x)
x = tf.keras.layers.BatchNormalization()(x)
output = tf.keras.layers.Dense(30, activation='sigmoid')(x)


model = tf.keras.Model(inputs=[input_1,input_2], outputs=output)

In [None]:
model.compile(optimizer= tf.keras.optimizers.Adam(lr=1e-4),
        loss=['binary_crossentropy']
    )

model.summary()

In [None]:
tf.keras.utils.plot_model(model)

In [None]:
y_valid_dataset = np.array(y_valid_dataset)
custom_callback = SpearmanCallback(validation_data=([X_valid,X_valid_feature_eng] , y_valid_dataset))

In [None]:
history = model.fit([X_train, X_train_feature_eng], y_train_dataset,
          epochs =100,
          validation_data=([X_valid, X_valid_feature_eng],y_valid_dataset),
                    callbacks=[custom_callback]
         )

In [None]:
pd.DataFrame(history.history).plot()
plt.title("USE with basic Three features + 9 Meta features")

## **Observation:** 
* Using USE + 9 meta feature we have acheived sperman score of **0.3713** which is better than simple USE + basic features.
* The loss curve is smothly decreasing curve and after 90th epoch the model starting to overfit.
* Lets try with L2 distance similarity features which can be computed using USE embeddings 

# **USE + L2 distance similarity + 9 Meta Features Model**

In [None]:
# Computing L2 distance similarity

l2_distance_similarity = lambda x, y: np.power(x - y, 2).sum(axis=1)


sim_distance_features_train = np.array([
    l2_distance_similarity(embeddings_train['question_title_USE_embedding'], embeddings_train['answer_USE_embedding']),
    l2_distance_similarity(embeddings_train['question_body_USE_embedding'], embeddings_train['answer_USE_embedding']),
    l2_distance_similarity(embeddings_train['question_body_USE_embedding'], embeddings_train['question_title_USE_embedding']),
    ]).T


sim_distance_features_valid = np.array([
    l2_distance_similarity(embeddings_valid['question_title_USE_embedding'], embeddings_valid['answer_USE_embedding']),
    l2_distance_similarity(embeddings_valid['question_body_USE_embedding'], embeddings_valid['answer_USE_embedding']),
    l2_distance_similarity(embeddings_valid['question_body_USE_embedding'], embeddings_valid['question_title_USE_embedding']),
    ]).T


sim_distance_features_test = np.array([
    l2_distance_similarity(embeddings_test['question_title_USE_embedding'], embeddings_test['answer_USE_embedding']),
    l2_distance_similarity(embeddings_test['question_body_USE_embedding'], embeddings_test['answer_USE_embedding']),
    l2_distance_similarity(embeddings_test['question_body_USE_embedding'], embeddings_test['question_title_USE_embedding']),
    ]).T


In [None]:
X_train = np.hstack([item for k, item in embeddings_train.items()] + [sim_distance_features_train])
X_valid = np.hstack([item for k, item in embeddings_valid.items()] + [sim_distance_features_valid])
X_test = np.hstack([item for k, item in embeddings_test.items()] + [sim_distance_features_test])

In [None]:
X_train.shape

In [None]:
# Building the model


######################## USE Embeddings #########################
input_1 = tf.keras.layers.Input(shape=(X_train.shape[1],), name='USE_Embeddings')

####################### 9 Meta features #########################
input_2 = tf.keras.layers.Input(shape=(9,))
x  = tf.keras.layers.Dense(68, activation='relu')(input_2)
x = tf.keras.layers.Dense(64, activation='relu')(input_2)
x = tf.keras.layers.Dropout(0.2)(x)
flat_2 = tf.keras.layers.Dense(32, activation='relu')(x)

concat = tf.keras.layers.Concatenate()([input_1,flat_2])


x = tf.keras.layers.Dense(512, activation='elu')(concat)
x = tf.keras.layers.Dropout(0.2)(x)
x = tf.keras.layers.Dense(218, activation='relu')(x)
x = tf.keras.layers.BatchNormalization()(x)
output = tf.keras.layers.Dense(30, activation='sigmoid')(x)


model = tf.keras.Model(inputs=[input_1,input_2], outputs=output)

In [None]:
model.compile(optimizer= tf.keras.optimizers.Adam(lr=1e-4),
        loss=['binary_crossentropy']
    )

model.summary()

In [None]:
tf.keras.utils.plot_model(model)

In [None]:
y_valid_dataset = np.array(y_valid_dataset)
custom_callback = SpearmanCallback(validation_data=([X_valid,X_valid_feature_eng] , y_valid_dataset))

In [None]:
history = model.fit([X_train, X_train_feature_eng], y_train_dataset,
          epochs =100,
          validation_data=([X_valid, X_valid_feature_eng],y_valid_dataset),
                    callbacks=[custom_callback]
         )

In [None]:
pd.DataFrame(history.history).plot()
plt.title("USE with basic Three features + L2 distance similarity features + 9 Meta features")

## **Observation:**
* The score has not improved as much as above USE + 9 Meta featues model but the same score has been acheived by both the model but this model took more epochs to reach high score of **0.36124** at 22nd epoch and the after 20 epochs the model started overfitting to the train data as seen in the above curve.

# **USE + cosine distance + 9 Meta features**
Now lets performe the same experiment instead of l2 distance we will try with cosine distance

In [None]:
# Computing cosine distance

cos_dist = lambda x, y: (x*y).sum(axis=1)

cos_distance_features_train = np.array([
    cos_dist(embeddings_train['question_title_USE_embedding'], embeddings_train['answer_USE_embedding']),
    cos_dist(embeddings_train['question_body_USE_embedding'], embeddings_train['answer_USE_embedding']),
    cos_dist(embeddings_train['question_body_USE_embedding'], embeddings_train['question_title_USE_embedding']),
    ]).T


cos_distance_features_valid = np.array([
    cos_dist(embeddings_valid['question_title_USE_embedding'], embeddings_valid['answer_USE_embedding']),
    cos_dist(embeddings_valid['question_body_USE_embedding'], embeddings_valid['answer_USE_embedding']),
    cos_dist(embeddings_valid['question_body_USE_embedding'], embeddings_valid['question_title_USE_embedding']),
    ]).T


cos_distance_features_test = np.array([
    cos_dist(embeddings_test['question_title_USE_embedding'], embeddings_test['answer_USE_embedding']),
    cos_dist(embeddings_test['question_body_USE_embedding'], embeddings_test['answer_USE_embedding']),
    cos_dist(embeddings_test['question_body_USE_embedding'], embeddings_test['question_title_USE_embedding']),
    ]).T


In [None]:
X_train = np.hstack([item for k, item in embeddings_train.items()] + [cos_distance_features_train])
X_valid = np.hstack([item for k, item in embeddings_valid.items()] + [cos_distance_features_valid])
X_test = np.hstack([item for k, item in embeddings_test.items()] + [cos_distance_features_test])

In [None]:
X_train.shape

In [None]:
# Building the model


######################## USE Embeddings #########################
input_1 = tf.keras.layers.Input(shape=(X_train.shape[1],), name='USE_Embeddings')

####################### 9 Meta features #########################
input_2 = tf.keras.layers.Input(shape=(9,))
x  = tf.keras.layers.Dense(68, activation='relu')(input_2)
x = tf.keras.layers.Dense(64, activation='relu')(input_2)
x = tf.keras.layers.Dropout(0.2)(x)
flat_2 = tf.keras.layers.Dense(32, activation='relu')(x)

concat = tf.keras.layers.Concatenate()([input_1,flat_2])


x = tf.keras.layers.Dense(512, activation='elu')(concat)
x = tf.keras.layers.Dropout(0.2)(x)
x = tf.keras.layers.Dense(218, activation='relu')(x)
x = tf.keras.layers.BatchNormalization()(x)
output = tf.keras.layers.Dense(30, activation='sigmoid')(x)


model = tf.keras.Model(inputs=[input_1,input_2], outputs=output)

In [None]:
model.compile(optimizer= tf.keras.optimizers.Adam(lr=1e-4),
        loss=['binary_crossentropy']
    )

model.summary()

In [None]:
tf.keras.utils.plot_model(model)

In [None]:
y_valid_dataset = np.array(y_valid_dataset)
custom_callback = SpearmanCallback(validation_data=([X_valid,X_valid_feature_eng] , y_valid_dataset))

In [None]:
history = model.fit([X_train, X_train_feature_eng], y_train_dataset,
          epochs =100,
          validation_data=([X_valid, X_valid_feature_eng],y_valid_dataset),
                    callbacks=[custom_callback]
         )

In [None]:
pd.DataFrame(history.history).plot()
plt.title("USE with basic Three features + Cosine distance features + 9 Meta features")

## **Observation:**
* The model has acheived an spearman score of **0.371537** which was similar with above two USE after training for 50 epochs and even if we train for more epochs the score remains same.

# **USE + All distance features + 9 Meta features**

In [None]:
X_train = np.hstack([item for k, item in embeddings_train.items()] + [cos_distance_features_train, sim_distance_features_train])
X_valid = np.hstack([item for k, item in embeddings_valid.items()] + [cos_distance_features_valid, sim_distance_features_valid])
X_test = np.hstack([item for k, item in embeddings_test.items()] + [cos_distance_features_test, sim_distance_features_test])

In [None]:
X_train.shape

In [None]:
# Building the model


######################## USE Embeddings #########################
input_1 = tf.keras.layers.Input(shape=(X_train.shape[1],), name='USE_Embeddings')

####################### 9 Meta features #########################
input_2 = tf.keras.layers.Input(shape=(9,))
x  = tf.keras.layers.Dense(68, activation='relu')(input_2)
x = tf.keras.layers.Dense(64, activation='relu')(input_2)
x = tf.keras.layers.Dropout(0.2)(x)
flat_2 = tf.keras.layers.Dense(32, activation='relu')(x)

concat = tf.keras.layers.Concatenate()([input_1,flat_2])


x = tf.keras.layers.Dense(512, activation='elu')(concat)
x = tf.keras.layers.Dropout(0.2)(x)
x = tf.keras.layers.Dense(218, activation='relu')(x)
x = tf.keras.layers.BatchNormalization()(x)
output = tf.keras.layers.Dense(30, activation='sigmoid')(x)


model = tf.keras.Model(inputs=[input_1,input_2], outputs=output)

In [None]:
model.compile(optimizer= tf.keras.optimizers.Adam(lr=1e-4),
        loss=['binary_crossentropy']
    )

model.summary()

In [None]:
y_valid_dataset = np.array(y_valid_dataset)
custom_callback = SpearmanCallback(validation_data=([X_valid,X_valid_feature_eng] , y_valid_dataset))

In [None]:
history = model.fit([X_train, X_train_feature_eng], y_train_dataset,
          epochs =100,
          validation_data=([X_valid, X_valid_feature_eng],y_valid_dataset),
                    callbacks=[custom_callback]
         )

In [None]:
pd.DataFrame(history.history).plot()
plt.title("USE with basic Three features + Cosine distance features + 9 Meta features")

## **Observations:**

* From the above experiment we have observe that USE model with 9 meta featues + all distance similarity features we were able to acheive the maximum score of **0.37149**. 

In [None]:
from prettytable import PrettyTable


myTable = PrettyTable(["USE Model", "Features", "Spearman scroe"])


myTable.add_row(["USE", "Three basic features", "0.33029"])
myTable.add_row(["USE", "Three basic features + 9 Meta Features", "0.37133"])
myTable.add_row(["USE", "Three basic features + L2 distance feature + 9 meta Features", "0.36575"])
myTable.add_row(["USE", "Three basic features + cosine distance + 9 Meta features", "0.37153"])
myTable.add_row(["USE", "Three basic features + L2 distance +cosine distance + 9 Meta features", "0.37061"])


print(myTable)

# **USE Model Observations:**
* Training time for USE model is very less. 
* It take more epochs to reach the best score.
* By performing all the above USE experiment model we are able to acheive the maximum spearman score of **0.3712** with both cosine + L2 distance + .


# Over all Experiments insights

In [None]:
from prettytable import PrettyTable


myTable = PrettyTable(["USE Model", "Features", "Spearman scroe"])


myTable.add_row(["Bi-LSTM", "Three basic features", "0.27561"])
myTable.add_row(["Bi-LSTM", "Three basic features + 18 FE Features(meta, TF-IDF, Web scraping)", "0.00253"])
myTable.add_row(["Bi-LSTM", "Three basic features + 13 FE Features(meta, Web scraping)", "0.01255"])
myTable.add_row(["Bi-LSTM", "Three basic features + 9 FE Features(meta features)", "0.28656"])
myTable.add_row(["Bi-LSTM", "Three basic features + 13 FE features with 100 dim embeddings(meta, Web scraping)", "-0.0041"])
myTable.add_row(["USE", "Three basic features", "0.33029"])
myTable.add_row(["USE", "Three basic features + 9 Meta Features", "0.37133"])
myTable.add_row(["USE", "Three basic features + L2 distance feature + 9 meta Features", "0.36575"])
myTable.add_row(["USE", "Three basic features + cosine distance + 9 Meta features", "0.37153"])
myTable.add_row(["USE", "Three basic features + L2 distance +cosine distance + 9 Meta features", "0.37061"])


print(myTable)

References use for USE: 
* https://www.kaggle.com/abazdyrev/use-features-oof
* https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwiq-eGqsI72AhXFb94KHRmZDOAQFnoECAIQAQ&url=https%3A%2F%2Fmedium.com%2Fanalytics-vidhya%2Fgoogle-quest-challenge-q-a-labelling-9df4aff317d5&usg=AOvVaw2bN-Rqi5hBouX0fIdLHXVc
* https://www.kaggle.com/abhishek/distilbert-use-features-oof

> # Note: Further experiments will be performed using transformer based model like bert, roberta, albert, XLnet etc 