# Quora Question Pairs Similarity

## 1. Business Problem

Quora is a place to gain and share knowledge—about anything. It’s a platform to ask questions and connect with people who contribute unique insights and quality answers. This empowers people to learn from each other and to better understand the world.

Over 100 million people visit Quora every month, so it's no surprise that many people ask similarly worded questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question, and make writers feel they need to answer multiple versions of the same question. Quora values canonical questions because they provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.

## Problem Statement 



*   Identify which questions asked on Quora are duplicates of questions that  
    have already been asked.

*   This could be useful to instantly provide answers to questions that have already been answered.

 

*   We are tasked with predicting whether a pair of questions are duplicates or not.

### Data Overview

- Data will be in a file Train.csv
- Train.csv contains 5 columns : qid1, qid2, question1, question2, is_duplicate
- Size of Train.csv - 60MB
- Number of rows in Train.csv = 404,290

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from subprocess import check_output
%matplotlib inline
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import os
import gc
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from bs4 import BeautifulSoup
from fuzzywuzzy import fuzz
from sklearn.manifold import TSNE
from wordcloud import WordCloud, STOPWORDS
from os import path
from PIL import Image
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from collections import Counter
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics.classification import accuracy_score, log_loss
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
from scipy.sparse import hstack
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from collections import Counter, defaultdict
from sklearn.calibration import CalibratedClassifierCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
import math
from sklearn.metrics import normalized_mutual_info_score
from sklearn.ensemble import RandomForestClassifier



from sklearn.model_selection import cross_val_score
from sklearn.linear_model import SGDClassifier
from mlxtend.classifier import StackingClassifier

from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_curve, auc, roc_curve
from sklearn.preprocessing import normalize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import warnings
warnings.filterwarnings("ignore")
import sys
import os 
from tqdm import tqdm
import spacy

In [None]:
import zipfile

zf = zipfile.ZipFile('../input/quora-question-pairs/train.csv.zip')
quora_df = pd.read_csv(zf.open('train.csv'))

In [None]:
quora_df.head(3)

In [None]:
quora_df.info()

In [None]:
quora_df.groupby("is_duplicate")['id'].count().plot.bar()

In [None]:
quora_df['is_duplicate'].value_counts()

In [None]:
print('Question pairs are not Similar (is_duplicate = 0):\n   {}%'.format(100 - round(quora_df['is_duplicate'].mean()*100, 2)))
print('Question pairs are Similar (is_duplicate = 1):\n   {}%'.format(round(quora_df['is_duplicate'].mean()*100, 2)))

In [None]:
question_ids=pd.Series(quora_df['qid1'].tolist() + quora_df['qid2'].tolist())

unique_questions=len(np.unique(question_ids))
questions_morethan1=np.sum(question_ids.value_counts() > 1)


print('Total No of Unique questions :{} \n'.format(unique_questions))

print ('Number of unique questions that appear more than one time: {} ({}%)\n'.format(questions_morethan1,questions_morethan1/unique_questions*100))

print ('Max number of times a single question is repeated: {}\n'.format(max(question_ids.value_counts()))) 

q_vals=question_ids.value_counts()

q_vals=q_vals.values

In [None]:
type(question_ids)

In [None]:
question_ids[:5]

In [None]:
nan_rows=quora_df[quora_df.isnull().any(1)]

print (nan_rows)

In [None]:
quora_df=quora_df.fillna('')

nan_rows=quora_df[quora_df.isnull().any(1)]

print (nan_rows)

## 3.3 Basic Feature Extraction (before cleaning)



1.   freq_qid1 = Frequency of qid1's
2.   freq_qid2 = Frequency of qid2's
3.   q1len = Length of q1
4.   q2len = Length of q2
5.   q1_n_words = Number of words in Question 1
6.   q2_n_words = Number of words in Question 2
7.   word_Common = (Number of common unique words in Question 1 and Question 2)
8.   word_Total =(Total num of words in Question 1 + Total num of words in Question 2)
9.   word_share = (word_common)/(word_Total)
10.  freq_q1+freq_q2 = sum total of frequency of qid1 and qid2
11.  freq_q1-freq_q2 = absolute difference of frequency of qid1 and qid2

In [None]:
if os.path.isfile('feature_engg_preprocessing_train.csv'):
  quora_df = pd.read_csv("feature_engg_preprocessing_train",encoding='latin-1')
else:

  quora_df['freq_qid1'] = quora_df.groupby('qid1')['qid1'].transform('count')
  quora_df['freq_qid2'] = quora_df.groupby('qid2')['qid2'].transform('count')
  quora_df['q1len'] = quora_df['question1'].str.len()
  quora_df['q2len'] = quora_df['question2'].str.len()
  quora_df['q1_n_words'] = quora_df['question1'].apply(lambda row: len(row.split(" ")))
  quora_df['q2_n_words'] = quora_df['question2'].apply(lambda row: len(row.split(" ")))

  def normalized_word_Common(row):
     w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
     w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))    
     return 1.0 * len(w1 & w2)
  quora_df['word_Common'] = quora_df.apply(normalized_word_Common, axis=1)

  def normalized_word_Total(row):
    w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
    w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))    
    return 1.0 * (len(w1) + len(w2))
  quora_df['word_Total'] = quora_df.apply(normalized_word_Total, axis=1)

  def normalized_word_share(row):
    w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
    w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))    
    return 1.0 * len(w1 & w2)/(len(w1) + len(w2))
  quora_df['word_share'] = quora_df.apply(normalized_word_share, axis=1)

  quora_df['freq_q1+q2'] = quora_df['freq_qid1']+quora_df['freq_qid2']
  quora_df['freq_q1-q2'] = abs(quora_df['freq_qid1']-quora_df['freq_qid2'])

  quora_df.to_csv("feature_engg_preprocessing_train.csv", index=False)

quora_df.head()

In [None]:
print ("Minimum length of the questions in question1 : " , min(quora_df['q1_n_words']))

print ("Minimum length of the questions in question2 : " , min(quora_df['q2_n_words']))

print ("Number of Questions with minimum length [question1] :", quora_df[quora_df['q1_n_words']== 1].shape[0])
print ("Number of Questions with minimum length [question2] :", quora_df[quora_df['q2_n_words']== 1].shape[0])

In [None]:
plt.figure(figsize=(12, 8))

plt.subplot(1,2,1)
sns.violinplot(x = 'is_duplicate', y = 'word_share', data = quora_df[0:])

plt.subplot(1,2,2)
sns.distplot(quora_df[quora_df['is_duplicate'] == 1.0]['word_share'][0:] , label = "1", color = 'red')
sns.distplot(quora_df[quora_df['is_duplicate'] == 0.0]['word_share'][0:] , label = "0" , color = 'blue' )
plt.show()

### By looking at the Violenplot , below are the 2 observations :

1. The distributions for normalized word_share have some overlap on the far right-hand side, i.e., there are quite a lot of questions with high word similarity
2. The average word share and Common no. of words of qid1 and qid2 is more when they are duplicate(Similar)

In [None]:
pip install fuzzywuzzy

In [None]:
if os.path.isfile('feature_engg_preprocessing_train.csv'):
    quora_df = pd.read_csv("feature_engg_preprocessing_train.csv",encoding='latin-1')
    quora_df = quora_df.fillna('')
    quora_df.head()
else:
    print("get feature_engg_preprocessing_train.csv from drive or run the previous notebook")

In [None]:
quora_df.head(2)

In [None]:
nltk.download('stopwords')

In [None]:
from nltk.corpus import stopwords
SAFE_DIV = 0.0001 

STOP_WORDS = stopwords.words("english")


def preprocess(x):
    x = str(x).lower()
    x = x.replace(",000,000", "m").replace(",000", "k").replace("′", "'").replace("’", "'")\
                           .replace("won't", "will not").replace("cannot", "can not").replace("can't", "can not")\
                           .replace("n't", " not").replace("what's", "what is").replace("it's", "it is")\
                           .replace("'ve", " have").replace("i'm", "i am").replace("'re", " are")\
                           .replace("he's", "he is").replace("she's", "she is").replace("'s", " own")\
                           .replace("%", " percent ").replace("₹", " rupee ").replace("$", " dollar ")\
                           .replace("€", " euro ").replace("'ll", " will")
    x = re.sub(r"([0-9]+)000000", r"\1m", x)
    x = re.sub(r"([0-9]+)000", r"\1k", x)
    
    
    porter = PorterStemmer()
    pattern = re.compile('\W')
    
    if type(x) == type(''):
        x = re.sub(pattern, ' ', x)
    
    
    if type(x) == type(''):
        x = porter.stem(x)
        example1 = BeautifulSoup(x)
        x = example1.get_text()
               
    
    return x

In [None]:
def get_token_features(q1, q2):
    token_features = [0.0]*10
    
    # Converting the Sentence into Tokens: 
    q1_tokens = q1.split()
    q2_tokens = q2.split()

    if len(q1_tokens) == 0 or len(q2_tokens) == 0:
        return token_features
    # Get the non-stopwords in Questions
    q1_words = set([word for word in q1_tokens if word not in STOP_WORDS])
    q2_words = set([word for word in q2_tokens if word not in STOP_WORDS])
    
    #Get the stopwords in Questions
    q1_stops = set([word for word in q1_tokens if word in STOP_WORDS])
    q2_stops = set([word for word in q2_tokens if word in STOP_WORDS])
    
    # Get the common non-stopwords from Question pair
    common_word_count = len(q1_words.intersection(q2_words))
    
    # Get the common stopwords from Question pair
    common_stop_count = len(q1_stops.intersection(q2_stops))
    
    # Get the common Tokens from Question pair
    common_token_count = len(set(q1_tokens).intersection(set(q2_tokens)))
    
    
    token_features[0] = common_word_count / (min(len(q1_words), len(q2_words)) + SAFE_DIV)
    token_features[1] = common_word_count / (max(len(q1_words), len(q2_words)) + SAFE_DIV)
    token_features[2] = common_stop_count / (min(len(q1_stops), len(q2_stops)) + SAFE_DIV)
    token_features[3] = common_stop_count / (max(len(q1_stops), len(q2_stops)) + SAFE_DIV)
    token_features[4] = common_token_count / (min(len(q1_tokens), len(q2_tokens)) + SAFE_DIV)
    token_features[5] = common_token_count / (max(len(q1_tokens), len(q2_tokens)) + SAFE_DIV)
    
    # Last word of both question is same or not
    token_features[6] = int(q1_tokens[-1] == q2_tokens[-1])
    
    # First word of both question is same or not
    token_features[7] = int(q1_tokens[0] == q2_tokens[0])
    
    token_features[8] = abs(len(q1_tokens) - len(q2_tokens))
    
    #Average Token Length of both Questions
    token_features[9] = (len(q1_tokens) + len(q2_tokens))/2
    return token_features

# get the Longest Common sub string

def get_longest_substr_ratio(a, b):
    strs = list(distance.lcsubstrings(a, b))
    if len(strs) == 0:
        return 0
    else:
        return len(strs[0]) / (min(len(a), len(b)) + 1)

def extract_features(df):
    # preprocessing each question
    df["question1"] = df["question1"].fillna("").apply(preprocess)
    df["question2"] = df["question2"].fillna("").apply(preprocess)

    print("token features...")
    
    # Merging Features with dataset
    
    token_features = df.apply(lambda x: get_token_features(x["question1"], x["question2"]), axis=1)
    
    df["cwc_min"]       = list(map(lambda x: x[0], token_features))
    df["cwc_max"]       = list(map(lambda x: x[1], token_features))
    df["csc_min"]       = list(map(lambda x: x[2], token_features))
    df["csc_max"]       = list(map(lambda x: x[3], token_features))
    df["ctc_min"]       = list(map(lambda x: x[4], token_features))
    df["ctc_max"]       = list(map(lambda x: x[5], token_features))
    df["last_word_eq"]  = list(map(lambda x: x[6], token_features))
    df["first_word_eq"] = list(map(lambda x: x[7], token_features))
    df["abs_len_diff"]  = list(map(lambda x: x[8], token_features))
    df["mean_len"]      = list(map(lambda x: x[9], token_features))
   
    #Computing Fuzzy Features and Merging with Dataset       
    print("fuzzy features..")

    df["token_set_ratio"]       = df.apply(lambda x: fuzz.token_set_ratio(x["question1"], x["question2"]), axis=1)
    # The token sort approach involves tokenizing the string in question, sorting the tokens alphabetically, and 
    # then joining them back into a string We then compare the transformed strings with a simple ratio().
    df["token_sort_ratio"]      = df.apply(lambda x: fuzz.token_sort_ratio(x["question1"], x["question2"]), axis=1)
    df["fuzz_ratio"]            = df.apply(lambda x: fuzz.QRatio(x["question1"], x["question2"]), axis=1)
    df["fuzz_partial_ratio"]    = df.apply(lambda x: fuzz.partial_ratio(x["question1"], x["question2"]), axis=1)
    df["longest_substr_ratio"]  = df.apply(lambda x: get_longest_substr_ratio(x["question1"], x["question2"]), axis=1)
    return df

In [None]:
pip install distance

In [None]:
import distance
if os.path.isfile('nlp_features_train.csv'):
    quora_df = pd.read_csv("nlp_features_train.csv",encoding='latin-1')
    quora_df.fillna('')
else:
    print("Extracting features for train:")
    quora_df = pd.read_csv('../input/quora-question-pairs/train.csv.zip')
    quora_df = extract_features(quora_df)
    quora_df.to_csv("nlp_features_train.csv", index=False)
quora_df.head(2)

## GENERATING WORD CLOUD OF DUPLICATES AND NON DUPLICATE QUESTION PAIRS. WE CAN OBSERVE MOST FREQUENT OCCURING WORDS

In [None]:
df_duplicate = quora_df[quora_df['is_duplicate'] == 1]
df_nonduplicate = quora_df[quora_df['is_duplicate'] == 0]

# Converting 2d array of q1 and q2 and flatten the array: like {{1,2},{3,4}} to {1,2,3,4}
p = np.dstack([df_duplicate["question1"], df_duplicate["question2"]]).flatten()
n = np.dstack([df_nonduplicate["question1"], df_nonduplicate["question2"]]).flatten()

print ("Number of data points in class 1 (duplicate pairs) :",len(p))
print ("Number of data points in class 0 (non duplicate pairs) :",len(n))

#Saving the np array into a text file
np.savetxt('train_p.txt', p, delimiter=' ', fmt='%s')
np.savetxt('train_n.txt', n, delimiter=' ', fmt='%s')

In [None]:
d = path.dirname('.')

textp_w = open(path.join(d, 'train_p.txt')).read()
textn_w = open(path.join(d, 'train_n.txt')).read()
stopwords = set(STOPWORDS)
stopwords.add("said")
stopwords.add("br")
stopwords.add(" ")
stopwords.remove("not")

stopwords.remove("no")
stopwords.remove("like")

print ("Total number of words in duplicate pair questions :",len(textp_w))
print ("Total number of words in non duplicate pair questions :",len(textn_w))

In [None]:
wc = WordCloud(background_color="white", max_words=len(textp_w), stopwords=stopwords)
wc.generate(textp_w)
print ("Word Cloud for Duplicate Question pairs")
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
wc = WordCloud(background_color="white", max_words=len(textn_w),stopwords=stopwords)
# generate word cloud
wc.generate(textn_w)
print ("Word Cloud for non-Duplicate Question pairs:")
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

## TAKE 100K DATAPOINTS AND SPLIT THEM INTO INTO TEST AND TRAIN 

## FEATURIZING TEXT DATA USING TF-IDF 

In [None]:
quora_df['question1']=quora_df['question1'].apply(lambda x:str(x))
quora_df['question2']=quora_df['question2'].apply(lambda x:str(x))

quora_df.head()

In [None]:
if os.path.isfile('nlp_features_train.csv'):
    dfnlp = pd.read_csv("nlp_features_train.csv",encoding='latin-1')
else:
    print("download nlp_features_train.csv from drive or run previous notebook")

if os.path.isfile('feature_engg_preprocessing_train.csv'):
    dfppro = pd.read_csv("feature_engg_preprocessing_train.csv",encoding='latin-1')
else:
    print("download ./feature_engg_preprocessing_train.csv from drive or run previous notebook")

In [None]:
df1 = dfnlp.drop(['qid1','qid2','question1','question2','is_duplicate'],axis=1)
df2 = dfppro.drop(['qid1','qid2','question1','question2','is_duplicate'],axis=1)

In [None]:
df3=dfnlp[['id','question1','question2']]
duplicate=dfnlp.is_duplicate

In [None]:
df3 = df3.fillna(' ')

In [None]:
new_dataframe = pd.DataFrame()

new_dataframe['questions']=df3.question1 + ' ' + df3.question2
new_dataframe['id']=df3.id
df2['id']=df1['id']
new_dataframe['id']=df1['id']
final_df = df1.merge(df2, on='id',how='left')
X_Final  = final_df.merge(new_dataframe, on='id',how='left')

In [None]:
X_Final=X_Final.drop('id',axis=1)

In [None]:
X_Final.shape

In [None]:
X_Final.head(2)

In [None]:
X_Final.columns

In [None]:
X_Final.shape

In [None]:
Y_Final=np.array(duplicate)

In [None]:
X_Final_100K = X_Final[0:100000]
Y_Final_100K = Y_Final[0:100000]

In [None]:
X_Train,X_Test,Y_Train,Y_Test = train_test_split(X_Final_100K,Y_Final_100K,test_size=0.2,random_state=0)

In [None]:
print(X_Train.shape)
print(X_Test.shape)
print(Y_Train.shape)
print(Y_Test.shape)


In [None]:
X_train_ques=X_Train['questions']
X_test_ques=X_Test['questions']

X_Train=X_Train.drop('questions',axis=1)
X_Test=X_Test.drop('questions',axis=1)

## FEATURIZATION DATA USING TF-IDF WEIGHTED WORD2VEC

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

tfidf_vector=TfidfVectorizer(lowercase=False)
tfidf_vector.fit_transform(X_train_ques)

word2Vectfidf = dict(zip(tfidf_vector.get_feature_names(), tfidf_vector.idf_))

In [None]:
pip install spacy

In [None]:
nlp = spacy.load('en_core_web_sm')

vecs1 = []

for qu1 in tqdm(list(X_train_ques)):
    doc1 = nlp(qu1)    

In [None]:
type(doc1)

In [None]:
len(doc1)

In [None]:
doc1[0].vector

In [None]:
len(doc1[0].vector)

In [None]:
nlp = spacy.load('en_core_web_sm')

vecs1 = []

for qu1 in tqdm(X_train_ques):
    #doc1 = nlp(qu1)      
    mean_vec1 = np.zeros([len(doc1), len(doc1[0].vector)])
    for word1 in doc1:
        # word2vec
        vec1 = word1.vector
        # fetch df score
        try:
            idf = word2Vectfidf[str(word1)]
        except:
            idf = 0
        # compute final vec
        mean_vec1 += vec1 * idf
    mean_vec1 = mean_vec1.mean(axis=0)
    vecs1.append(mean_vec1)
#X_train_ques['q1_feats_m'] = list(vecs1)

In [None]:
for qu2 in tqdm(list(X_test_ques)):    
    doc2 = nlp(qu2) 

In [None]:
type(doc2)

In [None]:
len(doc2)

In [None]:
doc2[0].vector

In [None]:
len(doc2[0].vector)

In [None]:
vecs2 = []

for qu2 in tqdm(list(X_test_ques)):
    #doc2 = nlp(qu2) 
    mean_vec2 = np.zeros([len(doc2), len(doc2[0].vector)])
    for word2 in doc2:        
        # word2vec
        vec2 = word2.vector
        # fetch df score
        try:
            idf = word2Vectfidf[str(word2)]
        except:
            #print word
            idf = 0
        # compute final vec
        mean_vec2 += vec2 * idf
    mean_vec2 = mean_vec2.mean(axis=0)
    vecs2.append(mean_vec2)
#X_Test['q2_feats_m'] = list(vecs2)

In [None]:
train_df=pd.DataFrame(vecs1)
test_df = pd.DataFrame(vecs2)

In [None]:
X_Train.head(5)

In [None]:
X_Train.values

In [None]:
train_df.head(4)

In [None]:
from scipy.sparse import hstack
X_Train = hstack((X_Train.values,train_df))
X_Test= hstack((X_Test.values,test_df))
print(X_Train.shape)
print(X_Test.shape)

In [None]:
type(X_Train)

In [None]:
# This function plots the confusion matrices given y_i, y_i_hat.
def plot_confusion_matrix(test_y, predict_y):
    C = confusion_matrix(test_y, predict_y)
    # C = 9,9 matrix, each cell (i,j) represents number of points of class i are predicted class j
    
    A =(((C.T)/(C.sum(axis=1))).T)
    #divid each element of the confusion matrix with the sum of elements in that column
    
    # C = [[1, 2],
    #     [3, 4]]
    # C.T = [[1, 3],
    #        [2, 4]]
    # C.sum(axis = 1)  axis=0 corresonds to columns and axis=1 corresponds to rows in two diamensional array
    # C.sum(axix =1) = [[3, 7]]
    # ((C.T)/(C.sum(axis=1))) = [[1/3, 3/7]
    #                           [2/3, 4/7]]

    # ((C.T)/(C.sum(axis=1))).T = [[1/3, 2/3]
    #                           [3/7, 4/7]]
    # sum of row elements = 1
    
    B =(C/C.sum(axis=0))
    #divid each element of the confusion matrix with the sum of elements in that row
    # C = [[1, 2],
    #     [3, 4]]
    # C.sum(axis = 0)  axis=0 corresonds to columns and axis=1 corresponds to rows in two diamensional array
    # C.sum(axix =0) = [[4, 6]]
    # (C/C.sum(axis=0)) = [[1/4, 2/6],
    #                      [3/4, 4/6]] 
    plt.figure(figsize=(20,4))
    
    labels = [1,2]
    # representing A in heatmap format
    cmap=sns.light_palette("blue")
    plt.subplot(1, 3, 1)
    sns.heatmap(C, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=labels)
    plt.xlabel('Predicted Class')
    plt.ylabel('Original Class')
    plt.title("Confusion matrix")
    
    plt.subplot(1, 3, 2)
    sns.heatmap(B, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=labels)
    plt.xlabel('Predicted Class')
    plt.ylabel('Original Class')
    plt.title("Precision matrix")
    
    plt.subplot(1, 3, 3)
    # representing B in heatmap format
    sns.heatmap(A, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=labels)
    plt.xlabel('Predicted Class')
    plt.ylabel('Original Class')
    plt.title("Recall matrix")
    
    plt.show()

### BUILDING A RANDOM MODEL AND FINDING THE WORST CASE LOG LOSS

In [None]:
test_len = len(Y_Test)

In [None]:
predicted_y = np.zeros((test_len,2))
for i in range(test_len):
    rand_probs = np.random.rand(1,2)
    predicted_y[i] = ((rand_probs/sum(sum(rand_probs)))[0])
print("Log loss on Test Data using Random Model",log_loss(Y_Test, predicted_y, eps=1e-15))

predicted_y =np.argmax(predicted_y, axis=1)
plot_confusion_matrix(Y_Test, predicted_y)

### LOGISTIC REGRESSION TO FIND HYPERPARAMETER

In [None]:
alpha = [10 ** x for x in range(-5, 2)] # hyperparam for SGD classifier.
log_error_array=[]
for i in alpha:
    clf = SGDClassifier(alpha=i, penalty='l2', loss='log', random_state=42)
    clf.fit(X_Train, Y_Train)
    sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
    sig_clf.fit(X_Train, Y_Train)
    predict_y = sig_clf.predict_proba(X_Test)
    log_error_array.append(log_loss(Y_Test, predict_y, labels=clf.classes_, eps=1e-15))
    print('For values of alpha = ', i, "The log loss is:",log_loss(Y_Test, predict_y, labels=clf.classes_, eps=1e-15))

fig, ax = plt.subplots()
ax.plot(alpha, log_error_array,c='g')
for i, txt in enumerate(np.round(log_error_array,3)):
    ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()


best_alpha = np.argmin(log_error_array)
clf = SGDClassifier(alpha=alpha[best_alpha], penalty='l2', loss='log', random_state=42)
clf.fit(X_Train, Y_Train)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(X_Train, Y_Train)

predict_y = sig_clf.predict_proba(X_Train)
print('For values of best alpha = ', alpha[best_alpha], "The train log loss is:",log_loss(Y_Train, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(X_Test)
print('For values of best alpha = ', alpha[best_alpha], "The test log loss is:",log_loss(Y_Test, predict_y, labels=clf.classes_, eps=1e-15))
predicted_y =np.argmax(predict_y,axis=1)
print("Total number of data points :", len(predicted_y))
plot_confusion_matrix(Y_Test, predicted_y)

### LINEAR SVM WITH HYPERPARAMETER TUNING 

In [None]:
alpha = [10 ** x for x in range(-5, 2)] # hyperparam for SGD classifier.
log_error_array=[]
for i in alpha:
    clf = SGDClassifier(alpha=i, penalty='l1', loss='hinge', random_state=42)
    clf.fit(X_Train, Y_Train)
    sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
    sig_clf.fit(X_Train, Y_Train)
    predict_y = sig_clf.predict_proba(X_Test)
    log_error_array.append(log_loss(Y_Test, predict_y, labels=clf.classes_, eps=1e-15))
    print('For values of alpha = ', i, "The log loss is:",log_loss(Y_Test, predict_y, labels=clf.classes_, eps=1e-15))

fig, ax = plt.subplots()
ax.plot(alpha, log_error_array,c='g')
for i, txt in enumerate(np.round(log_error_array,3)):
    ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()


best_alpha = np.argmin(log_error_array)
clf = SGDClassifier(alpha=alpha[best_alpha], penalty='l1', loss='hinge', random_state=42)
clf.fit(X_Train, Y_Train)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(X_Train, Y_Train)

predict_y = sig_clf.predict_proba(X_Train)
print('For values of best alpha = ', alpha[best_alpha], "The train log loss is:",log_loss(Y_Train, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(X_Test)
print('For values of best alpha = ', alpha[best_alpha], "The test log loss is:",log_loss(Y_Test, predict_y, labels=clf.classes_, eps=1e-15))
predicted_y =np.argmax(predict_y,axis=1)
print("Total number of data points :", len(predicted_y))
plot_confusion_matrix(Y_Test, predicted_y)

### XGBoost

In [None]:
import xgboost as xgb
params = {}
params['objective'] = 'binary:logistic'
params['eval_metric'] = 'logloss'
params['eta'] = 0.02
params['max_depth'] = 4

d_train = xgb.DMatrix(X_Train, label=Y_Train)
d_test = xgb.DMatrix(X_Test, label=Y_Test)

watchlist = [(d_train, 'train'), (d_test, 'valid')]

bst = xgb.train(params, d_train, 400, watchlist, early_stopping_rounds=20, verbose_eval=10)

xgdmat = xgb.DMatrix(X_Train,Y_Train)
predict_y = bst.predict(d_test)
print("The test log loss is:",log_loss(Y_Test, predict_y, labels=clf.classes_, eps=1e-15))

In [None]:
predicted_y =np.array(predict_y>0.5,dtype=int)
print("Total number of data points :", len(predicted_y))
plot_confusion_matrix(Y_Test, predicted_y)

### TF-IDF VECTORIZATION ON QUORA QUESTION PAIR SIMILARITY

In [None]:
if os.path.isfile('nlp_features_train.csv'):
    dfnlp = pd.read_csv("nlp_features_train.csv",encoding='latin-1')
else:
    print("download nlp_features_train.csv from drive or run previous notebook")

if os.path.isfile('feature_engg_preprocessing_train.csv'):
    dfppro = pd.read_csv("feature_engg_preprocessing_train.csv",encoding='latin-1')
else:
    print("download ./feature_engg_preprocessing_train.csv from drive or run previous notebook")

In [None]:
X_Final.shape

In [None]:
X_final_100K = X_Final[0:100000]
Y_final_100K = Y_Final[0:100000]

In [None]:
X_train,X_test,Y_train,Y_test = train_test_split(X_final_100K,Y_final_100K,test_size=0.2,random_state=0)

In [None]:
X_train_questions = X_train['questions']
X_test_questions = X_test['questions']

In [None]:
X_train = X_train.drop('questions',axis=1)
X_test = X_test.drop('questions',axis=1)

In [None]:
X_train.shape

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vector=TfidfVectorizer(ngram_range=(1,3),min_df=5)

X_train_data_tfidf= tfidf_vector.fit_transform(X_train_questions)
X_test_data_tfidf= tfidf_vector.transform(X_test_questions)

In [None]:
X_train = hstack((X_train.values,X_train_data_tfidf))
X_test= hstack((X_test.values,X_test_data_tfidf))
print(X_train.shape)
print(X_test.shape)

## Logistic Regression to find Hyperparameter

In [None]:
alpha = [10 ** x for x in range(-5, 2)] # hyperparam for SGD classifier.
log_error_array=[]
for i in alpha:
    clf = SGDClassifier(alpha=i, penalty='l2', loss='log', random_state=42)
    clf.fit(X_train, Y_train)
    sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
    sig_clf.fit(X_train, Y_train)
    predict_y = sig_clf.predict_proba(X_test)
    log_error_array.append(log_loss(Y_test, predict_y, labels=clf.classes_, eps=1e-15))
    print('For values of alpha = ', i, "The log loss is:",log_loss(Y_test, predict_y, labels=clf.classes_, eps=1e-15))

fig, ax = plt.subplots()
ax.plot(alpha, log_error_array,c='g')
for i, txt in enumerate(np.round(log_error_array,3)):
    ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()


best_alpha = np.argmin(log_error_array)
clf = SGDClassifier(alpha=alpha[best_alpha], penalty='l2', loss='log', random_state=42)
clf.fit(X_train, Y_train)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(X_train, Y_train)

predict_y = sig_clf.predict_proba(X_train)
print('For values of best alpha = ', alpha[best_alpha], "The train log loss is:",log_loss(Y_train, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(X_test)
print('For values of best alpha = ', alpha[best_alpha], "The test log loss is:",log_loss(Y_test, predict_y, labels=clf.classes_, eps=1e-15))
predicted_y =np.argmax(predict_y,axis=1)
print("Total number of data points :", len(predicted_y))
plot_confusion_matrix(Y_test, predicted_y)

## Linear SVM

In [None]:
alpha = [10 ** x for x in range(-5, 2)] # hyperparam for SGD classifier.
log_error_array=[]
for i in alpha:
    clf = SGDClassifier(alpha=i, penalty='l1', loss='hinge', random_state=42)
    clf.fit(X_train, Y_train)
    sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
    sig_clf.fit(X_train, Y_train)
    predict_y = sig_clf.predict_proba(X_test)
    log_error_array.append(log_loss(Y_test, predict_y, labels=clf.classes_, eps=1e-15))
    print('For values of alpha = ', i, "The log loss is:",log_loss(Y_test, predict_y, labels=clf.classes_, eps=1e-15))

fig, ax = plt.subplots()
ax.plot(alpha, log_error_array,c='g')
for i, txt in enumerate(np.round(log_error_array,3)):
    ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()


best_alpha = np.argmin(log_error_array)
clf = SGDClassifier(alpha=alpha[best_alpha], penalty='l1', loss='hinge', random_state=42)
clf.fit(X_train, Y_train)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(X_train, Y_train)

predict_y = sig_clf.predict_proba(X_train)
print('For values of best alpha = ', alpha[best_alpha], "The train log loss is:",log_loss(Y_train, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(X_test)
print('For values of best alpha = ', alpha[best_alpha], "The test log loss is:",log_loss(Y_test, predict_y, labels=clf.classes_, eps=1e-15))
predicted_y =np.argmax(predict_y,axis=1)
print("Total number of data points :", len(predicted_y))
plot_confusion_matrix(Y_test, predicted_y)

## Hyperparameter tuning using RandomSearchCV

In [None]:
from sklearn.model_selection import RandomizedSearchCV
import xgboost as xgb

param = {"max_depth":[1,5,10,50,100,500,1000],"n_estimators":[20,40,60,80,100]}

xgb_classifier=xgb.XGBClassifier(n_jobs=-1,random_state=25)

model = RandomizedSearchCV(xgb_classifier,param,n_iter=30,scoring='neg_log_loss',cv=3,n_jobs=-1)

model.fit(X_train,Y_train)
model.best_params_

In [None]:
from sklearn.model_selection import RandomizedSearchCV
import xgboost as xgb

clf=xgb.XGBClassifier(n_jobs=-1,random_state=25,max_depth=10,n_estimators=100)
clf.fit(X_train,Y_train)
y_pred_test=clf.predict_proba(X_test)
y_pred_train=clf.predict_proba(X_train)
log_loss_train = log_loss(Y_train, y_pred_train, eps=1e-15)
log_loss_test=log_loss(Y_test,y_pred_test,eps=1e-15)
print('Train log loss = ',log_loss_train,' Test log loss = ',log_loss_test)
predicted_y=np.argmax(y_pred_test,axis=1)
plot_confusion_matrix(Y_test,predicted_y)

## Pretty Table 

In [None]:
pip install -U PTable

# DRAW CONCLUSION

In [None]:
from prettytable import PrettyTable
x= PrettyTable()

x.field_names = ["VECTORIZER","TYPE OF MODEL","TRAIN LOG LOSS","TEST LOG LOSS"]
x.add_row(['TF-IDF WEIGHTED W2V','LOGISTIC REGRESSION(ALPHA=0.001)','0.4314','0.9163'])
x.add_row(['TF-IDF WEIGHTED W2V','LINEAR SVM(ALPHA = 0.1)','0.5209','0.5377'])
x.add_row(['TF-IDF WEIGHTED W2V','XGBOOST','0.3545','0.3532'])
x.add_row(['TF-IDF','LOGISTIC REGRESSION(ALPHA=0.0001)','0.4034 ','0.4014'])
x.add_row(['TF-IDF','LINEAR SVM(ALPHA=0.00001)','0.4336','0.4329'])
x.add_row(['TF-IDF','XGBOOST','0.2131 ','0.3163'])

print(x)

### LOOKING AT THE PRETTY TABLE , THE TF-IDF VECTORIZER USING XG-BOOST PERFORM WELL WITH LESS TRAIN LOSS = 0.2131 AND TEST LOSS = 0.3163