# Kaggle Quora Question Pairs [competition](https://www.kaggle.com/c/quora-question-pairs/)

Solving using the Small ConvNet described in Xiang Zhang & Yann LeCun's paper [Character-level Convolutional Networks for Text Classification](https://arxiv.org/pdf/1509.01626.pdf)

In [3]:
# Pre-requisites
import numpy as np
import pandas as pd
import os
#import cv2

# To clear print buffer
from IPython.display import clear_output

In [56]:
# Keras
from keras import backend as K
from keras.models import Model, Sequential
from keras.layers import Input, Conv1D, MaxPooling1D
from keras.layers import Flatten, Dense, Dropout
from keras.layers.merge import Concatenate
from keras.layers.embeddings import Embedding
from keras.optimizers import SGD
from keras.initializers import RandomNormal
from keras.callbacks import LearningRateScheduler
from keras.utils import np_utils
from keras.engine.topology import Layer

In [5]:
# Loading saved variable
qsDict = np.load("qsDict.npy").item()
#charCorpus = np.load("charCorpus.npy")
#charCorpusCount = np.load("charCorpusCount.npy")
alphabet = list(np.load("alphabet.npy"))

# Load data

In [6]:
# Load training and test data
# Download train.csv and test.csv from https://www.kaggle.com/c/quora-question-pairs/
trainDf = pd.read_csv('kaggleQuoraTrain.csv',sep=',')
testDf = pd.read_csv('kaggleQuoraTest.csv',sep=',')

# Convert into np array
trainData = np.array(trainDf)
testData = np.array(testDf)

Q2 for trainData[105780] and trainData[201841] is missing. Their Q2id is 174364, and there's no other mention of that question id. Replacing them with "Who is Dumbledore?".

In [36]:
trainData[105780] = "Who is Dumbledore?"
trainData[201841] = "Who is Dumbledore?"

In [7]:
trainDf

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0
5,5,11,12,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1
6,6,13,14,Should I buy tiago?,What keeps childern active and far from phone ...,0
7,7,15,16,How can I be a good geologist?,What should I do to be a great geologist?,1
8,8,17,18,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0
9,9,19,20,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0


In [8]:
testDf

Unnamed: 0,test_id,question1,question2
0,0,How does the Surface Pro himself 4 compare wit...,Why did Microsoft choose core m3 and not core ...
1,1,Should I have a hair transplant at age 24? How...,How much cost does hair transplant require?
2,2,What but is the best way to send money from Ch...,What you send money to China?
3,3,Which food not emulsifiers?,What foods fibre?
4,4,"How ""aberystwyth"" start reading?",How their can I start reading?
5,5,How are the two wheeler insurance from Bharti ...,I admire I am considering of buying insurance ...
6,6,How can I reduce my belly fat through a diet?,How can I reduce my lower belly fat in one month?
7,7,"By scrapping the 500 and 1000 rupee notes, how...",How will the recent move to declare 500 and 10...
8,8,What are the how best books of all time?,What are some of the military history books of...
9,9,After 12th years old boy and I had sex with a ...,Can a 14 old guy date a 12 year old girl?


In [7]:
trainData.shape

(404290, 6)

In [8]:
testData.shape

(2345796, 3)

# Idea

The idea is to construct a Character-level CNN, i.e. a CNN that takes a sentence as a fixed-length frame of individual one-hot encoded characters as the input.

We shall input two questions along two branches of the same model of NN, and then merge the two branches. The output of this CNN after the merging will have a sigmoid neuron to classify whether the two questions are duplicates or not.

In [37]:
# Get list of question IDs and questions in  Question1 and Question2
trainQIds1 = trainData[:,1]
trainQs1 = trainData[:,3]
trainQIds2 = trainData[:,2]
trainQs2 = trainData[:,4]

testQs1 = testData[:,1]
testQs2 = testData[:,2]

# Database of questions

To make an alphabet of the most frequent characters used, let us first make a database of all the questions according to their questionIDs, to encode and use later

In [73]:
# Get list of question IDs and questions in training data
qDict = {}
for data in trainData:
    qsDict[data[1]] = data[3].lower()
    qsDict[data[2]] = data[4].lower()

In [75]:
# Save qsDict
np.save("qsDict", qsDict)

In [10]:
# Extract question IDs and questions
qIds = list(qsDict.keys())
questions = list(qsDict.values())

In [5]:
qsDict[1]

'what is the step by step guide to invest in share market in india?'

In [6]:
len(qsDict)

537933

In [7]:
questions

array(['what is the step by step guide to invest in share market in india?',
       'what is the step by step guide to invest in share market?',
       'what is the story of kohinoor (koh-i-noor) diamond?', ...,
       'i am having little hairfall problem but i want to use hair styling product. which one should i prefer out of gel, wax and clay?',
       'what is like to have sex with cousin?',
       'what is it like to have sex with your cousin?'], 
      dtype='<U1169')

For curiosity's sake, let's check how many questions are actually unique, discounting dupicates as the same question.

In [19]:
# Number of questions, counting duplicates as same
data = np.array(trainData)
data[data[:,5]==1, 2] = 0
uniqueQs = np.unique(np.array([[data[:, 1]], [data[:, 2]]]))[1:]
print(len(uniqueQs))

484549


# Alphabet

Let us make a corpus of characters in the questions database, find out the number of times each character occurs in the database, and choose only the most frequent characters as our alphabet.

In [27]:
# MAKE CORPUS OF CHARACTERS

# Append all characters from training data into list
charFullCorpus = []
for (q, question) in enumerate(questions):
    # Printing status (makes it slow)
    #clear_output(); print(str(q)+" of "+str(len(questions)))
    for char in list(question):
        charFullCorpus.append(char)

# Extract unique characters
charCorpus = np.unique(charFullCorpus)

In [None]:
# Save charCorpus
np.save("charCorpus", charCorpus)

In [28]:
print(charCorpus)
for c in charCorpus:
    print(c)

['\n' '\r' ' ' ..., '？' '￦' '￼']



 
!
"
#
$
%
&
'
(
)
*
+
,
-
.
/
0
1
2
3
4
5
6
7
8
9
:
;
<
=
>
?
@
[
\
]
^
_
`
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
{
|
}
~






¢
£
¤
¥
§
¨
©
«
¬
­
®
°
±
²
³
´
µ
·
¹
º
»
¼
½
¾
¿
×
à
á
â
ã
ä
å
æ
ç
è
é
ê
ë
ì
í
î
ï
ð
ñ
ò
ó
ô
ö
÷
ø
ù
ú
û
ü
ā
ă
ą
ć
ē
ė
ę
ğ
ĩ
ī
ı
ĺ
ł
ń
ň
ŋ
ō
œ
ŕ
ś
ş
š
ŧ
ũ
ū
ƒ
ƕ
ƫ
ǎ
ǐ
ǔ
ǡ
ț
ɐ
ɖ
ə
ɜ
ɡ
ɪ
ɮ
ɽ
ʉ
ʊ
ʌ
ʛ
ʻ
ʾ
ː
́
̇
̱
ά
ί
α
β
γ
δ
ε
η
θ
ι
κ
λ
μ
ν
ξ
ο
π
ρ
ς
σ
τ
υ
φ
χ
ψ
ω
ό
ώ
ϟ
ϵ
а
б
в
г
д
е
ж
з
и
й
к
л
н
о
п
р
с
т
у
ф
х
ц
ч
щ
ы
ь
ю
я
ҿ
ְ
ִ
ָ
ֹ
א
ב
ד
ה
ו
י
כ
ם
מ
נ
ס
פ
ק
ר
ש
ת
؟
ء
أ
ؤ
إ
ا
ب
ة
ت
ث
ج
ح
خ
د
ذ
ر
ز
س
ش
ص
ض
ظ
ع
غ
ـ
ف
ق
ك
ل
م
ن
ه
و
ى
ي
ک
ی
ँ
ं
ः
अ
आ
इ
ई
उ
ऊ
ऋ
ए
ऐ
औ
क
ख
ग
च
ज
ट
ठ
ड
ण
त
थ
द
ध
न
प
फ
ब
भ
म
य
र
ल
व
श
ष
स
ह
ा
ि
ी
ु
ू
ृ
े
ै
ो
्
ॐ
ड़
फ़
।
३
६
জ
ড
ত
ন
প
য
স
হ
া
ি
্
ਬ
ਰ
ਸ
਼
ੇ
ੱ
ଁ
ଇ
କ
ଧ
ର
ା
ି
அ
ஆ
இ
எ
க
ங
ச
ட
ண
த
ந
ன
ப
ம
ய
ர
ற
ல
ள
ழ
வ
ா
ி
ு
ெ
ே
ை
ொ
ோ
்
ం
అ
జ
ణ
న
మ
ర
వ
ా
ి
ు
ూ
ಕ
ಗ
ಟ
ಣ
ತ
ದ
ಬ
ಭ
ಮ
ರ
ವ
ಸ
ಾ
ಿ
ು
ೂ
ೆ
್
ഏ
ക
ച
ഛ
ജ
ട
ത
ധ
ന
മ
ര
റ
ഴ
സ
ാ
ി
േ
ോ
്
ൺ
ർ
ൾ
ก
ค
ง
จ
ช
ซ
ต
ถ
ท
น
บ
ป
พ
ฟ
ม
ย
ร
ล
ว
ส
ห
อ
ั
า


In [29]:
# Count the number of times each character occurs
charCorpusCount = [charFullCorpus.count(c) for c in charCorpus]

In [33]:
# Save charCorpusCount
np.save("charCorpusCount", charCorpusCount)

In [126]:
charCorpusCount

[9,
 5597518,
 791,
 28190,
 565,
 1861,
 1908,
 3259,
 58906,
 20466,
 20576,
 643,
 4991,
 74589,
 27846,
 58584,
 16495,
 50661,
 40395,
 31848,
 14437,
 11219,
 16858,
 12008,
 8386,
 6780,
 6617,
 10551,
 560,
 201,
 1653,
 219,
 569043,
 257,
 80427,
 41573,
 81021,
 53542,
 37225,
 22631,
 31069,
 150929,
 272597,
 16668,
 14080,
 22136,
 53223,
 25825,
 19777,
 49175,
 11066,
 27308,
 87200,
 52707,
 25430,
 12772,
 327754,
 3616,
 6756,
 2166,
 2592,
 1747,
 2590,
 1840,
 655,
 42,
 2231226,
 378081,
 770431,
 916602,
 2962952,
 527576,
 512635,
 1263458,
 1820534,
 34113,
 215294,
 947268,
 670485,
 1741657,
 2169674,
 518462,
 22526,
 1522162,
 1603700,
 2169901,
 663429,
 249062,
 508386,
 58145,
 528143,
 20174,
 1510,
 266,
 1504,
 90,
 8,
 4,
 1,
 1,
 1,
 1,
 1,
 85,
 1,
 4,
 3,
 4,
 3,
 3,
 2,
 6,
 11,
 97,
 6,
 35,
 8,
 45,
 2,
 2,
 2,
 3,
 3,
 1,
 7,
 2,
 5,
 1,
 1,
 1,
 3,
 4,
 1,
 2,
 1,
 1,
 1,
 76,
 6,
 1,
 20,
 42,
 6,
 18,
 26,
 1,
 2,
 16,
 20,
 860,
 5,
 5,
 5

In [14]:
# Sort charCorpus according to the number of times of occurence
charCorpusCountSorted = sorted(charCorpusCount)
charCorpusSorted = [y for (x, y) in sorted(zip(charCorpusCount, charCorpus))]

In [15]:
charCorpusCountSorted[-200:]

[8,
 8,
 8,
 8,
 9,
 9,
 9,
 9,
 9,
 9,
 9,
 9,
 9,
 9,
 9,
 9,
 9,
 9,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 11,
 11,
 11,
 11,
 11,
 11,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 13,
 13,
 14,
 14,
 14,
 14,
 15,
 15,
 15,
 15,
 15,
 15,
 15,
 16,
 16,
 16,
 16,
 16,
 16,
 17,
 17,
 17,
 17,
 17,
 18,
 18,
 18,
 18,
 18,
 18,
 18,
 20,
 20,
 21,
 21,
 21,
 21,
 22,
 23,
 23,
 24,
 25,
 25,
 25,
 25,
 26,
 26,
 27,
 27,
 28,
 28,
 30,
 31,
 31,
 32,
 32,
 33,
 34,
 35,
 40,
 40,
 42,
 43,
 45,
 45,
 46,
 49,
 52,
 54,
 55,
 62,
 64,
 76,
 85,
 88,
 90,
 97,
 103,
 182,
 201,
 219,
 238,
 257,
 266,
 301,
 425,
 560,
 565,
 643,
 655,
 791,
 864,
 962,
 1002,
 1504,
 1510,
 1653,
 1747,
 1840,
 1861,
 1908,
 2590,
 2592,
 2618,
 3259,
 4991,
 6617,
 6780,
 8386,
 10551,
 11219,
 12008,
 14437,
 16495,
 16858,
 20466,
 20576,
 22340,
 27846,
 28190,
 31848,
 33592,
 40395,
 50661,
 50781,
 58584,
 58906,
 61761,
 74589,
 229374,
 261834,
 419654,
 53

In [16]:
charCorpusSorted[-200:]

['了',
 '会',
 '大',
 '文',
 '\n',
 '\r',
 'û',
 'ō',
 'ρ',
 'ு',
 'ง',
 '•',
 '→',
 '∫',
 'の',
 '天',
 '是',
 '有',
 'в',
 'р',
 'ؤ',
 'ب',
 'د',
 'ब',
 'ल',
 'த',
 'ம',
 '™',
 '∀',
 '∆',
 '∪',
 '≥',
 '®',
 'и',
 'च',
 'ா',
 '⚡',
 '\ufeff',
 'ø',
 'ú',
 'д',
 'к',
 'श',
 'ல',
 'ร',
 '℅',
 '我',
 'ر',
 'य',
 'و',
 'า',
 '∩',
 '的',
 'ó',
 'μ',
 'с',
 'س',
 'ए',
 '∅',
 '一',
 'ç',
 'т',
 'ع',
 'ग',
 '≤',
 '人',
 'ن',
 'ज',
 'ो',
 '\u200b',
 'い',
 'ã',
 'م',
 'ं',
 'े',
 '∂',
 '⚪',
 '不',
 'è',
 '÷',
 'à',
 'а',
 'е',
 'ي',
 'ñ',
 'द',
 '′',
 'ل',
 'प',
 'व',
 'ी',
 '″',
 'स',
 '，',
 'ä',
 'ि',
 'ह',
 '்',
 'π',
 'н',
 'न',
 'ü',
 'о',
 'त',
 'म',
 '²',
 'í',
 '€',
 '`',
 'á',
 '´',
 'र',
 'ا',
 'क',
 '−',
 'ö',
 '्',
 '—',
 '？',
 '×',
 '£',
 'ा',
 '~',
 '°',
 '√',
 '–',
 '<',
 '>',
 '‘',
 '@',
 '|',
 '₹',
 '…',
 ';',
 '#',
 '*',
 '_',
 '!',
 'é',
 '“',
 '”',
 '}',
 '{',
 '=',
 '\\',
 '^',
 '$',
 '%',
 ']',
 '[',
 '’',
 '&',
 '+',
 '9',
 '8',
 '7',
 ':',
 '4',
 '6',
 '3',
 '/',
 '5',
 '(',
 ')',
 'z

[Character-level Convolutional Networks for Text Classification](https://arxiv.org/pdf/1509.01626.pdf) uses 70 characters as the alphabet, excluding Capital letters (which were converted to small in all text) and blank spaces. Let us use 100.

In [11]:
# Setting alphabet size
alphabetSize = 100

In [18]:
# Assign the most frequent #alphabetSize number of characters as the alphabet for the network
# Also, remove blank space (the most frequent character) from alphabet
alphabet = charCorpusSorted[-alphabetSize-1:-1]

In [19]:
# Save alphabet
np.save("alphabet", alphabet)

In [12]:
alphabet

['н',
 'न',
 'ü',
 'о',
 'त',
 'म',
 '²',
 'í',
 '€',
 '`',
 'á',
 '´',
 'र',
 'ا',
 'क',
 '−',
 'ö',
 '्',
 '—',
 '？',
 '×',
 '£',
 'ा',
 '~',
 '°',
 '√',
 '–',
 '<',
 '>',
 '‘',
 '@',
 '|',
 '₹',
 '…',
 ';',
 '#',
 '*',
 '_',
 '!',
 'é',
 '“',
 '”',
 '}',
 '{',
 '=',
 '\\',
 '^',
 '$',
 '%',
 ']',
 '[',
 '’',
 '&',
 '+',
 '9',
 '8',
 '7',
 ':',
 '4',
 '6',
 '3',
 '/',
 '5',
 '(',
 ')',
 'z',
 '-',
 '"',
 '2',
 'q',
 '1',
 '0',
 'j',
 '.',
 "'",
 'x',
 ',',
 'k',
 'v',
 'b',
 'y',
 'g',
 'f',
 'p',
 '?',
 'u',
 'm',
 'w',
 'c',
 'l',
 'd',
 'h',
 'r',
 's',
 'n',
 'i',
 'o',
 't',
 'a',
 'e']

In [11]:
# Making one-hot encoded alphabets
encodedAlphabet = np.eye(alphabetSize).astype('float32')

In [12]:
encodedAlphabet

array([[ 1.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  1.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  1., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  1.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  1.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  1.]], dtype=float32)

In [14]:
# Checking existing alphabet
char = 'f'
if char in alphabet:
    print(str(char)+" is at "+str(alphabet.index(char)))
    print(encodedAlphabet[alphabet.index(char)])
else:
    print(str(char)+" not found.")
    print(np.zeros((1, alphabetSize)))
char = '∂'
if char in alphabet:
    print(str(char)+" is at "+str(alphabet.index(char)))
    print(encodedAlphabet[alphabet.index(char)])
else:
    print(str(char)+" not found.")
    print(np.zeros((1, alphabetSize)))

f is at 82
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
∂ not found.
[[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]]


In [190]:
# Find max length of a question (for first layer of CNN)
maxQLength = np.max([len(q) for q in questions])

In [191]:
maxQLength

1169

# Model

I shall make the Small ConvNet described in Xiang Zhang & Yann LeCun's paper [Character-level Convolutional Networks for Text Classification](https://arxiv.org/pdf/1509.01626.pdf)

The number of characters per question is set to 1200, because the maximum question length ($maxQLength$) was found to be 1169, and I want it to be greater than that. The remaining characters (for each question) shall be set to zeros.

In [12]:
# Params
inputDim = alphabetSize #number of letters (characters) in alphabet
inputLength = 1200 #input feature length (the paper used 1014)

## Encoding questions

Each question needs to be encoded as an array of $inputLength$ characters, each character itself being encoded as a $1{\times}alphabetSize$-dimensional vector.

In [None]:
# DO NOT RUN THIS!!!!

# ENCODE n QUESTIONS
n = len(questions)

# Initialize encoded questions array
encodedQs = np.zeros((n, inputLength, inputDim)).astype('float32')

# For each question
for (q, question) in enumerate(questions[:n]):
    # For each character in question, in reversed order (so latest character is first)
    for (c, char) in enumerate(reversed(question[:inputLength])):
        if char in alphabet:
            encodedQs[q][c] = encodedAlphabet[alphabet.index(char)]
        else:
            encodedQs[q][c] = np.zeros((alphabetSize))

In [None]:
# DO NOT RUN THIS!!!!
#np.save("encodedQs", encodedQs)

The above takes too long, and saves in ~30GB of memory.

In [16]:
## DO NOT RUN THIS!!!!!

def oneHotEncodeQs(questions, inputLength, alphabet):
    alphabetSize = len(alphabet)
    # Initialize encoded questions array
    encodedQs = np.zeros((len(questions), inputLength, alphabetSize)).astype('float32')
    # For each question
    for (q, question) in enumerate(questions):
        # For each character in question, in reversed order (so latest character is first)
        for (c, char) in enumerate(reversed(question[:inputLength])):
            if char in alphabet:
                encodedQs[q][c] = encodedAlphabet[alphabet.index(char)].astype('float32')
            else:
                encodedQs[q][c] = np.zeros((alphabetSize)).astype('float32')
    return encodedQs

# Make encoded questions out of training questions 1 and 2
encodedQ1s = oneHotEncodeQs(trainQs1, inputLength, list(alphabet))
encodedQ2s = oneHotEncodeQs(trainQs2, inputLength, list(alphabet))

The above stops the kernel.

In [38]:
def encodeQs(questions, inputLength, alphabet):
    alphabetSize = len(alphabet)
    # Initialize encoded questions array
    encodedQs = np.zeros((len(questions), inputLength))
    # For each question
    for (q, question) in enumerate(questions):
        #print(q)
        # For each character in question, in reversed order (so latest character is first)
        for (c, char) in enumerate(reversed(question[:inputLength])):
            #print("  "+str(c))
            if char in alphabet:
                encodedQs[q][c] = alphabet.index(char)
            else:
                encodedQs[q][c] = 0
    return encodedQs

In [41]:
# Make encoded questions out of training questions 1 and 2
encodedQ1s = encodeQs(trainQs1, inputLength, list(alphabet))
encodedQ2s = encodeQs(trainQs2, inputLength, list(alphabet))

In [43]:
getsizeof(encodedQ1s)

3881184112

Each of encodedQ1s is of around 3.6GB. So let's not save them to file.

In [47]:
len(encodedQ1s[0])

1200

In [60]:
# Outputs (whether duplicate or not)
duplicateOrNot = trainData[:,5]

# IGNORE

But this would build a very very large $nd.array$ $encodedQs$. So instead, let's make a custom layer that can do this in run-time for each input.

In [53]:
# LAYER TO ENCODE QUESTIONS
class EncodeQuestions(Layer):
    
    def __init__(self, alphabet, input_length, **kwargs):
        self.alphabet = alphabet
        self.inputLength = input_length
        super(EncodeQuestions, self).__init__(**kwargs)
    
    def build(self, input_shape):
        self.alphabetSize = len(alphabet)
        self.encodedAlphabet = np.eye(alphabetSize)
        super(EncodeQuestions, self).build(input_shape)
    
    def call(self, question):
        encodedQ = np.zeros((self.inputLength, self.alphabetSize))
        # For each character in question, upto #inputLength number of characters,
        #in reversed order (so latest character is first)
        i = 0
        for (c, char) in enumerate(reversed(question)):
            if i == self.inputLength:
                break
            if char in self.alphabet:
                encodedQ[c] = self.encodedAlphabet[self.alphabet.index(char)]
            else:
                encodedQ[c] = np.zeros((self.alphabetSize))
            i += 1
        return encodedQ
    
    def compute_output_shape(self, input_shape):
        return (input_shape[0], self.inputLength, self.alphabetSize)

In [54]:
# TESTING ENCODEQUESTIONS

modelEncodeQs = Sequential()
modelEncodeQs.add(EncodeQuestions(alphabet=alphabet, input_length=10, input_shape=(10,alphabetSize)))
modelEncodeQS.compile()

TypeError: object of type 'Tensor' has no len()

# MODEL

## IGNORE

In [108]:
# MODEL

# Model for Q1
modelQ1 = Sequential()
modelQ1.add(Conv1D(256, 7, strides=1, padding='valid', activation='relu', input_shape=(inputLength, dimOfEachInput), kernel_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None), bias_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None)))
modelQ1.add(MaxPooling1D(pool_size=3, strides=3))
modelQ1.add(Conv1D(256, 7, strides=1, padding='valid', activation='relu', kernel_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None), bias_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None)))
modelQ1.add(MaxPooling1D(pool_size=3, strides=3))
modelQ1.add(Conv1D(256, 3, strides=1, padding='valid', activation='relu', kernel_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None), bias_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None)))
modelQ1.add(Conv1D(256, 3, strides=1, padding='valid', activation='relu', kernel_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None), bias_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None)))
modelQ1.add(Conv1D(256, 3, strides=1, padding='valid', activation='relu', kernel_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None), bias_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None)))
modelQ1.add(Conv1D(256, 3, strides=1, padding='valid', activation='relu', kernel_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None), bias_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None)))
modelQ1.add(MaxPooling1D(pool_size=3, strides=3))
modelQ1.add(Flatten())
modelQ1.add(Dense(1024, activation='relu'))
modelQ1.add(Dropout(0.5))
modelQ1.add(Dense(1024, activation='relu'))
modelQ1.add(Dropout(0.5))

# Model for Q2
modelQ2 = Sequential()
modelQ2.add(Conv1D(256, 7, strides=1, padding='valid', activation='relu', input_shape=(inputLength, dimOfEachInput), kernel_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None), bias_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None)))
modelQ2.add(MaxPooling1D(pool_size=3, strides=3))
modelQ2.add(Conv1D(256, 7, strides=1, padding='valid', activation='relu', kernel_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None), bias_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None)))
modelQ2.add(MaxPooling1D(pool_size=3, strides=3))
modelQ2.add(Conv1D(256, 3, strides=1, padding='valid', activation='relu', kernel_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None), bias_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None)))
modelQ2.add(Conv1D(256, 3, strides=1, padding='valid', activation='relu', kernel_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None), bias_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None)))
modelQ2.add(Conv1D(256, 3, strides=1, padding='valid', activation='relu', kernel_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None), bias_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None)))
modelQ2.add(Conv1D(256, 3, strides=1, padding='valid', activation='relu', kernel_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None), bias_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None)))
modelQ2.add(MaxPooling1D(pool_size=3, strides=3))
modelQ2.add(Flatten())
modelQ2.add(Dense(1024, activation='relu'))
modelQ2.add(Dropout(0.5))
modelQ2.add(Dense(1024, activation='relu'))
modelQ2.add(Dropout(0.5))

# Merge 
model = Sequential()
model.add(Merge([modelQ1, modelQ2], mode = 'concat'))
model.add(Dense(1, activation = 'sigmoid'))



## SIAMESE NETWORK

Building a Siamese network from the [MNIST Siamese Network example](https://github.com/fchollet/keras/blob/master/examples/mnist_siamese_graph.py)

In [50]:
def createBaseNetwork(inputDim, inputLength):
    baseNetwork = Sequential()
    baseNetwork.add(Embedding(input_dim=inputDim, output_dim=inputDim, input_length=inputLength))
    baseNetwork.add(Conv1D(256, 7, strides=1, padding='valid', activation='relu', kernel_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None), bias_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None)))
    baseNetwork.add(MaxPooling1D(pool_size=3, strides=3))
    baseNetwork.add(Conv1D(256, 7, strides=1, padding='valid', activation='relu', kernel_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None), bias_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None)))
    baseNetwork.add(MaxPooling1D(pool_size=3, strides=3))
    baseNetwork.add(Conv1D(256, 3, strides=1, padding='valid', activation='relu', kernel_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None), bias_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None)))
    baseNetwork.add(Conv1D(256, 3, strides=1, padding='valid', activation='relu', kernel_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None), bias_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None)))
    baseNetwork.add(Conv1D(256, 3, strides=1, padding='valid', activation='relu', kernel_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None), bias_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None)))
    baseNetwork.add(Conv1D(256, 3, strides=1, padding='valid', activation='relu', kernel_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None), bias_initializer=RandomNormal(mean=0.0, stddev=0.05, seed=None)))
    baseNetwork.add(MaxPooling1D(pool_size=3, strides=3))
    baseNetwork.add(Flatten())
    baseNetwork.add(Dense(1024, activation='relu'))
    baseNetwork.add(Dropout(0.5))
    baseNetwork.add(Dense(1024, activation='relu'))
    baseNetwork.add(Dropout(0.5))
    return baseNetwork

In [57]:
baseNetwork = createBaseNetwork(inputDim, inputLength)

# Inputs
inputA = Input(shape=(inputLength,))
inputB = Input(shape=(inputLength,))

# because we re-use the same instance `base_network`,
# the weights of the network
# will be shared across the two branches
processedA = baseNetwork(inputA)
processedB = baseNetwork(inputB)

# Concatenate
conc = Concatenate()([processedA, processedB])

# Add a sigmoid
predictions = Dense(1, activation='sigmoid')(conc)

# This creates a model that includes the Input and Dense layers
model = Model(inputs=[inputA, inputB], outputs=predictions)

In [58]:
# Compile options
sgd = SGD(lr=0.01, momentum=0.9, decay=0, nesterov=False)
model.compile(loss='binary_crossentropy', optimizer=sgd, metrics =['accuracy'])

In [68]:
# Halve learning rate for every 3rd epoch
def stepDecay(epoch):
    initLR = 0.01
    newLR = float(initLR/np.power(2, (int(epoch/3))))
    print("stepDecay: Epoch "+str(epoch)+" ; lr: "+str(newLR))
    return newLR
lRate = LearningRateScheduler(stepDecay)
callbacks = [lRate]

In [None]:
model.fit([encodedQ1s, encodedQ2s], duplicateOrNot, batch_size=128, epochs=30, verbose=1, callbacks=callbacks)

stepDecay: Epoch 0 ; lr: 0.01
Epoch 1/30
  2816/404290 [..............................] - ETA: 79339s - loss: 0.6649 - acc: 0.6303