#  Recognizing textual entailment
* RTE is a corresponding decision problem whether a given(coherent) text T entails a given text H (in this context oftencalled a hypothesis)
* Since RTE is a binary decision problem, in case of a negative result of RTE, i. e., when T does not entail H, it is not possible to state “how distant” is H from another hypothesis H ′ , such that H ′ is entailed by T . *From a different point of view, it is not possible to express that H is “almost entailed”by T in this setting.
*  Many approaches and refinements of approaches have been considered, such as word embedding, logical models, graphical models, rule systems, contextual focusing, and machine learning
* Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with much lower dimension.
* Software for training and using word embeddings includes Tomas Mikolov's Word2vec, Stanford University's GloVe, Gensim, Indra and Deeplearning. 
* Principal Component Analysis (PCA) and T-Distributed Stochastic Neighbour Embedding (t-SNE) are both used to reduce the dimensionality of word vector spaces and visualize word embeddings and clusters.
* Reference: Wikipaedia


In [1]:
from gensim.models import word2vec
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
 # load up unzipped corpus from http://mattmahoney.net/dc/text8.zip
sentences = word2vec.Text8Corpus('text8') # train the skip-gram model; default window=5
model = word2vec.Word2Vec(sentences, size=200) # ... and some hours later... just as advertised...
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)

 

2017-11-03 13:14:02,099 : INFO : collecting all words and their counts
2017-11-03 13:14:02,102 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-11-03 13:14:08,519 : INFO : collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences
2017-11-03 13:14:08,520 : INFO : Loading a fresh vocabulary
2017-11-03 13:14:08,889 : INFO : min_count=5 retains 71290 unique words (28% of original 253854, drops 182564)
2017-11-03 13:14:08,890 : INFO : min_count=5 leaves 16718844 word corpus (98% of original 17005207, drops 286363)
2017-11-03 13:14:09,069 : INFO : deleting the raw counts dictionary of 253854 items
2017-11-03 13:14:09,112 : INFO : sample=0.001 downsamples 38 most-common words
2017-11-03 13:14:09,113 : INFO : downsampling leaves estimated 12506280 word corpus (74.8% of prior 16718844)
2017-11-03 13:14:09,114 : INFO : estimated required memory for 71290 words and 200 dimensions: 149709000 bytes
2017-11-03 13:14:09,367 : INFO : resetting la

2017-11-03 13:15:20,707 : INFO : PROGRESS: at 68.41% examples, 606249 words/s, in_qsize 5, out_qsize 0
2017-11-03 13:15:21,707 : INFO : PROGRESS: at 69.39% examples, 606450 words/s, in_qsize 5, out_qsize 0
2017-11-03 13:15:22,708 : INFO : PROGRESS: at 70.37% examples, 606507 words/s, in_qsize 5, out_qsize 0
2017-11-03 13:15:23,732 : INFO : PROGRESS: at 71.37% examples, 606615 words/s, in_qsize 5, out_qsize 0
2017-11-03 13:15:24,750 : INFO : PROGRESS: at 72.36% examples, 606637 words/s, in_qsize 5, out_qsize 0
2017-11-03 13:15:25,755 : INFO : PROGRESS: at 73.35% examples, 606758 words/s, in_qsize 6, out_qsize 0
2017-11-03 13:15:26,768 : INFO : PROGRESS: at 74.34% examples, 606927 words/s, in_qsize 4, out_qsize 0
2017-11-03 13:15:27,779 : INFO : PROGRESS: at 75.36% examples, 607022 words/s, in_qsize 6, out_qsize 0
2017-11-03 13:15:28,793 : INFO : PROGRESS: at 76.33% examples, 606933 words/s, in_qsize 5, out_qsize 1
2017-11-03 13:15:29,795 : INFO : PROGRESS: at 77.33% examples, 607107 wor

[(u'queen', 0.6264169216156006)]

In [2]:
import gensim
# pickle the entire model to disk, so we can load&resume training later
model.save('text8.model')
# store the learned weights, in a format the original C tool understands
model.wv.save_word2vec_format('text8.model.bin', binary=True)
# or, import word weights created by the (faster) C word2vec
# this way, you can switch between the C/Python toolkits easily
#model =gensim.models.KeyedVectors.load_word2vec_format('vectors.bin', binary=True)
 
# "boy" is to "father" as "girl" is to ...?
print model.most_similar(['girl', 'father'], ['boy'], topn=3)
#[('mother', 0.61849487), ('wife', 0.57972813), ('daughter', 0.56296098)]
more_examples = ["he his she", "big bigger bad", "going went being"]
for example in more_examples:
     a, b, x = example.split()
     predicted = model.most_similar([x, b], [a])[0][0]
     print "'%s' is to '%s' as '%s' is to '%s'" % (a, b, x, predicted)

 
# which word doesn't go with the others?
print model.doesnt_match("breakfast cereal dinner lunch".split())
print model.doesnt_match("table chair bed wall".split())
print model.doesnt_match("wine beer milk orange".split())

#Reference : https://rare-technologies.com/deep-learning-with-word2vec-and-gensim/


2017-11-03 13:15:52,831 : INFO : saving Word2Vec object under text8.model, separately None
2017-11-03 13:15:52,838 : INFO : not storing attribute syn0norm
2017-11-03 13:15:52,842 : INFO : storing np array 'syn0' to text8.model.wv.syn0.npy
2017-11-03 13:15:53,067 : INFO : storing np array 'syn1neg' to text8.model.syn1neg.npy
2017-11-03 13:15:53,329 : INFO : not storing attribute cum_table
2017-11-03 13:15:54,040 : INFO : saved text8.model
2017-11-03 13:15:54,041 : INFO : storing 71290x200 projection weights into text8.model.bin


[(u'mother', 0.7772825360298157), (u'wife', 0.7233453392982483), (u'grandmother', 0.7137593030929565)]
'he' is to 'his' as 'she' is to 'her'
'big' is to 'bigger' as 'bad' is to 'worse'
'going' is to 'went' as 'being' is to 'was'
cereal
chair
orange


In [None]:
#Empty proposal abstract
#Word2vec types 
#look deep into  word embeddings
#different types of gensim on youtube
#SNLP dataset info
#converting vectors to sentence entailment
#Brief overview of the top 3 methods
#Lstm's
#Zotero tool
