<a href="https://colab.research.google.com/github/soniajoseph/phrase-similarity/blob/master/Phrase_Level_Semantic_Search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## First let's load the data

In [0]:
import urllib.request
import re 
import numpy as np
import numpy
import string

!python -m spacy download en_core_web_md
import spacy
import en_core_web_md

Collecting en_core_web_md==2.1.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.1.0/en_core_web_md-2.1.0.tar.gz (95.4MB)
[K     |████████████████████████████████| 95.4MB 1.1MB/s 
[?25hBuilding wheels for collected packages: en-core-web-md
  Building wheel for en-core-web-md (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-md: filename=en_core_web_md-2.1.0-cp36-none-any.whl size=97126237 sha256=38500ec5b897229336082b821716b893d447b59b1c4fa1feaa41d2a0b39af292
  Stored in directory: /tmp/pip-ephem-wheel-cache-pd6weatu/wheels/c1/2c/5f/fd7f3ec336bf97b0809c86264d2831c5dfb00fc2e239d1bb01
Successfully built en-core-web-md
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-2.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


We'll initially use Microsoft Research's Paraphrase Corpus and create a list of n paraphrase pairs. 


In [0]:
def load_data(n=5):
  '''
  Function to get spacy model with medium neural net with the constituency parsing extension
  described in "Constituency Parsing with a Self-Attentive Encoder" (2018)

  Args:
      n (int): Number of paraphrase pairs to load.

  Returns:
      new_list: A nested list of n paraphrase pairs.
  '''
  target_url = 'https://raw.githubusercontent.com/wasiahmad/paraphrase_identification/master/dataset/msr-paraphrase-corpus/msr_paraphrase_data.txt'
  i = 0; data = []
  for sentence in urllib.request.urlopen(target_url):
      # skip first sentence
      if i == 0: i += 1; continue
      sentence = sentence.decode()
      sentence =  re.split(r'\t+', sentence)
      data.append(sentence[1])
      # increment counter for number of data
      i += 1
      if i > n*2: break 
    # turn into nested list
  new_list = []
  for i in range(0, len(data)-1, 2):
    new_list.append([data[i], data[i+1]])

  print("Data loaded")

  return new_list

In [0]:
data = load_data()

Data loaded


## Now let's chunk the sentence into phrases.

Let's load a model from spaCy...

In [0]:
def get_model():
  '''
  Function to get spacy model with medium neural net with the constituency parsing extension
  described in "Constituency Parsing with a Self-Attentive Encoder" (2018)

  Args:
      None

  Returns:
      nlp: A loaded model with constituency parsing functionality.
  '''
  nlp = en_core_web_md.load()
  print("Model loaded")
  return nlp 

In [0]:
nlp = get_model()

Model loaded


Let's build a Traverse object to extract noun phrases from our sentences. The object will take phrases from the Stanford Parser and traverse them to find the noun phrases.

In [0]:
import nltk
from nltk.parse.stanford import StanfordParser
import os
from os import path

# Create an object to recursive travel tree and collect phrases
class Traverse():
  '''
  Traverse object to create trees to find noun phrases.
  To use, call traverse_tree() with input from the StanfordParser
  Then call the phrase_strings() function with self.phrases to get the noun phrases of
  the input sentence.
  '''
  def __init__(self):
    self.phrases = []
    
  def traverse_phrase(self, tree, phrases): 
      for subtree in tree:
          if type(subtree) == nltk.tree.Tree:
              self.traverse_phrase(subtree, phrases)
          else:
              phrases.append(subtree)

  def traverse_tree(self, tree):
      for subtree in tree:
          if type(subtree) == nltk.tree.Tree:
              if subtree.label() == 'NP' or subtree.label() == 'PP':
                  self.traverse_phrase(subtree, self.phrases)
                  self.phrases.append('\n')
              else :
                  self.traverse_tree(subtree)

  def phrase_strings(self, phrase_list):
    # put noun phrases in list
    a = " ".join(phrase_list).split("\n")
    a = [i.strip() for i in a if i]
    return a


# Load Stanford Statistical Parser
def load_parser():
  '''
  Function to load Stanford parser,

  Args:
      None

  Returns:
      parser: return parser object
  '''
  path_exists = os.path.exists('/content/stanford-parser-full-2018-10-17')
  if path_exists:
    print(True)
  else:
    print("Load and configure StanfordParser")
    !wget https://nlp.stanford.edu/software/stanford-parser-full-2018-10-17.zip
    !unzip stanford-parser-full-2018-10-17.zip

  stanford_parser_dir = '/content/stanford-parser-full-2018-10-17'
  path_to_models = stanford_parser_dir  + "/stanford-parser-3.9.2-models.jar"
  path_to_jar = stanford_parser_dir  + "/stanford-parser.jar"
  parser=StanfordParser(path_to_models, path_to_jar)
  return parser

load_parser()

True


<nltk.parse.stanford.StanfordParser at 0x7f54ee16a8d0>

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Let's write a function to calculate the average word vector for each phrase

In [0]:
def wva(string):
    '''
    Finds document vector through an average of each word's vector.

    Args: 
      string (str): Input sentence

    Returns:
      array: Word vector average
    '''
    doc = nlp(string)
    wvs = np.array([doc[i].vector for i in range(len(doc))])
    return np.mean(wvs, axis=0) 
  
def match_wv_pair(phrasesA, phrasesB):
  '''
  Takes two lists of phrases from one sentence each and finds the smallest Euclidean distance for each pair's word vector (non-exclusive).

  Args:
  phraseA (list of str): List of parsed phrases from or sentence A
  phraseB (list of str): List of parsed phrases from sentence B to compare with sentence A 

  Returns:
  matches (list of str): Returns list of matches between the two phrase (surjectively, i.e. multiple phrases can have the same match).

  '''
  # get word vectors
  wva_a = []; wva_b = []
  for i in phrasesA:
    wva_a.append(wva(i))
  for j in phrasesB:
    wva_b.append(wva(j))

  # swap so that shortest is on the outer for loop
  if len(wva_a) > len(wva_b):
    temp = wva_a
    wva_a = wva_b
    wva_b = temp

    temp = phrasesA
    phrasesA = phrasesB
    phrasesB = temp

  matches = []
  for i in range(len(wva_a)):
    distances = []
    for j in range(len(wva_b)):
      distances.append(numpy.linalg.norm(wva_a[i] - wva_b[j]))
      # indices_total.append(np.argsort(distances)[0])
    matches.append("Sentence A: " + phrasesA[i] + "\n Sentence B:" + phrasesB[np.argsort(distances)[0]] + "\n Euclidean Distance:" + str(np.sort(distances)[0]) + "\n")

  return matches


## Now let's iterate through the phrases, calculate their word vectors, and compare the closest Euclidean distances to get matching phrases from the two sentences.

In [0]:
def find_similar_phrases(data, parser):
  '''
  Uses Traverse object to create phrase trees of each sentence, then recurses through tree to collect noun phrases.

  Args: 
  string (list of strings): List of string lists in the format [[a,b],[c,d]] to find similarity between each pair.

  Returns:
  Nothing (prints out original sentences, a phrases, b phrases, matching phrases, and their Euclidean distance)
  '''
  for a, b in data:
    print("Original sentences:")
    print("Sentence A: ", a)
    print("Sentence B: ", b)
    print()

    ta = Traverse()
    phrasesA = parser.raw_parse(a)
    ta.traverse_tree(phrasesA)
    a = ta.phrases
    a = ta.phrase_strings(a)
    
    tb = Traverse()
    phrasesB = parser.raw_parse(b)
    tb.traverse_tree(phrasesB)
    b = tb.phrases
    b = tb.phrase_strings(b)
    
    print("A phrases:", a)
    print("B phrases:", b)
    print()

    matches = match_wv_pair(a,b)
    print("Similar phrases:")
    for i in matches:
      print(i)
    print()
    print()

parser = load_parser()
find_similar_phrases(data, parser)

True
Original sentences:
Sentence A:  Amrozi accused his brother, whom he called "the witness", of deliberately distorting his evidence.
Sentence B:  Referring to him as only "the witness", Amrozi accused his brother of deliberately distorting his evidence.

A phrases: ['Amrozi', "his brother , whom he called `` the witness '' , of deliberately distorting his evidence"]
B phrases: ["to him as only `` the witness ''", 'Amrozi', 'his brother of deliberately distorting his evidence']

Similar phrases:
Sentence A: Amrozi
 Sentence B:Amrozi
 Euclidean Distance:0.0

Sentence A: his brother , whom he called `` the witness '' , of deliberately distorting his evidence
 Sentence B:to him as only `` the witness ''
 Euclidean Distance:1.6207646



Original sentences:
Sentence A:  Yucaipa owned Dominick's before selling the chain to Safeway in 1998 for $2.5 billion.
Sentence B:  Yucaipa bought Dominick's in 1995 for $693 million and sold it to Safeway for $1.8 billion in 1998.

A phrases: ['Yucaipa

In [0]:
a = "Charlie Chan is off the case for the Fox Movie Channel."
b = "The Fox Movie Channel has banned Charlie Chan."
c = "Feelings about current business conditions improved substantially from the first quarter, jumping from 40 to 55."
d = "Assessment of current business conditions improved substantially, the Conference Board said, jumping to 55 from 40 in the first quarter."

samples = [[a,b], [c,d]]

In [0]:

find_similar_phrases(samples, parser)

Original sentences:
Sentence A:  Charlie Chan is off the case for the Fox Movie Channel.
Sentence B:  The Fox Movie Channel has banned Charlie Chan.

A phrases: ['Charlie Chan', 'the case for the Fox Movie Channel']
B phrases: ['The Fox Movie Channel', 'Charlie Chan']

Similar phrases:
Sentence A: Charlie Chan
 Sentence B:Charlie Chan
 Euclidean Distance:0.0

Sentence A: the case for the Fox Movie Channel
 Sentence B:The Fox Movie Channel
 Euclidean Distance:1.7385955



Original sentences:
Sentence A:  Feelings about current business conditions improved substantially from the first quarter, jumping from 40 to 55.
Sentence B:  Assessment of current business conditions improved substantially, the Conference Board said, jumping to 55 from 40 in the first quarter.

A phrases: ['Feelings about current business conditions', 'from the first quarter', 'from 40 to 55']
B phrases: ['Assessment of current business conditions', 'the Conference Board', 'to 55', 'from 40 in the first quarter']

Sim