# Phrase extraction

Find common phrases using Mikolov et al.'s (2013) method

In [39]:
import pandas as pd
import spacy
from cytoolz import *
from gensim.models.phrases import Phrases, Phraser

## Load and tokenize data

In [24]:
nlp = spacy.load('en', disable=['tagger', 'ner', 'parser'])

In [25]:
df = pd.read_csv('http://bulba.sdsu.edu/cellartracker.csv.gz',na_filter=False)
df = df.sample(50000)

In [26]:
def tokenize(text):
    return [tok.orth_ for tok in nlp.tokenizer(text)]
df['tokens'] = df['review_text'].apply(tokenize)

## Find bigram phrases

In [30]:
P1 = Phraser(Phrases(df['tokens'], threshold=20, common_terms={'of', 'to', '-'}))

In [44]:
for s in take(5, df['tokens']):
    print(' '.join(P1[s]))
    print()

Popped & poured at local_restaurant . Some muted cherries on the nose . Not a great_deal of action on the palate . Cherries , some tartness on the finish . Not much body .

This wine was superb . Different than pinot 's from the Williamette Valley , which I 'm a big_fan . superior to anything from CA . Was comparing it to my_favorite pinot from Oregon , Shea 's Block 23 . Lighter , with much more subtle complexity . Hints of cherry & strawberry , with a marvelous mid_palette and finish . I just bought more of this to lay_down , though it probably_wo n't make it , I 'll easily consume this sometime soon :)

The nose had good dark fruit aromas and finished with some leather and coffee . The palate had a bunch of black_currant mixed with chocolate , and expresso , and finished with a hint of leather .

absolutely divine . hands_down the best wine i 've_had in a very long time . rich , lush and full - bodied . major fruit with a slight undertone of oak on the nose . simultaneously fruity (

## Find (up to) trigram phrases

In [32]:
P2 = Phraser(Phrases(P1[list(df['tokens'])], threshold=15, common_terms={'of', 'to', '-'}))

In [45]:
for s in take(5, df['tokens']):
    print(' '.join(P2[P1[s]]))
    print()

Popped & poured at local_restaurant . Some muted cherries on the nose . Not a great_deal of action on the palate . Cherries , some tartness on the finish . Not much body .

This wine was superb . Different than pinot 's from the Williamette Valley , which I_'m a big_fan . superior to anything from CA . Was comparing it to my_favorite pinot from Oregon , Shea 's Block 23 . Lighter , with much more subtle complexity . Hints of cherry & strawberry , with a marvelous mid_palette and finish . I just bought more of this to lay_down , though it probably_wo_n't make it , I 'll easily consume this sometime soon :)

The nose had good dark fruit aromas and finished with some leather and coffee . The palate had a bunch of black_currant mixed with chocolate , and expresso , and finished with a hint of leather .

absolutely divine . hands_down the best wine i 've_had in a very long time . rich , lush and full_-_bodied . major fruit with a slight undertone of oak on the nose . simultaneously fruity (

## Find (up to) four-gram phrases

In [34]:
P3 = Phraser(Phrases(P2[P1[list(df['tokens'])]], threshold=10, common_terms={'of', 'to', '-'}))

In [48]:
for s in take(5, df['tokens']):
    print(' '.join(P3[P2[P1[s]]]))
    print()

Popped_& poured at local_restaurant . Some muted cherries on the nose . Not a great_deal of action on the palate . Cherries , some tartness on the finish . Not_much body .

This wine was superb . Different than pinot 's from the Williamette Valley , which I_'m a big_fan . superior to anything from CA . Was comparing it to my_favorite pinot from Oregon , Shea 's Block 23 . Lighter , with much more subtle complexity . Hints of cherry & strawberry , with a marvelous mid_palette and finish . I just bought more of this to lay_down , though it probably_wo_n't make it , I_'ll easily consume this sometime soon :)

The nose had good dark fruit aromas and finished with some leather and coffee . The palate had a bunch of black_currant mixed with chocolate , and expresso , and finished with a hint of leather .

absolutely divine . hands_down the best wine i_'ve_had in a very long time . rich , lush and full_-_bodied . major fruit with a slight undertone of oak on the nose . simultaneously fruity (

In [36]:
print(' '.join(P3[P2[P1[list(df['tokens'].iloc[2])]]]))

The nose had good dark fruit aromas and finished with some leather and coffee . The palate had a bunch of black_currant mixed with chocolate , and expresso , and finished with a hint of leather .
