# Topic Modeling wtih Winemag Data: Term Extraction Benchmarks
<b>10/8/2018</b><br>
Space to set example benchmarks for term extraction on [WineMag data](https://www.kaggle.com/zynicide/wine-reviews/home).
<hr>

In [91]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

In [92]:
# load data
df = pd.read_csv('./data/winemag-data-130k-v2.csv', index_col=0)

In [3]:
df.head(3)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm


## Summary

Our goal is to examine the topics that exist within the wine tasting descriptions. To perform the analysis, we'll first remove unessary stopwords (such as 'the', 'for', 'when', etc) to remove noise. We'll then lemmatize the remaining words (find their base forms) to make the topic indicators more consistent. Finally, we'll use TF-IDF vectors and Latent Dirichlet Analysis (LDA) to separate the descriptions into topic groups.

In [64]:
# setup for stopword removal, topic word extraction
import spacy

nlp = spacy.load('en')

def is_valid(tk):
    """ Return True if token is not a stopword, punctuation character, blank space or digit."""
    invalid_conditions = (
        tk.is_stop,
        tk.pos_ == 'PUNCT',
        tk.lemma_ == ' ',
        tk.lemma_.isdigit()
    )
    if any(invalid_conditions):
        return False
    return True
    
def extract_lemmatized_topic_words(doc):
    """ Extract valid topic words from Spacy doc """
    return [tk.lemma_ for tk in nlp(doc) if is_valid(tk)]

def extract_tw_from_row(row):
    return extract_lemmatized_topic_words(row['description'])

In [5]:
# extract topic words for full dataset
# df['topic_words'] = df.description.apply(extract_lemmatized_topic_words)

# get sample to extract topic words from
df_sample = df.sample(n=100, replace=False, random_state=42)

### Create Parallelized Version

In [6]:
import dask.dataframe as dd
from multiprocessing import cpu_count

In [84]:
# load data
ddf = dd.read_csv('./data/winemag-data-130k-v2.csv')
ddf = ddf.drop('Unnamed: 0', axis=1)

# get sample
ddf_sample = ddf.sample(frac=0.0007694, random_state=42, replace=False)
ddf_sample = ddf_sample.repartition(npartitions=4)

In [87]:
# add tw
ddf_sample['topic_words'] = ddf_sample.description.apply(extract_lemmatized_topic_words,
                                                         args=(),
                                                         meta=pd.Series(dtype='object', name='topic_words'))

In [88]:
ddf_sample.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,topic_words
33857,US,A blend of nearly two-thirds Cabernet Sauvigno...,Claret,87,20.0,Washington,Horse Heaven Hills,Columbia Valley,Sean P. Sullivan,@wawinereport,Robert Karl 2010 Claret Red (Horse Heaven Hills),Bordeaux-style Red Blend,Robert Karl,"[a, blend, nearly, third, cabernet, sauvignon,..."
117319,US,Block One is the original (1972) planting at C...,Champoux Vineyards Block One,96,72.0,Washington,Columbia Valley (WA),Columbia Valley,Paul Gregutt,@paulgwine,Sineann 2012 Champoux Vineyards Block One Cabe...,Cabernet Sauvignon,Sineann,"[block, one, original, plant, champoux, anchor..."
39310,Austria,"Fresh, superripe as well as candied pineapple ...",Eiswein Jungherrn,94,,Vienna,,,Anne Krebiehl MW,@AnneInVino,Stift Klosterneuburg 2011 Eiswein Jungherrn Ch...,Chardonnay,Stift Klosterneuburg,"[fresh, superripe, candy, pineapple, combine, ..."
56509,US,The fruit on this Syrah is so ripe that the bl...,Lafond Vineyard,82,40.0,California,Sta. Rita Hills,Central Coast,,,Lafond 2010 Lafond Vineyard Syrah (Sta. Rita H...,Syrah,Lafond,"[the, fruit, syrah, ripe, blackberry, venture,..."
118702,Chile,"Soft aromas of sea shell, stone and mango appe...",Culpeo Made with Organic Grapes,82,10.0,Curicó Valley,,,Michael Schachner,@wineschach,Viña La Fortuna 2012 Culpeo Made with Organic ...,Chardonnay,Viña La Fortuna,"[soft, aroma, sea, shell, stone, mango, appear..."


In [95]:
len(ddf)

129971