# Topic Modeling: Winemag Data
<b>9/14/2018</b><br>
Space to explore [WineMag data](https://www.kaggle.com/zynicide/wine-reviews/home) via topic modeling.

<hr>

In [1]:
import pandas as pd
import numpy as np

In [2]:
# load data
df = pd.read_csv('./data/winemag-data-130k-v2.csv', index_col=0)

In [3]:
df.head(3)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm


## Summary

Our goal is to examine the topics that exist within the wine tasting descriptions. To perform the analysis, we'll first remove unessary stopwords (such as 'the', 'for', 'when', etc) to remove noise. We'll then lemmatize the remaining words (find their base forms) to make the topic indicators more consistent. Finally, we'll use TF-IDF vectors and Latent Dirichlet Analysis (LDA) to separate the descriptions into topic groups.

### Extract Topic Words

In [4]:
import spacy

nlp = spacy.load('en')

def is_valid(tk):
    """ Return True if token is not a stopword, punctuation character, blank space or digit."""
    invalid_conditions = (
        tk.is_stop,
        tk.pos_ == 'PUNCT',
        tk.lemma_ == ' ',
        tk.lemma_.isdigit()
    )
    if any(invalid_conditions):
        return False
    return True
    
def extract_lemmatized_topic_words(doc):
    return [tk.lemma_ for tk in nlp(doc) if is_valid(tk)]

def add_topic_words(df):
    df_tw = df.copy()
    df_tw['topic_words'] = df_tw.description.apply(extract_lemmatized_topic_words)
    return df_tw

In [5]:
from multiprocessing import Pool

# setup partitions, cores, pool
num_partitions = 16
num_cores = 8
pool = Pool(num_cores)

# partition dataframe
df_split = np.array_split(df, num_partitions)
df = pd.concat(pool.map(add_topic_words, df_split))

# close pool
pool.close()
pool.join()

In [9]:
df.to_csv('winemag_topic_words.csv', index=False)