<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Load-Libraries-and-Data" data-toc-modified-id="Load-Libraries-and-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load Libraries and Data</a></span></li><li><span><a href="#Extract-Review-Text" data-toc-modified-id="Extract-Review-Text-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Extract Review Text</a></span></li><li><span><a href="#Save-to-File" data-toc-modified-id="Save-to-File-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Save to File</a></span></li></ul></div>

# Notebook Purpose
This notebook splits up the review text into the categories we need: Nose, Taste, Finish.
This works by splitting the text into lines and using some regular expressions.

## Load Libraries and Data

In [1]:
import pandas as pd
import re

import multiprocessing as mp

In [2]:
rdb = pd.read_parquet('data/db_reviews.parquet')

  labels = getattr(columns, 'labels', None) or [
  return pd.MultiIndex(levels=new_levels, labels=labels, names=columns.names)
  labels, = index.labels


Take a look at a review:


In [3]:
rdb.iloc[0]['review']

"My wife and I are on a trip to Thailand to meet her family.  I've seen plenty of whisky here, mostly JW, but this one stood out from the rest.  100 pipers is not something I've seen before and it seems to have quite the following here.  It is a blend at 40% alcohol by volume and 35cl was 220 baht or about $8 Canadian.  I got it more as a novelty as I suspect it is the Thai equivalent of chivas or glenfiddich 12.\n\nColour: caramel, I suspect it is artificially coloured.\n\nNose: (I had some tiger balm on my hands so this may be *way* off) alcohol, little bit of leather and some hints of sweetness.\n\nPalate: very bland, I taste almost nothing really, a bit of woody flavour, the promise of leather and sweetness from the nose is gone.\n\nFinish: short and devoid of anything but alcohol.\n\nThis reminds me of a JW red or the cheap rye my Dad drank when I was a kid.  I bought it primarily for the novelty so I don't think it was a waste.  it is just not something I'd seek out again.\n\n68/

Can look through this by splitting on new lines:

In [123]:
def extract_categories(text,name):
    nose_re = '.*nose.{,4}[:-]|\*nose\*?|[\*\s]n[:-]'
    taste_re = '.*taste.{,4}[:-]|\*taste\*?|\*palate\*|.*palate.{,6}[:-]|[\*\s]t[:-]|[\*\s]p[:-]'
    finish_re = '.*finish.{,4}[:-]|\*finish\*?|[\*\s]f[:-]'

    # Initialize Collections
    review_categories = ['','','']
    review_count = [0,0,0]
    
    review = re.split("\n+", text.lower())
    
    for i, line in enumerate(review):
        # for each line split
        nose = re.findall(nose_re, line)
        taste = re.findall(taste_re, line)
        finish = re.findall(finish_re, line)
              
        searchresults = [nose, taste, finish]
        for idx, result in enumerate(searchresults):
            if result:
                review_count[idx] += 1
                # if the category title is on a line before the review:
                if len(line) < 15 and i<len(review)-1:
                    review_categories[idx] += review[i+1]
                else:
                # otherwise it's on the same line
                    review_categories[idx] += line

    return {'nose': re.sub("(nose)|\*|:|>",'', review_categories[0]),
            'taste': re.sub("(taste)|(palate)|\*|:|>",'', review_categories[1]),
            'finish':re.sub("(finish)|\*|:|>",'', review_categories[2])}
                

# Function to multiprocess an entire dataframe
def extract_categories_dataframe(df, columnname):
    # create dataframe to hold results
    global results
    results = pd.DataFrame(columns=['reviewID','results'])
    
    # select only the column we want and make unique to save some time
    reviewlist = df[['reviewID','whisky', columnname]].drop_duplicates()
    pool = mp.Pool(mp.cpu_count())
    
    # call function for each name
    for review in reviewlist.itertuples():
        pool.apply_async(extract_categories_row, args=(review.reviewID, review.review, review.whisky), callback=collect_result)
    pool.close()
    pool.join()
    
    # break out dictionary from results
    results = pd.concat([results.drop(['results'], axis=1), results['results'].apply(pd.Series)], axis=1)
    
    # join back on original dataframe
    return (df.set_index('reviewID')
              .join(results.set_index('reviewID'))
              .reset_index()
              .rename({'index':'reviewID'}, axis='columns')
           )
    
# Function to be ran in multiprocess on each item
def extract_categories_row(reviewID, text, name):
    newitem = {}
    newitem['reviewID'] = reviewID
    newitem['results'] = extract_categories(text, name)
    return newitem
    
# Function to collect results from multiprocess
def collect_result(result):
    global results
    results = results.append(result,ignore_index = True)

## Extract Review Text

Filter out reviews without any text

In [126]:
results = None
rdb = pd.read_parquet('data/db_reviews.parquet').reset_index().rename({'index':'reviewID'}, axis='columns')
rdb = rdb[rdb['review'] != '']
rdb = extract_categories_dataframe(rdb, 'review')

## Save to File

In [129]:
rdb.to_parquet('data/db_reviews_split.parquet')

  result = infer_dtype(pandas_collection)
