# Beer Reviews 

This example analyzes beer reviews to find the most common words used in positive and negative reviews.
Original example can be found [here](https://medium.com/rapids-ai/real-data-has-strings-now-so-do-gpus-994497d55f8e)

### Notes on running these queries:

By defaults runs use Bodo. Hence, data is distributed in chunks across processes.

reviews_sample.csv size is 23.1MB

Fulldataset is available on "s3://bodo-examples-data/beer/reviews.csv" and its size is 2.2GB

To run the code:
1. Make sure you [add your AWS account credentials to Saturn Cloud](https://saturncloud.io/docs/examples/python/load-data/qs-load-data-s3/#create-aws-credentials) to access the data.
2. If you want to run the example using pandas only (without Bodo):
    1. Comment lines magic expression (`%%px`) and bodo decorator (`@bodo.jit`) from all the code cells.
    2. Then, re-run cells from the beginning.

### Start an IPyParallel cluster
Run the following code in a cell to start an IPyParallel cluster. 4 cores are used in this example. 

In [None]:
import ipyparallel as ipp
import psutil; n = min(psutil.cpu_count(logical=False), 8)
rc = ipp.Cluster(engines='mpi', n=n).start_and_connect_sync(activate=True)

Starting 8 engines with <class 'ipyparallel.cluster.launcher.MPIEngineSetLauncher'>
100%|███████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:07<00:00,  1.12engine/s]


### Verifying your setup
Run the following code to verify that your IPyParallel cluster is set up correctly:

In [None]:
%%px
import bodo
print(f"Hello World from rank {bodo.get_rank()}. Total ranks={bodo.get_size()}")

%px:   0%|                                                                                               | 0/8 [00:02<?, ?tasks/s]

[stdout:6] Hello World from rank 6. Total ranks=8


[stdout:0] Hello World from rank 0. Total ranks=8


[stdout:7] Hello World from rank 7. Total ranks=8


[stdout:3] Hello World from rank 3. Total ranks=8


[stdout:5] Hello World from rank 5. Total ranks=8


[stdout:4] Hello World from rank 4. Total ranks=8


[stdout:1] Hello World from rank 1. Total ranks=8


%px: 100%|███████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00,  2.81tasks/s]

[stdout:2] Hello World from rank 2. Total ranks=8





## Importing the Packages

These are the main packages we are going to work with:
 - Bodo to parallelize Python code automatically
 - Pandas to work with data
 - scikit-learn to build and evaluate classification models
 - xgboost for xgboost classifier model algorithm

In [None]:
%%px
import warnings
warnings.filterwarnings("ignore")

import bodo
import pandas as pd
import time
from sklearn.preprocessing import StandardScaler # data normalization
from sklearn.model_selection import train_test_split # data split
from sklearn.linear_model import LogisticRegression # Logistic regression algorithm
from sklearn.ensemble import RandomForestClassifier # Random forest tree algorithm
from xgboost import XGBClassifier # XGBoost algorithm
from sklearn.svm import LinearSVC # SVM classification algorithm
from sklearn.metrics import accuracy_score # evaluation metric

## Data Processing and EDA
1. Load dataset
2. Compute the percentage of fraud cases in the overall recorded transcations.
3. Get a statistical view of both fraud and non-fraud transaction amount data

## Preprocessing
1. Create lists of stopwords and punctuation that will be removed.
2. Define regex that will be used to remove these punctuation and stopwords from the reviews.
3. Use the lower and strip functions to convert all letters to lowercase and remove excess whitespace. 
4. Remove stopwords and punctuation. 

In [3]:
%%px
with open("nltk-stopwords.txt", "r") as fh:
    STOPWORDS = list(map(str.strip, fh.readlines()))


PUNCT_LIST = ["\.", "\-", "\?", "\:", ":", "!", "&", "'", ","]
punc_regex = "|".join([f"({p})" for p in PUNCT_LIST])
stopword_regex = "|".join([f"\\b({s})\\b" for s in STOPWORDS])

In [4]:
%%px
@bodo.jit(distributed=["reviews"])
def preprocess(reviews):
    # lowercase and strip
    reviews = reviews.str.lower()
    reviews = reviews.str.strip()

    # remove punctuation and stopwords
    reviews = reviews.str.replace(punc_regex, "", regex=True)
    reviews = reviews.str.replace(stopword_regex, "", regex=True)
    return reviews

## Find the Most Common Words

In [7]:
%%px
@bodo.jit
def find_top_words(review_filename):
    # Load in the data
    t_start = time.time()
    df = pd.read_csv(review_filename, parse_dates=[2])
    print("read time", time.time() - t_start)

    score = df.score
    reviews = df.text

    t1 = time.time()
    reviews = preprocess(reviews)
    print("preprocess time", time.time() - t1)

    t1 = time.time()
    # create low and high score series
    low_threshold = 1.5
    high_threshold = 4.95
    high_reviews = reviews[score > high_threshold]
    low_reviews = reviews[score <= low_threshold]
    high_reviews = high_reviews.dropna()
    low_reviews = low_reviews.dropna()

    high_colsplit = high_reviews.str.split()
    low_colsplit = low_reviews.str.split()
    print("high/low time", time.time() - t1)

    t1 = time.time()
    high_words = high_colsplit.explode()
    low_words = low_colsplit.explode()

    top_words = high_words.value_counts().head(25)
    low_words = low_words.value_counts().head(25)
    print("value_counts time", time.time() - t1)
    print("total time", time.time() - t_start)

    print(top_words)
    print(low_words)
    
find_top_words("s3://bodo-examples-data/beer/reviews_sample.csv")

[stdout:0] 
'coroutine' object is not subscriptable.
Will use the value defined in the AWS_DEFAULT_REGION environment variable (or us-east-1 if that is not provided either).
read time 0.8933188915252686
preprocess time 6.139484882354736
high/low time 0.0021250247955322266
value_counts time 0.006670951843261719
total time 7.042609930038452
beer         333
one          158
taste        140
head         119
like         117
best         102
chocolate     90
dark          90
great         86
perfect       80
good          79
sweet         77
smell         73
bottle        72
ive           70
flavor        68
glass         65
well          65
ever          65
aroma         64
nice          64
malt          63
bourbon       62
hops          62
beers         62
dtype: int64
beer           239
like           109
taste          104
head            69
light           65
one             65
smell           57
bad             53
bottle          52
really          49
good            41
would       