# Japanese Whisky Sentiment Analysis (unsupervised learning)

****Introduction****

Here, we explored the binary sentiment (positive & negative) of Jepanese Whisky reviews using unsupervised machine learning method, then counted the numbers of positive & negative reviews of the whole dataset and each whisky brands to generate business insights

****Dataset****

This dataset contains 1130 customers’ reviews for 4 Japanese whiskey brands. The data is inputed from the CSV file "japanese_whisky_revie.csv"

****Initialization****

Set the path to the folder containing the data:

In [57]:
path = '/Users/93754/OneDrive/documents/MGT561 text mining/team project/'

Run functions in Text_Normalization_Function.ipynb for text pre-processing

In [58]:
%run ./Text_Normalization_Function.ipynb

Collecting html.parser
Installing collected packages: html.parser
Successfully installed html.parser


You are using pip version 19.0.3, however version 19.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.




You are using pip version 19.0.3, however version 19.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.




You are using pip version 19.0.3, however version 19.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


Original:   <p>The circus dog in a plissé skirt jumped over Python who wasn't that large, just 3 feet long.</p>
Processed:  ['<', 'p', '>', 'The', 'circus', 'dog', 'in', 'a', 'plissé', 'skirt', 'jumped', 'over', 'Python', 'who', 'was', "n't", 'that', 'large', ',', 'just', '3', 'feet', 'long.', '<', '/p', '>']
Original:   <p>The circus dog in a plissé skirt jumped over Python who wasn't that large, just 3 feet long.</p>
Processed:  <p>The circus dog in a plissé skirt jumped over Python who was not that large, just 3 feet long.</p>
Original:   <p>The circus dog in a plissé skirt jumped over Python who wasn't that large, just 3 feet long.</p>
Processed:  [('<', 'a'), ('p', 'n'), ('>', 'v'), ('the', None), ('circus', 'n'), ('dog', 'n'), ('in', None), ('a', None), ('plissé', 'n'), ('skirt', 'n'), ('jumped', 'v'), ('over', None), ('python', 'n'), ('who', None), ('was', 'v'), ("n't", 'r'), ('that', None), ('large', 'a'), (',', None), ('just', 'r'), ('3', None), ('feet', 'n'), ('long.', 'a'), 

Import the required modules:

In [59]:
import pandas as pd
import numpy as np
import sys
import nltk
import warnings
warnings.simplefilter(action='ignore')

Use the VADER lexicon available in NLTK module: 

download the VADER lexicon and set up the sentiment analysis function that uses that lexicon

In [60]:
nltk.download('vader_lexicon')

from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\93754\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Read in data and slice out the review part

In [61]:
data = pd.read_csv("japanese_whisky_review.csv")
reviews = np.array(data["Review_Content"])

****Define Modeling Function****

Define the function that scores text using VADER lexicon and returns the polarity score and the binary sentiment indicator (positive or negative). 

If "compound" polarity/intensity score > 0.1 then the review is positive. The threshold of 0.1 is the recommended value.

In [62]:
def analyze_sentiment_vader_lexicon(review, 
                                    threshold = 0.1,
                                    verbose = False):
    
    #pre-process text
    review = normalize_accented_characters(review)
    review = html_parser.unescape(review)
    review = strip_html(review)
    
    #analyze the sentiment for review
    analyzer = SentimentIntensityAnalyzer()
    scores = analyzer.polarity_scores(review)
    
    #get binary sentiment
    binary_sentiment = 'positive' if scores['compound'] >= threshold\
                                   else 'negative'
    
    if verbose:
        
        #display sentiment 
        sentiment_frame = pd.DataFrame([[binary_sentiment, round(scores['compound'], 2)]],
                                        columns=pd.MultiIndex(levels=[['SENTIMENT STATS:'], 
                                                                      ['Binary Sentiment ', 'Polarity Score']], 
                                                              labels=[[0,0],[0,1]]))
        print(sentiment_frame.to_string(index=False))
    
    return binary_sentiment,scores['compound']  

Test the function and display sample results

In [63]:
for doc_index in range(8):
    print('Review:-')
    print(reviews[doc_index])   
    print()    
    final_sentiment = analyze_sentiment_vader_lexicon(reviews[doc_index],
                                                        threshold=0.1,
                                                        verbose=True)
    print('-'*60)  

Review:-
Dull taste. High price. No finish. Over-hyped and disappointing.

SENTIMENT STATS:               
Binary Sentiment  Polarity Score
        negative           -0.8
------------------------------------------------------------
Review:-
Delicious! sugared red fruits and sweet with a morish, cinnamon, aromatic depth.

SENTIMENT STATS:               
Binary Sentiment  Polarity Score
        positive           0.79
------------------------------------------------------------
Review:-
I am not a whisky expert but i really love the taste. The experience i had i would describe as very comfortable like being snug under a blanket on a cold rainy day. i sip this with my friend through the night and just enjoyed the flavors. That being said, it is more expensive and i feel that for that price, you could get a better whisky. i am not saying this because of the no age but rather, i think you can get more from other scotch for that price. But if you have it in your vault already, its really en

****Make Prediction****

Make prediction on all reviews, and display the first 5 rows of results

In [64]:
predicted_sentiment = pd.DataFrame([analyze_sentiment_vader_lexicon(review, threshold=0.1)
                     for review in reviews],columns = ['binary sentiment','raw score'])

In [65]:
predicted_sentiment.head()

Unnamed: 0,binary sentiment,raw score
0,negative,-0.7964
1,positive,0.7901
2,positive,0.9854
3,positive,0.9341
4,positive,0.4318


Count the number of positive & negative reviews

In [66]:
predicted_sentiment["binary sentiment"].value_counts()

positive    930
negative    200
Name: binary sentiment, dtype: int64

Count the number of positive & negative reviews for of each brands

In [67]:
df = pd.concat([data,predicted_sentiment],axis=1)
df

Unnamed: 0.1,Unnamed: 0,Bottle_name,Brand,Title,Review_Content,binary sentiment,raw score
0,1,The Yamazaki Single Malt Whisky - Distiller’s ...,Yamazaki,Overpriced dissapointment,Dull taste. High price. No finish. Over-hyped ...,negative,-0.7964
1,2,The Yamazaki Single Malt Whisky - Distiller’s ...,Yamazaki,Delicious,Delicious! sugared red fruits and sweet with a...,positive,0.7901
2,3,The Yamazaki Single Malt Whisky - Distiller’s ...,Yamazaki,Good for beginners. i know cos i am a beginner,I am not a whisky expert but i really love the...,positive,0.9854
3,4,The Yamazaki Single Malt Whisky - Distiller’s ...,Yamazaki,Yamazaki Tutorial,"This is a terrible Yamazaki. Very young, unsh...",positive,0.9341
4,5,The Yamazaki Single Malt Whisky - Distiller’s ...,Yamazaki,Very Nice,First time and I like it - fresh but not thin ...,positive,0.4318
5,6,The Yamazaki Single Malt Whisky - Distiller’s ...,Yamazaki,Unworthy of al the hype over Yamazaki,I can’t believe all these commenters who are f...,negative,-0.7691
6,7,The Yamazaki Single Malt Whisky - Distiller’s ...,Yamazaki,Japanese Yamasaki better than one made somewhe...,My friend brought bottle of this from Japan in...,positive,0.6742
7,8,The Yamazaki Single Malt Whisky - Distiller’s ...,Yamazaki,Delicious.,Nice. I like some vanilla and this has the goo...,positive,0.6486
8,9,The Yamazaki Single Malt Whisky - Distiller’s ...,Yamazaki,Yamazaki Distiller's Reserve,Stands against Glenfiddich 12 in it's similari...,positive,0.2161
9,10,The Yamazaki Single Malt Whisky - Distiller’s ...,Yamazaki,Can't Wait,Recommended by my friend.,positive,0.6124


In [68]:
def calnum(group):
    result={'Total_Review':group['binary sentiment'].shape[0],\
            'Num_of_Positive':np.sum(group['binary sentiment']=='positive'),\
           'Num_of_Negative':np.sum(group['binary sentiment']=='negative')}
    return pd.Series(result)

df1=df.groupby(by=['Brand']).apply(calnum)
df1

Unnamed: 0_level_0,Total_Review,Num_of_Positive,Num_of_Negative
Brand,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Hakushu,85,73,12
Hibiki,196,163,33
Nikka,392,329,63
Yamazaki,457,365,92
