# Text Analytics on Reviews of California Cabernet Sauvignon

## Introduction

Cabernet Sauvignon is the world’s foremost red wine-grape variety. It is most commonly associated with the red wines of Bordeaux, though it is widely cultivated throughout the world.

In the vineyard, the Cabernet Sauvignon grape can be distinguished by its small, thick-skinned and decidedly blue-colored berries with a high pip-to-pulp ratio. Its thick skin results in wines of profoundly deep color, and the pips give the wine a high level of tannin.

The grape ripens late, which is advantageous in warmer climates like Bordeaux and California, and decidedly disadvantageous in cooler climates. In colder growing environments, the Cabernet Sauvignon can easily fail to ripen properly. Unripe Cabernet Sauvignon can show a lot of the undesired aromas of unripe Cabernet Franc, notably a green or herbaceous character. This may not be entirely surprising, as DNA profiling has shown Cabernet Sauvignon’s parents are Cabernet Franc and Sauvignon Blanc.

The flavor profile of Cabernet Sauvignon can vary from one region or subregion to another. The expression of Cabernet Sauvignon produced in Margaux varies considerably from that further north in Pauillac, for example. The best Cabernet Sauvignon wines tend to have deep color, good structure and a full body. They are tannic in youth, especially when matured in oak, and often require a few years to soften before they become enjoyable to drink. Typical flavor descriptors used may include black fruits like blackcurrant or blackberry, as well as fragrant cigar box, tobacco, and coffee.

**Bordeaux** <br>
Cabernet Sauvignon is at home in Bordeaux, where it is the key red variety in the left bank regions of the Médoc and Pessac-Léognan. Here, it is the main component in a blend with Merlot, as well as Cabernet Franc and Petit Verdot. The top wines from Médoc Appellation d’Origine Protegée (AOP) regions like Margaux, Saint-Julien, Pauillac and Saint-Estèphe are arguably the greatest expressions of the grape. These are deep-colored reds with very high tannin and the capacity to age for decades.

**California** <br>
If one other region could be said to compete with Bordeaux on Cabernet Sauvignon, it’s surely California, where Cabernet Sauvignon has become ubiquitous. Cabernet Sauvignon has found a home away from home here, and the region is famous for its Bordeaux-style red blends and Cult Cabernets, whose prices can often meet and exceed the first growths of Bordeaux. Other North American regions producing quality Cabernet Sauvignon include Washington State and British Columbia in Canada.

## Table of Content

<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ol>
        <li><a href="#about_dataset">About the dataset</a></li>
        <li><a href="#preprocessing">Text reviews pre-processing [POS, Stop words removal and Stemming]</a></li>
        <li><a href="#modeling">Build Term/Doc matrix</a></li>
        <li><a href="#modeling">Use TF-IDF weighting for term/doc matrix</a></li>
        <li><a href="#modeling">Latent Dirichlet Analysis with TF-IDF to form 9 topic clusters</a></li>
        <li><a href="#evaluation">Store Predicted Topic Assignment in List Topics</a></li>
        <li><a href="#evaluation">Merge Topics List into DataFrame</a></li>
        <li><a href="#evaluation">Initialize Containers</a></li>
        <li><a href="#evaluation">Calculate and Display Average Points and Price by Region</a></li>
        <li><a href="#evaluation">Adding topic cluster to the original dataframe</a></li>
        <li><a href="#evaluation">Region wise contribution to each topic cluster</a></li>
    </ol>
</div>
<br>
<hr>

## Dataset
Contain over 13K reviews of California Cabernet Sauvignon. The reviews are in the column labeled ‘description’. 

The full data dictionary is:
 - review: A number unique for each review (an ID)
 - description: The actual review (text)
 - year: Year the wine was bottled. This is missing for some wines.
 - points: The points assigned by the reviewer to the wine. These range from 80 to 100. Better reviews have higher points.
 - price: The retail price for a bottle of the wine ($0-$3000).
 - winery: The winery where the wine was bottled. (a text label)
 - region: Region of California (text) where wine was produced.

There are no outliers in these data, but many of the years are missing.

In [19]:
#incase you need the required modules
'''
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
nltk.download('wordnet')
'''

"\nnltk.download('punkt')\nnltk.download('averaged_perceptron_tagger')\nnltk.download('stopwords')\nnltk.download('wordnet')\n"

### 1. Import packages

In [16]:
import pandas as pd
import string
import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.decomposition import LatentDirichletAllocation
import warnings
warnings.filterwarnings("ignore")

In [2]:
def my_analyzer(s):
    # List of synonyms
    syns = {"n't":'not', 'to30':'to 30','wont':'would not', 'cant':'can not',\
            'cannot':'can not','couldnt':'could not', 'shouldnt':'should not',\
            'wouldnt':'would not'}
    
   
    s = s.lower()
    s = s.replace(',', '. ')
     
    tokens = word_tokenize(s)
    tokens = [word.replace(',','') for word in tokens ]
    tokens = [word for word in tokens if ('*' not in word) and \
              ("''" != word) and ("``" != word) and \
              (word!='description') and (word !='dtype') \
              and (word != 'object') and (word!="'s")]
    
  
    for i in range(len(tokens)):
        if tokens[i] in syns:
            tokens[i] = syns[tokens[i]]
            
    # Removing stop words
    punctuation = list(string.punctuation)+['..', '...']
    pronouns = ['i', 'he', 'she', 'it', 'him', 'they', 'we', 'us', 'them']
    stop = stopwords.words('english') + punctuation + pronouns
    filtered_terms = [word for word in tokens if (word not in stop) and \
                  (len(word)>1) and (not word.replace('.','',1).isnumeric()) \
                  and (not word.replace("'",'',2).isnumeric())]
    
    
    tagged_words = pos_tag(filtered_terms, lang='eng')
    
    stemmer = SnowballStemmer("english")
    wn_tags = {'N':wn.NOUN, 'J':wn.ADJ, 'V':wn.VERB, 'R':wn.ADV}
    wnl = WordNetLemmatizer()
    stemmed_tokens = []
    for tagged_token in tagged_words:
        term = tagged_token[0]
        pos  = tagged_token[1]
        pos  = pos[0]
        try:
            pos   = wn_tags[pos]
            stemmed_tokens.append(wnl.lemmatize(term, pos=pos))
        except:
            stemmed_tokens.append(stemmer.stem(term))
    return stemmed_tokens

def my_preprocessor(s):
    s = s.lower()
    s = s.replace(',', '. ')
    print("preprocessor")
    return(s)
    
def my_tokenizer(s):
  
    print("Tokenizer")
    tokens = word_tokenize(s)
    tokens = [word.replace(',','') for word in tokens ]
    tokens = [word for word in tokens if word.find('*')!=True and \
              word != "''" and word !="``" and word!='description' \
              and word !='dtype']
    return tokens

### 2. Read reviews
The following code reads the document and places its contents into a string california_cabernet.

In [3]:
# Increase Pandas column width to let pandas read large text columns
pd.set_option('max_colwidth', 32000)
# Read GMC Ignition Recall Comments from NTHSA Data
file_path = 'C:/Users/vasu.kumar/Desktop/Wine_Desc_Data/'
df = pd.read_excel(file_path + "CaliforniaCabernet.xlsx")
# Setup simple constants
n_docs = len(df['description'])
n_samples = n_docs
m_features = None
s_words = 'english'
# Setup reviews in list 'discussions'
discussions = df['description']

### 3. Create Term/Doc Matrix using Custom Analyzer
Notice the use of CountVectorizer, but it’s calling the custom text analysis function my_analyzer in the TextAnalytics Class of AdvancedAnalytics.

In [4]:
# Create Word Frequency by Review Matrix using Custom Analyzer
cv = CountVectorizer(max_df=0.95, min_df=2, max_features=m_features,\
analyzer=my_analyzer)
tf = cv.fit_transform(discussions)

### 4. Prepare TFIDF Term Weighting

In [5]:
# Construct the TF/IDF matrix from term-doc matrix created by CountVectorizer
tf_idf = TfidfTransformer(norm=None, use_idf=True)
print("\nTF-IDF Parameters\n", tf_idf.get_params(),"\n")
tf = tf_idf.fit_transform(tf)


TF-IDF Parameters
 {'norm': None, 'smooth_idf': True, 'sublinear_tf': False, 'use_idf': True} 



### 5. Latent Dirichlet Analysis with TF-IDF
Analysis based on the Term-Frequency/Inverse Document Frequency Matrix.

In [6]:
n_topics        = 9
max_iter        =  5
learning_offset = 20.
learning_method = 'online'
lda = LatentDirichletAllocation(n_components=n_topics, max_iter=max_iter,\
                                learning_method=learning_method, \
                                learning_offset=learning_offset, \
                                random_state=12345)
U = lda.fit_transform(tf)
print('{:.<22s}{:>6d}'.format("Number of Reviews", tf.shape[0]))
print('{:.<22s}{:>6d}'.format("Number of Terms",     tf.shape[1]))
print("\nTopics Identified using LDA with TF_IDF")
tf_features = cv.get_feature_names()
max_words = 15
topic_description=[]
for topic_idx, topic in enumerate(lda.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([tf_features[i]
                             for i in topic.argsort()[:-max_words - 1:-1]])
        topic_description.append(message[10:])
        print(message)
        print()

Number of Reviews..... 13135
Number of Terms.......  6263

Topics Identified using LDA with TF_IDF
Topic #0: blend merlot verdot petit wrap cabernet franc sauvignon malbec small wine soften find black tannin

Topic #1: valley get year young astringent acid many wine begin age still tannin tannic ageability blackberry

Topic #2: palate aroma finish body wine full black nose fruit dark leather plum red juicy texture

Topic #3: grape char palate nose black wine wood graphite paso open vanilla vineyard blueberry build tobacco

Topic #4: flavor cherry blackberry sweet little drink soft oak like dry good ripe jammy lot green

Topic #5: nice feel cabernet softly wine dry tannin show flavor cedar refine currant black blackberry gentle

Topic #6: alcohol high flavor drinkable want mellow blackberry real hot dry oak cabernet new wine despit

Topic #7: hard style power appeal new blackberry oak pretty wine modern make tannin year big cocoa

Topic #8: year fine long develop currant next balance ta

### 6. Store Predicted Topic Assignment in List Topics

In [7]:
# Store topic selection for each doc in topics[]
n_reviews = tf.shape[0]
topics = [0] * n_reviews
for i in range(n_reviews):
    max = abs(U[i][0])
    topics[i] = 0
    for j in range(n_topics):
        x = abs(U[i][j])
        if x > max:
            max = x
            topics[i] = j

### 7. Merge Topics List into DataFrame

In [8]:
df_top = pd.DataFrame(topics, columns=["p_topic"])
df = df.join(df_top)

### 8. Initialize Containers
In order to print the average price and average point for each topic and region, the containers for
these statistics must be first created and set to zero.

The lists avg_points and avg_price will contain the average of these attributes for each topic.
t_counts will contain the number of documents associated with each topic.

The dictionary region will contain four statistics for each of the 18 regions:
1. Total Points
2. Number of reviews with points
3. Total Price
4. Number of reviews with price

This dictionary will be used to display the average points and average price by region.

In [9]:
attribute_map = {
    'Review':[3,(1, 14000),[0,0]],
    'description':[3,(''),[0,0]],
    'year':[3,(1900,2020),[0,0]],
    'points':[0,(1, 100),[0,0]],
    'price':[0,(1, 3000),[0,0]],
    'winery':[3,(''),[0,0]],
    'Region':[2,('South Coast', 'Sonoma', 'Sierra Foothills', \
    'Redwood Valley', 'Red Hills Lake County', \
    'North Coast', 'Napa-Sonoma', 'Napa', \
    'Mendocino/Lake Counties', 'Mendocino Ridge', \
    'Mendocino County', 'Mendocino', 'Lake County', \
    'High Valley', 'Clear Lake', 'Central Valley', \
    'Central Coast', 'California Other'),[0,0]]
}
avg_points = [0] * n_topics
avg_price = [0] * n_topics
t_counts = [0] * n_topics
# region is a dictionary of lists by region
# Each list has 4 values: sum_points, npoints, sum_price, nprice
region = {}
for r in attribute_map['Region'][1]:
    region[r] = [0, 0, 0, 0]

### 9. Calculate and Display Average Points and Price by Region

In [10]:
for i in range(n_reviews):
    j = int(df['p_topic'].iloc[i])
    t_counts[j] += 1
    avg_points[j] += df['points'].iloc[i]
    region[df['Region'].iloc[i]][0] += df['points'].iloc[i]
    region[df['Region'].iloc[i]][1] += 1
    if pd.isnull(df['price'].iloc[i])==True:
        continue
    avg_price [j] += df['price' ].iloc[i]
    region[df['Region'].iloc[i]][2] += df['price'].iloc[i]
    region[df['Region'].iloc[i]][3] += 1

In [11]:
# Print Avg Points and Price by Topic
print('{:<6s}{:>7s}{:>8s}{:>8s}'.format("TOPIC", "N", "POINTS", "PRICE"))
for i in range(n_topics):
    if t_counts[i]>0:
        avg_points[i] = avg_points[i]/t_counts[i]
        avg_price [i] = avg_price [i]/t_counts[i]
    print('{:>3d}{:>10d}{:>8.2f}{:>8.2f}'.format((i+1), t_counts[i], avg_points[i], avg_price[i]))
# Print Avg Points and Price by Region
print("")
print('{:<24s}{:>5s}{:>9s}{:>8s}'.format("REGION", "N", "POINTS", "PRICE"))
for r in attribute_map['Region'][1]:
    region[r][0] = region[r][0]/region[r][1] # Avg points
    region[r][2] = region[r][2]/region[r][3] # Avg price
    print('{:<24s}{:>6d}{:>8.2f}{:>8.2f}'.format(r, region[r][1], region[r][0], region[r][2]))

TOPIC       N  POINTS   PRICE
  1      1074   89.18   59.12
  2      1110   89.38   65.08
  3      1824   89.07   57.83
  4       918   89.62   60.48
  5      2821   86.09   34.93
  6      1192   89.80   57.34
  7      1075   86.55   51.57
  8      1071   89.99   68.47
  9      2050   90.94   64.20

REGION                      N   POINTS   PRICE
South Coast                 52   87.04   61.37
Sonoma                    2277   88.09   41.81
Sierra Foothills           126   87.20   28.77
Redwood Valley               3   87.67   23.00
Red Hills Lake County       37   88.78   35.30
North Coast                183   86.07   21.11
Napa-Sonoma                 84   90.08   60.30
Napa                      7348   89.97   72.23
Mendocino/Lake Counties    196   86.22   27.63
Mendocino Ridge              3   86.00   40.00
Mendocino County            29   87.62   22.97
Mendocino                   30   87.27   24.80
Lake County                 34   87.74   30.50
High Valley                  3   88.67   

### 10. Adding topic cluster to the original dataframe

In [12]:
for i in range(len(topic_description)):
    topic_description[i]=topic_description[i].split(' ')

In [13]:
temp=lda.transform(tf)
temp1=[]
for i in range(len(temp)):
    temp1.append(temp[i].argmax())
temp1=pd.DataFrame(temp1,columns=['Topic#'])
df=df.join(temp1)

In [14]:
df.head()

Unnamed: 0,Review,description,year,points,price,winery,Region,p_topic,Topic#
0,1,"This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate, framed by elegant, fine tannins and a subtle minty tone in the background. Balanced and rewarding from start to finish, it has years ahead of it to develop further nuance. Enjoy 2022–2030.",,96,235.0,Heitz,Napa,8,8
1,17,"This blockbuster, powerhouse of a wine suggests blueberry pie and chocolate as it opens in the glass. On the palate, it's smooth and seductively silky, offering complex cedar, peppercorn and peppery oak seasonings amidst its dense richness. It finishes with finesse and spice.",,95,325.0,Hall,Napa,4,4
2,48,"Blended with 9% Malbec, 9% Cabernet Franc and 5% Petit Verdot, this is a perennial classic for the winery, the sister brand of Cuvaison. Juicy in cherry and cassis, it sustains big, pillowy tannins and tar, suggesting more time for the fruit to match up with the structure. Drink through 2020.",,90,60.0,Brandlin,Napa,5,5
3,68,"From the producer's monumental Atlas Peak vineyard, this is a tightly wound, solidly constructed mountain Cab, blended with a handful of Petit Verdot. Tobacco, black tea and a sliver of coconut intermingle around a medium-bodied whole that will benefit from cellaring, through 2021.",,91,85.0,Michael Mondavi Family Estate,Napa,0,0
4,70,"A juiciness of cherry and vanilla spark the opening of this wine, a celebration of the vintage, appellation and in this case, fruit-forwardness of the variety. With a backbone of oak and cedar, it has smooth tannins and medium weight, finishing in mocha chocolate. Drink now through 2022.",,91,60.0,Provenance Vineyards,Napa,3,3


In [17]:
table1=df.pivot_table(['points','price'],index='Topic#')
table1=table1.join(pd.DataFrame(topic_description))
table1=table1.rename_axis({'points':'avg_points','price':'avg_price'},axis=1)
table1.T

Topic#,0,1,2,3,4,5,6,7,8
avg_points,89.1844,89.3757,89.0718,89.6176,86.0893,89.802,86.5535,89.9935,90.938
avg_price,59.2838,65.1434,58.1179,61.6226,35.0021,57.487,51.6192,68.7226,64.3919
0,blend,valley,palate,grape,flavor,nice,alcohol,hard,year
1,merlot,get,aroma,char,cherry,feel,high,style,fine
2,verdot,year,finish,palate,blackberry,cabernet,flavor,power,long
3,petit,young,body,nose,sweet,softly,drinkable,appeal,develop
4,wrap,astringent,wine,black,little,wine,want,new,currant
5,cabernet,acid,full,wine,drink,dry,mellow,blackberry,next
6,franc,many,black,wood,soft,tannin,blackberry,oak,balance
7,sauvignon,wine,nose,graphite,oak,show,real,pretty,tannin


### 11. Region wise contribution to each topic cluster

In [18]:
table2=df.pivot_table('Review',index='Region',columns='Topic#',\
                      aggfunc='count',\
                      fill_value=0,margins=True)
def percent_convert(x):
    for index in x.index:
        for i in x.columns:
            x.loc[index,i]=round(x.loc[index,i]*100/x.loc[index,'All'],2)
    
    return x
percent_convert(table2)
print(table2)

Topic#                       0      1      2      3       4      5      6  \
Region                                                                      
California Other          4.95   4.69  14.59   4.15   50.47   4.95  10.58   
Central Coast             3.65   5.17  19.51  12.25   30.35   6.18   9.72   
Central Valley            9.36   3.94  26.60   7.88   28.57   3.94  11.82   
Clear Lake                0.00   0.00   0.00   0.00  100.00   0.00   0.00   
High Valley               0.00   0.00  66.67  33.33    0.00   0.00   0.00   
Lake County               5.88   2.94  23.53   5.88   41.18  11.76   0.00   
Mendocino                 3.33   3.33  30.00  16.67   30.00   3.33  10.00   
Mendocino County          0.00   3.45  51.72   6.90   24.14   6.90   3.45   
Mendocino Ridge           0.00   0.00   0.00  33.33    0.00   0.00  66.67   
Mendocino/Lake Counties   9.18   9.18  14.29   4.08   31.12   8.67  12.24   
Napa                      9.46   9.25  12.25   6.56   14.43   9.38   7.62   