# Using Python and Pandas in Pursuit of a Good Scotch

In [None]:
from IPython.display import Image
Image("../input/images/Header.jpg")

In [None]:
## Introduction 

This notebook was created to as an assignment for the University of California San Diego Course <i>"Basic Data Processing and Visualization"</i> as part of the University of California San Diego, part of the <i>"Python Data Products for Predictive Analytics Specialization"</i>.  The assignment was <i>For this project, you will load a real-world dataset of consumer activities (e.g. product reviews) from the web, compute basic statistics about the data, and perform some visualizations of the data.</i>

For my assignment,  I decided to do an analysis of reviews of Scotch Whiskys, using the dataset <a href='https://www.kaggle.com/koki25ando/22000-scotch-whisky-reviews'>2,2k+ Scotch Whisky Reviews</a> of Kaggle.  I noticed that the dataset was over 3 years old, but the data scrapping script was still available, so I executed it come up with an up to date listing of reviews as my source.   

## Preliminaries 

Import the standared libraries 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
import re 

## Load and Process Data

In [None]:
df = pd.read_csv('../input/scotch-whiskey-reviews-update-2020/scotch_review2020.csv', index_col='id')
display(df.head())

That title column "review.points" with the periods is going to cause heartache; let us just make it "points".  Also clean up description

In [None]:
df.rename(columns={"review.point": "points","description.1.2247.":"description"},inplace=True)
df.head()

Let us look at the shape of our data

In [None]:
dimensions=df.shape
print ("The dataframe has",dimensions[1]," columns and ",dimensions[0]," rows")

Look for missing data

In [None]:
df.info()

Very good, no missing data.  Let's look at the catagoricals, starting with currency

In [None]:
df.currency.unique()

OK, it seems that everything is in dollars, so that column is not providing us much. Let us drop it while renaming "price" to "price (usd)"

In [None]:
df.drop(['currency'], axis=1,inplace=True)
df['price'] = df['price'].replace('[\$,]', '', regex=True)

Let us look at the price column for nonnumeric values

In [None]:
symbol_idx = pd.to_numeric(df['price'], errors = 'coerce').isnull() # errors = 'coerce' results in NaNs for non-numeric values
df[symbol_idx][['name','price']].head(50)

Looking back at the original data source, the default bottle size is 750ml, these bottles represent larger and smaller size containers.  So we can fairly compare, let's normalize these prices to 750ml bottles. Given that there is only a handful and no consistent pattern, we will just brute force the adjustment

In [None]:
df.at[[34, 187,740, 1549, 1815], 'price'] = 15000   # instances with '60,000/set' which equals 15000 dollars per bottle

In [None]:
df.at[[93], 'price']=300
df.at[[95,360], 'price']=100 #Note that this is a Clue to check for double
df.at[[779], 'price']=200
df.at[[1011], 'price']=44*.75
df.at[[1281], 'price']=132*1.07
df.at[[1826], 'price']=39*.4285
df.at[[2028], 'price']=35*.75
df.at[[2201], 'price']=18*.4285


In [None]:
df['price'] =df['price'].astype(int)

Let us look for duplicate entries

In [None]:
dups=df[df.duplicated(subset='name', keep=False)]
dups.head()

 Let's be generous. We will sort by points, and then drop the duplicates saving the highest rating when there is a difference

In [None]:
df.sort_values('points', ascending=False,inplace=True)
df.drop_duplicates(subset='name', keep='first',inplace=True)
df.head()


Confirm we got all the duplicates

In [None]:
dups=df[df.duplicated(subset='name', keep=False)]
dups.head()

From the name column we can get some additional data such as age and percentage alcohol

In [None]:
# Thanks to Orfanakis Konstantinos' Notebook for this bit of code

df['age'] = df['name'].str.extract(r'(\d+) year')[0].astype(float) # extract age and convert to float

df['name'] = df['name'].str.replace(' ABV ', '')
df['alcohol%'] = df['name'].str.extract(r"([\(\,\,\'\"\’\”\$] ? ?\d+(\.\d+)?%)")[0]
df['alcohol%'] = df['alcohol%'].str.replace("[^\d\.]", "").astype(float) # keep only numerics and convert to float

df[['name', 'age', 'alcohol%']].sample(10, random_state = 42)

Finally, lets create a metric of how much we are paying per review point 

In [None]:
df['price_per_point'] = df['price']/df['points']
df.head()

OK, we have wrangled the data to the point where we can ask some questions

## Questions

<ol>
<li> What is the review differences between different categories of Scotch?  Are Single Malts supeior to Blends?
<li> What is the relation of the Scotch's Age to its final review rating?
<li> Which Scotch is the best value, where value is defined as the highest review at the lowest price per point.
</ol>

## Analysis

Let us take a highlevel view of our data 

In [None]:
df.describe()

Wow, look at those range of values and standard deviation on price. We will need to factor that in when we do plotting.  A graph ranging from \\$1,570,000 to \\$7.00 is not going to be pretty

Let us look at the Distribution of Review Scores

In [None]:
sns.set_style('whitegrid')
plt.figure(figsize=(13, 5))
ax=df['points'].plot(kind='hist',xticks=[65,70,75,80,85,90,95,100],figsize = (13, 5),title="Frequency of Review Grades")
ax.set_xlabel("Review Points")
ax.set_ylabel("Number of Whiskys Receiving Grade")

We have some very generous reviewers or we are to believe there is no bad Scotch Whisky?  According to Whiskey Review, the range of review scores are as follow:
<ul>
<li>95-100 points—Classic: a great whisky</li>
<li>90-94 points—Outstanding: a whisky of superior character and style</li>
<li>85-89 points—Very good: a whisky with special qualities</li>
<li>80-84 points—Good: a solid, well-made whisky</li>
<li>75-79—Mediocre: a drinkable whisky that may have minor flaws</li>
<li>50-74—Not recommended</li>
</ul>

Per above, we have no mediocre whiskys, which is hard to believe. Let's look at a boxplot of review scores by category of whisky.

In [None]:
sns.set_style('whitegrid')
plt.figure(figsize=(10, 5))
ax = sns.boxplot(x="category", y="points", data=df)
 

Hmmm, for all three classes of whiskey, 75% of the whiskys are in the "85-89 points—Very good: a whisky with special qualities" category.  If that is the case, then logically those "special qualities" are not that special. I realize that Raymnond Chandler once said  <i>"There is no bad whiskey. There are only some whiskeys that aren’t as good as others"</i>, but this seems a bit much.  We will continue our analysis, but concern has to be raised as to the objectivity of the reviews.

That said, we really don't see a marked review superiority between the different classes of whisky, the median for all three is about a score of 89 with Single malt being slightly scored lower.

We next look at price.  From looking at the summary metrics, we now that prices vary wildly, and that is going to distort graphs.  Again, we use a boxplot to look at the distribution of data with outliers turned off.

In [None]:
sns.set_style('whitegrid')
plt.figure(figsize=(10, 5))
ax = sns.boxplot(x='category', y='price', data=df,showfliers = False)
ax.set_ylabel('Price (USD)')

From this, we see that a large majority of our Whiskys are under $600, so let us cull our data set down to whisky's under that value.  

In [None]:
dr=df[df['price'] <600].copy()
dr.reset_index(drop=True,inplace=True)
dr.shape

We lost some date, but this may be a far more practicle exercise (anyone who can spend over \\$600 for a bottle of Scotch is likely not concerned with a price analysis)

We examing the correlation between Price and the Age of the Whisky per Category

In [None]:
g = sns.lmplot(x="age", y="price",  col="category",hue="category",
               data=dr, height=6)

We see there is a correlation, and it is most pronounced for Single malts

We look at the relation of the age of the whisky to review score

In [None]:
g = sns.lmplot(x="age", y="points",  col="category",hue="category",
               data=dr, height=6)

So again a correlation, and with the more definitive correlation for the Single Malts.

## So what is the Best Scotch Whisky Value? 

To find the best value, we select all Whiskys above a 90 Review score and then sort by price per point

In [None]:
d90=df[df['points'] >90].copy()
d90.reset_index(drop=True,inplace=True)
d90.sort_values(by=['price_per_point'],ascending=[True],inplace=True)
d90.head(10)

The winner is:

In [None]:
from IPython.display import Image
Image("../input/images/black_bottle.jpg")

I think I will need to conduct some field research to validate my finding ;-)