Hello, I'm learning data analysis and Python so here is my notebook using this dataset. Feel free to comment! 

In [None]:
import pandas as pd
import numpy as np
import re
import string
import nltk
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
import plotly.express as px
from wordcloud import WordCloud, STOPWORDS

### The dataset

As you can see below, there is duplicates which reduce the dataset from 150k entries to almost 100k. There is also missing values in some columns but I choose to not handle it for now.

In [None]:
# Import the dataset 150k
df = pd.read_csv('../input/wine-reviews/winemag-data_first150k.csv', index_col=0)

# Drop the duplicates from 150k --> almost 100k rows 
df = df.drop_duplicates()
df = df.replace('US-France', 'US') # only 1 row classified as US-France but it's actually US
df.head()

In [None]:
df.info()

### EDA

In [None]:
# Top 10 wine countries 
country = df.groupby('country').size().reset_index(name='count').sort_values('count', ascending=False)
px.bar(country.head(10), x='country', y='count', template='simple_white')

We can now know that almost all the wines in the dataset come from US and Europe (specially France, Italy and Spain). The 3rd zone is South America with Chile and Argentina. 

In [None]:
# World map 
px.choropleth(country, locations = 'country', locationmode='country names', color = 'count', template = 'simple_white',range_color=[2000,15000], color_continuous_scale='Viridis')

**Price distribution**

In [None]:
# Price distribution, you can move the bar below to zoom in for some price ranges
fig = px.histogram(df, x="price", nbins = 250, template='simple_white')
fig.update_layout(xaxis=dict(rangeslider=dict(visible=True), type="linear"))

**Price distribution of top 10 countries**

As you can see while all the wines despite their origins start around 5$, only US and French wines have an actual bottle of more than 2,000$

In [None]:
df.groupby('country').price.agg(['count', 'min', 'max', 'mean']).reset_index().sort_values('count', ascending=False).head(10)


**Correlation between price and points**

Consider paying max 30$ for a bottle which is quite acceptable, so here we can see the points begin at 80. 

You can find a bottle at any price but to have a bottle at min 94 pts, you have to pay at least 15$

In [None]:
px.scatter(df[df.price < 30], x = 'points', y = 'price', template='simple_white')

**Zoom in to French provinces**

Although Bordeaux is the most popular province for wine, France has quite a list of good wines through the country.

In [None]:
france = df[df.country == 'France'].groupby('province').size().reset_index(name='count').sort_values('count', ascending=False)

px.pie(france, names = 'province', values = 'count')

**US provinces**

California is the place-to-be if you want to produce wine in US with over 70%, following with Washington and Oregon who cover already over 90% of the market.

In [None]:
us = df[df.country == 'US'].groupby('province').size().reset_index(name='count').sort_values('count', ascending=False)

px.pie(us.head(10), names = 'province', values = 'count')

### Wine description analysis

In [None]:
df.description

Firstly, I will do some cleaning of the description

In [None]:
# Remove numbers and punctuations
def cleaning(words):
    words = re.sub("[^a-zA-Z]"," ", str(words))
    text = words.lower().split()                   
    return " ".join(text)

df['text'] = df.description.apply(cleaning)
df[['description', 'text']].head()

In [None]:
# Remove stopwords
nltk.download('stopwords')

stop_words = stopwords.words('english')
def stopwords(text):
    word = [word.lower() for word in text.split() if word.lower() not in stop_words]
    return " ".join(word)

df['text_ready'] = df.text.apply(stopwords)
df[['description', 'text']].head()

**WORDCLOUDS**

We can see "black cherry" or "full bodied" are displayed many times, we can spot some words like "rich", "fruit", "sweet" or "fresh"

In [None]:
text = " ".join(review for review in df.text_ready)

# Create a custom stopwords list
stopwords = set(STOPWORDS)
stopwords.update(["drink", "now", "wine", "flavor", "flavors", 'finish', 'palate', 'show', 'nose', 'note', 'taste', 'shows', 'notes'])

# Generate a word cloud image
wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(text)

# Display the generated image the matplotlib way:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Let's see if the words are differents based on variety

In [None]:
# Top 5 varieties
df.groupby('variety').size().reset_index(name='count').sort_values('count', ascending=False).head(5)

In [None]:
pinot = " ".join(review for review in df[df.variety == 'Pinot Noir'].text_ready)

wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(pinot)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
chardonnay = " ".join(review for review in df[df.variety == 'Chardonnay'].text_ready)

wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(chardonnay)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
sauvignon = " ".join(review for review in df[df.variety == 'Cabernet Sauvignon'].text_ready)

wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(sauvignon)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()