# Predicting Wine From User Sentiment

Project Summary
This project aims to predict the type of wine a user would like to drink, given text describing how they feel. Using Machine Learning, we'd like to be able to classify potential types of wine. The dataset was found on Kaggle.

In [1]:
# Imports
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

In [2]:
raw_data = pd.read_csv('winemag-data-130k-v2.csv')

## Data Exploration

In [3]:
# Use only first 5000 lines of CSV file because actual file is too large
df = raw_data.head(5000)

In [4]:
# Get column names
df.columns

Index(['id', 'country', 'description', 'designation', 'points', 'price',
       'province', 'region_1', 'region_2', 'taster_name',
       'taster_twitter_handle', 'title', 'variety', 'winery'],
      dtype='object')

In [5]:
# Check how many NaN values there are
df.isna().sum()

id                          0
country                     3
description                 0
designation              1477
points                      0
price                     343
province                    3
region_1                  792
region_2                 3044
taster_name              1031
taster_twitter_handle    1206
title                       0
variety                     0
winery                      0
dtype: int64

In [6]:
df.nunique()

id                       5000
country                    27
description              4985
designation              2823
points                     21
price                     138
province                  176
region_1                  567
region_2                   17
taster_name                18
taster_twitter_handle      14
title                    4980
variety                   272
winery                   3411
dtype: int64

## Data Cleaning + Preprocessing for Modeling

In [7]:
# Drop columns, set index, drop NaNs and duplicates
df.drop(columns=["taster_name", "taster_twitter_handle"], inplace=True)
df = df.set_index('id')
df.dropna(axis=0, inplace=True)
df.drop_duplicates(inplace=True)
df = df.reset_index(drop=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns=["taster_name", "taster_twitter_handle"], inplace=True)


In [8]:
df.head(5)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,title,variety,winery
0,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks
1,US,"Soft, supple plum envelopes an oaky structure ...",Mountain Cuvée,87,19.0,California,Napa Valley,Napa,Kirkland Signature 2011 Mountain Cuvée Caberne...,Cabernet Sauvignon,Kirkland Signature
2,US,This wine from the Geneseo district offers aro...,Signature Selection,87,22.0,California,Paso Robles,Central Coast,Bianchi 2011 Signature Selection Merlot (Paso ...,Merlot,Bianchi
3,US,Oak and earth intermingle around robust aromas...,King Ridge Vineyard,87,69.0,California,Sonoma Coast,Sonoma,Castello di Amorosa 2011 King Ridge Vineyard P...,Pinot Noir,Castello di Amorosa
4,US,"Rustic and dry, this has flavors of berries, c...",Puma Springs Vineyard,86,50.0,California,Dry Creek Valley,Sonoma,Envolve 2010 Puma Springs Vineyard Red (Dry Cr...,Red Blend,Envolve


In [9]:
# Check shape to check number of entries available
df.shape

(1303, 11)

In [10]:
# Import NLTK Corpus for NLP and RegEx library
from nltk.corpus import stopwords
import re

# Get stop words and add any other common ones from text file into the set
stopWords = set(stopwords.words('english'))

# Remove \n characters
with open('Common English Words.txt') as file:
    cleaned_cwords = [word[:len(word)-1] for word in file.readlines()]

# Add text file common words to stop words set
for c_words in cleaned_cwords:
    stopWords.add(c_words)
    
# Remove wine variety names in description list by adding variety names to stop word list
for wine_variety in df['variety']:
    stopWords.add(wine_variety)

In [11]:
# Take the description column and take out stop words
processed_sentences = []
for desc in df['description']:
    preprocessed = [word.lower().strip() for word in desc.split() if (word.lower() not in stopWords)]
    
    # Perform regex to remove digits and % signs from list
    for index, word in enumerate(preprocessed):
        match = re.search("[\d%,.]",word)
        
        # Check if there is a Match object and remove that item from list
        if match is not None:
            preprocessed.pop(index)
        
    # Combine all cleaned words into string for storage
    processed_sentences.append(" ".join(preprocessed))

In [12]:
# Store as new column in dataframe
for i in range(df.shape[0]):
    df.loc[i,"processed_description"] = processed_sentences[i]

In [13]:
# Drop original description column
df = df.drop(['description'],axis=1)

In [14]:
# Check if added column is present in df
df.head(5)

Unnamed: 0,country,designation,points,price,province,region_1,region_2,title,variety,winery,processed_description
0,US,Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,"regular bottling comes across rough rustic, he..."
1,US,Mountain Cuvée,87,19.0,California,Napa Valley,Napa,Kirkland Signature 2011 Mountain Cuvée Caberne...,Cabernet Sauvignon,Kirkland Signature,supple plum envelopes oaky structure supported...
2,US,Signature Selection,87,22.0,California,Paso Robles,Central Coast,Bianchi 2011 Signature Selection Merlot (Paso ...,Merlot,Bianchi,wine geneseo district offers aromas sour plums...
3,US,King Ridge Vineyard,87,69.0,California,Sonoma Coast,Sonoma,Castello di Amorosa 2011 King Ridge Vineyard P...,Pinot Noir,Castello di Amorosa,oak intermingle around robust aromas wet viney...
4,US,Puma Springs Vineyard,86,50.0,California,Dry Creek Valley,Sonoma,Envolve 2010 Puma Springs Vineyard Red (Dry Cr...,Red Blend,Envolve,"rustic flavors currants, licorice cabernet fra..."


## Data Visualization

In [25]:
import plotly.express as px


In [21]:
# Get data decription based on state that wine is sold in
price_point_state_comp = df.groupby(['province'],axis=0).describe()
price_point_state_comp

Unnamed: 0_level_0,points,points,points,points,points,points,points,points,price,price,price,price,price,price,price,price
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
province,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
California,864.0,88.921296,3.124799,81.0,87.0,89.0,91.0,99.0,864.0,43.027778,27.532459,9.0,25.0,37.0,52.0,300.0
New York,70.0,87.014286,2.003568,83.0,85.25,87.0,88.75,90.0,70.0,21.485714,9.324648,13.0,16.25,18.0,22.0,75.0
Oregon,127.0,89.094488,3.143229,82.0,87.0,89.0,92.0,95.0,127.0,43.858268,22.163435,10.0,27.5,39.0,54.5,120.0
Washington,242.0,88.983471,2.340279,83.0,88.0,89.0,90.0,98.0,242.0,32.301653,14.753715,9.0,23.0,30.0,40.0,95.0


## Natural Language Processing

There is this API available on RapidAPI for free that analyzes sentiment level and positivity score. The API docs can be found here: https://rapidapi.com/twinword/api/sentiment-analysis/