# Predicting Wine From User Sentiment

Project Summary
This project aims to predict the type of wine a user would like to drink, given text describing how they feel. Using Machine Learning, we'd like to be able to classify potential types of wine. The dataset was found on Kaggle.

In [1]:
# Imports
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import plotly.express as px

In [2]:
raw_data = pd.read_csv('winemag-data-130k-v2.csv')

## Data Exploration

In [3]:
# Use only first 3000 lines of CSV file because actual file is too large
df = raw_data.head(3000)

In [4]:
# Get column names
df.columns

Index(['id', 'country', 'description', 'designation', 'points', 'price',
       'province', 'region_1', 'region_2', 'taster_name',
       'taster_twitter_handle', 'title', 'variety', 'winery'],
      dtype='object')

In [5]:
# Check how many NaN values there are
df.isna().sum()

id                          0
country                     1
description                 0
designation               904
points                      0
price                     194
province                    1
region_1                  468
region_2                 1885
taster_name               626
taster_twitter_handle     725
title                       0
variety                     0
winery                      0
dtype: int64

## Data Cleaning + Preprocessing for Modeling

In [6]:
# Drop columns, set index, drop NaNs and duplicates
df.drop(columns=["taster_name", "taster_twitter_handle"], inplace=True)
df = df.set_index('id')
df.dropna(axis=0, inplace=True)
df.drop_duplicates(inplace=True)
df = df.reset_index(drop=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns=["taster_name", "taster_twitter_handle"], inplace=True)


In [7]:
df.head(5)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,title,variety,winery
0,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks
1,US,"Soft, supple plum envelopes an oaky structure ...",Mountain Cuvée,87,19.0,California,Napa Valley,Napa,Kirkland Signature 2011 Mountain Cuvée Caberne...,Cabernet Sauvignon,Kirkland Signature
2,US,This wine from the Geneseo district offers aro...,Signature Selection,87,22.0,California,Paso Robles,Central Coast,Bianchi 2011 Signature Selection Merlot (Paso ...,Merlot,Bianchi
3,US,Oak and earth intermingle around robust aromas...,King Ridge Vineyard,87,69.0,California,Sonoma Coast,Sonoma,Castello di Amorosa 2011 King Ridge Vineyard P...,Pinot Noir,Castello di Amorosa
4,US,"Rustic and dry, this has flavors of berries, c...",Puma Springs Vineyard,86,50.0,California,Dry Creek Valley,Sonoma,Envolve 2010 Puma Springs Vineyard Red (Dry Cr...,Red Blend,Envolve


In [8]:
# Check shape to check number of entries available
df.shape

(742, 11)

## Natural Language Processing

In [9]:
# Import NLTK Corpus for NLP and RegEx library
from nltk.corpus import stopwords
import re

# Get stop words and add any other common ones from text file into the set
stopWords = set(stopwords.words('english'))

# Remove \n characters
with open('Common English Words.txt') as file:
    cleaned_cwords = [word[:len(word)-1] for word in file.readlines()]

# Add text file common words to stop words set
for c_words in cleaned_cwords:
    stopWords.add(c_words)
    
# Remove wine variety names in description list by adding variety names to stop word list
for wine_type in df['variety']:
    stopWords.add(wine_type)

In [20]:
# Take the description column and take out stop words
for desc in df['description']:
    preprocessed = [word.lower().strip() for word in desc.split() if (word.lower() not in stopWords)]
    
    # Perform regex to remove digits and % signs from list
    for index, word in enumerate(preprocessed):
        match = re.search("[\d%,.]",word)
        
        # Check if there is a Match object and remove that item from list
        if match is not None:
            preprocessed.pop(index)
    print(preprocessed, "\n")

['regular', 'bottling', 'comes', 'across', 'rough', 'rustic,', 'herbal', 'nonetheless,', 'pleasantly', 'unfussy', 'companion', 'hearty'] 

['supple', 'plum', 'envelopes', 'oaky', 'structure', 'supported', 'merlot.', 'coffee', 'chocolate', 'finishing', 'resulting', 'value-priced', 'wine', 'attractive', 'flavor', 'immediate'] 

['wine', 'geneseo', 'district', 'offers', 'aromas', 'sour', 'plums', 'cigar', 'tempt', 'flavors', 'acidity', 'tension', 'sour', 'cherries', 'emerges', 'bolstered'] 

['oak', 'intermingle', 'around', 'robust', 'aromas', 'wet', 'vineyard-designated', 'pinot', 'hails', 'high-elevation', 'production,', 'offers', 'full-bodied', 'raspberry', 'blackberry', 'steeped', 'smoky', 'spice', 'smooth'] 

['rustic', 'flavors', 'currants,', 'licorice', 'cabernet', 'franc', 'cabernet'] 

['erath', 'vineyard', 'strongly', 'notes', 'leaf', 'herb', 'somewhat', 'unripe', 'flavor', 'bitterness', 'passes', 'ripeness', 'sweet'] 

['shows', 'jelly-like', 'flavors', 'orange', 'earthy', 'mou

## Data Visualization

## Natural Language Processing

There is this API available on RapidAPI for free that analyzes sentiment level and positivity score. The API docs can be found here: https://rapidapi.com/twinword/api/sentiment-analysis/