# Wine pairing
Zuzanna Gawrysiak 148255, Agata Żywot 148258


Goal of the project:
1. Adding flavor information (sweet, acid, salt, piquant, fat, bitter) from https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews to FlavorGraph.
2. Utilizing embeddings and flavors for pairing with wine from https://www.kaggle.com/datasets/roaldschuring/wine-reviews.

### Import necessary libraries

In [3]:
import pandas as pd
from nltk.tokenize import sent_tokenize

from src.data_preprocessing.text_preprocessing import normalize_text, normalize_sentences, extract_phrases
from gensim.models.phrases import Phrases, Phraser

## EDA

### [Amazon Fine Food Reviews](https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews)

In [4]:
reviews_df = pd.read_csv('data/Reviews.csv')
display(reviews_df.head())
print(reviews_df.shape)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


(568454, 10)


In [3]:
# reviews_df.isnull().sum()
reviews_df.describe()

Unnamed: 0,Id,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time
count,568454.0,568454.0,568454.0,568454.0,568454.0
mean,284227.5,1.743817,2.22881,4.183199,1296257000.0
std,164098.679298,7.636513,8.28974,1.310436,48043310.0
min,1.0,0.0,0.0,1.0,939340800.0
25%,142114.25,0.0,0.0,4.0,1271290000.0
50%,284227.5,0.0,1.0,5.0,1311120000.0
75%,426340.75,2.0,2.0,5.0,1332720000.0
max,568454.0,866.0,923.0,5.0,1351210000.0


In [4]:
reviews_df["Score"].value_counts()

Score
5    363122
4     80655
1     52268
3     42640
2     29769
Name: count, dtype: int64

In [5]:
# The reviews need to be cleaned up
top_words = pd.Series(' '.join(reviews_df["Text"]).split()).value_counts()
top_words[:10]

the    1628045
I      1388076
and    1228666
a      1163164
to      992367
of      789652
is      714264
it      631252
for     519983
in      512394
Name: count, dtype: int64

In [5]:
reviews_list = [str(r) for r in list(reviews_df['Text'])]
reviews_list[:5]

['I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.',
 'Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".',
 'This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven.  Not too chewy, and very flavorful.  I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis\' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.',
 'If you are looking f

In [12]:
# use sentence tokenizer to split the reviews into sentences
food_reviews_corpus = ' '.join(reviews_list)
food_reviews_corpus_tokenized = sent_tokenize(food_reviews_corpus[:20000])
food_reviews_corpus_tokenized[:5]

['I have bought several of the Vitality canned dog food products and have found them all to be of good quality.',
 'The product looks more like a stew than a processed meat and it smells better.',
 'My Labrador is finicky and she appreciates this product better than  most.',
 'Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted.',
 'Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".']

In [13]:
reviews_list_normalized = normalize_sentences(food_reviews_corpus_tokenized)
reviews_list_normalized[:3]

[['bought',
  'sever',
  'vital',
  'dog',
  'food',
  'product',
  'found',
  'good',
  'qualiti'],
 ['product', 'look', 'like', 'stew', 'process', 'meat', 'smell', 'better'],
 ['labrador', 'finicki', 'appreci', 'product', 'better']]

In [15]:
# extract the most relevant bi- and tri-grams
reviews_phrases = extract_phrases(reviews_list_normalized, save_path = 'out/')
reviews_phrases[0]

['bought',
 'sever',
 'vital',
 'dog',
 'food',
 'product',
 'found',
 'good',
 'qualiti']

### [Wine Reviews](https://www.kaggle.com/datasets/roaldschuring/wine-reviews)
TODO