<a href="https://colab.research.google.com/github/uceikow/DataEngineeringGroupAO/blob/master/snorkel_lf_(2).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recipe Labelling

## Set up for the work

In [0]:
! pip install snorkel

In [0]:
! pip install spacy

In [0]:
!python -m spacy download en_core_web_sm

In [0]:
! pip install nltk



In [0]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/faculty/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

[nltk_data] Downloading package wordnet to /home/faculty/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/faculty/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [0]:
import nltk
from nltk import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import re
import string

In [0]:
# Display full output rather than just the last line of output
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [0]:
# Load dataset
import pandas as pd

indian = pd.read_csv("/project/data_indian.csv")
indian['label'] = 'indian'
italian = pd.read_csv("/project/data_italian.csv")
italian['label'] = 'italian'
mexican = pd.read_csv("/project/data_mexican.csv")
mexican['label'] = 'mexican'

In [0]:
len(indian)
len(mexican)
len(italian)

480

620

400

In [0]:
# Concat them into one dataset
recipe = pd.concat([indian, italian, mexican],ignore_index = True)
recipe

Unnamed: 0,Title,Description,label
0,Indian Peanut Stew,"This is an easy, authentic dish from South Asi...",indian
1,Roomali Roti,"There is no leavening in this simple, tender I...",indian
2,Spicy Sweet Potato Salad,It's important to use good mayonnaise in this ...,indian
3,Chicken Saag,The classic Indian chicken and spinach dish ge...,indian
4,Paleo Slow Cooker Pork Loin,Boneless pork loin slowly cooks in a curried f...,indian
...,...,...,...
1495,Taco Stew,Ground beef and onions sauteed with a packet o...,mexican
1496,Chicken Tortilla Soup in the Slow Cooker,Everyone loves using their slow cooker to make...,mexican
1497,Bountiful Garden Zucchini Enchiladas,Fresh zucchini and Monterey Jack cheese filled...,mexican
1498,Bean and Honey Burrito Casserole,Here's a great way to feed burritos to a crowd...,mexican


In [0]:
## Clean dataset

In [0]:
# Clean the dataset
# Lowercase
recipe = recipe.apply(lambda row: row.str.lower())

# Remove digits
recipe['Title'] = recipe.apply((lambda row: ''.join([i for i in row['Title'] if not i.isdigit()])),axis = 1)
recipe['Description'] = recipe.apply((lambda row: ''.join([i for i in row['Description'] if not i.isdigit()])),axis = 1)

# Remove punctuations
recipe['Title'] = recipe.apply((lambda row: ''.join([i for i in row['Title'] if i not in string.punctuation])),axis=1)
recipe['Description'] = recipe.apply((lambda row: ''.join([i for i in row['Description'] if i not in string.punctuation])),axis=1)

# Remove Stopwords
stop = stopwords.words('english')
recipe['Title'] = recipe['Title'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
recipe['Description'] = recipe['Description'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

In [0]:
recipe.head()

Unnamed: 0,Title,Description,label
0,indian peanut stew,easy authentic dish south asia appeals wide ra...,indian
1,roomali roti,leavening simple tender indian flatbread bread...,indian
2,spicy sweet potato salad,important use good mayonnaise recipe let cooke...,indian
3,chicken saag,classic indian chicken spinach dish gets richn...,indian
4,paleo slow cooker pork loin,boneless pork loin slowly cooks curried fruit ...,indian


## Pattern Exploration

Before splitting the dataset and writing labelling function,  we might want to first get an idea of how our targetting labels look like. This gives us some basic information of how to start building the labelling function.

In [0]:
# Patterns from different recipes

# Filter out different recipes
recipe_ind = recipe[recipe['label'] == 'indian']
recipe_ita = recipe[recipe['label'] == 'italian']
recipe_mex = recipe[recipe['label'] == 'mexican']

# Word frequency in 'Title'
top_N = 15

title1 = recipe_ind.Title.str.cat(sep=' ')
words_in_title1 = nltk.tokenize.word_tokenize(title1)
word_dist_title1 = nltk.FreqDist(words_in_title1)

title2 = recipe_ita.Title.str.cat(sep=' ')
words_in_title2 = nltk.tokenize.word_tokenize(title2)
word_dist_title2 = nltk.FreqDist(words_in_title2)

title3 = recipe_mex.Title.str.cat(sep=' ')
words_in_title3 = nltk.tokenize.word_tokenize(title3)
word_dist_title3 = nltk.FreqDist(words_in_title3)

ind_freq = pd.DataFrame(word_dist_title1.most_common(top_N),
                    columns=['Indian', 'Frequency'])
ita_freq = pd.DataFrame(word_dist_title2.most_common(top_N),
                    columns=['Italian', 'Frequency'])
mex_freq = pd.DataFrame(word_dist_title3.most_common(top_N),
                    columns=['Mexican', 'Frequency'])

title_freq = pd.concat([ind_freq,ita_freq,mex_freq],axis = 1)
title_freq

Unnamed: 0,Indian,Frequency,Italian,Frequency.1,Mexican,Frequency.2
0,chicken,92,italian,76,chicken,118
1,curry,82,chicken,52,mexican,100
2,indian,72,pasta,45,enchiladas,57
3,masala,30,sauce,38,taco,41
4,rice,28,lasagna,33,soup,39
5,spicy,25,sausage,23,bean,37
6,curried,21,pizza,23,salsa,36
7,paneer,20,ii,23,casserole,36
8,chutney,20,spaghetti,19,beef,31
9,soup,20,bread,18,rice,30


It is easier to find patterns and differences of recipes if we display titles of three recipes together. The same goes for description.

**Ideas of building Labelling functions:**

- Single word (specific ones), such as curry, masala, paneer and chutney for indian repice. These words can label one type of recipe quite well because of their specialty (they will not appear in other recipes). Except for words included in top 15 frequency list, they must be other special words, which might need go through the whole dataset to find.

- Word combos, such as **curry + chicken = indian** :) You might find that chicken are used a lot in both indian and mexican recipe, while a way to label them might be find a **word combos** (function = special word + main ingrediant).

In [0]:
# Word frequency in 'Description'
des1 = recipe_ind.Description.str.cat(sep=' ')
words_in_des1 = nltk.tokenize.word_tokenize(des1)
word_dist_des1 = nltk.FreqDist(words_in_des1)

des2 = recipe_ita.Description.str.cat(sep=' ')
words_in_des2 = nltk.tokenize.word_tokenize(des2)
word_dist_des2 = nltk.FreqDist(words_in_des2)

des3 = recipe_mex.Description.str.cat(sep=' ')
words_in_des3 = nltk.tokenize.word_tokenize(des3)
word_dist_des3 = nltk.FreqDist(words_in_des3)

d1_freq = pd.DataFrame(word_dist_des1.most_common(top_N),
                    columns=['Ind_description', 'Frequency'])
d2_freq = pd.DataFrame(word_dist_des2.most_common(top_N),
                    columns=['Ita_description', 'Frequency'])
d3_freq = pd.DataFrame(word_dist_des3.most_common(top_N),
                    columns=['Mex_description', 'Frequency'])

des_freq = pd.concat([d1_freq,d2_freq,d3_freq],axis=1)
des_freq

Unnamed: 0,Ind_description,Frequency,Ita_description,Frequency.1,Mex_description,Frequency.2
0,indian,119,italian,79,chicken,126
1,curry,101,sauce,63,beef,77
2,dish,83,cheese,61,recipe,77
3,chicken,80,pasta,51,cheese,76
4,recipe,61,chicken,49,mexican,69
5,rice,53,recipe,43,corn,68
6,spicy,51,easy,41,make,68
7,spices,50,garlic,36,sauce,64
8,made,48,tomatoes,35,beans,62
9,sauce,44,dish,35,salsa,58


<div class="alert alert-success">

As you might have noticed, indian food and mexican food share some similarities, such as spicy-related words, sauce-related words, rice, etc. This is somewhere that we need to keep an eye on.
</b>


## Split the dataset

As being discussed in group meeting, we split the dataset into training, validation, development and test datasets.

If we do multi-labelling, we need to make sure that all datasets above contains same proportion of the 3 recipes. I decided to have 30% labelled data, in which 10% for dev set, 10% for validation set, and the remaining 10% for test set. We left 70% data to training set.

In [0]:
# Split the dataset
# Use ShuffleStratifiedSplit to ensure same proportion of each dataset
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=3, test_size=0.3, random_state=0)

# Get different labelled data
ind = recipe[recipe['label'] == 'indian']
ind.reset_index(drop=True,inplace=True)
ita = recipe[recipe['label'] == 'italian']
ita.reset_index(drop=True,inplace=True)
mex = recipe[recipe['label'] == 'mexican']
mex.reset_index(drop=True,inplace=True)

# Split function (leave 70% for training)
def shuffle_split(df,sss):
  X = df[['Title','Description']]
  y = df['label']
  for train_index, test_index in sss.split(X, y):
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    y_train, y_test = y[train_index], y[test_index]
    return X_train, X_test, y_train, y_test

ind_X_train, ind_X_test, ind_y_train, ind_y_test = shuffle_split(ind,sss)
ita_X_train, ita_X_test, ita_y_train, ita_y_test = shuffle_split(ita,sss)
mex_X_train, mex_X_test, mex_y_train, mex_y_test = shuffle_split(mex,sss)

In [0]:
print('indian: ',len(ind_X_train),len(ind_X_test))
print('italian: ',len(ita_X_train),len(ita_X_test))
print('mexican: ',len(mex_X_train),len(mex_X_test))

indian:  336 144
italian:  280 120
mexican:  434 186


In [0]:
# Combine training and test dataset
X_train = pd.concat([ind_X_train,ita_X_train,mex_X_train],axis=0)
y_train = pd.concat([ind_y_train,ita_y_train,mex_y_train],axis=0)
X_test =  pd.concat([ind_X_test,ita_X_test,mex_X_test],axis=0)
y_test =  pd.concat([ind_y_test,ita_y_test,mex_y_test],axis=0)

In [0]:
# Combine training dataset
train = pd.concat([X_train,y_train],axis=1)

In [0]:
# Combine the test dataset for next splitting
test = pd.concat([X_test,y_test],axis=1)
test = test.reset_index(drop=True)

Split development and validation dataset from test dataset.

In [0]:
# From randomly sampled test set get dev set and validation set.

ind_val, ind_dev = test[:48], test[48:96]
ita_val, ita_dev = test[144:184], test[184:224]
mex_val, mex_dev = test[264:326], test[326:388]
ind_test, ita_test, mex_test = test[96:144],test[224:264],test[388:450]

In [0]:
# Combine val, dev and test set

val = pd.concat([ind_val,ita_val,mex_val],axis=0)
dev = pd.concat([ind_dev,ita_dev,mex_dev],axis=0)
test_n = pd.concat([ind_test,ita_test,mex_test],axis=0)

As we split the dataset by different countries, we need to shuffle them before training.

In [0]:
from sklearn.utils import shuffle
train = shuffle(train, random_state = 42)
test = shuffle(test_n, random_state = 42)
val = shuffle(val, random_state = 42)
dev = shuffle(dev, random_state = 42)

To apply LFAnalysis, we need to change labels to number.

In [0]:
# Change labels to number
def label_to_num(df):
    df.label = df.label.apply(lambda x: 0 if x == 'indian' else(1 if x == 'italian' else 2))
    return df
    
test_n = label_to_num(test_n)
val = label_to_num(val)
dev = label_to_num(dev)

In [0]:
# Prepare for later training
df_train = train.iloc[:,:2]
df_val = val.iloc[:,:2]
df_dev = dev.iloc[:,:2]
Y_val = val.iloc[:,-1].values
Y_dev = dev.iloc[:,-1].values

# Labelling functions

In [0]:
from snorkel.labeling import labeling_function

In [0]:
# For clarity, we define constants to represent the class labels and abstaining.
ABSTAIN = -1
INDIAN = 0
ITALIAN = 1
MEXICAN = 2

## Keywords LFs

In [0]:
ind_keywords = ['curry','indian','masala','paneer','chutney']

@labeling_function()
def indian_keywords(x):
        if any(word in x.Title for word in ind_keywords):
            return INDIAN
        else:
            return ABSTAIN

In [0]:
import re

@labeling_function()
def curchicken(x):
    return INDIAN if re.search(r"curry.*chicken", x.Description, flags=re.I) else ABSTAIN

In [0]:
from snorkel.labeling import PandasLFApplier

lfs = [indian_keywords, curchicken]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)
L_dev = applier.apply(df=df_dev)





  0%|          | 0/1050 [00:00<?, ?it/s][A[A[A[A



100%|██████████| 1050/1050 [00:00<00:00, 7795.12it/s][A[A[A[A




100%|██████████| 150/150 [00:00<00:00, 12384.02it/s]


In [0]:
from snorkel.labeling import LFAnalysis

LFAnalysis(L=L_train, lfs=lfs).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
indian_keywords,0,[0],0.144762,0.00381,0.0
curchicken,1,[0],0.004762,0.00381,0.0


In [0]:
Y_dev

array([[-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [ 0, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [ 0, -1],
       [ 0, -1],
       [ 0, -1],
       [-1, -1],
       [-1, -1],
       [ 0, -1],
       [ 0, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [ 0, -1],
       [ 0, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1],
       [ 0, -1

In [0]:
LFAnalysis(L=L_dev, lfs=lfs).lf_summary(Y=Y_dev)

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
indian_keywords,0,[0],0.12,0.0,0.0,18,0,1.0
curchicken,1,[],0.0,0.0,0.0,0,0,0.0
