# Amazon Customer Reviews Task

### You will work with Amazon Customer Reviews, writing labeling functions that classify them as positive (1) or negative (0) sentiment. 

First, import necessary libraries:

In [None]:
from data.preparer import load_amazon_dataset
from snorkel.labeling import labeling_function
from snorkel.labeling import LabelModel
from snorkel.labeling import LFAnalysis
from snorkel.labeling import PandasLFApplier
from analyzer import train_model
import re
import pandas as pd
pd.set_option('display.max_colwidth', -1)

## The Data

The reviews are available [via Amazon](https://s3.amazonaws.com/amazon-reviews-pds/readme.html).
You may download them there, or provide a password to unzip the file below.

For simplicity, only 1 star and 5 star reviews are included.

You must replace `PASSWORD` with the password to unzip the data.

In [None]:
!unzip -P PASSWORD data/data.zip

In [None]:
DELIMITER = "#"
df_train, df_dev, df_valid, df_test = load_amazon_dataset(delimiter=DELIMITER)
print("{} training examples".format(len(df_train)))
print("{} development examples".format(len(df_dev)))
print("{} validation examples".format(len(df_valid)))
print("{} test examples".format(len(df_test)))

Define the labels for this task:

In [None]:
ABSTAIN = -1
NEGATIVE = 0
POSITIVE = 1

In [None]:
print("Some labeled examples: ")
display(df_dev[df_dev.label==NEGATIVE].sample(5))
display(df_dev[df_dev.label==POSITIVE].sample(5))

## Labeling Instructions

All reviews were submitted with either 1 star (negative) or 5 star (positive) ratings. Your task is to create labeling functions that take the text of the review as input, and output either a NEGATIVE or a POSITIVE or an ABSTAIN label.

## Writing Labeling Functions
Time to write some labeling functions! Below is an example. Be sure to add your function to the list `lfs`

In [None]:
lfs = []

In [None]:
@labeling_function()
def my_first_labeling_function(x):
    return NEGATIVE if "awful" in x.text.lower() else ABSTAIN

lfs.append(my_first_labeling_function)

## Applying Functions
This is how we obtain training labels, by training a model to combine the outputs of the noisy labeling functions.
`L_train` and `L_dev` are matrices representing the label returned by each labeling function for each example in the training and development sets.

In [None]:
# Apply the LFs to the unlabeled training data, and the development data
applier = PandasLFApplier(lfs)
L_train = applier.apply(df_train)
L_dev = applier.apply(df_dev)

Train the snorkel model to combine these noisy labels.

In [None]:
# Train the label model and compute the training labels
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train, n_epochs=500, log_freq=50, seed=123)
df_train["label"] = label_model.predict(L=L_train, tie_break_policy="abstain")
display(df_train.sample(5))

## View Unlabeled Examples
You can use these to brainstorm new labeling functions. You may try filtering or sorting them in other ways.

In [None]:
# You can filter for unlabeled data
df_unlabeled = df_train[df_train.label == ABSTAIN]
display(df_unlabeled.sample(5))

## Analyze Results
Evaluate the accuracy of the estimated training labels and development set labels (based on ground truth).

In [None]:
Y_train = df_train.label.values
train_analysis = LFAnalysis(L=L_train, lfs=lfs).lf_summary(Y=Y_train)
display("Training set results:", train_analysis)

In [None]:
Y_dev = df_dev.label.values
dev_analysis = LFAnalysis(L=L_dev, lfs=lfs).lf_summary(Y=Y_dev)
display("Dev set results:", dev_analysis)

## Train Model
Lastly, train a simple bag of words model on these labels, and report test accuracy.

(This step may take a while).

In [None]:
train_model(label_model, df_train, df_valid, df_test, L_train)