In [1]:
import pandas as pd

In [5]:
import string

# Load Amazon Dataset

Load the dataset consisting of baby product reviews on Amazon.com. Store the data in a data frame **products**.

In [44]:
products = pd.read_csv('./data/amazon_baby.csv')

In [45]:
products.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


In [46]:
products.shape

(183531, 3)

# Perform Text Cleaning

We start by removing punctuation, so that words "cake." and "cake!" are counted as the same word.

- Write a function **remove_punctuation** that strips punctuation from a line of text
- Apply this function to every element in the **review** column of **products**, and save the result to a new column **review_clean**.

**Aside.** In this notebook, we remove all punctuation for the sake of simplicity. A smarter approach to punctuation would preserve phrases such as "I'd", "would've", "hadn't" and so forth.

In [47]:
def remove_punctuation(text):
    tran_tab = str.maketrans('', '', string.punctuation)
    return text.translate(tran_tab) 

**MPORTANT.** Make sure to fill n/a values in the **review** column with empty strings (if applicable). The n/a values indicate empty reviews. For instance, Pandas's the fillna() method lets you replace all N/A's in the **review** columns.

In [48]:
products.fillna({'review':''}, inplace=True)

In [49]:
products['review_clean'] = products.review.apply(remove_punctuation)

# Extract Sentiments

We will **ignore** all reviews with *rating = 3*, since they tend to have a neutral sentiment.

In [52]:
idx = (products.rating != 3)
products = products[idx]

Now, we will assign reviews with a rating of 4 or higher to be *positive* reviews, while the ones with rating of 2 or lower are *negative*. For the sentiment column, we use +1 for the positive class label and -1 for the negative class label. A good way is to create an anonymous function that converts a rating into a class label and then apply that function to every element in the **rating** column.

In [54]:
products['sentiment'] = products.rating.apply(
    lambda rating: +1 if rating > 3 else -1
)

Now, we can see that the dataset contains an extra column called **sentiment** which is either positive (+1) or negative (-1).

In [55]:
products.head()

Unnamed: 0,name,review,rating,review_clean,sentiment
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,it came early and was not disappointed i love ...,1
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5,Very soft and comfortable and warmer than it l...,1
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5,This is a product well worth the purchase I h...,1
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5,All of my kids have cried nonstop when I tried...,1
5,Stop Pacifier Sucking without tears with Thumb...,"When the Binky Fairy came to our house, we did...",5,When the Binky Fairy came to our house we didn...,1


# Train/Test Split