# Implementing logistic regression from scratch

The goal of this assignment is to implement your own logistic regression classifier. You will:

- Extract features from Amazon product reviews.
- Convert an SFrame into a NumPy array.
- Implement the link function for logistic regression.
- Write a function to compute the derivative of the log likelihood function with respect to a single coefficient.
- Implement gradient ascent.
- Given a set of coefficients, predict sentiments.
- Compute classification accuracy for the logistic regression model.

In [30]:
import string
import pandas as pd
import numpy as np

# Load review dataset

For this assignment, we will use a subset of the Amazon product review dataset. The subset was chosen to contain similar numbers of positive and negative reviews, as the original dataset consisted primarily of positive reviews.

Load the dataset into a data frame named products. One column of this dataset is sentiment, corresponding to the class label with +1 indicating a review with positive sentiment and -1 for negative sentiment.

In [21]:
products = pd.read_csv('./data/amazon_baby_subset.csv')
products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53072 entries, 0 to 53071
Data columns (total 4 columns):
name         52982 non-null object
review       52831 non-null object
rating       53072 non-null int64
sentiment    53072 non-null int64
dtypes: int64(2), object(2)
memory usage: 1.6+ MB


Let us quickly explore more of this dataset. The name column indicates the name of the product. Try listing the name of the first 10 products in the dataset.

In [8]:
products.name[:10]

0    Stop Pacifier Sucking without tears with Thumb...
1      Nature's Lullabies Second Year Sticker Calendar
2      Nature's Lullabies Second Year Sticker Calendar
3                          Lamaze Peekaboo, I Love You
4    SoftPlay Peek-A-Boo Where's Elmo A Children's ...
5                            Our Baby Girl Memory Book
6    Hunnt&reg; Falling Flowers and Birds Kids Nurs...
7    Blessed By Pope Benedict XVI Divine Mercy Full...
8    Cloth Diaper Pins Stainless Steel Traditional ...
9    Cloth Diaper Pins Stainless Steel Traditional ...
Name: name, dtype: object

After that, try counting the number of positive and negative reviews.

Note: For this assignment, we eliminated class imbalance by choosing a subset of the data with a similar number of positive and negative reviews.

In [10]:
np.sum(products.sentiment == 1)

26579

In [11]:
np.sum(products.sentiment == -1)

26493

# Apply text cleaning on the review data

In this section, we will perform some simple feature cleaning using data frames. The last assignment used all words in building bag-of-words features, but here we limit ourselves to 193 words (for simplicity). We compiled a list of 193 most frequent words into the JSON file named important_words.json. Load the words into a list important_words.

In [17]:
important_words = pd.read_json('./data/important_words.json')
important_words.columns = ['words']

Let us perform 2 simple data transformations:

- Remove punctuation
- Compute word counts (only for important_words)

We start with the first item as follows:

- If your tool supports it, fill n/a values in the review column with empty strings. The n/a values indicate empty reviews. For instance, Pandas's the fillna() method lets you replace all N/A's in the review columns as follows:

In [27]:
products = products.fillna({'review':''})  # fill in N/A's in the review column
products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53072 entries, 0 to 53071
Data columns (total 4 columns):
name         52982 non-null object
review       53072 non-null object
rating       53072 non-null int64
sentiment    53072 non-null int64
dtypes: int64(2), object(2)
memory usage: 1.6+ MB


- Write a function remove_punctuation that takes a line of text and removes all punctuation from that text. The function should be analogous to the following Python code:

In [38]:
def remove_punctuation(text):
    table = str.maketrans('', '', string.punctuation)
    return text.translate(table)

- Apply the remove_punctuation function on every element of the review column and assign the result to the new column review_clean. Note. Many data frame packages support apply operation for this type of task. Consult appropriate manuals.

In [41]:
products['review_clean'] = products.review.apply(remove_punctuation)

In [47]:
print(products.review_clean[5])
print(products.review[5])

Beautiful book I love it to record cherished times in my great granddaughters life with the beautiful pastel pink color
Beautiful book, I love it to record cherished times in my great granddaughters life with the beautiful pastel pink color.
