# Logistic regression from scratch

## Objectives:
- Implement the link function for logistic regression.
- Write a function to compute the derivative of the log likelihood function with respect to a single coefficient.
- Implement gradient ascent.
- Compute classification accuracy for the logistic regression model.

## Import library and load data

In [1]:
import pandas as pd
from sframe import  SFrame

In [2]:
products = SFrame('amazon_baby_subset.gl/')

[INFO] sframe.cython.cy_server: SFrame v2.1 started. Logging C:\Users\zby0902\AppData\Local\Temp\sframe_server_1532162436.log.0


In [3]:
products= pd.DataFrame(products.to_numpy(),columns=['name','review','rating','sentiment'])

## EDA

In [4]:
products.head(5)

Unnamed: 0,name,review,rating,sentiment
0,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5,1
1,Nature's Lullabies Second Year Sticker Calendar,We wanted to get something to keep track of ou...,5,1
2,Nature's Lullabies Second Year Sticker Calendar,My daughter had her 1st baby over a year ago. ...,5,1
3,"Lamaze Peekaboo, I Love You","One of baby's first and favorite books, and it...",4,1
4,SoftPlay Peek-A-Boo Where's Elmo A Children's ...,Very cute interactive book! My son loves this ...,5,1


In [5]:
#number of positive and negative sentiment
products[products['sentiment'] == 1].shape[0], products.shape[0] - products[products['sentiment'] == 1].shape[0]

(26579, 26493)

### Load important words

In [6]:
import json

In [7]:
with open('important_words.json','r') as f:
    important_words = json.load(f)

In [8]:
important_words = [str(s) for s in important_words]

### Clean reviews

In [9]:
products.review.fillna('');

In [10]:
def remove_punc(text):
    import string
    return text.translate(None, string.punctuation)

In [11]:
products['review_clean'] = products['review'].apply(remove_punc)

### Generate word count

Before generating the word_counts of all important words we need to tranform all reviews to lower case.

In [12]:
## Should unify the upper and lower case to avoid missed count
#products['review_clean'] = products['review_clean'].apply(lambda s: s.lower() )

Then we generate the count for each importan word

In [13]:
for word in important_words:
    products[word] = products.review_clean.str.count(word)# the count will ignore the cases

``Quiz:`` No of reviews contains 'perfect'

In [14]:
(products['perfect'] > 0).sum()

4246

In [15]:
def get_feature_matrix(df,features,label):
    """
    Extract the feature matrix and the target matrix from the dataframe
    
    Parameters
    ----------
    df: The dataframe(SFrame or Dataframe) containing the data points
    
    features: list of features picked to form the feature matrix
    
    label: The target feature of the model
    
    Return
    ------
    feature_matrix : A 2D array containing all features for all data points
    
    label_array : A 1D array containing the class labels for all instances
    """
    df['intercept'] = 1
    features = ['intercept'] + features
    feature_matrix = df[features].as_matrix()
    label_array = df[label].as_matrix()
    return(feature_matrix, label_array)

In [16]:
import numpy as np

In [17]:
feature_matrix, sentiment = get_feature_matrix(products, important_words, 'sentiment')

###  The predict method

In [18]:
def predict_probability(feature_matrix, weights):
    """
    Predict the probabilitie of each instances as class 1
    
    Parameters
    ----------
    feature_matrix: The 2D array containing all selected features of all data points
    
    weights: The learnt coefficients of the linear score function
    
    Return
    ------
    prob: The probility of all instances as class 1
    """
    scores = np.dot(feature_matrix,weights)
    prob = 1 / (1+np.exp(-scores))
    
    return prob

## Checkpoint: A test of the previous methods

In [19]:
dummy_feature_matrix = np.array([[1.,2.,3.], [1.,-1.,-1]])
dummy_coefficients = np.array([1., 3., -1.])

correct_scores      = np.array( [ 1.*1. + 2.*3. + 3.*(-1.),          1.*1. + (-1.)*3. + (-1.)*(-1.) ] )
correct_predictions = np.array( [ 1./(1+np.exp(-correct_scores[0])), 1./(1+np.exp(-correct_scores[1])) ] )

print 'The following outputs must match '
print '------------------------------------------------'
print 'correct_predictions           =', correct_predictions
print 'output of predict_probability =', predict_probability(dummy_feature_matrix, dummy_coefficients)

The following outputs must match 
------------------------------------------------
correct_predictions           = [ 0.98201379  0.26894142]
output of predict_probability = [ 0.98201379  0.26894142]


#### The derivative for single feature 

In [20]:
def feature_derivative(errors, feature):
    """
    compute the derivative of the log-likelihood with respect to a single feature
    
    Parameters
    ----------
    errors: A 1D array containing differences between the boolean of the instance as 1 and its probility to be 1
    
    feature: A 1D array containing values of a single feature for all data points
    
    Return
    ------
    The derivative of the log-likelihood, which is the derivative with respect to a single coefficient w_j. 
    """
    return np.dot(errors,feature)

### The log-likelihood function

In [21]:
def compute_log_likelihood(feature_matrix,sentiment,weights):
    """
    compute the log-likelihood of all features.
    
    Parameters
    ----------
    feature_matrix: A 2D array containing all selected featuers of all instances
    
    sentiment: A 1D array containing class lables of all instances
    
    weights: The coefficients of all features
    
    Return
    ------
    log_like : The log likelihood for the current iteration
    """
    indicator = sentiment == 1
    scores = np.dot(feature_matrix,weights)
    log_like = np.sum((indicator-1)*scores - np.log(1.+np.exp(-scores)))
    
    return log_like

## Test the derivative related methods

In [22]:
dummy_feature_matrix = np.array([[1.,2.,3.], [1.,-1.,-1]])
dummy_coefficients = np.array([1., 3., -1.])
dummy_sentiment = np.array([-1, 1])

correct_indicators  = np.array( [ -1==+1,1==+1 ] )
correct_scores      = np.array( [ 1.*1. + 2.*3. + 3.*(-1.),1.*1. + (-1.)*3. + (-1.)*(-1.) ] )
correct_first_term  = np.array( [ (correct_indicators[0]-1)*correct_scores[0],  (correct_indicators[1]-1)*correct_scores[1] ] )
correct_second_term = np.array( [ np.log(1. + np.exp(-correct_scores[0])),np.log(1. + np.exp(-correct_scores[1])) ] )

correct_ll = sum( [ correct_first_term[0]-correct_second_term[0], correct_first_term[1]-correct_second_term[1] ] ) 

print 'The following outputs must match '
print '------------------------------------------------'
print 'correct_log_likelihood           =', correct_ll
print 'output of compute_log_likelihood =', compute_log_likelihood(dummy_feature_matrix, dummy_sentiment, dummy_coefficients)

The following outputs must match 
------------------------------------------------
correct_log_likelihood           = -5.33141161544
output of compute_log_likelihood = -5.33141161544


### Implement of logistic regression

In [24]:
def logistic_regression(feature_matrix, sentiment, 
                        initial_coefficients, step_size, max_iter):
    """
    learn the optimal weights for the logistic regression model using gradiend ascent.
    
    Parameters
    ----------
    feature_matrix: A 2D array of features for all instances
    
    sentimnet: A 1D array of true class labels
    
    initial_coefficients: The initial weights for all features
    
    step_size: The step_size for gradient ascent
    
    max_iter: The limit number of iterations
    
    Return
    ------
    The final coefficents for all features(a 1D array)
    """
    weights = np.array(initial_coefficients)
    
    for itr in range(max_iter):
        prediction = predict_probability(feature_matrix, weights)
        indicator = sentiment == 1
        errors = indicator - prediction
        
        for j in range(len(weights)):
            derivative = feature_derivative(errors, feature_matrix[:,j])
            weights[j] += step_size * derivative
        # Checking whether log likelihood is increasing
        if itr <= 15 or (itr <= 100 and itr % 10 == 0) or (itr <= 1000 and itr % 100 == 0) \
        or (itr <= 10000 and itr % 1000 == 0) or itr % 10000 == 0:
            lp = compute_log_likelihood(feature_matrix, sentiment, weights)
            print 'iteration %*d: log likelihood of observed labels = %.8f' % \
                (int(np.ceil(np.log10(max_iter))), itr, lp)      
    return weights

### Train the model to learn the weights

In [25]:
coefficients = logistic_regression(feature_matrix, sentiment, initial_coefficients=np.zeros(194),
                                   step_size=1e-7, max_iter=301)

iteration   0: log likelihood of observed labels = -36775.36728268
iteration   1: log likelihood of observed labels = -36764.12455183
iteration   2: log likelihood of observed labels = -36752.97527093
iteration   3: log likelihood of observed labels = -36741.91590009
iteration   4: log likelihood of observed labels = -36730.94305575
iteration   5: log likelihood of observed labels = -36720.05350394
iteration   6: log likelihood of observed labels = -36709.24415380
iteration   7: log likelihood of observed labels = -36698.51205139
iteration   8: log likelihood of observed labels = -36687.85437371
iteration   9: log likelihood of observed labels = -36677.26842295
iteration  10: log likelihood of observed labels = -36666.75162105
iteration  11: log likelihood of observed labels = -36656.30150438
iteration  12: log likelihood of observed labels = -36645.91571871
iteration  13: log likelihood of observed labels = -36635.59201438
iteration  14: log likelihood of observed labels = -36625.3282

## （Own experiment of tuning step_size）

In [26]:
coefficients3 = logistic_regression(feature_matrix, sentiment, initial_coefficients=np.zeros(194),
                                   step_size=1e-4, max_iter=301)

iteration   0: log likelihood of observed labels = -40801.83940538
iteration   1: log likelihood of observed labels = -305491.31527983
iteration   2: log likelihood of observed labels = -287437.70722149
iteration   3: log likelihood of observed labels = -225046.99200230
iteration   4: log likelihood of observed labels = -319964.77159845
iteration   5: log likelihood of observed labels = -149401.85327191
iteration   6: log likelihood of observed labels = -331743.80178196
iteration   7: log likelihood of observed labels = -104425.51723589
iteration   8: log likelihood of observed labels = -280117.07484287
iteration   9: log likelihood of observed labels = -121174.06258823
iteration  10: log likelihood of observed labels = -265335.82234271
iteration  11: log likelihood of observed labels = -112545.67693600
iteration  12: log likelihood of observed labels = -233343.17991458
iteration  13: log likelihood of observed labels = -119707.94841859
iteration  14: log likelihood of observed labels 

In [27]:
## own experiment on step_size
coefficients2 = logistic_regression(feature_matrix, sentiment, initial_coefficients=np.zeros(194),
                                   step_size=1e-5, max_iter=301)

iteration   0: log likelihood of observed labels = -35892.79887963
iteration   1: log likelihood of observed labels = -35138.30167197
iteration   2: log likelihood of observed labels = -34498.09230738
iteration   3: log likelihood of observed labels = -33926.93024038
iteration   4: log likelihood of observed labels = -33426.47718771
iteration   5: log likelihood of observed labels = -32955.63297414
iteration   6: log likelihood of observed labels = -32534.54152791
iteration   7: log likelihood of observed labels = -32130.66610131
iteration   8: log likelihood of observed labels = -31767.29274598
iteration   9: log likelihood of observed labels = -31423.30318664
iteration  10: log likelihood of observed labels = -31112.92753696
iteration  11: log likelihood of observed labels = -30824.43165046
iteration  12: log likelihood of observed labels = -30561.97850986
iteration  13: log likelihood of observed labels = -30319.45682083
iteration  14: log likelihood of observed labels = -30096.0665

In [28]:
coefficients4 = logistic_regression(feature_matrix, sentiment, initial_coefficients=np.zeros(194),
                                   step_size=1.25e-5, max_iter=301)

iteration   0: log likelihood of observed labels = -35741.41177794
iteration   1: log likelihood of observed labels = -35037.43761085
iteration   2: log likelihood of observed labels = -34798.91220270
iteration   3: log likelihood of observed labels = -34811.33611600
iteration   4: log likelihood of observed labels = -35378.06925980
iteration   5: log likelihood of observed labels = -34868.59096597
iteration   6: log likelihood of observed labels = -35211.95945269
iteration   7: log likelihood of observed labels = -33899.30545839
iteration   8: log likelihood of observed labels = -33970.79361608
iteration   9: log likelihood of observed labels = -32723.41084611
iteration  10: log likelihood of observed labels = -32644.16029290
iteration  11: log likelihood of observed labels = -31661.01385026
iteration  12: log likelihood of observed labels = -31485.91194390
iteration  13: log likelihood of observed labels = -30745.57951567
iteration  14: log likelihood of observed labels = -30517.9647

In [29]:
products['predic_prob1'] = predict_probability(feature_matrix,coefficients)
products['predic_prob2'] = predict_probability(feature_matrix,coefficients2)

### Predict sentiment using the learnt coefficients

In [30]:
scores = np.dot(feature_matrix,coefficients)
scores2 = np.dot(feature_matrix,coefficients2)
scores4 = np.dot(feature_matrix,coefficients4)

In [31]:
products['predic_scores1'] = scores
products['predic_scores2'] = scores2
products['predic_scores4'] = scores4

In [32]:
products['predic_sentiment1'] = products['predic_scores1'].apply(lambda s: 1 if s>0 else -1)
products['predic_sentiment2'] = products['predic_scores2'].apply(lambda s: 1 if s>0 else -1)
products['predic_sentiment4'] = products['predic_scores4'].apply(lambda s: 1 if s>0 else -1)

In [33]:
# No of positive_predicted instances
(products['predic_sentiment1'] == 1).sum(),(products['predic_sentiment2'] == 1).sum()

(24423, 26068)

In [34]:
# accuracy of coefficients and coefficients2
round((products['predic_sentiment1'] == products['sentiment']).sum()*1./products.shape[0],2),round((products['predic_sentiment2'] == products['sentiment']).sum()*1./products.shape[0],4)

(0.74, 0.7993)

In [35]:
# accuracy of coefficients4 which is believed to be opitimum regarding the step_size
round((products['predic_sentiment4'] == products['sentiment']).sum()*1./products.shape[0],4)

0.8002

### Find the most positive and negative words

In [36]:
coefficients = list(coefficients)[1:]
word_coefficient_paris = [(word,coef) for word,coef in zip(important_words,coefficients)]


In [37]:
word_coef_dict = {a:b for a,b in zip(important_words,coefficients)}

In [38]:
word_coef_dict['work']

-0.03583028769555744

#### Ten most positive words

In [39]:
sorted(word_coefficient_paris,key=lambda x: x[1],reverse=True)[:10]

[('love', 0.11485512972899151),
 ('great', 0.070255174595001635),
 ('easy', 0.06728827346724775),
 ('little', 0.048671064877459043),
 ('loves', 0.043117605892266361),
 ('perfect', 0.043094103127603575),
 ('old', 0.039751364733730123),
 ('well', 0.034230940747470302),
 ('able', 0.028612995130438346),
 ('daughter', 0.025786498659130356)]

#### Ten most negative words

In [40]:
sorted(word_coefficient_paris,key=lambda x: x[1])[:10]

[('would', -0.055627299480496724),
 ('return', -0.053075214662185677),
 ('product', -0.042461104690891809),
 ('money', -0.037007585179380666),
 ('work', -0.03583028769555744),
 ('one', -0.032410014190202972),
 ('disappointed', -0.028521910126023561),
 ('get', -0.027573075630982241),
 ('us', -0.027328738685513963),
 ('even', -0.026471612072450475)]