# Logistic Regression

In this code, we will explore the concept of [*Logistic Regression*](https://en.wikipedia.org/wiki/Logistic_regression) and its application for sentimental analysis. 

The goal is to use the [amazon_baby_subset.csv](../Data/amazon_baby_subset.csv), which contains 4 columns: Product name, client review, client rate, and sentiment. The rating goes from 1 (worst) to 5 (best) and the sentiment is -1 if the rating is low (< 3) and 1 if it is good (>= 3). Here, the logistic regression method is used to give weights to each important word in the comments (the important words are given now. In the future, we will see how to select them) and to create a prediction model for future reviews, understanting if it is good or bad.

First, let's load all used packages and the dataset.

In [1]:
import turicreate as tc
import json
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

  from ._conv import register_converters as _register_converters


In [2]:
products = tc.SFrame('../Data/amazon_baby_subset.csv')

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [3]:
products.head()

name,review,rating,sentiment
Stop Pacifier Sucking without tears with ...,All of my kids have cried non-stop when I tried to ...,5,1
Nature's Lullabies Second Year Sticker Calendar ...,We wanted to get something to keep track ...,5,1
Nature's Lullabies Second Year Sticker Calendar ...,My daughter had her 1st baby over a year ago. ...,5,1
"Lamaze Peekaboo, I Love You ...","One of baby's first and favorite books, and i ...",4,1
SoftPlay Peek-A-Boo Where's Elmo A Childr ...,Very cute interactive book! My son loves this ...,5,1
Our Baby Girl Memory Book,"Beautiful book, I love it to record cherished t ...",5,1
Hunnt&reg; Falling Flowers and Birds Kids ...,"Try this out for a spring project !Easy ,fun and ...",5,1
Blessed By Pope Benedict XVI Divine Mercy Full ...,very nice Divine Mercy Pendant of Jesus now on ...,5,1
Cloth Diaper Pins Stainless Steel ...,We bought the pins as my 6 year old Autistic son ...,4,1
Cloth Diaper Pins Stainless Steel ...,It has been many years since we needed diaper ...,5,1


We can count how many positive and negative reviews the data set has.

In [4]:
print 'Number of positive reviews =', len(products[products['sentiment']==1])
print 'Number of negative reviews =', len(products[products['sentiment']==-1])

Number of positive reviews = 26579
Number of negative reviews = 26493


Pretty close numbers!

The way the reviews are writen have punctuation. Let's clean them and also create columns with the important words count. The important words are in the file [important_words.json](../Data/important_words.json).

## Cleaning the data

First, load the important words from the *json* file:

In [5]:
with open('../Data/important_words.json', 'r') as f: # Reads the list of most frequent words
    important_words = json.load(f)
important_words = [str(s) for s in important_words]
print important_words

['baby', 'one', 'great', 'love', 'use', 'would', 'like', 'easy', 'little', 'seat', 'old', 'well', 'get', 'also', 'really', 'son', 'time', 'bought', 'product', 'good', 'daughter', 'much', 'loves', 'stroller', 'put', 'months', 'car', 'still', 'back', 'used', 'recommend', 'first', 'even', 'perfect', 'nice', 'bag', 'two', 'using', 'got', 'fit', 'around', 'diaper', 'enough', 'month', 'price', 'go', 'could', 'soft', 'since', 'buy', 'room', 'works', 'made', 'child', 'keep', 'size', 'small', 'need', 'year', 'big', 'make', 'take', 'easily', 'think', 'crib', 'clean', 'way', 'quality', 'thing', 'better', 'without', 'set', 'new', 'every', 'cute', 'best', 'bottles', 'work', 'purchased', 'right', 'lot', 'side', 'happy', 'comfortable', 'toy', 'able', 'kids', 'bit', 'night', 'long', 'fits', 'see', 'us', 'another', 'play', 'day', 'money', 'monitor', 'tried', 'thought', 'never', 'item', 'hard', 'plastic', 'however', 'disappointed', 'reviews', 'something', 'going', 'pump', 'bottle', 'cup', 'waste', 'retu

Now, let's remove the reviews punctuations:

In [6]:
def remove_punctuation(text):
    import string
    return text.translate(None, string.punctuation) 

products['review_clean'] = products['review'].apply(remove_punctuation)

The next step is to count each important word and create new columns with the results.

In [7]:
for word in important_words:
    products[word] = products['review_clean'].apply(lambda s : s.split().count(word))

In [8]:
products[important_words[0]] # This is the count of the word 'baby' for each review.

dtype: int
Rows: 53072
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 7, 0, 0, 1, 0, 0, 0, 0, 0, 0, 13, 0, 1, 1, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, ... ]

In [9]:
len(products[products[important_words[0]] > 0]) # Number of reviews that contain the word 'baby'

12174

How many time the word *baby* appears in all the reviews?

In [10]:
sum(products['baby'])

18715

## Starting the logistic regression

Logistic regression deals with categorical variables, and here in the sentimental analysis, it assumes only 2 number: 1 for a good review and -1 for a bad review.

The whole idea is to estimate the probabily on which the review is good or bad, and the probability is computed usin the [logistic function](https://en.wikipedia.org/wiki/Logistic_function) (also known as link function):

$$ P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) = \frac{1}{1 + e^{-\mathbf{w}^T h(\mathbf{x}_i)}} $$

where, in our code, the feature vector $h(\mathbf{x}_i)$ represents the word counts of **important_words** in the review $\mathbf{x}_i$. In this equation, for a given vector of weights $\mathbf{w}$ and the word count $\mathbf{x}_i$, it computes the probabilty of the sentiment $\mathbf{i}_i$ be positive (equal to 1). The values go from 0 to 1. We can choose a threshold (usually 0.5) on which a higher probability is considered as a positive sentiment, while the opsite comes from a lower probability.

Best prediction comes from optimized weights. And a good way to obtain such weights is by [maximizing the likelihood function](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation). The [likelihood function](https://en.wikipedia.org/wiki/Likelihood_function) is writen as:

$$\ell(\mathbf{w}) = \prod_{i = 1}^{N}P(y_i | \mathbf{x}_i,\mathbf{w})$$

The optimization method can be done by by the gradient descent method. To find the optimized weights, we should derive the likelihood function by $\mathbf{w}$. However, such task is not easy for the current function. A good strategy is to work with the [natural logarithm](https://en.wikipedia.org/wiki/Natural_logarithm) of the likelihood function, called the **log-likelihood**.

$$\ell\ell(\mathbf{w}) = \ln\prod_{i = 1}^{N}P(y_i | \mathbf{x}_i,\mathbf{w}) = \sum_{i = 1}^{N}\ln P(y_i | \mathbf{x}_i,\mathbf{w})$$

The log-likelihood can be computed using the following formula:

$$\ell\ell(\mathbf{w}) = \sum_{i=1}^N \{ (\mathbf{1}[y_i = +1] - 1)\mathbf{w}^T h(\mathbf{x}_i) - \ln[1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))]\} $$

The derivative of the log-likelihood is:

$$\frac{\partial\ell\ell}{\partial w_j} = \sum_{i=1}^N h_j(\mathbf{x}_i)\left(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})\right)$$

where $\mathbf{1}[y_i = +1]$ means that it is equal 1 if the sentment is positive (1) and 0 if the sentiment is negative (-1). Now that we can compute the gradient of the log-likelihood, the gradient descent method can be applied:

$$\mathbf{w}_{j+1} = \mathbf{w}_j + \eta\frac{\partial\ell\ell}{\partial w_j}$$

where $\eta$ is the step length.

Great!!! Now let's implement the equations above to find the optimal weights $\mathbf{w}$. Then we can use then to make predictions of the reviews sentiment.

The first step is to convert the Turicreate SFrame data to Numpy array to perform the math. The following function receives the SFrame data, the list of desired features (that will be the important words count), and the class label (in this case, 'sentiment'). It will return two outputs: a matrix with the features (word count) plus an initial intercept (= 1), and an array with the sentiment (-1 or +1) of each review.

In [11]:
def get_numpy_data(data_sframe, features, label):
    data_sframe['intercept'] = 1 # Initial value for the intercept, including it in the sframe
    features = ['intercept'] + features # Including 'intercept' in the features array
    features_sframe = data_sframe[features] # Saving a sframe with only the desired features 
    feature_matrix = features_sframe.to_numpy() # Converting the features sframe to a numpy array (matrix)
    label_sarray = data_sframe[label] # Picking the desired label
    label_array = label_sarray.to_numpy() # Converting the label to a numpy array
    return(feature_matrix, label_array)

In [12]:
feature_matrix, sentiment = get_numpy_data(products, important_words, 'sentiment')
feature_matrix.shape

(53072, 194)

In [13]:
print len(products[products['sentiment']]) # Number of reviews
print len(important_words) + 1 # Number of features plus the intercept

53072
194


Okay, the *feature_matrix* matches in size the number of reviews and the number of features (plus the intercept).

Let's now create a function to calculate the probability for positive sentiments (the negative sentiment probability is just one minus the positive sentiment probability), given an array of coefficients and the feature matrix. The output is an array with the probability predictions for each reviews (or length of the feature matrix).

In [14]:
def predict_probability(coefficients, feature_matrix):
    # Computing P(y_i = +1 | x_i, w), using the logistic function
    predictions = 1/(1 + np.exp(-np.dot(coefficients, feature_matrix)))
    return predictions

We will now write a function that computes the derivative of log likelihood with respect to a single coefficient $w_j$. The function accepts two arguments:

* **errors** vector containing $\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})$ for all $i$.
* **feature** vector containing $h_j(\mathbf{x}_i)$  for all $i$. 

The derivative is just the dot product of the errors and the feature.

In [15]:
def feature_derivative(errors, feature):     
    derivative = np.dot(errors, feature)
    return derivative

We will also create a function to compute the log-likelihood for all the features. It is interesting to use it as a QC tool. The log-likelihood, in the gradient descent method, should increase at each interation, until a maximum is reached.

The function has three inputs: the features matris, the sentiment array, and the computed (or initial) coefficients. The output is a scalar with the value of the log-likelihood.

In [16]:
def compute_log_likelihood(feature_matrix, sentiment, coefficients):
    isone = (sentiment==+1) # Saving an array with positive sentiments as 1 and others as 0
    
    # Computing each part of the log-likelihood formula
    dotfc = np.dot(feature_matrix, coefficients)
    lnexp = np.log(1. + np.exp(-dotfc))
    
    # Avoiding infinite results.
    mask = np.isinf(lnexp)
    lnexp[mask] = -dotfc[mask]
    
    # log-likelihood
    ll = np.sum((isone-1)*dotfc - lnexp)
    return ll

Now we can use the three functions above to do the logistic regression using the gradient descent method.
The next function receives the feature matrix, the sentiment array, the initial guess for the coefficients, the step length, and the maximum number of iterations. The output are the coefficients for each feature.

In [17]:
def logistic_regression(feature_matrix, sentiment, initial_coefficients, step_size, max_iter):
    coefficients = np.array(initial_coefficients) # make sure it's a numpy array
    
    # Start the gradient method and stop it in the maximum iteration
    for itr in xrange(max_iter):

        # Making the predictions with the initial or updated coefficients coefficients
        predictions = predict_probability(coefficients, np.transpose(feature_matrix))
        
        # Compute indicator value for +1 (positive sentiment)
        indicator = (sentiment==+1)
        
        # Compute the errors with the initial or updated coefficients coefficients
        errors = indicator - predictions
        
        # Apply the gradient method for each coefficient
        for j in xrange(len(coefficients)):
            
            # Update the coefficient for feature j
            coefficients[j] = coefficients[j] + step_size * np.sum(feature_derivative(errors, feature_matrix[:,j]))
            
        
        # Checking whether log likelihood is increasing
        if itr <= 15 or (itr <= 100 and itr % 10 == 0) or (itr <= 1000 and itr % 100 == 0) \
        or (itr <= 10000 and itr % 1000 == 0) or itr % 10000 == 0:
            lp = compute_log_likelihood(feature_matrix, sentiment, coefficients)
            print 'Iteration %*d: log-likelihood of observed labels = %.8f' % \
                (int(np.ceil(np.log10(max_iter))), itr, lp)
    return coefficients

Now, let's check it working.

In [18]:
coefficients = logistic_regression(feature_matrix, sentiment, initial_coefficients=np.zeros(194),
                                   step_size=1e-7, max_iter=301)

Iteration   0: log-likelihood of observed labels = -36780.91768478
Iteration   1: log-likelihood of observed labels = -36775.13434712
Iteration   2: log-likelihood of observed labels = -36769.35713564
Iteration   3: log-likelihood of observed labels = -36763.58603240
Iteration   4: log-likelihood of observed labels = -36757.82101962
Iteration   5: log-likelihood of observed labels = -36752.06207964
Iteration   6: log-likelihood of observed labels = -36746.30919497
Iteration   7: log-likelihood of observed labels = -36740.56234821
Iteration   8: log-likelihood of observed labels = -36734.82152213
Iteration   9: log-likelihood of observed labels = -36729.08669961
Iteration  10: log-likelihood of observed labels = -36723.35786366
Iteration  11: log-likelihood of observed labels = -36717.63499744
Iteration  12: log-likelihood of observed labels = -36711.91808422
Iteration  13: log-likelihood of observed labels = -36706.20710739
Iteration  14: log-likelihood of observed labels = -36700.5020

Now, with the coefficients in hands, let's predict the sentiments.

For this analysis, I am chosing to classify probabilities larger than 0.5 as positive and from 0.5 to 0 as negative. So:

$$
\hat{y}_i = 
\left\{
\begin{array}{ll}
      +1 & P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) > 0.5 \\
      -1 & P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) \leq 0.5 \\
\end{array} 
\right.
$$

Time to compute the predictions.

In [19]:
pred = predict_probability(feature_matrix,coefficients)

In [20]:
classify_predictions = tc.SArray(pred).apply(lambda x: 1 if x > 0.5 else -1)

In [21]:
len(classify_predictions[classify_predictions == 1])

25126

In [22]:
len(sentiment[sentiment == 1])

26579

Our prediction is close to the true sentiment (apparently). But we don't know if the -1's and 1's are on the correct reviews. So, let's compare the predictions with the true reviews and compute the **accuracy** of our logistic regression.

In [23]:
num_total = len(products) # Total number of reviews
num_correct = (classify_predictions == products['sentiment']).sum() # Is equal, return 1 (TRUE). Else, return 0 (FALSE)
num_wrong = num_total - num_correct
accuracy = 1.0*num_correct/num_total

print 'Number of correct reviews:', num_correct
print 'Number of wrong reviews:', num_wrong
print 'Number of reviews:', num_total
print 'Accuracy: %.2f' % accuracy

Number of correct reviews: 39903
Number of wrong reviews: 13169
Number of reviews: 53072
Accuracy: 0.75


Let's compare our result with the built in logistic classifier of Turicreate;

In [24]:
modeltc = tc.logistic_classifier.create(products, target='sentiment', features = important_words)

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



In [28]:
predictionstc = modeltc.predict(products, output_type = 'class')
resultstc = modeltc.evaluate(products)
print 'Accuracy of our code: %.2f' % accuracy
print 'Accuracy of Turicreate: %.2f' % resultstc['accuracy']
print classify_predictions.head()
print predictionstc.head()
print products.head()['sentiment']

Accuracy of our code: 0.75
Accuracy of Turicreate: 0.79
[1, -1, 1, 1, 1, 1, 1, 1, 1, -1]
[1, -1, 1, 1, 1, 1, 1, 1, 1, -1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


We got pretty close to the Turicreate package. We can improve the accuracy by penalizing some words, as we did with the [polynomial regression](../Polynomial_Regression/Polynomial_Regression.ipynb).

## Logistic regression with L2 penalization (regularization)

Similar as in the [polynomial regression](../Polynomial_Regression/Polynomial_Regression.ipynb), we will include a term in the cost function to penalize large coefficients (overfitting). For the "simple" logistic regression, the cost function is the log likelihood:

$$Cost(\mathbf{w}) = \ell\ell(\mathbf{w})$$

Now, we include the penalizations term: the L2-norm of the coefficients multiplied by the tuning parameter $\lambda$:

$$Cost(\mathbf{w}) = \ell\ell(\mathbf{w}) - \lambda \|\mathbf{w}\|_{2}^{2}$$

To find the best coefficients, we have to take the derivative of the cost function. The derivative of the cost function is known. The derivative of the penalty term is, for the j-th coefficient is:

$$\frac{\partial \|\mathbf{w}\|_{2}^{2}}{\partial w_j} = \frac{\partial (w_0^2 + w_1^2 + w_2^2 + \dots + w_j^2 + \dots + w_N^2)}{\partial w_j} = 2w_j$$

For the gradient descent, the update for the coefficients will be:

$$\mathbf{w}_{j+1} = \mathbf{w}_j + \eta \big( \frac{\partial\ell\ell}{\partial w_j} - 2w_j \big)$$

where:

$$\frac{\partial\ell\ell}{\partial w_j} = \sum_{i=1}^N h_j(\mathbf{x}_i)\left(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})\right)$$

Usually the intercept does not suffer any kind of penalty. So, when we create our functions, let's keep it in mind.

The first step is to split the data into the train and validation sets.

In [29]:
train_data, validation_data = products.random_split(.8, seed = 1)

print 'Training set   : %d data points' % len(train_data)
print 'Validation set : %d data points' % len(validation_data)

Training set   : 42474 data points
Validation set : 10598 data points


Convert both sets to a numpy array.

In [30]:
feature_matrix_train, sentiment_train = get_numpy_data(train_data, important_words, 'sentiment')
feature_matrix_valid, sentiment_valid = get_numpy_data(validation_data, important_words, 'sentiment')

The gradient now must include the $-2w_j$ for all the coefficients but the intercept.

In [31]:
def feature_derivative_with_L2(errors, feature, coefficient, l2_penalty, intercept): 
    derivative = np.dot(errors, feature)

    # Add the L2 penalty to all coefficients but the intercept
    if not intercept:
        derivative -= 2 * l2_penalty * coefficient
        
    return derivative

The cost function (log-likelihood) also need the L2 term.

In [32]:
def compute_log_likelihood_with_L2(feature_matrix, sentiment, coefficients, l2_penalty):
    isone = (sentiment==+1)
    lnexp = np.dot(feature_matrix, coefficients)
    lp = np.sum((isone-1)*lnexp - np.log(1. + np.exp(-lnexp))) - l2_penalty*np.sum(coefficients[1:]**2)
    
    return lp

Now, the logistic regression:

In [43]:
def logistic_regression_with_L2(feature_matrix, sentiment, initial_coefficients, step_size, l2_penalty, max_iter):
    coefficients = np.array(initial_coefficients) # make sure it's a numpy array

    # Start the gradient method and stop it in the maximum iteration
    for itr in xrange(max_iter):

        # Making the predictions with the initial or updated coefficients coefficients
        predictions = predict_probability(feature_matrix, coefficients)
        
        # Compute indicator value for +1 (positive sentiment)
        indicator = (sentiment==+1)
        
        # Compute the errors with the initial or updated coefficients coefficients
        errors = indicator - predictions

        # Apply the gradient method for each coefficient
        for j in xrange(len(coefficients)): # loop over each coefficient
            is_intercept = (j == 0)

            # Computing the derivative
            derivative = feature_derivative_with_L2(errors, feature_matrix[:,j], coefficients[j], l2_penalty, is_intercept)
            
            # Updating the coefficient
            coefficients[j] += step_size * derivative
        
        # Checking whether log likelihood is increasing
        if itr <= 15 or (itr <= 100 and itr % 10 == 0) or (itr <= 1000 and itr % 100 == 0) \
        or (itr <= 10000 and itr % 1000 == 0) or itr % 10000 == 0:
            lp = compute_log_likelihood_with_L2(feature_matrix, sentiment, coefficients, l2_penalty)
            print 'iteration %*d: log likelihood of observed labels = %.8f' % \
                (int(np.ceil(np.log10(max_iter))), itr, lp)
    return coefficients

Let's check different values for the l2_penalty:

In [45]:
# L2_penalty = 0
coefficients_0_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train,
                                                     initial_coefficients=np.zeros(194),
                                                     step_size=1e-7, l2_penalty=0, max_iter=501)

iteration   0: log likelihood of observed labels = -29437.00830657
iteration   1: log likelihood of observed labels = -29433.28661525
iteration   2: log likelihood of observed labels = -29429.56826389
iteration   3: log likelihood of observed labels = -29425.85324331
iteration   4: log likelihood of observed labels = -29422.14154439
iteration   5: log likelihood of observed labels = -29418.43315805
iteration   6: log likelihood of observed labels = -29414.72807530
iteration   7: log likelihood of observed labels = -29411.02628719
iteration   8: log likelihood of observed labels = -29407.32778483
iteration   9: log likelihood of observed labels = -29403.63255941
iteration  10: log likelihood of observed labels = -29399.94060215
iteration  11: log likelihood of observed labels = -29396.25190435
iteration  12: log likelihood of observed labels = -29392.56645735
iteration  13: log likelihood of observed labels = -29388.88425256
iteration  14: log likelihood of observed labels = -29385.2052

In [46]:
# L2_penalty = 4
coefficients_0_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train,
                                                     initial_coefficients=np.zeros(194),
                                                     step_size=1e-7, l2_penalty=4, max_iter=501)

iteration   0: log likelihood of observed labels = -29437.00830806
iteration   1: log likelihood of observed labels = -29433.28662418
iteration   2: log likelihood of observed labels = -29429.56828622
iteration   3: log likelihood of observed labels = -29425.85328496
iteration   4: log likelihood of observed labels = -29422.14161129
iteration   5: log likelihood of observed labels = -29418.43325612
iteration   6: log likelihood of observed labels = -29414.72821043
iteration   7: log likelihood of observed labels = -29411.02646528
iteration   8: log likelihood of observed labels = -29407.32801177
iteration   9: log likelihood of observed labels = -29403.63284105
iteration  10: log likelihood of observed labels = -29399.94094437
iteration  11: log likelihood of observed labels = -29396.25231299
iteration  12: log likelihood of observed labels = -29392.56693826
iteration  13: log likelihood of observed labels = -29388.88481156
iteration  14: log likelihood of observed labels = -29385.2059

In [47]:
# L2_penalty = 10
coefficients_0_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train,
                                                     initial_coefficients=np.zeros(194),
                                                     step_size=1e-7, l2_penalty=1, max_iter=501)

iteration   0: log likelihood of observed labels = -29437.00830694
iteration   1: log likelihood of observed labels = -29433.28661748
iteration   2: log likelihood of observed labels = -29429.56826947
iteration   3: log likelihood of observed labels = -29425.85325372
iteration   4: log likelihood of observed labels = -29422.14156111
iteration   5: log likelihood of observed labels = -29418.43318257
iteration   6: log likelihood of observed labels = -29414.72810908
iteration   7: log likelihood of observed labels = -29411.02633171
iteration   8: log likelihood of observed labels = -29407.32784157
iteration   9: log likelihood of observed labels = -29403.63262982
iteration  10: log likelihood of observed labels = -29399.94068771
iteration  11: log likelihood of observed labels = -29396.25200651
iteration  12: log likelihood of observed labels = -29392.56657758
iteration  13: log likelihood of observed labels = -29388.88439231
iteration  14: log likelihood of observed labels = -29385.2054

In [48]:
# L2_penalty = 1e2
coefficients_0_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train,
                                                     initial_coefficients=np.zeros(194),
                                                     step_size=1e-7, l2_penalty=1e2, max_iter=501)

iteration   0: log likelihood of observed labels = -29437.00834383
iteration   1: log likelihood of observed labels = -29433.28683866
iteration   2: log likelihood of observed labels = -29429.56882208
iteration   3: log likelihood of observed labels = -29425.85428462
iteration   4: log likelihood of observed labels = -29422.14321690
iteration   5: log likelihood of observed labels = -29418.43560958
iteration   6: log likelihood of observed labels = -29414.73145339
iteration   7: log likelihood of observed labels = -29411.03073911
iteration   8: log likelihood of observed labels = -29407.33345759
iteration   9: log likelihood of observed labels = -29403.63959975
iteration  10: log likelihood of observed labels = -29399.94915655
iteration  11: log likelihood of observed labels = -29396.26211902
iteration  12: log likelihood of observed labels = -29392.57847825
iteration  13: log likelihood of observed labels = -29388.89822538
iteration  14: log likelihood of observed labels = -29385.2213

In [49]:
# L2_penalty = 1e3
coefficients_0_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train,
                                                     initial_coefficients=np.zeros(194),
                                                     step_size=1e-7, l2_penalty=1e3, max_iter=501)

iteration   0: log likelihood of observed labels = -29437.00867915
iteration   1: log likelihood of observed labels = -29433.28884909
iteration   2: log likelihood of observed labels = -29429.57384441
iteration   3: log likelihood of observed labels = -29425.86365268
iteration   4: log likelihood of observed labels = -29422.15826153
iteration   5: log likelihood of observed labels = -29418.45765867
iteration   6: log likelihood of observed labels = -29414.76183189
iteration   7: log likelihood of observed labels = -29411.07076903
iteration   8: log likelihood of observed labels = -29407.38445803
iteration   9: log likelihood of observed labels = -29403.70288687
iteration  10: log likelihood of observed labels = -29400.02604364
iteration  11: log likelihood of observed labels = -29396.35391646
iteration  12: log likelihood of observed labels = -29392.68649354
iteration  13: log likelihood of observed labels = -29389.02376315
iteration  14: log likelihood of observed labels = -29385.3657

In [50]:
# L2_penalty = 1e5
coefficients_0_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train,
                                                     initial_coefficients=np.zeros(194),
                                                     step_size=1e-7, l2_penalty=1e5, max_iter=501)

iteration   0: log likelihood of observed labels = -29437.04556452
iteration   1: log likelihood of observed labels = -29433.50706443
iteration   2: log likelihood of observed labels = -29430.11179834
iteration   3: log likelihood of observed labels = -29426.85396307
iteration   4: log likelihood of observed labels = -29423.72799081
iteration   5: log likelihood of observed labels = -29420.72853953
iteration   6: log likelihood of observed labels = -29417.85048384
iteration   7: log likelihood of observed labels = -29415.08890618
iteration   8: log likelihood of observed labels = -29412.43908842
iteration   9: log likelihood of observed labels = -29409.89650370
iteration  10: log likelihood of observed labels = -29407.45680871
iteration  11: log likelihood of observed labels = -29405.11583625
iteration  12: log likelihood of observed labels = -29402.86958804
iteration  13: log likelihood of observed labels = -29400.71422790
iteration  14: log likelihood of observed labels = -29398.6460