This code implements Logistic Regression from scratch on twitter data to classify tweets as positive or negative. THe code covers:

- Learn how to extract features for logistic regression given some text
- Implement logistic regression from scratch
- Apply logistic regression on a natural language processing task
- Test using your logistic regression
Perform error analysis

In [55]:
#importing the necessary libraries
import numpy as np
import nltk
from os import getcwd
import w1_unittest

nltk.download('twitter_samples')
nltk.download('stopwords')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\udayr\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\udayr\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [56]:
import pandas as pd
from nltk.corpus import twitter_samples
from utils import process_tweet, build_freqs


In [57]:
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

In [58]:
test_pos=all_positive_tweets[4000:]
train_pos=all_positive_tweets[:4000]
train_neg=all_negative_tweets[:4000]
test_neg=all_negative_tweets[4000:]

train_x=train_pos+train_neg
test_x=test_pos+test_neg

In [59]:
train_y=np.append(np.ones((len(train_pos),1)),np.zeros((len(train_neg),1)),axis=0)
test_y=np.append(np.ones((len(test_pos),1)),np.zeros((len(test_neg),1)),axis=0)

In [60]:
train_y.shape
test_y.shape

(2000, 1)

In [61]:
freqs=build_freqs(train_x,train_y)

print(type(freqs))
print(len(freqs.keys()))

<class 'dict'>
11427


## Processing tweet

In [62]:
print(f'Sample postive tweet: {train_x[0]}')
print(f'Example of processed version: {process_tweet(train_x[0])}')

Sample postive tweet: #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
Example of processed version: ['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']


In [63]:
# Implementing sigmoid function
# z=x*theta
# sigmoid of z gives a probability between 0 and 1

def sigmoid(z):

    h=1/(1+np.exp(-z))

    return h

In [64]:
# Testing sigmpid funtion
if (sigmoid(0) == 0.5):
    print('SUCCESS!')
else:
    print('Oops!')

if (sigmoid(4.92) == 0.9927537604041685):
    print('CORRECT!')
else:
    print('Oops again!')

SUCCESS!
CORRECT!


In [65]:
# Test your function
w1_unittest.test_sigmoid(sigmoid)

[92m All tests passed


The cost function used for logistic regression is the average of the log loss across all training examples:

$$J(\theta) = -\frac{1}{m} \sum_{i=1}^m y^{(i)}\log (h(z(\theta)^{(i)})) + (1-y^{(i)})\log (1-h(z(\theta)^{(i)}))\tag{5} $$
* $m$ is the number of training examples
* $y^{(i)}$ is the actual label of training example 'i'.
* $h(z^{(i)})$ is the model's prediction for the training example 'i'.

## Gradient Descent

* $\mathbf{\theta}$ has dimensions (n+1, 1), where 'n' is the number of features, and there is one more element for the bias term $\theta_0$ (note that the corresponding feature value $\mathbf{x_0}$ is 1).
* The 'logits', 'z', are calculated by multiplying the feature matrix 'x' with the weight vector 'theta'.  $z = \mathbf{x}\mathbf{\theta}$
    * $\mathbf{x}$ has dimensions (m, n+1) 
    * $\mathbf{\theta}$: has dimensions (n+1, 1)
    * $\mathbf{z}$: has dimensions (m, 1)
* The prediction 'h', is calculated by applying the sigmoid to each element in 'z': $h(z) = sigmoid(z)$, and has dimensions (m,1).
* The cost function $J$ is calculated by taking the dot product of the vectors 'y' and 'log(h)'.  Since both 'y' and 'h' are column vectors (m,1), transpose the vector to the left, so that matrix multiplication of a row vector with column vector performs the dot product.
$$J = \frac{-1}{m} \times \left(\mathbf{y}^T \cdot log(\mathbf{h}) + \mathbf{(1-y)}^T \cdot log(\mathbf{1-h}) \right)$$
* The update of theta is also vectorized.  Because the dimensions of $\mathbf{x}$ are (m, n+1), and both $\mathbf{h}$ and $\mathbf{y}$ are (m, 1), we need to transpose the $\mathbf{x}$ and place it on the left in order to perform matrix multiplication, which then yields the (n+1, 1) answer we need:
$$\mathbf{\theta} = \mathbf{\theta} - \frac{\alpha}{m} \times \left( \mathbf{x}^T \cdot \left( \mathbf{h-y} \right) \right)$$

In [66]:
def gradient_descent(x,y,theta,alpha,num_iters):

    m=x.shape[0]

    for i in range(num_iters):

        z=np.dot(x,theta)

        h=sigmoid(z)

        J=(-1/m)*(np.dot(np.transpose(y),np.log(h))+np.dot(np.transpose(1-y),np.log(1-h)))

        theta=theta-(alpha/m)*(np.dot(np.transpose(x),(h-y)))
    
    J=float(J)
    return J,theta


In [67]:
# Check the function
# Construct a synthetic test case using numpy PRNG functions
np.random.seed(1)

tmp_x=np.append(np.ones((10,1)),np.random.rand(10,2)*2000,axis=1)

tmp_y=(np.random.rand(10,1)>0.35).astype(float)

tmp_J,tmp_theta=gradient_descent(tmp_x,tmp_y,np.zeros((3,1)),1e-8,700)
print(tmp_J)
print(tmp_theta)

0.6709497038162118
[[4.10713435e-07]
 [3.56584699e-04]
 [7.30888526e-05]]


  J=float(J)


In [68]:
# Test your function
w1_unittest.test_gradientDescent(gradient_descent)

[92m All tests passed


  J=float(J)


## Extracting features

Implement the extract_features function. 
* This function takes in a single tweet.
* Process the tweet using the imported `process_tweet` function and save the list of tweet words.
* Loop through each word in the list of processed words
    * For each word, check the 'freqs' dictionary for the count when that word has a positive '1' label. (Check for the key (word, 1.0)
    * Do the same for the count for when the word is associated with the negative label '0'. (Check for the key (word, 0.0).)


In [69]:
def extract_features(tweet,freqs,process_tweet=process_tweet):

    procesed_tweet=process_tweet(tweet)

    x=np.zeros(3)

    x[0]=1

    for word in procesed_tweet:

        # increment the word count for the positive label 1
        x[1]+=freqs.get((word,1.0),0)

        # increment the word count for the positive label 0
        x[2]+=freqs.get((word,0),0)

    x=x[None,:]
    assert(x.shape==(1,3))
    return x


In [70]:
# Check your function
# test 1
# test on training data
tmp1 = extract_features(train_x[0], freqs)
print(tmp1)

# test 2:
# check for when the words are not in the freqs dictionary
tmp2 = extract_features('blorb bleeeeb bloooob', freqs)
print(tmp2)

[[1.000e+00 3.133e+03 6.100e+01]]
[[1. 0. 0.]]


## Training the model

To train the model:
* Stack the features for all training examples into a matrix X. 
* Call `gradientDescent`.

In [71]:
X=np.zeros((len(train_x),3))
for i in range(len(train_x)):
    X[i,:]=extract_features(train_x[i],freqs)

Y=train_y

# Applying GD
J,theta=gradient_descent(X,Y,np.zeros((3,1)),1e-9,2000)
print(f"Cost after training is : {J}")
print(f"theta after training is : {[round(t,8) for t in np.squeeze(theta)]}")

Cost after training is : 0.19442795968100662
theta after training is : [8e-08, 0.00063388, -0.00063676]


  J=float(J)


## Testing Logistic Regression

Predict whether a tweet is positive or negative.

* Given a tweet, process it, then extract the features.
* Apply the model's learned weights on the features to get the logits.
* Apply the sigmoid to the logits to get the prediction (a value between 0 and 1).


In [72]:
def predict_tweet(tweet,freqs,theta):

    x=extract_features(tweet,freqs,process_tweet=process_tweet)

    y_pred=sigmoid(np.dot(x,theta))

    return y_pred


In [73]:
for tweet in ['I am happy', 'I am bad', 'this movie should have been great.', 'great', 'great great', 'great great great', 'great great great great']:
    print( '%s -> %f' % (tweet, predict_tweet(tweet, freqs, theta)))    

I am happy -> 0.522791
I am bad -> 0.493623
this movie should have been great. -> 0.518984
great -> 0.518997
great great -> 0.537938
great great great -> 0.556771
great great great great -> 0.575442


  print( '%s -> %f' % (tweet, predict_tweet(tweet, freqs, theta)))


In [74]:
my_tweet ='I am not learning'
predict_tweet(my_tweet, freqs, theta)


array([[0.50047328]])

## Checking perfomance on test set


In [75]:
def test_logistic_regression(test_x,test_y,freqs,theta,predict_tweet=predict_tweet):
    
    y_hat=[]

    for tweet in test_x:
        y_pred=predict_tweet(tweet,freqs,theta)

        if y_pred>0.5:
            y_hat.append(1.0)
        else:
            y_hat.append(0.0)

    test_y=np.squeeze(test_y)
    equal_counts=test_y==y_hat
    accuracy=np.sum(equal_counts)/len(test_y)

    return accuracy

In [76]:
test_logistic_regression(test_x, test_y, freqs, theta)

0.9945

In [78]:
# Some error analysis done for you
print('Label Predicted Tweet')
for x,y in zip(test_x,test_y):
    y_hat = predict_tweet(x, freqs, theta)

    if np.abs(y - (y_hat > 0.5)) > 0:
        print('THE TWEET IS:', x)
        print('THE PROCESSED TWEET IS:', process_tweet(x))
        print('%d\t%0.8f\t%s' % (y, y_hat, ' '.join(process_tweet(x)).encode('ascii', 'ignore')))
    

Label Predicted Tweet
THE TWEET IS: @MarkBreech Not sure it would be good thing 4 my bottom daring 2 say 2 Miss B but Im gonna be so stubborn on mouth soaping ! #NotHavingit :p
THE PROCESSED TWEET IS: ['sure', 'would', 'good', 'thing', '4', 'bottom', 'dare', '2', 'say', '2', 'miss', 'b', 'im', 'gonna', 'stubborn', 'mouth', 'soap', 'nothavingit', ':p']
1	0.49138820	b'sure would good thing 4 bottom dare 2 say 2 miss b im gonna stubborn mouth soap nothavingit :p'
THE TWEET IS: I'm playing Brain Dots : ) #BrainDots
http://t.co/UGQzOx0huu
THE PROCESSED TWEET IS: ["i'm", 'play', 'brain', 'dot', 'braindot']
1	0.48298649	b"i'm play brain dot braindot"
THE TWEET IS: I'm playing Brain Dots : ) #BrainDots http://t.co/aOKldo3GMj http://t.co/xWCM9qyRG5
THE PROCESSED TWEET IS: ["i'm", 'play', 'brain', 'dot', 'braindot']
1	0.48298649	b"i'm play brain dot braindot"


  print('%d\t%0.8f\t%s' % (y, y_hat, ' '.join(process_tweet(x)).encode('ascii', 'ignore')))


THE TWEET IS: I'm playing Brain Dots : ) #BrainDots http://t.co/R2JBO8iNww http://t.co/ow5BBwdEMY
THE PROCESSED TWEET IS: ["i'm", 'play', 'brain', 'dot', 'braindot']
1	0.48298649	b"i'm play brain dot braindot"
THE TWEET IS: off to the park to get some sunlight : )
THE PROCESSED TWEET IS: ['park', 'get', 'sunlight']
1	0.49669547	b'park get sunlight'
THE TWEET IS: @msarosh Uff Itna Miss karhy thy ap :p
THE PROCESSED TWEET IS: ['uff', 'itna', 'miss', 'karhi', 'thi', 'ap', ':p']
1	0.48065997	b'uff itna miss karhi thi ap :p'
THE TWEET IS: @phenomyoutube u probs had more fun with david than me : (
THE PROCESSED TWEET IS: ['u', 'prob', 'fun', 'david']
0	0.51254922	b'u prob fun david'
THE TWEET IS: pats jay : (
THE PROCESSED TWEET IS: ['pat', 'jay']
0	0.50047543	b'pat jay'
THE TWEET IS: my beloved grandmother : ( https://t.co/wt4oXq5xCf
THE PROCESSED TWEET IS: ['belov', 'grandmoth']
0	0.50000002	b'belov grandmoth'
THE TWEET IS: Sr. Financial Analyst - Expedia, Inc.: (#Bellevue, WA) http://t.co

## Predicting with own tweet

In [82]:
# Feel free to change the tweet below
# my_tweet = 'This is a ridiculously bright movie. The plot was terrible and I was sad until the ending!'
my_tweet="I'm playing Brain Dots : ) #BrainDots"
print(process_tweet(my_tweet))
y_hat = predict_tweet(my_tweet, freqs, theta)
print(y_hat)
if y_hat > 0.5:
    print('Positive sentiment')
else: 
    print('Negative sentiment')

["i'm", 'play', 'brain', 'dot', 'braindot']
[[0.48298649]]
Negative sentiment
