<a href="https://colab.research.google.com/github/the-headliner/Battle_ship/blob/main/Copy_of_LogisticRegression_SentimentAnalysis_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 1: Logistic Regression
Welcome to week one of the specialization. You will learn about logistic regression. Concretely, you will be implementing logistic regression for sentiment analysis on tweets. Given a tweet, you will decide if it has a positive sentiment or a negative one. Specifically you will:

* Learn how to extract features for logistic regression given some text
* Implement logistic regression from scratch
* Apply logistic regression on a natural language processing task
* Test using your logistic regression
* Perform error analysis

Before starting make sure that you are following the following assignment instructions

## Assignment Instructions
Create a Copy in Your Google Drive:
* Before you begin working on this assignment, you must create a copy of this Colab file in your own Google Drive.
* To do this, go to the menu bar at the top of this page, click on File > Save a copy in Drive.... This will save a copy of this file in your Google Drive under the name Copy of <Original File Name>. Rename the file.
* Ensure you are logged into your msitprogram.net account when doing this.

Work on Your Copy:
* Do not edit this original file.
* All your work must be done on the copy saved in your Google Drive. Any work done on this original file will not be saved and may be lost.

Saving Your Work:
* Google Colab automatically saves your progress in the copy stored in your Drive. However, it's a good practice to manually save your work periodically by clicking on File > Save.


Let's get started!!!

## Table of Contents

- [Import Functions and Data](#0)
- [1 - Logistic Regression](#1)
    - [1.1 - Sigmoid](#1-1)
        - [Exercise 1 - sigmoid](#ex-1)
    - [1.2 - Cost function and Gradient](#1-2)
        - [Exercise 2 - gradientDescent](#ex-2)
- [2 - Extracting the Features](#2)
    - [Exercise 3 - extract_features](#ex-3)
- [3 - Training Your Model](#3)
- [4 - Test your Logistic Regression](#4)
    - [Exercise 4 - predict_tweet](#ex-4)
    - [4.1 - Check the Performance using the Test Set](#4-1)
        - [Exercise 5 - test_logistic_regression](#ex-5)
- [5 - Error Analysis](#5)
- [6 - Predict with your own Tweet](#6)

<a name='0'></a>
## Import Functions and Data

In [64]:
# run this cell to import nltk
import nltk
from os import getcwd
import re
import string
import numpy as np
#from collections import defaultdict

nltk.download('twitter_samples')
nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.corpus import twitter_samples
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


#### Helper functions for data processing:
* process_tweet: cleans the text, tokenizes it into separate words, removes stopwords, and converts words to stems.


In [65]:
def process_tweet(tweet):
    """Process tweet function.
    Input:
        tweet: a string containing a tweet
    Output:
        tweets_clean: a list of words containing the processed tweet

    """
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')
    # remove stock market tickers like $GE
    tweet = re.sub(r'\$\w*', '', tweet)
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    # remove hyperlinks
    tweet = re.sub(r'https?://[^\s\n\r]+', '', tweet)
    # remove hashtags
    # only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)
    # tokenize tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)

    tweets_clean = []
    for word in tweet_tokens:
        if (word not in stopwords_english and  # remove stopwords
                word not in string.punctuation):  # remove punctuation
            # tweets_clean.append(word)
            stem_word = stemmer.stem(word)  # stemming word
            tweets_clean.append(stem_word)

    return tweets_clean

### Prepare the Data
* The `twitter_samples` contains subsets of five thousand positive_tweets, five thousand negative_tweets, and the full set of 10,000 tweets.  
    * If you used all three datasets, we would introduce duplicates of the positive tweets and negative tweets.  
    * You will select just the five thousand positive tweets and five thousand negative tweets.

In [66]:
# select the set of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

* Train test split: 20% will be in the test set, and 80% in the training set.


In [67]:
# split the data into two pieces, one for training and one for testing (validation set)
test_pos = all_positive_tweets[4000:]  #1000 test tweets -positive
train_pos = all_positive_tweets[:4000]  #4000 training sets -positive
test_neg = all_negative_tweets[4000:]    #1000 test tweets -negative
train_neg = all_negative_tweets[:4000]   #4000 training sets -negative

train_x = train_pos + train_neg
test_x = test_pos + test_neg

* Create the numpy array of positive labels and negative labels.

In [68]:
# combine positive and negative labels
train_y = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0)
test_y = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0)

In [69]:
# Print the shape train and test sets
print("train_y.shape = " + str(train_y.shape))
print("test_y.shape = " + str(test_y.shape))

train_y.shape = (8000, 1)
test_y.shape = (2000, 1)


* Create the frequency dictionary by completing the build_freqs function: this counts how often a word in the 'corpus' (the entire set of tweets) was associated with a positive label '1' or a negative label '0', then builds the 'freqs' dictionary, where each key is the (word,label) tuple, and the value is the count of its frequency within the corpus of tweets.


In [70]:
def build_freqs(tweets, ys):
    """Build frequencies.
    Input:
        tweets: a list of tweets
        ys: an m x 1 array with the sentiment label of each tweet
            (either 0 or 1)
    Output:
        freqs: a dictionary mapping each (word, sentiment) pair to its
        frequency
    """
    ### START CODE HERE ###

    #Initialized an empty dictionary
    freqs = {}

    # Loop through each index in the tweets and ys lists
    for i in range(len(tweets)):
        tweet = tweets[i]
        y = ys[i]

        # Process the tweet to get a list of cleaned words
        words = process_tweet(tweet)
        # For each word in the tweet
        for word in words:
            # Create the key as a tuple (word, sentiment)
            pair = (word, int(y))

            # Increment the frequency count for the (word, sentiment) pair
            if pair in freqs:
                freqs[pair] += 1
            else:
                freqs[pair] = 1

    ### END CODE HERE ###

    return freqs

In [71]:
# create frequency dictionary
freqs = build_freqs(train_x, train_y)

# check the output
print("type(freqs) = " + str(type(freqs)))
print("len(freqs) = " + str(len(freqs.keys())))

  pair = (word, int(y))


type(freqs) = <class 'dict'>
len(freqs) = 11427


#### Expected output
```
type(freqs) = <class 'dict'>
len(freqs) = 11427
```

### Process Tweet
The given function 'process_tweet' tokenizes the tweet into individual words, removes stop words and applies stemming.

In [72]:
# test the function below
print('This is an example of a positive tweet: \n', train_x[0])
print('\nThis is an example of the processed version of the tweet: \n', process_tweet(train_x[0]))

This is an example of a positive tweet: 
 #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)

This is an example of the processed version of the tweet: 
 ['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']


#### Expected output
```
This is an example of a positive tweet:
 #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)

This is an example of the processes version:
 ['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']
```

<a name='1'></a>
## 1 - Logistic Regression

<a name='1-1'></a>
### 1.1 - Sigmoid
You will learn to use logistic regression for text classification.
* The sigmoid function is defined as:

$$ h(z) = \frac{1}{1+\exp^{-z}} \tag{1}$$

It maps the input 'z' to a value that ranges between 0 and 1, and so it can be treated as a probability.

<div style="width:image width px; font-size:100%; text-align:center;"><img src='https://drive.google.com/file/d/1bbsYdwwkA1LVkON7lafLd0CEnfLDxljA/view?usp=share_link' alt="alternate text" width="width" height="height" style="width:300px;height:200px;" /> Figure 1 </div>

<a name='ex-1'></a>
### Exercise 1 -  sigmoid
Implement the sigmoid function.
* You will want this function to work if z is a scalar as well as if it is an array.

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>
<ul>
    <li><a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.exp.html" > numpy.exp </a> </li>

</ul>
</p>



In [73]:
def sigmoid(z):
    """
    Input:
        z: is the input (can be a scalar or an array)
    Output:
        h: the sigmoid of z
    """

    ### START CODE HERE ###
    # calculate the sigmoid of z
    # if z<0:
    #    h = np.exp(z)/(1+np.exp(z))
    # else:
    #    h = 1/(1+np.exp(-z))

    # Use np.where to handle the case for all elements in z
    return np.where(z < 0, np.exp(z) / (1 + np.exp(z)), 1 / (1 + np.exp(-z)))


    ### END CODE HERE ###

    return h

In [74]:
# Test cases for sigmoid function
def test_sigmoid(target):
    successful_cases = 0
    failed_cases = []

    test_cases = [
        {"name": "default_check", "input": {"z": 0}, "expected": 0.5},
        {
            "name": "positive_check",
            "input": {"z": 4.92},
            "expected": 0.9927537604041685,
        },
        {"name": "negative_check", "input": {"z": -1}, "expected": 0.2689414213699951},
        {
            "name": "larger_neg_check",
            "input": {"z": -20},
            "expected": 2.0611536181902037e-09,
        },
    ]

    for test_case in test_cases:
        result = target(**test_case["input"])

        try:
            assert np.isclose(result, test_case["expected"])
            successful_cases += 1
        except:
            failed_cases.append(
                {
                    "name": test_case["name"],
                    "expected": test_case["expected"],
                    "got": result,
                }
            )
            print(
                f"Wrong output from sigmoid function. \n\tExpected: {failed_cases[-1].get('expected')}.\n\tGot: {failed_cases[-1].get('got')}."
            )

    if len(failed_cases) == 0:
        print("\033[92m All tests passed")
    else:
        print("\033[92m", successful_cases, " Tests passed")
        print("\033[91m", len(failed_cases), " Tests failed")

In [75]:
# Test your function
test_sigmoid(sigmoid)

[92m All tests passed


#### Logistic Regression: Regression and a Sigmoid

Logistic regression takes a regular linear regression, and applies a sigmoid to the output of the linear regression.

Regression:
$$z = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + ... \theta_N x_N$$
Note that the $\theta$ values are "weights". If you took the deep learning specialization, we referred to the weights with the 'w' vector.  In this course, we're using a different variable $\theta$ to refer to the weights.

Logistic regression
$$ h(z) = \frac{1}{1+\exp^{-z}}$$
$$z = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + ... \theta_N x_N$$
We will refer to 'z' as the 'logits'.

<a name='1-2'></a>
### 1.2 - Cost function and Gradient

The cost function used for logistic regression is the average of the log loss across all training examples:

$$J(\theta) = -\frac{1}{m} \sum_{i=1}^m y^{(i)}\log (h(z(\theta)^{(i)})) + (1-y^{(i)})\log (1-h(z(\theta)^{(i)}))\tag{5} $$
* $m$ is the number of training examples
* $y^{(i)}$ is the actual label of training example 'i'.
* $h(z^{(i)})$ is the model's prediction for the training example 'i'.

The loss function for a single training example is
$$ Loss = -1 \times \left( y^{(i)}\log (h(z(\theta)^{(i)})) + (1-y^{(i)})\log (1-h(z(\theta)^{(i)})) \right)$$

* All the $h$ values are between 0 and 1, so the logs will be negative. That is the reason for the factor of -1 applied to the sum of the two loss terms.
* Note that when the model predicts 1 ($h(z(\theta)) = 1$) and the label 'y' is also 1, the loss for that training example is 0.
* Similarly, when the model predicts 0 ($h(z(\theta)) = 0$) and the actual label is also 0, the loss for that training example is 0.
* However, when the model prediction is close to 1 ($h(z(\theta)) = 0.9999$) and the label is 0, the second term of the log loss becomes a large negative number, which is then multiplied by the overall factor of -1 to convert it to a positive loss value. $-1 \times (1 - 0) \times log(1 - 0.9999) \approx 9.2$ The closer the model prediction gets to 1, the larger the loss.

In [76]:
# verify that when the model predicts close to 1, but the actual label is 0, the loss is a large positive value
-1 * (1 - 0) * np.log(1 - 0.9999) # loss is about 9.2

9.210340371976294

* Likewise, if the model predicts close to 0 ($h(z) = 0.0001$) but the actual label is 1, the first term in the loss function becomes a large number: $-1 \times log(0.0001) \approx 9.2$.  The closer the prediction is to zero, the larger the loss.

In [77]:
# verify that when the model predicts close to 0 but the actual label is 1, the loss is a large positive value
-1 * np.log(0.0001) # loss is about 9.2

9.210340371976182

#### Update the weights

To update your weight vector $\theta$, you will apply gradient descent to iteratively improve your model's predictions.  
The gradient of the cost function $J$ with respect to one of the weights $\theta_j$ is:

$$\nabla_{\theta_j}J(\theta) = \frac{1}{m} \sum_{i=1}^m(h^{(i)}-y^{(i)})x^{(i)}_j \tag{5}$$
* 'i' is the index across all 'm' training examples.
* 'j' is the index of the weight $\theta_j$, so $x^{(i)}_j$ is the feature associated with weight $\theta_j$

* To update the weight $\theta_j$, we adjust it by subtracting a fraction of the gradient determined by $\alpha$:
$$\theta_j = \theta_j - \alpha \times \nabla_{\theta_j}J(\theta) $$
* The learning rate $\alpha$ is a value that we choose to control how big a single update will be.


<a name='ex-2'></a>
### Exercise 2 - gradientDescent
Implement gradient descent function.
* The number of iterations 'num_iters" is the number of times that you'll use the entire training set.
* For each iteration, you'll calculate the cost function using all training examples (there are 'm' training examples), and for all features.
* Instead of updating a single weight $\theta_i$ at a time, we can update all the weights in the column vector:  
$$\mathbf{\theta} = \begin{pmatrix}
\theta_0
\\
\theta_1
\\
\theta_2
\\
\vdots
\\
\theta_n
\end{pmatrix}$$
* $\mathbf{\theta}$ has dimensions (n+1, 1), where 'n' is the number of features, and there is one more element for the bias term $\theta_0$ (note that the corresponding feature value $\mathbf{x_0}$ is 1).
* The 'logits', 'z', are calculated by multiplying the feature matrix 'x' with the weight vector 'theta'.  $z = \mathbf{x}\mathbf{\theta}$
    * $\mathbf{x}$ has dimensions (m, n+1)
    * $\mathbf{\theta}$: has dimensions (n+1, 1)
    * $\mathbf{z}$: has dimensions (m, 1)
* The prediction 'h', is calculated by applying the sigmoid to each element in 'z': $h(z) = sigmoid(z)$, and has dimensions (m,1).
* The cost function $J$ is calculated by taking the dot product of the vectors 'y' and 'log(h)'.  Since both 'y' and 'h' are column vectors (m,1), transpose the vector to the left, so that matrix multiplication of a row vector with column vector performs the dot product.
$$J = \frac{-1}{m} \times \left(\mathbf{y}^T \cdot log(\mathbf{h}) + \mathbf{(1-y)}^T \cdot log(\mathbf{1-h}) \right)$$
* The update of theta is also vectorized.  Because the dimensions of $\mathbf{x}$ are (m, n+1), and both $\mathbf{h}$ and $\mathbf{y}$ are (m, 1), we need to transpose the $\mathbf{x}$ and place it on the left in order to perform matrix multiplication, which then yields the (n+1, 1) answer we need:
$$\mathbf{\theta} = \mathbf{\theta} - \frac{\alpha}{m} \times \left( \mathbf{x}^T \cdot \left( \mathbf{h-y} \right) \right)$$

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>
<ul>
    <li>use numpy.dot for matrix multiplication.</li>
    <li>To ensure that the fraction -1/m is a decimal value, cast either the numerator or denominator (or both), like `float(1)`, or write `1.` for the float version of 1. </li>
</ul>
</p>



In [78]:
import numpy as np

def sigmoid(z):

    # Use np.where to handle the case for all elements in z
      return np.where(z < 0, np.exp(z) / (1 + np.exp(z)), 1 / (1 + np.exp(-z)))

def gradientDescent(x, y, theta, alpha, num_iters):

    """
    Input:
        x: matrix of features which is (m,n+1)
        y: corresponding labels of the input matrix x, dimensions (m,1)
        theta: weight vector of dimension (n+1,1)
        alpha: learning rate
        num_iters: number of iterations you want to train your model for
    Output:
        J: the final cost
        theta: your final weight vector
    Hint: you might want to print the cost to make sure that it is going down.
    """
    ### START CODE HERE ###
    # get 'm', the number of rows in matrix x
    m = x.shape[0]
    # epsilon = 1e-5

    for i in range(0, num_iters):

        # get z, the dot product of x and theta
        z = np.matmul(x, theta)

        # get the sigmoid of z
        h = sigmoid(z)
        print(h)

        # calculate the cost function
        J = (-1/m) * np.sum(y * np.log(h) + (1 - y) * np.log(1 - h))
        # update the weights theta
        gradient = (1/m) * np.matmul(x.T, (h - y))
        theta = theta - alpha * gradient

        # Optionally, print the cost to see if it decreases
        if i % 100 == 0:  # Print every 100 iterations (or adjust as needed)
            print(f"Iteration {i}: Cost {J}")


    ### END CODE HERE ###
    J = float(J)
    return J, theta




In [79]:
# Test cases for gradient descent function
import numpy as np

def test_gradientDescent(target):
    successful_cases = 0
    failed_cases = []

    test_cases = [
        {
            "name": "default_check",
            "input": {
                "random_seed": 1,
                "input_dict": {
                    "x": np.array(
                        [
                            [1.00000000e00, 8.34044009e02, 1.44064899e03],
                            [1.00000000e00, 2.28749635e-01, 6.04665145e02],
                            [1.00000000e00, 2.93511782e02, 1.84677190e02],
                            [1.00000000e00, 3.72520423e02, 6.91121454e02],
                            [1.00000000e00, 7.93534948e02, 1.07763347e03],
                            [1.00000000e00, 8.38389029e02, 1.37043900e03],
                            [1.00000000e00, 4.08904499e02, 1.75623487e03],
                            [1.00000000e00, 5.47751864e01, 1.34093502e03],
                            [1.00000000e00, 8.34609605e02, 1.11737966e03],
                            [1.00000000e00, 2.80773877e02, 3.96202978e02],
                        ]
                    ),
                    "y": np.array(
                        [
                            [1.0],
                            [1.0],
                            [0.0],
                            [1.0],
                            [1.0],
                            [1.0],
                            [0.0],
                            [0.0],
                            [0.0],
                            [1.0],
                        ]
                    ),
                    "theta": np.zeros((3, 1)),
                    "alpha": 1e-8,
                    "num_iters": 700,
                },
            },
            "expected": {
                "J": 0.6709497038162118,
                "theta": np.array(
                    [[4.10713435e-07], [3.56584699e-04], [7.30888526e-05]]
                ),
            },
        },
        {
            "name": "larger_check",
            "input": {
                "random_seed": 2,
                "input_dict": {
                    "x": np.array(
                        [
                            [1.0, 435.99490214, 25.92623183, 549.66247788],
                            [1.0, 435.32239262, 420.36780209, 330.334821],
                            [1.0, 204.64863404, 619.27096635, 299.65467367],
                            [1.0, 266.8272751, 621.13383277, 529.14209428],
                            [1.0, 134.57994534, 513.57812127, 184.43986565],
                            [1.0, 785.33514782, 853.97529264, 494.23683738],
                            [1.0, 846.56148536, 79.64547701, 505.24609012],
                            [1.0, 65.28650439, 428.1223276, 96.53091566],
                            [1.0, 127.1599717, 596.74530898, 226.0120006],
                            [1.0, 106.94568431, 220.30620707, 349.826285],
                            [1.0, 467.78748458, 201.74322626, 640.40672521],
                            [1.0, 483.06983555, 505.23672002, 386.89265112],
                            [1.0, 793.63745444, 580.00417888, 162.2985985],
                            [1.0, 700.75234661, 964.55108009, 500.00836117],
                            [1.0, 889.52006395, 341.61365267, 567.14412763],
                            [1.0, 427.5459633, 436.74726303, 776.559185],
                            [1.0, 535.6041735, 953.74222694, 544.20816015],
                            [1.0, 82.09492228, 366.34240168, 850.850504],
                            [1.0, 406.27504305, 27.20236589, 247.177239],
                            [1.0, 67.14437074, 993.85201142, 970.58031338],
                        ]
                    ),
                    "y": np.array(
                        [
                            [1.0],
                            [1.0],
                            [1.0],
                            [0.0],
                            [0.0],
                            [1.0],
                            [0.0],
                            [0.0],
                            [1.0],
                            [0.0],
                            [1.0],
                            [0.0],
                            [0.0],
                            [0.0],
                            [1.0],
                            [1.0],
                            [0.0],
                            [0.0],
                            [1.0],
                            [0.0],
                        ]
                    ),
                    "theta": np.zeros((4, 1)),
                    "alpha": 1e-4,
                    "num_iters": 30,
                },
            },
            "expected": {
                "J": 6.5044107216556135,
                "theta": np.array(
                    [
                        [9.45211976e-05],
                        [2.40577958e-02],
                        [-1.77876847e-02],
                        [1.35674845e-02],
                    ]
                ),
            },
        },
    ]

    for test_case in test_cases:
        # Setting the random seed for reproducibility
        result_J, result_theta = target(**test_case["input"]["input_dict"])

        try:
            assert isinstance(result_J, float)
            successful_cases += 1
        except:
            failed_cases.append(
                {"name": test_case["name"], "expected": float, "got": type(result_J),}
            )
            print(
                f"Wrong output type for loss function. \n\tExpected: {failed_cases[-1].get('expected')}.\n\tGot: {failed_cases[-1].get('got')}."
            )

        try:
            assert np.isclose(result_J, test_case["expected"]["J"])
            successful_cases += 1
        except:
            failed_cases.append(
                {
                    "name": test_case["name"],
                    "expected": test_case["expected"]["J"],
                    "got": result_J,
                }
            )
            print(
                f"Wrong output for the loss function. Check how you are implementing the matrix multiplications. \n\tExpected: {failed_cases[-1].get('expected')}.\n\tGot: {failed_cases[-1].get('got')}."
            )

        try:
            assert result_theta.shape == test_case["input"]["input_dict"]["theta"].shape
            successful_cases += 1
        except:
            failed_cases.append(
                {
                    "name": test_case["name"],
                    "expected": test_case["input"]["input_dict"]["theta"].shape,
                    "got": result_theta.shape,
                }
            )
            print(
                f"Wrong shape for weights matrix theta. \n\tExpected: {failed_cases[-1].get('expected')}.\n\tGot: {failed_cases[-1].get('got')}."
            )

        try:
            assert np.allclose(
                np.squeeze(result_theta), np.squeeze(test_case["expected"]["theta"]),
            )
            successful_cases += 1
        except:
            failed_cases.append(
                {
                    "name": test_case["name"],
                    "expected": test_case["expected"]["theta"],
                    "got": result_theta,
                }
            )
            print(
                f"Wrong values for weight's matrix theta. Check how you are updating the matrix of weights. \n\tExpected: {failed_cases[-1].get('expected')}.\n\tGot: {failed_cases[-1].get('got')}."
            )

    if len(failed_cases) == 0:
        print("\033[92m All tests passed")
    else:
        print("\033[92m", successful_cases, " Tests passed")
        print("\033[91m", len(failed_cases), " Tests failed")

In [62]:
# Test your function
test_gradientDescent(gradientDescent)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
 [0.55280691]
 [0.52995037]
 [0.55687115]
 [0.51962553]]
[[0.56359285]
 [0.51254069]
 [0.51581481]
 [0.52951171]
 [0.55454132]
 [0.56233527]
 [0.55291214]
 [0.52999775]
 [0.55701224]
 [0.51967415]]
[[0.56374248]
 [0.51255875]
 [0.51585867]
 [0.52958082]
 [0.55467566]
 [0.56248349]
 [0.55301686]
 [0.53004476]
 [0.5571529 ]
 [0.51972263]]
[[0.5638916 ]
 [0.51257666]
 [0.51590245]
 [0.5296497 ]
 [0.5548096 ]
 [0.56263123]
 [0.55312105]
 [0.53009142]
 [0.55729315]
 [0.51977096]]
[[0.56404023]
 [0.5125944 ]
 [0.51594613]
 [0.52971834]
 [0.55494314]
 [0.56277849]
 [0.55322472]
 [0.53013771]
 [0.55743298]
 [0.51981915]]
[[0.56418836]
 [0.51261199]
 [0.51598972]
 [0.52978675]
 [0.55507629]
 [0.56292527]
 [0.55332788]
 [0.53018365]
 [0.5575724 ]
 [0.51986719]]
[[0.56433599]
 [0.51262942]
 [0.51603322]
 [0.52985493]
 [0.55520904]
 [0.56307156]
 [0.55343053]
 [0.53022923]
 [0.5577114 ]
 [0.5199151 ]]
[[0.56448313]
 [0.51264669]
 [0.

<a name='2'></a>
## 2 - Extracting the Features

* Given a list of tweets, extract the features and store them in a matrix. You will extract two features.
    * The first feature is the number of positive words in a tweet.
    * The second feature is the number of negative words in a tweet.
* Then train your logistic regression classifier on these features.
* Test the classifier on a validation set.

<a name='ex-3'></a>
### Exercise 3 - extract_features
Implement the extract_features function.
* This function takes in a single tweet.
* Process the tweet using the imported `process_tweet` function and save the list of tweet words.
* Loop through each word in the list of processed words
    * For each word, check the 'freqs' dictionary for the count when that word has a positive '1' label. (Check for the key (word, 1.0)
    * Do the same for the count for when the word is associated with the negative label '0'. (Check for the key (word, 0.0).)

**Note:** In the implementation instructions provided above, the prediction of being positive or negative depends on feature vector which counts-in duplicate words - this is different from what you have seen in the lecture videos

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>
<ul>
    <li>Make sure you handle cases when the (word, label) key is not found in the dictionary. </li>
    <li> Search the web for hints about using the 'get' function of a Python dictionary.  Here is an <a href="https://www.programiz.com/python-programming/methods/dictionary/get" > example </a> </li>
</ul>
</p>


In [80]:
def extract_features(tweet, freqs, process_tweet=process_tweet):
    """
    Input:
        tweet: a string containing one tweet
        freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
    Output:
        x: a feature vector of dimension (1,3)
    """
    # process_tweet tokenizes, stems, and removes stopwords
    word_l = process_tweet(tweet)

    # 3 elements for [bias, positive, negative] counts
    x = np.zeros(3)

    # bias term is set to 1
    x[0] = 1

    ### START CODE HERE ###

    # loop through each word in the list of words
    for word in word_l:

        # increment the word count for the positive label 1
        x[1] += freqs.get((word, 1.0), 0)

        # increment the word count for the negative label 0
        x[2] += freqs.get((word, 0.0), 0)

    ### END CODE HERE ###

    x = x[None, :]  # adding batch dimension for further processing
    assert(x.shape == (1, 3))
    return x

In [81]:
# Test cases for extract_features
def test_extract_features(target, freqs):
    successful_cases = 0
    failed_cases = []

    test_cases = [
        {
            "name": "default_check",
            "input": {
                "tweet": "#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)",
                "freqs": freqs,
            },
            "expected": np.array(
                [[1.00e00, 3.133e03, 6.10e01]]
            ),
        },
        {
            "name": "unk_words_check",
            "input": {"tweet": "blorb bleeeeb bloooob", "freqs": freqs},
            "expected": np.array([[1.0, 0.0, 0.0]]),
        },
        {
            "name": "good_words_check",
            "input": {"tweet": "Hello world! All's good!", "freqs": freqs},
            "expected": np.array([[1.0, 263.0, 106.0]]),
        },
        {
            "name": "bad_words_check",
            "input": {"tweet": "It is so sad!", "freqs": freqs},
            "expected": np.array([[1.0, 5.0, 100.0]]),
        },
    ]

    for test_case in test_cases:
        result = target(**test_case["input"])

        try:
            assert result.shape == test_case["expected"].shape
            successful_cases += 1
        except:
            failed_cases.append(
                {
                    "name": test_case["name"],
                    "expected": test_case["expected"].shape,
                    "got": result.shape,
                }
            )
            print(
                f"Wrong output shape. \n\tExpected: {failed_cases[-1].get('expected')}.\n\tGot: {failed_cases[-1].get('got')}."
            )

        try:
            assert np.allclose(result, test_case["expected"])
            successful_cases += 1
        except:
            failed_cases.append(
                {
                    "name": test_case["name"],
                    "expected": test_case["expected"],
                    "got": result,
                }
            )
            print(
                f"Wrong output values. Check how you are computing the positive or negative word count. \n\tExpected: {failed_cases[-1].get('expected')}.\n\tGot: {failed_cases[-1].get('got')}."
            )

    if len(failed_cases) == 0:
        print("\033[92m All tests passed")
    else:
        print("\033[92m", successful_cases, " Tests passed")
        print("\033[91m", len(failed_cases), " Tests failed")

In [82]:
# Test your function
test_extract_features(extract_features, freqs)

[92m All tests passed


<a name='3'></a>
## 3 - Training Your Model

To train the model:
* Stack the features for all training examples into a matrix X.
* Call `gradientDescent`, which you've implemented above.

This section is given to you.  Please read it for understanding and run the cell.

In [83]:
# collect the features 'x' and stack them into a matrix 'X'
X = np.zeros((len(train_x), 3))
for i in range(len(train_x)):
    X[i, :]= extract_features(train_x[i], freqs)

# training labels corresponding to X
Y = train_y

# Apply gradient descent
J, theta = gradientDescent(X, Y, np.zeros((3, 1)), 1e-9, 1500)
print(f"The cost after training is {J:.8f}.")
print(f"The resulting vector of weights is {[round(t, 8) for t in np.squeeze(theta)]}")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
[[0.7460547 ]
 [0.75540417]
 [0.74097572]
 ...
 [0.43477649]
 [0.18732592]
 [0.18149299]]
[[0.74624292]
 [0.75560065]
 [0.74116237]
 ...
 [0.43472951]
 [0.18715808]
 [0.18132471]]
[[0.74643093]
 [0.7557969 ]
 [0.74134881]
 ...
 [0.43468256]
 [0.18699051]
 [0.1811567 ]]
[[0.74661872]
 [0.75599292]
 [0.74153504]
 ...
 [0.43463567]
 [0.18682322]
 [0.18098897]]
[[0.74680628]
 [0.7561887 ]
 [0.74172105]
 ...
 [0.43458882]
 [0.18665619]
 [0.18082152]]
[[0.74699363]
 [0.75638424]
 [0.74190686]
 ...
 [0.43454202]
 [0.18648944]
 [0.18065435]]
[[0.74718076]
 [0.75657955]
 [0.74209244]
 ...
 [0.43449527]
 [0.18632296]
 [0.18048745]]
[[0.74736767]
 [0.75677463]
 [0.74227782]
 ...
 [0.43444856]
 [0.18615675]
 [0.18032083]]
[[0.74755436]
 [0.75696947]
 [0.74246298]
 ...
 [0.4344019 ]
 [0.18599081]
 [0.18015448]]
[[0.74774083]
 [0.75716408]
 [0.74264793]
 ...
 [0.43435529]
 [0.18582513]
 [0.17998841]]
[[0.74792709]
 [0.75735845]
 [0.742

**Expected Output**:

```
The cost after training is 0.22522315.
The resulting vector of weights is [6e-08, 0.00053818, -0.0005583]
```

<a name='4'></a>
## 4 -  Test your Logistic Regression

It is time for you to test your logistic regression function on some new input that your model has not seen before.
<a name='ex-4'></a>
### Exercise 4 - predict_tweet
Implement `predict_tweet`.
Predict whether a tweet is positive or negative.

* Given a tweet, process it, then extract the features.
* Apply the model's learned weights on the features to get the logits.
* Apply the sigmoid to the logits to get the prediction (a value between 0 and 1).

$$y_{pred} = sigmoid(\mathbf{x} \cdot \theta)$$

In [84]:
import numpy as np

def sigmoid(z):
    """
    Apply sigmoid function to scalar, vector, or matrix.
    """
    return 1 / (1 + np.exp(-z))

def predict_tweet(tweet, freqs, theta):


    """
    Input:
        tweet: a string
        freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
        theta: (3,1) vector of weights
    Output:
        y_pred: the probability of a tweet being positive or negative
    """
    ### START CODE HERE ###

    # extract the features of the tweet and store it into x
    x = extract_features(tweet, freqs)

    z = np.dot(x, theta)

    # make the prediction using x and theta
    y_pred = sigmoid(z)


    ### END CODE HERE ###

    return y_pred

In [85]:
# Test cases for predict_tweet
def test_predict_tweet(target, freqs, theta):
    successful_cases = 0
    failed_cases = []

    test_cases = [
        {
            "name": "default_check1",
            "input": {"tweet": "I am happy", "freqs": freqs, "theta": theta},
            "expected": np.array([[0.5192746]]),
        },
        {
            "name": "default_check2",
            "input": {"tweet": "I am bad", "freqs": freqs, "theta": theta},
            "expected": np.array([[0.49434685]]),
        },
        {
            "name": "default_check3",
            "input": {
                "tweet": "this movie should have been great",
                "freqs": freqs,
                "theta": theta,
            },
            "expected": np.array([[0.5159792]]),
        },
        {
            "name": "default_check5",
            "input": {"tweet": "It is a good day", "freqs": freqs, "theta": theta,},
            "expected": np.array([[0.52320595]]),
        },
        {
            "name": "default_check6",
            "input": {"tweet": "It is a bad bad day", "freqs": freqs, "theta": theta,},
            "expected": np.array([[0.49780224]]),
        },
        {
            "name": "default_check7",
            "input": {
                "tweet": "It is a good day",
                "freqs": freqs,
                "theta": np.array([[5.0000e-04], [-3.4e-02], [3.2e-02]]),
            },
            "expected": np.array([[0.00147813]]),
        },
        {
            "name": "default_check8",
            "input": {
                "tweet": "It is a bad bad day",
                "freqs": freqs,
                "theta": np.array([[5.0000e-04], [-3.4e-02], [3.2e-02]]),
            },
            "expected": np.array([[0.45673348]]),
        },
        {
            "name": "default_check9",
            "input": {
                "tweet": "this movie should have been great",
                "freqs": freqs,
                "theta": np.array([[5.0000e-04], [-3.4e-02], [3.2e-02]]),
            },
            "expected": np.array([[0.01561938]]),
        },
    ]

    for test_case in test_cases:
        result = target(**test_case["input"])

        try:
            assert result.shape == test_case["expected"].shape
            successful_cases += 1
        except:
            failed_cases.append(
                {
                    "name": test_case["name"],
                    "expected": test_case["expected"].shape,
                    "got": result.shape,
                }
            )
            print(
                f"Wrong output shape. \n\tExpected: {failed_cases[-1].get('expected')}.\n\tGot: {failed_cases[-1].get('got')}."
            )

        try:
            assert np.allclose(result, test_case["expected"])
            successful_cases += 1
        except:
            failed_cases.append(
                {
                    "name": test_case["name"],
                    "expected": test_case["expected"],
                    "got": result,
                }
            )
            print(
                f"Wrong predicted values. \n\tExpected: {failed_cases[-1].get('expected')}.\n\tGot: {failed_cases[-1].get('got')}."
            )

    if len(failed_cases) == 0:
        print("\033[92m All tests passed")
    else:
        print("\033[92m", successful_cases, " Tests passed")
        print("\033[91m", len(failed_cases), " Tests failed")

In [86]:
# Test your function
test_predict_tweet(predict_tweet, freqs, theta)

[92m All tests passed


<a name='4-1'></a>
### 4.1 -  Check the Performance using the Test Set
After training your model using the training set above, check how your model might perform on real, unseen data, by testing it against the test set.

<a name='ex-5'></a>
### Exercise 5 - test_logistic_regression
Implement `test_logistic_regression`.
* Given the test data and the weights of your trained model, calculate the accuracy of your logistic regression model.
* Use your 'predict_tweet' function to make predictions on each tweet in the test set.
* If the prediction is > 0.5, set the model's classification 'y_hat' to 1, otherwise set the model's classification 'y_hat' to 0.
* A prediction is accurate when the y_hat equals the test_y.  Sum up all the instances when they are equal and divide by m.


<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>
<ul>
    <li>Use np.asarray() to convert a list to a numpy array</li>
    <li>Use numpy.squeeze() to make an (m,1) dimensional array into an (m,) array </li>
</ul>
</p>

In [87]:
def test_logistic_regression(test_x, test_y, freqs, theta, predict_tweet=predict_tweet):
    """
    Input:
        test_x: a list of tweets
        test_y: (m, 1) vector with the corresponding labels for the list of tweets
        freqs: a dictionary with the frequency of each pair (or tuple)
        theta: weight vector of dimension (3, 1)
    Output:
        accuracy: (# of tweets classified correctly) / (total # of tweets)
    """

    m = len(test_x)  # Number of tweets in the test set

    # Initialize the list for storing predictions
    y_hat = []

    for tweet in test_x:
        # Get the label prediction for the tweet
        y_pred = predict_tweet(tweet, freqs, theta)

        # Classify as 1.0 if the probability > 0.5, else classify as 0.0
        if y_pred > 0.5:
            y_hat.append(1.0)
        else:
            y_hat.append(0.0)

    # Convert y_hat and test_y to numpy arrays for comparison
    y_hat = np.array(y_hat)
    test_y = np.squeeze(test_y)  # Convert (m, 1) to (m,)

    # Calculate accuracy
    accuracy = np.mean(y_hat == test_y)

    return accuracy

In [88]:
tmp_accuracy = test_logistic_regression(test_x, test_y, freqs, theta)
print(f"Logistic regression model's accuracy = {tmp_accuracy:.4f}")

Logistic regression model's accuracy = 0.9950


#### Expected Output:
```0.9950```  
Pretty good!

In [None]:
# Unit tests for test_logistic_regression
def unittest_test_logistic_regression(target, freqs, theta):
    successful_cases = 0
    failed_cases = []

    test_cases = [
        {
            "name": "default_check1",
            "input": {
                "test_x": [
                    "Bro:U wan cut hair anot,ur hair long Liao bo\nMe:since ord liao,take it easy lor treat as save $ leave it longer :)\nBro:LOL Sibei xialan",
                    "@heyclaireee is back! thnx God!!! i'm so happy :)",
                    "@BBCRadio3 thought it was my ears which were malfunctioning, thank goodness you cleared that one up with an apology :-)",
                    "@HumayAG 'Stuck in the centre right with you. Clowns to the right, jokers to the left...' :) @orgasticpotency @ahmedshaheed @AhmedSaeedGahaa",
                    "Happy Friday :-) http://t.co/iymPIlWXFY",
                    "I wanna change my avi but uSanele :(",
                    "MY PUPPY BROKE HER FOOT :(",
                    "where's all the jaebum baby pictures :((",
                    "But but Mr Ahmad Maslan cooks too :( https://t.co/ArCiD31Zv6",
                    "@eawoman As a Hull supporter I am expecting a misserable few weeks :-(",
                ],
                "test_y": np.array(
                    [
                        [1.0],
                        [1.0],
                        [1.0],
                        [1.0],
                        [1.0],
                        [0.0],
                        [0.0],
                        [0.0],
                        [0.0],
                        [0.0],
                    ]
                ),
                "freqs": freqs,
                "theta": theta,
            },
            "expected": 1.0,
        },
        {
            "name": "default_check1",
            "input": {
                "test_x": [
                    "Bro:U wan cut hair anot,ur hair long Liao bo\nMe:since ord liao,take it easy lor treat as save $ leave it longer :)\nBro:LOL Sibei xialan",
                    "@heyclaireee is back! thnx God!!! i'm so happy :)",
                    "@BBCRadio3 thought it was my ears which were malfunctioning, thank goodness you cleared that one up with an apology :-)",
                    "@HumayAG 'Stuck in the centre right with you. Clowns to the right, jokers to the left...' :) @orgasticpotency @ahmedshaheed @AhmedSaeedGahaa",
                    "Happy Friday :-) http://t.co/iymPIlWXFY",
                    "I wanna change my avi but uSanele :(",
                    "MY PUPPY BROKE HER FOOT :(",
                    "where's all the jaebum baby pictures :((",
                    "But but Mr Ahmad Maslan cooks too :( https://t.co/ArCiD31Zv6",
                    "@eawoman As a Hull supporter I am expecting a misserable few weeks :-(",
                ],
                "test_y": np.array(
                    [
                        [1.0],
                        [1.0],
                        [1.0],
                        [1.0],
                        [1.0],
                        [0.0],
                        [0.0],
                        [0.0],
                        [0.0],
                        [0.0],
                    ]
                ),
                "freqs": freqs,
                "theta": np.array([[5.0000e-04], [-3.4e-02], [3.2e-02]]),
            },
            "expected": 0.0,
        },
    ]

    for test_case in test_cases:
        result = target(**test_case["input"])

        try:
            assert isinstance(result, np.float64)
            successful_cases += 1
        except:
            failed_cases.append(
                {
                    "name": test_case["name"],
                    "expected": np.float64,
                    "got": type(result),
                }
            )
            print(
                f"Wrong output type. \n\tExpected: {failed_cases[-1].get('expected')}.\n\tGot: {failed_cases[-1].get('got')}."
            )

        try:
            assert np.isclose(result, test_case["expected"])
            successful_cases += 1
        except:
            failed_cases.append(
                {
                    "name": test_case["name"],
                    "expected": test_case["expected"],
                    "got": result,
                }
            )
            print(
                f"Wrong accuracy value. \n\tExpected: {failed_cases[-1].get('expected')}.\n\tGot: {failed_cases[-1].get('got')}."
            )

    if len(failed_cases) == 0:
        print("\033[92m All tests passed")
    else:
        print("\033[92m", successful_cases, " Tests passed")
        print("\033[91m", len(failed_cases), " Tests failed")

In [None]:
# Test your function
unittest_test_logistic_regression(test_logistic_regression, freqs, theta)

<a name='5'></a>
## 5 -  Error Analysis

In this part you will see some tweets that your model misclassified. Why do you think the misclassifications happened? Specifically what kind of tweets does your model misclassify?

In [89]:
# Some error analysis done for you
print('Label Predicted Tweet')
for x,y in zip(test_x,test_y):
    y_hat = predict_tweet(x, freqs, theta)

    if np.abs(y - (y_hat > 0.5)) > 0:
        print('THE TWEET IS:', x)
        print('THE PROCESSED TWEET IS:', process_tweet(x))
        print('%d\t%0.8f\t%s' % (y, y_hat, ' '.join(process_tweet(x)).encode('ascii', 'ignore')))

Label Predicted Tweet
THE TWEET IS: @MarkBreech Not sure it would be good thing 4 my bottom daring 2 say 2 Miss B but Im gonna be so stubborn on mouth soaping ! #NotHavingit :p
THE PROCESSED TWEET IS: ['sure', 'would', 'good', 'thing', '4', 'bottom', 'dare', '2', 'say', '2', 'miss', 'b', 'im', 'gonna', 'stubborn', 'mouth', 'soap', 'nothavingit', ':p']
1	0.48942981	b'sure would good thing 4 bottom dare 2 say 2 miss b im gonna stubborn mouth soap nothavingit :p'
THE TWEET IS: I'm playing Brain Dots : ) #BrainDots
http://t.co/UGQzOx0huu
THE PROCESSED TWEET IS: ["i'm", 'play', 'brain', 'dot', 'braindot']
1	0.48418981	b"i'm play brain dot braindot"
THE TWEET IS: I'm playing Brain Dots : ) #BrainDots http://t.co/aOKldo3GMj http://t.co/xWCM9qyRG5
THE PROCESSED TWEET IS: ["i'm", 'play', 'brain', 'dot', 'braindot']
1	0.48418981	b"i'm play brain dot braindot"


  print('%d\t%0.8f\t%s' % (y, y_hat, ' '.join(process_tweet(x)).encode('ascii', 'ignore')))


THE TWEET IS: I'm playing Brain Dots : ) #BrainDots http://t.co/R2JBO8iNww http://t.co/ow5BBwdEMY
THE PROCESSED TWEET IS: ["i'm", 'play', 'brain', 'dot', 'braindot']
1	0.48418981	b"i'm play brain dot braindot"
THE TWEET IS: off to the park to get some sunlight : )
THE PROCESSED TWEET IS: ['park', 'get', 'sunlight']
1	0.49636406	b'park get sunlight'
THE TWEET IS: @msarosh Uff Itna Miss karhy thy ap :p
THE PROCESSED TWEET IS: ['uff', 'itna', 'miss', 'karhi', 'thi', 'ap', ':p']
1	0.48250522	b'uff itna miss karhi thi ap :p'
THE TWEET IS: @phenomyoutube u probs had more fun with david than me : (
THE PROCESSED TWEET IS: ['u', 'prob', 'fun', 'david']
0	0.50988296	b'u prob fun david'
THE TWEET IS: pats jay : (
THE PROCESSED TWEET IS: ['pat', 'jay']
0	0.50040366	b'pat jay'
THE TWEET IS: my beloved grandmother : ( https://t.co/wt4oXq5xCf
THE PROCESSED TWEET IS: ['belov', 'grandmoth']
0	0.50000002	b'belov grandmoth'
THE TWEET IS: Sr. Financial Analyst - Expedia, Inc.: (#Bellevue, WA) http://t.co

Later in this specialization, we will see how we can use deeplearning to improve the prediction performance.

<a name='6'></a>
## 6 - Predict with your own Tweet

In [90]:
# Feel free to change the tweet below
my_tweet = 'This is a ridiculously bright movie. The plot was terrible and I was sad until the ending!'
print(process_tweet(my_tweet))
y_hat = predict_tweet(my_tweet, freqs, theta)
print(y_hat)
if y_hat > 0.5:
    print('Positive sentiment')
else:
    print('Negative sentiment')

['ridicul', 'bright', 'movi', 'plot', 'terribl', 'sad', 'end']
[[0.48125421]]
Negative sentiment
