# Logistic regression

- Used for binary classification. 
- Fits curve to data. (still considered as linear model because output depends on sum of inputs and parameters)
    - Given $x$, want $\hat{y} = P(y=1|x)$ 
        - Where $x \in {\rm I\!R^{n_{x}}}$ and $0 \le \hat{y} \le 1$
- Parameters
    - $w \in {\rm I\!R^{n_{x}}}$
    - $b \in {\rm I\!R}$
- Output $\hat{y} = \sigma{(w^{T}x + b)}$
    - Where $\sigma{(z)} = \dfrac{1}{1+e^{-z}}$
        - If $z$ is large positive, $\sigma{(z)} \approx 1$
        - If $z$ is large negative, $\sigma{(z)} \approx 0$
        
## Collinearity
- One feature can be predicted by other feature
     
## Loss function

- $L(\hat{y}, y) = -(ylog\hat{y} + (1-y)log(1-\hat{y}))$
    - If $y = 1$, $L(\hat{y}, y) = -log\hat{y}$
        - We want $\hat{y}$ large as possible ($y \approx 1$)
    - If $y = 0$, $L(\hat{y}, y) = -log(1-\hat{y})$
        - We want $\hat{y}$ small as possible ($y \approx 0$)
        
## Cost function

$J(w,b) = \dfrac{1}{m}\displaystyle\sum_{i=1}^{m}L(\hat{y}^{(i)}, y^{(i)})$ = $\dfrac{1}{m}\displaystyle\sum_{i=1}^{m}-(y^{(i)}log\hat{y}^{(i)} + (1-y^{(i)})log(1-\hat{y}^{(i)}))$

## Gradient descent

- We want $w,b$ that minimizes $J(w,b)$
    - $w := w - \alpha\dfrac{\partial J(w,b)}{\partial w}$
    - $b := b - \alpha\dfrac{\partial J(w,b)}{\partial b}$
    
## With n features and m training data

$X = 
\begin{bmatrix}
    \vdots & \vdots & \vdots \\
    \vdots & \vdots & \vdots \\ 
    x^{(1)} & x^{(2)} \ldots & x^{(m)} \\
    \vdots & \vdots & \vdots \\
    \vdots & \vdots & \vdots \\ 
\end{bmatrix}$

$Z = 
\begin{bmatrix}
    z^{(1)} & z^{(2)} \ldots & z^{(m)} \\
\end{bmatrix} = w^{T}X + \begin{bmatrix}
    b & b \ldots & b \\
\end{bmatrix} = \begin{bmatrix}
    w^{T}x^{(1)}+b & w^{T}x^{(2)}+b \ldots & w^{T}x^{(m)}+b \\
\end{bmatrix} = np.dot(wT,X)+b$ 

$A = 
\begin{bmatrix}
    a^{(1)} & a^{(2)} \ldots & A^{(m)} \\
\end{bmatrix} = \sigma(Z)$

$Y = 
\begin{bmatrix}
    y^{(1)} & y^{(2)} \ldots & y^{(m)} \\
\end{bmatrix}$

$dZ = 
\begin{bmatrix}
    dz^{(1)} & dz^{(2)} \ldots & dz^{(m)} \\
\end{bmatrix} = A - Y =
\begin{bmatrix}
    a^{(1)}y^{(1)} & a^{(2)}y^{(2)} \ldots & a^{(m)}y^{(m)} \\
\end{bmatrix}$

$dw = \dfrac{1}{m}XdZ^{T}$

$db = \dfrac{1}{m}np.sum(dZ)$

$w := w - \alpha dw$

$b := b - \alpha db$

## Example

<img src="img/LogReg_kiank.png" style="width:650px;height:400px;">

### Packages

- [numpy](www.numpy.org) scientific computing with Python.
- [matplotlib](http://matplotlib.org) plot graphs in Python.
- [h5py](http://www.h5py.org) to interact with a dataset that is stored on an H5 file.
- [PIL](http://www.pythonware.com/products/pil/) and [scipy](https://www.scipy.org/) test your model with your own picture.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import h5py
import scipy

from PIL import Image
from scipy import ndimage
from dl_utils import *

%matplotlib inline

### Data

Dataset ("data.h5") contains:
- A training set of `m_train` images labeled as cat $(y=1)$ or non-cat. $(y=0)$
- A test set of `m_test` images labeled as cat $(y=1)$ or non-cat. $(y=0)$
- Each image is of shape `(num_px, num_px, 3)` where $3$ is for the $3$ channels. (RGB)

In [None]:
# Load the data. (cat/non-cat)
train_set_location = 'data/train_catvnoncat.h5'
test_set_location = 'data/test_catvnoncat.h5'
train_set_x_orig, train_set_y, test_set_x_orig, test_set_y, classes = load_dataset(train_set_location, test_set_location)

# Reshape the training and test examples such that matrices are flattened into vectors.
train_set_x_flatten = train_set_x_orig.reshape(train_set_x_orig.shape[0], -1).T
test_set_x_flatten = test_set_x_orig.reshape(test_set_x_orig.shape[0], -1).T

# Center and standardize dataset dividing by the maximum value of a pixel channel.
train_set_x = train_set_x_flatten / 255
test_set_x = test_set_x_flatten / 255

In [None]:
dim = 2
w, b = initialize_with_zeros(dim)

### Forward and Backward propagation

Forward Propagation:
- Get 
$$X$$
- Compute 
$$A = \sigma(w^T X + b) = (a^{(1)}, a^{(2)}, ..., a^{(m-1)}, a^{(m)})$$
- Calculate the cost function
$$J = -\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log(a^{(i)})+(1-y^{(i)})\log(1-a^{(i)})$$

Backward Propagation: 
$$\frac{\partial J}{\partial w} = \frac{1}{m}X(A-Y)^T$$
$$\frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^m (a^{(i)}-y^{(i)})$$

In [None]:
def propagate(w, b, X, Y):
    """
    Implement the cost function and its gradient for the propagation explained above

    Arguments:
    w -- Weights, a numpy array of size. (num_px * num_px * 3, 1)
    b -- Bias, a scalar.
    X -- Data of size. (num_px * num_px * 3, number of examples)
    Y -- True "label" vector (containing 0 if non-cat, 1 if cat) of size. (1, number of examples)

    Return:
    cost -- Negative log-likelihood cost for logistic regression.
    dw -- Gradient of the loss with respect to w, thus same shape as w.
    db -- Gradient of the loss with respect to b, thus same shape as b.
    """
    
    # FORWARD PROPAGATION (FROM X TO COST)
    m = X.shape[1]
    A = sigmoid(np.dot(w.T, X) + b)  # compute activation
    cost = -(np.dot(Y, np.log(A.T)) + np.dot((1-Y), np.log(1-A.T))) / m  # compute cost
    
    # BACKWARD PROPAGATION (TO FIND GRAD)
    dw = np.dot(X, (A - Y).T) / m
    db = np.sum(A - Y) / m

    assert(dw.shape == w.shape)
    assert(db.dtype == float)
    cost = np.squeeze(cost)
    assert(cost.shape == ())
    
    grads = {"dw": dw, "db": db}
    
    return grads, cost

In [None]:
def optimize(w, b, X, Y, num_iterations, learning_rate, print_cost = False):
    """
    This function optimizes w and b by running a gradient descent algorithm.
    
    Arguments:
    w -- Eeights, a numpy array of size. (num_px * num_px * 3, 1)
    b -- Bias, a scalar.
    X -- Data of shape. (num_px * num_px * 3, number of examples)
    Y -- True "label" vector (containing 0 if non-cat, 1 if cat) of shape. (1, number of examples)
    num_iterations -- Number of iterations of the optimization loop.
    learning_rate -- Learning rate of the gradient descent update rule.
    print_cost -- True to print the loss every 100 steps.
    
    Returns:
    params -- Dictionary containing the weights w and bias b.
    grads -- Dictionary containing the gradients of the weights and bias with respect to the cost function.
    costs -- List of all the costs computed during the optimization, this will be used to plot the learning curve.
    """
    
    costs = []
    
    for i in range(num_iterations):
    
        # Cost and gradient calculation.
        grads, cost = propagate(w, b, X, Y)
        
        # Retrieve derivatives from grads.
        dw = grads["dw"]
        db = grads["db"]
        
        # Update rule.
        w = w - learning_rate * dw
        b = b - learning_rate * db
        
        # Record the costs.
        if i % 100 == 0:
            costs.append(cost)
        
        # Print the cost every 100 training iterations.
        if print_cost and i % 100 == 0:
            print ("Cost after iteration %i: %f" %(i, cost))
    
    params = {"w": w, "b": b}
    
    grads = {"dw": dw, "db": db}
    
    return params, grads, costs

### Inference

- Calculate $\hat{Y} = A = \sigma(w^T X + b)$
- Convert the entries of a into 0 (if activation <= 0.5) or 1 (if activation > 0.5), stores the predictions in a vector `Y_prediction`.

In [None]:
def predict(w, b, X):
    '''
    Predict whether the label is 0 or 1 using learned logistic regression parameters. (w, b)
    
    Arguments:
    w -- Weights, a numpy array of size. (num_px * num_px * 3, 1)
    b -- Bias, a scalar.
    X -- Data of size. (num_px * num_px * 3, number of examples)
    
    Returns:
    Y_prediction -- A numpy array (vector) containing all predictions (0/1) for the examples in X.
    '''
    
    m = X.shape[1]
    Y_prediction = np.zeros((1,m))
    w = w.reshape(X.shape[0], 1)
    
    # Compute vector "A" predicting the probabilities of a cat being present in the picture.
    A = sigmoid(np.dot(w.T, X) + b) 
    
    for i in range(A.shape[1]):
        
        # Convert probabilities A[0,i] to actual predictions p[0,i].
        if A[0, i] <= 0.5:
            Y_prediction[0, i] = 0
        else:
            Y_prediction[0, i] = 1
    
    assert(Y_prediction.shape == (1, m))
    
    return Y_prediction

### Model
- `Y_prediction_test` for the predictions on the test set.
- `Y_prediction_train` for the predictions on the train set.
- `w, costs, grads` for the outputs of `optimize()`.

In [None]:
def model(X_train, Y_train, X_test, Y_test, num_iterations = 2000, learning_rate = 0.5, print_cost = False):
    """
    Builds the logistic regression model by calling the function you've implemented previously.
    
    Arguments:
    X_train -- Training set represented by a numpy array of shape. (num_px * num_px * 3, m_train)
    Y_train -- Training labels represented by a numpy array of shape. (1, m_train)
    X_test -- Test set represented by a numpy array of shape. (num_px * num_px * 3, m_test)
    Y_test -- Test labels represented by a numpy array of shape. (1, m_test)
    num_iterations -- Hyperparameter representing the number of iterations to optimize the parameters.
    learning_rate -- Hyperparameter representing the learning rate used in the update rule of optimize().
    print_cost -- Set to true to print the cost every 100 iterations.
    
    Returns:
    d -- Dictionary containing information about the model.
    """
    
    # initialize parameters with zeros.
    w, b = initialize_with_zeros(X_train.shape[0])

    # Gradient descent.
    parameters, grads, costs = optimize(w, b, X_train, Y_train, num_iterations, learning_rate, print_cost = False)
    
    # Retrieve parameters w and b from dictionary "parameters".
    w = parameters["w"]
    b = parameters["b"]
    
    # Predict test/train set examples.
    Y_prediction_test = predict(w, b, X_test)
    Y_prediction_train = predict(w, b, X_train)

    # Print train/test Errors.
    print("train accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_train - Y_train)) * 100))
    print("test accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_test - Y_test)) * 100))
    
    d = {"costs": costs,
         "Y_prediction_test": Y_prediction_test, 
         "Y_prediction_train" : Y_prediction_train, 
         "w" : w, 
         "b" : b,
         "learning_rate" : learning_rate,
         "num_iterations": num_iterations}
    
    return d

In [None]:
d = model(train_set_x, train_set_y, test_set_x, test_set_y, num_iterations = 2000, learning_rate = 0.005, print_cost = True)

# Plot learning curve.
costs = np.squeeze(d['costs'])
plt.plot(costs)
plt.ylabel('cost')
plt.xlabel('iterations (per hundreds)')
plt.title("Learning rate =" + str(d["learning_rate"]))
plt.show()

### Learning rate

- The learning rate $\alpha$  determines how rapidly we update the parameters. 
    - If too large, the cost may oscillate up and down or even diverge.
    - If too small, we will need too many iterations to converge to the best values.
- In deep learning, recommendation is 
    - Choose the learning rate that better minimizes the cost function.
    - If the model overfits, use other techniques to reduce overfitting.

In [None]:
learning_rates = [0.01, 0.001, 0.0001]
models = {}
for i in learning_rates:
    print ("learning rate is: " + str(i))
    models[str(i)] = model(train_set_x, train_set_y, test_set_x, test_set_y, num_iterations = 1500, learning_rate = i, print_cost = False)
    print ('\n' + "-------------------------------------------------------" + '\n')

for i in learning_rates:
    plt.plot(np.squeeze(models[str(i)]["costs"]), label= str(models[str(i)]["learning_rate"]))

plt.ylabel('cost')
plt.xlabel('iterations (hundreds)')

legend = plt.legend(loc='upper center', shadow=True)
frame = legend.get_frame()
frame.set_facecolor('0.90')
plt.show()

## Example
- Feature extraction with frequencies

Postive
- I am happy because I am learning NLP
- I am happy

Negative
- I am sad, I am not learning NLP
- I am sad

<table>
<tr>
    <td>Vocabulary</td>
    <td>PosFreq(1)</td>
    <td>NegFreq(0)</td>
</tr>
<tr>
    <td>I</td>
    <td>3</td>
    <td>3</td>
</tr>
<tr>
    <td>am</td>
    <td>3</td>
    <td>3</td>
</tr>
<tr>
    <td>happy</td>
    <td>2</td>
    <td>1</td>
</tr>
<tr>
    <td>because</td>
    <td>1</td>
    <td>0</td>
</tr>
<tr>
    <td>learning</td>
    <td>1</td>
    <td>1</td>
</tr>
<tr>
    <td>NLP</td>
    <td>1</td>
    <td>1</td>
</tr>
<tr>
    <td>sad</td>
    <td>1</td>
    <td>2</td>
</tr>
<tr>
    <td>not</td>
    <td>1</td>
    <td>2</td>
</tr>
</table>

For the word, "I am sad, I am not learning NLP"
- Freq(w,1) = 3+3+1+1 = 8 ("happy" and "because" do not appear on the sentence)
- Freq(w,0) = 3+3+1+1+2+1 = 8 ("happy" and "because" do not appear on the sentence)

The feature vector becomes [1,8,11] 
- 1 is the bias
- 8 is positive feature
- 11 is negative feature

### Preprocessing

- Eliminate "Stop words" like "and, is, are, at, has, for, a"
- Eliminate punctuations
- Eliminate handles (starting with @) and URLs
- Stemming word "tune" has three forms "tune, tuned, tuning". Stemmed word becomes "tun"
- Convert all words to lowercase

For the word, "I am happy Because i am learning NLP @DeepLearning", we do the preprocessing to get
- [happy, learn, nlp]

Then, feature vector becomes [1,4,2]
- 1 is the bias
- happy appears twice, learn and nlp appear once each, thus 4 is the positive feature
- learn and nlp appear once each, thus 2 is the negative feature

If we have lots of $m$ sentences to construct the feature vectors,
$\begin{bmatrix}
    1 & X_{1}^{(1)} & X_{2}^{(1)} \\
    1 & X_{1}^{(2)} & X_{2}^{(2)} \\
    \vdots & \vdots & \vdots \\
    1 & X_{1}^{(m)} & X_{2}^{(m)}
\end{bmatrix}$

In [1]:
def gradientDescent(x, y, theta, alpha, num_iters):
    '''
    Input:
        x: matrix of features which is (m,n+1)
        y: corresponding labels of the input matrix x, dimensions (m,1)
        theta: weight vector of dimension (n+1,1)
        alpha: learning rate
        num_iters: number of iterations you want to train your model for
    Output:
        J: the final cost
        theta: your final weight vector
    Hint: you might want to print the cost to make sure that it is going down.
    '''

    # get 'm', the number of rows in matrix x
    num_rows, num_cols = x.shape
    m = num_rows
    
    for i in range(0, num_iters):
        
        # get z, the dot product of x and theta
        z = np.dot(x, theta)
        
        # get the sigmoid of z
        h = sigmoid(z)
        
        # calculate the cost function
        J = (-1/m) * ( np.dot(y.T, np.log(h)) + np.dot((1-y).T, np.log(1-h)) )
                      
        # update the weights theta
        theta = theta - (alpha/m) * np.dot(x.T, h-y)
        
    J = float(J)
    return J, theta

In [2]:
def extract_features(tweet, freqs):
    '''
    Input: 
        tweet: a list of words for one tweet
        freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
    Output: 
        x: a feature vector of dimension (1,3)
    '''
    # process_tweet tokenizes, stems, and removes stopwords
    word_l = process_tweet(tweet)
    
    # 3 elements in the form of a 1 x 3 vector
    x = np.zeros((1, 3)) 
    
    #bias term is set to 1
    x[0,0] = 1 
    
    # loop through each word in the list of words
    for word in word_l:
        
        # increment the word count for the positive label 1
        key_pos = (word,1.0)
        count_pos = freqs[key_pos] if key_pos in freqs else 0
        x[0,1] += count_pos
        
        # increment the word count for the negative label 0
        key_neg = (word,0.0)
        count_neg = freqs[key_neg] if key_neg in freqs else 0
        x[0,2] += count_neg
        
    assert(x.shape == (1, 3))
    return x


In [3]:
def predict_tweet(tweet, freqs, theta):
    '''
    Input: 
        tweet: a string
        freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
        theta: (3,1) vector of weights
    Output: 
        y_pred: the probability of a tweet being positive or negative
    '''
    
    # extract the features of the tweet and store it into x
    x = extract_features(tweet, freqs)
    
    # make the prediction using x and theta
    y_pred = sigmoid(np.dot(x, theta))
    
    return y_pred

In [4]:
def test_logistic_regression(test_x, test_y, freqs, theta):
    """
    Input: 
        test_x: a list of tweets
        test_y: (m, 1) vector with the corresponding labels for the list of tweets
        freqs: a dictionary with the frequency of each pair (or tuple)
        theta: weight vector of dimension (3, 1)
    Output: 
        accuracy: (# of tweets classified correctly) / (total # of tweets)
    """
    
    # the list for storing predictions
    y_hat = []
    num_row, num_col = test_y.shape
    
    for tweet in test_x:
        # get the label prediction for the tweet
        y_pred = predict_tweet(tweet, freqs, theta)
        
        if y_pred > 0.5:
            # append 1.0 to the list
            y_hat.append(1.0)
        else:
            # append 0 to the list
            y_hat.append(0)

    # With the above implementation, y_hat is a list, but test_y is (m,1) array
    # convert both to one-dimensional arrays in order to compare them using the '==' operator
    y_hat_matrix = np.array(y_hat, ndmin=2).T
    compare_result = (y_hat_matrix == test_y)
    count_true = np.count_nonzero(compare_result)
    
    accuracy = count_true / num_row

    return accuracy