### Feature extraction with frequencies

Example

Postive
- I am happy because I am learning NLP
- I am happy

Negative
- I am sad, I am not learning NLP
- I am sad

<table>
<th>
    <td>Vocabulary</td>
    <td>PosFreq(1)</td>
    <td>NegFreq(0)</td>
</th>
<tr>
    <td>I</td>
    <td>3</td>
    <td>3</td>
</tr>
<tr>
    <td>am</td>
    <td>3</td>
    <td>3</td>
</tr>
<tr>
    <td>happy</td>
    <td>2</td>
    <td>1</td>
</tr>
<tr>
    <td>because</td>
    <td>1</td>
    <td>0</td>
</tr>
<tr>
    <td>learning</td>
    <td>1</td>
    <td>1</td>
</tr>
<tr>
    <td>NLP</td>
    <td>1</td>
    <td>1</td>
</tr>
<tr>
    <td>sad</td>
    <td>1</td>
    <td>2</td>
</tr>
<tr>
    <td>not</td>
    <td>1</td>
    <td>2</td>
</tr>
</table>

For the word, "I am sad, I am not learning NLP"
- Freq(w,1) = 3+3+1+1 = 8 ("happy" and "because" do not appear on the sentence)
- Freq(w,0) = 3+3+1+1+2+1 = 8 ("happy" and "because" do not appear on the sentence)

The feature vector becomes [1,8,11] 
- 1 is the bias
- 8 is positive feature
- 11 is negative feature

### Preprocessing

- Eliminate "Stop words" like "and, is, are, at, has, for, a"
- Eliminate punctuations
- Eliminate handles (starting with @) and URLs
- Stemming word "tune" has three forms "tune, tuned, tuning". Stemmed word becomes "tun"
- Convert all words to lowercase

For the word, "I am happy Because i am learning NLP @DeepLearning", we do the preprocessing to get
- [happy, learn, nlp]

Then, feature vector becomes [1,4,2]
- 1 is the bias
- happy appears twice, learn and nlp appear once each, thus 4 is the positive feature
- learn and nlp appear once each, thus 2 is the negative feature

If we have lots of $m$ sentences to construct the feature vectors,
$\begin{bmatrix}
    1 & X_{1}^{(1)} & X_{2}^{(1)} \\
    1 & X_{1}^{(2)} & X_{2}^{(2)} \\
    \vdots & \vdots & \vdots \\
    1 & X_{1}^{(m)} & X_{2}^{(m)}
\end{bmatrix}$

In [1]:
def gradientDescent(x, y, theta, alpha, num_iters):
    '''
    Input:
        x: matrix of features which is (m,n+1)
        y: corresponding labels of the input matrix x, dimensions (m,1)
        theta: weight vector of dimension (n+1,1)
        alpha: learning rate
        num_iters: number of iterations you want to train your model for
    Output:
        J: the final cost
        theta: your final weight vector
    Hint: you might want to print the cost to make sure that it is going down.
    '''

    # get 'm', the number of rows in matrix x
    num_rows, num_cols = x.shape
    m = num_rows
    
    for i in range(0, num_iters):
        
        # get z, the dot product of x and theta
        z = np.dot(x, theta)
        
        # get the sigmoid of z
        h = sigmoid(z)
        
        # calculate the cost function
        J = (-1/m) * ( np.dot(y.T, np.log(h)) + np.dot((1-y).T, np.log(1-h)) )
                      
        # update the weights theta
        theta = theta - (alpha/m) * np.dot(x.T, h-y)
        
    J = float(J)
    return J, theta


def extract_features(tweet, freqs):
    '''
    Input: 
        tweet: a list of words for one tweet
        freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
    Output: 
        x: a feature vector of dimension (1,3)
    '''
    # process_tweet tokenizes, stems, and removes stopwords
    word_l = process_tweet(tweet)
    
    # 3 elements in the form of a 1 x 3 vector
    x = np.zeros((1, 3)) 
    
    #bias term is set to 1
    x[0,0] = 1 
    
    # loop through each word in the list of words
    for word in word_l:
        
        # increment the word count for the positive label 1
        key_pos = (word,1.0)
        count_pos = freqs[key_pos] if key_pos in freqs else 0
        x[0,1] += count_pos
        
        # increment the word count for the negative label 0
        key_neg = (word,0.0)
        count_neg = freqs[key_neg] if key_neg in freqs else 0
        x[0,2] += count_neg
        
    assert(x.shape == (1, 3))
    return x


def predict_tweet(tweet, freqs, theta):
    '''
    Input: 
        tweet: a string
        freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
        theta: (3,1) vector of weights
    Output: 
        y_pred: the probability of a tweet being positive or negative
    '''
    
    # extract the features of the tweet and store it into x
    x = extract_features(tweet, freqs)
    
    # make the prediction using x and theta
    y_pred = sigmoid(np.dot(x, theta))
    
    return y_pred


def test_logistic_regression(test_x, test_y, freqs, theta):
    """
    Input: 
        test_x: a list of tweets
        test_y: (m, 1) vector with the corresponding labels for the list of tweets
        freqs: a dictionary with the frequency of each pair (or tuple)
        theta: weight vector of dimension (3, 1)
    Output: 
        accuracy: (# of tweets classified correctly) / (total # of tweets)
    """
    
    # the list for storing predictions
    y_hat = []
    num_row, num_col = test_y.shape
    
    for tweet in test_x:
        # get the label prediction for the tweet
        y_pred = predict_tweet(tweet, freqs, theta)
        
        if y_pred > 0.5:
            # append 1.0 to the list
            y_hat.append(1.0)
        else:
            # append 0 to the list
            y_hat.append(0)

    # With the above implementation, y_hat is a list, but test_y is (m,1) array
    # convert both to one-dimensional arrays in order to compare them using the '==' operator
    y_hat_matrix = np.array(y_hat, ndmin=2).T
    compare_result = (y_hat_matrix == test_y)
    count_true = np.count_nonzero(compare_result)
    
    accuracy = count_true / num_row

    return accuracy

### Naive Bayes
- **assumes each word in sentence are independent from one another** 

Conditional probability
- probability of B given that A happened OR
- looking at elements of A, probability that they also being to B

Bayes rule
- $P(X|Y) = P(Y|X) \times \dfrac{P(X)}{P(Y)}$

Table from before but add the total sum
<table>
<th>
    <td>Vocabulary</td>
    <td>PosFreq(1)</td>
    <td>NegFreq(0)</td>
</th>
<tr>
    <td>I</td>
    <td>3</td>
    <td>3</td>
</tr>
<tr>
    <td>am</td>
    <td>3</td>
    <td>3</td>
</tr>
<tr>
    <td>happy</td>
    <td>2</td>
    <td>1</td>
</tr>
<tr>
    <td>because</td>
    <td>1</td>
    <td>0</td>
</tr>
<tr>
    <td>learning</td>
    <td>1</td>
    <td>1</td>
</tr>
<tr>
    <td>NLP</td>
    <td>1</td>
    <td>1</td>
</tr>
<tr>
    <td>sad</td>
    <td>1</td>
    <td>2</td>
</tr>
<tr>
    <td>not</td>
    <td>1</td>
    <td>2</td>
</tr>
<tr>
    <td>N</td>
    <td>13</td>
    <td>12</td>
<tr>
</table>

Compute conditional probabilities, for example
- $P(I|Pos) = \dfrac{3}{13}, P(I|Neg) = \dfrac{3}{12}$

Then, fill those conditional probabilties to a new table

<table>
<th>
    <td>Vocabulary</td>
    <td>Pos</td>
    <td>Neg</td>
</th>
<tr>
    <td>I</td>
    <td>0.24</td>
    <td>0.25</td>
</tr>
<tr>
    <td>am</td>
    <td>0.24</td>
    <td>0.25</td>
</tr>
<tr>
    <td>happy</td>
    <td>0.15</td>
    <td>0.08</td>
</tr>
<tr>
    <td>because</td>
    <td>0.08</td>
    <td>0</td>
</tr>
<tr>
    <td>learning</td>
    <td>0.08</td>
    <td>0.08</td>
</tr>
<tr>
    <td>NLP</td>
    <td>0.08</td>
    <td>0.08</td>
</tr>
<tr>
    <td>sad</td>
    <td>0.08</td>
    <td>0.17</td>
</tr>
<tr>
    <td>not</td>
    <td>0.08</td>
    <td>0.17</td>
</tr>
<tr>
    <td>Sum</td>
    <td>1</td>
    <td>1</td>
</tr>
</table>

Naive Bayes binary classification rule
- $\displaystyle\sum_{i=1}^{m}\dfrac{P({w_{i}}|POS)}{P({w_{i}}|NEG)}$

For an example sentence "I am happy today; I am learning"
- $\dfrac{0.2}{0.2} \times \dfrac{0.2}{0.2} \times \dfrac{0.14}{0.10} \times \dfrac{0.2}{0.2} \times \dfrac{0.2}{0.2} \times \dfrac{0.1}{0.1} = 1.4$

### Laplacian smoothing
- avoid problems of probabilities being $0$
- $P(w_{i}|class) = \dfrac{freq(w_{i},class)+1}{(N_{class}+V)}$
    - $N_{class}$ : frequency of all words in class
    - $V$ : number of unique words in vocabulary
- for example, $P(I|POS) = \dfrac{3+1}{13+8} = 0.19$
- then, the table becomes

<table>
<th>
    <td>Vocabulary</td>
    <td>Pos</td>
    <td>Neg</td>
</th>
<tr>
    <td>I</td>
    <td>0.19</td>
    <td>0.20</td>
</tr>
<tr>
    <td>am</td>
    <td>0.19</td>
    <td>0.20</td>
</tr>
<tr>
    <td>happy</td>
    <td>0.14</td>
    <td>0.10</td>
</tr>
<tr>
    <td>because</td>
    <td>0.10</td>
    <td>0.05</td>
</tr>
<tr>
    <td>learning</td>
    <td>0.10</td>
    <td>0.10</td>
</tr>
<tr>
    <td>NLP</td>
    <td>0.10</td>
    <td>0.10</td>
</tr>
<tr>
    <td>sad</td>
    <td>0.10</td>
    <td>0.15</td>
</tr>
<tr>
    <td>not</td>
    <td>0.10</td>
    <td>0.15</td>
</tr>
<tr>
    <td>Sum</td>
    <td>1</td>
    <td>1</td>
</tr>
</table>

### Log likelihood

To do inference, we can compute
- $\dfrac{P(pos)}{P(neg)}\displaystyle\sum_{i=1}^{m}\dfrac{P({w_{i}}|pos)}{P({w_{i}}|neg)} \gt 1$

To avoid numerical overflow as $m$ gets larger, we introduce "log" such that
- $\log\left(\dfrac{P(pos)}{P(neg)}\displaystyle\sum_{i=1}^{m}\dfrac{P({w_{i}}|pos)}{P({w_{i}}|neg)}\right) = \log\left(\dfrac{P(pos)}{P(neg)}\right) + \log\left(\displaystyle\sum_{i=1}^{m}\dfrac{P({w_{i}}|pos)}{P({w_{i}}|neg)}\right)$

We let lambda such that
- $\lambda(w) = \log\left(\dfrac{P(w|pos)}{P(w|neg)}\right)$

Log likelyhood is give by
- $\displaystyle\sum_{i=1}^{m}\lambda(w) = \displaystyle\sum_{i=1}^{m}\log\left(\dfrac{P(w|pos)}{P(w|neg)}\right)$

In [2]:
def count_tweets(result, tweets, ys):
    '''
    Input:
        result: a dictionary that will be used to map each pair to its frequency
        tweets: a list of tweets
        ys: a list corresponding to the sentiment of each tweet (either 0 or 1)
    Output:
        result: a dictionary mapping each pair to its frequency
    '''

    for y, tweet in zip(ys, tweets):
        for word in process_tweet(tweet):
            # define the key, which is the word and label tuple
            pair = (word, y)

            # if the key exists in the dictionary, increment the count
            if pair in result:
                result[pair] += 1

            # else, if the key is new, add it to the dictionary and set the count to 1
            else:
                result[pair] = 1
    
    return result


def train_naive_bayes(freqs, train_x, train_y):
    '''
    Input:
        freqs: dictionary from (word, label) to how often the word appears
        train_x: a list of tweets
        train_y: a list of labels correponding to the tweets (0,1)
    Output:
        logprior: the log prior. (equation 3 above)
        loglikelihood: the log likelihood of you Naive bayes equation. (equation 6 above)
    '''
    loglikelihood = {}
    logprior = 0

    # calculate V, the number of unique words in the vocabulary
    vocab = set([pair[0] for pair in freqs.keys()])
    V = len(vocab)

    # calculate N_pos and N_neg
    N_pos = N_neg = 0
    for pair in freqs.keys():
        # if the label is positive (greater than zero)
        if pair[1] > 0:

            # Increment the number of positive words by the count for this (word, label) pair
            N_pos += freqs[pair]

        # else, the label is negative
        else:

            # increment the number of negative words by the count for this (word,label) pair
            N_neg += freqs[pair]

    # Calculate D, the number of documents
    D = len(train_y)
    
    # Calculate D_pos, the number of positive documents (*hint: use sum(<np_array>))
    D_pos = 0
    for i in train_y:
        if i == 1:
            D_pos += 1

    # Calculate D_neg, the number of negative documents (*hint: compute using D and D_pos)
    D_neg = D - D_pos

    # Calculate logprior
    logprior = np.log(D_pos) - np.log(D_neg)

    # For each word in the vocabulary...
    for word in vocab:
        # get the positive and negative frequency of the word
        freq_pos = lookup(freqs,word,1)
        freq_neg = lookup(freqs,word,0)

        # calculate the probability that each word is positive, and negative
        p_w_pos = (freq_pos + 1) / (N_pos + V)
        p_w_neg = (freq_neg + 1) / (N_neg + V)

        # calculate the log likelihood of the word
        loglikelihood[word] = np.log(p_w_pos / p_w_neg)
        
    return logprior, loglikelihood


def naive_bayes_predict(tweet, logprior, loglikelihood):
    '''
    Input:
        tweet: a string
        logprior: a number
        loglikelihood: a dictionary of words mapping to numbers
    Output:
        p: the sum of all the logliklihoods of each word in the tweet (if found in the dictionary) + logprior (a number)

    '''
    
    # process the tweet to get a list of words
    word_l = process_tweet(tweet)
    
    # initialize probability to zero
    p = 0

    # add the logprior
    p += logprior
    
    for word in word_l:

        # check if the word exists in the loglikelihood dictionary
        if word in loglikelihood:
            # add the log likelihood of that word to the probability
            p += loglikelihood[word]

    return p


def test_naive_bayes(test_x, test_y, logprior, loglikelihood):
    """
    Input:
        test_x: A list of tweets
        test_y: the corresponding labels for the list of tweets
        logprior: the logprior
        loglikelihood: a dictionary with the loglikelihoods for each word
    Output:
        accuracy: (# of tweets classified correctly)/(total # of tweets)
    """
    accuracy = 0  # return this properly

    y_hats = []
    for tweet in test_x:
        # if the prediction is > 0
        if naive_bayes_predict(tweet, logprior, loglikelihood) > 0:
            # the predicted class is 1
            y_hat_i = 1
        else:
            # otherwise the predicted class is 0
            y_hat_i = 0

        # append the predicted class to the list y_hats
        y_hats.append(y_hat_i)

    # error is the average of the absolute values of the differences between y_hats and test_y
    error = np.mean(np.absolute(y_hats-test_y))
            
    # Accuracy is 1 minus the error
    accuracy = 1 - error
    
    return accuracy

### Cosine similarity

$\cos (\theta)=\dfrac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\|\|\mathbf{B}\|}=\dfrac{\sum_{i=1}^{n} A_{i} B_{i}}{\sqrt{\sum_{i=1}^{n} A_{i}^{2}} \sqrt{\sum_{i=1}^{n} B_{i}^{2}}}$


### PCA

1. Mean normalized your data
2. Compute the covariance matrix
3. Compute SVD on the covariance matrix. This returns $[USV] = svd(\Sigma)$ where $U$ is eigenvectors and $S$ is eigenvalues
4. You can then use first $n$ columns of $U$, to get new data by $XU[:,0 : n]$

In [3]:
def cosine_similarity(A, B):
    '''
    Input:
        A: a numpy array which corresponds to a word vector
        B: A numpy array which corresponds to a word vector
    Output:
        cos: numerical number representing the cosine similarity between A and B.
    '''

    dot = np.dot(A,B)
    norma = np.linalg.norm(A)
    normb = np.linalg.norm(B)
    cos = dot / (norma * normb)

    return cos


def get_country(city1, country1, city2, embeddings):
    """
    Input:
        city1: a string (the capital city of country1)
        country1: a string (the country of capital1)
        city2: a string (the capital city of country2)
        embeddings: a dictionary where the keys are words and values are their embeddings
    Output:
        countries: a dictionary with the most likely country and its similarity score
    """

    # store the city1, country 1, and city 2 in a set called group
    group = set((city1, country1, city2))

    # get embeddings of city 1
    city1_emb = embeddings[city1]

    # get embedding of country 1
    country1_emb = embeddings[country1]

    # get embedding of city 2
    city2_emb = embeddings[city2]

    # get embedding of country 2 (it's a combination of the embeddings of country 1, city 1 and city 2)
    # Remember: King - Man + Woman = Queen
    vec = country1_emb - city1_emb + city2_emb
    
    # Initialize the similarity to -1 (it will be replaced by a similarities that are closer to +1)
    similarity = -1

    # initialize country to an empty string
    country = ''

    # loop through all words in the embeddings dictionary
    for word in embeddings.keys():

        # first check that the word is not already in the 'group'
        if word not in group:

            # get the word embedding
            word_emb = embeddings[word]

            # calculate cosine similarity between embedding of country 2 and the word in the embeddings dictionary
            cur_similarity = cosine_similarity(vec, word_emb)

            # if the cosine similarity is more similar than the previously best similarity...
            if cur_similarity > similarity:

                # update the similarity to the new, better similarity
                similarity = cur_similarity

                # store the country as a tuple, which contains the word and the similarity
                country = (word, cur_similarity)

    return country


def get_accuracy(word_embeddings, data):
    '''
    Input:
        word_embeddings: a dictionary where the key is a word and the value is its embedding
        data: a pandas dataframe containing all the country and capital city pairs
    
    Output:
        accuracy: the accuracy of the model
    '''

    # initialize num correct to zero
    num_correct = 0

    # loop through the rows of the dataframe
    for i, row in data.iterrows():

        # get city1
        city1 = row[0]

        # get country1
        country1 = row[1]

        # get city2
        city2 =  row[2]

        # get country2
        country2 = row[3]

        # use get_country to find the predicted country2
        predicted_country2, _ = get_country(city1, country1, city2, word_embeddings)

        # if the predicted country2 is the same as the actual country2...
        if predicted_country2 == country2:
            # increment the number of correct by 1
            num_correct += 1

    # get the number of rows in the data dataframe (length of dataframe)
    m = len(data)

    # calculate the accuracy by dividing the number correct by m
    accuracy = num_correct / m

    return accuracy

### Transform word vectors

Suppose
- $X$ is the english word vectors
- $Y$ is the french word vectors
- $R$ is the mapping matrix

Step to learn $R$ will be
- initialize $R$
- For loop
    - $Loss = \|{XR-Y}\|_{F}$ (frobenius norm - square all elements of matrix and add them up)
    - $g = \dfrac{d}{dR}Loss$
    - $R = R - \alpha g$