<a href="https://colab.research.google.com/github/xlzuvekas/Machine-Learning/blob/main/COMP_3703_Assignment_6_Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 6: Logistic Regression
Course: COMP 3703 - Natural Language Processing

Due: Friday, February 25 at 11:59pm Mountain time

---

YOUR NAME HERE:Xavier Zuvekas

---

# Question 1: Preprocessing

Although the title of this assignment is "Logistic Regression", most of the code you will write will deal with preprocessing and the implementation of a binary perception (question 2).

As usual, the necessary package imports and set up of the `movie_reviews` corpus is implemented here for you. Run the code cell below, namely to create the variables `raw_docs` and `lexicon` in addition to running the import statements.

In [None]:
### RUN THIS ###
import nltk
import random
from math import log

nltk.download('vader_lexicon')
nltk.download('movie_reviews')
nltk.download('punkt')

from nltk.sentiment import vader
from nltk.corpus import movie_reviews

raw_docs = []                                     # Our custom dataset list
for label in movie_reviews.categories():          # Categories are 'neg' and 'pos'
  files = movie_reviews.fileids(label)            # Get list of file names associated with label
  for file in files:
    words_from_file = movie_reviews.words(file)   # Get words in file (one movie review)
    words_with_label = (words_from_file, label)   # Place raw words with label in a tuple
    raw_docs.append(words_with_label)             # Add tuple to our docs list

random.shuffle(raw_docs)                          # Randomize document positions

lexicon = vader.SentimentIntensityAnalyzer().make_lex_dict() # Get lexicon dictionary

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Part (a)

A critical step for the classifiers in this assignment (as opposed to Naive Bayes) is **feature extraction**. For `nltk` classifiers, features are defined in dictionary (or dictionary-like) structures such that each key in the dictionary is the name of the feature and its corresponding value is the quantity associated with it. The name doesn't actually change anything mathematically about the result; the name is just there so we know what the feature is.

Previously we used word counts as features, where a word/token was its own feature name and the frequency of that word in a given document was the value. Here you will define a handful of functions, each of which extracts a single feature value from a document. (The `f_` prefix at the beginning of each function name is just to indicate that these are supposed to be used as features. This is not a standard nomenclature; it's just for readability purposes.) These features overlap with ones used in the textbook, but a couple are slightly tweaked for this implementation.

Define the following functions:
* `f_log_doc_length(doc)`: returns the logarithm of the length of the document
  * The `log()` function is already imported for you above from the `math` package.
* `f_contains_exclamation(doc)`: returns 1 if the string `"!"` appears in `doc`, 0 otherwise
* `f_positive_sentiment_sum(doc)`: returns the sum of the sentiment scores of the positive words in `doc` based on the scores in `lexicon`
* `f_negative_sentiment_sum(doc)`: returns the sum of the sentiment scores of the negative words in `doc` based on the scores in `lexicon`
* `extract_features(doc)`: returns a dictionary with 4 key-value pairs, one for each feature as described

For example, the (made up) review
```
This movie is the worst. There were no dogs, no wizards, no fun. I am not happy.
```
would actually be represented in `raw_docs` as a list of lowercased tokens like
```
['this', 'movie', 'is', 'the', 'worst', '.', 'there', 'were', 'no', 'dogs', ',', 'no', 'wizards', ',', 'no', 'fun', '.', 'i', 'am', 'not', 'happy', '.']
```
Calling `extract_features` on this list of tokens would return the dictionary
```
{'log_doc_length': 3.091042453358316,  # Log of 22, the length of the doc
 'contains_exclamation': 0,  # No exclamation marks present in doc
 'positive_word_sentiment_sum': 5.0, # 'fun'+'happy' = 2.3+2.7 = 5.0
 'negative_word_sum': -6.7} # 'worst'+'no'+'no'+'no' = (-3.1)+(-1.2)+(-1.2)+(-1.2) = -6.7
```
The keys are hard-coded for each key-value pair in the dictionary. The values are returned by the corresponding functions you define.

In [None]:
import math

def f_log_doc_length(doc):
    return math.log(len(doc))

def f_contains_exclamation(doc):
    return 1 if '!' in doc else 0

def f_positive_sentiment_sum(doc, lexicon):
    positive_sentiment_sum = 0
    positive_words = []
    for word in doc[0]:
      if word in lexicon:
        if lexicon[word] > 0 :
          positive_sentiment_sum += lexicon[word]
          positive_words.append(lexicon[word])
    return (positive_sentiment_sum, positive_words)
def f_negative_sentiment_sum(doc, lexicon):
    negative_sentiment_sum = 0
    negative_words = []
    for word in doc[0]:
      if word in lexicon:
        if lexicon[word] < 0 :
          negative_sentiment_sum += lexicon[word]
          negative_words.append(lexicon[word])
    return (negative_sentiment_sum, negative_words)

def extract_features(doc, lexicon):
    return {
        'log_doc_length': f_log_doc_length(doc),
        'contains_exclamation': f_contains_exclamation(doc),
        'positive_word_sentiment_sum': f_positive_sentiment_sum(doc, lexicon)[0],
        'negative_word_sum': f_negative_sentiment_sum(doc, lexicon)[0]
    }


Try testing your `extract_features` function by calling it on the first few documents in `raw_docs` and printing the resulting dictionaries.

In [None]:
### YOUR CODE HERE ###
feature_sets = []

f_negative_sentiment_sum(raw_docs[0], lexicon)
f_positive_sentiment_sum(raw_docs[0], lexicon)
for doc in raw_docs:
    features = extract_features(doc, lexicon)
    feature_sets.append((features, label))
print(feature_sets)

[({'log_doc_length': 0.6931471805599453, 'contains_exclamation': 0, 'positive_word_sentiment_sum': 101.30000000000004, 'negative_word_sum': -76.40000000000002}, 'pos'), ({'log_doc_length': 0.6931471805599453, 'contains_exclamation': 0, 'positive_word_sentiment_sum': 52.89999999999999, 'negative_word_sum': -34.300000000000004}, 'pos'), ({'log_doc_length': 0.6931471805599453, 'contains_exclamation': 0, 'positive_word_sentiment_sum': 25.599999999999998, 'negative_word_sum': -30.9}, 'pos'), ({'log_doc_length': 0.6931471805599453, 'contains_exclamation': 0, 'positive_word_sentiment_sum': 35.6, 'negative_word_sum': -39.3}, 'pos'), ({'log_doc_length': 0.6931471805599453, 'contains_exclamation': 0, 'positive_word_sentiment_sum': 38.39999999999999, 'negative_word_sum': -45.3}, 'pos'), ({'log_doc_length': 0.6931471805599453, 'contains_exclamation': 0, 'positive_word_sentiment_sum': 45.199999999999996, 'negative_word_sum': -42.300000000000004}, 'pos'), ({'log_doc_length': 0.6931471805599453, 'con

## Part (b)

With `extract_features` defined, let's actually extract the features from each document as well as partition the corpus into training and test sets.

Create a new list of tuples based on `raw_docs` such that for each tuple:
* the first element is the dictionary returned by `extract_features` on the document tokens
* the second element is still the label associated with the document.

Then split the newly created list into training and test sets with 90% train and 10% test. You may reference code that you've written before in previous assignments to answer this question.

In [None]:
### YOUR CODE HERE ###
def test_train_split(raw_docs,split):
    features = []
    for (doc, label) in raw_docs:
        features.append((extract_features(doc, lexicon), label))

    train_set_size = int(len(features) * split)
    train_set = features[:train_set_size]
    test_set = features[train_set_size:]

    return (train_set, test_set)

train_set, test_set = test_train_split(raw_docs, 0.9)

# Question 2: Binary Perceptron

Since the math for the implementation of a binary perceptron is relatively simple (at least compared to logistic regression), you will manually implement a binary perceptron classifier. For this implementation we will ignore the bias term. As such, the weight vector $w$ is the binary perception model itself. Each weight in the vector corresponds to a feature in the extracted feature vectors of the documents.

To facilitate vector operations, we will utilize an extremely commonly used Python library called `numpy` (traditionally imported as `np`). `numpy` most notably provides the NumPy array data structure which provides many useful functionalities.

A NumPy array can be created simply by passing in a Python list into `np.array()`. An example of this is included in the following code cell along with the import statement. Run this cell to make `numpy` available.

In [None]:
### RUN THIS ###
import numpy as np

np.array([1, 2, 3, 4])

array([1, 2, 3, 4])

NumPy arrays provide convenient syntax for element-wise adding and subtracting.

In [None]:
array_1 = np.array([1, 1, 2, 3])
array_2 = np.array([5, 8, 13, 21])

print(array_1 + array_2)
print(array_1 - array_2)

[ 6  9 15 24]
[ -4  -7 -11 -18]


NumPy arrays also have the dot product operation already defined via the `dot` function.

In [None]:
array_1.dot(array_2) # 1*5 + 1*8 + 2*13 + 3*21 = 102

102

## Part (a)

The bulk of the work for the binary perceptron occurs in training. This is where you will implement the training portion of this classifier.

Create the weight vector $w$ as a NumPy array of all zeroes. The length of the array should match the number of features that `extract_features` has (you may hardcode this value).

Then, train the model. The overall process for training a binary perceptron model is as follows:
```
repeat until X iterations have occurred:
  for each document in documents:
    classify the document using weight vector
    update weight vector if necessary
```

One "iteration" is one pass over all the documents in the training set. Training takes some time so depending on your internet speed you can set the max number of iterations accordingly, but you should perform **at least 15 iterations** so there is a reasonable amount of change in $w$.

Remember that "classifying a document" entails the following calculations:
$$
y_{predicted} = \cases{positive & if $w \cdot f(x) \ge 0$ \cr
           negative & if $w \cdot f(x) < 0$ }
$$

If the prediction is incorrect, the weight vector is updated like so:
$$
w_{new} = \cases{w_{current} + f(x) & if $y_{actual} = positive$ \cr
           w_{current} - f(x) & if $y_{actual} = negative$ }
$$

$f(x)$ refers to the *values* from `extract_features`. To get the values of the dictionary into a NumPy array, you have to call `values()` on the dictionary, cast the result to a list, then give that list to `np.array()`.

**Print the weight vector at the end** to see how the weights have been updated. Considering what feature each weight corresponds to may give some insight as to why each weight is as positive or negative as it is.

*(A formal implementation of a binary perceptron would include a check to see if the training has converged before the max number of iterations passes, at which point you can terminate early. You may optionally include this in your code but it is not necessary since you are not likely to see your model converge.)*

In [None]:
#@title Default title text
import numpy as np

def make_array(feature): ## make empty list and for each value append to empty, return np.array(list)
### YOUR CODE HERE ###
    out = []
    for val in feature.values():
      out.append(val)
    return out



def predict(weight_vector, feature):## if weight dot and make array feature >= 0 return respective result

    if weight_vector.dot(make_array(feature)) >= 0:
      return 'pos'
    return 'neg'


def train_model(feature_label_set, iterations):
  ## make weight vectore w np.array([0,0,0,0])
  correctCount = 0

  weights = np.array([0,0,0,0])
  for i in range(iterations) :
    for doc in feature_label_set:
      if predict(weights, doc[0]) != doc[1]:
        if doc[1] == 'pos':
          weights = weights + make_array(doc[0])
        else:
          weights = weights - make_array(doc[0])
      else:
        correctCount += 1
  accuracy = correctCount / (len(feature_label_set) * iterations)

  return weights, accuracy     
      
        
  #for each iteration 
  ## for each doc in the feature set 
  ##if predict is false update values as shown above 
  ### YOUR CODE HERE ###


  ## if true add to correct count 
  # return 







train_docs, test_docs = test_train_split(raw_docs, 0.9)
weights, accuracy = train_model(train_docs, 100)

for w in weights:
  print(w)
print(weights)
print(accuracy)

-2.883047305899362
-16.0
-8.0
0.0
[ -2.88304731 -16.          -8.           0.        ]
0.5125555555555555


## Part (b)

With the classifier trained, let's evaluate its performance. To keep things simple, we will use accuracy to score the test set. Recall that accuracy is the ratio of correctly classified test documents over the total number of test documents.

Iterate over the test set, classify each document using the trained $w$ from part (a), and print the resulting accuracy of your binary perceptron classifier.

In [None]:
### YOUR CODE HERE ###
for w in weights:
  print(w)

print('Accuracy:', accuracy)

-2.883047305899362
-16.0
-8.0
0.0
Accuracy: 0.5125555555555555


# Question 3: Logistic Regression

You may be relieved to hear you do not have to manually train a logistic regression classifier for this question. Instead we will use the `LogisticRegression` class from another popular Python package called Sci-Kit Learn (`sklearn`). `nltk` conveniently has a `SklearnClassifier` which makes it so that we can use the `sklearn` model in the same manner that we would use the models that `nltk` provides, such as `NaiveBayesClassifier` from the previous assignments.

Run the following code cell to create an "empty", untrained logistic regression classifier.

In [None]:
### RUN THIS ###
from nltk.classify import SklearnClassifier
from sklearn.linear_model import LogisticRegression

log_reg = SklearnClassifier(LogisticRegression())

With the classifier already implemented for us in `sklearn` and our documents already formatted with the extracted features from earlier, there is not much to do in order to train a logistic regression classifier on our data.

`log_reg` has two notable functions:
* `train(training_set)`: trains the classifier on the given training set, presumed to be in the format we created in question 1
* `classify(document_features)`: classifies the given document, presumed to be a dictionary of features

Train the logistic regression classifier, then calculate and print the accuracy of the classifier on the test set.

In [None]:
from nltk.classify import SklearnClassifier
from sklearn.linear_model import LogisticRegression
from nltk.classify.util import accuracy

# Create logistic regression classifier
log_reg = SklearnClassifier(LogisticRegression())

# Train classifier on the training set
log_reg.train(train_set)

# Predict labels for test set using log_reg.classify()
y_pred = [log_reg.classify(document) for document, label in test_set]

# Calculate accuracy
acc = accuracy(log_reg, test_set)

print("Logistic Regression Accuracy:", acc)


Logistic Regression Accuracy: 0.64


# Question 4: Analysis

You will likely find that the resulting accuracy is not great for either classifier. For this question you will explore why.

## Part (a)

What may have contributed to the accuracy of your binary perceptron classifier (from question 2b) being relatively poor? Consider what might be true about the dataset, the weight vector $w$ and how it was trained, the features we used, etc.

*Provide a meaningful analysis of your binary perceptron classifier. Your answer does not have to be long, but you should highlight specific reasons to support your conclusion(s).*

YOUR ANSWER HERE
-2.883047305899362
-16.0
-8.0
0.0
Accuracy: 0.5125555555555555


The algorithm is sensitive to the order in which the training examples are presented, and may converge to a suboptimal solution if the examples are not presented in a random order. Additionally, the number of iterations used for training may not have been sufficient to reach a good solution.


## Part (b)

Does the accuracy of your logistic regression classifier (from question 3) surprised you? Was the accuracy higher than that of the binary perceptron classifier? Why do you think the logistic regression classifier performed better/worse? What do you think could be changed to increase the logistic regression classifier's accuracy?

*You do not have to answer each question individually. You should provide a meaningful analysis of your logistic regression classifier and the setbacks/properties therein.*

YOUR ANSWER HERE: 

The accuracy of the logistic regression classifier is 0.64, which is higher than the binary perceptron classifier's accuracy. This is not surprising since logistic regression is often more accurate than a perceptron for classification tasks.

One possible reason why the logistic regression classifier outperformed the perceptron is because it is a probabilistic model that estimates the probability of each class label. In contrast, the perceptron is a deterministic model that simply assigns a binary output to each input.

Another factor that may have contributed to the logistic regression classifier's accuracy is the set of features used in the model. The feature set extracted in question 1 was relatively simple and did not take into account more complex linguistic features such as n-grams or part-of-speech tags. A more sophisticated feature set could potentially improve the accuracy of the logistic regression classifier.