<a href="https://colab.research.google.com/github/sachi097/Agrisense/blob/master/Assignment2_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 2: Text Classification with Logistic Regression

This time, we will dive into **text classification** --one of the most popular NLP tasks!

In simple terms, **supervised** text classification is the task of taking a string and assigning it a **label** or a **class**. These labels can be as varied as we can possibly imagine. For example, is a string spam or not? Does it contain hate speech or not? Does it communicate a positive emotion or a negative one?

As you can see, text classification has many potential applications in this digital age. Given the massive number of strings produced every day, no human is capable of reading and assigning them all a label. Therefore, we want to come up with ways to classify them automatically. This is where **logistic regression (LR)** comes in handy, since it is a **machine learning algorithm** that predicts labels on **unseen inputs**.

In this assignment, we will train a LR model that will help us determine whether a Reddit post contains a "controversial" opinion or not.

*Some of the text in this Assignment is taken from the [book](https://web.stanford.edu/~jurafsky/slp3/5.pdf)*

## Text classification

This NLP task can be subdivided in the following **substasks**: sentiment analysis, spam detection, language identiﬁcation, and authorship attribution.

**Sentiment analysis** classiﬁes a text as reﬂecting the positive or negative orientation (sentiment) that a writer expresses toward some object.

There are many classification algorithms. The most popular ones are naive Bayes, logistic regression, random forests, and suport vector machines. In this assignment we will only make use of logistic regression.

Classiﬁers are **trained** using distinct training, dev, and test sets. Then, they are **evaluated** with various metrics. The most popular ones are  **precision**, **recall**, **accuracy** and **F1 metric**.

Statistical signiﬁcance tests should be used to determine whether we can be conﬁdent that one version of a classiﬁer is better than another.

## Logistic regression (LR)

Logistic regression can be used to classify an observation into one of two classes (like ‘positive sentiment’ and ‘negative sentiment’), or into one of many classes.
  
The main idea behind LR is computing the probability of assigning a document $d$ the probability of having a class $c$. This is,
$$
P(c \mid d) =P(y \mid x)
$$

LR (and other probabilistic machine learning classifiers) have the following components:
  1.  A **feature representation** of the input. For each input observation $x^{(i)}$, this will be a vector of features $\left[x_1, x_2, \ldots, x_n\right]$. We will generally refer to feature $i$ for input $x^{(j)}$ as $x_i^{(j)}$, sometimes simplified as $x_i$, but we will also see the notation $f_i, f_i(x)$, or, for multiclass classification, $f_i(c, x)$.
  2.  A **classification function** that computes $\hat{y}$, the estimated class, via $p(y \mid x)$. In the next section we will introduce the **sigmoid** and **softmax** tools for classification.
  3.  An **objective function** for learning, usually involving minimizing error on training examples. We will use the **cross-entropy loss** function.
  4.  An algorithm for **optimizing** the objective function. We will use the **stochastic gradient descent** algorithm.


## 0. Let's look at our data

Before the feature representation step, we need to load our *corpus* and see whats in it.

In [None]:
# this will install Polars in our notebook.
# Polars is a useful data wrangling library.
! pip install polars

# this will install Pytorch, a popular ML framework.
! pip install torch torchvision torchaudio

# numpy is a popular scientific computing library.
! pip install numpy

# tqdm makes your loops show a smart progress meter
# source: https://pypi.org/project/tqdm/
! pip install tqdm



The following block of code will read our *corpus* into memory and print a brief summary of the data contained in it.

In [None]:
import polars as pl

df = pl.read_csv("data.csv")
df.describe()

describe,comment_id,score,self_text,subreddit,created_time,post_id,author_name,controversiality,ups,downs,user_is_verified,user_account_created_time,user_awardee_karma,user_awarder_karma,user_link_karma,user_comment_karma,user_total_karma,post_score,post_self_text,post_title,post_upvote_ratio,post_thumbs_ups,post_total_awards_received,post_created_time
str,str,f64,str,str,str,str,str,f64,f64,f64,str,str,f64,f64,f64,f64,f64,f64,str,str,f64,f64,f64,str
"""count""","""10000""",10000.0,"""10000""","""10000""","""10000""","""10000""","""10000""",10000.0,10000.0,10000.0,"""10000""","""10000""",10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,"""10000""","""10000""",10000.0,10000.0,10000.0,"""10000"""
"""null_count""","""0""",0.0,"""0""","""0""","""0""","""0""","""0""",0.0,0.0,0.0,"""0""","""0""",0.0,0.0,0.0,0.0,0.0,0.0,"""0""","""0""",0.0,0.0,0.0,"""0"""
"""mean""",,13.6226,,,,,,0.5,13.6226,0.0,,,828.2566,314.1331,20593.8092,76800.3118,98536.5107,2223.6118,,,0.810165,2223.6118,0.0,
"""std""",,116.758572,,,,,,0.500025,116.758572,0.0,,,3522.184763,1818.285791,227349.618559,162506.111621,302778.131495,4348.545872,,,0.202442,4348.545872,0.0,
"""min""","""eeyxlnh""",-199.0,""" &gt;1. Ameri…","""AskReddit""","""2019-01-25 23:…","""1005d6u""","""----Dongers""",0.0,-199.0,0.0,"""False""","""""",0.0,0.0,0.0,-100.0,-99.0,0.0,"""""",""""".... and now …",0.04,0.0,0.0,"""2019-01-25 22:…"
"""25%""",,0.0,,,,,,0.0,0.0,0.0,,,0.0,0.0,31.0,5362.0,6406.0,99.0,,,0.72,99.0,0.0,
"""50%""",,2.0,,,,,,1.0,2.0,0.0,,,140.0,0.0,544.0,22565.0,26363.0,457.0,,,0.9,457.0,0.0,
"""75%""",,6.0,,,,,,1.0,6.0,0.0,,,598.0,57.0,4278.0,76263.0,88463.0,2465.0,,,0.96,2465.0,0.0,
"""max""","""ki6s9k7""",5129.0,"""🦀🦀🦀🦀🦀🦀🦀🦀🦀🦀🦀🦀🦀🦀…","""uspolitics""","""2024-01-16 21:…","""zziysm""","""zyzzogeton""",1.0,5129.0,0.0,"""True""","""2024-01-16 16:…",149281.0,57047.0,14549215.0,3765650.0,15517137.0,55746.0,"""💩💩💩💩💩""","""🚨 NEW: Trump's…",1.0,55746.0,0.0,"""2024-01-16 20:…"


As we can see, our DataFrame object contains a lot of data. We could create very complex **features** using all the columns, but for this assignment we will only consider the Reddit posts (strings) and their labels.

### + 0.5 points - Create a subsample of the DataFrame created above

Using the previously created `df` object, create a `df_subsampled` object that contains only the data in the columns "self_text" and "controversiality". Then, rename the column "controversiality" to "label". Finally, print the value distribution in the column "label". This will help us get an idea of our label distribution.

In [None]:
# Write your solution here.
df_subsampled = df.select(["self_text", "controversiality"])
df_subsampled = df_subsampled.rename({"controversiality": "label"})
print(df_subsampled['label'].value_counts())

shape: (2, 2)
┌───────┬───────┐
│ label ┆ count │
│ ---   ┆ ---   │
│ i64   ┆ u32   │
╞═══════╪═══════╡
│ 0     ┆ 5000  │
│ 1     ┆ 5000  │
└───────┴───────┘


*The expected output is a table of counts of each type of label in the dataset, i.e.:*

![image.png](attachment:image.png)

Lets take a look at two texts in our *corpus*. It's always a good idea to get an idea of what's in it!

In [None]:
def print_samples(df_subsampled):
    sample_0 = df_subsampled.filter(pl.col("label") == 0).sample(n=1)
    sample_1 = df_subsampled.filter(pl.col("label") == 1).sample(n=1)

    print("--- Sample with label 0 ---")
    print(sample_0["self_text"].to_list()[0])
    print("\n--- Sample with label 1 ---")
    print(sample_1["self_text"].to_list()[0])

print_samples(df_subsampled)

--- Sample with label 0 ---
Don't forget that they have to obey every order given them by a Trump, no matter how illegal it is.

--- Sample with label 1 ---
Care to elaborate on which Trump policies Biden stopped using and how it affected the border?


### 0.5 pts - Train, dev and test sets

Using the `df_subsampled` object, create three sets: training, development/validation, and test. The test set will be used in the evaluation phase, while the first two partitions will be used during the training phase. Remember to keep the test set entirely separate from the training and validation process. Train-dev-test sizes must be $80\%$, $10\%$, and $10\%$ respectively. Feel free to use sklearn library to help you with this task - https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
! pip install -U scikit-learn



In [None]:
# Write your solution here. Store your three sets in three
# different variables called train, validation, test.
from sklearn.model_selection import train_test_split
trainSet, test = train_test_split(df_subsampled, test_size=0.1, stratify=df_subsampled['label'])

train, validation = train_test_split(trainSet, test_size=0.1111, stratify=trainSet['label'])

In [None]:
# Let's see the dimensions of our partitions.
# DO NOT CHANGE THIS CODE.

print(train.shape, validation.shape, test.shape)

(8000, 2) (1000, 2) (1000, 2)


## 1. Feature representation

Now, we are ready to implement the first step in our learning pipeline.

Since we need numerical inputs for our learning function, we have to transform our strings to numerical representations. Essentially, we need to find a way to make numbers encode some aspect of word or sentence meaning. There are many ways to do this. In this assignment, we will use the **TF-IDF algorithm** and the **CountVectorizer** method to create two sets of features (both sets will represent the same data). You can learn more about it [here](https://www.learndatasci.com/glossary/tf-idf-term-frequency-inverse-document-frequency/#:~:text=Term%20Frequency%20%2D%20Inverse%20Document%20Frequency%20(TF%2DIDF)%20is,%2C%20relative%20to%20a%20corpus). The method involves multiplying two ratios - the term frequency and the inverse document frequency.

The term frequency is the number of times a term appears in a document, divided by the total number of terms in the document. The inverse document frequency is the logarithm of the number of documents in the corpus divided by the number of documents that contain the term.

$TF|IDF(t, d, D) = TF(t, d) \times IDF(t, D)$

Where:

$TF(t, d) = \frac{f_{t,d}}{\sum_{t' \in d} f_{t',d}}$

$IDF(t, D) = \log\left(\frac{N}{|\{d \in D : t \in d\}|}\right)$

Where:
- $f_{t,d}$ is the number of times term $t$ appears in document $d$
- $\sum_{t' \in d} f_{t',d}$ is the sum of the number of times each term appears in document $d$, which is basically the total number of terms in the document
- $N$ is the number of documents in the corpus
- $|\{d \in D : t \in d\}|$ is the number of documents in the corpus that contain term $t$


This allows us to get a numerical representation that assigns higher values to the most important words in our *corpus* -- these are not necessarily the most frequent words. (think of the definite article "the". It might appear frequently in a corpus, but it does not provide a lot of semantic content).

### + 1.0 point - Complete the following function and implement the TF-IDF algorithm

In [None]:
from typing import List, Dict
import numpy as np
from tqdm import tqdm
import copy

def find_vocabulary(rows: List[str]) -> List[str]:
  vocabulary_list = list()
  for row in rows:
    tokens = row.split(" ")
    tokens = [ele for ele in tokens if ele.strip()]
    vocabulary_list.extend(tokens)
  vocabulary = set(vocabulary_list)
  return sorted(vocabulary)

def generate_docs(rows: List[str]) -> List[List[str]]:
  docs = list()
  for row in rows:
    tokens = row.split(" ")
    tokens = [ele for ele in tokens if ele.strip()]
    docs.append(tokens)
  return docs

def compute_tf_idf(docs: List[List[str]]) -> Dict[str, Dict[str, float]]:
    token_hash = {}
    token_frequency = {}
    inverse_doc_frequency = {}

    # Compute TF
    i = 0
    for doc in docs:
      docKey = "doc_"+str(i)
      token_frequency.update({docKey: {}})
      doc_token_frequency = token_frequency[docKey]
      total_tokens = len(doc)
      for token in doc:
        token_count = doc.count(token)
        doc_token_frequency.update({token: (token_count / total_tokens)})
        # preprocessing for IDF
        if token not in token_hash:
          token_hash[token] = set()
          token_hash[token].add(i)
        else:
          token_hash[token].add(i)
      i = i + 1

    # Compute IDF
    N = len(docs)
    for doc in docs:
      for token in doc:
        inverse_doc_frequency.update({token: np.log(N / len(token_hash[token]))})

    # Compute TF-IDF
    tf_idf = copy.deepcopy(token_frequency)
    for doc, doc_tf_idf in tf_idf.items():
      for token in doc_tf_idf.keys():
        doc_tf_idf[token] = doc_tf_idf[token] * inverse_doc_frequency[token]
      tf_idf[doc] = doc_tf_idf

    return tf_idf

In [None]:
vocabulary = find_vocabulary(df_subsampled['self_text'])
docs = generate_docs(df_subsampled['self_text'])
tf_idf = compute_tf_idf(docs)

### + 1.0 point - Second set of features

Lets generate a second set of features. This time, we will use `CountVectorizer`, a sklearn method that generates a matrix of token counts from the input text.

In [None]:
# Use the count vectorizer to convert the text data to a matrix of token counts
from sklearn.feature_extraction.text import CountVectorizer

# Create a CountVectorizer instance
count_vectorizer = CountVectorizer()

In [None]:
from typing import Tuple

def transform_features(count_vectorizer: CountVectorizer, train: pl.DataFrame, validation: pl.DataFrame, test: pl.DataFrame) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
    # Fit the CountVectorizer to the train, validation and test  sets
    count_vectorizer.fit(df_subsampled['self_text'])
    train_features = count_vectorizer.transform(train['self_text'])
    validation_features = count_vectorizer.transform(validation['self_text'])
    test_features = count_vectorizer.transform(test['self_text'])
    return train_features, validation_features, test_features

In [None]:
train_features, validation_features, test_features = transform_features(count_vectorizer, train, validation, test)

In [None]:
# Convert the features to PyTorch tensors.
# Tensors are a useful data structure in machine learning.
# Their usefulness stems from the fact that they are "n-dimensional vectors" --this allows us
# to conveniently store the features generated from text corpora.
import torch

def convert_to_tensors(train_features: np.ndarray, validation_features: np.ndarray, test_features: np.ndarray) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    # Convert the features to PyTorch tensors
    train_features_tensor = torch.from_numpy(train_features.todense()).float()
    validation_features_tensor = torch.from_numpy(validation_features.todense()).float()
    test_features_tensor = torch.from_numpy(test_features.todense()).float()
    return train_features_tensor, validation_features_tensor, test_features_tensor

In [None]:
train_features_tensor, validation_features_tensor, test_features_tensor = convert_to_tensors(train_features, validation_features, test_features)

In [None]:
# Convert the labels to PyTorch tensors

def convert_labels_to_tensors(train: pl.DataFrame, validation: pl.DataFrame, test: pl.DataFrame) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    # Convert the labels to PyTorch tensors
    train_labels = torch.from_numpy(np.array(train['label']))
    validation_labels = torch.from_numpy(np.array(validation['label']))
    test_labels = torch.from_numpy(np.array(test['label']))
    return train_labels, validation_labels, test_labels

In [None]:
train_labels, validation_labels, test_labels = convert_labels_to_tensors(train, validation, test)

In [None]:
# This will give us an idea of the dimensions of our generated features.
print(train_features_tensor.shape, validation_features_tensor.shape, test_features_tensor.shape)

torch.Size([8000, 19825]) torch.Size([1000, 19825]) torch.Size([1000, 19825])


## 2. The objective function - cross-entropy loss

To optimize the learning process in our classifier, we need to implement a **loss function** or objective function. This function assigns weights to the outputs made by the classifier given how likely they are to be in a class than in another.

For a string $x$, we want to learn weights that maximize the probability of the correct label, $p(y \mid x)$. Since there are only two discrete outcomes ($1$ or $0$), we can make use of the **Bernoulli distribution**, and we can express the probability, $p(y \mid x)$, that our classifier produces for one observation as the following:
$$
p(y \mid x)=\hat{y}^y(1-\hat{y})^{1-y} \quad \text{where} \quad \hat{y}:\text{predicted label}, \quad y:\text{correct label}
$$


Next, we take the log of both sides. This will turn out to be handy mathematically, and doesn't hurt us; whatever values maximize a probability will also maximize the $\log$ of the probability:
$$
\begin{aligned}
\log p(y \mid x) & =\log \left[\hat{y}^y(1-\hat{y})^{1-y}\right] \\
& =y \log \hat{y}+(1-y) \log (1-\hat{y})
\end{aligned}
$$

To turn this into a loss function (something that we need to minimize), we'll just flip the sign on the equation. The result is the cross-entropy loss $L_{\mathrm{CE}}$:
$$
\begin{aligned}
L_{\mathrm{CE}}(\hat{y}, y) & = -\log p(y \mid x) \\
& =-[y \log \hat{y}+(1-y) \log (1-\hat{y})]
\end{aligned}
$$

Finally, we substitute the loss function in  $\hat{y}=\sigma(\mathbf{w} \cdot \mathbf{x}+b)$ to obtain:
$$
L_{\mathrm{CE}}(\hat{y}, y)=-[y \log \sigma(\mathbf{w} \cdot \mathbf{x}+b)+(1-y) \log (1-\sigma(\mathbf{w} \cdot \mathbf{x}+b))]
$$

**For a given input, $x$, want the loss to be smaller if the model's estimate is close to correct, and bigger if the model is confused.**

The equation above is generalized to:
$$
L=-\frac{1}{m} \sum_{i=1}^m y_i \cdot \log \left(\hat{y}_i\right)+\left(1-y_i\right) \cdot \log \left(1-\hat{y}_i\right)\quad \text{where} \quad m:\text{no. of samples in the corpus}
$$

### + 1.0 point - Cross-entropy loss implementation

Complete the following function and implement the cross-entropy loss function.

In [None]:
# Step 2: Cross-Entropy Loss
def cross_entropy_loss(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    m = len(y_true)
    cross_entropy_sum = 0
    for y_i_true, y_hat_i_pred in zip(y_true, y_pred):
      cross_entropy_sum += ( (y_i_true * np.log(y_hat_i_pred)) + ((1 - y_i_true) * np.log(1 - y_hat_i_pred)) )
    cross_entropy_loss_value = ( -1 / m) *  cross_entropy_sum
    return cross_entropy_loss_value


## 3. The sigmoid function

$$ \begin{align}
\quad \sigma(z)=\frac{1}{1+e^{-z}}=\frac{1}{1+\exp (-z)}
\end{align}
$$

- This function is also called the **logistic function**. It takes a real-valued number and maps it into the range $(0,1)$, which is what we want because we want to model probabilities.
- The sigmoid function takes real-valued numbers as inputs  --this is why we needed to create the features beforehand.
- Because it is nearly linear around $0$ but ﬂattens toward the ends, the sigmoid function tends to squash outlier values toward $0$ or $1$.
- It’s differentiable, which is convenient for the learning process (this process is just a series of function optimizations).
- Given a input $x$, we want to assign it a label. In the case of our *corpus*, we can only make two decisions: controversial or not controversial.
- The sigmoid function is written in terms of $z$ because it takes a **random variable**.

### + 1.0 point - Sigmoid function implementation

Complete the following function and implement the sigmoid function.

In [None]:
from typing import Union

def sigmoid(x: Union[float, np.ndarray]) -> Union[float, np.ndarray]:

    return 1 / (1 + np.exp(-x))

## 4. Gradient descent

This algorithm will help us **optimize** our loss function. Since our loss function is differentiable, we can optimize it by calculating the derivatives.

We now introduce **weights** into our predictions --these should minimize the loss function averaged over all examples:

$$
L=-\frac{1}{m} \sum_{i=1}^m y_i \cdot \log \left(\sigma\left(\mathrm{X}_{\mathrm{i}} w+b\right)\right)+\left(1-y_i\right) \cdot \log \left(1-\sigma\left(\mathrm{X}_{\mathrm{i}} w+b\right)\right)
$$

By taking the gradient $L$ with respect to $w$, you get the following:
$$
\frac{\partial L}{\partial w}=\frac{1}{m}\left(\sigma\left(\mathrm{X} w+b\right)-y\right) X
$$

By taking the gradient $L$ with respect to $b$, you get the following:
$$
\frac{\partial L}{\partial b}=\frac{1}{m} \sum_{i=1}^m \sigma\left(\mathrm{X}_{\mathrm{i}} w+b\right)-y_i
$$

[source](https://www.tensorflow.org/guide/core/logistic_regression_core)

Stochastic gradient descent is an algorithm that minimizes the loss function by computing its gradient after each training example, and nudging $\theta$ in the right direction (the opposite direction of the gradient).

[source](https://web.stanford.edu/~jurafsky/slp3/5.pdf)


The learning rate $\eta$ is a hyperparameter --this means that it is a value that we have to choose manually and arbitrarily, and adjust later on depending on the optimization results. If it is set too low, the algorithm will take very long to find a minimum.

## 5. Training

Now that we have created all the necessary building blocks for our logistic regression model, we will perform the **training phase**, in which we will **learn** the weights for particular *corpus*.

### + 1.0 point - Complete the following function:

In [None]:
from typing import Tuple

# hyperparameters:
# lr: learning rate
# epochs: number of passes through the entire corpus.
def logistic_regression(X: np.ndarray, y: np.ndarray, lr: float = 0.1, epochs: int = 100) -> Tuple[np.ndarray, float]:
    m, n = X.shape
    w = np.zeros(n)
    b = 0

    for epoch in tqdm(range(epochs)):
        z = np.dot(X, w) + b
        y_pred = sigmoid(z)
        loss = cross_entropy_loss(y, y_pred)

        # Gradient computation
        dw = np.dot(X.T, (y_pred - y)) / m
        db = np.sum(y_pred - y) / m

        # Update weights
        w -= lr * dw
        b -= lr * db

        if epoch % 10 == 0:
            print(f"Epoch {epoch}, Loss: {loss}")

    return w, b

In [None]:
y = np.array(train_labels)
w, b = logistic_regression(train_features_tensor, y)

  1%|          | 1/100 [00:01<02:25,  1.47s/it]

Epoch 0, Loss: 0.6931471805600389


 11%|█         | 11/100 [00:18<03:27,  2.33s/it]

Epoch 10, Loss: 0.6874562403268131


 21%|██        | 21/100 [00:28<01:31,  1.15s/it]

Epoch 20, Loss: 0.6842026284281115


 31%|███       | 31/100 [00:39<01:12,  1.04s/it]

Epoch 30, Loss: 0.6815564430997848


 41%|████      | 41/100 [00:49<01:00,  1.02s/it]

Epoch 40, Loss: 0.6792761469283639


 51%|█████     | 51/100 [01:00<00:49,  1.02s/it]

Epoch 50, Loss: 0.6772473744494814


 61%|██████    | 61/100 [01:10<00:40,  1.03s/it]

Epoch 60, Loss: 0.6754064689990353


 71%|███████   | 71/100 [01:21<00:31,  1.08s/it]

Epoch 70, Loss: 0.6737128590671474


 81%|████████  | 81/100 [01:31<00:21,  1.11s/it]

Epoch 80, Loss: 0.6721383577437882


 91%|█████████ | 91/100 [01:41<00:08,  1.00it/s]

Epoch 90, Loss: 0.6706623396952428


100%|██████████| 100/100 [01:51<00:00,  1.11s/it]


### 0.5 points - Predict the final probabilities

Now, we will create a function that will output the final probabilities of our model. The inputs will be the `trained_model` and the `features` of the test dataset.

Fill in the missing code and consider a boundary of $>=0.5$

In [None]:
def predict(X: np.ndarray, w: np.ndarray, b: float) -> np.ndarray:
    z = np.dot(X, w) + b
    y_pred = sigmoid(z)
    for i in range(0, len(y_pred)):
      if y_pred[i] >= 0.5:
        y_pred[i] = 1
      else:
        y_pred[i] = 0

    return y_pred

In [None]:
predictions = predict(test_features_tensor, w, b)
print("Predictions:", predictions)

Predictions: [0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 0. 1. 0. 0. 1. 1. 1. 0. 1. 1. 1. 1. 1.
 1. 1. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1.
 0. 1. 1. 0. 0. 1. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 1. 1. 0. 1. 0. 1. 0. 1.
 0. 0. 1. 1. 1. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 1. 0. 0. 1.
 0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.
 1. 0. 1. 1. 0. 0. 0. 1. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0.
 1. 0. 0. 1. 0. 1. 1. 0. 1. 1. 0. 0. 0. 1. 0. 0. 1. 1. 1. 0. 1. 1. 0. 1.
 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1.
 0. 0. 0. 1. 0. 1. 1. 0. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 1. 0.
 1. 1. 1. 0. 0. 0. 1. 1. 0. 1. 1. 1. 1. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1.
 0. 0. 0. 0. 0. 0. 1. 0. 1. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1. 1. 0. 0. 1. 1. 0. 0. 0. 1. 1. 0.
 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 1. 1. 1

## 6. Evaluation metrics

A very important step in every ML pipeline is the evaluation since we always want to know how well/bad our model will predict the labels of unseen examples.

In NLP we make use of various evaluation metrics for our models. The most common ones are:

- Accuracy: It is the ratio of the number of correct predictions divided by the number of total predictions. This gives us an idea of how many correct outputs our model will generate when prediciting labels on unseen examples. The formula is:
$$
\begin{aligned}
Accuracy = \frac{numberOfCorrectPredictions}{total Number Of Predictions}
\end{aligned}
$$
- Precision: This metric indicates the ratio of true positive predictions divided by the total number of positive predictions. It indicates us the quality of our model, since it gives us a notion of how many of the predictions in one class actually belong to that class.
$$
\begin{aligned}
Precision = \frac{truePositives}{total Number Of Predictions}
\end{aligned}
$$
- Recall: It calculates the proportion of true positive predictions divided by the total number of actual positive items (i.e.,  the sum of true positive and false negative predictions). Intuitively, it tells us how many of the actual positive instances of a class the model correctly identified.
$$
\begin{aligned}
Recall = \frac{truePositives}{truePositives + false Negatives}
\end{aligned}
$$
- F1 metric: This metric is the result of dividing precision and recall in an equally weighted manner. It is used to see how balanced are the precision and recall metrics, and when there is an uneven class distribution. In our case, we do not have an unbalanced proportion of labels. What do you expect then to see here for our corpus?
$$
\begin{aligned}
F1 = 2 * \frac{Precision*Recall}{Precision+Recall}
\end{aligned}
$$

### + 1.0 pts Evaluation metrics implementation

Complete the following functions that generate the evaluation metrics introduced before:

In [None]:
#                   | global negative | global positive |
#.  system negative |  true negative  |  false negative |
#.  system positive |  false positive |  true positive  |
#                   |                 |                 |

contingency_matrix = np.zeros((2,2))
calculated_contingency_matrix = False

def calculate_contingency_matrix(y_true: List[int], y_pred: List[int]):
  global calculated_contingency_matrix, contingency_matrix
  calculated_contingency_matrix = True
  contingency_matrix = np.zeros((2,2))
  for y_i_true, y_i_pred in zip(y_true, y_pred):
    if y_i_true == y_i_pred:
      if y_i_true == 0:
        contingency_matrix[0][0] += 1 # true negative
      else:
        contingency_matrix[1][1] += 1 # true positive
    else:
      if y_i_true == 0:
        contingency_matrix[0][1] += 1 # false negative
      else:
        contingency_matrix[1][0] += 1 # false positive

def accuracy(y_true: List[int], y_pred: List[int]) -> float:
    # Compute accuracy of the model based on the true and predicted labels
    total_predictions = len(y_pred)
    calculate_contingency_matrix(y_true, y_pred)
    correct_prediction = contingency_matrix[0][0] + contingency_matrix[1][1]
    acurac_value = correct_prediction / total_predictions
    return acurac_value

def precision(y_true: List[int], y_pred: List[int]) -> float:
    # Compute precision of the model based on the true and predicted labels
    if not calculated_contingency_matrix:
      calculate_contingency_matrix(y_true, y_pred)
    true_positives = contingency_matrix[1][1]
    false_positives = contingency_matrix[1][0]
    precision_value = true_positives / (true_positives + false_positives)
    return precision_value

def recall(y_true: List[int], y_pred: List[int]) -> float:
    # Compute recall of the model based on the true and predicted labels
    if not calculated_contingency_matrix:
      calculate_contingency_matrix(y_true, y_pred)
    true_positives = contingency_matrix[1][1]
    false_negatives = contingency_matrix[0][1]
    recall_value = true_positives / (true_positives + false_negatives)
    return recall_value

def f1_score(y_true: List[int], y_pred: List[int]) -> float:
    # Compute F1 score of the model based on the true and predicted labels
    if not calculated_contingency_matrix:
      calculate_contingency_matrix(y_true, y_pred)
    precision_value = precision(y_true, y_pred)
    recall_value = recall(y_true, y_pred)
    f1_score_value = 2 * ((precision_value * recall_value) / (precision_value + recall_value))
    return f1_score_value

## 7. Evaluation

Now, we will evaluate the performance of our classifier. During this step we will make use of the test set that we created in the beginning; we will use the test set during this phase because we want to simulate a real-world scenario. That is, if we train a model and we deploy it in a real-world application, we expect that the users will pass new (unseen) text samples to that train model. In our case, those new samples are in the test set. By "hidding" the test set during the training phase, we make sure that our models will have some generalization capabilities when dealing with unseen data.

### + 0.5 pts - Evaluation

Complete the following lines of code and print the classification metrics the trained model achieves.

In [None]:
### DO NOT EDIT ###
### Test set results
y_true = np.array(test_labels)
print('Logistic regression with CountVectorizer features - Results:')
print('Accuracy:', accuracy(y_true, predictions))
print('Precision:', precision(y_true, predictions))
print('Recall:', recall(y_true, predictions))
print('F1-score', f1_score(y_true, predictions))

Logistic regression with CountVectorizer features - Results:
Accuracy: 0.56
Precision: 0.484
Recall: 0.5707547169811321
F1-score 0.5238095238095237


## 8. Regularization

Even though we want our machine learning models to learn as best as possible from our data, we don't want them to approximate precise functions for all the inputs. Why? Because when we show them new samples, they won't be able to predict its label accurately given that they will only be optimized for a set of specific training samples. Therefore, we want our models to have some room for error.

To avoid overfitting, a new regularization term, $R(\theta)$, is added to the objective function, resulting in the following objective for a batch of $m$ examples (slightly rewritten to be maximizing log probability rather than minimizing loss, and removing the $\frac{1}{m}$ term which doesn't affect the argmax):
$$
\hat{\theta}=\underset{\theta}{\operatorname{argmax}} \sum_{i=1}^m \log P\left(y^{(i)} \mid x^{(i)}\right)-\alpha R(\theta)
$$

The new regularization term, $R(\theta)$, is used to penalize large weights. Thus a setting of the weights that matches the training data perfectly— but uses many weights with high values to do so-will be penalized more than a setting that matches the data a little less well, but does so using smaller weights. There are two common ways to compute this regularization term $R(\theta)$. L2 regularization is a quadratic function of the weight values, named because it uses the (square of the) L2 norm of the weight values. The L2 norm, $\|\theta\|_2$, is the same as the Euclidean distance of the vector $\theta$ from the origin. If $\theta$ consists of $n$ weights, then:
$$
R(\theta)=\|\theta\|_2^2=\sum_{j=1}^n \theta_j^2
$$

The L2 regularized objective function becomes:
$$
\hat{\theta}=\underset{\theta}{\operatorname{argmax}}\left[\sum_{i=1}^m \log P\left(y^{(i)} \mid x^{(i)}\right)\right]-\alpha \sum_{j=1}^n \theta_j^2
$$

L1 regularization is a linear function of the weight values, named after the $\mathrm{L} 1$ norm $\|W\|_1$, the sum of the absolute values of the weights, or Manhattan distance (the Manhattan distance is the distance you'd have to walk between two points in a city with a street grid like New York):
$$
R(\theta)=\|\theta\|_1=\sum_{i=1}^n\left|\theta_i\right|
$$

The L1 regularized objective function becomes:
$$
\hat{\theta}=\underset{\theta}{\operatorname{argmax}}\left[\sum_{1=i}^m \log P\left(y^{(i)} \mid x^{(i)}\right)\right]-\alpha \sum_{j=1}^n\left|\theta_j\right|
$$

### 1.0 pts - Implement and test the L1 regularization

Using the code provided next, train another LR classifier. The difference in this new LR model will be that now we will use the L1 regularization.

Evaluate this new LR model and compare the results obtained by this model and the previous one.

In [None]:
def cross_entropy_loss_l1(y_true: np.ndarray, y_pred: np.ndarray, w: np.ndarray, lambda_: float) -> float:
    m = len(y_true)
    cross_entropy_sum = 0
    for y_i_true, y_hat_i_pred in zip(y_true, y_pred):
      cross_entropy_sum += ( (y_i_true * np.log(y_hat_i_pred)) + ((1 - y_i_true) * np.log(1 - y_hat_i_pred)) )
    cross_entropy_loss_value = (( -1 / m) * cross_entropy_sum) + (lambda_ * np.sum(np.abs(w)))

    return cross_entropy_loss_value

def logistic_regression_l1(X: np.ndarray, y: np.ndarray, lr: float = 0.1, epochs: int = 100, lambda_: float = 0.1) -> Tuple[np.ndarray, float]:
    m, n = X.shape
    w = np.zeros(n)
    b = 0

    for epoch in tqdm(range(epochs)):
        z = np.dot(X, w) + b
        y_pred = sigmoid(z)
        loss =  cross_entropy_loss_l1(y, y_pred, w, lambda_)
        # Gradient computation
        dw = ((1 / m) * np.dot(X.T, (y_pred - y))) + (lambda_ * np.sign(w))

        db = np.sum(y_pred - y) / m

        # Update weights
        w -= lr * dw
        b -= lr * db

        if epoch % 10 == 0:
            print(f"Epoch {epoch}, Loss: {loss}")

    return w, b

In [None]:
y = np.array(train_labels)
w_l1, b_l1 = logistic_regression_l1(train_features_tensor, y, lr = 0.1, epochs = 100, lambda_ = 0.1)

  1%|          | 1/100 [00:01<01:44,  1.06s/it]

Epoch 0, Loss: 0.6931471805600389


 11%|█         | 11/100 [00:11<01:30,  1.02s/it]

Epoch 10, Loss: 16.409680530387266


 21%|██        | 21/100 [00:22<01:29,  1.14s/it]

Epoch 20, Loss: 16.140654550654826


 31%|███       | 31/100 [00:33<01:18,  1.14s/it]

Epoch 30, Loss: 15.899459211534705


 41%|████      | 41/100 [00:44<01:04,  1.09s/it]

Epoch 40, Loss: 15.678296850317624


 51%|█████     | 51/100 [00:54<00:52,  1.07s/it]

Epoch 50, Loss: 15.463073894605348


 61%|██████    | 61/100 [01:05<00:39,  1.02s/it]

Epoch 60, Loss: 15.252058387860185


 71%|███████   | 71/100 [01:15<00:29,  1.01s/it]

Epoch 70, Loss: 15.047153109482828


 81%|████████  | 81/100 [01:26<00:19,  1.03s/it]

Epoch 80, Loss: 14.85189497643821


 91%|█████████ | 91/100 [01:36<00:09,  1.07s/it]

Epoch 90, Loss: 14.661483128990806


100%|██████████| 100/100 [01:45<00:00,  1.06s/it]


In [None]:
predictions_l1 = predict(test_features_tensor, w_l1, b_l1)
print("Predictions:", predictions_l1)

Predictions: [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 1. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.
 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1.
 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1.
 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0

In [None]:
y_true = np.array(test_labels)
print('Logistic regression with l1 regularization lamda_ 0.1 - Results:')
print('Accuracy:', accuracy(y_true, predictions_l1))
print('Precision:', precision(y_true, predictions_l1))
print('Recall:', recall(y_true, predictions_l1))
print('F1-score', f1_score(y_true, predictions_l1))

Logistic regression with l1 regularization lamda_ 0.1 - Results:
Accuracy: 0.472
Precision: 0.074
Recall: 0.3627450980392157
F1-score 0.12292358803986711


Comparison: LR vs LR with L1

| learning rate = 0.1 | Accuracy | Precision | Recall | F1-score | loss (after 100 epochs)
| --- | --- | --- | --- | --- | --- |
| LR | 0.560 | 0.484 | 0.5707547169811321 | 0.5238095238095237 | 0.6706623396952428 |
| LR with L1 (lambda_ = 0.1) | 0.472 | 0.074 | 0.3627450980392157 | 0.12292358803986711 | 14.661483128990806 |

Note: values in the table may change for every run

In case of **LR** & **LR with L1**, the loss is higher in **LR with L1** than **LR** that is because of the penalty factor we are adding in the loss function.

Also, adding lambda to weight gradient ensures higher weights get penalised and push them towards zero and promoting sparsity. And subtracting the penalty factor ensures negative weights to move towards to zero.

Accuracy of **LR with L1** is lesser than **LR**. And it is not as precise as **LR** because of higher loss value. Accuracy of **LR with L1** can be improved by increasing epochs so that the loss of the model becomes smaller i.e. nothing but we converge the model by increasing the train time. With 100 epochs it can be seen that model has not reached it's convergence point.

### + 1.0 point - Comparison with another set of features

Using the TF-IDF features we created in the beginning, train a third LR model and compare its performance to the other two.

What can you say about these results? Write a brief paragraph where you explain your findings.

In [None]:
vocabulary = find_vocabulary(df_subsampled['self_text'])

train_docs = generate_docs(train['self_text'])
train_tf_idf = compute_tf_idf(train_docs)

validation_docs = generate_docs(validation['self_text'])
validation_tf_idf = compute_tf_idf(validation_docs)

test_docs = generate_docs(test['self_text'])
test_tf_idf = compute_tf_idf(test_docs)

def generate_feature_matrix(docs_tf_idf: Dict[str, Dict[str, float]]) -> Tuple[torch.Tensor]:
  feature_matrix = np.zeros((len(docs_tf_idf), len(vocabulary)))
  i = 0
  for doc, doc_tf_idf in docs_tf_idf.items():
    for token in doc_tf_idf.keys():
      feature = token
      feature_value = doc_tf_idf[token]
      feature_matrix[i][vocabulary.index(feature)] = feature_value
    i += 1
  return torch.from_numpy(feature_matrix).float()

train_features_tf_idf_tensor = generate_feature_matrix(train_tf_idf)
validation_features_tf_idf_tensor = generate_feature_matrix(validation_tf_idf)
test_features_tf_idf_tensor = generate_feature_matrix(test_tf_idf)
print(train_features_tf_idf_tensor.shape, validation_features_tf_idf_tensor.shape, test_features_tf_idf_tensor.shape)

torch.Size([8000, 42593]) torch.Size([1000, 42593]) torch.Size([1000, 42593])


In [None]:
y = np.array(train_labels)
w_tf_idf, b_tf_idf = logistic_regression(train_features_tf_idf_tensor, y)

  1%|          | 1/100 [00:03<05:15,  3.19s/it]

Epoch 0, Loss: 0.6931471805600389


 11%|█         | 11/100 [00:24<03:11,  2.15s/it]

Epoch 10, Loss: 0.6930118049027276


 21%|██        | 21/100 [00:46<02:53,  2.19s/it]

Epoch 20, Loss: 0.6928767110625578


 31%|███       | 31/100 [01:08<02:30,  2.18s/it]

Epoch 30, Loss: 0.6927418973396542


 41%|████      | 41/100 [01:29<02:05,  2.13s/it]

Epoch 40, Loss: 0.6926073622602261


 51%|█████     | 51/100 [01:51<01:45,  2.14s/it]

Epoch 50, Loss: 0.6924731044873189


 61%|██████    | 61/100 [02:13<01:26,  2.22s/it]

Epoch 60, Loss: 0.692339122767364


 71%|███████   | 71/100 [02:34<01:00,  2.09s/it]

Epoch 70, Loss: 0.6922054158981544


 81%|████████  | 81/100 [02:55<00:40,  2.16s/it]

Epoch 80, Loss: 0.6920719827096138


 91%|█████████ | 91/100 [03:17<00:19,  2.21s/it]

Epoch 90, Loss: 0.6919388220523094


100%|██████████| 100/100 [03:37<00:00,  2.17s/it]


In [None]:
predictions_tf_idf = predict(test_features_tf_idf_tensor, w_tf_idf, b_tf_idf)
print("Predictions:", predictions_tf_idf)

Predictions: [1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 0. 0. 1. 1. 1. 1. 0. 0. 0. 1. 0. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1. 1. 0. 1. 1. 1. 1. 1.
 1. 0. 0. 1. 1. 1. 0. 1. 0. 1. 1. 1. 0. 1. 0. 0. 0. 0. 1. 0. 1. 0. 1. 1.
 0. 1. 1. 1. 1. 0. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1.
 0. 1. 1. 0. 1. 0. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1.
 1. 1. 1. 0. 0. 1. 1. 1. 0. 0. 1. 0. 1. 0. 0. 1. 0. 1. 0. 0. 1. 1. 0. 1.
 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0. 1. 0.
 1. 0. 0. 1. 0. 1. 1. 0. 0. 1. 1. 0. 0. 1. 0. 0. 1. 1. 1. 1. 1. 1. 0. 1.
 1. 0. 1. 1. 1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 1. 0. 0. 1. 1. 1.
 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 0. 1. 0. 1. 0. 1. 0. 0. 1. 1. 0. 1.
 0. 0. 1. 0. 1. 0. 1. 0. 1. 1. 0. 1. 1. 1. 1. 0. 1. 0. 1. 0. 0. 1. 0. 1.
 0. 1. 0. 1. 0. 1. 0. 1. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 0.
 1. 1. 0. 1. 0. 0. 1. 0. 0. 0. 1. 1. 1

In [None]:
y_true = np.array(test_labels)
print('Logistic regression with tf-idf features - Results:')
print('Accuracy:', accuracy(y_true, predictions_tf_idf))
print('Precision:', precision(y_true, predictions_tf_idf))
print('Recall:', recall(y_true, predictions_tf_idf))
print('F1-score', f1_score(y_true, predictions_tf_idf))

Logistic regression with tf-idf features - Results:
Accuracy: 0.595
Precision: 0.772
Recall: 0.5701624815361891
F1-score 0.6559048428207306


*Write here a few lines explaining your observations from these new results (for example, are the classification metrics higher or lower than the ones previously obtained? Why do you think this happens? Which ones were easier to implement? Which set of features seems more intuitive to you? etc)*

...

Observations:

| learning rate = 0.1 | Accuracy | Precision | Recall | F1-score | loss (after 100 epochs)
| --- | --- | --- | --- | --- | --- |
| LR | 0.560 | 0.484 | 0.5707547169811321 | 0.5238095238095237 | 0.6706623396952428 |
| LR with L1 (lambda_ = 0.1) | 0.472 | 0.074 | 0.3627450980392157 | 0.12292358803986711 | 14.661483128990806 |
| LR with tf_idf | 0.595 | 0.772 | 0.5701624815361891 |  0.6559048428207306 | 0.6919388220523094 |

Note: values may change for each run

Explanation for **LR** and **LR with L1** is provided in the previous section. Therefore, explantion only for **LR with tf_idf** is provided below.

1. Accuracy, precision and F1-score of **LR with tf_idf** is higher than the other two models.

2. Because, this model generated more loss value in the training phase than **LR**, suggesting **LR with tf_idf** model has more room for error than other models which in turn suggests it is less overfitted than other models. Also, this model generates frequency of each tokens/terms considering entire dataset (idf factor ensures this) unlike in CountVectoriser frequency of the token/terms is determined with respect to particular row/doc only. Hence, td_idf ratio provides a numerical representation that assigns higher values to the most important words in our corpus. Meanwhile, CountVectoriser has an inability to identify more and less important words in the corpus. And unlike tf_idf, it lacks to draw relationship between tokens.

3. **LR** and **LR with L1** were easier to implement because they made features built from CountVectoriser. Whereas, there was lot of pre-processing involded in **LR with tf_idf** hence it seemed bit complicated then the rest.

4. Feature set generated from tf_idf ratio seemed more intuitive to me than CountVectoriser because of following reasons:
    
    - It considers whole dataset while determining frequency value and not just the abundanance of token in particular doc/row.
    - It identifies the relationships between words such as linguistic similarity between words.
    - It has ability to identify more important and less important words. We can then remove the words that are less important for analysis, hence making the model building less complex by reducing the input dimensions.

### + extra:  1.0 point - LR in sklearn

Implement a fourth LR model using the sklearn [implementation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). Use the same hyperparameters that you used in the previous trained models. Finally, compare your results with those obtained by the sklearn implementation.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, log_loss

clf = LogisticRegression(penalty=None, random_state=0, max_iter=100)
y = np.array(train_labels)
y_true = np.array(test_labels)
clf.fit(train_features_tensor, y)
y_pred = clf.predict(test_features_tensor)
total_predictions_sklearn = len(y_pred)

y_pred_prob = clf.predict_proba(test_features_tensor)

ce_loss = log_loss(y_true, y_pred_prob)
accuracy_sklearn = accuracy_score(y_true, y_pred)
conf_mat = confusion_matrix(y_true, y_pred)
precision_sklearn = conf_mat[0][0] / (conf_mat[0][0] + conf_mat[0][1])
recall_sklearn = conf_mat[0][0] / (conf_mat[0][0] + conf_mat[1][0])
f1_score_sklearn = 2 * ((precision_sklearn * recall_sklearn) / (precision_sklearn + recall_sklearn))


print("Predictions:", y_pred)
print('\nLogistic regression with sklearn - Results:')
print("Loss: " + str(ce_loss))
print("Accuracy: " + str(accuracy_sklearn))
print("Precision: " + str(precision_sklearn))
print("Recall: " + str(recall_sklearn))
print("F1-score: " + str(f1_score_sklearn))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Predictions: [1 0 0 1 1 0 1 1 1 1 0 0 1 0 1 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 0 1 0 0 1 0
 1 1 1 0 0 0 1 0 1 1 0 1 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 1
 0 1 0 1 0 1 0 0 1 0 0 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 1 1 0 1 1 1 0 1 0
 1 1 1 1 1 1 0 0 0 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1
 1 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 1 1 1 1 1 0 0 0 0 1 1 0 1 1 1 0 0 1 0 1 1
 0 0 1 1 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0 0 1 0 0 1 0 1 0 1 0 0 1 1 1 0 0 0 1
 0 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 1 1 0 0 1 0 1 0 0 0 0 1 1 0 0 0 0
 0 1 1 0 1 1 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 1 1 0 1 1 1 0 1 1 1 1 1
 0 1 1 0 1 1 1 0 1 0 1 1 0 1 0 1 1 1 0 1 0 0 1 1 0 0 0 1 1 0 0 0 1 0 1 1 1
 1 0 1 1 1 0 1 0 0 0 1 1 0 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 0 1 0
 0 1 1 1 0 0 1 0 0 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 0 1 0
 1 1 1 0 0 1 1 0 1 0 1 0 0 1 1 1 1 0 0 1 0 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 1
 1 1 1 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 1 0 1 0 0 1 1 0 1 1 1 0 0 1 1 1 0
 0 1 1 1 0 1

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, log_loss


clf = LogisticRegression(penalty='l1', solver="liblinear", C = 0.1, random_state=0, max_iter=100)
y = np.array(train_labels)
y_true = np.array(test_labels)
clf.fit(train_features_tensor, y)
y_pred = clf.predict(test_features_tensor)
total_predictions_sklearn = len(y_pred)

y_pred_prob = clf.predict_proba(test_features_tensor)

ce_loss = log_loss(y_true, y_pred_prob)
accuracy_sklearn = accuracy_score(y_true, y_pred)
conf_mat = confusion_matrix(y_true, y_pred)
precision_sklearn = conf_mat[0][0] / (conf_mat[0][0] + conf_mat[0][1])
recall_sklearn = conf_mat[0][0] / (conf_mat[0][0] + conf_mat[1][0])
f1_score_sklearn = 2 * ((precision_sklearn * recall_sklearn) / (precision_sklearn + recall_sklearn))


print("Predictions:", y_pred)
print('\nLogistic regression L1 with sklearn - Results:')
print("Loss: " + str(ce_loss))
print("Accuracy: " + str(accuracy_sklearn))
print("Precision: " + str(precision_sklearn))
print("Recall: " + str(recall_sklearn))
print("F1-score: " + str(f1_score_sklearn))

Predictions: [0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 1 0 1 0 1
 0 0 0 1 0 0 0 1 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1
 1 0 1 0 1 1 1 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 0 0 0 0 1 0
 1 0 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1
 1 0 0 1 0 1 1 0 1 1 1 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 1 0 1 1 0 0 0 1 0 0 1
 1 1 0 1 0 0 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1
 1 0 0 0 1 0 1 0 0 0 1 0 1 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0
 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 1 1 0
 0 0 1 0 1 1 0 0 1 1 0 0 0 1 1 0 0 1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0
 1 1 1 0 1 0 1 0 1 0 0 0 0 0 0 1 1 0 1 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 1
 1 1 1 0 0 0 1 0 1 1 1 1 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1
 0 0 1 0 0 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 1 0 1 0 0 0 0 1 1 0 0 0 1 0 0 1 0
 1 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 0 1 0 0
 0 1 0 0 0 0

Observations:

| learning rate = 0.1 | Accuracy | Precision | Recall | F1-score | loss (after 100 epochs)
| --- | --- | --- | --- | --- | --- |
| LR | 0.560 | 0.484 | 0.5707547169811321 | 0.5238095238095237 | 0.6706623396952428 |
| LR with L1 (lambda_ = 0.1) | 0.472 | 0.074 | 0.3627450980392157 | 0.12292358803986711 | 14.661483128990806 |
| **LR with tf_idf** | **0.595** | **0.772** | **0.5701624815361891** |  **0.6559048428207306** | **0.6919388220523094** |
| LR with Sklearn | 0.588 | 0.584 | 0.5887096774193549 | 0.5863453815261045 | 5.7472093150643815 |
| LR with L1 Sklearn | 0.565 | 0.644 | 0.5561312607944733 | 0.5968489341983317 | 0.6778823276553914 |

Note: values may change for each run

Explanation

Above two models represent **LR** and **LR with L1** by initialising penalty to "none" and "l1" respectively, lambda is set to 0.1 (C=0.1).

**LR with Sklearn** has second highest accuracy but it's precision is lower than it's couterpart **LR with L1 Sklearn**. And **LR with Sklearn** has loss of 5.74(approx) which makes it to be less precise than the other one. Also, it lacks in F1-score too.

To conclude, **LR with tf_idf** is more accurate among all the models. It also has higher precision, and f1-score. Second best model is **LR with L1 Sklearn**. And **LR with L1** is least accurate of all with highest loss value.