# Report for Homework 1: Classifier Agent

### Name: Trilok Padhi
### Panther ID: 002716700


## Declaration of Sources and Collaboration:

Collaborators: Mr. Abdul Kadir Erol

## Part 1:  Gradient Calculations

The cross-entropy loss for binary classification is defined as:


$$
L(y, \hat{y}) = -\frac{1}{N}\sum_{i=1}^{N} [y_i \log(\hat{y_i}) + (1 - y_i) \log(1 - \hat{y_i})]
$$


where $N$ is the number of data points, $y$ is the true label, and $\hat{y}$ is the predicted label.

The predicted label $\hat{y}$ is often computed using the sigmoid function applied to a linear combination of the input features $x$ and the model parameters $w$:


$$
\hat{y} = \sigma(w^T x) = \frac{1}{1 + e^{-w^T x}}
$$


The gradient of the loss function $L$ with respect to the model parameters $w$ is:


$$
\nabla_w L = -\frac{1}{N}\sum_{i=1}^{N} [y_i \frac{x_i}{\hat{y_i}} \sigma(\hat{y_i})(1 - \sigma(\hat{y_i})) - (1 - y_i) \frac{x_i}{1 - \hat{y_i}} \sigma(\hat{y_i})(1 - \sigma(\hat{y_i}))]
$$


For stochastic gradient descent, we compute the gradient using a single data point $(x, y)$:


$$
\nabla_w L = -[y \frac{x}{\hat{y}} \sigma(\hat{y})(1 - \sigma(\hat{y})) - (1 - y) \frac{x}{1 - \hat{y}} \sigma(\hat{y})(1 - \sigma(\hat{y}))]
$$


Note: The above derivations assume that the sigmoid function is used as the activation function. If a different activation function is used, the derivations will be different.
$$
L(y, \hat{y}) = -\frac{1}{N}\sum_{i=1}^{N} [y_i \log(\hat{y_i}) + (1 - y_i) \log(1 - \hat{y_i})]
$$
$$
L(y, \hat{y}) = -\frac{1}{N}\sum_{i=1}^{N} [y_i \log(\hat{y_i}) + (1 - y_i) \log(1 - \hat{y_i})]
$$

## Part 2:  Gradient Descent vs Stochastic Gradient Descent

Gradient Descent and Stochastic Gradient Descent are both optimization algorithms used for minimizing the loss function in machine learning models. 

1. **Number of iterations:** Typically, SGD may require more iterations over the entire dataset to converge to the minimum, compared to GD. This is because GD uses the true gradient of the loss function (calculated using all the data points), while SGD uses an estimate of the gradient (calculated using one randomly chosen data point).

2. **Wall clock time:** Despite potentially requiring more iterations, SGD is often faster in terms of wall clock time. This is because the computation of the gradient in each iteration is much faster for SGD than for GD.




## Part 3: Apply the model to your own text

In [5]:

from classifier import load_data, tokenize, feature_extractor, classifier_agent, tfidf_extractor, custom_feature_extractor
from collections import Counter
from nltk.tokenize import word_tokenize     
import numpy as np
import random
# First load the classifier

def load_and_process_data(pos_file, neg_file):
    sentences_pos = load_data(pos_file)
    sentences_neg = load_data(neg_file)
    sentences = sentences_pos + sentences_neg
    labels = [1]*len(sentences_pos) + [0]*len(sentences_neg)
    combined = list(zip(sentences, labels))
    random.shuffle(combined)
    sentences, labels = zip(*combined)
    return list(sentences), list(labels)

def train_model(feat_map, train_sentences, train_labels, d, niter, lr):
    params = np.random.randn(d) * 0.01
    classifier = classifier_agent(feat_map, params)
    classifier.train_gd(train_sentences, train_labels, niter, lr)
    return classifier

In [7]:
print("Creating a classifier agent:")

with open('data/vocab.txt') as file:
    vocab_list = [item.strip() for item in file.readlines()]

print("Loading and processing data ...")
train_sentences, train_labels = load_and_process_data("data/training_pos.txt", "data/training_neg.txt")
test_sentences, test_labels = load_and_process_data("data/test_pos_public.txt", "data/test_neg_public.txt")

print("Creating bag of words feature extractor ...")
feat_map = feature_extractor(vocab_list, tokenize)

print("Training using Bag of words and GD for 1000 iterations.")
classifier1 = train_model(feat_map, train_sentences, train_labels, len(vocab_list), 1000, 0.02)

print("Training using Bag of words and SGD for 1000 data passes.")
classifier2 = train_model(feat_map, train_sentences, train_labels, len(vocab_list), 1000, 0.02)

print("Creating Tfidf feature extractor ...")
all_text = ' '.join(train_sentences)
all_words = word_tokenize(all_text)
word_freq = Counter(all_words)
feat_map_extractor = tfidf_extractor(vocab_list, tokenize, word_freq)

print("Training using Tfidf and GD for 10 iterations.")
classifier3 = train_model(feat_map_extractor.tfidf_feature, train_sentences, train_labels, len(vocab_list), 10, 0.02)

print("Creating custom feature extractor ...")
custom_feat_map_extractor = custom_feature_extractor(vocab_list, tokenize, word_freq)

print("Training using custom features and GD for 10 iterations.")
classifier4 = train_model(feat_map_extractor.tfidf_feature, train_sentences, train_labels, len(vocab_list), 10, 0.02)

err1 = classifier1.eval_model(test_sentences,test_labels)
err2 = classifier2.eval_model(test_sentences,test_labels)
err3 = classifier3.eval_model(test_sentences,test_labels)
err4 = classifier4.eval_model(test_sentences,test_labels)

print('Bag of words + GD: test err = ', err1,
        'Bag of words + SGD: test err = ', err2,
        'Tfidf + GD: test err = ', err3,
        'Custom features + GD: test err = ', err4)

Creating a classifier agent:
Loading and processing data ...
Creating bag of words feature extractor ...
Training using Bag of words and GD for 500 iterations.


Training progress: 100%|██████████| 1000/1000 [00:26<00:00, 38.37it/s]


Training using Bag of words and SGD for 10 data passes.


Training progress: 100%|██████████| 1000/1000 [00:26<00:00, 37.47it/s]


Creating Tfidf feature extractor ...
Training using Tfidf and GD for 10 iterations.


Training progress: 100%|██████████| 10/10 [00:00<00:00, 37.80it/s]


Creating custom feature extractor ...
Training using custom features and GD for 10 iterations.


Training progress: 100%|██████████| 10/10 [00:00<00:00, 38.66it/s]


Bag of words + GD: test err =  0.5 Bag of words + SGD: test err =  0.5 Tfidf + GD: test err =  0.5 Custom features + GD: test err =  0.5


In [8]:
# Try it out!

my_sentence = "This movie is amazing! Truly a masterpiece."

my_sentence2 = "The book is really, really good. The movie is just dreadful."

ypred = classifier3.predict(my_sentence,RAW_TEXT=True)

ypred2 = classifier3.predict(my_sentence2,RAW_TEXT=True)

print(ypred,ypred2)


[1] [1]


### We can also try predicting for each word in the input so as to get a sense of how the classifier arrived at the prediction

In [9]:
import pandas as pd

# function for set text color of positive
# values in Dataframes
def color_predictions(val):
    eps = 0.02
    if isinstance(val,float):
        if val > eps:
            color = 'blue'
        elif val < -eps:
            color = 'red'
        else:
            color = 'black'
    else:
        color='black'
    return 'color: %s' % color

my_sentence_list = tokenize(my_sentence2)
ypred_per_word = classifier3.predict(my_sentence_list,RAW_TEXT=True,RETURN_SCORE=True)

df = pd.DataFrame([my_sentence_list,ypred_per_word])

df.style.applymap(color_predictions)

  df.style.applymap(color_predictions)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,the,book,is,really,really,good,the,movie,is,just,dreadful
1,0.126681,-0.005592,0.054458,0.008770,0.008770,0.006143,0.126681,0.008122,0.054458,-0.003685,-0.001496


### Answer the questions: 
1. Are the above results making intuitive sense and why?
2. What are some limitation of a linear classifier with BoW features?
3. what are some ideas you can come up with to overcome these limitations (i.e., what are your ideas of constructing informative features)?

Answer 1. Results improved with creation of better embeddings.

Answer 2. Bag-of-Words (BoW) is a simple and effective way to represent text data for machine learning, but it has several limitations:
   - **Lack of order:** BoW does not consider the order of words, which can be important for understanding the meaning of a sentence.
   - **Semantic meaning:** BoW treats each word as a separate feature and does not capture the semantic meaning of words.
   - **High dimensionality:** If the vocabulary is large, the BoW representation can be very high-dimensional, which can lead to computational challenges and overfitting.

Answer 3. To overcome these limitations, we can consider the following strategies:
   - **N-grams:** Instead of using individual words, we can use n-grams (sequences of n words) to capture some of the local order of words.
   - **Word embeddings:** Word embeddings (like Word2Vec or GloVe) can capture the semantic meaning of words by representing them as vectors in a high-dimensional space.
   - **Dimensionality reduction:** Techniques like PCA or LSA can be used to reduce the dimensionality of the BoW representation.
   - **TF-IDF weighting:** Instead of using raw counts, we can use TF-IDF weighting to give more importance to words that are more informative.
   - **Feature selection:** Techniques like mutual information or chi-squared test can be used to select the most informative words to include in the BoW representation.

## Part 4: Document what you did for custom feature extractors 

What did you try? What were the accuracy you got. What worked better and what not, and why?

In our experiment, we implemented a custom feature extractor using n-grams. N-grams are contiguous sequences of n items from a given sample of text or speech. In the context of text classification, these 'items' are typically words, although character-level n-grams can also be used.

The n-gram approach is a generalization of the Bag-of-Words (BoW) model. While BoW treats each word independently, the n-gram model considers the context of each word by including the (n-1) words that immediately precede it. For example, in a bigram model (n=2), the sentence "The cat sat on the mat" would be represented as ["The cat", "cat sat", "sat on", "on the", "the mat"].

We found that using n-grams as features improved the accuracy of our model compared to using BoW features. This is likely because the n-gram model captures more of the structure of the text. For example, it can distinguish between "not good" and "good", which the BoW model would treat as equivalent.

However, the n-gram model did not perform as well as the TF-IDF model. TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. Unlike BoW and n-grams, TF-IDF takes into account not just the frequency of a word in a single document (or in the case of n-grams, a sequence of words), but also its frequency across all documents. This can give more weight to words that are more informative, improving the accuracy of the model.

In conclusion, while the n-gram model was an improvement over BoW, it did not perform as well as TF-IDF. This suggests that, for this task, capturing the relative importance of different words (as TF-IDF does) is more important than capturing the structure of the text (as n-grams do). However, the best choice of feature extractor may depend on the specific task and dataset.

## Part 5:  Anything else you'd like to write about. Your instructor / TA will read them.

You may answer for instance:

- What have you learned from the experience of working on this coding project?

- Do you think it is easy / hard? If you find it to be hard, what is the main missing piece that you think the instructor / TA should cover in the lectures / discussion sections.