# EAI6000 2023 Spring A Week 6 Assignment

## Section One - Conceptual Understanding

### <div class="alert alert-info">[GRADED  TASK 1.1]</div>
Please compare various language models - __RNN__, __LSTM__, __GRU__, __Transformer__, __BERT__, and __GPTs__ in your own understanding

__RNNs__ are a type of artificial neural network designed to recognize patterns in sequences of data, such as text, genomes, handwriting, or the spoken word. A significant advantage of RNNs is their ability to use their internal state (memory) to process sequences of inputs, allowing them to remember past information. 

__LSTMs__ are an extension of RNNs and were developed to combat the vanishing gradient problem. They have a more complex structure, with an internal state cell and three "gate" structures: the input gate, forget gate, and output gate. These help maintain or discard information in the cell state over long periods, allowing the model to learn longer dependencies.

__GRUs__ are a simplification of LSTMs that combine the forget and input gates into a single "update gate." They also merge the cell state and hidden state, resulting in a lighter model than LSTM. 

The __Transformer__ model, introduced in the "Attention is All You Need" paper, discards recurrence and instead uses self-attention mechanisms that directly model relationships between all words in a sentence, no matter how far apart. It's highly parallelizable (leading to faster training times) and capable of capturing long-range dependencies.

__BERT__ is based on the Transformer model. Unlike many earlier models, it is bidirectional, meaning that it uses both preceding and following context in understanding a word in a sentence.

__GPT__, like BERT, is based on the Transformer model, but it is unidirectional, meaning it uses only the preceding context to understand a word. Its main innovation is using unsupervised pre-training, where it learns to predict the next word in a sentence, followed by fine-tuning for specific tasks. It has significantly more parameters than BERT.

### <div class="alert alert-info">[GRADED  TASK 1.2]</div>

Please discuss different ways (at least three methods) to encode natural languages in your own understanding

1. __One-Hot Encoding__:

This is the simplest and most straightforward method. In one-hot encoding, each word in the vocabulary is represented by a vector in n-dimensional space, where n is the size of the vocabulary. This vector is filled with 0s, except for a single 1 at the index corresponding to the word's position in the vocabulary.

2. __Bag of Words (BoW)__:

The Bag of Words model represents text as an 'unordered bag' or 'multiset' of its words, disregarding grammar and even word order but keeping track of frequency. Each document is represented as a vector in an n-dimensional space, where n is the size of the vocabulary. The value in each position in the vector corresponds to the frequency of that word in the document.

3. __Word Embeddings__ :

Word embeddings are a type of word representation that allows words with similar meanings to have similar representations. These are dense vector representations, as opposed to sparse representations like one-hot encoding or BoW. Two popular methods for generating word embeddings are Word2Vec and GloVe.


## Section Two - Sentiment Analysis

Use the attached IMDB Dataset.csv file to run sentiment analysis

### <div class="alert alert-info">[GRADED  TASK 2.1]</div>
* Split the data into training and testing part using the `train_test_split` function so that the training set size is 75% of the whole data (set argument `random_state=2023` to make the result deterministic, and make sure the data is split in a stratified fashion)

* Report and interpret the result (accuracy score) on test set

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

# Load data set
data = pd.read_csv('IMDB Dataset.csv')

X = data['review']  # Text data
y = data['sentiment']  # Target labels

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=2023, stratify=y)

# vectorize
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

classifier = LinearSVC()
classifier.fit(X_train_vec, y_train)

# make predictions and testing
y_pred = classifier.predict(X_test_vec)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Accuracy: 0.89904


Based on the accuracy score of 0.89904, it appears that the sentiment analysis model achieved a relatively high accuracy on the test set. An accuracy of 0.89904 indicates that the model correctly predicted the sentiment of approximately 89.9% of the test samples. 

### <div class="alert alert-info">[GRADED  TASK 2.2]</div>
* Try to add cross-validation using the `RepeateKFold` function with 5 splits, 10 repeats, and 2023 as random state. 
* Report the result on both training and test set with average and the standard deviation of the accuracy score
* Please expain whether the model is overfitting or underfitting the training data


In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split, RepeatedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

# Load the dataset
data = pd.read_csv('IMDB Dataset.csv')

X = data['review']  # Text data
y = data['sentiment']  # Target labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=2023, stratify=y)

#vectorization
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)


classifier = LinearSVC()
cv = RepeatedKFold(n_splits=5, n_repeats=10, random_state=2023)


train_scores = []
for train_index, val_index in cv.split(X_train_vec):
    X_train_cv, X_val = X_train_vec[train_index], X_train_vec[val_index]
    y_train_cv, y_val = y_train.iloc[train_index], y_train.iloc[val_index]
    classifier.fit(X_train_cv, y_train_cv)
    y_pred_train = classifier.predict(X_train_cv)
    train_accuracy = accuracy_score(y_train_cv, y_pred_train)
    train_scores.append(train_accuracy)


test_scores = []
for train_index, val_index in cv.split(X_test_vec):
    X_train_cv, X_val = X_test_vec[train_index], X_test_vec[val_index]
    y_train_cv, y_val = y_test.iloc[train_index], y_test.iloc[val_index]
    classifier.fit(X_train_cv, y_train_cv)
    y_pred_test = classifier.predict(X_val)
    test_accuracy = accuracy_score(y_val, y_pred_test)
    test_scores.append(test_accuracy)


print("Training Set - Mean Accuracy:", round(sum(train_scores) / len(train_scores), 4))
print("Training Set - Standard Deviation:", round(pd.Series(train_scores).std(), 4))

print("Test Set - Mean Accuracy:", round(sum(test_scores) / len(test_scores), 4))
print("Test Set - Standard Deviation:", round(pd.Series(test_scores).std(), 4))


Training Set - Mean Accuracy: 0.9893
Training Set - Standard Deviation: 0.0004
Test Set - Mean Accuracy: 0.8756
Test Set - Standard Deviation: 0.0056


The model achieved a high mean accuracy of 0.9893 on the training set, with a very low standard deviation of 0.0004. This indicates that the model performs exceptionally well on the training data. The small standard deviation suggests that the model's performance is consistent across different cross-validation folds.

On the other hand, the model achieved a slightly lower mean accuracy of 0.8756 on the test set, with a higher standard deviation of 0.0056. The slightly lower accuracy compared to the training set suggests a small degree of __overfitting__. The standard deviation of 0.0056 on the test set indicates some variability in the model's performance across different cross-validation folds.