# Natural Language Project - Finding the Genre of Movie Plots - Group 11

## Project Setup
If you're running this project in **Google Colab**, make sure to execute the following commands to properly configure the environment.
(These steps are not required if you're running the project locally on your machine.)

In [1]:
!git clone https://github.com/rspecker/NLP.git
%cd NLP

Cloning into 'NLP'...
remote: Enumerating objects: 279, done.[K
remote: Counting objects: 100% (21/21), done.[K
remote: Compressing objects: 100% (16/16), done.[K
remote: Total 279 (delta 8), reused 12 (delta 5), pack-reused 258 (from 1)[K
Receiving objects: 100% (279/279), 16.95 MiB | 16.29 MiB/s, done.
Resolving deltas: 100% (159/159), done.
/content/NLP


In [12]:
# !git pull

remote: Enumerating objects: 5, done.[K
remote: Counting objects:  20% (1/5)[Kremote: Counting objects:  40% (2/5)[Kremote: Counting objects:  60% (3/5)[Kremote: Counting objects:  80% (4/5)[Kremote: Counting objects: 100% (5/5)[Kremote: Counting objects: 100% (5/5), done.[K
remote: Compressing objects:  33% (1/3)[Kremote: Compressing objects:  66% (2/3)[Kremote: Compressing objects: 100% (3/3)[Kremote: Compressing objects: 100% (3/3), done.[K
remote: Total 3 (delta 2), reused 0 (delta 0), pack-reused 0 (from 0)[K
Unpacking objects:  33% (1/3)Unpacking objects:  66% (2/3)Unpacking objects: 100% (3/3)Unpacking objects: 100% (3/3), 2.61 KiB | 669.00 KiB/s, done.
From https://github.com/rspecker/NLP
   fd31546..23ab11c  main       -> origin/main
Updating fd31546..23ab11c
Fast-forward
 reviews.ipynb | 134 [32m+++++++++++++++++++++++++++++++++++++++[m[31m---------------------------------------[m
 1 file changed, 67 insertions(+), 67 deletions(-)


In [2]:
# !pip install -r requirements.txt

In [2]:
! pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-3.2.0-py3-none-any.whl.metadata (10 kB)
Downloading sentence_transformers-3.2.0-py3-none-any.whl (255 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m255.2/255.2 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence-transformers
Successfully installed sentence-transformers-3.2.0


In [13]:
# import project dependencies
import pandas as pd
from modelling.grid import classifiers
from modelling.grid_search import perform_grid_search
from modelling.evaluation import evaluate_model, save_results_to_file, save_confusion_matrix, save_test_data_with_predictions
from preprocessing.embeddings import create_sentence_embeddings
from preprocessing.preproc import create_preprocesssed_dataset
from preprocessing.tfidf import create_train_data_tfidf
from utils import create_train_test_sets
from modelling.information_ret import score

## Data Import

In [14]:
# Import data and set column names
df = pd.read_table(
    'train.txt',
    names=['title', 'from', 'genre', 'director', 'plot']
    )

In [15]:
# Import data without labels and set column names
df_no_labels = pd.read_table(
    'test_no_labels.txt',
    names=['title', 'from', 'director', 'plot']
    )

## Data Pre-Processing

In [5]:
# Split data into training and testing sets
x_train, x_test, y_train, y_test = create_train_test_sets(
    df, test_size=0.2, random_state=0, y_column='genre'
)

### TF-IDF
TF-IDF converts text into sparse vectors based on word importance across documents, without considering word order or context

In [6]:
# create TF-IDF
vectorizer, x_train_tfidf = create_train_data_tfidf(x_train)



In [7]:
vectorizer, x_test_tfidf = create_train_data_tfidf(x_test, vectorizer)

In [None]:
vectorizer, no_labels_tfidf = create_train_data_tfidf(df_no_labels, vectorizer)

### Feature-based Transfer Learning with Sentence Embeddings (Sentence-BERT)
Sentence-BERT generates dense, fixed-length embeddings that capture the semantic meaning of entire sentences, enabling more context-aware comparisons.

In [9]:
# create sentence embeddings for testing data
x_train_sentence_embeddings = create_sentence_embeddings(
    sentences=x_train["plot"].to_list(),
    model="all-MiniLM-L6-v2")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [10]:
# create sentence embeddings for testing data
x_test_sentence_embeddings = create_sentence_embeddings(
    sentences=x_test["plot"].to_list(),
    model="all-MiniLM-L6-v2")

In [None]:
# create sentence embeddings for unlabelled data
no_labels_sentence_embeddings = create_sentence_embeddings(
    sentences=df_no_labels["plot"].to_list(),
    model="all-MiniLM-L6-v2")

## Training

### TF-IDF

In [8]:
del classifiers["MultinomialNB"]

In [11]:
for model_name, (model, param_grid) in classifiers.items():
    model_type="word_embeddings"
    # Perform GridSearchCV
    best_model, best_params, best_score = perform_grid_search(
        model, param_grid, model_type,model_name,
        x_train_tfidf, y_train)

    # Evaluate the model
    y_pred, test_accuracy, classification_rep, cm = evaluate_model(
        best_model, x_test_tfidf, y_test)

    # Save the results
    save_results_to_file(
        model_type, model_name, best_params, best_score, test_accuracy, classification_rep)

    save_test_data_with_predictions(x_test, y_test, y_pred, model_type, model_name)

    # Save confusion matrix as image
    save_confusion_matrix(model_type, model_name, cm, y_test)

    # Apply model to unlabeled data
    apply_model_to_unlabeled_data(best_model, no_labels_tfidf, model_type, model_name)

Fitting 5 folds for each of 24 candidates, totalling 120 fits


KeyboardInterrupt: 

### Feature-based Transfer Learning with Sentence Embeddings (Sentence-BERT)

In [None]:
# del classifiers["MultinomialNB"]

In [None]:
for model_name, (model, param_grid) in classifiers.items():
    model_type="sentence_embeddings"
    # Perform GridSearchCV
    best_model, best_params, best_score = perform_grid_search(
        model, param_grid, model_type,model_name,
        x_train_sentence_embeddings, y_train)

    # Evaluate the model
    test_accuracy, classification_rep, cm = evaluate_model(
        best_model, x_test_sentence_embeddings, y_test)

    # Save the results
    save_results_to_file(
        model_type, model_name, best_params, best_score, test_accuracy, classification_rep)
    save_test_data_with_predictions(x_test, y_test, y_pred, model_type, model_name)

    # Save confusion matrix as image
    save_confusion_matrix(model_type, model_name, cm, y_test)


    # Apply model to unlabeled data
    apply_model_to_unlabeled_data(best_model, no_labels_sentence_embeddings, model_type, model_name)

Fitting 5 folds for each of 24 candidates, totalling 120 fits
Fitting 5 folds for each of 324 candidates, totalling 1620 fits
