# **Movie Genre Predictor using ANN**

**Name:** **Sachin Singh**  
**Roll Number:** **2023BCS0064**  
**Course:** **CSE 311 Artificial Intelligence**  


###  **Installing Required Dependencies**

This step installs all the essential libraries needed for the project:

- **kaggle** → for downloading the dataset directly from Kaggle  
- **scikit-learn** → preprocessing, metrics, and utilities  
- **tensorflow** → building and training the ANN model  

In [None]:
!pip install -q kaggle scikit-learn tensorflow

###  **Importing Required Libraries**

In this step, we load all essential Python libraries needed for building the movie genre prediction model.  
Key modules include:

- `os` and `ast` — for file handling and parsing JSON-like fields  
- `numpy` and `pandas` — for numerical operations and structured data processing  
- `MultiLabelBinarizer` and `StandardScaler` from `sklearn` — for label encoding and feature scaling  
- `train_test_split` — to divide the dataset into training, validation, and testing sets  
- `TfidfVectorizer` — for converting text descriptions into numerical features  
- `classification_report`, `f1_score`, `precision_score`, `recall_score` — for evaluating multi-label performance  
- `Sequential`, `Dense`, `Dropout`, `BatchNormalization` — core components of our ANN model from `tensorflow.keras`  
- `EarlyStopping` and `ReduceLROnPlateau` — callbacks to prevent overfitting and improve training stability  
- `google.colab.files` — for uploading files like `kaggle.json` when running on Google Colab  
- `warnings` — to suppress unwanted warnings for a cleaner notebook


In [None]:
import os
import ast
import numpy as np
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report, f1_score, precision_score, recall_score
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from google.colab import files
import warnings
warnings.filterwarnings('ignore')

###  **Kaggle Authentication**

To enable direct dataset downloads from Kaggle, we authenticate using Kaggle API credentials.  
These are set securely using environment variables:

- `KAGGLE_USERNAME` — your Kaggle account username  
- `KAGGLE_KEY` — your Kaggle API key (must be kept private)

 **Note:** Never share your API key publicly or commit it to version control.  
Regenerate the key from your Kaggle account if it is ever exposed.


In [None]:
os.environ['KAGGLE_USERNAME'] = "sachinsingh070"
os.environ['KAGGLE_KEY'] = "your_kaggle_key_here"


### **Downloading the Dataset from Kaggle**

This step downloads the required movie dataset directly from Kaggle using the API.  
The dataset is stored inside the `movies_dataset/` directory and automatically unzipped.  
Once downloaded, all necessary CSV files become available for preprocessing, including the metadata, credits, and keywords files.


In [None]:
dataset = 'rounakbanik/the-movies-dataset'
!kaggle datasets download -d {dataset} -p movies_dataset --unzip

Dataset URL: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset
License(s): CC0-1.0
Downloading the-movies-dataset.zip to movies_dataset
 58% 131M/228M [00:00<00:00, 1.36GB/s]
100% 228M/228M [00:00<00:00, 717MB/s] 


### **Loading the Movie Metadata**

In this step, the primary dataset file **`movies_metadata.csv`** is loaded into a pandas DataFrame.  
This file contains core information about each movie, including:

- basic identifiers (title, ID, release details)
- descriptive fields (overview, tagline)
- production-related attributes (budget, revenue, runtime)
- preliminary genre information in a structured format

After loading, the shape of the dataset is printed to confirm that the file was read successfully and to understand the initial size of the data before preprocessing.


In [None]:
movies = pd.read_csv('movies_dataset/movies_metadata.csv', low_memory=False)
print("Dataset shape:", movies.shape)

Dataset shape: (45466, 24)


### **Loading the Credits Information**

The `credits.csv` file is loaded at this stage to obtain additional details about each movie, specifically the **cast** and **crew** information.  
These fields are essential because they provide structured lists of actors and production members, which later help in enriching the feature set used for genre prediction.

Displaying the first few rows allows for a quick inspection of the data format and confirms that the file has been read correctly.


In [None]:
credits = pd.read_csv('movies_dataset/credits.csv')
credits.head()


Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


### **Preparing and Aligning the Core Dataset Files**

At this stage, the three essential components of the Movies Dataset are loaded:  
- the main `movies_metadata` file  
- the `credits` file containing cast and crew information  
- the `keywords` file containing descriptive tags  

All three files rely on a common movie identifier, but the `id` field in these files may appear in different formats, including non-numeric entries.  
To ensure consistent merging later, the `id` columns are converted into numeric values, with invalid entries coerced into `NaN`.  
Rows with missing or invalid IDs are then removed, since they cannot be reliably matched across files.  

This step lays the foundation for accurate merging of metadata, credits, and keyword information into a single unified dataset.

In [None]:
# Load all relevant files
movies = pd.read_csv("movies_dataset/movies_metadata.csv", low_memory=False)
credits = pd.read_csv("movies_dataset/credits.csv")
keywords_df = pd.read_csv("movies_dataset/keywords.csv")

# Convert IDs to numeric
movies["id"] = pd.to_numeric(movies["id"], errors="coerce")
credits["id"] = pd.to_numeric(credits["id"], errors="coerce")
keywords_df["id"] = pd.to_numeric(keywords_df["id"], errors="coerce")

# Drop bad IDs
movies = movies.dropna(subset=["id"])
credits = credits.dropna(subset=["id"])
keywords_df = keywords_df.dropna(subset=["id"])

### **Merging Metadata, Credits, and Keywords**

With all datasets cleaned and aligned by their numeric `id` fields, the next step combines them into a single unified DataFrame.  
First, the movie metadata is merged with the credits file, adding detailed **cast** and **crew** information for each movie.  
Next, the keywords file is merged in, contributing additional descriptive tags that can later be used as textual features.

Using a left join ensures that every movie from the main metadata file is preserved, even if corresponding credits or keywords are missing.  
The result is a consolidated dataset that brings together all relevant information required for feature extraction and model training.


In [None]:
#Merge: metadata + credits
movies = movies.merge(credits, on="id", how="left")

# Merge: now metadata has cast + crew
# Merge: metadata + keywords
movies = movies.merge(keywords_df, on="id", how="left")

### **Extracting Structured Information from JSON-Like Fields**

Several columns in the merged dataset—such as `genres`, `cast`, and `keywords`—store information in a JSON-like string format.  
To make these fields usable for feature engineering, they must be converted into clean Python lists.

A helper function is defined to safely parse these entries and extract specific attributes, such as the `name` field from each JSON object.  
Using this function:

- `genres_list` is created to represent the list of genres assigned to each movie  
- `cast_list` extracts the actors associated with the movie, with the list restricted to the top five for consistency  
- `keywords_list` captures descriptive tags that may improve the model’s understanding of thematic elements  

Finally, movies that do not contain any valid genre information are removed, since they cannot serve as labeled examples during training.  
The resulting dataset is now fully structured and ready for the next stage of preprocessing and feature construction.

In [None]:
def parse_json_list(x, key_name='name'):
    try:
        lst = ast.literal_eval(x)
        return [item.get(key_name, '').strip() for item in lst if isinstance(item, dict)]
    except:
        return []

# Parse genres
movies['genres_list'] = movies['genres'].fillna('[]').apply(parse_json_list)

# Parse cast (use top 5 actors)
movies['cast_list'] = movies['cast'].fillna('[]').apply(parse_json_list)
movies['cast_list'] = movies['cast_list'].apply(lambda x: x[:5])   # keep top 5 actors

# Parse keywords
movies['keywords_list'] = movies['keywords'].fillna('[]').apply(parse_json_list)

# Keep only movies with valid genres
movies = movies[movies['genres_list'].map(len) > 0].reset_index(drop=True)

print("FINAL SHAPE:", movies.shape)
print("COLUMNS:", movies.columns.tolist())
print("SAMPLE ROW:")
print(movies[['title','genres_list','cast_list','keywords_list']].head(5))

FINAL SHAPE: (44104, 30)
COLUMNS: ['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id', 'imdb_id', 'original_language', 'original_title', 'overview', 'popularity', 'poster_path', 'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title', 'video', 'vote_average', 'vote_count', 'cast', 'crew', 'keywords', 'genres_list', 'cast_list', 'keywords_list']
SAMPLE ROW:
                         title                   genres_list  \
0                    Toy Story   [Animation, Comedy, Family]   
1                      Jumanji  [Adventure, Fantasy, Family]   
2             Grumpier Old Men             [Romance, Comedy]   
3            Waiting to Exhale      [Comedy, Drama, Romance]   
4  Father of the Bride Part II                      [Comedy]   

                                           cast_list  \
0  [Tom Hanks, Tim Allen, Don Rickles, Jim Varney...   
1  [Robin Williams, Jonathan Hyde, Kirsten Du

### **Constructing a Unified Text Feature**

To create a rich textual representation for each movie, multiple descriptive fields are combined into a single consolidated feature.  
This unified text includes:

- the movie's **title**  
- the **overview**, which provides a narrative description  
- the extracted **keywords**, representing thematic tags  
- the top five actors from the **cast**  

Each component is cleaned and joined into a continuous text string, ensuring that all available descriptive information is captured.  
This combined text feature serves as the primary input for generating embeddings in the later stages of the model pipeline.


In [None]:
movies['text_combined'] = (
    movies['title'].fillna('') + " . " +
    movies['overview'].fillna('') + " . " +
    movies['keywords_list'].apply(lambda l: " ".join(l)) + " . " +
    movies['cast_list'].apply(lambda l: " ".join(l[:5]))
)

### **Preparing Target Labels and Auxiliary Numeric Features**

To train a multi-label classifier, the list of genres associated with each movie must be converted into a machine-readable format.  
This is accomplished using `MultiLabelBinarizer`, which transforms each movie’s genre list into a binary vector, where each position corresponds to a specific genre.  
The resulting matrix `Y` becomes the target output for the model, and the total number of genre classes is recorded for constructing the ANN output layer.

In addition to textual information, several numeric attributes are incorporated to enrich the feature set.  
Fields such as `popularity`, `vote_average`, `vote_count`, and `runtime` provide quantitative signals that may correlate with genre tendencies.  
These values are carefully converted to numeric types, missing entries are filled with zeros, and the final set of numeric features is standardized using `StandardScaler` to ensure balanced input magnitudes during training.


In [None]:
mlb = MultiLabelBinarizer()
Y = mlb.fit_transform(movies['genres_list'])
n_classes = Y.shape[1]
print("Genres:", list(mlb.classes_))

# 8. Numeric features
num_cols = ['popularity', 'vote_average', 'vote_count', 'runtime']
for c in num_cols:
    if c in movies.columns:
        movies[c] = pd.to_numeric(movies[c], errors='coerce').fillna(0.0)
    else:
        movies[c] = 0.0
num_features = StandardScaler().fit_transform(movies[num_cols].values)

Genres: ['Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy', 'Foreign', 'History', 'Horror', 'Music', 'Mystery', 'Romance', 'Science Fiction', 'TV Movie', 'Thriller', 'War', 'Western']


### **Generating TF-IDF Text Features**

To convert the combined text descriptions into a structured numerical form, a TF-IDF (Term Frequency–Inverse Document Frequency) vectorizer is applied.  
This method assigns higher weights to terms that are informative yet not overly common across the dataset, making it well-suited for representing movie descriptions.

Using a vocabulary capped at 5,000 terms and standard English stop-word removal, the TF-IDF vectorizer transforms the unified text field into a dense numerical matrix.  
Each movie is thus represented by a vector capturing the relative importance of key words and phrases within its description.  
The resulting TF-IDF matrix forms the primary textual input for the neural network.


In [None]:
print("Computing TF-IDF features...")
tfidf = TfidfVectorizer(max_features=5000, stop_words='english')
X_text = tfidf.fit_transform(movies['text_combined']).toarray()
print("TF-IDF shape:", X_text.shape)

Computing TF-IDF features...
TF-IDF shape: (44104, 5000)


### **Combining Textual and Numeric Features**

The TF-IDF matrix derived from the combined text fields captures rich semantic information, while the standardized numeric attributes provide complementary quantitative signals.  
To construct a complete feature representation for each movie, these two components are horizontally concatenated into a single input matrix.

This unified feature set integrates both descriptive text-based information and structured numerical data, enabling the neural network to learn from multiple modalities simultaneously.  
The final shape of the input matrix confirms the total dimensionality that will be fed into the ANN model.


In [None]:
X = np.hstack([X_text, num_features])
print("Final input shape:", X.shape)

Final input shape: (44104, 5004)


### **Splitting the Dataset into Training, Validation, and Test Sets**

To evaluate the model reliably and prevent overfitting, the dataset is divided into three distinct subsets:

- **Training set:** used to fit the neural network and learn underlying patterns  
- **Validation set:** used during training for tuning hyperparameters and monitoring model performance  
- **Test set:** held out completely to measure the final generalization ability of the model  

An initial split allocates 10% of the data for testing.  
The remaining portion is further divided so that the validation set represents approximately 10% of the overall dataset as well.  
Printing the shapes of these splits confirms that the data partitions are correctly sized and ready for model training.


In [None]:
X_trainval, X_test, Y_trainval, Y_test = train_test_split(X, Y, test_size=0.10, random_state=42)
X_train, X_val, Y_train, Y_val = train_test_split(X_trainval, Y_trainval, test_size=0.1111, random_state=42)
print("Shapes -> train:", X_train.shape, "val:", X_val.shape, "test:", X_test.shape)

Shapes -> train: (35283, 5004) val: (4410, 5004) test: (4411, 5004)


### **Addressing Genre Imbalance with Sample Weights**

Movie genres in this dataset are highly imbalanced: some genres occur frequently, while others appear only in a small number of films.  
Training a model directly on such data may cause it to favor common genres and ignore the rare ones.

To counter this, a weighting strategy is applied:

- The frequency of each genre in the training set is calculated.
- A median-frequency ratio is used to compute **class weights**, giving more importance to underrepresented genres.
- These class weights are then converted into **sample weights**, where each movie receives a weight proportional to the rarity of its associated genres.
- Finally, the weights are normalized to maintain stable training behavior.

This approach ensures the neural network pays sufficient attention to less frequent genres, improving the balance and fairness of the model’s predictions.


In [None]:
freq = Y_train.sum(axis=0)/Y_train.shape[0]
median_freq = np.median(freq[freq>0])
class_weight_arr = np.array([median_freq/f if f>0 else 1.0 for f in freq])
sample_weights = 1.0 + (Y_train * class_weight_arr).sum(axis=1)
sample_weights /= np.mean(sample_weights)

### **Building the Artificial Neural Network Model**

A fully connected Artificial Neural Network (ANN) is constructed to perform multi-label genre prediction.  
The architecture is designed to learn from the combined TF-IDF and numeric feature representation created earlier.

Key characteristics of the model:

- The input layer matches the dimensionality of the final feature set.
- Several dense layers with ReLU activation are used to capture non-linear relationships in the data.
- `BatchNormalization` layers help stabilize and accelerate training.
- `Dropout` layers are included to reduce overfitting by randomly deactivating neurons during training.
- The output layer uses a `sigmoid` activation function, enabling the model to produce independent probability scores for each genre (suitable for multi-label classification).

The model is compiled with the `adam` optimizer and a `binary_crossentropy` loss function, which is appropriate for predicting multiple independent labels.  
A summary of the architecture is displayed to verify the layout and parameter counts before training begins.


In [None]:
input_dim = X_train.shape[1]
import tensorflow as tf
tf.keras.backend.clear_session()
model = Sequential([
    Dense(512, activation='relu', input_shape=(input_dim,)),
    BatchNormalization(),
    Dropout(0.5),
    Dense(256, activation='relu'),
    BatchNormalization(),
    Dropout(0.35),
    Dense(128, activation='relu'),
    Dropout(0.25),
    Dense(n_classes, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

### **Training the Neural Network with Early Stopping and Learning Rate Scheduling**

The model is trained using the prepared training data, with additional mechanisms to improve stability and prevent overfitting:

- **EarlyStopping** monitors the validation loss and halts training if no improvement is observed for several epochs.  
  This avoids unnecessary training and helps preserve the best-performing model.

- **ReduceLROnPlateau** dynamically lowers the learning rate when the validation loss stops improving.  
  This allows the optimizer to take smaller, more precise steps during the later stages of training.

During training, the model uses:
- a batch size of 256,
- up to 40 epochs,
- the previously computed `sample_weights` to correct for class imbalance.

Validation performance is evaluated at each epoch, and the training history is stored for later analysis.


In [None]:
callbacks = [
    EarlyStopping(monitor='val_loss', patience=6, restore_best_weights=True, verbose=1),
    ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3, min_lr=1e-6, verbose=1)
]

history = model.fit(
    X_train, Y_train,
    validation_data=(X_val, Y_val),
    epochs=40,
    batch_size=256,
    sample_weight=sample_weights,
    callbacks=callbacks,
    verbose=2
)

Epoch 1/40
138/138 - 16s - 115ms/step - accuracy: 0.2516 - loss: 0.3962 - val_accuracy: 0.3002 - val_loss: 0.2905 - learning_rate: 1.0000e-03
Epoch 2/40
138/138 - 11s - 83ms/step - accuracy: 0.4284 - loss: 0.2623 - val_accuracy: 0.3145 - val_loss: 0.2522 - learning_rate: 1.0000e-03
Epoch 3/40
138/138 - 21s - 153ms/step - accuracy: 0.4870 - loss: 0.2240 - val_accuracy: 0.4154 - val_loss: 0.2204 - learning_rate: 1.0000e-03
Epoch 4/40
138/138 - 23s - 165ms/step - accuracy: 0.5102 - loss: 0.1998 - val_accuracy: 0.4873 - val_loss: 0.1959 - learning_rate: 1.0000e-03
Epoch 5/40
138/138 - 12s - 89ms/step - accuracy: 0.5275 - loss: 0.1800 - val_accuracy: 0.5050 - val_loss: 0.1917 - learning_rate: 1.0000e-03
Epoch 6/40
138/138 - 13s - 91ms/step - accuracy: 0.5356 - loss: 0.1634 - val_accuracy: 0.4930 - val_loss: 0.1975 - learning_rate: 1.0000e-03
Epoch 7/40
138/138 - 20s - 146ms/step - accuracy: 0.5504 - loss: 0.1487 - val_accuracy: 0.4955 - val_loss: 0.2059 - learning_rate: 1.0000e-03
Epoch 8/4

### **Optimizing Decision Thresholds for Multi-Label Classification**

Since the model outputs a probability for each genre independently, a fixed threshold of 0.5 may not yield the best multi-label performance.  
Different genres have different frequency distributions, and some require lower or higher thresholds to achieve optimal detection.

To address this, a threshold-tuning procedure is applied:

- For each genre, a range of candidate thresholds is evaluated on the validation set.
- Each threshold is tested by converting probabilities into binary predictions and computing the corresponding F1-score.
- The threshold that produces the highest F1-score for that genre is selected.
- The resulting vector of genre-specific thresholds is then applied to the model's predictions on the test set.

This approach ensures that the final predictions are better calibrated and more sensitive to the characteristics of each genre, significantly improving multi-label performance compared to using a uniform threshold.


In [None]:
def find_best_thresholds(Y_true, Y_prob):
    thresholds = np.full(Y_true.shape[1], 0.5)
    for c in range(Y_true.shape[1]):
        best_t, best_f1 = 0.5, -1
        ytrue = Y_true[:, c]
        if ytrue.sum()==0:
            thresholds[c]=0.5
            continue
        ts = np.arange(0.05,0.95,0.01)
        for t in ts:
            f1 = f1_score(ytrue, (Y_prob[:,c]>=t).astype(int), zero_division=0)
            if f1>best_f1:
                best_f1 = f1
                best_t = t
        thresholds[c] = best_t
    return thresholds

Y_val_prob = model.predict(X_val)
thresholds = find_best_thresholds(Y_val, Y_val_prob)

def apply_thresholds(Y_prob, thresholds):
    return (Y_prob>=thresholds).astype(int)

Y_test_pred = apply_thresholds(model.predict(X_test), thresholds)

[1m138/138[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 8ms/step
[1m138/138[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step


### **Evaluating Multi-Label Classification Performance**

Standard accuracy is not an appropriate measure for multi-label problems, where each movie can belong to multiple genres simultaneously.  
Instead, performance is assessed using metrics that better capture the nature of multi-label prediction:

- **Micro F1-score:** evaluates overall performance by aggregating contributions from all labels, giving more weight to common genres.
- **Macro F1-score:** computes the F1-score for each genre independently and then averages them, providing insight into performance on both frequent and rare genres.
- **Micro precision and recall:** measure the model’s ability to correctly identify genres across all predictions.

A comprehensive **classification report** is also generated, showing precision, recall, and F1-score for each individual genre.  
Together, these metrics provide a balanced and detailed evaluation of the model’s predictive ability.

In [None]:
def multilabel_metrics(ytrue, ypred):
    return {
        'micro_f1': f1_score(ytrue, ypred, average='micro', zero_division=0),
        'macro_f1': f1_score(ytrue, ypred, average='macro', zero_division=0),
        'micro_precision': precision_score(ytrue, ypred, average='micro', zero_division=0),
        'micro_recall': recall_score(ytrue, ypred, average='micro', zero_division=0)
    }

print("Test metrics:", multilabel_metrics(Y_test, Y_test_pred))
print("\nClassification report:\n", classification_report(Y_test, Y_test_pred, target_names=mlb.classes_, zero_division=0))

Test metrics: {'micro_f1': 0.621989735491512, 'macro_f1': 0.5520358230078146, 'micro_precision': 0.576736524206095, 'micro_recall': 0.6749491271286281}

Classification report:
                  precision    recall  f1-score   support

         Action       0.56      0.64      0.60       637
      Adventure       0.49      0.48      0.48       337
      Animation       0.53      0.69      0.60       191
         Comedy       0.63      0.71      0.67      1375
          Crime       0.49      0.63      0.55       449
    Documentary       0.78      0.76      0.77       375
          Drama       0.70      0.84      0.76      2104
         Family       0.58      0.55      0.57       282
        Fantasy       0.34      0.61      0.44       234
        Foreign       0.17      0.34      0.23       167
        History       0.32      0.44      0.37       154
         Horror       0.72      0.72      0.72       500
          Music       0.61      0.51      0.56       167
        Mystery       0.

## **Predicting Genres for New Movie Descriptions**

A custom prediction function is defined to classify the genres of a new, unseen movie based solely on its textual description.  
The function performs the following steps:

1. **Text vectorization:**  
   The input description is transformed using the previously fitted TF-IDF model to ensure consistency with the training features.

2. **Feature construction:**  
   The text vector is combined with placeholder numeric features (set to zero), matching the dimensional structure of the original training data.

3. **Probability prediction:**  
   The trained ANN model outputs a probability score for each genre.

4. **Threshold-based decision:**  
   Each probability is converted into a binary prediction using the genre-specific thresholds optimized earlier.

5. **Top-k genre ranking:**  
   The highest-scoring genres are returned along with their corresponding probabilities.

This function enables practical genre prediction on arbitrary movie summaries and serves as the interface for applying the trained model beyond the dataset.


In [None]:
def predict_genres(text, top_k=5):
    X_new_text = tfidf.transform([text]).toarray()
    X_new = np.hstack([X_new_text, np.zeros((1,len(num_cols)))])  # numeric features as zeros
    probs = model.predict(X_new)[0]
    preds = (probs>=thresholds).astype(int)
    top_idx = probs.argsort()[-top_k:][::-1]
    return [(mlb.classes_[i], float(probs[i])) for i in top_idx], preds

### **Demonstrating the Model on Example Descriptions**

To illustrate how the trained model performs on unseen movie summaries, a series of example descriptions are provided.  
For each example, the prediction function outputs:

- the top predicted genres along with their probability scores  
- the full binary prediction mask indicating which genres were activated  

These examples help verify the model’s practical behavior and demonstrate its ability to identify multiple relevant genres from natural language descriptions.  
Below, several test descriptions are evaluated to showcase the model’s genre prediction capabilities.

In [None]:
example = "A ragtag team of mercenaries attempt a heist, but personal conflicts spiral into chaos and dark comedy."
topk, mask = predict_genres(example, top_k=5)
print("\nTop predicted genres & probs:", topk)
print("Binary mask for all genres:", mask)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 44ms/step

Top predicted genres & probs: [('Comedy', 0.9414530992507935), ('Action', 0.3839086592197418), ('Thriller', 0.20027127861976624), ('Drama', 0.15489935874938965), ('Crime', 0.11070700734853745)]
Binary mask for all genres: [1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


In [None]:
example = "A team of astronauts is sent on a dangerous mission to explore a distant planet, but they encounter hostile alien lifeforms that threaten their survival."
topk, mask = predict_genres(example, top_k=5)
print("\nTop predicted genres & probs:", topk)
print("Binary mask for all genres:", mask)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 44ms/step

Top predicted genres & probs: [('Science Fiction', 0.9885070323944092), ('Action', 0.6739530563354492), ('Thriller', 0.4483526647090912), ('Horror', 0.37217339873313904), ('Adventure', 0.13695670664310455)]
Binary mask for all genres: [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0]


In [None]:
example = "Two rival chefs in a small town are forced to compete in a cooking contest, but unexpected romance blooms amidst the chaos."
topk, mask = predict_genres(example, top_k=5)
print("\nTop predicted genres & probs:", topk)
print("Binary mask for all genres:", mask)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step

Top predicted genres & probs: [('Comedy', 0.9777870774269104), ('Romance', 0.7659969925880432), ('Family', 0.1861397922039032), ('Music', 0.18331356346607208), ('Drama', 0.1773184835910797)]
Binary mask for all genres: [0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]


In [None]:
example = "A detective races against time to stop a serial killer who leaves cryptic clues at each crime scene."
topk, mask = predict_genres(example, top_k=5)
print("\nTop predicted genres & probs:", topk)
print("Binary mask for all genres:", mask)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 44ms/step

Top predicted genres & probs: [('Crime', 0.8949007987976074), ('Thriller', 0.874430775642395), ('Mystery', 0.6925297975540161), ('Drama', 0.38880160450935364), ('Action', 0.3847935199737549)]
Binary mask for all genres: [1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0]


In [None]:
example = "During World War II, a young soldier struggles with the horrors of battle and the moral dilemmas of loyalty and survival."
topk, mask = predict_genres(example, top_k=5)
print("\nTop predicted genres & probs:", topk)
print("Binary mask for all genres:", mask)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 46ms/step

Top predicted genres & probs: [('War', 0.9778497815132141), ('Drama', 0.8548486232757568), ('Action', 0.4780610203742981), ('Thriller', 0.17443959414958954), ('History', 0.09370403736829758)]
Binary mask for all genres: [1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0]


In [None]:
example = "A family moves into a remote old mansion, only to discover it is haunted by vengeful spirits from the past."
topk, mask = predict_genres(example, top_k=5)
print("\nTop predicted genres & probs:", topk)
print("Binary mask for all genres:", mask)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 65ms/step

Top predicted genres & probs: [('Horror', 0.9759166240692139), ('Thriller', 0.2501448094844818), ('Comedy', 0.19881510734558105), ('Mystery', 0.16861572861671448), ('Fantasy', 0.13059157133102417)]
Binary mask for all genres: [0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0]


In [None]:
example = "A young dragon befriends a boy and together they embark on a magical journey to save their kingdom from dark forces."
topk, mask = predict_genres(example, top_k=5)
print("\nTop predicted genres & probs:", topk)
print("Binary mask for all genres:", mask)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step

Top predicted genres & probs: [('Family', 0.8784245252609253), ('Adventure', 0.7780735492706299), ('Animation', 0.7720664739608765), ('Fantasy', 0.745086669921875), ('Drama', 0.13261695206165314)]
Binary mask for all genres: [0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0]


In [None]:
example = "A group of bumbling thieves accidentally steals a gangster’s treasure and must outwit both the police and the mob to survive."
topk, mask = predict_genres(example, top_k=5)
print("\nTop predicted genres & probs:", topk)
print("Binary mask for all genres:", mask)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 39ms/step

Top predicted genres & probs: [('Comedy', 0.9134494662284851), ('Action', 0.869952917098999), ('Crime', 0.6738135814666748), ('Thriller', 0.3067455589771271), ('Adventure', 0.11784035712480545)]
Binary mask for all genres: [1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


In [None]:
example = "A chosen hero must unite a group of warriors to defeat an evil sorcerer threatening their enchanted world."
topk, mask = predict_genres(example, top_k=5)
print("\nTop predicted genres & probs:", topk)
print("Binary mask for all genres:", mask)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 72ms/step

Top predicted genres & probs: [('Adventure', 0.8646023273468018), ('Fantasy', 0.8638072609901428), ('Action', 0.8083404302597046), ('Science Fiction', 0.3182853162288666), ('Animation', 0.14917849004268646)]
Binary mask for all genres: [1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]


In [None]:
example = "A struggling musician falls in love with a dancer, and together they try to achieve their dreams on the stage of a big city."
topk, mask = predict_genres(example, top_k=5)
print("\nTop predicted genres & probs:", topk)
print("Binary mask for all genres:", mask)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 39ms/step

Top predicted genres & probs: [('Drama', 0.9093378186225891), ('Music', 0.8335316181182861), ('Romance', 0.7966499328613281), ('Comedy', 0.4257875680923462), ('Fantasy', 0.037110645323991776)]
Binary mask for all genres: [0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0]


In [None]:
example = "A detailed look into the life of a famous civil rights leader and the events that shaped their legacy."
topk, mask = predict_genres(example, top_k=5)
print("\nTop predicted genres & probs:", topk)
print("Binary mask for all genres:", mask)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 45ms/step

Top predicted genres & probs: [('Documentary', 0.8276681900024414), ('Drama', 0.34598904848098755), ('History', 0.13752266764640808), ('Comedy', 0.07212508469820023), ('Foreign', 0.04408174380660057)]
Binary mask for all genres: [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
