### **Task 1: Evaluating Model Usefulness**

---

## **1️⃣ Dictionary-Based Approach**
### ✅ **Advantages**
- **Simple & Fast**: No need for training, just check if predefined genre words (e.g., "sci-fi", "romantic", "thriller") exist in the movie summary.
- **Interpretable**: Clear logic—if words related to "horror" appear, classify as "Horror".

### ❌ **Limitations**
- **Limited Generalization**: If the summary does not contain explicit genre-related words, it may fail.
- **No Context Understanding**: The word *"alien"* may appear in both *sci-fi* and *horror*, leading to misclassification.
- **Fixed Vocabulary**: Cannot detect **new genres** or **evolving terminology**.

### 🔥 **Best Use Case**
- **Baseline classifier** when no labeled data is available.
- **Quick filtering** (e.g., exclude children's movies by detecting words like *"murder"*, *"violence"*).

---

## **2️⃣ Latent Dirichlet Allocation (LDA) - Topic Modeling**
### ✅ **Advantages**
- **Unsupervised Learning**: Can find **hidden themes (topics)** in the summaries without labeled data.
- **Genre Discovery**: Automatically groups similar movies based on word distribution.
- **Interpretable Topics**: Each topic can be manually mapped to a genre.

### ❌ **Limitations**
- **Less Accurate for Classification**: It only **suggests themes**; it does not **assign explicit genre labels**.
- **Number of Topics Must Be Predefined**: Choosing **K (number of topics)** is difficult.
- **Sensitive to Stopwords**: Generic words might affect quality.

### 🔥 **Best Use Case**
- **Exploratory Analysis**: Useful when trying to **discover** genres in **unlabeled datasets**.
- **Genre Distribution**: Helps in understanding the structure of the dataset.

---

## **3️⃣ Word2Vec - Neural Embeddings for Similarity**
### ✅ **Advantages**
- **Understands Context**: Unlike dictionaries, it captures **word relationships** (e.g., *"thriller"* and *"suspense"* are closer in vector space).
- **Works Well with Short Texts**: Even if a summary **doesn’t mention the genre**, Word2Vec might infer it by context.
- **Adaptable**: Can improve accuracy by training on larger datasets.

### ❌ **Limitations**
- **Requires Training Data**: Needs a **large** movie dataset to learn good representations.
- **Computationally Expensive**: Training embeddings takes time.
- **Does Not Directly Classify**: Needs an additional **supervised model** like **SVM or Logistic Regression** to predict genres.

### 🔥 **Best Use Case**
- **Finding Similar Movies**: Example → "Find all movies **similar to Inception**".
- **Genre Classification with ML**: Used as input for **deep learning** classifiers.

---

## **Final Verdict: Which Model Should You Use?**
| **Scenario**  | **Best Model** | **Why?** |
|--------------|--------------|-----------------------------------|
| No labeled data?  | **LDA** | Finds **themes** in the dataset. |
| Need fast classification?  | **Dictionary-Based** | Quick but limited in accuracy. |
| Want **high accuracy**?  | **Word2Vec + ML** | Learns **context** from data. |

### **📌 Conclusion**
- **Use LDA** if you **don't have labeled data** and want to explore **genres**.
- **Use Dictionary-Based Classification** if you need a **quick and simple method**.
- **Use Word2Vec** if you have **training data** and want a **powerful ML-based classifier**.

### Task 2: Preprocessing Movie Data
Since we have the movies.txt file containing movie summaries, preprocessing is essential before applying a model. Below is a Python script for preprocessing:

In [None]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

# Load movie summaries
file_path = "movies.txt"

# Read the file
with open(file_path, "r", encoding="utf-8") as file:
    movie_summaries = file.readlines()

# Define a function to clean text
def preprocess_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = text.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation
    words = word_tokenize(text)  # Tokenization
    words = [word for word in words if word not in stopwords.words("english")]  # Remove stopwords
    return " ".join(words)

# Apply preprocessing to each movie summary
cleaned_summaries = [preprocess_text(summary) for summary in movie_summaries]

# Create a DataFrame
df = pd.DataFrame({"Original Summary": movie_summaries, "Processed Summary": cleaned_summaries})

# Display processed data
import ace_tools as tools
tools.display_dataframe_to_user(name="Preprocessed Movie Summaries", dataframe=df)

print("✅ Data preprocessing completed!")


### Task 3: Unsupervised Genre Detection
We can apply Latent Dirichlet Allocation (LDA) for topic modeling to infer possible genres.

In [None]:
import gensim
from gensim import corpora
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models import LdaModel

# Convert processed summaries into token lists
tokenized_summaries = [summary.split() for summary in cleaned_summaries]

# Create a dictionary
dictionary = corpora.Dictionary(tokenized_summaries)

# Convert to document-term matrix
corpus = [dictionary.doc2bow(text) for text in tokenized_summaries]

# Train LDA model
lda_model = LdaModel(corpus, num_topics=5, id2word=dictionary, passes=10)

# Display topics
topics = lda_model.print_topics(num_words=5)
for topic in topics:
    print(topic)
