**Reasoning**:
Load the dataset from the specified file, handle the potential FileNotFoundError, and display the head and columns if successful.



In [15]:
import os

file_paths = [
    "/content/train_data.txt",
    "/content/test_data_solution.txt",
    "/content/description.txt",
    "/content/test_data.txt"
]

for file_path in file_paths:
    if os.path.exists(file_path):
        print(f"--- Content of {file_path} ---")
        with open(file_path, 'r') as f:
            for i in range(5): # Read first 5 lines
                try:
                    print(f.readline().strip())
                except Exception as e:
                    print(f"Error reading line: {e}")
                    break
    else:
        print(f"--- File not found: {file_path} ---")

--- Content of /content/train_data.txt ---
1 ::: Oscar et la dame rose (2009) ::: drama ::: Listening in to a conversation between his doctor and parents, 10-year-old Oscar learns what nobody has the courage to tell him. He only has a few weeks to live. Furious, he refuses to speak to anyone except straight-talking Rose, the lady in pink he meets on the hospital stairs. As Christmas approaches, Rose uses her fantastical experiences as a professional wrestler, her imagination, wit and charm to allow Oscar to live life and love to the full, in the company of his friends Pop Corn, Einstein, Bacon and childhood sweetheart Peggy Blue.
2 ::: Cupid (1997) ::: thriller ::: A brother and sister with a past incestuous relationship have a current murderous relationship. He murders the women who reject him and she murders the women who get too close to him.
3 ::: Young, Wild and Wonderful (1980) ::: adult ::: As the bus empties the students for their field trip to the Museum of Natural History, li

In [16]:
import pandas as pd

# Load the training data
train_df = pd.read_csv('/content/train_data.txt', sep=':::', names=['ID', 'Title', 'Genre', 'Description'], engine='python')

# Load the test data
test_df = pd.read_csv('/content/test_data.txt', sep=':::', names=['ID', 'Title', 'Description'], engine='python')

# Display the first few rows and column names of both dataframes
print("--- Training Data ---")
display(train_df.head())
print("\n--- Test Data ---")
display(test_df.head())

print("\n--- Training Data Info ---")
train_df.info()

print("\n--- Test Data Info ---")
test_df.info()

--- Training Data ---


Unnamed: 0,ID,Title,Genre,Description
0,1,Oscar et la dame rose (2009),drama,Listening in to a conversation between his do...
1,2,Cupid (1997),thriller,A brother and sister with a past incestuous r...
2,3,"Young, Wild and Wonderful (1980)",adult,As the bus empties the students for their fie...
3,4,The Secret Sin (1915),drama,To help their unemployed father make ends mee...
4,5,The Unrecovered (2007),drama,The film's title refers not only to the un-re...



--- Test Data ---


Unnamed: 0,ID,Title,Description
0,1,Edgar's Lunch (1998),"L.R. Brane loves his life - his car, his apar..."
1,2,La guerra de papá (1977),"Spain, March 1964: Quico is a very naughty ch..."
2,3,Off the Beaten Track (2010),One year in the life of Albin and his family ...
3,4,Meu Amigo Hindu (2015),"His father has died, he hasn't spoken with hi..."
4,5,Er nu zhai (1955),Before he was known internationally as a mart...



--- Training Data Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54214 entries, 0 to 54213
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   ID           54214 non-null  int64 
 1   Title        54214 non-null  object
 2   Genre        54214 non-null  object
 3   Description  54214 non-null  object
dtypes: int64(1), object(3)
memory usage: 1.7+ MB

--- Test Data Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54200 entries, 0 to 54199
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   ID           54200 non-null  int64 
 1   Title        54200 non-null  object
 2   Description  54200 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.2+ MB


## Preprocess Text Data

### Subtask:
Clean and preprocess the text data, including tokenization, removing stop words, and stemming or lemmatization.

**Reasoning**:
Define a function to preprocess the text data by removing punctuation, converting to lowercase, and removing stop words. Apply this function to the 'Description' column of both the training and testing dataframes.

In [17]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Download stopwords if not already downloaded
nltk.download('stopwords')
nltk.download('punkt')

stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def preprocess_text(text):
    # Remove punctuation and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Convert to lowercase
    text = text.lower()
    # Tokenize
    tokens = text.split()
    # Remove stop words and stem
    processed_tokens = [stemmer.stem(word) for word in tokens if word not in stop_words]
    return ' '.join(processed_tokens)

# Apply preprocessing to the 'Description' column
train_df['Processed_Description'] = train_df['Description'].apply(preprocess_text)
test_df['Processed_Description'] = test_df['Description'].apply(preprocess_text)

print("--- Processed Training Data ---")
display(train_df[['Description', 'Processed_Description']].head())

print("\n--- Processed Test Data ---")
display(test_df[['Description', 'Processed_Description']].head())

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


KeyboardInterrupt: 

## Feature Extraction

### Subtask:
Convert the preprocessed text data into numerical features using techniques like TF-IDF or word embeddings.

**Reasoning**:
Use TF-IDF to convert the preprocessed text data into numerical features. Initialize a TfidfVectorizer and fit it on the processed training descriptions, then transform both the training and testing descriptions.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
# Consider using a limited number of features to avoid a very large matrix
tfidf_vectorizer = TfidfVectorizer(max_features=5000)

# Fit the vectorizer on the training data and transform both training and testing data
tfidf_train_features = tfidf_vectorizer.fit_transform(train_df['Processed_Description'])
tfidf_test_features = tfidf_vectorizer.transform(test_df['Processed_Description'])

print("TF-IDF features shape (training):", tfidf_train_features.shape)
print("TF-IDF features shape (testing):", tfidf_test_features.shape)

## Split Data

### Subtask:
Split the data into training and testing sets.

**Reasoning**:
Separate the features (TF-IDF representations) and the target variable (Genre) from the training data.

In [None]:
X_train = tfidf_train_features
y_train = train_df['Genre']

# For the test set, we only have features (X_test) and will predict the genres later
X_test = tfidf_test_features

print("Shape of X_train:", X_train.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of X_test:", X_test.shape)

## Choose a Model

### Subtask:
Select a suitable machine learning model for classification (e.g., Naive Bayes, Logistic Regression, Support Vector Machine).

## Train the Model

### Subtask:
Train the chosen model on the training data.

**Reasoning**:
Import the Multinomial Naive Bayes classifier and train it using the training data.

In [None]:
from sklearn.naive_bayes import MultinomialNB

# Initialize the Multinomial Naive Bayes model
model = MultinomialNB()

# Train the model
model.fit(X_train, y_train)

print("Model training complete.")

## Evaluate the Model

### Subtask:
Evaluate the performance of the trained model on the testing data using appropriate metrics (e.g., accuracy, precision, recall, F1-score).

**Reasoning**:
Load the test data solution, make predictions on the test features, and evaluate the model's performance using classification metrics.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import pandas as pd

# Load the test data solution
test_solution_df = pd.read_csv('/content/test_data_solution.txt', sep=':::', names=['ID', 'Title', 'Genre', 'Description'], engine='python')

# Make predictions on the test features
y_pred = model.predict(X_test)

# Get the true genres from the test solution
y_true = test_solution_df['Genre']

# Evaluate the model
accuracy = accuracy_score(y_true, y_pred)
# For precision, recall, and f1-score, we need to handle potential issues with
# multi-class classification and imbalanced datasets. Using 'weighted' average
# is a common approach.
precision = precision_score(y_true, y_pred, average='weighted')
recall = recall_score(y_true, y_pred, average='weighted')
f1 = f1_score(y_true, y_pred, average='weighted')


print(f"Accuracy: {accuracy:.4f}")
print(f"Precision (weighted): {precision:.4f}")
print(f"Recall (weighted): {recall:.4f}")
print(f"F1-score (weighted): {f1:.4f}")

## Predict Genres

### Subtask:
Use the trained model to predict the genre of new, unseen movie plot summaries.

In [None]:
print(model)

## Predict Genre for New Data

### Subtask:
Demonstrate how to use the trained model to predict the genre of a new movie plot summary.

In [None]:
print(model)

In [None]:
# Example of a new movie description
new_movie_description = "A detective investigates a series of mysterious disappearances in a small town."

# Preprocess the new description using the same function used for training data
processed_new_description = preprocess_text(new_movie_description)
print(f"Processed new description: {processed_new_description}")

# Transform the preprocessed description using the fitted TF-IDF vectorizer
new_description_features = tfidf_vectorizer.transform([processed_new_description])

# Predict the genre using the trained model
predicted_genre = model.predict(new_description_features)

print(f"The predicted genre for the new movie is: {predicted_genre[0]}")