<a href="https://colab.research.google.com/github/smahasr/Codsot----Machine-Learning-Intern-Task-1/blob/main/Codsoft_Machine_learning_internship_task_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Objective:



Create a machine learning model that can predict the genre of a
movie based on its plot summary or other textual information. You
can use techniques like TF-IDF or word embeddings with classifiers
such as Naive Bayes, Logistic Regression, or Support Vector
Machines.

Steps to Follow:

1. Load and examine the training data.
2. Preprocess the text data.
3. Create features using TF-IDF.
4. Train a classifier (e.g., Logistic Regression).
5. Evaluate the model on test data.






The training data consists of lines where each line represents a movie with the following format:
id::: Title:::plot:::summary.


Steps to Proceed:

1. Load and preprocess the training data to separate the ID, title, genre, and plot summary.
2. Vectorize the plot summaries using TF-IDF.
3. Train a classifier (Logistic Regression) on the training data.
4. Evaluate the model using the test data.



Step 1: Load and Preprocess the data.

In [7]:
import pandas as pd

# Function to preprocess the data
def preprocess_data(data, is_train=True):
    ids = []
    titles = []
    genres = []
    plots = []

    for line in data:
        parts = line.strip().split(' ::: ')
        if is_train:
            if len(parts) == 4:
                ids.append(parts[0])
                titles.append(parts[1])
                genres.append(parts[2])
                plots.append(parts[3])
            else:
                print(f"Skipping line due to incorrect format: {line}")
        else:
            if len(parts) == 3:
                ids.append(parts[0])
                titles.append(parts[1])
                plots.append(parts[2])
            else:
                print(f"Skipping line due to incorrect format: {line}")

    if not ids:
        raise ValueError("No valid data found after preprocessing. Please check the input format.")

    if is_train:
        return pd.DataFrame({
            'id': ids,
            'title': titles,
            'genre': genres,
            'plot': plots
        })
    else:
        return pd.DataFrame({
            'id': ids,
            'title': titles,
            'plot': plots
        })

# Read and preprocess the training data
with open('/content/train_data.txt', 'r') as file:
    train_data = file.readlines()

train_df = preprocess_data(train_data, is_train=True)


Step 2. Vectorize the Plot Summaries Using TF-IDF

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Vectorize the plot summaries using TF-IDF
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)
X_train_tfidf = tfidf.fit_transform(train_df['plot'])
y_train = train_df['genre']


Step 3. Model Training

In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train_tfidf, y_train, test_size=0.2, random_state=42)

# Train a Logistic Regression model
log_reg = LogisticRegression(max_iter=1000, random_state=42)
log_reg.fit(X_train, y_train)

# Evaluate the model on the validation set
y_val_pred = log_reg.predict(X_val)
print("Validation Accuracy:", accuracy_score(y_val, y_val_pred))
print("Validation Classification Report:")
print(classification_report(y_val, y_val_pred))


Validation Accuracy: 0.5793599557318085
Validation Classification Report:


  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

      action       0.52      0.26      0.34       263
       adult       0.73      0.21      0.33       112
   adventure       0.43      0.14      0.22       139
   animation       0.62      0.10      0.17       104
   biography       0.00      0.00      0.00        61
      comedy       0.51      0.58      0.55      1443
       crime       0.29      0.02      0.04       107
 documentary       0.66      0.84      0.74      2659
       drama       0.54      0.78      0.64      2697
      family       0.39      0.07      0.12       150
     fantasy       0.00      0.00      0.00        74
   game-show       0.94      0.42      0.59        40
     history       0.00      0.00      0.00        45
      horror       0.64      0.56      0.60       431
       music       0.62      0.47      0.54       144
     musical       1.00      0.02      0.04        50
     mystery       0.00      0.00      0.00        56
        news       1.00    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Step 4: Prediction on Test Data

In [10]:
# Read and preprocess the test data
test_file_path = '/content/test_data.txt'
with open(test_file_path, 'r') as file:
    test_data = file.readlines()

test_df = preprocess_data(test_data, is_train=False)

# Ensure the test data is not empty
if test_df.empty:
    raise ValueError("Test data is empty. Please check the test data file.")

# Transform the test data using the same TF-IDF vectorizer
X_test_tfidf = tfidf.transform(test_df['plot'])

# Make predictions on the test data
y_test_pred = log_reg.predict(X_test_tfidf)

# Add predictions to the test DataFrame
test_df['predicted_genre'] = y_test_pred

# Display the test DataFrame with predictions
print(test_df[['id', 'title', 'predicted_genre']])

# Save the results to a CSV file
test_df.to_csv('/content/predicted_test_data.csv', index=False)


          id                           title predicted_genre
0          1            Edgar's Lunch (1998)           short
1          2        La guerra de papá (1977)           drama
2          3     Off the Beaten Track (2010)     documentary
3          4          Meu Amigo Hindu (2015)           drama
4          5               Er nu zhai (1955)           drama
...      ...                             ...             ...
54195  54196  "Tales of Light & Dark" (2013)           drama
54196  54197     Der letzte Mohikaner (1965)          action
54197  54198             Oliver Twink (2007)          comedy
54198  54199               Slipstream (1973)           drama
54199  54200       Curitiba Zero Grau (2010)     documentary

[54200 rows x 3 columns]


##Conclusion:


##Model Performance:

* Validation Accuracy: 57.9%
* Performance Variability: High accuracy for genres like documentary and drama, but poor performance for less frequent genres like biography and fantasy.



##Test Data Predictions:

* The model predicts a variety of genres, with drama, documentary, and short being common predictions.

