# Part 2: Problem Statement and Dataset

## Movie Genre Classification

### Problem Statement

*I will build a multi-class classification model using the MovieLens dataset to automatically predict movie genres (achievable with SVM or Naive Bayes). Success will be measured by accuracy (at least 80%) and a balance between precision and recall. This project can improve movie recommendation systems (streaming services), target marketing campaigns (studios), and genre-based movie discovery for users (all within the designated timeframe).*

### Proposed Methods and Models

**Data Preprocessing and Cleaning:**

-Address missing values and inconsistencies.

-Preprocess textual data (plot keywords) for machine learning compatibility (tokenization, stemming).

**Feature Engineering (Optional):**

-Create new features from existing ones to potentially improve model performance (e.g., combining budget and revenue for production scale).

**Model Selection and Training:**

-Implement classification algorithms like Support Vector Machines or Naive Bayes.

-Train the model on the preprocessed MovieLens dataset.

**Model Evaluation:**

-Evaluate model performance using accuracy, precision, and recall.

### Data Source and Risks/Assumptions

**Data Source:** *Kaggle (https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset)*

**Risks and Assumptions:**

-Potential genre bias within the dataset. I need to explore data balancing techniques if certain genres are underrepresented.

-Data quality and accuracy might require additional cleaning and validation steps.

-The model's performance might be influenced by the chosen features and algorithms. I might need to explore alternative options for improvement.

### Success Criteria Refinement

*Based on the chosen metrics, success will be determined by:*

-Achieving at least 80% accuracy in genre prediction.

-Maintaining a balance between precision and recall to ensure the model accurately identifies movies within each genre and avoids misclassifications.

### Data Source Documentation

*The MovieLens dataset provides the primary data source for this project. Key components include:*

-movies_metadata.csv: Contains information on movies (budget, revenue, plot keywords, cast, crew).

-keywords.csv: Provides details about movie plot keywords.

-credits.csv (Optional): Offers details about cast and crew (potentially relevant for genre prediction).

*I will utilize the CSV files and potentially explore the full dataset (including user ratings) if time allows, considering the additional processing required for sentiment analysis.*

### import data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Loading a csv into a DataFrame

In [4]:
movie = pd.read_csv('movies_metadata.csv')

  movie = pd.read_csv('movies_metadata.csv')


In [5]:
type(movie)

pandas.core.frame.DataFrame

### Exploratory Data Analysis (EDA)

In [6]:
movie.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0
