# Movies and Shows Data Analysis

## Project Overview

This project analyzes a dataset of movies and shows using Python and pandas. The analysis includes data cleaning, exploration, and the creation of custom functions to extract insights about actors, genres, decades, and IMDb ratings.

## Dataset Description

The dataset `movies_and_shows.csv` contains information about various movies and shows, including:

- **name**: The name of the actor or actress
- **Character**: The character they played
- **role**: The role type (e.g., ACTOR)
- **title**: The title of the movie or show
- **type**: Whether it's a MOVIE or SHOW
- **release_year**: The year it was released
- **genres**: A list of genres the movie or show belongs to
- **imdb_score**: The IMDb score of the movie or show
- **imdb_votes**: The number of votes on IMDb

---

## 1. Setup and Data Loading

In [None]:
# Import necessary libraries
import pandas as pd

In [None]:
# Read the dataset into a DataFrame
df = pd.read_csv('data/movies_and_shows.csv')

In [None]:
# Display the first few rows to understand the structure
df.head()

## 2. Data Cleaning

The dataset contains inconsistent column names with mixed cases and special characters. Let's standardize them.

In [None]:
# View current column names
print("Original column names:")
print(df.columns.tolist())

In [None]:
# Standardize column names: lowercase and replace special characters
df.columns = df.columns.str.lower().str.replace(' ', '_').str.replace('0', 'o')

print("\nCleaned column names:")
print(df.columns.tolist())

In [None]:
# Verify the changes
df.head()

## 3. Data Exploration

In [None]:
# Display basic information about the dataset
df.info()

In [None]:
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())

In [None]:
# Display summary statistics
df.describe()

## 4. Filtering and Analysis

### 4.1 Filtering by Genre

In [None]:
# Filter for movies/shows that contain 'drama' in their genres
drama_df = df[df['genres'].str.contains('drama', case=False, na=False)]

print(f"Found {len(drama_df)} entries containing 'drama'")
drama_df.head()

### 4.2 Filtering by Decade

In [None]:
# Function to filter movies/shows by decade
def filter_by_decade(df, start_year):
    """
    Filter the DataFrame for movies/shows released in a specific decade.
    
    Parameters:
    - df: DataFrame to filter
    - start_year: Starting year of the decade (e.g., 1980 for 1980s)
    
    Returns:
    - Filtered DataFrame
    """
    end_year = start_year + 9
    return df[(df['release_year'] >= start_year) & (df['release_year'] <= end_year)]

In [None]:
# Example: Filter movies from the 1990s
movies_90s = filter_by_decade(df, 1990)
print(f"Movies/shows from the 1990s: {len(movies_90s)}")
movies_90s.head()

### 4.3 High-Rated Movies

In [None]:
# Filter for highly-rated content (IMDb score >= 8.0)
high_rated = df[df['imdb_score'] >= 8.0]

print(f"Number of highly-rated titles (>=8.0): {len(high_rated)}")
high_rated[['title', 'imdb_score', 'release_year']].drop_duplicates().sort_values('imdb_score', ascending=False).head(10)

## 5. Custom Functions

### 5.1 Get Actors for a Title

In [None]:
def get_actors_for_title(title):
    """
    Retrieve a comma-separated list of actors for a given movie/show title.
    
    Parameters:
    - title: The title of the movie or show
    
    Returns:
    - String of actor names separated by commas
    """
    # Filter for rows with the specified title and role as 'ACTOR'
    title_actors_df = df[(df['title'] == title) & (df['role'] == 'ACTOR')]
    
    # Extract the 'name' column for actor names
    actor_names = title_actors_df['name']
    
    # Combine names into a single string
    return ', '.join(actor_names)

In [None]:
# Example usage
print("Actors in 'Taxi Driver':")
print(get_actors_for_title("Taxi Driver"))

### 5.2 Categorize Movies by IMDb Score

In [None]:
def categorize_imdb_score(title):
    """
    Categorize a movie/show based on its IMDb score.
    
    Categories:
    - Excellent: >= 9.0
    - Good: 7.0 - 8.9
    - Average: 5.0 - 6.9
    - Low: < 5.0
    
    Parameters:
    - title: The title of the movie or show
    
    Returns:
    - Category string or 'Title not found'
    """
    # Filter for the row with the specified title
    imdb_scores = df[df['title'] == title]['imdb_score'].tolist()
    
    # Check if title exists
    if not imdb_scores:
        return 'Title not found'
    
    # Get the IMDb score
    imdb_score = float(imdb_scores[0])
    
    # Categorize and return
    if imdb_score >= 9.0:
        return 'Excellent'
    elif imdb_score >= 7.0:
        return 'Good'
    elif imdb_score >= 5.0:
        return 'Average'
    else:
        return 'Low'

In [None]:
# Test the categorization function
test_titles = ["Taxi Driver", "The Godfather", "Plan 9 from Outer Space"]

for title in test_titles:
    category = categorize_imdb_score(title)
    print(f"{title}: {category}")

## 6. Summary and Insights

### Key Findings:

1. **Data Quality**: The dataset contained inconsistent column naming which was standardized for better usability.

2. **Content Distribution**: The dataset includes both movies and shows across multiple decades.

3. **Rating Analysis**: Movies can be categorized into quality tiers based on IMDb scores, helping identify highly-rated content.

4. **Actor Information**: Custom functions allow for easy retrieval of cast information for any title.

### Potential Next Steps:

- Analyze genre trends over time
- Identify most prolific actors or highest-rated actors
- Compare movie vs. show ratings
- Investigate correlation between number of votes and IMDb scores