# Data Preprocessing and Feature Selection
### Objective:

Perform data preprocessing tasks and select relevant features for further analysis.
### Steps:

##### CSV Data Loading:
Load the existing movie data from the CSV file into a DataFrame (data).
##### Missing Values Check:
Print the count of missing values in each column of the DataFrame to identify potential data gaps.
##### Feature Selection:
Drop unused or irrelevant features to streamline the dataset.
Check for the presence of each feature before dropping it to avoid errors.

### Code Summary:

The code first loads the existing movie data from the CSV file into a DataFrame.
It then checks for missing values in the DataFrame and prints the count of missing values for each column.
Unused features such as 'backdrop_path', 'poster_path', 'video', 'overview', and 'original_title' are dropped from the DataFrame if they exist.
Each feature is dropped individually, and error messages are printed if a feature is not found in the DataFrame.

In [1]:
import requests
import pandas as pd
import json
import time


# Load the existing CSV file into a DataFrame
data = pd.read_csv('movie_data.csv')

# Check missing values in data
print("Missing values in dataframe:")
print(data.isnull().sum())
print("data shape: ", data.shape)

#Drop unused features
if 'id' in data.columns:
    try:
        data = data.drop(columns=['backdrop_path'])
    except KeyError:
        print("'backdrop_path' column not found.")
    try:
        data = data.drop(columns=['poster_path'])
    except KeyError:
        print("'poster_path' column not found.")
    try:
        data = data.drop(columns=['video'])
    except KeyError:
        print("'video' column not found.")
    try:
        data = data.drop(columns=['overview'])
    except KeyError:
        print("'overview' column not found.")
    try:
        data = data.drop(columns=['original_title'])
    except KeyError:
        print("'original_title' column not found.")



Missing values in dataframe:
adult                   0
genre_ids               0
id                      0
original_language       0
original_title          0
popularity              0
release_date            0
title                   0
vote_average            0
vote_count              0
revenue                 0
production_companies    0
budget                  0
dtype: int64
data shape:  (9089, 13)
'backdrop_path' column not found.
'poster_path' column not found.
'video' column not found.
'overview' column not found.


In [2]:
data.head()
data.count()

adult                   9089
genre_ids               9089
id                      9089
original_language       9089
popularity              9089
release_date            9089
title                   9089
vote_average            9089
vote_count              9089
revenue                 9089
production_companies    9089
budget                  9089
dtype: int64

# Feature Engineering
### bjective:

Create new features from existing data to enhance the predictive power of the model or provide additional insights into the dataset.
### Steps:

##### Release Year Extraction:
Extract the release year from the 'release_date' column and create a new feature named 'release_year'.
##### Release Month Extraction:
Extract the release month from the 'release_date' column and create a new feature named 'release_month'.
##### Genre Count:
Count the number of genres associated with each movie and create a new feature named 'num_genres'.
##### Release Month to Season Mapping:
Map each release month to its corresponding season (e.g., January, February, March -> Winter) and create a new feature named 'release_season'.
##### Budget-to-Revenue Ratio Calculation:
Calculate the budget-to-revenue ratio for each movie and create a new feature named 'budget_revenue_ratio'.

In [3]:
#FEATURE ENGINEERING

# Extract release year from release_date
data['release_year'] = pd.to_datetime(data['release_date']).dt.year

# Extract release month from release_date
data['release_month'] = pd.to_datetime(data['release_date']).dt.month

# Count the number of genres
data['num_genres'] = data['genre_ids'].apply(lambda x: len(x.split(',')))

# Map release month to season
season_map = {1: 'Winter', 2: 'Winter', 3: 'Spring', 4: 'Spring', 5: 'Spring', 6: 'Summer',
              7: 'Summer', 8: 'Summer', 9: 'Fall', 10: 'Fall', 11: 'Fall', 12: 'Winter'}
data['release_season'] = data['release_month'].map(season_map)

# Calculate budget-to-revenue ratio
data['budget_revenue_ratio'] = data['budget'] / data['revenue']

# Segmentation Analysis
### Objective:

Segment the dataset based on the presence or absence of zero values for revenue and budget to identify patterns or anomalies within the data.
### Steps:

##### Zero Value Masks Creation:
Create boolean masks to identify rows with zero values for revenue and budget.
### Segmentation:
##### Segment the dataset into four categories:
Rows with zero values for revenue (revenue_zero_data).
Rows with zero values for budget (budget_zero_data).
Rows with non-zero values for both revenue and budget (non_zero_data).
Rows with zero values for either revenue or budget (zero_data).
##### Segment Sizes Display:
Print the sizes of each segment to understand the distribution of data across different categories.

In [4]:
#SEGMENTATION

# Create masks to identify rows with zero values for revenue and budget
revenue_zero_mask = data['revenue'] == 0
budget_zero_mask = data['budget'] == 0

# Segment the data based on zero values for revenue and budget
revenue_zero_data = data[revenue_zero_mask]
budget_zero_data = data[budget_zero_mask]
non_zero_data = data[~(revenue_zero_mask | budget_zero_mask)]
zero_data = data[(revenue_zero_mask | budget_zero_mask)]

# Display the sizes of each segment
print("Number of rows with zero revenue:", len(revenue_zero_data))
print("Number of rows with zero budget:", len(budget_zero_data))
print("Number of rows with non-zero revenue and budget:", len(non_zero_data))
print("Number of rows with zero revenue or budget:", len(zero_data))

Number of rows with zero revenue: 2371
Number of rows with zero budget: 2773
Number of rows with non-zero revenue and budget: 5745
Number of rows with zero revenue or budget: 3344


### The data has been successfully segmented and saved into separate CSV files:

##### revenue_zero_data.csv: Contains rows with zero values for revenue.
##### budget_zero_data.csv: Contains rows with zero values for budget.
##### non_zero_data.csv: Contains rows with non-zero values for both revenue and budget.
##### zero_data.csv: Contains rows with zero values for either revenue or budget.

In [5]:
revenue_zero_data.to_csv('revenue_zero_data.csv', index=False)
budget_zero_data.to_csv('budget_zero_data.csv', index=False)
non_zero_data.to_csv('non_zero_data.csv', index=False)
zero_data.to_csv('zero_data.csv', index=False)