# Tollywood Movie Data Cleaning

## 1. Load Data

In [1]:
!pip install openpyxl



In [21]:
import pandas as pd
import numpy as np
# import openpyxl # No longer strictly needed if only reading, but good practice if installed

# --- IMPORTANT: Make sure the path to your file is correct --- 
file_path = '/home/nineleaps/Documents/Python_Training/Final_test/movies_data/raw_data/tollywood_movies.xlsx' 

try:
    df = pd.read_excel(file_path, engine='openpyxl')
    print(f"Successfully loaded data from {file_path}")
    # Display the first 5 rows
    print(df.head())
    
    # Display column names, non-null counts, and data types
    df.info()
except FileNotFoundError:
    print(f"ERROR: File not found at {file_path}")
    print("Please ensure the file exists in the 'movies_data' directory relative to the notebook.")
    # Stop execution or handle error appropriately
    df = None # Set df to None so subsequent cells know loading failed
except Exception as e:
    print(f"An error occurred while loading the Excel file: {e}")
    df = None

Successfully loaded data from /home/nineleaps/Documents/Python_Training/Final_test/movies_data/raw_data/tollywood_movies.xlsx
   Unnamed: 0 MovieID                        Title          Director  \
0           0  MOV004  Baahubali 2: The Conclusion   S. S. Rajamouli   
1           1  MOV021     Baahubali: The Beginning   S. S. Rajamouli   
2           2  MOV023      Sye Raa Narasimha Reddy    Surender Reddy   
3           3  MOV025                       Jersey  Gowtam Tinnanuri   
4           4  MOV027              Geetha Govindam         Parasuram   

                 Genre  ReleaseYear  Budget (Crores)  BoxOffice (Crores)  \
0  Epic Fantasy Action          NaN              250                1810   
1  Epic Fantasy Action       2015.0              180                 650   
2    Historical Action       2019.0              200                 265   
3         Sports Drama       2019.0               20                  45   
4      Romantic Comedy       2018.0               10         

## 2. Initial Data Exploration

Let's look at the first few rows and the data types/missing values.

In [22]:
# Check for missing values only if df was loaded successfully
if df is not None:
    print('\nMissing values before cleaning:')
    print(df.isnull().sum())
else:
    print('DataFrame df not loaded. Skipping missing value check.')


Missing values before cleaning:
Unnamed: 0            0
MovieID               0
Title                 0
Director              0
Genre                 0
ReleaseYear           2
Budget (Crores)       0
BoxOffice (Crores)    0
Rating                1
Duration (minutes)    1
LeadActor             0
LeadActress           0
Language              3
ProductionCompany     1
dtype: int64


## 3. Data Cleaning Steps

We will perform the following steps:
1. Drop redundant columns (`Unnamed: 0`).
2. Handle missing `ReleaseYear` by **dropping** rows.
3. Convert `ReleaseYear` to integer type.
4. Handle other missing values by imputing (using mean for numeric `Rating`, median for numeric `Duration`, mode for categorical `Language`, `ProductionCompany`).
5. Correct data types (e.g., `Duration (minutes)` to Integer).
6. Remove duplicate rows.

### 3.1 Drop Redundant Columns

In [23]:
if df is not None:
    if 'Unnamed: 0' in df.columns:
        df = df.drop('Unnamed: 0', axis=1)
        print("Dropped 'Unnamed: 0' column.")
    else:
        print("'Unnamed: 0' column not found.")
else:
    print('DataFrame df not loaded.')

Dropped 'Unnamed: 0' column.


### 3.2 Handle Missing Values and Data Types

**ReleaseYear: Drop Missing Rows and Convert Type**

In [24]:
# Drop rows with missing ReleaseYear and convert type

if df is not None and 'ReleaseYear' in df.columns:
    initial_rows = len(df)
    print(f"Initial rows: {initial_rows}")

    # --- Step 1: Convert to numeric (important to identify NaNs correctly) ---
    # Coerce errors will turn non-numeric entries (like unexpected strings) into NaN
    df['ReleaseYear'] = pd.to_numeric(df['ReleaseYear'], errors='coerce')

    # --- Step 2: Check how many NaNs are in ReleaseYear ---
    missing_year_count = df['ReleaseYear'].isnull().sum()
    print(f"Found {missing_year_count} missing values in 'ReleaseYear'.")

    # --- Step 3: Drop rows where 'ReleaseYear' is NaN ---
    df.dropna(subset=['ReleaseYear'], inplace=True) # inplace=True modifies df directly

    rows_dropped = initial_rows - len(df)
    if rows_dropped > 0:
        print(f"Dropped {rows_dropped} row(s) due to missing 'ReleaseYear'.")
        print(f"Rows remaining: {len(df)}")
    else:
        print("No rows dropped due to missing 'ReleaseYear'.")

    # --- Step 4: Attempt to convert ReleaseYear to Integer ---
    # Now that NaNs are removed, we can try converting to a nullable integer type.
    # This handles cases where the original data might have had years like 2015.0
    if not df['ReleaseYear'].empty: # Check if DataFrame is not empty after dropping
        try:
            # Check if all remaining years are effectively integers (e.g., 2015.0)
            if df['ReleaseYear'].apply(lambda x: pd.isna(x) or x == np.floor(x)).all():
                df['ReleaseYear'] = df['ReleaseYear'].astype('Int64')
                print("Successfully converted 'ReleaseYear' column to Int64.")
            else:
                 print("Warning: Some 'ReleaseYear' values are not whole numbers. Keeping as float.")
                 # Optionally, you could round or floor them before converting:
                 # df['ReleaseYear'] = df['ReleaseYear'].round().astype('Int64')
                 # print("Rounded and converted 'ReleaseYear' to Int64.")

        except Exception as e:
            print(f"ERROR converting ReleaseYear to Int64 after dropping NaNs: {e}. Keeping original type.")
    else:
        print("DataFrame is empty after dropping rows with missing 'ReleaseYear'.")

else:
    # Handle case where df is None or column doesn't exist
    if df is None:
         print('DataFrame df not loaded.')
    elif 'ReleaseYear' not in df.columns:
         print('ReleaseYear column missing.')

# Display info to see the effect immediately (optional)
if df is not None:
    print("\nDataFrame info after handling ReleaseYear:")
    df.info()

Initial rows: 10
Found 2 missing values in 'ReleaseYear'.
Dropped 2 row(s) due to missing 'ReleaseYear'.
Rows remaining: 8
Successfully converted 'ReleaseYear' column to Int64.

DataFrame info after handling ReleaseYear:
<class 'pandas.core.frame.DataFrame'>
Index: 8 entries, 1 to 9
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   MovieID             8 non-null      object 
 1   Title               8 non-null      object 
 2   Director            8 non-null      object 
 3   Genre               8 non-null      object 
 4   ReleaseYear         8 non-null      Int64  
 5   Budget (Crores)     8 non-null      int64  
 6   BoxOffice (Crores)  8 non-null      int64  
 7   Rating              7 non-null      float64
 8   Duration (minutes)  7 non-null      float64
 9   LeadActor           8 non-null      object 
 10  LeadActress         8 non-null      object 
 11  Language            5 non-null      objec

**Rating: Impute Missing with Mean**

In [25]:
if df is not None and 'Rating' in df.columns:
    # Ensure column is numeric before calculating mean
    df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')
    mean_rating = df['Rating'].mean()
    if pd.notna(mean_rating):
        missing_before = df['Rating'].isnull().sum()
        df['Rating'] = df['Rating'].fillna(mean_rating)
        if missing_before > 0:
             print(f"Imputed {missing_before} missing 'Rating' with mean ({mean_rating:.2f}).")
        else:
             print("No missing 'Rating' values found to impute.")
    else:
        print("Could not calculate a valid mean rating (perhaps all values were missing?). Imputation skipped.")
elif df is not None:
     print("'Rating' column not found.")
else:
    print('DataFrame df not loaded.')

Imputed 1 missing 'Rating' with mean (7.69).


**Duration (minutes): Impute Missing with Median and Convert Type**

In [26]:
if df is not None and 'Duration (minutes)' in df.columns:
    # Ensure the column is numeric, coercing errors
    df['Duration (minutes)'] = pd.to_numeric(df['Duration (minutes)'], errors='coerce')
    median_duration = df['Duration (minutes)'].median()

    if pd.notna(median_duration):
        missing_before = df['Duration (minutes)'].isnull().sum()
        df['Duration (minutes)'] = df['Duration (minutes)'].fillna(median_duration)
        # Convert to nullable integer type
        try:
            df['Duration (minutes)'] = df['Duration (minutes)'].astype('Int64')
            if missing_before > 0:
                print(f"Imputed {missing_before} missing 'Duration (minutes)' with median ({median_duration}) and converted to Int64.")
            else:
                print(f"No missing 'Duration (minutes)' found. Converted column to Int64.")
        except Exception as e:
            print(f"ERROR converting Duration (minutes) to Int64: {e}")
    else:
        print("Could not calculate a valid median duration. Imputation and conversion skipped.")
elif df is not None:
    print("'Duration (minutes)' column not found.")
else:
    print('DataFrame df not loaded.')

Imputed 1 missing 'Duration (minutes)' with median (159.0) and converted to Int64.


**Language: Impute Missing with Mode**

In [27]:
if df is not None and 'Language' in df.columns:
    # Calculate mode - might return multiple values if tied
    mode_language = df['Language'].mode()
    # Use the first mode if available, otherwise use 'Unknown'
    fill_value = mode_language[0] if not mode_language.empty else 'Unknown'
    missing_before = df['Language'].isnull().sum()
    df['Language'] = df['Language'].fillna(fill_value)
    if missing_before > 0:
        print(f"Imputed {missing_before} missing 'Language' with mode ('{fill_value}').")
    else:
        print("No missing 'Language' values found to impute.")
elif df is not None:
    print("'Language' column not found.")
else:
    print('DataFrame df not loaded.')

Imputed 3 missing 'Language' with mode ('Telugu').


**ProductionCompany: Impute Missing with Mode**

In [28]:
if df is not None and 'ProductionCompany' in df.columns:
    mode_company = df['ProductionCompany'].mode()
    fill_value = mode_company[0] if not mode_company.empty else 'Unknown'
    missing_before = df['ProductionCompany'].isnull().sum()
    df['ProductionCompany'] = df['ProductionCompany'].fillna(fill_value)
    if missing_before > 0:
        print(f"Imputed {missing_before} missing 'ProductionCompany' with mode ('{fill_value}').")
    else:
        print("No missing 'ProductionCompany' values found to impute.")
elif df is not None:
    print("'ProductionCompany' column not found.")
else:
    print('DataFrame df not loaded.')

Imputed 1 missing 'ProductionCompany' with mode ('Arka Media Works').


### 3.3 Remove Duplicate Rows

In [29]:
if df is not None and 'Title' in df.columns:
    print("\nStep 6: Checking for duplicate movie titles (case-insensitive, ignoring spaces)...")
    initial_shape = df.shape

    # Normalize titles by lowercasing and stripping spaces
    df['Title_normalized'] = df['Title'].str.lower().str.strip()

    # Count duplicates based on the normalized title
    duplicate_count = df.duplicated(subset=['Title_normalized']).sum()

    if duplicate_count > 0:
        print(f"Found {duplicate_count} duplicate movie title(s) (after normalization).")
        print("Dropping duplicates based on normalized title...")
        df.drop_duplicates(subset=['Title_normalized'], keep='first', inplace=True)
        print(f"{duplicate_count} duplicate title(s) dropped. New shape: {df.shape}")
    else:
        print("No duplicate movie titles found (after normalization).")

    # Drop the helper column
    df.drop(columns='Title_normalized', inplace=True)

    print("\n--- Data Cleaning Complete ---")
    print("-" * 30)
else:
    print("DataFrame not loaded or 'Title' column missing.")


Step 6: Checking for duplicate movie titles (case-insensitive, ignoring spaces)...
Found 1 duplicate movie title(s) (after normalization).
Dropping duplicates based on normalized title...
1 duplicate title(s) dropped. New shape: (7, 14)

--- Data Cleaning Complete ---
------------------------------


## 4. Final Check

In [30]:
if df is not None:
    # Display cleaned data info
    print('\nCleaned DataFrame Info:')
    df.info()
    
    # Display first 5 rows of cleaned data
    print('\nCleaned DataFrame Head:')
    print(df.head())
else:
    print('DataFrame df not loaded. Cannot display final check.')


Cleaned DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, 1 to 9
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   MovieID             7 non-null      object 
 1   Title               7 non-null      object 
 2   Director            7 non-null      object 
 3   Genre               7 non-null      object 
 4   ReleaseYear         7 non-null      Int64  
 5   Budget (Crores)     7 non-null      int64  
 6   BoxOffice (Crores)  7 non-null      int64  
 7   Rating              7 non-null      float64
 8   Duration (minutes)  7 non-null      Int64  
 9   LeadActor           7 non-null      object 
 10  LeadActress         7 non-null      object 
 11  Language            7 non-null      object 
 12  ProductionCompany   7 non-null      object 
dtypes: Int64(2), float64(1), int64(2), object(8)
memory usage: 798.0+ bytes

Cleaned DataFrame Head:
  MovieID                     Title       

## 5. Save Cleaned Data

In [32]:
if df is not None:
    try:
        # Define the output path
        output_file = '/home/nineleaps/Documents/Python_Training/Final_test/movies_data/cleaned_data_csv/cleaned_tollywood_movies.csv'
        # Save the DataFrame to CSV without the index
        df.to_csv(output_file, index=False)
        print(f"Cleaned data saved to {output_file}")
    except Exception as e:
        print(f"Error saving cleaned data: {e}")
else:
    print('DataFrame df not loaded. Cannot save.')

Cleaned data saved to /home/nineleaps/Documents/Python_Training/Final_test/movies_data/cleaned_data_csv/cleaned_tollywood_movies.csv


In [31]:
df

Unnamed: 0,MovieID,Title,Director,Genre,ReleaseYear,Budget (Crores),BoxOffice (Crores),Rating,Duration (minutes),LeadActor,LeadActress,Language,ProductionCompany
1,MOV021,Baahubali: The Beginning,S. S. Rajamouli,Epic Fantasy Action,2015,180,650,8.1,159,Prabhas,Tamannaah,Telugu,Arka Media Works
2,MOV023,Sye Raa Narasimha Reddy,Surender Reddy,Historical Action,2019,200,265,7.1,167,Chiranjeevi,Nayanthara,Telugu,Konidela Production Company
3,MOV025,Jersey,Gowtam Tinnanuri,Sports Drama,2019,20,45,7.8,159,Nani,Shraddha Srinath,Telugu,Sithara Entertainments
4,MOV027,Geetha Govindam,Parasuram,Romantic Comedy,2018,10,130,7.685714,148,Vijay Deverakonda,Rashmika Mandanna,Telugu,GA2 Pictures
5,MOV029,Dear Comrade,Bharat Kamma,Romantic Drama,2019,15,35,7.1,170,Vijay Deverakonda,Rashmika Mandanna,Telugu,Mythri Movie Makers
7,MOV036,Bheeshma,Venky Kudumula,Romantic Comedy,2020,20,50,7.4,145,Nithiin,Rashmika Mandanna,Telugu,Sithara Entertainments
9,MOV052,Baahubali 2: The Conclusion,S. S. Rajamouli,Epic Fantasy Action,2017,250,1810,8.2,171,Prabhas,Anushka Shetty,Telugu,Arka Media Works
