<a href="https://www.kaggle.com/code/sonalanand/netflix-tittle-02?scriptVersionId=220658946" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
import os

# List all files and folders in /kaggle/input
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/netflix-titles-dataset-for-visualization-practise/netflix_titles.csv


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
import seaborn as sns

# Correct file path based on the output
file_path = "/kaggle/input/netflix-titles-dataset-for-visualization-practise/netflix_titles.csv"

# Load the CSV file
try:
    df = pd.read_csv(file_path)
    print("Dataset loaded successfully.")
    print(df.head())
except FileNotFoundError:
    print(f"File not found: {file_path}. Please check the file path.")

Dataset loaded successfully.
  show_id     type                  title         director  \
0      s1    Movie   Dick Johnson Is Dead  Kirsten Johnson   
1      s2  TV Show          Blood & Water              NaN   
2      s3  TV Show              Ganglands  Julien Leclercq   
3      s4  TV Show  Jailbirds New Orleans              NaN   
4      s5  TV Show           Kota Factory              NaN   

                                                cast        country  \
0                                                NaN  United States   
1  Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...   South Africa   
2  Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...            NaN   
3                                                NaN            NaN   
4  Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...          India   

           date_added  release_year rating   duration  \
0  September 25, 2021          2020  PG-13     90 min   
1  September 24, 2021          2021  TV-MA  2 Seasons   


📌 Day 2 - Data Analysis & Visualization Questions (Netflix Dataset)


1. Data Cleaning & Handling Missing Values
Q1: Identify Missing Values
Find the total missing values per column in the dataset.


Q2: Handle Missing Values in the cast Column
Replace missing values for cast with "Not Available".


Q3: Normalize Ratings
Some ratings may be inconsistent.
Merge similar categories (e.g., TV-MA and MA should be considered the same).


Q4: Detect & Remove Duplicate Titles
Check if there are duplicate titles in the dataset.
If duplicates exist, drop them while keeping the first occurrence.


2. Feature Engineering (Creating New Insights)
Q5: Extract the Primary Genre
The listed_in column contains multiple genres.
Extract only the first genre (e.g., "Dramas, International Movies" → "Dramas").


Q6: Categorize Titles by Decade
Create a new column that categorizes titles into decades (2000s, 2010s, etc.).


Q7: Identify the Top 10 Directors with Most Releases
Find the top 10 directors who have created the most content.


Q8: Add a Column for "Old vs. New"
Titles before 2000 → "Old"
Titles from 2000 onwards → "New"


3. Data Aggregation & Trend Analysis
Q9: Find the Most Common Movie Length Category
Among "Short", "Medium", and "Long", which category appears most frequently?

Q10: Which Countries Have Produced the Most Content?
Count the number of titles per country and display the top 10.

Q11: What is the Trend of Movies vs. TV Shows Over Time?
How has the ratio of Movies vs TV Shows changed over the years?
Visualize this trend using a line plot.

Q12: What are the Most Popular Genres?
Count the number of titles per primary genre and display the top 5.


4. Data Visualization & Insights
Q13: Bar Plot - Top 10 Countries Producing Content
Plot a bar chart showing the top 10 countries with the highest number of titles.

Q14: Heatmap - Correlation Between Numerical Features
Create a heatmap to check if release_year, duration, and added_year are correlated.

Q15: Box Plot - Duration Distribution for Movies & TV Shows
Compare the duration of Movies vs TV Shows using a boxplot.

Q16: Pie Chart - Content Distribution by Rating
Visualize the proportion of each rating category using a pie chart.

Q17: Line Plot - Trend of Movie vs. TV Show Releases Over Time
Create a line plot to visualize the number of Movies vs. TV Shows released each year.

Q1: Identify Missing Values Find the total missing values per column in the dataset. 

In [3]:
df.isnull().sum()

show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64

In [4]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
release_year,8807.0,2014.180198,8.819312,1925.0,2013.0,2017.0,2019.0,2021.0


In [5]:
for x in df.columns:
    print(x)

show_id
type
title
director
cast
country
date_added
release_year
rating
duration
listed_in
description


In [6]:
df.dtypes

show_id         object
type            object
title           object
director        object
cast            object
country         object
date_added      object
release_year     int64
rating          object
duration        object
listed_in       object
description     object
dtype: object

Q2: Handle Missing Values in the cast Column
Replace missing values for cast with "Not Available".
Replace:in director: Replace missing values with "Unknown". 
country: Replace missing values with "Not Specified".

In [7]:
df['cast'] = df['cast'].fillna('Not Available')
df['director'] = df['director'].fillna('Unknown')
df['country'] = df['country'].fillna('Not Specified')

Q3: Normalize Ratings
Some ratings may be inconsistent.
Merge similar categories (e.g., TV-MA and MA should be considered the same).

Issues Found

1️⃣ Invalid Ratings (Contain Duration)

'74 min', '84 min', '66 min' → These do not belong in the rating column.

2️⃣ Similar Categories That Need Merging

'TV-Y7-FV' should be merged with 'TV-Y7'.
'NC-17' should be standardized as 'NC 17'.
'PG-13' should be standardized as 'PG 13'.
'UR' should be renamed to 'Unrated'.
'NR' should be renamed to 'Not Rated'.

3️⃣ Missing Values (nan)

Needs to be handled properly.

In [8]:
print(df['rating'].unique())

['PG-13' 'TV-MA' 'PG' 'TV-14' 'TV-PG' 'TV-Y' 'TV-Y7' 'R' 'TV-G' 'G'
 'NC-17' '74 min' '84 min' '66 min' 'NR' nan 'TV-Y7-FV' 'UR']


In [9]:
df = df[~df['rating'].astype(str).str.contains('min',na = False)].copy()
# df['rating'] = df['rating'].apply(lambda x: 'Not Rated' if 'min' in str(x) else x)             #2nd way


df.loc[:, 'rating'] = df['rating'].replace({
    'TV-Y7-FV' : 'TV Y7',
    'NC-17' : 'NC 17',
    'PG-13' : 'PG 13',
    'UR' : 'Unrated',
    'NR' : 'Not Rated',
    'TV-14' : 'TV 14',
    'PG' : 'TV PG',
    'G' : 'TV PG',
    'TV-G' : 'TV PG',
    'TV-MA' : 'TV MA',
    'TV-PG' : 'TV PG',
    'TV-Y' : 'TV Y',
    'TV-Y7' : 'TV Y7'
    })


df.loc[:, 'rating'] = df['rating'].fillna('Not Rated')

In [10]:
df['rating'].value_counts()

rating
TV MA        3207
TV 14        2160
TV PG        1411
R             799
PG 13         490
TV Y7         340
TV Y          307
Not Rated      84
NC 17           3
Unrated         3
Name: count, dtype: int64

Q4: Detect & Remove Duplicate Titles
Check if there are duplicate titles in the dataset.
If duplicates exist, drop them while keeping the first occurrence.

In [11]:
df['title'].duplicated().sum()

# df = df.drop_duplicates(subset=['title'], keep='first')           # if there were duplicates.


0

Q5: Extract the Primary Genre
The listed_in column contains multiple genres.
Extract only the first genre (e.g., "Dramas, International Movies" → "Dramas").

In [12]:
df['Primary Genre'] = df['listed_in'].apply(lambda x : x.split(',')[0] if pd.notna(x) else 'Unknown')
df['Primary Genre']

0                  Documentaries
1         International TV Shows
2                 Crime TV Shows
3                     Docuseries
4         International TV Shows
                  ...           
8802                 Cult Movies
8803                    Kids' TV
8804                    Comedies
8805    Children & Family Movies
8806                      Dramas
Name: Primary Genre, Length: 8804, dtype: object

Q6: Categorize Titles by Decade
Create a new column that categorizes titles into decades (2000s, 2010s, etc.).

In [13]:
df['Decade Year'] = (df['release_year'] // 10) * 10
df['Decade Year'] = df['Decade Year'].astype(str) + "'s"

In [14]:
df

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,Primary Genre,Decade Year
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,Not Available,United States,"September 25, 2021",2020,PG 13,90 min,Documentaries,"As her father nears the end of his life, filmm...",Documentaries,2020's
1,s2,TV Show,Blood & Water,Unknown,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",International TV Shows,2020's
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",Not Specified,"September 24, 2021",2021,TV MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,Crime TV Shows,2020's
3,s4,TV Show,Jailbirds New Orleans,Unknown,Not Available,Not Specified,"September 24, 2021",2021,TV MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",Docuseries,2020's
4,s5,TV Show,Kota Factory,Unknown,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,International TV Shows,2020's
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a...",Cult Movies,2000's
8803,s8804,TV Show,Zombie Dumb,Unknown,Not Available,Not Specified,"July 1, 2019",2018,TV Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g...",Kids' TV,2010's
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...,Comedies,2000's
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,TV PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero...",Children & Family Movies,2000's
