## 📌 Project Objective

This project investigates Netflix movie data, focusing on key trends and patterns such as durations, genres, and release decades.  
We aim to perform Exploratory Data Analysis (EDA) and answer specific questions related to movies released in the 1990s.

### 🎯 Key Questions

1. **What was the most frequent movie duration in the 1990s?**  
   - Save an approximate answer as an integer called `duration`.  
   - Use **1990** as the decade's start year.

2. **How many short action movies (less than 90 mins) were released in the 1990s?**  
   - Save this integer as `short_movie_count`.

---

## 🗂️ Project Structure

1. 📦 Setup & Data Loading  
2. 🔍 Handling Missing Data  
3. 📊 Exploratory Data Analysis (EDA)  
4. 🛠️ Feature Engineering  
5. ❓ Answering Key Questions  
6. 📈 Visualizations & Insights  
7. 📌 Summary & Conclusion

---

In [16]:
# Importing relevent libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [17]:
# Check first 5 rows
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,duration,description,genre
0,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,93,After a devastating earthquake hits Mexico Cit...,Dramas
1,s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,78,"When an army recruit is found dead, his fellow...",Horror Movies
2,s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,80,"In a postapocalyptic world, rag-doll robots hi...",Action
3,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,123,A brilliant group of students become card-coun...,Dramas
4,s6,TV Show,46,Serdar Akar,"Erdal Beşikçioğlu, Yasemin Allen, Melis Birkan...",Turkey,"July 1, 2017",2016,1,A genetics professor experiments with a treatm...,International TV


In [18]:
# Basic info (Columns, Data types, missing values)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4812 entries, 0 to 4811
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       4812 non-null   object
 1   type          4812 non-null   object
 2   title         4812 non-null   object
 3   director      4812 non-null   object
 4   cast          4812 non-null   object
 5   country       4812 non-null   object
 6   date_added    4812 non-null   object
 7   release_year  4812 non-null   int64 
 8   duration      4812 non-null   int64 
 9   description   4812 non-null   object
 10  genre         4812 non-null   object
dtypes: int64(2), object(9)
memory usage: 413.7+ KB


In [19]:
# Shape of data
print("Rows:", df.shape[0], "\nColumns:", df.shape[1])

Rows: 4812 
Columns: 11


In [20]:
"""
Observation:
- There is no missing values
- show_id & row indexing are doing the same work so need to remove one
"""

'\nObservation:\n- There is no missing values\n- show_id & row indexing are doing the same work so need to remove one\n'

In [21]:
# Lets Analyze the feature 'show_id'
print(df["show_id"])
# it's not matching with the id so lets remove
df.drop(columns="show_id", inplace=True)

0          s2
1          s3
2          s4
3          s5
4          s6
        ...  
4807    s7779
4808    s7781
4809    s7782
4810    s7783
4811    s7784
Name: show_id, Length: 4812, dtype: object


In [22]:
df.head()

Unnamed: 0,type,title,director,cast,country,date_added,release_year,duration,description,genre
0,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,93,After a devastating earthquake hits Mexico Cit...,Dramas
1,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,78,"When an army recruit is found dead, his fellow...",Horror Movies
2,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,80,"In a postapocalyptic world, rag-doll robots hi...",Action
3,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,123,A brilliant group of students become card-coun...,Dramas
4,TV Show,46,Serdar Akar,"Erdal Beşikçioğlu, Yasemin Allen, Melis Birkan...",Turkey,"July 1, 2017",2016,1,A genetics professor experiments with a treatm...,International TV


In [23]:
# =============================================================================
# Exploratory Data Analysis (EDA)
# =============================================================================

In [24]:
# Numerical Data Analysis

In [25]:
[feature for feature in df.columns if df[feature].dtype != "O"]

['release_year', 'duration']

# DataCamp's Investigating Netflix Movies Project

**Author**: Vimalathas Vithusan  
**GitHub**: [github.com/thasvithu](https://github.com/thasvithu)  
**Last Updated**: April 21, 2025  
**Data Source**: `netflix_data.csv`

---

## Objective

The goal of this project is to analyze Netflix movie data, specifically focusing on titles released during the 1990s.  
We will explore trends in movie durations and genres through structured exploratory data analysis (EDA) and answer the following:

---

## Key Questions

1. **What was the most frequent movie duration in the 1990s?**  
   - Store the answer in a variable called `duration`.

2. **How many short action movies (less than 90 minutes) were released in the 1990s?**  
   - Store this as an integer called `short_movie_count`.

---

## Project Structure

1. **Data Loading & Initial Inspection**  
2. **Handling Missing Data**  
3. **Exploratory Data Analysis (EDA)**  
4. **Feature Engineering**  
5. **Answering Key Business Questions**  
6. **Visualizations**  
7. **Summary of Insights**

In [26]:
# =============================================================================
# Step 1: Load Dataset and Initial Exploration
# =============================================================================

In [27]:
# importing necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [28]:
# Load the dataset
df = pd.read_csv("netflix_data.csv")

# Preview the first few rows
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,duration,description,genre
0,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,93,After a devastating earthquake hits Mexico Cit...,Dramas
1,s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,78,"When an army recruit is found dead, his fellow...",Horror Movies
2,s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,80,"In a postapocalyptic world, rag-doll robots hi...",Action
3,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,123,A brilliant group of students become card-coun...,Dramas
4,s6,TV Show,46,Serdar Akar,"Erdal Beşikçioğlu, Yasemin Allen, Melis Birkan...",Turkey,"July 1, 2017",2016,1,A genetics professor experiments with a treatm...,International TV


In [30]:
# view the dataset structure
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4812 entries, 0 to 4811
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       4812 non-null   object
 1   type          4812 non-null   object
 2   title         4812 non-null   object
 3   director      4812 non-null   object
 4   cast          4812 non-null   object
 5   country       4812 non-null   object
 6   date_added    4812 non-null   object
 7   release_year  4812 non-null   int64 
 8   duration      4812 non-null   int64 
 9   description   4812 non-null   object
 10  genre         4812 non-null   object
dtypes: int64(2), object(9)
memory usage: 413.7+ KB


$ The dataframe has 11 features, there are no null values.

In [32]:
df.duplicated().sum()

0