# Business Case: Netflix - Data Exploration and Visualisation
## Context
Netflix is one of the most popular media and video streaming platforms. They have over 10000 movies or tv shows available on their platform, as of mid-2021, they have over 222M Subscribers globally. This tabular dataset consists of listings of all the movies and tv shows available on Netflix, along with details such as - cast, directors, ratings, release year, duration, etc.

## 1. Problem Statement & EDA
### 1.1 Problem
Analyze the data and generate insights that could help Netflix in deciding which type of shows/movies to produce and how they can grow the business in different countries

In [15]:
# data:
# !wget -nc https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/000/940/original/netflix.csv -O ../temp/netflix.csv

--2025-02-01 05:47:00--  https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/000/940/original/netflix.csv
Resolving d2beiqkhq929f0.cloudfront.net (d2beiqkhq929f0.cloudfront.net)... 108.158.41.203, 108.158.41.222, 108.158.41.226, ...
Connecting to d2beiqkhq929f0.cloudfront.net (d2beiqkhq929f0.cloudfront.net)|108.158.41.203|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3399671 (3.2M) [text/plain]
Saving to: ‘../temp/netflix.csv’


2025-02-01 05:47:00 (21.1 MB/s) - ‘../temp/netflix.csv’ saved [3399671/3399671]



### 1.2 Shape of data
Import the dataset and do usual exploratory data analysis steps like checking the structure & characteristics of the dataset

In [1]:
import pandas as pd
df = pd.read_csv('../temp/netflix.csv')
df.shape

(8807, 12)

We have 8807 records with 12 features

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


In [4]:
df.apply(lambda c: c.nunique(), axis=0)

show_id         8807
type               2
title           8807
director        4528
cast            7692
country          748
date_added      1767
release_year      74
rating            17
duration         220
listed_in        514
description     8775
dtype: int64

* show_id can be index 
* type, director, country, rating can be categories
* date_added, duration can be numerical variables
* title, director, cast, listed_in, description can be text 

In [31]:
df['date_added'].sample(4)

7023     October 22, 2017
1184       March 19, 2021
2894    February 21, 2020
7724        April 1, 2019
Name: date_added, dtype: object

In [44]:
cast = {
    'type': 'category',
    'director': 'category',
    'country': 'category',
    'rating': 'category'
}

df_cast = df.astype(cast)
df_cast['date_added'] = pd.to_datetime(df['date_added'].str.strip(), format='%B %d, %Y', errors='raise')

In [46]:
df_cast.dtypes

show_id                 object
type                  category
title                   object
director              category
cast                    object
country               category
date_added      datetime64[ns]
release_year             int64
rating                category
duration                object
listed_in               object
description             object
dtype: object