# EDA on Netflix Data

In [1]:
# libraries 
import pandas as pd
import numpy as np
import altair as alt

# Handle large data sets without embedding them in the notebook
alt.data_transformers.enable('data_server')

DataTransformerRegistry.enable('data_server')

## Tasks
1. Basic Data Wrangling Tasks (including understanding the dataset characteristics)
2. Summary views (both visual and numerical)
3. Generate questions about the data
4. Search for answers by visualizing the data

## 1. Load the Dataset
Understanding the dataset characteristics
 - What is the size of the dataset
 - What are the column names
 - Is the data in an appropriate form for us to encode it with altair, adjust as necessary

In [2]:
url = 'https://raw.githubusercontent.com/kemiolamudzengi/dsci-320-datasets/main/netflix_data_edited.csv'
data = pd.read_csv(url, parse_dates= ['release_year', 'year_added'])
data['release_year'] = pd.DatetimeIndex(data['release_year']).year
data['year_added'] = pd.DatetimeIndex(data['year_added']).year
data.head()

Unnamed: 0,show_id,title,director,cast,country,release_year,rating,duration,listed_in,description,month_added,day_added,year_added
0,s7104,Tinker Bell and the Legend of the NeverBeast,Steve Loter,"Ginnifer Goodwin, Mae Whitman, Rosario Dawson,...",United States,2014,G,78,Children & Family Movies,When suspicious scout fairies scheme to captur...,January,1,2008
1,s1764,Dilan 1991,"Fajar Bustomi, Pidi Baiq","Iqbaal Ramadhan, Vanesha Prescilla, Ira Wibowo...",Indonesia,2019,TV-14,118,"Dramas, International Movies, Romantic Movies",Dilan's involvement in the motorbike gang impe...,February,4,2008
2,s3244,Jumping the Broom,Salim Akil,"Angela Bassett, Paula Patton, Laz Alonso, Lore...",United States,2011,PG-13,113,"Comedies, Romantic Movies","After a whirlwind romance, a couple rushes to ...",May,5,2009
3,s3834,Mac & Devin Go to High School,Dylan C. Brown,"Snoop Dogg, Wiz Khalifa, Mike Epps, Teairra Ma...",United States,2012,R,76,Comedies,Devin Overstreet may be the class valedictoria...,November,1,2010
4,s6836,The Rover,David Michôd,"Guy Pearce, Robert Pattinson, Scoot McNairy, D...","Australia, United States",2014,R,103,"International Movies, Thrillers","Set in a chaotic future, this Outback saga fol...",October,1,2011


### Data Wrangling
Let us split the listed_in column so that we have distinct categories, give the column the names genre_1, genre_2, genre_3

## 2. Summary views (both visual and numerical)
 - Univariate Numerical Summaries
 - Univariate Visual Idioms
 - Multivariate Numerical Summaries
 - Multivariate Visual Idioms

### Univariate Numerical Summaries

#### Quantitative
- range (i.e., min, max)
- central tendency (i.e, mean, median)
- spread (i.e., standard deviation)

#### Categorical
 - Frequency of each value (i.e., frequency table)
First determine which attributes you are interested in exploring
data.columns
'rating', 'added_month', 'genre_1', 'genre_2', 'genre_3'
'director', 'cast', 'country', is more diverse and less interesting at this point


In [3]:
cat_attr = ['rating', 'month_added', 'genre_1', 'genre_2', 'genre_3'] #'director', 'cast', 'country',  less interesting and more diverse. 

Iterate over the list and print out the frequency for each attribute

#### Data Munging

Hmmmm do we want to combine TV-G with G  and also combine TV-PG with PG, also let's drop the ones that are missing
So what is happening is that as we understand the data, we are refinning the dataset and performing additional transformations
https://movielabs.com/md/ratings/v2.3/html/US_TVPG_Ratings.html 

### Categorical Univariate Visual Summaries
 - use bar charts for categorical attributes
 e.g genre_1, month_added, rating, country, cast etc. 

### Quantitative Univariate Visual Summaries
histograms and density plots  - duration, year added

Adjust the number of bars so it is similar to the density plot above

### Multivariate Numerical Summaries

#### Categorical
- rating and genre_1
- rating and month_added
- rating and genre_2
HINT: use crosstab

#### Quantitative
- Correlation Matrix for quantitative attributes

What if we wanted to explore if there a strong correlation between the quantitative attributes for a specific genre

### Multivariate Visual Summaries

#### Stacked Bar Charts  - month and genre_1

#### Overlapping Density Plots - duration and rating (keep 3 ratings you care about

#### Bivariate Outlier Exploration
- use a scatter-plot to depict the values for a one categorical and one quantititve attribute


## Additional Analysis

Now that we have an overview of the data, we can start exploring additional questions of interest.
First summarize the questions that you have been able to answer with the EDA before formulating additional questions of interest
The questions should be diverse (use Stasko classification of low-level tasks (e.g., Retrive Value, Filter, Find Extremum)
- Retrieve Value - Find the longest movie, what is its name, genre, and length?
- Filter - Present the 20 longest movies realized after 2005 that have a pG-13 rating 
- Compute Derived Value - What percentage of movies added to the Netflix catalogue in 2018 were Documentaries? 
- Compute Derived Value - What is the average length of the movies in a given primary genre
- Find Extremum - Which genre has the longest movie
- Sort - Rank movies by their length
- Determine Range - What is the duration range for movies released in 2000?
- Characterize Distribution - What is the distribution by Genre for movies in a given rating group?
- Find Anomalies - What outliers exist for a given genre and rating in terms of movie length
- Correlate - Is there a relationship between film duration and year of release for a given genre
- Does Netflix typically add movies on a specifc day of the month?

### Retrieve Value - Find the longest movie, what is its name, genre, and length?

### Filter - What are the 20 longest movies realized after 2005 that have a PG-13 rating

### Compute Derived Value

### What percentage of movies added to the Netflix catalogue in 2018 were Documentaries? 

#### Sorted Bar Chart - Attempt 1

#### Compute Derived Value and then use Pie Chart or Stacked Single Bar - Attempt 2

#### - Layered View to Create Proportional Single Bar Chart - Attempt 3

### Compute Derived Value - What is the average length of the movies for each rating

### Find Extremum - Which genre has the longest movie
we have already done this above, but we can do it again here. or even use a squares. 
We differentiate between squares and circles becaus a traditional scatter plot has a specific purpose in statistics

### Sort - Categorize the catalogue by arranging the movies in each primary genre for each rating 

### Determine Range - What is the range from when a movie was released to when it was added to Netflix's catalogue?

### Characterize Distribution - What is the distribution by Genre for movies in a given rating group?

### Find Anomalies - What outliers exist for a given genre and rating in terms of movie length

### Correlate - Is there a relationship between film duration and year of release for a given genre

### Does Netflix typically add movies on a specifc day of the month?