## DS 4200 - Analyzing Netflix Trends Through IMDB Scores and Entertainment Characteristics

### Alissa Agnelli, Paulina Acosta, Regina Rabkina

In [114]:
# Importing libraries
import pandas as pd
import numpy as np 
import altair as alt

In [115]:
# Read the CSV file 
imdb_df = pd.read_csv('Netflix TV Shows and Movies.csv')
netflix_df = pd.read_csv('netflix_titles.csv', encoding='ISO-8859-1')

In [116]:
# Looking at variables' datatypes for imdb dataframe
imdb_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5283 entries, 0 to 5282
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   index              5283 non-null   int64  
 1   id                 5283 non-null   object 
 2   title              5283 non-null   object 
 3   type               5283 non-null   object 
 4   description        5278 non-null   object 
 5   release_year       5283 non-null   int64  
 6   age_certification  2998 non-null   object 
 7   runtime            5283 non-null   int64  
 8   imdb_id            5283 non-null   object 
 9   imdb_score         5283 non-null   float64
 10  imdb_votes         5267 non-null   float64
dtypes: float64(2), int64(3), object(6)
memory usage: 454.1+ KB


In [117]:
# Looking at variables' datatypes for netflix dataframe
netflix_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8809 entries, 0 to 8808
Data columns (total 26 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   show_id       8809 non-null   object 
 1   type          8809 non-null   object 
 2   title         8809 non-null   object 
 3   director      6175 non-null   object 
 4   cast          7984 non-null   object 
 5   country       7978 non-null   object 
 6   date_added    8799 non-null   object 
 7   release_year  8809 non-null   int64  
 8   rating        8805 non-null   object 
 9   duration      8806 non-null   object 
 10  listed_in     8809 non-null   object 
 11  description   8809 non-null   object 
 12  Unnamed: 12   0 non-null      float64
 13  Unnamed: 13   0 non-null      float64
 14  Unnamed: 14   0 non-null      float64
 15  Unnamed: 15   0 non-null      float64
 16  Unnamed: 16   0 non-null      float64
 17  Unnamed: 17   0 non-null      float64
 18  Unnamed: 18   0 non-null    

### EDA

In [118]:
# Selecting relevant columns
netflix_df = netflix_df[['title', 'country', 'date_added', 'release_year', 'duration', 'listed_in', 'rating']]

We decided to select these columns from the netflix dataframe because we deemed these most significant for our analysis of Netflix trends.

In [119]:
# Merging the two dataframe based on matching titles and release years
data = pd.merge(imdb_df, netflix_df, on=['title', 'release_year'], how='inner')

In [120]:
# Dropping columns unrelated to our analysis
data = data.drop(['id', 'imdb_id', 'index', 'description', 'age_certification', 'duration'], axis = 1)

#### Ensuring appropriate data types

In [121]:
# Merged the data sets, kept columns that are relevant to analysis
# Now going to make sure the data types are correct

data['date_added'] = pd.to_datetime(data['date_added'], format='%B %d, %Y', errors='coerce')
data['type'] = data['type'].astype('category')
data['imdb_votes'] = data['imdb_votes'].astype('Int64')
data['rating'] = data['rating'].astype('category')
data['country'] = data['country'].astype('category')
data['listed_in'] = data['listed_in'].astype('category')


In [122]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2984 entries, 0 to 2983
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   title         2984 non-null   object        
 1   type          2984 non-null   category      
 2   release_year  2984 non-null   int64         
 3   runtime       2984 non-null   int64         
 4   imdb_score    2984 non-null   float64       
 5   imdb_votes    2979 non-null   Int64         
 6   country       2742 non-null   category      
 7   date_added    2984 non-null   datetime64[ns]
 8   listed_in     2984 non-null   category      
 9   rating        2984 non-null   category      
dtypes: Int64(1), category(4), datetime64[ns](1), float64(1), int64(2), object(1)
memory usage: 205.5+ KB


In [123]:
data.isna().sum()

title             0
type              0
release_year      0
runtime           0
imdb_score        0
imdb_votes        5
country         242
date_added        0
listed_in         0
rating            0
dtype: int64

#### Handling null values

In [124]:
# Fill imdb_votes with median
data['imdb_votes'].fillna(data['imdb_votes'].median(), inplace=True)
data.isna().sum()

title             0
type              0
release_year      0
runtime           0
imdb_score        0
imdb_votes        0
country         242
date_added        0
listed_in         0
rating            0
dtype: int64

Since votes tend to be skewed (some movies get millions, others get a few), using the median value is safer than the mean. So, we will fill the missing imdb_votes with the median value.

Add "Unknown" to the category list before filling missing values

In [125]:
# Add "Unknown" as a valid category
data['country'] = data['country'].cat.add_categories("Unknown")

In [126]:
# Fill country with "Unknown"
data['country'].fillna("Unknown", inplace=True)
data.isna().sum()

title           0
type            0
release_year    0
runtime         0
imdb_score      0
imdb_votes      0
country         0
date_added      0
listed_in       0
rating          0
dtype: int64

We replaced null values with "Unknown" to indicate missing country data. We do not want to remove certain titles if country data is missing, but we may also want to explore regional trends.

### Visualizations

#### IMDB Score Trends Over Time - Altair Line Chart
We want to analyze Netflix's content strategy. By looking at when Netflix added the title to its platform, we can see if Netflix is acquiring more high-rated content overtime. We can also see if certain years had a surge in high or low-rated content being added.

In [None]:
# Extract the year Netflix added the content
data['year_added'] = data['date_added'].dt.year

# Aggregate average IMDB scores per year
df_avg = data.groupby('year_added', as_index=False)['imdb_score'].mean()

# Create Altair Line Chart
score_chart = alt.Chart(df_avg).mark_line(point=True).encode(
    x=alt.X('year_added:O', title="Year Content Added to Netflix"),
    y=alt.Y('imdb_score:Q', title="Average IMDB Score"),
    tooltip=['year_added', 'imdb_score']
).properties(
    title="IMDB Score Trends for Content Added to Netflix"
).interactive()

score_chart

The IMDB score trends from 2009 to 2021 show fluctuations in content quality on Netflix. Scores started above 7, dipped to 6.5 in 2011, then rose back above 7 in 2013–2014. A decline to 6.5 in 2015 suggests a focus on increasing volume, possibly due to competition from other streaming platforms. Scores stabilized around 6.5 from 2017 to 2021, indicating a balance between quantity and quality in Netflix’s content strategy.