# Exploratory Data Analysis

[OTT-Content-Analysis](https://github.com/sangeethankumar/OTT-Content-Analysis)

In [1]:
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = [12, 8]
mpl.rcParams['figure.dpi'] = 150 # 200 e.g. is really fine, but slower
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import imdb_scrape_country as imdbC
from IPython.display import clear_output

Reading data

In [3]:
netflix = pd.read_csv('data/netflix/netflix_titles.csv')
prime = pd.read_csv('data/amazonprime/amazon_prime_titles.csv')
disneyplus = pd.read_csv('data/disneyplus/disney_plus_titles.csv')

In [4]:
print('Number of columns in Netflix : {ncols}'.format(ncols=len(netflix.columns)))
print('Number of columns in Prime   : {ncols}'.format(ncols=len(prime.columns)))
print('Number of columns in Disney+ : {ncols}'.format(ncols=len(disneyplus.columns)))

Number of columns in Netflix : 12
Number of columns in Prime   : 12
Number of columns in Disney+ : 12


In [5]:
# check if column names are same in all three datasets
(netflix.columns == prime.columns).any() & (netflix.columns == disneyplus.columns).any()

True

The datasets of all the three OTTs have same columns

In [6]:
print('Number of content in Netflix : {ncontent}'.format(ncontent=len(netflix)))
print('Number of content in Prime   : {ncontent}'.format(ncontent=len(prime)))
print('Number of content in Disney+ : {ncontent}'.format(ncontent=len(disneyplus)))

Number of content in Netflix : 8807
Number of content in Prime   : 9668
Number of content in Disney+ : 1450


Prime has the most number content and Disney+ has the least 

In [7]:
print("Columns in the three datasets")
netflix.columns

Columns in the three datasets


Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

Merging the datasets together

In [8]:
netflix['platform'] = 'Netflix'
prime['platform'] = 'Prime'
disneyplus['platform'] = 'Disney+'

In [9]:
ott = netflix.append(prime).append(disneyplus)

In [10]:
ott.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19925 entries, 0 to 1449
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       19925 non-null  object
 1   type          19925 non-null  object
 2   title         19925 non-null  object
 3   director      14736 non-null  object
 4   cast          17677 non-null  object
 5   country       9879 non-null   object
 6   date_added    10399 non-null  object
 7   release_year  19925 non-null  int64 
 8   rating        19581 non-null  object
 9   duration      19922 non-null  object
 10  listed_in     19925 non-null  object
 11  description   19925 non-null  object
 12  platform      19925 non-null  object
dtypes: int64(1), object(12)
memory usage: 2.1+ MB


Some of the OTT content is null

In [11]:
ott['platform'][ott.country.isnull()].value_counts()

Prime      8996
Netflix     831
Disney+     219
Name: platform, dtype: int64

Most of the titles with no country available are from Prime followed by Netflix

In [12]:
# getting list of countries with null country
null_country_title = ott['title'][ott.country.isnull()].to_list()

In [13]:
null_country_title[:10]

['Ganglands',
 'Jailbirds New Orleans',
 'Midnight Mass',
 'My Little Pony: A New Generation',
 'Vendetta: Truth, Lies and The Mafia',
 'Bangkok Breaking',
 'Confessions of an Invisible Girl',
 'Crime Stories: India Detectives',
 "Europe's Most Dangerous Man: Otto Skorzeny in Spain",
 'Intrusion']

I will use the script to scrape imdb data for country of origin for these titles 

In [18]:
def country2df(title):
    try:
        vals = imdbC.get_country(title)
    except:
        vals = ""
    return vals

In [19]:
fast_c2df = np.vectorize(country2df)

In [22]:
missing_countries = []
tot_missing = len(null_country_title)
for ind,nc in enumerate(null_country_title):
    rem = tot_missing - ind
    clear_output(wait=True)
    print(rem,nc)
    missing_countries.append(fast_c2df(nc))

1 Tree Climbing Lions


In [21]:
missing_countries

[array('France', dtype='<U6'),
 array('United States', dtype='<U13'),
 array('United States', dtype='<U13')]

In [24]:
# with open('list_missing_countries','w') as fwrite:
#     for i in range(len(null_country_title)):
#         fwrite.write("%s:%s\n"%(null_country_title[i],missing_countries[i]))

In [31]:
for title,miss_country in zip(null_country_title,missing_countries):
    print(title,miss_country)
    ott.loc[ott.title == title, "country"] = miss_country
    clear_output(wait=True)

Tree Climbing Lions United States


In [32]:
ott.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,platform
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",Netflix
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",Netflix
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",France,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,Netflix
3,s4,TV Show,Jailbirds New Orleans,,,United States,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",Netflix
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,Netflix


In [33]:
# ott.to_csv('data/ott_fillcountries.csv')