## Movies in OTT platforms

In the latest 21st century, technology plays a vital role all over the world. One of the best emerging technologies are OTT platforms.

Over the past decade streaming companies completely shook up the distribution method that reigned supreme for the better part of a century.OTT stands for an over-the-top streaming media service.These internet-based services hop “over-the-top” of traditional programming distributors like cable, satellite, and broadcasting.

Now-a-days, everyone has their own smart gadgets such as laptop,smartphone,tablets with 24 hrs wifi connection. This made the filmmakers easy to take their movies to all the customers through the OTT platforms such as 'Netflix','Amazon' etc.

This Dataset Contains the information of Movies present in the various OTT platforms. It provides detailed information such as Director,Genre,Language,Runtime,Rating on each movies.

In [None]:

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
## libraries for data wrangling and visualisation are imported
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
Movies = pd.read_csv('/kaggle/input/movies-on-netflix-prime-video-hulu-and-disney/MoviesOnStreamingPlatforms_updated.csv')
Movies.head()

In [None]:
Movies.info()

* There are 10 numerical columns and 7 categorical columns present in the dataset.

## Data  Cleaning

#### --> Columns Removal

In [None]:
Movies.drop(['Unnamed: 0','ID','Rotten Tomatoes'],axis = 1,inplace = True)

In [None]:
Movies.columns

In [None]:
print('No of features:',Movies.shape[1],'\nNo of Movies:',Movies.shape[0])

#### --> Missing Value Treatment

In [None]:
Movies.isnull().sum()

In [None]:
Movies[['Directors','Genres','Country','Language','Runtime','Age']] = Movies[['Directors','Genres','Country','Language','Runtime','Age']].fillna('NA')

In [None]:
Movies['Age'].value_counts()

In [None]:
Movies['Age'] = Movies['Age'].str.replace('+','')
Movies['Age'] = Movies['Age'].replace('NA','0')
Movies['Age'] = Movies['Age'].replace('all','1')

In [None]:
Movies['Age'].value_counts()

* 1 - represents all age group
* 0 - represent null value

In [None]:
sns.kdeplot(Movies['IMDb'],shade = True)
plt.axvline(x = Movies['IMDb'].median(),color = 'red',label = 'median')
plt.legend()

In [None]:
Movies['IMDb'].fillna(Movies['IMDb'].median(),inplace = True)

* Since IMDb has 3% of null values , it is being filled with median values.

## Distribution of Movies

Let us understand the distribution of movies based on the features such as 'Directors','Genres','Country','Language','Age' etc.

#### OTT Platforms

In [None]:
m_count = {'platform':['Netflix','Hulu','Prime Video','Disney+'],
            'MCount':[Movies['Netflix'].sum(),Movies['Hulu'].sum(),Movies['Prime Video'].sum(),Movies['Disney+'].sum()]}

m_count = pd.DataFrame(m_count)

In [None]:
plt.figure(figsize=(8,5))
sns.barplot(x='platform',y='MCount',data = m_count)
plt.xlabel('OTT platform',labelpad = 20)
plt.ylabel('count',labelpad = 20)
plt.show()

Based on the graph,

* 'Prime video' has taken maximum movies compared to all other platforms.
* Then it is followed by 'Netflix'.
* 'Hulu' and Disney+' has least set of movies.

#### **Directors**

In [None]:
Dir = Movies.drop('Directors', axis=1).join(
    Movies['Directors'].str.split(',', expand=True).stack().reset_index(drop=True, level=1).rename('Director'))
D_count = Dir['Director'].value_counts().head(15).reset_index().set_index('index')
D_count = D_count[1:16]


In [None]:
plt.figure(figsize=(8,8))
sns.barplot(x=D_count.index,y=D_count.Director,data = D_count)
plt.xticks(rotation =90)
plt.xlabel('Director')
plt.ylabel('count')
plt.show()

_Directors_ play a vital role in movies.

Here,

* Jay Chapman stands first with a count of 35+ movies.
* Then comes Joseph Kane and cheh chang with 27+ movies.

#### **Language**

In [None]:
Lang = Movies.drop('Language', axis=1).join(
    Movies['Language'].str.split(',', expand=True).stack().reset_index(drop=True, level=1).rename('Language'))

Lang_count = Lang['Language'].value_counts().head(25).reset_index().set_index('index')

In [None]:
plt.figure(figsize=(8,4))
sns.barplot(x=Lang_count.index,y=Lang_count.Language,data = m_count)
plt.xticks(rotation =90)
plt.xlabel('Language')
plt.ylabel('count')
plt.show()

* Most of the movies present is of 'English' Language.

* The reason might be of global language and priority is given to english movies than any other regional languages.

#### Genre

In [None]:
Genre = Movies.drop('Genres', axis=1).join(
    Movies['Genres'].str.split(',', expand=True).stack().reset_index(drop=True, level=1).rename('Genre'))
Genre_count = Genre['Genre'].value_counts().reset_index().set_index('index')

In [None]:
plt.figure(figsize=(8,8))
sns.barplot(x=Genre_count.index,y=Genre_count.Genre,data = m_count)
plt.xticks(rotation =90)
plt.xlabel('Genre')
plt.ylabel('count')
plt.show()

* Most of the movies comes under genre 'Drama'.

* It has been given importance in all the platforms as it gathers attention of all the family audience.

* Then it is followed by 'Comedy','Thriller','Action'.

In [None]:
lis = []
for i in range(0,Genre.shape[0]):
    lis.append(Genre.iloc[i,13])
    
from collections import Counter
G_count = Counter(lis)

from wordcloud import WordCloud
wc = WordCloud(background_color='white')
wc.generate_from_frequencies(G_count)
plt.figure(figsize=(12,10))
plt.imshow(wc,interpolation='bilinear')
plt.axis('off')
plt.show()

## Drill Down Analysis

we have seen how the movies are distributed based on 'Platforms','Language','Genre','Directors'. Now lets start analysing each and every feature.

#### Language

###### --> Language vs platforms

In [None]:
L_Netflix = Lang.loc[Lang['Netflix'] == 1,'Language'].value_counts().reset_index().set_index('index').drop('NA',axis =0)
L_Prime =  Lang.loc[Lang['Prime Video'] == 1,'Language'].value_counts().reset_index().set_index('index').drop('NA',axis =0)
L_Hulu = Lang.loc[Lang['Hulu'] == 1,'Language'].value_counts().reset_index().set_index('index').drop('NA',axis =0)
L_Disney = Lang.loc[Lang['Disney+'] == 1,'Language'].value_counts().reset_index().set_index('index').drop('NA',axis =0)

In [None]:
fig , axes = plt.subplots(2,2,figsize = (12,12))
 
plt.subplots_adjust(hspace = 0.6,wspace = 0.5)    
    
L_Netflix.head(10).plot(kind = 'bar',ax = axes[0,0])
axes[0,0].set_title('Netflix')
axes[0,0].set_xlabel('')
axes[0,0].set_ylabel('')

L_Prime.head(10).plot(kind = 'bar',ax = axes[0,1])
axes[0,1].set_title('Prime Video')
axes[0,1].set_xlabel('')
axes[0,1].set_ylabel('')


L_Hulu.head(10).plot(kind = 'bar',ax = axes[1,0])
axes[1,0].set_title('Hulu')
axes[1,0].set_xlabel('')
axes[1,0].set_ylabel('')



L_Disney.head(10).plot(kind = 'bar',ax = axes[1,1])
axes[1,1].set_title('Disney')

axes[1,1].set_xlabel('')
axes[1,1].set_ylabel('')

fig.text(0.5, 0.004, 'Language', ha='center',fontsize = 'large')
fig.text(0.004, 0.5, 'Count', va='center', rotation='vertical',fontsize = 'large')
plt.show()

All the platforms has more number of movies on 'English' language.
* __Netflix__:     Followed by 'English' , Hindi and spanish movies comes in line.

* __Prime Video__: Followed by 'English' , French and spanish movies comes in line. 'Hindi' movies stands behind these three.

* __Hulu__ : Followed by 'English' , French and spanish movies comes in line. 'German' movies stands behind these three.

* __Disney+__ : Followed by 'English' , French and spanish movies comes in line. 'German' comes after all these.

##### -->Language vs ratings

In [None]:
L_ratings = Lang.groupby('Language')['IMDb'].median()
L_ratings = L_ratings.reset_index().set_index('Language')

In [None]:
Top_10_lang = L_ratings.loc[['English','Hindi','Spanish','French','German','Italian'
                                                      ,'Japanese','Korean','Mandarin','Russian'],'IMDb']
Top_10_lang

In [None]:
Top_10_lang = Top_10_lang.reset_index().set_index('Language')
English = Lang.loc[Lang['Language']=='English','IMDb'].reset_index().set_index('index')

In [None]:
fig,axes = plt.subplots(1,2,figsize = (18,6))

sns.kdeplot(English['IMDb'],ax = axes[1],shade = True)
plt.axvline(English['IMDb'].median(),color = 'red')

sns.barplot(x=Top_10_lang.index,y=Top_10_lang['IMDb'],ax = axes[0])
plt.show()

* Top 10 languages have been chosen based on the distribution of movies and average ratings have been calculated.

* Since maximum movies are of language 'English', median rating of the movies of 'English' is calculated.

* It says that most of the 'English' movies has a median rating of '6.0'

#### Genre

At its core, the genre of your film is primarily a simple tool for categorizing how your film compares to other films. It's a broad bucket of similar elements that lump films together in a way that makes it easier to sell them and easier to convey the general experience of a film

##### -->Genre vs platforms

In [None]:
G_Netflix = Genre.loc[Genre['Netflix'] == 1,'Genre'].value_counts().reset_index().set_index('index')
G_Prime =  Genre.loc[Genre['Prime Video'] == 1,'Genre'].value_counts().reset_index().set_index('index')
G_Hulu = Genre.loc[Genre['Hulu'] == 1,'Genre'].value_counts().reset_index().set_index('index')
G_Disney = Genre.loc[Genre['Disney+'] == 1,'Genre'].value_counts().reset_index().set_index('index')

In [None]:
fig , axes = plt.subplots(2,2,figsize = (12,12))

 
G_Netflix.head(10).plot(kind = 'bar',ax = axes[0,0],color = 'brown')
axes[0,0].set_title('Netflix')
axes[0,0].set_xlabel('')
axes[0,0].set_ylabel('')

G_Prime.head(10).plot(kind = 'bar',ax = axes[0,1],color = 'green')
axes[0,1].set_title('Prime Video')
axes[0,1].set_xlabel('')

G_Hulu.head(10).plot(kind = 'bar',ax = axes[1,0],color = 'gray')
axes[1,0].set_title('Hulu')
axes[1,0].set_xlabel('')
axes[1,0].set_ylabel('')


G_Disney.head(10).plot(kind = 'bar',ax = axes[1,1])
axes[1,1].set_title('Disney')
axes[1,1].set_xlabel('')

plt.tight_layout()
fig.text(0.5, 0.004, 'Genre', ha='center',fontsize = 'large')
fig.text(0.004, 0.5, 'Count', va='center', rotation='vertical',fontsize = 'large')
plt.show()

By the graph plotted, lets understand the relationship between genre and platforms.

__Netflix:__
* Genre of 'Drama' has 1500+ movies.
* It is followed by comedy which is around 1300+.
* Then 'Thriller','Romance','Action','Documentry' has equally distributed.

__Prime Video:__
* Genre of 'Drama' has 5000+ movies.
* It is followed by comedy and thriller which is around 2500+ movies.
* 'Action' comes next.

__Hulu:__
* Here also 'Drama' comes first which has around 450+ movies.
* 'Comedy' and 'Thriller' follows 'Drama'.

__Disney:__

* Unlike other 3 platforms , 'Family' has the leading position with 450+ movies.
* Then 'comedy' and 'adventure' comes into play followed by 'Fantasy'.
    
From this we infer that ,
*  __'Disney+'__ is more suitable for kids as well as family.
*  __Netflix__ and __Prime Video__ is more suitable for people who enjoy 'Drama' and 'Comedy'.

##### -->Genre vs ratings

In [None]:
G_ratings = Genre.groupby('Genre')['IMDb'].median()
G_ratings = G_ratings.reset_index().set_index('Genre')

In [None]:
Top_10_genre = G_ratings.loc[['Drama','Comedy','Thriller','Action','Romance','Crime','Adventure','Horror','Family','Mystery'],'IMDb']
Top_10_genre = Top_10_genre.reset_index().set_index('Genre')

In [None]:
plt.figure(figsize=(8,5))
sns.barplot(x=Top_10_genre.index,y=Top_10_genre['IMDb'])
plt.xticks(rotation = 90)
plt.show()

Based on the movie count , Top 10 genres have been chosen.

* With help of this, we can say that all genres have the median rating around 5.8 - 6.2.

* 'Family' and 'Drama' has highest median rating of 6.2.

#### Directors


A film director controls a film's artistic and dramatic aspects and visualizes the screenplay (or script) while guiding the technical crew and actors in the fulfilment of that vision. The director has a key role in choosing the cast members, production design and all the creative aspects of filmmaking.

##### -->Director vs platforms

In [None]:
D_Netflix = Dir.loc[Dir['Netflix'] == 1,'Director'].value_counts().reset_index().set_index('index').drop('NA',axis =0)
D_Prime =  Dir.loc[Dir['Prime Video'] == 1,'Director'].value_counts().reset_index().set_index('index').drop('NA',axis =0)
D_Hulu = Dir.loc[Dir['Hulu'] == 1,'Director'].value_counts().reset_index().set_index('index').drop('NA',axis =0)
D_Disney = Dir.loc[Dir['Disney+'] == 1,'Director'].value_counts().reset_index().set_index('index').drop('NA',axis =0)

In [None]:
fig,axes = plt.subplots(2,2,figsize=(12,12))
D_Netflix.head(10).plot(kind = 'bar',ax = axes[0,0],color = 'brown')
axes[0,0].set_title('Netflix')
axes[0,0].set_xlabel('')
axes[0,0].set_ylabel('')


D_Prime.head(10).plot(kind = 'bar',ax = axes[0,1],color = 'green')
axes[0,1].set_title('Prime Video')
axes[0,1].set_xlabel('')
axes[0,1].set_ylabel('')


D_Hulu.head(10).plot(kind = 'bar',ax = axes[1,0],color = 'gray')
axes[1,0].set_title('Hulu')
axes[1,0].set_xlabel('')
axes[1,0].set_ylabel('')



D_Disney.head(10).plot(kind = 'bar',ax = axes[1,1])
axes[1,1].set_title('Disney')
axes[1,1].set_xlabel('')
axes[1,1].set_ylabel('')

fig.tight_layout()
fig.text(0.5, 0.004, 'Director', ha='center',fontsize = 'large')
fig.text(0.004, 0.5, 'Count', va='center', rotation='vertical',fontsize = 'large')
plt.show()

Relationship between Director and platforms is shown.

Netflix:

* 'Jan suter' and 'Raul campos' leads the Netflix platform by their movies with 20+ movies each.

Prime Video:

* 'Joseph kane' leads the prime video platform with more than 30+ movies.
* Then followed by 'Jay Chapman'.

Hulu:

* 'Tyler Perry' and 'Steve Holland' has around 5+ movies in Hulu.

Disney+:

* 'Paul Hoen' leads the disney platform by 15+ movies.
* 'James alagar' follows him with 11+ Movies.

##### -->Director vs ratings:

In [None]:
D_ratings = Dir.groupby('Director')['IMDb'].median()
D_ratings = D_ratings.reset_index().set_index('Director')

In [None]:
Top_10_dir = D_ratings.loc[['Jay Chapman','Joseph Kane','Cheh Chang','Jim Wynorski','William Beaudine','Sam Newfield','David DeCoteau','Jay Karas','Marcus Raboy','William Witney'],'IMDb']
Top_10_dir = Top_10_dir.reset_index().set_index('Director')

In [None]:
plt.figure(figsize=(8,5))
sns.barplot(x=Top_10_dir.index,y=Top_10_dir['IMDb'])
plt.xticks(rotation = 90)
plt.xlabel('Director',labelpad= 20)
plt.ylabel('IMDb',labelpad = 20)
plt.show()

Based on the total movie count, top 10 directors have been chosen and their median ratings have been displayed. 'Jay Karas' grabs the first position based on the ratings followed by 'jay Chapman'.

#### Understanding the ratings

A high IMDb rating makes an insignificant contribution towards a movie's success. Rating says how good the movie is and decides whether it can be watched or not

In [None]:
f,ax=plt.subplots(1,2,figsize=(20,7))
sns.distplot(Movies['IMDb'],bins=20,kde=True,color='r',ax=ax[0])
sns.boxplot(Movies['IMDb'],ax=ax[1],color='r',saturation=0.5)
plt.show()

* The average rating lies lies between 5-7. Also there are some outliers present.
* The movie is considered to be extraordinary when the rating lies above 8. The movie is considered to be good if the rating lies between 6 - 8.

## Movies in India

#### Indian Language Movies

In [None]:
country = Movies[Movies['Country']=='India']
country = country.drop('Language', axis=1).join(
    country['Language'].str.split(',', expand=True).stack().reset_index(drop=True, level=1).rename('Language'))

In [None]:
Top_10_lang_india = country['Language'].value_counts().head(10).reset_index().set_index('index')

In [None]:
plt.figure(figsize=(8,5))
sns.barplot(x=Top_10_lang_india.index,y=Top_10_lang_india['Language'])
plt.xticks(rotation = 90)
plt.xlabel('Language',labelpad= 20)
plt.ylabel('count',labelpad = 20)
plt.show()

When it comes to Indian movies ,

* 'Hindi' hits every other language having movie count of 650+.
* Then comes 'Tamil' and 'English' with countof 150+ movies each.

##### -->Indian Movies vs OTT platforms

In [None]:
Im_count = {'platform':['Netflix','Hulu','Prime Video','Disney+'],
            'ImCount':[country['Netflix'].sum(),country['Hulu'].sum(),country['Prime Video'].sum(),country['Disney+'].sum()]}

Im_count = pd.DataFrame(Im_count)

In [None]:
plt.figure(figsize=(8,4))
sns.barplot(x='platform',y='ImCount',data = Im_count)
plt.xlabel('OTT platform')
plt.ylabel('count')
plt.show()

* Prime video and Netflix are almost equal when comes to indian movies.
* Prime video has 800+ and Netflix has 700+ movies.
* These two platforms are preferred for Indian movies particularly 'Hindi' Language.

## Conclusion

This kernel is an attempt to understand the distribution of movies over the OTT platforms. It tries to say about the selection of platform for the movies based on 'Director','Rating','Genre','Language' and so on. Yet lot of new insights can be extracted on further exploration.