This notebook contains Exploratory data analysis done on the movies present on the four major OTT Platfroms Netflix, Prime Video, Disney+ and Hulu.

I hope you find this kernel helpful and some **<span style="color:red">UPVOTES</span>** would be very much appreciated.

### **Importing the required libraries**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
import warnings
warnings.filterwarnings("ignore")

### **Load the dataset**

In [None]:
df = pd.read_csv('/kaggle/input/movies-on-netflix-prime-video-hulu-and-disney/MoviesOnStreamingPlatforms_updated.csv')
df.head()

The dataset contains three unwanted columns **1. Unnamed: 0** and **2. ID**, **3. Type**. 


Removing the unnecessary columns from the dataset

In [None]:
df = df.drop(['Unnamed: 0', 'ID', 'Type'], axis='columns')

Now the dataset contains the following columns



1. **Title:** The Title of the Movie
2. **Year:** The Year in which the Movie was released
3. **Age:** Age Required for watching the movie
4. **IMDb:** The IMDb Score of the Movie (out of 10)
5. **Rotten Tomatoes:** The Rotten Tomatoes Score of the Movie (out of 100)
6. **Netflix:** Whether the movie is present on Netflix or not (1 for True, 0 for False)
7. **Hulu:** Whether the movie is present on Hulu or not (1 for True, 0 for False)
8. **Prime Video:** Whether the movie is present on Prime Video or not (1 for True, 0 for False)
9. **Disney+:** Whether the movie is present on Disney+ or not (1 for True, 0 for False)
10. **Directors:** Director(s) of the Movie
11. **Genres:** Genres of the Movies
12. **Country:** Countries in which the movie was directed
13. **Language:** Language(s) in which the movie is available

### **Features of columns in the dataset**

In [None]:
df.info()

It looks like the dataset has Null values present in it. Let's check the dataset for null values

### **Checking for Null values in the dataset**

In [None]:
null_values = pd.DataFrame(df.isnull().sum() / df.shape[0] * 100).reset_index()
null_values = null_values.rename(columns={'index':'Column Name', 0:'Percentage Missing'})
null_values = null_values[null_values['Percentage Missing'] > 0].sort_values(by='Percentage Missing', ascending=False)

In [None]:
plt.figure(figsize=(12, 10))
sns.set_style("white")

plt.title("Percentage of Missing Values in the dataset", fontsize=25)

labels = null_values['Column Name'].tolist()
sizes = [percent for percent in null_values['Percentage Missing'].tolist()]
colors = ['#845EC2', '#00C9A7','#C4FCEF','#4D8076',"#B39CD0","#FBEAFF","#F3C5FF","#FEFEDF"]

plt.pie(sizes,labels=labels, startangle=180, autopct='%1.1f%%',
        colors=colors,
        wedgeprops = { 'linewidth' : 3, 'edgecolor' : 'white' },
        labeldistance=1.15)

plt.show()

The IMDb and Rotten Tomatoes scores are strings, let's convert them to numeric values and also rename the columns

In [None]:
#IMDb
df['IMDb'] = df['IMDb'].str.replace("/10", "")
df['IMDb'] = pd.to_numeric(df["IMDb"])

#Rotten Tomatoes
df['Rotten Tomatoes'] = df['Rotten Tomatoes'].str.replace("/100", "")
df['Rotten Tomatoes'] = pd.to_numeric(df["Rotten Tomatoes"])

In [None]:
# Let's have a look at the dataset again
df.head()

### **Distribution of Ratings on IMDb and Rotten Tomatoes**

In [None]:
fig,ax = plt.subplots(nrows=1, ncols=2, figsize=(16, 8))

a = sns.histplot(df['IMDb'], ax=ax[0])
b = sns.histplot(df['Rotten Tomatoes'], color='red', ax=ax[1])
a.text(x=1, y=650,s='median = ' + str(df['IMDb'].median()), fontname = 'monospace', fontsize = 16, color = '#32384D')
b.text(x=10, y=550,s='median = ' + str(df['Rotten Tomatoes'].median()), fontname = 'monospace', fontsize = 16, color = '#32384D')

for graph in [a, b]:
    graph.grid(color='black', linestyle = ':', axis='y', alpha=1, zorder=0,
            dashes= (1, 7))

for graph in [a, b]:
    for w in ['right', 'top','left','bottom']:
        graph.spines[w].set_linewidth(1.2)


plt.figtext(0.14,0.98, 'Distribution of Ratings on IMDb and Rotten Tomatoes', fontsize=28,
           fontname='monospace')
fig.tight_layout(pad=4)
plt.show()

### **Distribution of Age Groups**

In [None]:
age_groups = pd.DataFrame(df['Age'].value_counts()).reset_index()
age_groups = age_groups.rename(columns={'index':'Age Group', 'Age':'Count'})
age_groups

In [None]:
plt.figure(figsize = (12, 8))

a = sns.barplot(x='Age Group', y='Count', data = age_groups, palette='copper',linewidth=1.5)

plt.figtext(x=0.14, y=0.95,
            s='Distribution of Movies based on Age Groups', 
            fontsize=25, fontname='monospace')

plt.xticks(fontsize=15, fontname='monospace')
plt.yticks(fontsize=15, fontname='monospace')
plt.xlabel('Age Group', fontsize=14)
plt.ylabel('Count', fontsize=14)

plt.grid(axis='y', color='black', linestyle = ':', alpha=0.5)

for q in [a]:
    for w in ['bottom', 'left']:
        q.spines[w].set_linewidth(1.5)
    for w in ['right', 'top']:
        q.spines[w].set_visible(False)
        
plt.show()

### **Finding all the unique genres present in the dataset**

In [None]:
def get_unique_values(genre_list):
    '''
    The function takes the genre list returns a list of all the unique genres, number of movies
    that have more than one genre and number of movies having only one genre
    '''
    more_than_one = 0
    only_one = 0
    unique_genre = []
    for genres in genre_list:
        try:
            values = genres.split(",")
            if len(values) > 1:
                more_than_one += 1
            elif len(values) == 1:
                only_one += 1
        except:
            pass
        for genre in values:
            if genre not in unique_genre:
                unique_genre.append(genre)
    
    return unique_genre, more_than_one, only_one

In [None]:
unique_genres, more_than_one_genre, only_one_genre = get_unique_values(df['Genres'].unique())

print('Total Number of Unique Genres are: ', len(unique_genres))
print('Movies having more than one genre: ', more_than_one_genre)
print('Movies having only one genre: ', only_one_genre)

### **Let's find the number of movies in each genre**

In [None]:
genre_dict = {}

for val in unique_genres:
    genre_dict[val] = 0

In [None]:
# Removing all the null values from genres
new_df = df[df['Genres'].notna()]

In [None]:
for genres in unique_genres:
    count = new_df[new_df['Genres'].str.contains(genres)].shape[0]
    genre_dict[genres] = count

In [None]:
genre_count = pd.DataFrame(columns=['Genre', 'Count'], 
                           data = {'Genre':[val for val in genre_dict.keys()],
                                    'Count': [val for val in genre_dict.values()]}).sort_values(by='Count', ascending=False).reset_index(drop=True)

In [None]:
plt.figure(figsize=(12,10))
plt.grid(axis='x',color='black', linestyle = ':', alpha=0.5)
plt.title('Top 10 Movie Genres', fontname='monospace', fontsize=25, y=1.05)
a = sns.barplot(x='Count', y='Genre', data=genre_count[:10], palette='rocket')

genres = genre_count['Genre'][:10].tolist()
for i, val in enumerate(genres):
    x_val = genre_count[genre_count['Genre'] == val]['Count'].values[0]
    a.text(y=i, x= x_val -300, 
           s=str(x_val),
          fontsize=14, fontname='monospace', color='white')
    
for q in [a]:
    for w in ['bottom', 'left']:
        q.spines[w].set_linewidth(1.5)
    for w in ['right', 'top']:
        q.spines[w].set_visible(False)

plt.xlabel('Count', fontsize=15)
plt.ylabel('Genre', fontsize=15)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)

plt.show()

### **Let's find the top languages in which the movies are available**

In [None]:
unique_languages, more_than_one_language, only_one_language = get_unique_values(df['Language'].unique())
print('Total Number of Unique Languages are: ', len(unique_languages))
print('Movies available in more than one languages: ', more_than_one_language)
print('Movies available in only one language: ', only_one_language)

In [None]:
language_dict = {}

for val in unique_languages:
    language_dict[val] = 0

In [None]:
# Removing all the null values from languages
lang_df = df[df['Language'].notna()]

In [None]:
for language in unique_languages:
    count = lang_df[lang_df['Language'].str.contains(language)].shape[0]
    language_dict[language] = count

In [None]:
language_count = pd.DataFrame(columns=['Language', 'Count'], 
                           data = {'Language':[val for val in language_dict.keys()],
                                    'Count': [val for val in language_dict.values()]}).sort_values(by='Count', ascending=False).reset_index(drop=True)

In [None]:
plt.figure(figsize=(12,10))
plt.grid(axis='x',color='black', linestyle = ':', alpha=0.5)
plt.title('Top 10 Movie Languages', fontname='monospace', fontsize=25, y=1.05)
a = sns.barplot(x='Count', y='Language', data=language_count[:10], palette='viridis')

languages = language_count['Language'][:10].tolist()
for i, val in enumerate(languages):
    x_val = language_count[language_count['Language'] == val]['Count'].values[0]
    a.text(y=i, x= x_val +270, 
           s=str(x_val),
          fontsize=14, fontname='monospace', color='black')
    
for q in [a]:
    for w in ['bottom', 'left']:
        q.spines[w].set_linewidth(1.5)
    for w in ['right', 'top']:
        q.spines[w].set_visible(False)

plt.xlabel('Count', fontsize=15)
plt.ylabel('Language', fontsize=15)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)

plt.show()

#### What's next?


Analysis and insights based on each individual streaming platfrom.

Kindly **<span style="color:red">UPVOTE</span>** if you found the notebook helpful.



**Suggestions are welcome**