# Netflix Movies And TV Shows

Special thanks to [Dmitryuarov](https://www.kaggle.com/dmitryuarov) for writing clean and visually appealing code so that I could refer to when in doubt.

The goal here are the following but one concept will remain the same throughout i.e. visualisation:
- Understanding the total shows and movies present.
- Knowing which countries are majorly making new movies/shows.
- Knowing the distribution of the uploads among various age groups.
- Finding out the distribution between the years and the upload count.
- Understanding the general plot distribution using descriptions.
- Comparing the popular genres for American and Indian uploads.

## Imports
The major graph plotting is done here using ```plotly```, you can move ahead with ```matplotlib``` too, but I find plotly more interactive and visually pleasing, but do note that it was hard for me to get around with its documentations.

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objs as go
import matplotlib.pyplot as plt
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
data = pd.read_csv('/kaggle/input/netflix-shows/netflix_titles.csv')
data = data.drop(['director','cast'],axis=1)
data.head(5)

## Cleaning

In [None]:
data[data['rating'].isnull()]

In [None]:
 print(pd.unique(data['rating']))

We replace the <em>rating</em> null values with "NR" and <em>country</em> null values with "United States" as it is generally the country that uploads a lot of Movies and TV Shows.

In [None]:
data['rating'].fillna('NR',inplace = True)
data['country'].fillna('United States',inplace = True)
print(pd.unique(data['rating']))
data.info()

## Preprocessing

One of our goals is to compare the popular genres and hence we first begin to find if the movies fit into the genres. The given genres for a movie can contain quite some genres as a string, hence we have to split then figure out, so for each movie we consider the 6 popular geners that we know and that are present in the dataset and then make new columns.


As we need to compare United States and India's releases in the end, we also need to know whether the movie or show was shot or whether it belonged to either one of them and similar to previous scenario, a movie can have multiple countries in a form of a string, hence we got to split it and figure it out.

In [None]:
genres = {'Action & Adventure': [],'Dramas' : [],'Documentaries': [],'Comedies': [],'Sci-Fi & Fantasy': [],'Thrillers': []}
indices = {'Action & Adventure': 0,'Dramas' : 1,'Documentaries': 2,'Comedies': 3,'Sci-Fi & Fantasy': 4,'Thrillers': 5}
india = []
usa = []


for i,row in data.iterrows():
    done = [False] * 6
    for g in row['listed_in'].split(','):
        if g.strip() in genres and done[indices[g.strip()]] == False:
            genres[g.strip()].append(1)
            done[indices[g.strip()]] = True
    for k in genres.keys():
        if done[indices[k]] == False:
            genres[k].append(0)
    ind = us = False
    for c in row['country'].split(','):
        if c == "India":
            ind = True
        if c == "United States":
            us = True
    india.append(1 if ind else 0)
    usa.append(1 if us else 0)

data.insert(9, "Action & Adventure",genres['Action & Adventure'], True)
data.insert(9, "Dramas",genres['Dramas'], True)
data.insert(9, "Documentaries",genres['Documentaries'], True)
data.insert(9, "Comedies",genres['Comedies'], True)
data.insert(9, "Sci-Fi & Fantasy",genres['Sci-Fi & Fantasy'], True)
data.insert(9, "Thrillers",genres['Thrillers'], True)
data.insert(4, "USA",usa, True)
data.insert(4, "India",india, True)
data.head()

## Movies vs TV Shows
Now let us try to understand how the data is distributed between Movies and TV Shows.

In [None]:
 print(pd.unique(data['type']))

In [None]:
typeFrame = data['type'].value_counts().reset_index()

In [None]:
fig = px.pie(typeFrame,values = 'type',labels = 'index',template = 'plotly_dark')
fig.update_traces(hole = 0.6, pull = [0.05,0.05], title = 'Movies<br> VS <br>TV Shows', opacity = 0.7,
text = ["Movie", "TV Show"],
hovertemplate = "%{text} <br>Count: %{value}",
textfont=dict(
        family="sans serif",
        size=18,
        color="black"
    ),
marker= dict(
    line= dict(
        color='black', width=1.5
        )
    ),
)
fig.show()

## Upload Count by Country

In [None]:
data.info()

In [None]:
country_data =  data['country'].value_counts().reset_index()
top10 = country_data[:10]
top10

In [None]:
fig = px.bar(top10,x = 'index', y = 'country',labels = {'index' : 'Countries', 'country' : 'Total Uploads'},title='Uploads by Country',template = 'plotly_dark')
fig.update_traces(
    hovertemplate = '%{x} <br>Total : %{y}',
)
fig.show()

## Sentiment Analysis

Using textblob to understand, we select words from description that are either Adjective, Verb or Adverb, then we use WordCloud and plot it over a masked netflix logo image.

In [None]:
from textblob import TextBlob
from wordcloud import WordCloud
from PIL import Image
import random

deets = str(list(data['description'])).replace(',', '').replace('[', '').replace("'", '').replace(']', '').replace('.', '')
def get_nouns(text):
    blob = TextBlob(text)
    return [ word for (word,tag) in blob.tags if tag == "JJ" or tag == "VB" or tag == "RB"]

deets = get_nouns(deets)
text = str(deets).replace(',', '').replace('[', '').replace("'", '').replace(']', '').replace('.', '')

def red_color_func(word, font_size, position, orientation, random_state = None, **kwargs):
    return 'rgb(241,114,109)'

mask = np.array(Image.open('/kaggle/input/netflix-logo/netflixLogo.png'))
mask = mask[::2,::2]
def transform_format(val):
    if val == 0:
        return 255
    else:
        return val

newMask = np.ndarray((mask.shape[0],mask.shape[1]), np.int32)
for i in range(len(mask)):
    newMask[i] = list(map(transform_format, mask[i]))
    
    
plt.rcParams['figure.figsize'] = (20, 20)
wordcloud = WordCloud(background_color = 'black',mask = newMask, stopwords = ['s']).generate(text)

wordcloud.recolor(color_func = red_color_func)

plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.show()

## Upload Count by Year
Realising that there are certain movies and shows that do not have the <em>date_added</em> value I decided to replace it with the mean of the <em>addedYear</em> of the movies and Tv Shows that didn't contain null values.

In [None]:
from statistics import mean
data['date_added'] = data['date_added'].fillna('Nothingness')
addedYear = []

for i,row in data.iterrows():
    year = row['date_added'][-4:]
    if year.isnumeric():
        addedYear.append(int(year))
    else:
        addedYear.append(-1)
value = int(mean([x for x in addedYear if x != -1]))
addedYear = [value if x == -1 else x for x in addedYear]
data.insert(7, "addedYear",addedYear, True)
data.head()

In [None]:
movie_data = data[data['type'] == 'Movie']['addedYear'].value_counts().reset_index().sort_values(by=['index'])
tv_data = data[data['type'] == 'TV Show']['addedYear'].value_counts().reset_index().sort_values(by=['index'])
overall_data = data['addedYear'].value_counts().reset_index().sort_values(by=['index'])
overall_data

In [None]:
fig = go.Figure()
fig.add_trace(
    go.Scatter(x = list(movie_data['index']), y = list(movie_data.addedYear),mode = 'lines + markers',name = 'Movies')
)
fig.add_trace(
    go.Scatter(x = list(tv_data['index']), y = list(tv_data.addedYear),mode = 'lines + markers',name = 'TV Shows')
)
fig.add_trace(
    go.Scatter(x = list(overall_data['index']), y =list(overall_data.addedYear),mode = 'lines + markers',name = 'Total')
)
fig.update_layout(title = 'Upload Count by Years',plot_bgcolor = 'black',template = 'plotly_dark')
fig.update_xaxes(showline = True,title = 'Years')
fig.update_yaxes(showline = True,title = 'Upload Count')

## Rating Distribution

We first convert the given ratings to a specific target audience to simplify the number of variables being present.

In [None]:
from collections import defaultdict
ratingTags ={'TV-MA': 'Adults',
          'R': 'Adults',
          'PG-13': 'Teens',
          'TV-14': 'Young Adults',
          'TV-PG': 'Older Kids',
          'NR': 'Adults',
          'TV-G': 'Kids',
          'TV-Y': 'Kids',
          'TV-Y7': 'Older Kids',
          'PG': 'Older Kids',
          'G': 'Kids',
          'NC-17': 'Adults',
          'TV-Y7-FV': 'Older Kids',
          'UR': 'Adults'}
ratingDict = defaultdict(int)
movieRating = defaultdict(int)
tvRating = defaultdict(int)
for i,row in data.iterrows():
    ratingDict[ratingTags[row['rating']]] += 1
    if row['type'] == 'Movie':
        movieRating[ratingTags[row['rating']]] += 1
    else:
        tvRating[ratingTags[row['rating']]] += 1
ratingDict

In [None]:
y = ratingDict.values()
labs = ratingDict.keys()
fig = px.pie(values = y, title="Overall Rating Distribution",template = 'plotly_dark')
fig.update_traces(
    text = list(labs),
    hovertemplate = 'Total : %{value}'
)
fig.show()

In [None]:
y = movieRating.values()
labs = movieRating.keys()
fig = px.pie(values = y, title="Movies Rating Distribution",template = 'plotly_dark')
fig.update_traces(
    text = list(labs),
    hovertemplate = 'Total : %{value}'
)
fig.show()

In [None]:
y = tvRating.values()
labs = tvRating.keys()
fig = px.pie(values = y, title="TV Shows Rating Distribution",template = 'plotly_dark')
fig.update_traces(
    text = list(labs),
    hovertemplate = 'Total : %{value}'
)
fig.show()

## Comparison between genre uploads of USA and India

In [None]:
splitDf = data[(data['India'] == 1) | (data["USA"] == 1)]
compareDf = pd.DataFrame(columns = ['country', 'genre', 'count'], index = range(12))
compareDf.iloc[0:6, 0] = 'United States'
compareDf.iloc[6:12, 0] = 'India'
compareDf.iloc[[0, 6], 1] = 'Dramas'
compareDf.iloc[[1, 7], 1] = 'Comedies'
compareDf.iloc[[2, 8], 1] = 'Action & Adventure'
compareDf.iloc[[3, 9], 1] = 'Documentaries'
compareDf.iloc[[4, 10], 1] = 'Thrillers'
compareDf.iloc[[5,11], 1] = 'Sci-Fi & Fantasy'
compareDf.iloc[:,2] = 0
compareDf

In [None]:
gs = ['Comedies','Action & Adventure','Documentaries','Thrillers','Sci-Fi & Fantasy','Dramas']
for i in range(6):
    cat = gs[i]
    compareDf.iloc[i,2] = splitDf.query('USA == 1 & `{0}` == 1'.format(cat)).agg(['count'])
for i in range(6,12):
    cat = gs[i % 6]
    compareDf.iloc[i,2] = splitDf.query('India == 1 & `{0}` == 1'.format(cat)).agg(['count'])
compareDf

In [None]:
fig = px.sunburst(compareDf, path=['country','genre'], values = 'count', width = 700, height = 700, color = 'country', title = 'Comparision Between The Top 2 Uploading Countries',template = 'plotly_dark',color_discrete_map = {'US': '#e6705e', 'India': '#7dfff2'})
fig.update_traces(textinfo = 'label + percent parent')
fig.show()
