## Netflix Dataset Review

Netflix, Inc. is an American over-the-top content platform and production company headquartered in Los Gatos, California. <br>Netflix was founded in 1997 by Reed Hastings and Marc Randolph in Scotts Valley, California.<br>
* Stock price: NFLX (NASDAQ) USD 516.39 5 Mar 2021
* Founded: 29 August 1997, Scotts Valley, California, United States
* Employees: 9,400 (2020)

### Purpose

Ideas - explore data and attempt some of the below items:
1. Top actors or directors by year
2. Movie ratings 
3. Descriptions by genre
4. What shows last most seasons? e.g. predictive model
5. Recommender engine - similar movies/shows based on cast, director, title and description

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import linear_kernel
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### Read Data

In [None]:
df = pd.read_csv("../input/netflix-shows/netflix_titles.csv")


### Explore Data

In [None]:
df.head()

In [None]:
df.shape

### Ratings vs Type

In [None]:
df2 = df.groupby(['type', 'rating']).size().unstack().plot(kind='bar', stacked=True)


Based on this, Netflix's target audience is 14/15 year old plus.

### Top 25 Directors

In [None]:
df.groupby(['director']).size().sort_values(ascending=False).head(25).plot(kind='bar')

Some well known directors here.

### Release Year

In [None]:
bins=[1900,1950, 1970, 1980, 1990,2000,2010,2015,2018,2020]

plt.hist(df.release_year, bins=bins, edgecolor="k")
plt.xticks(bins)
plt.xticks(rotation='vertical')
plt.show()


Not surprisingly, most content is fairly recent.

### Description

In [None]:
#Explore common terms by genre
def wc(genre,col):
    text = df[df['listed_in'].str.contains(genre)][col]

    stopwords = set(STOPWORDS)
    wordcloud = WordCloud(
                              background_color='white',
                              stopwords=stopwords,
                              max_words=200,
                              max_font_size=40, 
                              random_state=42
                             ).generate(str(text))

    print(wordcloud)
    fig = plt.figure(1)
    plt.imshow(wordcloud)
    plt.axis('off')
    plt.show()
    fig.savefig("word1.png", dpi=900)

In [None]:
#Horror Movies WordCloud:
wc('Horror','description')


No major surprises here or in the below

In [None]:
#Sci-Fi Movies WordCloud:
wc('Sci-Fi','description')

In [None]:
#Action Movies WordCloud:
wc('Action & Adventure','description')


In [None]:
#Drama Movies WordCloud:
wc('Drama','description')

In [None]:
#Thriller Movies WordCloud:
wc('Thriller','description')

#### Top Actors

In [None]:
#replace NA as prerequisite for working with text
df.cast.fillna('-999', inplace=True)


In [None]:
#replace space in name with underscore so that we capture actors first and last names
import re
# '_'.join(re.findall('^\,|\S+', df.cast[0]))
i = 0
while i < df.shape[0]:
    s1 = re.sub(", ","#",df.cast[i])
    s2 = re.sub(" ","_",s1)
    df.cast[i] = re.sub("#",", ",s2)
#     print(i)
    i += 1
else:
    print('done')

In [None]:
#Comedy Movies - Actor WordCloud:
wc('Comedy','cast')

In [None]:
#Sci-Fi Movies - Actor WordCloud:
wc('Sci-Fi','cast')

In [None]:
#Action Movies - Actor WordCloud:
wc('Action & Adventure','cast')

### Recommender Engine

In [None]:
vectorizer = CountVectorizer()

In [None]:
vectors = vectorizer.fit_transform(df.cast)

In [None]:
# Compute the cosine similarity matrix
cosine_sim = linear_kernel(vectors, vectors)

In [None]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return df['title'].iloc[movie_indices]

In [None]:
df.rating.fillna('-999', inplace=True)
df.country.fillna('-999', inplace=True)
df.director.fillna('-999', inplace=True)

In [None]:
#join multiple columns into one
def create_soup(x):
    return  ''.join(x['listed_in']+ ' ' + x['description']+ ' ' + x['cast']+ ' ' + x['title']+ ' ' + x['rating'] + ' ' + x['country']+ ' ' + x['director'])


In [None]:
df['soup'] = df.apply(create_soup, axis=1)

In [None]:
count_matrix = vectorizer.fit_transform(df.soup)

In [None]:
# Compute the cosine similarity matrix
cosine_sim2 = linear_kernel(count_matrix, count_matrix)

In [None]:
# Reset index of your main DataFrame and construct reverse mapping as before
df = df.reset_index()
indices = pd.Series(df.index, index=df['title'])

In [None]:
# df[df['title'].str.contains('Terminator')]#[1000:1030]

In [None]:
# See what kind of recommendations are coming through for movies with sequels
get_recommendations('Rocky', cosine_sim2)

In [None]:
get_recommendations('Terminator Salvation', cosine_sim2)

In [None]:
get_recommendations('The Lord of the Rings: The Return of the King', cosine_sim2)

Reasonable recommendations are being made for such a simple model. Improvements could include using tf-idf (weighted word counts based on scarcity within the group of documents) and more advanced text similarity processing techniques.

Please don't forget to upvote if you found this useful! :) Thankyou