<a href="https://colab.research.google.com/github/sravyagadam/ML_RecommenderSystems/blob/main/NETFLIX_MOVIES_AND_TV_SHOWS_CLUSTERING.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Problem Statement**

This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine.

In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.

Integrating this dataset with other external datasets such as IMDB ratings, rotten tomatoes can also provide many interesting findings.

## <b>In this  project, you are required to do </b>
1. Exploratory Data Analysis 

2. Understanding what type content is available in different countries

3. Is Netflix has increasingly focusing on TV rather than movies in recent years.
4. Clustering similar content by matching text-based features



# **Attribute Information**

1. show_id : Unique ID for every Movie / Tv Show

2. type : Identifier - A Movie or TV Show

3. title : Title of the Movie / Tv Show

4. director : Director of the Movie

5. cast : Actors involved in the movie / show

6. country : Country where the movie / show was produced

7. date_added : Date it was added on Netflix

8. release_year : Actual Releaseyear of the movie / show

9. rating : TV Rating of the movie / show

10. duration : Total Duration - in minutes or number of seasons

11. listed_in : Genere

12. description: The Summary description

In [57]:
import numpy as np
import scipy
import pandas as pd
import math
import random
import sklearn
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse.linalg import svds
import matplotlib.pyplot as plt

import warnings
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import MinMaxScaler

from time import time
import keras.backend as K

from tensorflow.keras.layers import Layer, InputSpec
from keras.layers import Dense, Input, Embedding
from keras.models import Model
from tensorflow.keras.optimizers import SGD
from keras import callbacks
from keras.initializers import VarianceScaling
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [58]:


netflix = pd.read_csv("/content/drive/MyDrive/Book_Recommendation/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv")


In [59]:
netflix.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...


In [60]:
print(f"Shape of data {netflix.shape}")
print(f"data types in data \n {netflix.dtypes}")

Shape of data (7787, 12)
data types in data 
 show_id         object
type            object
title           object
director        object
cast            object
country         object
date_added      object
release_year     int64
rating          object
duration        object
listed_in       object
description     object
dtype: object


In [71]:
netflix_copy=netflix.copy()

In [74]:
netflix.drop(columns=['show_id','cast', 'date_added','duration'])

Unnamed: 0,type,title,director,country,release_year,rating,listed_in,description
0,TV Show,3%,,Brazil,2020,TV-MA,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,Movie,7:19,Jorge Michel Grau,Mexico,2016,TV-MA,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,Movie,23:59,Gilbert Chan,Singapore,2011,R,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,Movie,9,Shane Acker,United States,2009,PG-13,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,Movie,21,Robert Luketic,United States,2008,PG-13,Dramas,A brilliant group of students become card-coun...
...,...,...,...,...,...,...,...,...
7782,Movie,Zozo,Josef Fares,"Sweden, Czech Republic, United Kingdom, Denmar...",2005,TV-MA,"Dramas, International Movies",When Lebanon's Civil War deprives Zozo of his ...
7783,Movie,Zubaan,Mozez Singh,India,2015,TV-14,"Dramas, International Movies, Music & Musicals",A scrappy but poor boy worms his way into a ty...
7784,Movie,Zulu Man in Japan,,,2019,TV-MA,"Documentaries, International Movies, Music & M...","In this documentary, South African rapper Nast..."
7785,TV Show,Zumbo's Just Desserts,,Australia,2019,TV-PG,"International TV Shows, Reality TV",Dessert wizard Adriano Zumbo looks for the nex...


In [84]:
x = netflix[['type','director','country','release_year','rating']]
x.head()


Unnamed: 0,type,director,country,release_year,rating
0,TV Show,,Brazil,2020,TV-MA
1,Movie,Jorge Michel Grau,Mexico,2016,TV-MA
2,Movie,Gilbert Chan,Singapore,2011,R
3,Movie,Shane Acker,United States,2009,PG-13
4,Movie,Robert Luketic,United States,2008,PG-13


In [None]:
x(type).unique

In [85]:
from sklearn import preprocessing
 
# label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
 
# Encode labels 
for column in x:
  x[column]= label_encoder.fit_transform(x[column])
 
x.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,type,director,country,release_year,rating
0,1,4049,39,71,8
1,0,1840,308,67,8
2,0,1289,379,62,5
3,0,3445,549,60,4
4,0,3176,549,59,4


In [89]:
from sklearn import preprocessing

# scale the data for better results
x_scaled = preprocessing.scale(x)

In [61]:
print(f"null data sum \n {netflix.isna().sum()}")

null data sum 
 show_id            0
type               0
title              0
director        2389
cast             718
country          507
date_added        10
release_year       0
rating             7
duration           0
listed_in          0
description        0
dtype: int64


In [62]:
netflix_d = netflix.drop_duplicates("title")
print(f"shape after dropping duplicates in data set{netflix_d.shape}")

shape after dropping duplicates in data set(7787, 12)


# As shape is same before and after removing duplicates , there are no duplicates in the dataset

**Clustering Section **

In [88]:
df_token = netflix[ "listed_in"]
maxlen = 1500 #only use this number of most frequent words
training_samples = 800
validation_samples = 450
max_words = 10000

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(df_token) # generates word index
sequences = tokenizer.texts_to_sequences(df_token) # transforms strings in list of intergers
word_index = tokenizer.word_index # calculated word index
word_index


{'action': 7,
 'adventure': 8,
 'anime': 29,
 'british': 27,
 'children': 13,
 'classic': 35,
 'comedies': 6,
 'comedy': 20,
 'crime': 15,
 'cult': 38,
 'documentaries': 10,
 'docuseries': 21,
 'dramas': 4,
 'faith': 42,
 'family': 14,
 'fantasy': 26,
 'features': 44,
 'fi': 25,
 'horror': 17,
 'independent': 11,
 'international': 3,
 "kids'": 16,
 'korean': 31,
 'language': 34,
 'lgbtq': 37,
 'movies': 1,
 'music': 22,
 'musicals': 23,
 'mysteries': 36,
 'nature': 40,
 'reality': 28,
 'romantic': 9,
 'sci': 24,
 'science': 39,
 'series': 32,
 'shows': 5,
 'spanish': 33,
 'spirituality': 43,
 'sports': 30,
 'stand': 18,
 'talk': 45,
 'teen': 41,
 'thrillers': 12,
 'tv': 2,
 'up': 19}

In [70]:
from sklearn.cluster import Birch
model = Birch(branching_factor=30, n_clusters=5, threshold=2.5)
model.fit(netflix)
pred = model.predict(netflix)
plt.scatter(netflix["title"], data["desription"], c=pred, cmap='rainbow', alpha=0.5, edgecolors='b')
plt.show()

ValueError: ignored

In [22]:
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import Birch

In [25]:
netflix, clusters = make_blobs(n_samples = 1000, centers = 12, cluster_std = 0.50, random_state = 0)


In [29]:
netflix.shape

(1000, 2)

In [23]:
model = Birch(branching_factor = 50, n_clusters = None, threshold = 1.5)

In [26]:
model.fit(netflix)

Birch(n_clusters=None, threshold=1.5)

In [28]:
pred = model.predict(netflix)