<h1><b>Project Type - Un-Supervised Machine learning 
<h1><b>Contribution - Individual

<h1><b>GitHUb Link



<h1><b>Project Summary



* This project involves analyzing the content available on Netflix, which is a popular streaming service for movies and TV shows. The dataset used in this project includes information about different shows and movies on Netflix as of 2019. The goal of the project is to categorize and group these shows based on their attributes like genre, cast, director, rating, country, and description.

* The dataset consisted of 7787 records and 11 attributes. 

* To start, the project focuses on cleaning the data and performing exploratory data analysis. This helps to ensure the data is organized and any issues are addressed.

* Next, the attributes are processed and transformed into a numerical format using a technique called TFIDF vectorization. This allows the project to analyze the textual information effectively. Principal Component Analysis (PCA) is then used to handle the high dimensionality of the data, making it easier to work with.

* Two different clustering algorithms, namely K-Means Clustering and Agglomerative Hierarchical Clustering, are employed to group the Netflix content based on similarities between their attributes. The number of clusters is determined using techniques such as the elbow method, silhouette score, and dendrogram.

* Lastly, a content-based recommender system is built using the cosine similarity between shows/movies. This system recommends 10 shows/movies to users based on their previous viewing history. By analyzing the attributes of the shows/movies and comparing them to the user's preferences, the recommender system suggests similar content that the user might enjoy.

* Overall, this project provides insights into the Netflix dataset and helps users discover shows/movies that align with their preferences, making it easier to find content they will enjoy.

<h1><b>Business Context


* The business context of this project is to improve the user experience on the Netflix platform by providing personalized content recommendations based on user viewing behavior and preferences. By clustering TV shows and movies into similar groups and integrating external datasets like IMDB ratings and Rotten Tomatoes, the project aims to provide valuable insights into the content's quality and popularity, further enhancing the viewing experience. Ultimately, the project's goal is to drive user engagement and satisfaction on the Netflix platform.

<h1><b>Problem Statement


* The problem statement of this project is to explore and analyze the Netflix dataset to identify patterns and similarities among TV shows and movies. The goal is to cluster the content into groups of similar shows and movies using attributes such as director, cast, country, genre, rating, and description. Additionally, the project aims to integrate external datasets like IMDB ratings and Rotten Tomatoes to provide insights into the content's quality and popularity. The ultimate objective is to build a content-based recommendation system that provides personalized recommendations to users based on their viewing behavior and preferences, ultimately enhancing the user experience on the Netflix platform

# ***Let's Begin !***

## ***1. Know Your Data***

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Import Libraries

In [None]:
# Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import matplotlib.cm as cm

# Word Cloud library
from wordcloud import WordCloud, STOPWORDS

# library used for textual data prerocessing
import string,unicodedata            # string module provides various constants and functions for working with strings
string.punctuation                   # punctuation constant is a string containing all the ASCII punctuation characters.
import nltk
from nltk.corpus import stopwords    # stopwords corpus, which contains a list of commonly used stop words in the English language.
nltk.download('stopwords')        # nltk.download('stopwords') is used to download the stopwords corpus from the NLTK library to your local machine. 
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import PCA

# library used for building recommandation system
from sklearn.metrics.pairwise import cosine_similarity

# library used for Clusters impelementation
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import silhouette_score, silhouette_samples
import scipy.cluster.hierarchy as shc


import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
df=pd.read_csv("/content/drive/MyDrive/capstone project 4/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv")
df

### Dataset First View

In [None]:
# Dataset First Look
df.head(5)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print('Number of rows {} \n Number of columns {}'.format(df.shape[0],df.shape[1]))

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print(len(df[df.duplicated()]))
     

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isna().sum().sort_values(ascending= False).reset_index().rename(columns={'index':'Columns',0:'Null values'})

In [None]:
# Visualizing the missing values
plt.figure(figsize=(14, 5),dpi=100)
msno.bar(df, color = 'blue')
plt.title("Missing Values in Column",fontweight="bold",size=16,color='red')
plt.show()
     


### What did you know about your dataset?

>The dataset contains information about specific movies.

>There are NaN values present in the director, cast, country, date_added, and rating columns.

>It is not possible to impute missing values using any method, as the data is specific to each movie.

>To avoid losing any data, the decision has been made to impute NaN values with empty space. This approach may not always be the best option, as external sources could potentially provide missing information for some of the columns.Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns
     

In [None]:
# Dataset Describe
df.describe(include='all').transpose()


### Variables Description 

>show_id : Unique ID for every Movie/Show

>type : Identifier - Movie/Show

>title : Title of the Movie/Show

>director : Director of the Movie/Show

>cast : Actors involved in the Movie/Show

>country : Country where the Movie/Show was produced

>date_added : Date it was added on Netflix

>release_year : Actual Release year of the Movie/Show

>rating : TV Rating of the Movie/Show

>duration : Total Duration - in minutes or number of seasons

>listed_in : Genre

>description: The Summary descriptionAnswer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in df.columns.tolist():
  print("No. of unique values in ",i,"is",df[i].nunique())

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

<h1><B>EDA (Exploratery Data Analysis)

>Exploratory Data Analysis (EDA) is a crucial initial step before making any modifications to a dataset or creating a statistical model to address business problems. The EDA process involves summarizing, visualizing, and gaining a deep understanding of the significant characteristics of a dataset. In essence, EDA is aimed at exploring and discovering insights from the data to inform subsequent data processing, modeling, and decision-making activities.

<h2>Type Column

In [None]:
# Number of Movies and TV Shows in the dataset
print(df.type.value_counts())
print(" ")

In [None]:
# visualization of Movies and TV Shows in the dataset
plt.figure(figsize=(10,5),dpi=100)
df['type'].value_counts().plot(kind='pie',autopct='%1.2f%%')
plt.ylabel('')
plt.title('Movies and TV Shows in the dataset',fontsize=16,color='red');

<h1>Title Column

In [None]:
# Creating and Displaying a Word Cloud Based on Titles in a Pandas Dataframe
text = " ".join(word for word in df['title'])

# Create the WordCloud object and generate the word cloud
wordcloud = WordCloud(stopwords=STOPWORDS).generate(text)

# Display the word cloud using matplotlib.pyplot
plt.imshow(wordcloud,  interpolation='bilinear')
plt.axis("off")
plt.show()

>The words like Christmas, Love, World, Man and Story are very comman word which are appear most in movie title column.

<h1><b>Director Column

In [None]:
# Printing the Number of Directors for Movies and TV Shows Separately

print(f"number of director who  by directed movie : { df[df['type']=='Movie']['director'].value_counts().sum()}")
print(f"number of director who  by directed TV Show : { df[df['type']=='TV Show']['director'].value_counts().sum()}")

In [None]:
#defining fig size and axis
fig,ax = plt.subplots(1,2, figsize=(15,5),dpi=100)

# top 10 director who directed TV show
show = df[df['type']=='TV Show']['director'].value_counts()[:10].plot(kind='barh', ax=ax[0])
show.set_title('top 10 director who directed TV Show', size=16,color='red')

# top 10 director who directed movie
movie = df[df['type']=='Movie']['director'].value_counts()[:10].plot(kind='barh', ax=ax[1])
movie.set_title('top 10 director who directed Movie', size=16,color='red')

plt.tight_layout()
plt.show()
     

>The director Alastair Fothergill has directed three TV shows, which is the highest number of TV shows directed by any director in the dataset.

>Raul Campos and Jan Suter have collaborated directed 18 movies, which is the highest number compared to any other director pair in the dataset. Following them are Marcus Raboy, Jay Karas, and Cathy Garcia-Molina.

<h1><b>Cast Column

In [None]:
#defing fig size and axis
fig,ax = plt.subplots(1,2, figsize=(15,5),dpi=100)

# top 10 TV shows actor 
TV_shows = df[df['type']=='TV Show']['cast'].str.split(', ', expand=True).stack().reset_index(level=1, drop=True).value_counts()[:10].plot(kind='barh', ax=ax[0])
TV_shows.set_title('Top 10 actors who appeared in Tv shows', size=16,color='red')

# top 10 Movie actor 
movies = df[df['type']=='Movie']['cast'].str.split(', ', expand=True).stack().reset_index(level=1, drop=True).value_counts()[:10].plot(kind='barh', ax=ax[1])
movies.set_title('Top 10 actors who appeared in movie', size=16,color='red')

plt.tight_layout()
plt.show()
     

>Takahiro Sakurai, Yuki Kaji and Daisuke Ono played highest role in the TV shows.

>Anupam Kher, Shahrukh Khan and Om Puri played highest number of role in the movies.

In [None]:
# Top 10 countries with the highest number movies / TV shows in the dataset
plt.figure(figsize=(15,5),dpi=100)
df.country.value_counts().nlargest(10).plot(kind='barh')
plt.title('Top 10 countries with the highest number of movies / TV shows',fontsize=16,color='red')

>The highest number of movies / TV shows were based out of the US, followed by India and UK.

<h1><b>Released Year Column

In [None]:
fig,ax = plt.subplots(1,2, figsize=(15,5),dpi=100)

# Univariate analysis
hist = sns.histplot(df['release_year'], ax=ax[0])
hist.set_title('distribution by released year',fontsize=16,color='red')

# Bivariate analysis
count = sns.countplot(x="release_year", hue='type', data=df, order=range(2008,2022), ax=ax[1])
count.set_title('Number of shows released each year since 2008 that are on Netflix',fontsize=16,color='red')
plt.xticks(rotation=90)
for p in count.patches:  #adding value count on the top of bar
   count.annotate(format(p.get_height(), '.0f'), (p.get_x(), p.get_height()))

plt.tight_layout()
plt.show()
     

>Netflix has more new movies and TV shows than old ones.

>The company has a consistent focus on adding new shows to its platform.

>In 2020, there was a decrease in the number of movies added, but not in the number of TV shows added. This could indicate a shift towards introducing more TV series rather than movies on Netflix.

<h1><b>Rating Column

In [None]:
# Top 10 Rating 
fig,ax = plt.subplots(1,2, figsize=(15,5),dpi=100)
plt.suptitle('Top 10 rating given for movie and shows', size=16,color='red', y=1.01)

# univariate analysis
df['rating'].value_counts()[:10].plot(kind='barh',ax=ax[0])

# bivariate analysis
graph = sns.countplot(x="rating", data=df, hue='type', order=df['rating'].value_counts().index[0:10], ax=ax[1])
plt.xticks(rotation=90)
for p in graph.patches:  #adding value count on the top of bar
   graph.annotate(format(p.get_height(), '.0f'), (p.get_x(), p.get_height()))

plt.tight_layout()
plt.show()

>Most of the movie and tv shows have rating of TV-MA (Mature Audiance) then followed by TV-14 (younger audiance).

<h1><b>Duration Column

In [None]:
# duration column
df['duration']

In [None]:

# Creating different dataset from duration

netflix_shows=df[df['type']=='TV Show']
netflix_movies=df[df['type']=='Movie']

<h1><b>Netflex Movie Duration

In [None]:
# movie duration 
netflix_movies['duration']=netflix_movies['duration'].str.replace(' min','')
netflix_movies['duration']=netflix_movies['duration'].astype(str).astype(int)

# Average movie length over the years
plt.figure(figsize=(15,5),dpi=100)
netflix_movies.groupby('release_year')['duration'].mean().plot(kind='line')
plt.title('Average movie length over the years',fontsize=16,color='red')
plt.ylabel('Length of movie in minutes')
plt.xlabel('Year')
     

>Netflix offers a range of movies on its platform, including those from as far back as 1942.

>Movies made in the 1940s had a relatively short duration, according to their plots.

>On average, movies made in the 1960s are the longest in length.

>The average length of movies has been decreasing steadily since the 2000s.

<h2><B> Netflix TV show Duration

In [None]:
# TV show duration 
netflix_shows['duration']=netflix_shows['duration'].str.replace(' Season','')
netflix_shows['duration']=netflix_shows['duration'].str.replace(' Seasons','')
netflix_shows['duration']=netflix_shows['duration'].str.replace('s','')
netflix_shows['duration']=netflix_shows['duration'].astype(str).astype(int)

# Seasons in each TV show
plt.figure(figsize=(15,5),dpi=100)
p = sns.countplot(x='duration',data=netflix_shows)
plt.title('Number of seasons per TV show distribution',fontsize=16,color='red')

for i in p.patches:
    p.annotate(format(i.get_height(), '.0f'), (i.get_x() + i.get_width() / 2., i.get_height()),
               ha='center', va='center', xytext=(0, 5), textcoords='offset points', fontsize=10)

>The TV series in the dataset have a maximum of 16 seasons, but the majority only have one season.

>This could suggest that many of the TV shows are relatively new and additional seasons may be in the works.

>There are very few TV shows in the dataset with more than 8 seasons.

In [None]:
# seperating genre from listed_in columns for analysis purpose
genres = df['listed_in'].str.split(', ', expand=True).stack().reset_index(level=1, drop=True)

# top 10 genre in listed movie/show
plt.figure(figsize=(15,5),dpi=100)
genres = genres.value_counts()[:10].plot(kind='barh')
plt.title('Top 10 genres',fontsize=16,color='red')     

>The International movie is the most popular genre followed by dramas and comedies.

<h1><b>Description

In [None]:
# text documents
text = " ".join(word for word in df['description'])

# create the word cloud
wordcloud = WordCloud(stopwords=STOPWORDS).generate(text)

# plot the word cloud
plt.imshow(wordcloud,  interpolation='bilinear')
plt.axis("off")
plt.show()

>Most of the comman words present in description column are family, find, life, love, new, world, friend.

<h1><b>Feature Engenerring & Data Pre-processing

<h1>Handling Missing Values

In [None]:
# Missing Data %

round(df.isna().mean().sort_values(ascending=False)*100,2)

>For the missing values in the director, cast, and country attributes, the 'empty string' can be used as a replacement.

>The percentage of null values in the rating and date_added columns is small, and dropping these values may not significantly impact model building.

In [None]:
# Handling Missing Values & Missing Value Imputation

df[['director','cast','country']] = df[['director','cast','country']].fillna(' ')
df.dropna(axis=0, inplace=True)

In [None]:
# checking for null values after treating them.

df.isna().sum()

<h1><b>Handling Outlier

In [None]:
# Handling Outliers & Outlier treatments
plt.figure(figsize=(15,5),dpi=100)
sns.boxplot(data=df,orient='h');

>Outlier handling may not be necessary for textual data as outliers are typically defined in numerical data.

>Data cleaning and preprocessing steps are still necessary to ensure the data is ready for model building.

<h1><b>Texual Data Pre-processing

>Select the attributes that will be used to cluster the shows.

>Perform text preprocessing by removing stopwords and punctuation marks, and converting all textual data to lowercase.

>Use stemming to generate a meaningful word out of the corpus of words.

>Tokenize the corpus and perform word vectorization.

>Apply dimensionality reduction techniques to reduce the dimensionality of the dataset.

>Use different algorithms to cluster the movies and determine the optimal number of clusters using various techniques such as the elbow method or silhouette score.

>Build the optimal number of clusters and visualize the contents of each cluster using word clouds to gain insights about the characteristics of each cluster.

<h1><b>Clustering Attributes

We will cluster the movie/shows on Netflix based on the following attributes:

 >Director

 >Cast

 >Country
 
 >Rating
 
 >Listed in (genres)

 >Description

In [None]:
# Copying the original dataset for clustering as it does not contain any missing values to handle

df1 = df.copy()    

In [None]:
# creating clustering_attributes column using all text column which one is used for model building purpose.

df1['clustering_attributes'] = df1['description'] + df1['listed_in'] + df1['rating'] + df1['cast'] + df1['country'] 

In [None]:
df1.clustering_attributes[0]

In [None]:
df1['clustering_attributes'].head(10)

<h1><b>Recoving Non ASCII Character

In [None]:
# function to remove non-ascii characters

def remove_non_ascii(words):
    """Function to remove non-ASCII characters"""
    new_words = []
    for word in words:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)                    # unicodedata.normalize = convert each string to NFKD form
    return new_words                                  # encode = convert string to ASCII format
                                                      # decode = convert resulting byte string to regular string format

In [None]:
# remove non-ascii characters

df1['clustering_attributes'] = remove_non_ascii(df1['clustering_attributes'])

In [None]:
df1['clustering_attributes'][0]     

In [None]:
df1['clustering_attributes'].head(5)

<h1><b>Removing Stopwords And Convert Lower Case

In [None]:
# Download the stop words list if it hasn't been downloaded already
nltk.download('stopwords')

# Create a set of English stop words
stop_words = stopwords.words('english')

# Display the stop words
print(stop_words)

In [None]:
# Text Preprocessing: Removing Stopwords and Punctuation Marks, and Stemming.
def stopwords(text):
    '''a function for removing the stopword and lowercase the each word'''
    text = [word.lower() for word in text.split() if word.lower() not in stop_words]
    # joining the list of words with space separator
    return " ".join(text)

In [None]:
# Removing stop words
df1['clustering_attributes'] = df1['clustering_attributes'].apply(stopwords)

In [None]:
df1['clustering_attributes'][0]

<h1><b>Removing Puntuation

>Removing punctuation is a common preprocessing step in natural language processing (NLP) tasks. Punctuation marks such as periods, commas, and exclamation points can add noise to the data and can sometimes be treated as separate tokens, which can impact the performance of NLP models.

In [None]:
# function to remove punctuations

def remove_punctuation(text):
    '''a function for removing punctuation'''
    # replacing the punctuations with no space, which deletes the punctuation marks.
    translator = str.maketrans('', '', string.punctuation)
    # return the text stripped of punctuation marks
    return text.translate(translator)
     

In [None]:
# Removing punctuation marks
df1['clustering_attributes'] = df1['clustering_attributes'].apply(remove_punctuation)

In [None]:
df1['clustering_attributes'][0]

<h1><b>Stemming

>Stemming operation bundles together words with the same root. For example, the stem operation bundles "response" and "respond" into the common stem "respon".

>The SnowballStemmer has been used to generate a meaningful word out of the corpus of words.

In [None]:
# create an object of stemming function
stemmer = SnowballStemmer("english")

def stemming(text):    
    '''a function which stems each word in the given text'''
    text = [stemmer.stem(word) for word in text.split()]
    return " ".join(text) 

In [None]:
#performing stemming operation

df1['clustering_attributes'] = df1['clustering_attributes'].apply(stemming)

In [None]:
df1['clustering_attributes'][0]  

<h1><b>Text Vectorization

In [None]:
# extract the tfid representation matrix of the text data
tfid_vectorizer= TfidfVectorizer(stop_words='english', lowercase=False, max_features = 10000)  # max features = 10000 to prevent system from crashing
tfid_matrix = tfid_vectorizer.fit_transform(df1['clustering_attributes'])        

# collect the tfid matrix in numpy array
array = tfid_matrix.toarray()  

In [None]:
#  Print Shape and Data Type of a NumPy Array

print(array)
print(f'shape of the vector : {array.shape}')
print(f'datatype : {type(array)}')

<h1><b>Dimensionility Reduction

>Dimensionality reduction is the process of reducing the number of features or dimensions in a dataset while preserving as much information as possible.

>PCA (Principal Component Analysis) can be used to reduce the dimensionality of the data.

In [None]:
# using PCA to reduce dimensionality

pca = PCA(random_state=0)
pca.fit(array)  

In [None]:
# Explained variance for different number of components

plt.figure(figsize=(10,5))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.title('PCA - Cumulative explained variance vs number of components',fontsize=16,color='red')
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');

>After performing PCA, it was found that ~7600 components can explain 100% of the variance in the data.

>More than 80% of the variance can be explained by just 4000 components.

>Selecting the top 4000 components can help simplify the model and reduce dimensionality while still capturing more than 80% of the variance.

In [None]:
# reducing the dimensions to 4000 using pca
pca = PCA(n_components=4000,random_state=0)
pca.fit(array)

In [None]:
# transformed features
X = pca.transform(array)

# shape of transformed vectors
X.shape

>The dimensionality of the data has been successfully reduced using PCA.

<h1><B>Ml Model Implimentation

#####<b><h2>1. K-Means Clustering

>K-means clustering is a popular unsupervised machine learning algorithm that divides a dataset into a predefined number of clusters. Since it is an unsupervised algorithm, it does not rely on labeled examples to learn about the data. To determine the optimal number of clusters for the K-means algorithm, we can use the elbow curve and Silhouette score visualization techniques.

<b>Elbow method to find best value of k



>The elbow curve is a plot of the sum of squared distances between each point and the centroid in a cluster against the number of clusters. As the number of clusters increases, the sum of squared distances generally decreases. The "elbow" point on the curve represents the optimal number of clusters, beyond which the decrease in sum of squared distances is not significant.

In [None]:
# The Elbow Method for Determining Optimal Number of Clusters

sum_of_sq_dist =[]
for i in range(1,20):
  # Initialize the k-means model with the current value of i
  kmeans = KMeans(n_clusters=i,init='k-means++',random_state=0)
  # Fit the model to the data
  kmeans.fit(X)
  # Compute the sum of squared errors for the model
  sum_of_sq_dist.append(kmeans.inertia_)

# Plot the value of SSE
number_clusters = range(1,20)
plt.figure(figsize=(15,5),dpi=100)
plt.plot(number_clusters,sum_of_sq_dist)
plt.title('The Elbow Method for optimal K',fontsize=16,color='red')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Sum of squared distances')
plt.show()

>Select the number of clusters as 10, as no drastic difference is visible after that.

In [None]:
# KMeans Clustering Visualization of Data Points
plt.figure(figsize=(10,6), dpi=150)

kmeans= KMeans(n_clusters=10, init= 'k-means++', random_state=0)
kmeans.fit(X)

#predict the labels of clusters.
label = kmeans.fit_predict(X)
#Getting unique labels
unique_labels = np.unique(label)
 
#plotting the results:
for i in unique_labels:
    plt.scatter(X[label == i , 0] , X[label == i , 1] , label = i)
plt.legend()
plt.title('KMeans Clustering Visualization',fontsize=16,color='red')
plt.show()

<h2><b>Silhouette score method to find the optimal value of k

In [None]:
# Initialize a list to store the silhouette score for each value of k
silhouette_scr = []

for k in range(2, 15):
  # Initialize the k-means model with the current value of k
  kmeans = KMeans(n_clusters=k, init='k-means++', random_state=0)
  # Fit the model to the data
  kmeans.fit(X)
  # Predict the cluster labels for each point in the data
  labels = kmeans.labels_
  # silhouette score for the model
  score = silhouette_score(X, labels)
  silhouette_scr.append(score)
  
# Plot the Silhouette analysis
plt.figure(figsize=(15,5),dpi=100)
plt.plot(range(2,15), silhouette_scr)
plt.xlabel('Number of clusters (K)') 
plt.ylabel('Silhouette score')
plt.title('Silhouette analysis For Optimal k',fontsize=16,color='red')
plt.show()

>The highest Silhouette score is obtained for 6 clusters.

<h2><b>Building clusters using the k-means algorithm:

In [None]:
# Clustering the data into 6 clusters

kmeans = KMeans(n_clusters=6, init='k-means++', random_state=0)
kmeans.fit(X)

In [None]:
# Evaluation metrics - distortion, Silhouette score

kmeans_distortion = kmeans.inertia_
kmeans_silhouette_score = silhouette_score(X, kmeans.labels_)

print((kmeans_distortion, kmeans_silhouette_score))
     

In [None]:
# Adding a kmeans cluster number attribute
df1['kmeans_cluster'] = kmeans.labels_

In [None]:
# Number of movies and tv shows in each cluster

plt.figure(figsize=(15,5),dpi=100)
graph = sns.countplot(x='kmeans_cluster',data=df1, hue='type')
plt.title('Number of movies and TV shows in each cluster',fontsize=16,color='red')

# adding value count on the top of bar
for p in graph.patches:
   graph.annotate(format(p.get_height(), '.0f'), (p.get_x(), p.get_height()))
     

>Successfully built 6 clusters using the k-means clustering algorithm.

<h2><b>Building wordclouds for different clusters built:

In [None]:
# Building a wordcloud for the movie descriptions

def kmeans_worldcloud(cluster_num):
  comment_words = ''
  stopwords = set(STOPWORDS)

  # iterate through the csv file
  for val in df1[df1['kmeans_cluster']==cluster_num].description.values:
      
      # typecaste each val to string
      val = str(val)

      # split the value
      tokens = val.split()
      
      # Converts each token into lowercase
      for i in range(len(tokens)):
          tokens[i] = tokens[i].lower()
      
      comment_words += " ".join(tokens)+" "

  wordcloud = WordCloud(width = 700, height = 700,
                  background_color ='white',
                  stopwords = stopwords,
                  min_font_size = 10).generate(comment_words)


  # plot the WordCloud image                      
  plt.figure(figsize = (10,5), facecolor = None,dpi=100)
  plt.imshow(wordcloud)
  plt.axis("off")
  plt.tight_layout(pad = 0)

In [None]:
# Wordcloud for cluster 0

kmeans_worldcloud(0)

>Keywords observed in cluster 0: comdedian, stage, first, special, life, deliver, funny, humor, share

In [None]:
# Wordcloud for cluster 1
kmeans_worldcloud(1)

Keywords observed in cluster 1: life, love, family, word, find, new, young, three, team

In [None]:
# Wordcloud for cluster 2

kmeans_worldcloud(2)    

Keywords observed in cluster 2:   life, new, find, word, family, become, girl, learn, school

In [None]:
# Wordcloud for cluster 3

kmeans_worldcloud(3)

Keywords observed in cluster 3: muscial,friend, love, band, music, documentary, one, young, life

In [None]:
# Wordcloud for cluster 4

kmeans_worldcloud(4)

Keywords observed in cluster 4: documentary, series, family, live, animal, explore, filmmaker, world, live

In [None]:
# Wordcloud for cluster 5

kmeans_worldcloud(5)

Keywords observed in cluster 5: find, life, family, man, woman,friend, brother, young, three

<h1><b>Hierarcial Clustering

>The agglomerative (hierarchical) clustering algorithm is employed to construct clusters. This approach involves merging clusters that are similar, starting with each sample as a single-sample cluster, and building a hierarchy of clusters from the bottom up. To determine the optimal number of clusters, a dendrogram can be visualized when using the agglomerative (hierarchical) clustering algorithm.

In [None]:
# Building a dendogram to decide on the number of clusters

plt.figure(figsize=(15,5),dpi=100)  
dend = shc.dendrogram(shc.linkage(X, method='ward'))
plt.title('Dendrogram',fontsize=16,color='red')
plt.xlabel('Netflix Shows')
plt.ylabel('Distance')
plt.axhline(y= 4.1, color='r', linestyle='--')

>Using the agglomerative clustering algorithm, it is possible to construct 7 clusters at a distance of 4.1 units.

In [None]:
# Fitting hierarchical clustering model

hierarchical = AgglomerativeClustering(n_clusters=7, affinity='euclidean', linkage='ward')  
hierarchical.fit_predict(X)   

In [None]:
# Number of movies and tv shows in each cluster
plt.figure(figsize=(15,5),dpi=100)
graph = sns.countplot(x='hierarchical_cluster',data=df1, hue='type')
plt.title('Number of movies and tv shows in each cluster - Hierarchical Clustering',fontsize=16,color='red')

# adding value count on the top of bar
for p in graph.patches:
   graph.annotate(format(p.get_height(), '.0f'), (p.get_x(), p.get_height()))

>The Agglomerative (hierarchical) clustering algorithm was utilized to construct 7 clusters successfully.

<H2><B>Building wordclouds for different clusters built:

In [None]:
# Building a wordcloud for the movie descriptions
def hierarchical_worldcloud(cluster_num):
  comment_words = ''
  stopwords = set(STOPWORDS)

  # iterate through the csv file
  for val in df1[df1['hierarchical_cluster']==cluster_num].description.values:
      
      # typecaste each val to string
      val = str(val)

      # split the value
      tokens = val.split()
      
      # Converts each token into lowercase
      for i in range(len(tokens)):
          tokens[i] = tokens[i].lower()
      
      comment_words += " ".join(tokens)+" "

  wordcloud = WordCloud(width = 700, height = 700,
                  background_color ='white',
                  stopwords = stopwords,
                  min_font_size = 10).generate(comment_words)


  # plot the WordCloud image                      
  plt.figure(figsize = (15,5), facecolor = None,dpi=100)
  plt.imshow(wordcloud)
  plt.axis("off")
  plt.tight_layout(pad = 0)

In [None]:
# Wordcloud for cluster 0
hierarchical_worldcloud(0)

Keywords observed in cluster 0: find, life, family, new, take, friend, become, love, live

In [None]:
# Wordcloud for cluster 1
hierarchical_worldcloud(1) 

Keywords observed in cluster 1: life, love, family, world, find, friend, young, must, crime

In [None]:
# Wordcloud for cluster 2
hierarchical_worldcloud(2)

Keywords observed in cluster 2: life, man, women, young, group, family, young, find, polics

In [None]:
# Wordcloud for cluster 3
hierarchical_worldcloud(3)

Keywords observed in cluster 3: father, love, family, man, find, friend, indian, women, india

In [None]:
# Wordcloud for cluster 4
hierarchical_worldcloud(4)     

Keywords observed in cluster 4: natural, creature, examine, planet, series, world, earth, explore, planet

In [None]:
# Wordcloud for cluster 5
hierarchical_worldcloud(5)

Keywords observed in cluster 5: new, love, life, korean, women, two, group, help, world

In [None]:
# Wordcloud for cluster 6
hierarchical_worldcloud(6)

Keywords observed in cluster 6: young, years, world, must, life, new, demon, group, battle

<H1><B>Content based recommender system:

<B>Content-based recommendation systems make recommendations to users by utilizing the similarities between items. These recommendation systems suggest products or items to users based on their descriptions or features, and they determine the degree of similarity between the products by analyzing their descriptions.

In [None]:
# veryfying index
df1[['show_id', 'title', 'clustering_attributes']]     

As shown in the above dataframe, the total number of rows present in our dataframe is 7770. However, the last index appears as 7786 due to the dropping of some rows while handling null values.

In [None]:
# defining a new df for building a recommender system
recommender_df = df1.copy()     

In [None]:
# reseting index
recommender_df.reset_index(inplace=True)

# checking reset index 
recommender_df[['show_id', 'title', 'clustering_attributes']]     

The index has been successfully reset, and the dataset is now ready to be used for building a content-based recommendation system.

In [None]:
# dropping show-id and index column
recommender_df.drop(columns=['index', 'show_id'], inplace=True)  

In [None]:
# calling out transformed array after performing PCA for dimenssionality reduction.
X   

In [None]:
# calculate cosine similarity
similarity = cosine_similarity(X)
similarity     

In [None]:
# Function for list down top 10 recommended movie on the basis of cosine similarity score.
def recommend(movie):
  try:
    '''
    This function list down top ten movies on the basis of similarity score for that perticular movie.
    '''
    # Empty list
    recommend_content = []   
    # find out index position
    index = recommender_df[recommender_df['title'] == movie].index[0]
    # sorting on the basis of simliarity score, In order to find out distaces from recommended one
    distances = sorted(list(enumerate(similarity[index])), reverse=True, key=lambda x:x[1])
    # printing Statement
    print(f"If you liked '{movie}', you may also enjoy: \n")
    # listing top ten recommenaded movie
    for i in distances[1:11]:
      recommend_content.append(df1.iloc[i[0]].title)
    return recommend_content
  except:
     return 'Invalid Entry'

In [None]:
recommend('Naruto')

In [None]:
recommend('A Man Called God')

In [None]:
recommend('Avenger')

In [None]:
recommend('Phir Hera Pheri')

<H1><B>Conclusion : 

* In this project, I worked on a text clustering problem in which I clustered Netflix shows into groups with similar attributes. Here's a summary of my findings and actions taken:

* The dataset had 7787 records and 11 attributes. I performed exploratory data analysis and handled missing values.

* I discovered that Netflix has more movies than TV shows, and the number of shows on the platform is growing rapidly. Most shows are produced in the United States.

* I chose to cluster the data based on attributes such as director, cast, country, genre, rating, and description. I tokenized, preprocessed, and vectorized these attributes using TFIDF Vectorizer, resulting in 10,000 attributes.

* To reduce dimensionality, I utilized Principal Component Analysis (PCA), with 4,000 components capturing over 80% of variance.

* I used K-Means Clustering to build the initial clusters, with the optimal number of clusters being 6 as determined by the elbow method and Silhouette score analysis.

* Agglomerative clustering was used to create clusters, with 7 being the optimal number determined by visualizing the dendrogram.

* Finally, I developed a content-based recommender system using cosine similarity on the similarity matrix. The recommender system recommends the top 10 shows based on the type of show the user has watched.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***