<a href="https://colab.research.google.com/github/yogesh966/In-process...-Unsupervised-ML---Netflix-Movies-and-TV-Shows-Clustering/blob/main/M6_Project_Unsupervised_ML_Netflix_Movies_and_TV_Shows_Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -Unsupervised ML - Netflix Movies and TV Shows Clustering



##### **Project Type**    - Unsupervised ML - Netflix Movies and TV Shows Clustering
##### **Contribution**    - Individual


# **Project Summary -**

This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine. In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.Integrating this dataset with other external datasets such as IMDB ratings, rotten tomatoes can also provide many interesting findings.

# **GitHub Link -**
https://github.com/yogesh966/In-process...-Unsupervised-ML---Netflix-Movies-and-TV-Shows-Clustering

Provide your GitHub Link here.

# **Problem Statement**
Netflix is one of the world's leading entertainment services with over 260 million paid memberships in over 190 countries enjoying TV series, films and games across a wide variety of genres and languages. Netflix is the most-subscribed video on demand streaming media service, with 260.28 million paid memberships in more than 190 countries as of January 2024.

It is crucial that they effectively cluster the shows that are hosted on their platform in order to enhance the user experience, thereby preventing subscriber churn.

We will be able to understand the shows that are similar to and different from one another by creating clusters, which may be leveraged to offer the consumers personalized show suggestions depending on their preferences.

The goal of this project is to classify/group the Netflix shows into certain clusters such that the shows within a cluster are similar to each other and the shows in different clusters are dissimilar to each other.

## **Project Objective**
In this project, you are required to do:

1. Exploratory Data Analysis

2. Understanding what type content is available in different countries

3. If Netflix has been increasingly focusing on TV rather than movies in recent years.

4. Clustering similar content by matching text-based features





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime as dt
import warnings
warnings.filterwarnings('ignore')
from wordcloud import WordCloud, STOPWORDS
import re, string, unicodedata
import nltk
#import inflect
from bs4 import BeautifulSoup
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import string
string.punctuation
nltk.download('omw-1.4')
from nltk.tokenize import TweetTokenizer

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.cluster import AgglomerativeClustering
import scipy.cluster.hierarchy as shc

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

%matplotlib inline
sns.set()

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive/')


In [None]:
data='/content/drive/My Drive/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv'

#Read the data
original_df=pd.read_csv(data)
df=original_df.copy()
df

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information
1. show_id : Unique ID for every Movie / Tv Show

2. type : Identifier - A Movie or TV Show

3. title : Title of the Movie / Tv Show

4. director : Director of the Movie

5. cast : Actors involved in the movie / show

6. country : Country where the movie / show was produced

7. date_added : Date it was added on Netflix

8. release_year : Actual Releaseyear of the movie / show

9. rating : TV Rating of the movie / show

10. duration : Total Duration - in minutes or number of seasons

11. listed_in : Genre

12. description: The Summary description

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print(df[df.duplicated()].sum())

print(f"No of dduplicate values : {df[df.duplicated()].sum().sum()}")

## **Hence,there is no any row containing duplicate values.**

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum().reset_index().rename(columns={'index':"column", 0:"count"}).sort_values(by='count',ascending=False)

In [None]:
print(df.isnull().sum().sum())

In [None]:
# Visualizing the missing values
#Columns with null values
null_col=df.columns[df.isnull().any()]
null_col


In [None]:
#visualize null values count

plt.bar(null_col,df[null_col].isnull().sum())
plt.title("Missing/null value Count")
plt.xlabel("Columns")
plt.ylabel("No of missing values")


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

1. show_id : Unique ID for every Movie / Tv Show

2. type : Identifier - A Movie or TV Show

3. title : Title of the Movie / Tv Show

4. director : Director of the Movie

5. cast : Actors involved in the movie / show

6. country : Country where the movie / show was produced

7. date_added : Date it was added on Netflix

8. release_year : Actual Releaseyear of the movie / show

9. rating : TV Rating of the movie / show

10. duration : Total Duration - in minutes or number of seasons

11. listed_in : Genre

12. description: The Summary description

In [None]:
# Check Unique Values for each variable.
df.nunique()


## 3. ***Data Wrangling***

### Handling the missing values

In [None]:
df.columns[df.isnull().any()]

In [None]:
#filling the null values with appropriate value
df[['director','cast','country']] = df[['director','cast','country']].fillna('Unknown')
df['rating'] = df['rating'].fillna(df['rating'].mode()[0])



In [None]:
df.dropna(inplace=True)
df.isnull().sum()

In [None]:
df.shape

## **Country, listed_in:**

In [None]:
#Top countries
df.country.value_counts()

In [None]:
#Genre of the shows
df.listed_in.value_counts()

### What all manipulations have you done and insights you found?


There are some movies / TV shows that were filmed in multiple countries, have multiple genres associated with it.

To simplify the analysis, let's consider only the primary country where that respective movie / TV show was filmed.

Also, let's consider only the primary genre of the respective movie / TV show.

In [None]:
#Choosing the primary country and primary genre to simplify the analysis
df['country'] = df['country'].apply(lambda x: x.split(',')[0])
df['listed_in'] = df['listed_in'].apply(lambda x: x.split(',')[0])


In [None]:
#contry in which a movie was produced
df.country.value_counts()

In [None]:
# genre of shows
df.listed_in.value_counts()

## **Typecasting 'duration' from string to integer**

In [None]:
# Splitting the duration column, and changing the datatype to integer

df['duration'] = df['duration'].apply(lambda x : int(x.split()[0]))


In [None]:
# Number of seasons for tv shows
df[df['type']=='TV Show'].duration.value_counts()


In [None]:
# Movie length in minutes
df[df['type'] == 'Movie'].duration.unique()

In [None]:
# Datatype of duration
df.duration.dtype

***Successfully converted the datatype of duration column to int.***



## **Typecasting 'date_added' from string to datetime:**


In [None]:
df.date_added.dtype

In [None]:
# Typecasting 'date_added' from string to datetime

df['date_added'] = pd.to_datetime(df['date_added'].str.strip(), format="%B %d, %Y")

In [None]:
df.date_added.min(),df.date_added.max()


***The shows were added on Netflix between 1st January 2008 and 16th January 2021.***

## **Adding new attributes/columns:**

In [None]:
# Adding new attributes month and year of date added
df['month_added'] = df['date_added'].dt.month
df['year_added'] = df['date_added'].dt.year

# **Rating:**


## **Age ratings for shows in the dataset**

In [None]:
plt.figure(figsize = (10,5))
sns.countplot(x='rating', data=df)

 ***Highest number of shows on Netflix are rated by TV-MA, followed by TV-14 and TV-PG.***



In [None]:
# Age ratings
df.rating.unique()

In [None]:
# Changing the values in the rating column
rating_map = {'TV-MA':'Adults',
              'R':'Adults',
              'PG-13':'Teens',
              'TV-14':'Young Adults',
              'TV-PG':'Older Kids',
              'NR':'Adults',
              'TV-G':'Kids',
              'TV-Y':'Kids',
              'TV-Y7':'Older Kids',
              'PG':'Older Kids',
              'G':'Kids',
              'NC-17':'Adults',
              'TV-Y7-FV':'Older Kids',
              'UR':'Adults'}

df['rating'].replace(rating_map, inplace = True)
df['rating'].unique()

## **Age ratings for shows in the dataset**

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(x='rating',data=df)

Most shows on Netflix are produced for adult audience. Followed by young adults, older kids and kids. Netflix has the least number of shows that are specifically produced for teenagers than other age groups.
**bold text**


# **Exploratory data analysis**


Exploratory Data Analysis (EDA) is an approach to analyzing datasets to summarize their main characteristics, often employing visual methods. Its primary goal is to gain insights into the data, identify patterns, trends, anomalies, and relationships between variables. EDA helps researchers or analysts understand the underlying structure of the data and formulate hypotheses for further investigation.

## **Univariate analysis**

Involves analyzing a single variable at a time to understand its distribution, central tendency, dispersion, and other summary statistics.



## **1. Number of Movies and TV Shows in the dataset :**

In [None]:
plt.figure(figsize=(7,7))
df.type.value_counts().plot(kind='pie',autopct='%1.2f%%')
plt.ylabel('')
plt.title('Movies and TV Shows in the dataset')

## **2.Top 10 directors in the dataset:**

In [None]:
plt.figure(figsize=(10,5))
df[~(df['director']=='Unknown')].director.value_counts().nlargest(10).plot(kind='barh')
plt.title('Top 10 directors by number of shows directed')

***Raul Campos and Jan Suter together have directed 18 movies / TV shows, higher than anyone in the dataset.***

## **3.Top 10 countries with the highest number movies / TV shows in the dataset**

In [None]:
plt.figure(figsize=(10,5))
df[~(df['country']=='Unknown')].country.value_counts().nlargest(10).plot(kind='barh')
plt.title(' Top 10 countries with the highest number of shows')


***The United States boasts the highest count of movies and TV shows, with India and the UK following closely behind.***

In [None]:
#% share of movies / tv shows by top 3 countries
df.country.value_counts().nlargest(3).sum()/len(df)*100

In [None]:
#% share of movies / tv shows by top 10 countries
df.country.value_counts().nlargest(10).sum()/len(df)*100


***The top three countries collectively contribute to approximately 56% of all movies and TV shows within the dataset, while this proportion escalates to around 78% for the top ten countries.***

##**4.Visualizing the year in which the movie / tv show was released**

In [None]:
plt.figure(figsize=(10,5))
sns.histplot(df['release_year'])
plt.title('distribution by released year')

***The histogram illustrates the distribution of movie and TV show release years, with a noticeable increase in the number of releases after 2000. Peaks indicate periods of higher production or favored eras, while gaps may suggest shifts in content creation or dataset coverage. Understanding these patterns aids in strategic content planning and audience targeting for platforms like Netflix.***

##**5.Top 10 genres**

In [None]:
plt.figure(figsize = (10,5))
df.listed_in.value_counts().nlargest(10).plot(kind='barh')
plt.title('top 10 genres')


In [None]:
# %Share of top 3 genres
df.listed_in.value_counts().nlargest(3).sum()/len(df)*100



In [None]:
# %Share of top 10 genres
df.listed_in.value_counts().nlargest(10).sum()/len(df)*100

***The visualization depicts the top 10 genres based on their frequency within the dataset. Notably, dramas emerge as the most prevalent genre, closely followed by comedies and documentaries. Collectively, these three genres constitute approximately 41% of all movies and TV shows in the dataset. Furthermore, the dominance of these genres becomes even more pronounced among the top 10, encompassing around 82% of the total content. This highlights a clear preference for these genres among viewers, underscoring their significance in content consumption trends.***

##**6.Number of shows on Netflix for different age groups.**

In [None]:
plt.figure(figsize=(10,5))
df.rating.value_counts().plot(kind='barh')
plt.title('Number of shows on Netflix for different age groups')


***The majority of the shows on Netflix are catered to the needs of adult and young adult population.***


# **Bivariate analysis**

Focuses on analyzing the relationship between two variables to uncover patterns, correlations, or associations.

## **1.Number of movies and TV shows added over the years**

In [None]:
plt.figure(figsize=(10,5))
p=sns.countplot(x='year_added', data=df, hue='type')
plt.title('Number of movies and tv shows added over the years')
plt.xlabel('')
for i in p.patches:
  p.annotate(format(i.get_height(), '.0f'), (i.get_x() + i.get_width() / 2., i.get_height()), ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')


***Over the years, Netflix has maintained a consistent emphasis on expanding its library of shows on its platform. While there was a decline in the number of movies added in 2020, a similar trend was not observed in the addition of TV shows during the same period. This might signal that Netflix is increasingly concentrating on introducing more TV series to its platform rather than movies.***

## **2.Seasons in each TV show**

In [None]:
plt.figure(figsize = (10,5))
p = sns.countplot(x='duration', data=df[df['type']=='TV Show'])
plt.title('Number of seasons per TV show distribution')
for i in p.patches:
  p.annotate(format(i.get_height(), '.0f'), (i.get_x() + i.get_width() / 2., i.get_height()), ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')



In [None]:
# % of tv shows with just 1 season
len(df[(df['type']== 'TV Show') & (df['duration']==1)]) / len(df[df['type']=='TV Show'])*100


***The TV series in the dataset range up to 16 seasons, yet the majority of them consist of only one season. This observation could imply that most TV shows are relatively new, with potential for additional seasons in the future. Additionally, there are very few TV shows with more than 8 seasons.***

## **3.length of movie analysis**

In [None]:
plt.figure(figsize=(10,5))
sns.histplot(x='duration',data=df[df['type']=='Movie'])
plt.title("Movie duration distribution")

In [None]:
# Movie statistics
df[df['type']== 'Movie'].duration.describe()

***The duration of a movie typically spans from 3 minutes to 312 minutes, exhibiting an almost normal distribution.***

## **4. Average movie length over the years**

In [None]:
plt.figure(figsize=(10,5))
df[df['type']=='Movie'].groupby('release_year').duration.mean().plot(kind='line')
plt.title('Average movie length over the years')
plt.ylabel('Length of movie in minutes')
plt.xlabel('Year')


In [None]:
# Movie release year statistics
df[df['type']== 'Movie'].release_year.describe()

***Netflix offers a diverse selection of movies, spanning from classics dating back to 1942 to contemporary releases. Interestingly, films from the 1940s tend to have relatively shorter durations, while those from the 1960s boast the longest average lengths. Notably, there has been a consistent decrease in the average movie length since the 2000s.***

## **5.Top 10 genre for movies**

In [None]:
plt.figure(figsize=(10,5))
df[df['type']=='Movie'].listed_in.value_counts().nlargest(10).plot(kind='barh')
plt.title('Top 10 genres for movies')


***Dramas, comedies, and documentaries stand out as the most favored genres among Netflix's movie collection, reflecting their widespread appeal and popularity among viewers. Whether audiences seek gripping narratives, lighthearted entertainment, or thought-provoking insights, these genres offer a diverse range of cinematic experiences to suit varied tastes and preferences on the streaming platform.***

## **6.Top 10 genre for tv shows**

In [None]:
plt.figure(figsize=(10,5))
df[df['type']=='TV Show'].listed_in.value_counts().nlargest(10).plot(kind='barh')
plt.title('Top 10 genres for TV Shows')


***International, crime, and kids' genres prominently feature among the most sought-after categories for TV shows on Netflix, resonating strongly with viewers across diverse demographics. From gripping international dramas to thrilling crime series and engaging content tailored for younger audiences, these genres offer a captivating array of viewing options that cater to a wide spectrum of preferences and interests on the streaming platform.***

## **7.Top 10 movie directors**

In [None]:
# Top 10 movie directors
plt.figure(figsize=(10,5))
df[~(df['director']=='Unknown') & (df['type']=='Movie')].director.value_counts().nlargest(10).plot(kind='barh')
plt.title('Top 10 movie directors')

***Raul Campos and Jan Suter hold the record for co-directing a remarkable total of 18 movies together, surpassing any other duo in this aspect. Following closely behind are Marcus Roboy, Jay Karas, and Cathy Garcia-Molina.***

## **8.Top 10 TV show directors**

In [None]:
plt.figure(figsize=(10,5))
df[~(df['director']=='Unknown') & (df['type']=='TV Show')].director.value_counts().nlargest(10).plot(kind='barh')
plt.title('Top 10 TV show directors')

***Alastair Fothergill leads with the distinction of directing three TV shows, marking the highest count among all directors. Additionally, only a select group of six directors have helmed more than a single television show.***

## **9.Top 10 actors for movies**

In [None]:
plt.figure(figsize=(10,9))
df[~(df['cast']=='Unknown') & (df['type']=='Movie')].cast.value_counts().nlargest(10).plot(kind='barh')
plt.title('Actors who have appeared in highest number of movies')

***Samuel West boasts an impressive presence in 10 movies, securing the top spot, closely followed by Jeff Dunham, who has graced the screen in 7 films.***

## **10.Top 10 actors for TV shows**



In [None]:
plt.figure(figsize=(10,5))
df[~(df['cast']=='Unknown') & (df['type']=='TV Show')].cast.value_counts().nlargest(10).plot(kind='barh')
plt.title('Actors who have appeared in highest number of TV shows')

***David Attenborough has appeared in 13 TV shows, followed by Michela Luci, Jamie Watson, Anna Claire Bartlam, Dante Zee, Eric Peterson with 4 TV shows.***

# **Building a wordcloud for the movie descriptions**

In [None]:
#Create the cooment_words which include all words in the description section in paragraph format.
comment_words = ''
stopwords = set(STOPWORDS)
# iterate through the csv file
for val in df.description.values:
   # typecaste each val to string
    val = str(val)

    # split the value
    tokens = val.split()
    for i in range(len(tokens)):
        tokens[i] = tokens[i].lower()


    comment_words += " ".join(tokens)+" "

comment_words


In [None]:
#Building a wordcloud
wordcloud= WordCloud(width=1200, height=700,
                     background_color='white',
                     stopwords=stopwords,
                     min_font_size=10).generate(comment_words)


In [None]:
# plot the WordCloud image
plt.figure(figsize = (10,5), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)

***A collection of significant keywords commonly found in Netflix show descriptions, ideal for generating a word cloud, includes: life, family, new, love, young, world, group, death, man, woman, murder, son, girl, documentary, and secret.***

# **Data preprocessing**

**Data preprocessing is a fundamental step in dta analysis and machine learning pipelines. It involves transforming raw data into a clean, organized, and suitable format for further analysis or modeling. The primary objectives of data preprocessing are to enhance data quality, resolve inconsistencies, reduce noise, and prepare the data for analysis or modeling tasks.**



## **Modeling Approach**



1. Select the attributes based on which you want to cluster the shows
2. Text preprocessing: Remove all non-ascii characters, stopwords and punctuation marks, convert all textual data to lowercase.
3. Lemmatization to generate a meaningful word out of corpus of words
4. Tokenization of corpus
5. Word vectorization
6. Dimensionality reduction
7. Use different algorithms to cluster the movies, obtain the optimal number of clusters using different techniques
8. Build optimal number of clusters and visualize the contents of each cluster using wordclouds.


We will cluster the shows on Netflix based on the following attributes:

Director

Cast

Country

Listed in (genres)

Description

In [None]:
# Using the original dataset for clustering since
# it does not require handling missing values
df1 = original_df.copy()

In [None]:
df1.fillna('',inplace=True)

## ***Combining clustering attributes into a single column***


In [None]:
df1['clustering_attributes'] = (df1['director'] + ' ' +
                                df1['cast'] +' ' +
                                df1['country'] +' ' +
                                df1['listed_in'] +' ' +
                                df1['description'])


In [None]:
#Check wheteher particular row contains all the data
df1['clustering_attributes'][40] #prints the values of row 4


## ***Removing non-ASCII characters***

ASCII (American Standard Code for Information Interchange). Non-ASCII characters are often represented using different encoding schemes, such as UTF-8 (Unicode Transformation Format 8-bit), which supports a wider range of characters compared to ASCII encoding.

In [None]:
# function to remove non-ascii characters

def remove_non_ascii(words):
    new_words = []
    for word in words:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)
    return new_words


In [None]:
# remove non-ascii characters
df1['clustering_attributes'] = remove_non_ascii(df1['clustering_attributes'])

In [None]:
df1['clustering_attributes'][40]

# **Remove stopwords and convert to lower case:**

Stopwords are commonly used words in natural language that are often filtered out or ignored during text processing and analysis because they typically do not carry significant meaning or context.



In [None]:
# extracting the stopwords from nltk library
import nltk
from nltk.corpus import stopwords
sw = stopwords.words('english')
# displaying the stopwords
np.array(sw)

In [None]:
# function to remove stop words
def stopwords(text):
    # removing the stop words and lowercasing the selected words
    text = [word.lower() for word in text.split() if word.lower() not in sw]
    # joining the list of words with space separator
    return " ".join(text)


In [None]:
# Removing stop words
df1['clustering_attributes'] = df1['clustering_attributes'].apply(stopwords)
df1['clustering_attributes'][40]


***We have successfully removed all the stopwords and converted the corpus to lowercase.***



## ***Remove punctuations***


In [None]:
# function to remove punctuations
def remove_punctuation(text):
    translator = str.maketrans('', '', string.punctuation)
    # return the text stripped of punctuation marks
    return text.translate(translator)

In [None]:
# Removing punctuation marks
df1['clustering_attributes'] = df1['clustering_attributes'].apply(remove_punctuation)
df1['clustering_attributes'][40]

***We have successfully dropped all the punctuation marks from the corpus.***

#**Lemmatization:**

Lemmatization is often used as a preprocessing step in natural language processing (NLP) tasks to normalize text data. Lemmatization helps reduce words to their base or dictionary form (lemmas), which can improve the performance of ML models by reducing the vocabulary size and capturing the essential meaning of words.



In [None]:
# function to lemmatize the corpus
def lemmatize_verbs(words):
    lemmatizer = WordNetLemmatizer()
    lemmas = []
    for word in words:
        lemma = lemmatizer.lemmatize(word, pos='v')
        lemmas.append(lemma)
    return lemmas

In [None]:
# Lemmatization
df1['clustering_attributes'] = lemmatize_verbs(df1['clustering_attributes'])
df1['clustering_attributes'][40]

***We have lemmatized the corpus***.

# **Tokenization**

Tokenization is the process of breaking down a sequence of text into smaller units, called tokens. These tokens can be words, phrases, symbols, or other meaningful elements, depending on the context and the task at hand.



In [None]:
tokenizer = TweetTokenizer()


In [None]:
df1['clustering_attributes'] = df1['clustering_attributes'].apply(lambda x: tokenizer.tokenize(x))


In [None]:
df1['clustering_attributes'][40]

***The corpus is converted to tokens.***


In [None]:
df1

In [None]:
dff=df1.to_csv('rec_df',index=True)


# **Vectorization**

Vectorization is the process of converting data into numerical vectors or arrays, enabling mathematical operations and analysis. It's a fundamental step in preparing data for machine learning algorithms that typically require numerical input.



In [None]:
# clustering tokens saved in a variable
clustering_data = df1['clustering_attributes']

In [None]:
# Tokenization
def identity_tokenizer(text):
    return text

## **Using TFIDF vectorizer to vectorize the corpus**

In [None]:
# max features = 20000 to prevent system from crashing
tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english', lowercase=False,max_features = 20000)
X = tfidf.fit_transform(clustering_data)

In [None]:
X

In [None]:
X.shape

In [None]:
# data type of vector
type(X)


In [None]:
# convert X in sparse array into dense array form for clustering
X = X.toarray()

## **Dimensionality reduction using PCA**

We can use PCA (Principal component Analysis) to reduce the dimensionality of data.



In [None]:
# using PCA to reduce dimensionality
pca = PCA(random_state=42)
pca.fit(X) #It will take about 13 min to execute due to larger dataset


## **Explained variance for different number of components**

In [None]:
plt.figure(figsize=(10,5))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.title('PCA - Cumulative explained variance vs number of components')
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')

We find that 100% of the variance is explained by about ~7500 components.

Also, more than 80% of the variance is explained just by 4000 components.

Hence to simplify the model, and reduce dimensionality, we can take the top 4000 components, which will still be able to capture more than 80% of variance.


## **reducing the dimensions to 4000 using PCA:**


In [None]:
pca = PCA(n_components=4000,random_state=42)
pca.fit(X)

In [None]:
# transformed features
x_pca = pca.transform(X)


In [None]:
# shape of transformed vectors
x_pca.shape


***We have successfully reduced the dimensionality of data using PCA.***


# **Clusters implementation**



## **K-Means Clustering**


Building clusters using the K-means clustering algorithm.

Visualizing the elbow curve and Silhouette score to decide on the optimal number of clusters for K-means clustering algorithm

In [None]:
# Elbow method to find the optimal value of k
wcss=[]
for i in range(1,31):
  kmeans = KMeans(n_clusters=i,init='k-means++',random_state=33)
  kmeans.fit(x_pca)
  wcss_iter = kmeans.inertia_
  wcss.append(wcss_iter)

print(wcss)
number_clusters = range(1,31)
plt.figure(figsize=(10,5))
plt.plot(number_clusters,wcss)
plt.title('The Elbow Method - KMeans clustering')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')


***The sum of squared distance between each point and the centroid in a cluster (WCSS) decreases with the increase in the number of clusters.***



In [None]:
# Plotting Silhouette score for different umber of clusters
range_n_clusters = range(2,31)
silhouette_avg = []
for num_clusters in range_n_clusters:
  # initialize kmeans
  kmeans = KMeans(n_clusters=num_clusters,init='k-means++',random_state=33)
  kmeans.fit(x_pca)
  cluster_labels = kmeans.labels_

  # silhouette score
  silhouette_avg.append(silhouette_score(x_pca, cluster_labels))

plt.figure(figsize=(10,5))
plt.plot(range_n_clusters,silhouette_avg)
plt.xlabel('Values of K')
plt.ylabel('Silhouette score')
plt.title('Silhouette analysis For Optimal k - KMeans clustering')
plt.show()

***The highest Silhouette score is obtained for 6 clusters.***

***Building 6 clusters using the k-means clustering algorith:***



In [None]:
# Clustering the data into 6 clusters
kmeans = KMeans(n_clusters=6,init='k-means++',random_state=33)
kmeans.fit(x_pca)

In [None]:
# Evaluation metrics - distortion, Silhouette score
kmeans_distortion = kmeans.inertia_
kmeans_silhouette_score = silhouette_score(x_pca, kmeans.labels_)

print((kmeans_distortion,kmeans_silhouette_score))

In [None]:
# Adding a kmeans cluster number attribute
df1['kmeans_cluster'] = kmeans.labels_

## **Number of movies and tv shows in each cluster**

In [None]:
plt.figure(figsize=(10,5))
q = sns.countplot(x='kmeans_cluster',data=df1, hue='type')
plt.title('Number of movies and TV shows in each cluster - Kmeans Clustering')
for i in q.patches:
  q.annotate(format(i.get_height(), '.0f'), (i.get_x() + i.get_width() / 2., i.get_height()), ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')


**Successfully built 6 clusters using the k-means clustering algorithm.**





# **Building wordclouds for different clusters built**


In [None]:
# Building a wordcloud for the movie descriptions
def kmeans_worldcloud(cluster_num):
  comment_words = ''
  stopwords = set(STOPWORDS)

  # iterate through the csv file
  for val in df1[df1['kmeans_cluster']==cluster_num].description.values:

      # typecaste each val to string
      val = str(val)

      # split the value
      tokens = val.split()

      # Converts each token into lowercase
      for i in range(len(tokens)):
          tokens[i] = tokens[i].lower()

      comment_words += " ".join(tokens)+" "

  wordcloud = WordCloud(width = 700, height = 700,
                  background_color ='white',
                   stopwords = stopwords,
                  min_font_size = 10).generate(comment_words)


  # plot the WordCloud image
  plt.figure(figsize = (10,5), facecolor = None)
  plt.imshow(wordcloud)
  plt.axis("off")
  plt.tight_layout(pad = 0)

In [None]:
# Wordcloud for cluster 0
kmeans_worldcloud(0)

Keywords observed in cluster 0: documentary,world,life,history,film,year,family,follow.



In [None]:
# Wordcloud for cluster 1
kmeans_worldcloud(1)

Keywords observed in cluster 1:world,adventure,help,new,life,take,family,must,friend.

In [None]:
# Wordcloud for cluster 2
kmeans_worldcloud(2)

Keywords observed in cluster:man,find,life,love,family,woman,young.

In [None]:
# Wordcloud for cluster 2
kmeans_worldcloud(4)

Keywords observed in cluster 4:comedy,special,stand,comic,life,show,take,stage,share.

In [None]:
# Wordcloud for cluster 2
kmeans_worldcloud(5)

Keywords observed in cluster 5:find, family, new, young, help, two, must, take.

## **Hierarchical clustering**


Building clusters using the agglomerative (hierarchical) clustering algorithm.

Visualizing the dendrogram to decide on the optimal number of clusters for the agglomerative (hierarchical) clustering algorithm:

In [None]:
# Building a dendogram to decide on the number of clusters
plt.figure(figsize=(10, 7))
dend = shc.dendrogram(shc.linkage(x_pca, method='ward'))
plt.title('Dendrogram')
plt.xlabel('Netflix Shows')
plt.ylabel('Distance')
plt.axhline(y= 3.8, color='r', linestyle='--')

***At  distance of 3.8 units, 12 clusters can be built using the agglomerative clustering algorithm.***


## **Building 12 clusters using the Agglomerative clustering algorithm:**



In [None]:
# Fitting hierarchical clustering model
hierarchical = AgglomerativeClustering(n_clusters=12, affinity='euclidean', linkage='ward')
hierarchical.fit_predict(x_pca)

In [None]:
# Adding a kmeans cluster number attribute
df1['hierarchical_cluster'] = hierarchical.labels_

In [None]:
# Number of movies and tv shows in each cluster
plt.figure(figsize=(10,5))
q = sns.countplot(x='hierarchical_cluster',data=df1, hue='type')
plt.title('Number of movies and tv shows in each cluster - Hierarchical Clustering')
for i in q.patches:
  q.annotate(format(i.get_height(), '.0f'), (i.get_x() + i.get_width() / 2., i.get_height()), ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')



Successfully built 12 clusters using the Agglomerative (hierarchical) clustering algorithm.

In [None]:
# Building a wordcloud for the movie descriptions
def hierarchical_worldcloud(cluster_num):
  comment_words = ''
  stopwords = set(STOPWORDS)

  # iterate through the csv file
  for val in df1[df1['hierarchical_cluster']==cluster_num].description.values:

      # typecaste each val to string
      val = str(val)

      # split the value
      tokens = val.split()

      # Converts each token into lowercase
      for i in range(len(tokens)):
          tokens[i] = tokens[i].lower()

      comment_words += " ".join(tokens)+" "

  wordcloud = WordCloud(width = 700, height = 700,
                  background_color ='white',
                  stopwords = stopwords,
                  min_font_size = 10).generate(comment_words)


  # plot the WordCloud image
  plt.figure(figsize = (10,5), facecolor = None)
  plt.imshow(wordcloud)
  plt.axis("off")
  plt.tight_layout(pad = 0)

In [None]:
# Wordcloud for cluster 0
hierarchical_worldcloud(0)

Keywords observed in cluster 0:

In [None]:
# Wordcloud for cluster 1
hierarchical_worldcloud(1)

Keywords observed in cluster

In [None]:
# Wordcloud for cluster 2
hierarchical_worldcloud(2)

Keywords observed in cluster

In [None]:
# Wordcloud for cluster 3
hierarchical_worldcloud(3)

Keywords observed in cluster

In [None]:
# Wordcloud for cluster 4
hierarchical_worldcloud(4)

Keywords observed in cluster

In [None]:
# Wordcloud for cluster 5
hierarchical_worldcloud(5)

Keywords observed in cluster

In [None]:
# Wordcloud for cluster 6
hierarchical_worldcloud(6)

Keywords observed in cluster

In [None]:
# Wordcloud for cluster 7
hierarchical_worldcloud(7)

Keywords observed in cluster

In [None]:
# Wordcloud for cluster 8
hierarchical_worldcloud(8)

Keywords observed in cluster

In [None]:
# Wordcloud for cluster 9
hierarchical_worldcloud(9)

In [None]:
# Wordcloud for cluster 10
hierarchical_worldcloud(10)

Keywords observed in cluster

In [None]:
# Wordcloud for cluster 11
hierarchical_worldcloud(11)

Keywords observed in cluster

Keywords observed in cluster

# **Building wordclouds for different clusters built**

# **Content based recommender system**



Content based recommender system
We can build a simple content based recommender system based on the similarity of the shows. If a person has watched a show on Netflix, the recommender system must be able to recommend a list of similar shows that s/he likes. To get the similarity score of the shows, we can use cosine similarity The similarity between two vectors (A and B) is calculated by taking the dot product of the two vectors and dividing it by the magnitude value as shown in the equation below. We can simply say that the CS score of two vectors increases as the angle between them decreases.

In [None]:
# defining a new df for building a recommender system
#rec_df is copy of df1
data2='/content/drive/My Drive/rec_df.csv'

#Read the data
df3=pd.read_csv(data2)
recommender_df=df3.copy()
recommender_df

In [None]:
# Changing the index of the df from show id to show title
recommender_df['show_id'] = recommender_df.index

In [None]:
# converting tokens to string
def convert(lst):
  return ' '.join(lst)

In [None]:
#recommender_df['clustering_attributes'] = recommender_df['clustering_attributes'].apply(lambda x: convert(x))


In [None]:
recommender_df

In [None]:
# setting title of movies/Tv shows as index
recommender_df.set_index('title',inplace=True)


In [None]:
recommender_df

**Count Vectorizer**

CountVectorizer is a tool in the scikit-learn library used to convert a collection of text documents into a matrix of token counts. It is a popular method for converting textual data into a format that machine learning algorithms can process.

CountVectorizer is a tool in the scikit-learn library used to convert a collection of text documents into a matrix of token counts. It is a popular method for converting textual data into a format that machine learning algorithms can process.

In [None]:
# Count vectorizer
CV = CountVectorizer()
converted_matrix = CV.fit_transform(recommender_df['clustering_attributes'])


**Cosine similarity:**
It is a metric used to measure how similar two vectors are by calculating the cosine of the angle between them. It is widely used in text analysis, recommendation systems, and information retrieval to compare documents or items based on their features.

Formula:
The formula for cosine similarity between two vectors A and B is:

Cosine Similarity
=
𝐴
⋅
𝐵
/
∣∣A∣∣×∣∣B∣∣


Where:

A⋅B is the dot product of vectors A and B.

∣∣A∣∣ and |∣B∣∣ are the magnitudes (or lengths) of vectors A and B.

1=similar

0=less similar or not

In [None]:
# Cosine similarity
cosine_similarity = cosine_similarity(converted_matrix)


In [None]:
cosine_similarity.shape


In [None]:
# Developing a function to get 10 recommendations for a show
indices = pd.Series(recommender_df.index)

def recommend_10(title, cosine_sim = cosine_similarity):
  try:
    recommend_content = []
    idx = indices[indices == title].index[0]
    series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)
    top10 = list(series.iloc[1:11].index)
    # list with the titles of the best 10 matching movies
    for i in top10:
      recommend_content.append(list(recommender_df.index)[i])
    print("If you liked '"+title+"', you may also enjoy:\n")
    return recommend_content

  except:
    return 'Invalid Entry'

In [None]:
# Recommendations for 'A Man Called God'
recommend_10('A Man Called God')


In [None]:
# Recommendations for 'Stranger Things'
recommend_10('Stranger Things')


In [None]:
# Recommendations for 'Peaky Blinders'
recommend_10('Peaky Blinders')

In [None]:
# Recommendations for 'Lucifer'
recommend_10('Lucifer')

In [None]:
# Recommendations for 'XXX'
recommend_10('xyz')

Invalid because the show 'Xyz' is not available on Netflix.


#**Conclusions**

In this project, we worked on a text clustering problem wherein we had to classify/group the Netflix shows into certain clusters such that the shows within a cluster are similar to each other and the shows in different clusters are dissimilar to each other.

The dataset contained about 7787 records, and 11 attributes. We began by dealing with the dataset's missing values and doing exploratory data analysis (EDA).

It was found that Netflix hosts more movies than TV shows on its platform, and the total number of shows added on Netflix is growing exponentially. Also, majority of the shows were produced in the United States, and the majority of the shows on Netflix were created for adults and young adults age group.

It was decided to cluster the data based on the attributes: director, cast, country, genre, and description. The values in these attributes were tokenized, preprocessed, and then vectorized using TFIDF vectorizer.

Through TFIDF Vectorization, we created a total of 20000 attributes. We used Principal Component Analysis (PCA) to handle the curse of dimensionality. 4000 components were able to capture more than 80% of variance, and hence, the number of components were restricted to 4000.

We first built clusters using the k-means clustering algorithm, and the optimal number of clusters came out to be 6. This was obtained through the elbow method and Silhouette score analysis.

Then clusters were built using the Agglomerative clustering algorithm, and the optimal number of clusters came out to be 12. This was obtained after visualizing the dendrogram.

A content based recommender system was built using the similarity matrix obtained after using cosine similarity. This recommender system will make 10 recommendations to the user based on the type of show they watched.

## ***Successfully completed the project.***