<a href="https://colab.research.google.com/github/sakshiharde/Netflix_Movie_And_Tvshows_RecommendationSystem/blob/main/NETFLIX_RECOMMENDATION_SYSTEM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Netflix Movies and TV shows Recommendation System



# **Project Summary -**

This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine.

In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.

Initially i have start with understanding the dataset, then i clean the data to make analysis ready.

Explore the data and understand the behaviour of the same.

Then i have prepare the dataset for creating clusters by various parameters wherein i can remove stop words, white spaces numbers etc. so that i can get important words and based on that i shall form clusters.

Later i have used the silhouette method and k-means elbow method to find optimal number of clusters and built recommender system by cosine similarity and recommended top ten movies.

# **GitHub Link -**

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go


### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
netflix=pd.read_csv('/content/drive/MyDrive/content/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')

### Dataset First View

In [None]:
# Dataset First Look
netflix.head()

In [None]:
netflix.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
netflix.shape

### Dataset Information

In [None]:
# Dataset Info
netflix.info()

In [None]:
netflix.describe()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
netflix.duplicated().sum()


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
netflix.isnull().sum()

In the dataset , the director colummn has 2389 missing values , cast has 718 missing values and country has 507 missing values , date_added has 10 values and rating has 7 value missing

In [None]:
# Visualizing the missing values
netflix.isnull().sum().plot(kind='bar')


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
netflix.columns

In [None]:
# Dataset Describe
netflix.describe()

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
netflix.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
netflix.isnull().sum()


In [None]:
#Handling Null Values
netflix['cast'].fillna(value='No cast',inplace=True)
netflix['country'].fillna(value=netflix['country'].mode()[0],inplace=True)



In [None]:
netflix.dropna(subset=['date_added','rating'],inplace=True)


In [None]:
# Replace the null values in director.
netflix['director']=netflix['director'].fillna('')


In [None]:
netflix.isnull().sum()

# Exploratory Data Analysis

# 1. How many TV shows and movies are on Netflix?

In [None]:
sns.countplot(x='type',data=netflix,hue='type')
plt.show()

Netflix has 5372 movies and 2398 TV shows, there are more number movies on Netflix than TV shows.

# What is the most common rating for movies and TV shows on Netflix?

In [None]:
plt.figure(figsize=(10,10))
rating_counts = netflix['rating'].value_counts()
plt.pie(rating_counts, labels=rating_counts.index, autopct='%1.1f%%')
plt.title('Distribution of Ratings')
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(x='rating', data=netflix)
plt.xticks(rotation=90)
plt.show()

By analyzing these charts, you can gain insights into the content available on Netflix and the preferences of its audience. For example:


*   Content Strategy: If you find that certain rating categories are underrepresented, Netflix could consider adding more content with those ratings to cater to a wider audience.
*   Parental Controls: The distribution of ratings can inform the development of parental control features, ensuring that children are only exposed to age-appropriate content.

*   Marketing and Recommendations: Understanding the popularity of different rating categories can help Netflix tailor its marketing campaigns and recommendation algorithms to better target specific audience segments.





# How does the distribution of release years for movies compare to that of TV shows?

In [None]:
import plotly.express as px

# For movies
movie_year_counts = netflix[netflix['type'] == 'Movie']['release_year'].value_counts().reset_index()
movie_year_counts.columns = ['release_year', 'count']
fig_movie = px.bar(movie_year_counts.head(10), x='release_year', y='count',
                     title='Top 10 Most Common Release Years for Movies',
                     labels={'release_year': 'Release Year', 'count': 'Count'})
fig_movie.show()

# For TV shows
tvshow_year_counts = netflix[netflix['type'] == 'TV Show']['release_year'].value_counts().reset_index()
tvshow_year_counts.columns = ['release_year', 'count']
fig_tvshow = px.bar(tvshow_year_counts.head(10), x='release_year', y='count',
                      title='Top 10 Most Common Release Years for TV Shows',
                      labels={'release_year': 'Release Year', 'count': 'Count'})
fig_tvshow.show()

In [None]:
import plotly.express as px

# Combine movie and TV show data
movie_year_counts = netflix[netflix['type'] == 'Movie']['release_year'].value_counts().reset_index()
movie_year_counts.columns = ['release_year', 'count']
movie_year_counts['type'] = 'Movie'

tvshow_year_counts = netflix[netflix['type'] == 'TV Show']['release_year'].value_counts().reset_index()
tvshow_year_counts.columns = ['release_year', 'count']
tvshow_year_counts['type'] = 'TV Show'

combined_counts = pd.concat([movie_year_counts, tvshow_year_counts])

# Create line plot with different colors
fig = px.line(combined_counts, x='release_year', y='count', color='type',
              title='Most Common Release Years for Movies and TV Shows',
              labels={'release_year': 'Release Year', 'count': 'Count', 'type': 'Type'})
fig.show()

# Which month sees the highest number of content additions? What factors might contribute to this peak?

In [None]:
netflix

In [None]:

netflix['month'] = pd.DatetimeIndex(netflix['date_added']).month
netflix.head()

In [None]:
# Plotting the Countplot
plt.figure(figsize=(10,10))
ax=sns.countplot(x='month',data=netflix,palette='viridis')
plt.xlabel("Month")
plt.ylabel("Count")
plt.title("Count of Movies and TV Shows Added to Netflix by Month")
plt.show()

In [None]:

fig, ax = plt.subplots(figsize=(15,6))

sns.countplot(x='month', hue='type',lw=5, data=netflix, ax=ax)



*   Peak Release Months: This could indicate strategic release periods targeted at specific viewer behaviors or industry trends. For example, if you see a spike in releases during the holiday season (November/December), it could suggest Netflix capitalizes on increased viewership during those times.
*   Content Slumps:  any months with significantly lower content additions. These periods might be due to production cycles, industry events, or strategic decisions to focus releases elsewhere.



# Who are the most frequently appearing actors on Netflix?

In [None]:
# Split the 'cast' column and flatten the list
all_actors = netflix['cast'].str.split(', ').explode()

all_actors_dropped=all_actors.dropna()

# Count the occurrences of each actor
actor_counts = all_actors.value_counts()

# Display the top 10 actors
print(actor_counts.head(10))

In [None]:
# Create a bar chart
plt.figure(figsize=(12, 6))
actor_counts.head(10).plot(kind='bar')
plt.title('Top 10 Actors on Netflix')
plt.xlabel('Actor')
plt.ylabel('Number of Appearances')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
from wordcloud import WordCloud

# Generate a word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(actor_counts)

# Display the word cloud
plt.figure(figsize=(10, 8), facecolor=None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

# Does the dominance of US-produced content reflect global viewer preferences, or is it a result of Netflix's origins and initial focus?

In [None]:
# Split, explode, and count country occurrences
country_counts = netflix['country'].str.split(', ').explode().value_counts()

# Display the top 10 countries
print(country_counts.head(10))


In [None]:

plt.figure(figsize=(12, 6))
country_counts.head(10).sort_values(ascending=True).plot(kind='barh', color=plt.cm.Paired(np.arange(10)))
plt.title('Top 10 Countries Producing Content on Netflix')
plt.ylabel('Country')
plt.xlabel('Number of Movies and TV Shows')
plt.show()

In [None]:
colors = ['#ff9999','#66b3ff','#99ff99','#ffcc99', '#c2c2f0','#ffb3e6', '#c2d6d6', '#e6b3b3', '#b3e6cc', '#ffff99']
# Create a pie chart
plt.figure(figsize=(8, 8))
country_counts.head(10).plot(kind='pie', autopct='%1.1f%%', startangle=90, colors=colors)
plt.title('Top 10 Countries Producing Content on Netflix')
plt.ylabel('')  # Hide the default y-axis label
plt.show()

The United States dominates Netflix content production with a staggering 2555 titles, followed by India with 923 and the United Kingdom with 397. Other countries contribute significantly less, indicating a concentrated production landscape led by these major players.

#### Chart - 7

In [None]:
#Checking the distribution of Movie Durations
plt.figure(figsize=(10,7))
sns.distplot(netflix['duration'].str.extract('(\d+)'),kde=False, color=['red'])
plt.title('Distplot with Normal distribution for Movies',fontweight="bold")
plt.show()


In [None]:
tv_shows=netflix[netflix['type']=='TV Show']
movies=netflix[netflix['type']=='Movie']

In [None]:
#Checking the distribution of TV SHOWS
plt.figure(figsize=(30,6))
plt.title("Distribution of TV Shows duration",fontweight='bold')
sns.countplot(x=tv_shows['duration'],data=tv_shows,order = tv_shows['duration'].value_counts().index,palette='viridis')
plt.show()


# How can Netflix leverage this genre data to improve its recommendation algorithms and personalize user experiences?

In [None]:
# Split, explode, and count genre occurrences
genre_counts = netflix['listed_in'].str.split(', ').explode().value_counts()

# Display the top 10 genres
print(genre_counts.head(10))

In [None]:

plt.figure(figsize=(12, 6))
genre_counts.head(10).plot(kind='bar', color='skyblue')
plt.title('Top 10 Genres on Netflix')
plt.xlabel('Genre')
plt.ylabel('Number of Movies and TV Shows')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for readability
plt.tight_layout()
plt.show()

International Movies, Dramas, and Comedies are the most prevalent genres on Netflix, suggesting a strong viewer preference for these categories.
The popularity of international movies and TV shows suggests a strong focus on acquiring and producing content from diverse regions, catering to a global audience.

# What are the most common words used in Netflix show titles?

In [None]:
from collections import Counter

# Combine all titles into a single string
all_titles = ' '.join(netflix['title'].astype(str))

# Split the string into individual words
words = all_titles.lower().split()

# Count the frequency of each word
word_counts = Counter(words)

# Display the most common words
print(word_counts.most_common(10))

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')

In [None]:
import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
nltk.download('punkt')
ps=PorterStemmer()
import string

In [None]:
comment_words = ''

# Remove The Stopwords
stopwords = set(stopwords.words('english'))

# Iterate Through The Column
for val in netflix.title:

    # Typecaste Each Val to String
    val = str(val)

    # Split The Value
    tokens = val.split()

    # Converts Each Token into lowercase
    for i in range(len(tokens)):
        tokens[i] = tokens[i].lower()

    comment_words += " ".join(tokens)+" "

# Set Parameters
wordcloud = WordCloud(width = 1000, height = 500,
                background_color ='white',
                stopwords = stopwords,
                min_font_size = 10,
                max_words = 1000,
                colormap = 'gist_heat_r').generate(comment_words)

plt.figure(figsize = (6,6), facecolor = None)
plt.title('Most Used Words In Shows Title', fontsize = 15, pad=20)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)

# Display Chart
plt.show()

The most common words in Netflix titles heavily emphasize personal relationships and everyday experiences, suggesting a focus on relatable and emotionally resonant stories.
Action and thriller titles frequently use words like 'devill', 'dead', and 'war', indicating a focus on high-stakes plots and suspense.

# who are the top 10 movies and tv shows actors on netflix?

In [None]:
cast = netflix['cast'].str.split(', ', expand=True).stack()

# top actors name who play highest role in movie/show.
cast.value_counts()


In [None]:
cast =cast[cast != 'No cast']
cast.head()

In [None]:
#visualization of top 10 actors of movie and tv show on netflix
fig,ax = plt.subplots(1,2, figsize=(14,5))

# seperating TV shows actor from cast column
top_TVshows_actor = netflix[netflix['type']=='TV Show']['cast'].str.split(', ', expand=True).stack()
top_TVshows_actor =top_TVshows_actor[top_TVshows_actor != 'No cast']
# plotting actor who appeared in highest number of TV Show
a = top_TVshows_actor.value_counts().head(10).plot(kind='barh', ax=ax[0],color='purple')
a.set_title('Top 10 TV shows actors', size=15)

# seperating movie actor from cast column
top_movie_actor = netflix[netflix['type']=='Movie']['cast'].str.split(', ', expand=True).stack()
top_movie_actor =top_movie_actor[top_movie_actor != 'No cast']
# plotting actor who appeared in highest number of Movie
b = top_movie_actor.value_counts().head(10).plot(kind='barh', ax=ax[1],color='blue')
b.set_title('Top 10 Movie actors', size=15)

plt.tight_layout(pad=1.2, rect=[0, 0, 0.95, 0.95])
plt.show()

In the TV shows category, the actor with the highest appearance is Takahiro Sakurai. In the movies category, the actor with the highest appearance is Anupam Kher.

# Top 15 director who directed highest number of movies and TV show on Netflix

In [None]:
directors_list = netflix.director.value_counts().reset_index().head(15)[1:]
directors_list.rename(columns={'index':'count', 'director':'directors name'}, inplace=True)

# Create a bar chart using Plotly
fig = px.bar(directors_list, x='directors name', y='count', text_auto=True)

# Generate a list of 25 unique color codes using seaborn
color_palette = sns.color_palette('bright', n_colors=15).as_hex()
fig.update_traces(marker_color=color_palette)

# Add a title and adjust the layout
fig.update_layout(
    title={
        'text': 'Top 25 directors with highest number of Movies and Tv Shows.',
        'y': 0.95,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    autosize=False,
    width=1200,
    height=500
)

# Show the plot
fig.show()

# Correlation Heatmap

In [None]:
ratings = {
    'TV-PG': 'Older Kids',
    'TV-MA': 'Adults',
    'TV-Y7-FV': 'Older Kids',
    'TV-Y7': 'Older Kids',
    'TV-14': 'Teens',
    'R': 'Adults',
    'TV-Y': 'Kids',
    'NR': 'Adults',
    'PG-13': 'Teens',
    'TV-G': 'Kids',
    'PG': 'Older Kids',
    'G': 'Kids',
    'UR': 'Adults',
    'NC-17': 'Adults'
}
netflix['target_ages'] = netflix['rating'].replace(ratings)

In [None]:
netflix['count'] = 1
data = netflix.groupby('country')[['count']].sum().sort_values(by='count',ascending=False).reset_index()[:10]
data = data['country']

netflix_heatmap = netflix.loc[netflix['country'].isin(data)]
netflix_heatmap = pd.crosstab(netflix_heatmap['country'],netflix_heatmap['target_ages'],normalize = "index").T
netflix_heatmap

In [None]:
#Plotting the heatmap
fig, ax = plt.subplots(1, 1, figsize=(12, 12))

country_order2 = ['United States', 'India', 'United Kingdom', 'Canada', 'Japan', 'France', 'South Korea', 'Spain']

age_order = ['Adults', 'Teens', 'Older Kids', 'Kids']

sns.heatmap(netflix_heatmap.loc[age_order,country_order2],cmap="PRGn",square=True, linewidth=2.5,cbar=False,
            annot=True,fmt='1.0%',vmax=.6,vmin=0.05,ax=ax,annot_kws={"fontsize":12})
plt.show()

In [None]:
netflix['count'] = 1
data1 = netflix.groupby('listed_in')[[ 'count']].sum().sort_values(by='count', ascending=False).reset_index()[:10]
data1 = data1['listed_in']


In [None]:
df_heatmap1 = netflix.loc[netflix['listed_in'].isin(data1)]
df_heatmap1 = pd.crosstab(df_heatmap1['listed_in'],df_heatmap1['target_ages'],normalize = "index").T
df_heatmap1

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12, 12))

top=['Documentaries', 'Stand-Up Comedy', 'Dramas, International Movies',
       'Comedies, Dramas, International Movies',
       'Dramas, Independent Movies, International Movies', "Kids' TV",
       'Children & Family Movies', 'Documentaries, International Movies',
       'Children & Family Movies, Comedies',
       'Comedies, International Movies']
age_order = ['Adults', 'Teens', 'Older Kids', 'Kids']

sns.heatmap(data=df_heatmap1.loc[age_order, top],
            cmap='YlGnBu',
            square=True,
            linewidth=2.5,
            cbar=False,
            annot=True,
            fmt='1.0%',
            vmax=.6,
            vmin=0.05,
            ax=ax,
            annot_kws={"fontsize": 12})
plt.show()

These visualisations show the content's country of origin, which include both Movies and TVs shows. Top of the list of nations were the US and India. A few countries, including Australia, Taiwan, and Brazil, produce little Netflix content.

From the heatmap,the US and UK are very similar to the Netflix target age group, although they differ greatly from such as India or Japan.

#  Pair Plot

In [None]:


# Select relevant numeric columns for the pair plot
columns_for_pairplot = ['release_year', 'duration']  # Add or remove columns as needed

# Create the pair plot
sns.pairplot(netflix[columns_for_pairplot], diag_kind='kde')
plt.suptitle("Pair Plot of Selected Netflix Features", y=1.02)
plt.show()

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

### Hypothetical Statement - 1

In [None]:
#making copy of df_clean_frame
netflix_hypothesis=netflix.copy()
#head of df_hypothesis
netflix_hypothesis.head()

In [None]:
#filtering movie from Type_of_show column
netflix_hypothesis = netflix_hypothesis[netflix_hypothesis["type"] == "Movie"]

In [None]:
#with respect to each ratings assigning it into group of categories
ratings_ages = {
    'TV-PG': 'Older Kids',
    'TV-MA': 'Adults',
    'TV-Y7-FV': 'Older Kids',
    'TV-Y7': 'Older Kids',
    'TV-14': 'Teens',
    'R': 'Adults',
    'TV-Y': 'Kids',
    'NR': 'Adults',
    'PG-13': 'Teens',
    'TV-G': 'Kids',
    'PG': 'Older Kids',
    'G': 'Kids',
    'UR': 'Adults',
    'NC-17': 'Adults'
}

netflix_hypothesis['target_ages'] = netflix_hypothesis['rating'].replace(ratings_ages)
#let's see unique target ages
netflix_hypothesis['target_ages'].unique()


In [None]:
netflix_hypothesis['target_ages'] = pd.Categorical(netflix_hypothesis['target_ages'], categories=['Kids', 'Older Kids', 'Teens', 'Adults'])

netflix_hypothesis['duration'] = netflix_hypothesis['duration'].astype(str)  # Convert to string type
netflix_hypothesis['duration'] = netflix_hypothesis['duration'].str.extract('(\d+)')
netflix_hypothesis['duration'] = pd.to_numeric(netflix_hypothesis['duration'])

netflix_hypothesis.head(3)



In [None]:
#group_by duration and target_ages
group_by_= netflix_hypothesis[['duration','target_ages']].groupby(by='target_ages')
#mean of group_by variable
group=group_by_.mean().reset_index()
group


In [None]:
#group_by duration and target_ages
group_by_= netflix_hypothesis[['duration','target_ages']].groupby(by='target_ages')
#mean of group_by variable
group=group_by_.mean().reset_index()
group

#In A and B variable grouping values
A= group_by_.get_group('Kids')
B= group_by_.get_group('Older Kids')

# Convert 'duration' to numeric before calculating mean and std
A['duration'] = pd.to_numeric(A['duration'])
B['duration'] = pd.to_numeric(B['duration'])

#mean and std. calutation for kids and older kids variables
M1 = A['duration'].mean() # Calculate mean of the 'duration' column
S1 = A['duration'].std()

M2= B['duration'].mean()
S2 = B['duration'].std()

print('Mean for movies rated for Kids {} \n Mean for  movies rated for older kids {}'.format(M1,M2))
print('Std for  movies rated for Older Kids {} \n Std for  movies rated for kids {}'.format(S2,S1))

In [None]:
#import stats
from scipy import stats
#length of groups and DOF
n1 = len(A)
n2= len(B)
print(n1,n2)

dof = n1+n2-2
print('dof',dof)

sp_2 = ((n2-1)*S1**2  + (n1-1)*S2**2) / dof
print('SP_2 =',sp_2)

sp = np.sqrt(sp_2)
print('SP',sp)

#tvalue
t_val = (M1-M2)/(sp * np.sqrt(1/n1 + 1/n2))
print('tvalue',t_val) # Remove [0] to print the scalar value directly

In [None]:
#t-distribution
stats.t.ppf(0.025,dof)

In [None]:
#t-distribution
stats.t.ppf(0.975,dof)

t-value is not in the range, the null hypothesis is rejected.

As a result, movies rated for kids and older kids are not at least two hours long.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

H1:The duration which is more than 90 mins are movies

HO:The duration which is more than 90 mins are NOT movies

#### 2. Perform an appropriate statistical test.

In [None]:
#making copy of df_clean_frame
netflix_hypothesis=netflix.copy()
#head of df_hypothesis
netflix_hypothesis.head()

In [None]:
netflix_hypothesis['duration']= netflix_hypothesis['duration'].str.extract('(\d+)')
netflix_hypothesis['duration'] = pd.to_numeric(netflix_hypothesis['duration'])

In [None]:
netflix_hypothesis['type'] = pd.Categorical(netflix_hypothesis['type'], categories=['Movie','TV Show'])
#from duration feature extractin string part and after extracting Changing the object type to numeric
#df_hypothesis['duration']= df_hypothesis['duration'].str.extract('(\d+)')
#df_hypothesis['duration'] = pd.to_numeric(df_hypothesis['duration'])
#head of df_
netflix_hypothesis.head(3)


In [None]:
netflix_hypothesis['type'] = pd.Categorical(netflix_hypothesis['type'], categories=['Movie','TV Show'])

In [None]:
# Perform Statistical Test to obtain P-Value
#group_by duration and TYPE
group_by_= netflix_hypothesis[['duration','type']].groupby(by='type')
#mean of group_by variable
group1=group_by_.mean().reset_index()
group1

In [None]:
#In A and B variable grouping values
A= group_by_.get_group('Movie')['duration'] # Select only the 'duration' column
B= group_by_.get_group('TV Show')['duration'] # Select only the 'duration' column

#mean and std
M1 = A.mean()
S1 = A.std()

M2= B.mean()
S2 = B.std()

print('Mean of Movie durations: {}'.format(M1)) # Format the output for clarity
print('Mean of TV Show durations: {}'.format(M2))
print('Std of Movie durations: {}'.format(S1))
print('Std of TV Show durations: {}'.format(S2))

In [None]:
#import stats
from scipy import stats
#length of groups and DOF
n1 = len(A)
n2= len(B)
print(n1,n2)

dof = n1+n2-2
print('dof',dof)

sp_2 = ((n2-1)*S1**2  + (n1-1)*S2**2) / dof
print('SP_2 =',sp_2)

sp = np.sqrt(sp_2)
print('SP',sp)

#tvalue
t_val = (M1-M2)/(sp * np.sqrt(1/n1 + 1/n2))
print('tvalue',t_val) # Remove [0] to print the scalar value directly

t-distribution
     

In [None]:
#t-distribution
stats.t.ppf(0.025,dof)

In [None]:
#t-distribution
stats.t.ppf(0.975,dof)

## ***6. Feature Engineering & Data Pre-processing***

In [None]:
# Combining all the clustering attributes into a single column
netflix['clustering'] = (netflix['director'] + ' ' + netflix['cast'] +' ' +
                                 netflix['country'] +' ' + netflix['listed_in'] +
                                 ' ' + netflix['description'])

In [None]:
netflix['clustering'][25]

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction
# Lower Casing
# Remove Punctuations
# Remove URLs & Remove words and digits contain digits
# Remove Stopwords
# Remove White spaces
# Rephrase Text
# Tokenization
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string
import nltk
nltk.download('all',quiet=True)
from PIL import Image

def transform_text(text):
    # Convert text to lowercase
    text = text.lower()

    # Remove URLs
    text = re.sub(r'http\S+', '', text)

    # Tokenize text into words
    words = nltk.word_tokenize(text)

    # Remove non-alphanumeric characters
    words = [word for word in words if word.isalnum()]

    # Remove stopwords and punctuation
    stopwords_set = set(stopwords.words('english'))
    punctuation_set = set(string.punctuation)
    words = [word for word in words if word not in stopwords_set and word not in punctuation_set]

    # Lemmatize words
    lemmatizer = WordNetLemmatizer()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

    # Join words into a string and return
    return ' '.join(lemmatized_words)

In [None]:
netflix['Clean_Text'] = netflix['clustering'].apply(transform_text)


In [None]:
netflix["Clean_Text"][50]

#### 10. Text Vectorization

TF-IDF combines two metrics: Term frequency (TF) and inverse document frequency (IDF).

Term Frequency (TF): This metric measures the frequency of a term in a document. It assumes that the more often a term appears in a document, the more relevant it is to that document. It is calculated using the formula:

TF(t, d) = (Number of times term t appears in document d) / (Total number of terms in document d)

Inverse Document Frequency (IDF): This metric measures the importance of a term across a collection of documents. It gives higher weight to terms that appear less frequently in the entire collection. It is calculated using the formula:

IDF(t) = log_e(Total number of documents / Number of documents containing term t)

In [None]:
bag_of_words = netflix.Clean_Text



In [None]:
t_vectorizer = TfidfVectorizer(max_features=20000)
X= t_vectorizer.fit_transform(bag_of_words)


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
t_vectorizer = TfidfVectorizer(max_features=20000)
X= t_vectorizer.fit_transform(bag_of_words)

In [None]:
print(X.shape)

In [None]:
t_vectorizer.get_feature_names_out()

### 7. Dimesionality Reduction

PCA to reduce the dimensionality of the dataset. PCA identifies the directions (principal components) along which the data varies the most. These components are ordered by the amount of variance they explain in the data.

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
transformer = PCA()
transformer.fit(X.toarray())

In [None]:
from sklearn.decomposition import IncrementalPCA
n_batches = 10  # Adjust based on your dataset size
inc_pca = IncrementalPCA()
for X_batch in np.array_split(X.toarray(), n_batches):
    inc_pca.partial_fit(X_batch)
X_transformed = inc_pca.transform(X.toarray())

In [None]:
 #Lets plot explained var v/s comp to check how many components to be considered.
 #explained var v/s comp
# Add a grid to the plot
import matplotlib.pyplot as plt
plt.figure(figsize=(15,5), dpi=120)
plt.plot(np.cumsum(inc_pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
plt.axhline(y=0.95, color='r', linestyle='--',linewidth=2, label='95% Explained Variance')
plt.grid()
plt.show()

The plot helps in determining the number of components to consider for dimensionality reduction. You can select the number of components where the cumulative explained variance reaches a satisfactory threshold, such as 95%. The point where the curve intersects or is closest to the threshold line can guide you in choosing the appropriate number of components for your analysis.

In [None]:
# Import the necessary libraries
from sklearn.decomposition import PCA
# Create an instance of PCA with the desired explained variance ratio
pca_tuned = PCA(n_components=0.95)
# Fit the PCA model on the input data, X, which is converted to a dense array
pca_tuned.fit(X.toarray())
# Transform the input data, X, to its reduced dimensional representation
X_transformed = pca_tuned.transform(X.toarray())
# Print the shape of the transformed data to see the number of samples and transformed features
print(X_transformed.shape)


In [None]:
X_transformed

## ***7. ML Model Implementation***

In [None]:
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
from sklearn.utils import resample

# Sample the data (adjust n_samples as needed)
X_sample = resample(X_transformed, n_samples=1000, random_state=5)

# Initialize KMeans with parallelization
model = KMeans(random_state=5)

# Narrow down the k range
visualizer = KElbowVisualizer(model, k=(8, 12), metric='silhouette', timings=False, locate_elbow=True)

# Fit on the sampled data
visualizer.fit(X_sample)
visualizer.show()

In [None]:
from yellowbrick.cluster import SilhouetteVisualizer
from sklearn.metrics import silhouette_score, silhouette_samples

def silhouette_score_analysis(n):

  for n_clusters in range(2,n):
      km = KMeans (n_clusters=n_clusters, random_state=5)
      preds = km.fit_predict(X_transformed)
      centers = km.cluster_centers_

      score = silhouette_score(X_transformed, preds, metric='euclidean')
      print ("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))

      visualizer = SilhouetteVisualizer(km)

      visualizer.fit(X_transformed) # Fit the training data to the visualizer
      visualizer.poof() # Draw/show/poof the data


In [None]:
silhouette_score_analysis(15)

In [None]:

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Create a figure with a specific size and resolution
plt.figure(figsize=(10, 6), dpi=120)

# Initialize an empty list to store the within-cluster sum of squares (WCSS)
wcss = []

# Iterate over different numbers of clusters
for i in range(1, 22):
    # Create a KMeans model with default parameters
    model = KMeans(random_state=0)

    # Initialize the KMeans algorithm with specific parameters
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)

    # Fit the KMeans algorithm to the transformed data
    kmeans.fit(X_transformed)

    # Append the WCSS to the list
    wcss.append(kmeans.inertia_)

# Plot the number of clusters against the WCSS
plt.plot(range(1, 22), wcss)

In [None]:
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import numpy as np

# Create a figure with a larger size and resolution
plt.figure(figsize=(20, 8), dpi=120)

# Initialize a KMeans model with 15 clusters
kmeans = KMeans(n_clusters=15, init='k-means++', random_state=9)

# Fit the KMeans algorithm to the transformed data
kmeans.fit(X_transformed)

# Predict the labels of the clusters
label = kmeans.fit_predict(X_transformed)

# Get unique labels from the predictions
unique_labels = np.unique(label)

# Plot the results
for i in unique_labels:
    # Scatter plot the points belonging to each cluster
    plt.scatter(X_transformed[label == i, 0], X_transformed[label == i, 1], label=i)

# Display a legend to identify the clusters
plt.legend()
plt.show()

In [None]:
netflix['cluster_number'] = kmeans.labels_

In [None]:
netflix.head(1)

In [None]:
# Count the number of movies or TV shows in each cluster
cluster_content_count = netflix['cluster_number'].value_counts().reset_index().rename(columns={'index': 'clusters', 'clusters': 'Movies/TV_Shows'})

# Print the cluster content count
print(cluster_content_count)


In [None]:
#word cloud
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
def word_count(category):
  print("Exploring Cluster", category)
  col_names = ['type','title','country','rating','listed_in','description']
  for i in col_names:
    df_word_cloud = netflix[['cluster_number',i]].dropna()
    df_word_cloud = df_word_cloud[df_word_cloud['cluster_number']==category]
    text = " ".join(word for word in df_word_cloud[i])
    # Create stopword list:
    stopwords = set(STOPWORDS)
  # Generate a word cloud image
    wordcloud = WordCloud(stopwords=stopwords, background_color="#FFC0CB",width=500,height=500).generate(text)
  # Display the generated image:
  # the matplotlib way:
    plt.rcParams["figure.figsize"] = (10,10)
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")

    print("Looking for insights from", i ,"Movies/TV Shows")

    plt.show()


In [None]:
word_count(9)

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
#removing stopwords
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
netflix['description'] = netflix['description'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(netflix['description'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape


In [None]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel
# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [None]:
cosine_sim

In [None]:
indices = pd.Series(netflix.index, index=netflix['title']).drop_duplicates()


In [None]:
def get_recommendations(title, cosine_sim=cosine_sim):
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return netflix['title'].iloc[movie_indices]




In [None]:
netflix['title'][1:70]


In [None]:
get_recommendations( '14 Cameras',cosine_sim)

# **Conclusion**



1.   It is interesting to note that the majority of the content available on Netflix consists of movies. However, in recent years, the platform has been focusing more on TV shows.
2.   Most of these shows are released either at the end or the beginning of the year.

1.   The United States and India are among the top five countries that produce all of the available content on the platform. Additionally, out of the top ten actors with the maximum content, six of them are from India.
2.    When it comes to content ratings, TV-MA tops the charts, indicating that mature content is more popular on Netflix.

1.   The value of k=15 was found to be optimal for clustering the data, and it was used to group the content into ten distinct clusters.
2.   Using this data, a Content based recommender system was created using cosine similarity, which provided recommendations for Movies and TV shows.







### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***