## Introduction

### Objective 
- Have a deeper look at Netflix data structure and potential relation to constructe Recommendation System

### Key 
- Identify any trends or interesting relation between features
- Construct a reliable Recommondation System 

### Approach 
- Conduct Exploratory Analysis
- Identify Correlations (if any)
- Construct Recommendation system based on Content-Based Filtering.




In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#import Data
df = pd.read_csv('/kaggle/input/netflix-shows/netflix_titles.csv')

## EDA
- Fisrt Look at the datasets
- Identify the features  
- Identify their unique counts 
- Identify any missing values


In [None]:
df.head(10)

In [None]:
print("Shape:",df.shape)
print(df.columns)

In [None]:
# Identify the unique values and missing values 
dict = {}
for i in list(df.columns):
    dict[i] = df[i].value_counts().shape[0]
    
print(pd.DataFrame(dict,index = ["unique count"]).transpose())



In [None]:
# missing values
print('Table of missing values: ')
print(df.isnull().sum())

### so far what we can see from the analysis:
- show_id does represents the primary key of the datasets.
- There are only two types of Netflix content type, where as others are distributed in wide range will need futher analysis with graphs. 
- There are significant missing values in director, cast and country. This can effect the prediction on content based filtering recommendation.

### Graphs Analysis
Now we will look at some graphs shows the distributions of each feautres.

In [None]:
# top 10 Countrys
Netflix_top_country = df['country'].value_counts().head(10)
df2 = pd.DataFrame(Netflix_top_country , columns = ['country'])

print(df2)

#Last 10 years of Neflix 
Last_ten_years = df[df['release_year']>2010 ]
Last_ten_years.head()

In [None]:
#Look at Count of type, rating , country and top 10 country 

fig = plt.figure(figsize=(20,20))
gs = fig.add_gridspec(2,2)
gs.update(wspace = 0.3 , hspace = 0.3)

sns.set(style = "darkgrid")
ax0 = fig.add_subplot(gs[0,0])
ax1 = fig.add_subplot(gs[0,1])
ax2 = fig.add_subplot(gs[1,0])
ax3 = fig.add_subplot(gs[1,1])

#Set title and label 
ax0.set_title("TV_show vs Movie")
ax1.set_title("Distribution of Rating")
ax2.set_title("Distribution of Top 10  Country")
ax3.set_title("Distribution of Top 10 release_year")
ax1.set_xticklabels(labels = [], rotation = 90)
ax2.set_xticklabels(labels = [], rotation = 90)

#Construct subplots 
sns.countplot(ax = ax0 , x = "type" , data = df , palette="Set2")
sns.countplot(ax = ax1 , x = "rating" ,hue = "type", data = df)
sns.countplot(ax = ax2 , x = "country" ,hue = "type", data = df, order=df.country.value_counts().iloc[:10].index)
sns.countplot(ax = ax3 , x = "release_year" ,hue = "type", data = Last_ten_years)
plt.show()

### Graphs analysis:
From the above graphs indicated ther are biased in features country, type and rating.
- Neflix_type have more Movies than TV shows
- Majority of the country are made in United State 
- Majority of the Rating are in TV-Ma and TV-14


###  EDA Conclusion 
We can see there is a trend of TV show is becoming more popular in the recent years as more TV shows are made where as movies are becoming less starting in the 2019. This make sense as online streaming is becoming larger and larger more TV-shows will air on platform like Netflix instead of traditional TV. We can also see how significant United state represents the entire entertainment industry far majority are made in United state. 

# Netflix Recommendation 
Here are will try to make Neflix recommendation using content-based filtering 

### Objective 
- Construct a Netflix Recommendation System that can give recommend similar Netflix-Content when given a Item details such as name , director , cast and etc. 

### Approach
We can see from the datasets there is no user rating stored in the data. So our aproach of recommendation will be based on content Based Filtering. This Approach uses metadata from the ite's features to find similar Item. 
1. Data Processing 
2. Construct a Similarity Score
3. Validate Similarity Score 


### Optional 1 - Description Based Recommender
We will be using description to compute pairwise similarity for all Netflix items and recommend movie based on that similarity score.

In [None]:
df['description'].head()

In [None]:
#import tfidVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
df['description'] = df['description'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(df['description'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape


### Compute Similarity Score
we have used the TF-IDF vectorizer, claculating the dot product will directly give us the cosine similarity score.

In [None]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [None]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(df.index, index=df['title']).drop_duplicates()
indices.head()

In [None]:
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return df['title'].iloc[movie_indices]

In [None]:
get_recommendations('Star Trek')

### Validation
From the above Recommendation results we can see a positive results, that the recommendation system is able to identify all the relavent Star Trek movies.
However the system will continue recommend irrelivent movies that are just seems random. 

Possible Problemes:
- The range is limited by the size of the Dataset, that there is no more relavent movies to 'Star Trek'.
- The current recommendation solution is not feasible to find more relavent movies. 

To futher Investigate we will do some test

## Conclusion

Unfortunetly, by the given datasets it is limited to find similar Generes or rating to validate our results. The results shows this recommendation can only identify highly similar items such as movie series and fail to identify other similar movies. 

This results is expected as the system is based on the plot of the item, we can assume the plot will varies but only the TV or movie series will have the most similar plot.

Overal the system is working greate as it is , but not a good reccomendation system for overal Neflix recommendation. 