<a href="https://colab.research.google.com/github/shelarumesh/DA01_Hospitality_Analysis_codebasics/blob/main/Netflix_clusterung_and_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Define the Problem Statement

This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine. In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.

Integrating this dataset with other external datasets such as IMDB ratings, rotten tomatoes can also provide many interesting findings.

In this project, you are required to do

-nExploratory Data Analysis
- Understanding what type content is available in different countries
- If Netflix has been increasingly focusing on TV rather than movies in recent years.
- Clustering similar content by matching text-based features

# Data Collection

## Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, LabelEncoder
from sklearn.preprocessing import StandardScaler, MinMaxScaler

## Data Set Loadinding

In [None]:
path = '/content/sample_data/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv'
df = pd.read_csv(path)
df.head()

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
df.shape

# Data Preprocessing

### 3.1 Missing Values and its solution

In [None]:
# % of missing values in dataset
df.isnull().sum()/len(df)*100

In [None]:
# In cast there are 9.22 % null values will fill with mode
df['cast'].value_counts()
df['cast'] = df['cast'].fillna(df['cast'].mode()[0])

In [None]:
df['director'].value_counts()
df['director'] = df['director'].fillna(df['director'].mode()[0])

In [None]:
df['country'].value_counts()
df['country'] = df['country'].fillna(df['country'].mode()[0])

In [None]:
df['rating'].value_counts()
df['rating'] = df['rating'].fillna(df['rating'].mode()[0])

In [None]:
df.head()

In [None]:
df = df.dropna()
df.shape

In [None]:
df.isnull().sum()

In [None]:
df = df.drop(columns=['title'])

### 3.2 Handling Outliers

In [None]:
df.describe()

In [None]:
def Check_categorical_wrongData(dataframe):
  cat = dataframe.select_dtypes(include=('object'))
  for data in cat.columns:
    #print(cat[data].value_colunts())
    print(data , len(cat[data].unique()))

In [None]:
Check_categorical_wrongData(df)

### Check Duplicates

In [None]:
df[df.duplicated()]

In [None]:
df = df.drop_duplicates()

In [None]:
df.shape

### Explore the date column

In [None]:
df[['month', 'day', 'year']] = df['date_added'].str.split(' ', expand=True).iloc[:,:3]

In [None]:
df = df.drop(columns=['date_added'])

In [None]:
df.head()

### Handle the duration column which contains TV and Movies data in minutes and seasons

In [None]:
df[['time_and_count', 'duration_type']] = df['duration'].str.split(' ', expand=True)

In [None]:
df = df.drop(columns=['duration'])

# Exploratory Data Analysis (EDA)

### Chart 1 : Count of TV and movies

In [None]:
count = df.groupby('type')[['director']].count()
count

In [None]:
count.index

In [None]:
y = count['director']
y

In [None]:
sns.barplot(y)
plt.ylabel('Number of movies or TV count')
plt.title('Count of Movie and Tv show ')

### Chart 2 : Top Director

In [None]:
director = df.groupby('director')[['year', 'rating']].count()
data = np.log(director.sort_values(by='year').tail(10)['year'])

In [None]:
sns.barplot(data)
plt.ylabel('Number of movies or TV count')
plt.title("Top 10 director")
plt.xticks(rotation=90)
plt.show()

### Chart 3 : Top ration



In [None]:
rating = df.groupby('rating')[['type', 'country']].count()
y= rating.sort_values('type')

In [None]:
sns.barplot(rating['country'])

### Chart 4 :  Top 10 country

In [None]:
y = df.groupby('country')[['rating','type']].count().sort_values('rating').tail(10)
y = np.log(y)
plt.bar(y.index, y['type'])
plt.title('Top 10 countries and movies count')
plt.xticks(rotation=90)
plt.show()

### Chart 5 :

In [None]:
sns.pairplot(df)

### Encoding

In [None]:
oe = OrdinalEncoder()
ohe = OneHotEncoder()
le = LabelEncoder()

In [None]:
df['type'] = le.fit_transform(df['type'])
df['director'] = le.fit_transform(df['director'])
df['country'] = le.fit_transform(df['country'])
df['description'] = le.fit_transform(df['description'])
df['duration_type'] = le.fit_transform(df['duration_type'])
df['listed_in'] = le.fit_transform(df['listed_in'])
df['cast'] = le.fit_transform(df['cast'])

In [None]:
df['month'] = oe.fit_transform(df[['month']])
df['rating'] = oe.fit_transform(df[['rating']])

In [None]:
df.head()

In [None]:

df['time_and_count'] = df['time_and_count'].astype('float')

In [None]:
df['day'] = df['day'].str.replace(',','')

In [None]:
df['day'].unique()

In [None]:
df = df[df['day']!='December']
df = df[df['day']!='January']
df = df[df['day']!='August']
df = df[df['day']!='July']
df = df[df['day']!='May']
df = df[df['day']!='November']
df = df[df['day']!='March']
df = df[df['day']!='October']
df = df[df['day']!='April']
df = df[df['day']!='February']
df = df[df['day']!='September']
df = df[df['day']!='June']

In [None]:
df['day'].value_counts()

In [None]:
#df['day'] = df['day'].astype('int64')
#df['year'] = df['year'].astype('int')
df['day'] = pd.to_numeric(df['day'])
df['year'] = pd.to_numeric(df['year'])

In [None]:
df.info()

### Chart 7 : distibution

In [None]:
sns.histplot(df['day'])
plt.xticks(np.linspace(0,31,4))
plt.show()

In [None]:
sns.histplot(df['time_and_count'])
plt.xticks(np.linspace(0,1,4))
plt.show()

In [None]:
df.columns

### Cluster technique and target calculation

Base on show id we will segment this data in foollowing
- count of movies per year
- time and count sum
- countries

In [None]:
df.head()

In [None]:
df.year.value_counts()

In [None]:
new_df = df.groupby('show_id').agg({'year' : lambda x : x.mode(),'type': lambda x: x.mode(),
                                    'rating': lambda x: x.max(), 'release_year': lambda x: (2024-x)
                                    , 'time_and_count': lambda x: x.sum(),'country': lambda x: x.mode() })

In [None]:
new_df.head()

# Model Selection, Training & Evaluation

In [None]:
X=new_df

In [None]:
# elbow method to find out the best k
from yellowbrick.cluster import KElbowVisualizer
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
SSE = {}
for k in range(1,15):
  km = KMeans(n_clusters = k, init = 'k-means++', max_iter = 1000)
  km = km.fit(X)
  SSE[k] = km.inertia_

# plot the graph for SSE and number of clusters
visualizer = KElbowVisualizer(km, k=(1,15), metric='distortion', timings=False)
visualizer.fit(X)
visualizer.poof()
plt.show()

In [None]:
# Assuming X is your data array
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
y_km = kmeans.predict(X)

In [None]:
# Plot the clusters
plt.figure(figsize=(10, 6))
plt.title('Customer Segmentation based on Recency and Frequency')
plt.scatter(X['time_and_count'], X['country'], c=y_km, s=50, cmap='Set1', label='Clusters')

# Plot and annotate the centers
centers = kmeans.cluster_centers_
plt.scatter(centers[:,0], centers[:,1], c='black', s=200, alpha=0.5, marker='x')
#for i, center in enumerate(centers):
#    plt.annotate(f'Cluster {i}', (center[0], center[1]), textcoords="offset points", xytext=(0,10), ha='center')
plt.xticks([0,200,500,1000])
plt.show()

# Conclusion

This Dataset are divide into three cluster

- Cluster 0 : Represent the oldest movie and TV show. cluster size is less mean less number of movies release than other cluster
- Cluster 1 : Represent the middle year of relies movie and TV show
- Cluster 2 : Represent the newly relies movie and TV show, cluster size is large means count of number of movies release
 is higher than other cluster