# STAT 430: Final Project

By: W. Jonas Reger and Atharv Pathak

## 1. Introduction and Dataset Research


### Motivation

For this project we wanted to perform an analysis on a dataset that we found interesting but also relevant to our personal lives. We decided to use the "Movies on Netflix, Prime Video, Hulu and Disney+" Dataset from Kaggle, which can be found at https://www.kaggle.com/ruchi798/movies-on-netflix-prime-video-hulu-and-disney. The Netflix Prize brought more attention to how important recommender systems are and how they may be used in many applications. While algorithms designed to recommend movies, products, or other services to customers are very beneficial to businesses and customers alike, sometimes they can perform poorly due to insufficient data or skewed analysis. Industry practitioners want these recommendations to be fast, simple, and accurate to make it as easy as possible for customers to see it and decide that it's something they want to buy (Technophilo, 2012) Recommendation systems can work very well because they provide customers a way to give their feedback about what they liked or disliked. This feedback helps businesses generate predictions on what other movies or products that the customer is most likely to watch or buy, which can increase the amount of business that is made (Aggarwal, 2016). While our end goal for the project isn't to build a recommendation system, this is a big part of what drew us to this dataset, which contains not just data from Netflix but from three other streaming services as well. We were also excited to find any possible relationships between movie ratings, production details, age, target audience, and streaming services that we may or may not already know. The motivation of someone who would analyze this kind of data more extensively would likely be to build a recommendation system for one of these streaming services or for other movie retailers. These businesses would be able to advertise or recommend movies to customers who would be more likely to watch or pay for it. For instance, Netflix and Prime Video are likely to have a large selection of movies available for all age ratings and would benefit from being able to recommend the best movie for a viewer who prefers 18+ rated Horror and Thriller movies vs another viewer who prefers Comedy and Drama but has no discrimination for age ratings. Another application is to evaluate how well movies in different clusters are rated across different geographic regions, so Netflix could make decisions on what movies to add or remove from different regions of their service. This dataset could even be expanded by including social media ratings and response to movies, movie revenue, how much was watched, and user-specific information that could be more useful for a recommendation system. This kind of data can also be useful to the movie production industry in general since it can help them measure the success of a movie based on ratings, where it's most popular, what genres or other characteristics are trending, and even what streaming service might add the movie to their library. This dataset can also be expanded by including movie rating changes over time to be even more useful for the movie industry (Moon et al, 2010).

1. Aggarwal C.C. (2016) An Introduction to Recommender Systems. In: Recommender Systems. _Springer, Cham_. https://doi.org/10.1007/978-3-319-29659-3_1
2. Moon, S., Bergey, P. K., & Iacobucci, D. (2010). Dynamic Effects among Movie Ratings, Movie Revenues, and Viewer Satisfaction. _Journal of Marketing, 74_(1), 108–121. https://doi.org/10.1509/jmkg.74.1.108
3. Technophilo. (2012, October). Recommender Systems: Pros and Cons. _Technophilo_. http://www.technophilo.in/2012/10/recommender-systems-pros-and-cons.html

### Dataset Information

We retrieved this dataset from Kaggle, which is maintained by Ruchi Bhatia. Part of the data was scraped, which contains attribute information of movies that were available on the streaming platforms. The remaining data came from the IMDb dataset. The Kaggle user was inspired to make this dataset from the following questions:

1. Which streaming platform(s) can I find this movie on?
2. Average IMDb rating of movies produced in a country?
3. Target age group movies vs the streaming \[platform\] they can be found on.
4. The year during which a movie was produced and the streaming platform they can be found on.
5. Analysis of the popularity of a movie vs directors.

She provides a link to her own data visualization that answers these questions, but we will not focus on or be limited to these questions for our own analysis. The dataset has 16744 observations (movies) and 17 columns for row index, movie ID, title, year of production, age rating, IMDb rating, Rotten Tomatoes rating, Netflix Dummy Variable (DV), Hulu DV, Prime Video DV, Disney+ DV, Type (movie or tv show), Directors (names), Genres, Country(s) available in, Language(s), and Runtime (movie length). 

There was no further details including on what if any preprocessing was used, but there may likely have been some binning used since since movies are usually region specific for most if not all platforms. For instance, the movie 'Inception' is listed on Netflix only but might only be available in the United States. Another possible issue is that the 'Country' column could refer to these region specefic availabilities or to where the movie was produced, neither was explicitly stated (similar issues for 'Genres', 'Language', and 'Directors'). We are not sure exactly which attributes come from IMDb or streaming services, but we think it would safe to assume that the attributes we mentioned likely came from the streaming services.

## 2. Preliminary Exploratory Data Analysis (EDA)

### Package Imports

In [None]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.cm as cm
%matplotlib inline
import statistics as st

import time
import math

from sklearn.cluster import KMeans, MiniBatchKMeans, Birch, DBSCAN, AgglomerativeClustering
from sklearn.manifold import TSNE
from sklearn.datasets import make_blobs, load_digits
from sklearn.metrics import adjusted_rand_score, silhouette_score, silhouette_samples
# , calinski_harabaz_score
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

from kmodes.kmodes import KModes

from scipy.spatial.distance import pdist, squareform
from scipy.cluster.hierarchy import linkage, dendrogram, cophenet

from pyclustertend import hopkins
from sklearn import preprocessing

The above packages are necessary for the exploratory data analysis and unsupervised learning algorithms needed for the dataset. These functions will be used throughout the report and are pivotal for mining important insights from the data.

### Loading Dataset

In [None]:
# Load the data set
df = pd.read_csv('MoviesOnStreamingPlatforms_updated.csv') 
print(df.shape)
df.head()

The data shown above closely resembles the raw data file given on Kaggle. Although the dataset was briefly described in the introduction of the report, there are still some key features that should be emphasized. First, it is important to note the overall structures of most features in the dataset: categorical. In other words, many of the clustering/unsupervised learning techniques mentioned in class may not yield substantial results, due to the lack of numerical attributes. There are several ways to combat this issue in order to allow the data to be usable for unsupervised learning. One method, which is implemented in this report, is the use of dummy variables. Instead of having strings of data in each column, like types of genres and name of the directors, it may be more helpful to convert those features into numerical, 1s or 0s, data. This process is outlined later in the Preliminary Exploratory Data Analysis section; however, by doing this, unsupervised learning techniques can be better applied to the dataset. 

### Summary Statistics of Numerical Attributes

In [None]:
df.describe()

By importing the dataset into a Pandas dataframe, the summary statistics can be invoked by using the `.describe()` function. Unfortunately, only the numerical features are shown for the summary statistics because the categorical data cannot be described in this way. One thing to note is that the count of each attribute is different, hinting that there may be missing values in the dataset that must be cleansed. Next, the `Netflix`, `Hulu`, `Amazon Prime Video` and `Disney+` columns do not provide valuable insights on each of the parameters. This is mainly due to the fact that those columns are enocded as dummy variables (1s or 0s), meaning that the data shows whether or not an observation has either `Netflix` or no `Netflix`. However, we can see that the average values are quite small for `Netflix`, `Hulu` and `Disney+`, meaning that there are probably very few movies on those streaming platforms within the dataset. On the other hand, `Prime Video` has a value (0.74) relatively close to 1, meaning that most observations are `Prime Video` m. Next, one interesting feature worth noting are the `IMDb` scores of movies. The movies on average have a score of about 5.9, but the standard deviation of the scores is quite large with 1.35. This means that most of the data points in the dataset are between a score of 4.55 and 7.25. Lastly, the `Years` the movies were made spanned a multitude of decades. The oldest movie was made in 1902 and the most recent movie was made in 2020, meaning there is a range of more than 100 years among movies in the dataset. These attributes will provide interesting insights on consumer behavior when unsupervised learning algorithms are applied on the dataset.

### Number of observations per unique value in each categorical variable

In [None]:
df['Age'].value_counts()

Using the `.value_counts()` function in Python, the counts for each value of age can be found. It is apparent that most of the movie options are 18+, which isn't too surprising. Additionally, since there are streaming platforms, like Disney+, in this dataset, it makes sense that there would be a 1462 observations with 7+ movies and 1255 observations with 13+ movies. The `Age` variable will definitely provide insight on types of movies presented in each of the clusters.

`catSeparate` is a function used to separate categorical values in text cells for `Country`, `Language`, `Directors`, and `Genres` columns.

In [None]:
def catSeparate(X):
    lst = []
    lst_ct = []
    large = 0
    n = len(X)
    for i in range(0, n):
        c = X[i].split(",")
        if (len(c) > large):
            large = len(c)
        else:
            pass
        for j in range(0, len(c)):
            lst_ct.append(c[j])
            if (c[j] in lst):
                pass
            else:
                lst.append(c[j])

    print(large)
    print(len(lst))
    return(pd.Series(lst_ct))

In [None]:
X = df.copy()
C = X['Country'].replace(np.nan, "Unknown")
catSeparate(C).value_counts().head(25)

From this analysis above, it is apparent that the `Country` variable has about 169 unique countries and a mximum of 27 countries listed for a single movie. It is also not surprising that the first 3 countries, United States, United Kingdom and India, have the highest counts. This is primarily because each of those countries have a booming film industry and are known to output popular cinematography. 

In [None]:
L = X['Language'].replace(np.nan, "Unknown")
catSeparate(L).value_counts().head(30)

From the code outputted above, it is apparent that the `Language` variable has 179 unique languages and a maximum of 10 languages listed for a single movie. It is also no suprise that English has the highest number of observations as it is the most popular language in the world. 

In [None]:
D = X['Directors'].replace(np.nan, "Unknown")
catSeparate(D).value_counts().head(25)

Next, it is apparent that the `Directors` column has a plethora of different options for movies. Specifically, there are 12,454 unique directors and a maximum of 28 directors listed for a single movie. Although this variable is very diverse, it does not seem to be a great option for clustering as there seems to be a lot of `Unknown` directors and the rest of the knowns directors are spread very thin.

In [None]:
G = X['Genres'].replace(np.nan, "Unknown")
catSeparate(G).value_counts().head(25)

Lastly, the `Genre` feature provides a lot of insight on the types of movies that are on the streaming platforms. There seems to be 28 unique genres spread accross the dataset and a maximum of 9 genres listed for a single movie. This feature will definitely be the most interesting to analyze as most cluster groups tend to have similar genres of movies. 

### Checking datatypes of each column

In [None]:
df.dtypes

### Preprocessing and Data Cleaning

In [None]:
df.head()

In [None]:
df.isnull().sum(axis = 0)

There's a considerable amount of missing data in some columns, the largest being Rotten Tomatoes with 11586 missing values out of a total of 16744 values. Hence, the `Rotten Tomatoes` variable will be dropped, but other variables will be kept and will be cleaned by dropping their respective NA values. Additionally,  `ID`, `Unnamed: 0`, and `Type` will be dropped since they contain irrelevant information for the analysis. 

In [None]:
DF = df.copy()
DF['Runtime'] = DF['Runtime'].replace(np.nan, st.mean(DF['Runtime']))
DF = DF.drop(['Rotten Tomatoes', 'ID', 'Unnamed: 0', 'Type'], axis=1)
DF = DF.dropna()
DF.reset_index(drop=True, inplace=True)
print(DF.shape)
DF.head()

`dummyDF` is a function that separates the categories in text cells and creates a dummy variable dataframe for the specified column.

In [None]:
def dummyDF(X):
    temp = pd.DataFrame(X)
    lst = []
    n = len(X)
    for i in range(0, n):
        c = X[i].split(",")
        for j in range(0, len(c)):
            if (c[j] not in lst):
                lst.append(c[j])
                temp[c[j]] = ""
            temp.iloc[i, temp.columns.get_loc(c[j])] = 1
    COL = temp.replace("", 0).drop(temp.columns[0], axis=1)
    print(COL.shape)
    return(COL)

# pd.set_option('display.max_columns', None)
# print(GENRE.shape)
# GENRE.head(15)

In [None]:
Ldf = dummyDF(DF['Language'])
Gdf = dummyDF(DF['Genres'])
Ddf = dummyDF(DF['Directors'])
Cdf = dummyDF(DF['Country'])

Considering the large number of categorical levels in these dummy variable dataframes, it will be the most efficient to only use the total number of values per observation instead. Because of this, the `Genres` dummy variable dataframe will be the only one used, and the `Language` and `Country` dummy variable dataframes will be dimensionally reduced by combining levels into larger groups (e.g. merging all European levels of `Country` into one column called `European`).

In [None]:
CX = Cdf.copy()

NA = ['United States', 'Mexico', 'Canada', 'Bermuda', 'Costa Rica', 'Guatemala', 'Panama',
'Bahamas', 'Dominican Republic', 'Haiti', 'Puerto Rico', 'Cayman Islands', 'Cuba', 'Jamaica']

SA = ['Aruba', 'Brazil', 'Argentina', 'Chile', 'Colombia', 'Peru', 'Ecuador', 'Uruguay', 'Paraguay',
'Trinidad and Tobago']

EU = ['United Kingdom', 'Italy', 'Spain', 'West Germany', 'France', 'Poland', 'Germany', 'Ireland',
'Belgium', 'East Germany', 'Sweden', 'Finland', 'Denmark', 'Luxembourg', 'Greece', 'Netherlands',
'Switzerland', 'Hungary', 'Norway', 'Romania', 'Iceland', 'Russia', 'Croatia', 'Holy See (Vatican City State)',
'Bulgaria', 'Malta', 'Latvia', 'Slovenia', 'Austria', 'Albania', 'Portugal', 'Serbia', 'Czech Republic',
'Federal Republic of Yugoslavia', 'Monaco', 'Lithuania', 'Ukraine', 'Czechoslovakia', 'Estonia',
'Soviet Union', 'Yugoslavia', 'Belarus', 'Slovakia', 'Bosnia and Herzegovina', 'Montenegro']

AS = ['Hong Kong', 'China', 'Japan', 'South Korea', 'Taiwan', 'Afghanistan', 'Bangladesh', 
'India', 'Kazakhstan', 'Kyrgyzstan', 'Nepal', 'Pakistan', 
'Thailand', 'Cambodia', 'Laos', 'Malaysia', 'Vietnam', 'Singapore', 'Indonesia', 'Philippines', 
'Bahrain', 'Iran', 'Iraq', 'Israel', 'Jordan', 'Lebanon', 'Qatar', 'Saudi Arabia', 
'Syria', 'Turkey', 'Palestine', 'United Arab Emirates']

AU = ['Australia', 'New Zealand', 'Papua New Guinea']

AF = ['Senegal', 'Nigeria', 'Ghana', 'Burkina Faso', 'Egypt', 'Libya', 'Morocco', 'Tunisia', 
'Angola', 'Congo', 'Ethiopia', 'Kenya', 'Malawi', 'Rwanda', 'Somalia', 'Tanzania', 'Uganda', 
'Zambia', 'Botswana', 'Namibia', 'South Africa']

Regions = NA + SA + EU + AS + AU + AF

CX['North America'] = CX[NA].sum(axis=1)
CX['South America'] = CX[SA].sum(axis=1)
CX['Europe'] = CX[EU].sum(axis=1)
CX['Asia'] = CX[AS].sum(axis=1)
CX['Australia and Pacific'] = CX[AU].sum(axis=1)
CX['Africa'] = CX[AF].sum(axis=1)

CX = CX.drop(Regions, axis=1)

for i in CX.columns:
    for ii in range(0,len(CX)):
        if CX[i][ii] == 0 or CX[i][ii] == 1:
            pass
        if CX[i][ii] > 1:
            CX[i][ii] = 1

CX.head()

In [None]:
# English       13233
# Spanish         872
# French          799
# Arabic          190
# Portuguese      108

LX = Ldf.copy()

NA = ['North American Indian', 'East-Greenlandic', 'Cheyenne', 'Navajo', 'Maya', 'Middle English',
'Inuktitut', 'Sioux', 'Creek', 'Athapascan languages', 'Apache languages', 'Micmac']

SA = ['Quechua', 'Mapudungun', 'Papiamento', 'Tupi', 'Guarani']

EU = ['Basque', 'Finnish', 'Serbo-Croatian', 'Swiss German', 'Croatian', 'Aramaic', 'Saami',
'Hungarian', 'Estonian', 'Serbian', 'Bosnian', 'Lithuanian', 'Latin', 'Greek', 'Irish', 'Yiddish', 
'Norwegian', 'Romanian', 'Scottish Gaelic', 'Danish', 'Flemish', 'Slovenian', 'Catalan',
'Icelandic', 'Ukrainian', 'Dutch', 'Polish', 'Czech', 'Welsh', 'Luxembourgish', 'Cornish', 
'Maltese', 'Scots', 'Slovak', 'Low German', 'Bulgarian', 'Swedish', 'German', 'Italian', 'Russian']

AS = ['Kudmali', 'Bhojpuri', 'Sanskrit', 'Pushto', 'Sinhalese', 'Awadhi', 'Vietnamese', 'Marathi',
'Armenian', 'Kannada', 'Nepali', 'Urdu', 'Persian', 'Kurdish', 'Bengali', 'Thai', 'Indonesian', 
'Khmer', 'Tagalog', 'Turkmen', 'Tibetan', 'Min Nan', 'Chinese', 'Malayalam', 'Mongolian', 'Punjabi',
'Malay', 'Hakka', 'Shanghainese', 'Gujarati', 'Assyrian Neo-Aramaic', 'Dari', 'Filipino', 'Turkish',
'Cantonese', 'Korean', 'Hindi', 'Tamil', 'Telugu', 'Japanese', 'Mandarin', 'Hebrew']

AU = ['Polynesian', 'Aboriginal', 'Maori', 'Hawaiian']

AF = ['Masai', 'Dyula', 'Nama', 'Amharic', 'Southern Sotho', 'Lingala', 'Berber languages', 'Wolof',
'Nyanja', 'Afrikaans', 'Zulu', 'Xhosa', 'Swahili', 'Yoruba', 'Kinyarwanda']

SL = ['Japanese Sign Language', 'American Sign Language', 'Sign Languages',
'French Sign Language', 'None', 'Brazilian Sign Language']

OT = ['Esperanto', 'Klingon', 'None']

Regions = NA + SA + EU + AS + AU + AF + OT + SL

LX['North American'] = LX[NA].sum(axis=1)
LX['South American'] = LX[SA].sum(axis=1)
LX['European'] = LX[EU].sum(axis=1)
LX['Asian'] = LX[AS].sum(axis=1)
LX['Australian and Pacific'] = LX[AU].sum(axis=1)
LX['African'] = LX[AF].sum(axis=1)
LX['Sign Lang.'] = LX[SL].sum(axis=1)
LX['Other'] = LX[OT].sum(axis=1)

LX = LX.drop(Regions, axis=1)

for i in LX.columns:
    for ii in range(0,len(LX)):
        if LX[i][ii] == 0 or LX[i][ii] == 1:
            pass
        if LX[i][ii] > 1:
            LX[i][ii] = 1

LX.head()

In [None]:
sns.histplot(Ldf.sum(axis=1))

In [None]:
sns.histplot(Gdf.sum(axis=1))

In [None]:
sns.histplot(Ddf.sum(axis=1))

In [None]:
sns.histplot(Cdf.sum(axis=1))

In [None]:
X = DF.copy()
X = X.drop(['Genres', 'Country', 'Language', 'Directors'], axis=1)
X['Total Genres'] = Gdf.sum(axis=1)
X['Total Countries'] = Cdf.sum(axis=1)
X['Total Languages'] = Ldf.sum(axis=1)
X['Total Directors'] = Ddf.sum(axis=1)
print(X.shape)
X.head()

We will also replace the `Year` column with `movie_age` for better interpretation.

In [None]:
X['movie_age'] = 2020 - X['Year']
X = X.drop('Year', axis=1)

Age dummy variable:

In [None]:
age = pd.get_dummies(X, columns=['Age'])
mdf = pd.concat([age, Gdf, CX, LX], axis=1)

Below is the modified dataset:

In [None]:
mdf.head()

In [None]:
mdf.describe()

In [None]:
Cmat_mdf = mdf.corr()

In [None]:
lst = []
point = []
for i in Cmat_mdf.columns:
    for j in Cmat_mdf.columns:
        if abs(Cmat_mdf[str(i)][str(j)]) >= 0.4:
            if i==j: 
                pass
            else:
                temp = str(i) + "," + str(j)
                if (str(j) + "," + str(i)) in lst:
                    pass
                else:
                    lst.append(temp)
                    point.append(Cmat_mdf[str(i)][str(j)])
        else:
            pass

### Pairwise Relationships

In [None]:
pair_mdf = pd.DataFrame({'row,col': lst, 'corr': point})
print(pair_mdf.shape)
pair_mdf.head(16)

The output above shows pairwise correlations for features that have a correlation values of greater than 0.4 or less than -0.4. This shows what features are adequately correlated with one another in both the positive and negative directions. Some features, like `Asia` and `Asian`, make sense that they have a large correlation between them, while others, like `Netflix` and `Prime Video`, are much more interesting. For instance, seeing a strong, negative correlation between the 2 streaming platforms means that there is an inverse relationship between movies in both platforms, meaning that for any movie that is on `Prime Video` there is less chance that `Netflix` will have it, and vice versa. Additionally, seeing a correlation between `Animation` and `Family`, and `Disney+` and `Family` is reassuring as it makes sense for those features to have a positive correlation with one another. In other words, it makes sense that animation films as well as films on Disney+ cater to the family genre. It is also worth noting the high correlations between `North America` and `English`, and `North America` and `Asian` as it wouldn't readily make sense for such relationships to be prevalent. Overall, there seems to be some interesting pairwise relationships within the dataset that may be brought into light as unsupervised learning is applied to the dataset.

In [None]:
# Strong/Interesting relationships visualized.
mdf_rel = mdf[['Netflix', 'Prime Video', 'Disney+', 'Family', 'Total Countries', 'Europe', 
'Total Languages', 'European', 'North America', 'English', 'Asia', 'Asian']]
sns.pairplot(mdf_rel)

Shown above are some pairplots of the different features within the dataset. Unfortunately, since most of the data is encoded as dummy variables, these pairwise relationships do not provide any significant information in terms of the relationship between features. In other words, most of the graphs are 0s or 1s data values, which doesn't follow any conventional correlations. However, from the correlation coefficients found in the previous block of code, there is enough preliminary exploratory data analysis to help drive the unsupervised learning in the next sections.

### Scaling of Dataset

In [None]:
Xs = mdf.copy()
Xs = Xs.drop('Title', axis=1)
Xs_col = Xs.columns
Xs.reset_index(drop=True, inplace=True)
Xs = StandardScaler().fit_transform(Xs)
Xs = pd.DataFrame(Xs, columns=Xs_col)
Xs.describe()

One main issue that was presented in the original dataset was the standard deviation within each feature. Specifically, there seemed to be a large spread among numerical data as well as created dummy variables. Because of this, the dataset was a bit unbalanced and it was much more difficult to derive insights through clustering. To combat this, scaling was applied to the dataset to stabilize the standard deviations of all variables in the dataset. In the table above, most, if not all, features have a standard deviation of about 1, meaning that the dataset was scaled appropriately. 

### Final Data Preparation

The code in this section transforms the `mdf` dataframe into a complete categorical dataset via binning and encoding.

In [None]:
mdf_bin = mdf.copy()
mdf_bin = mdf_bin.dropna()

mdf_bin['runtime_bin'] = pd.cut(mdf_bin['Runtime'], [0, 87, 95, 106, 260],
labels=['0-87', '87-95', '95-106', '106-260'], include_lowest=True)
mdf_bin['move_age_bin'] = pd.cut(mdf_bin['movie_age'], [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120],
labels=['0-10', '10-20', '20-30', '30-40', '40-50', '50-60', '60-70', '70-80', '80-90', '90-100', '100-110', '110-120'], include_lowest=True)
mdf_bin['imdb_bin'] = pd.cut(mdf_bin['IMDb'], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
labels=['0-1', '1-2', '2-3', '3-4', '4-5', '5-6', '6-7', '7-8', '8-9', '9-10'], include_lowest=True)

mdf_cat = mdf_bin.drop(['Title', 'IMDb', 'Runtime', 'movie_age'], axis=1)

lab_enc = preprocessing.LabelEncoder()
mdf_cat = mdf_cat.apply(lab_enc.fit_transform)
mdf_cat.head()

In [None]:
mdf_cat.describe()

In [None]:
pd.concat([mdf_bin['runtime_bin'],mdf_cat['runtime_bin']], axis=1).drop_duplicates()

In [None]:
pd.concat([mdf_bin['move_age_bin'],mdf_cat['move_age_bin']], axis=1).drop_duplicates()

In [None]:
pd.concat([mdf_bin['imdb_bin'],mdf_cat['imdb_bin']], axis=1).drop_duplicates()

## 3. Pre-Analysis Questions

### Is the Dataset Clusterable?

In [None]:
num_trials=5
hopkins_stats=[]
for i in range(0,num_trials):
    n = len(Xs)
    p = int(0.1 * n)
    hopkins_stats.append(hopkins(Xs,p))
print(hopkins_stats)

In [None]:
X = mdf.copy()
X = X.drop('Title', axis=1)
num_trials=5
hopkins_stats=[]
for i in range(0,num_trials):
    n = len(X)
    p = int(0.1 * n)
    hopkins_stats.append(hopkins(X,p))
print(hopkins_stats)

In [None]:
X = mdf_cat.copy()
num_trials=5
hopkins_stats=[]
for i in range(0,num_trials):
    n = len(X)
    p = int(0.1 * n)
    hopkins_stats.append(hopkins(X,p))
print(hopkins_stats)

In [None]:
def tsne_df(XDF, DF, PR, RS):
    tsne = TSNE(n_components=2, perplexity=PR, random_state=RS)
    data_tsne = tsne.fit_transform(XDF)
    df_tsne = pd.DataFrame(data_tsne, columns=['x', 'y'], index=XDF.index)
    dff = pd.concat([DF, df_tsne], axis=1)
    return(dff)

def tsne_plot(DFF, W, H, FS):
    fig, ax = plt.subplots(figsize=(W, H))
    with sns.plotting_context("notebook", font_scale=FS):
        sns.scatterplot(x='x', 
                        y='y', 
                        sizes=(30, 400),
                        data=DFF,
                        ax=ax)
    ax.set_xlabel(r'$x$')
    ax.set_ylabel(r'$y$')
    plt.show()

def tsne_plot_hue(DFF, W, H, FS, HUE):
    fig, ax = plt.subplots(figsize=(W, H))
    with sns.plotting_context("notebook", font_scale=FS):
        sns.scatterplot(x='x', 
                        y='y', 
                        sizes=(30, 400),
                        hue=HUE,
                        data=DFF,
                        ax=ax)
    ax.set_xlabel(r'$x$')
    ax.set_ylabel(r'$y$')
    plt.show()


In [None]:
X_cat = mdf_cat.copy()
tdf35 = tsne_df(XDF=X_cat, DF=mdf_cat, PR=35, RS=1000)

In [None]:
tdf45 = tsne_df(XDF=X_cat, DF=mdf_cat, PR=45, RS=1000)

In [None]:
tsne_plot(DFF=tdf35, W=12, H=8, FS=1.5)

In [None]:
Xu = mdf.copy()
Xu = Xu.drop('Title', axis=1)
tdf35u = tsne_df(XDF=Xu, DF=Xu, PR=35, RS=1000)
tsne_plot(DFF=tdf35u, W=12, H=8, FS=1.5)

From the analysis above, this dataset is clusterable. The main metric used to deduce the clusterability of the two datasets used in the analysis is the hopkins statistic. After running the algorithm on both datasets, the hopkins statistic provided values ranging from 0.04 and 0.16. The closer the hopkins statistic is to 0, the more clusterable the dataset is, which confirms that both datasets are ready for unsupervised learning. As stated earlier, there are 2 main datasets used in this report: `mdf` and `mdf_cat`. The t-sne plot for `mdf_cat` is shown first and the t-sne plot for `mdf` is shown second. For `mdf_cat`, there seems to be 2 main large clusters with many smaller cluster hovering around it, while `mdf` seems like one large cluster with a few less spherical clusters attached to it. Both clusters seem to have a shape that is approximately spherical with very little separation and cohesion. Additionally, both plots show that the clusters are not balanced in size as some are larger than others, and due to the lack of separation there is a large chance that many of the clusters overlap with one another. Although these plots do not present promising results, the different unsupervised learning algorithms may be able to identify some hidden cluster structures that can shed light on some possible findings in the data.

## 4. Algorithm Selection Motivation

This dataset was heavily dependent on categorical variables with very few numerical attributes. Additionally, there were a large amount of missing data and pre-processing that must be done for the dataset to be usable under any unsupervised modeling. During the processing phase, many of the observations were converted from categorical values to dummy variables as well as encoded into bins and then scaled to be on par with the rest of the features in the dataset. This allowed the dataset to be ready for a categorical based clustering. In fact, this dataset became a perfect specimen for agglomerative hierarchial clustering because of the dummy variables and standardized data. Additionally, since the dataset was categorical heavy, the k-modes algorithm was also chosen to cluster the dataset. Both of these clustering algorithms are known for their efficacy in categorical datasets and tend to give accurate clusterings. This is why they were chosen and implemented in the next section.

## 5. Algorithm Results

### Agglomerative Hierarchial Clustering Algorithm

In [None]:
hierarch_df = mdf.copy()
hierarch_df = hierarch_df.drop(['Title'], axis = 1)
X = mdf.copy()
X = X.drop(['Title'], axis = 1)

In [None]:
nb_clusters = range(2,15)
linkages = ['single', 'complete', 'ward', 'average']

silhouette_scores = np.zeros(shape = (len(linkages),len(nb_clusters)))

for i,l in enumerate(linkages):
    for j,nbc in enumerate(nb_clusters):
        ag = AgglomerativeClustering(n_clusters=nbc, affinity='euclidean', linkage=l)
        Y_pred = ag.fit_predict(hierarch_df)
        sls = silhouette_score(hierarch_df,Y_pred,random_state=1002)
        silhouette_scores[i,j] = sls

for i in range(len(nb_clusters)):
    plt.plot(silhouette_scores[:,i])
    plt.ylabel('Silhouette Score', fontsize = 14)
    plt.title('Number of Clusters:' + str(nb_clusters[i]), fontsize = 14)
    plt.xticks(np.arange(len(linkages)), linkages)
    plt.show()


In [None]:
sil_df = pd.DataFrame(silhouette_scores)
sil_df = sil_df.rename(columns={0:'2 Clusters', 1:'3 Clusters', 2:'4 Clusters', 3:'5 Clusters', 4:'6 Clusters', 5:'7 Clusters', 6:'8 Clusters', 7:'9 Clusters', 8:'10 Clusters', 9:'11 Clusters', 10:'12 Clusters', 11:'13 Clusters', 12:'14 Clusters'})
sil_df = sil_df.rename(index={0:'Single', 1:'Complete', 2:'Ward', 3:'Average'})
print(np.max(sil_df, axis=1))
print(sil_df.idxmax(axis=0, skipna=True))
sil_df.head()

From the results shown above, several silhouette scores were computed based on 4 different linkages: single, complete, ward and average. From the summary table, it is apparent that the single linkage was the most dominant linkage for the agglomerative hierarchial clustering. These high scores indicate the best linkage option as well as the best number of clusters. Based on the silhouette scores alone, it seem that a single linkage with 2 clusters is perfect for the agglomerative hierarchial clustering algorithm. The next best option was average linkage and then complete linkage. Each linkage has its pros and cons for clustering, so a great next step would be to visualize the most predominant linkages on a t-sne plot over a range of clusterings to see what is the best option for the agglomerative hierarchial clustering.

In [None]:
dm = pdist(hierarch_df,metric='euclidean')
Z = linkage(dm, method='complete')
fig,ax = plt.subplots(figsize = (25,20))
d = dendrogram(Z,orientation = 'right', truncate_mode='lastp', p=80, no_labels=True, ax=ax)
ax.set_xlabel('Dissimilarity', fontsize = 18)
ax.set_ylabel('Samples (80 Leaves)', fontsize = 18)
plt.show()

In [None]:
tdf35 = tsne_df(XDF=X, DF=mdf, PR=35, RS=1000)
for n in range(2,15):
    ag = AgglomerativeClustering(n_clusters=n, affinity='euclidean', linkage='complete')
    Y_pred = ag.fit_predict(hierarch_df)
    df_pred = pd.Series(Y_pred, name='Cluster', index = hierarch_df.index)
    pdff = pd.concat([tdf35,df_pred], axis = 1)

    sns.scatterplot(x = 'x', y = 'y', hue='Cluster', palette=sns.color_palette('husl', n), data = pdff)
    plt.title('T-SNE Plot Color Coded by Complete Linkage Hierarchial Clustering')
    plt.legend(bbox_to_anchor = (1,1))
    plt.show()


The first option to be explored was the complete linkage over 2-14 different clusters. Firstly, the dendogram is created based on the data (`mdf`) used for the hierarchial clustering. From the dendogram, it is apparent that the relationships/clusters are evenly spread out, hinting that there is a balanced cluster structure. After around 4 clusters, the dendogram produces really small clusters, which may be more useful for identifying outlier groups in the dataset. From the t-sne plots, it is evident that the agglomerative hierarchial clustering algorithm with complete linkage tends to break down the main big cluster into smaller subsets. This means that the algorithm may not be able to identify the main clusters in the dataset.

In [None]:
dm = pdist(hierarch_df,metric='euclidean')
Z = linkage(dm, method='single')
fig,ax = plt.subplots(figsize = (25,20))
d = dendrogram(Z,orientation = 'right', truncate_mode='lastp', p=80, no_labels=True, ax=ax)
ax.set_xlabel('Dissimilarity', fontsize = 18)
ax.set_ylabel('Samples (80 Leaves)', fontsize = 18)
plt.show()

In [None]:
tdf35 = tsne_df(XDF=X, DF=mdf, PR=35, RS=1000)
for n in range(2,15):
    ag = AgglomerativeClustering(n_clusters=n, affinity='euclidean', linkage='single')
    Y_pred = ag.fit_predict(hierarch_df)
    df_pred = pd.Series(Y_pred, name='Cluster', index = hierarch_df.index)
    pdff = pd.concat([tdf35,df_pred], axis = 1)

    sns.scatterplot(x = 'x', y = 'y', hue='Cluster', palette=sns.color_palette('husl', n), data = pdff)
    plt.title('T-SNE Plot Color Coded by Single Linkage Hierarchial Clustering')
    plt.legend(bbox_to_anchor = (1,1))
    plt.show()

The second option to be explored was the single linkage over 2-14 different clusters. Firstly, the dendogram is created based on the data (`mdf`) used for the hierarchial clustering. From the dendogram, it is apparent that the relationships/clusters do not provide well-balanced clusters at the 2-4 cluster levels. After around 4 clusters, the dendogram produces really small clusters, which may be more useful for identifying outlier groups in the dataset. From the t-sne plots, it is evident that the agglomerative hierarchial clustering algorithm with single linkage tends to identify only a few observations per cluster. In other words, the single linkage hierarchial clustering is finding outlier cluster groups within the dataset. This means that the algorithm may not be able to identify the main clusters in the dataset. 

In [None]:
dm = pdist(hierarch_df,metric='euclidean')
Z = linkage(dm, method='average')
fig,ax = plt.subplots(figsize = (25,20))
d = dendrogram(Z,orientation = 'right', truncate_mode='lastp', p=80, no_labels=True, ax=ax)
ax.set_xlabel('Dissimilarity', fontsize = 18)
ax.set_ylabel('Samples (80 Leaves)', fontsize = 18)
plt.show()

In [None]:
tdf35 = tsne_df(XDF=X, DF=mdf, PR=35, RS=1000)
for n in range(2,15):
    ag = AgglomerativeClustering(n_clusters=n, affinity='euclidean', linkage='average')
    Y_pred = ag.fit_predict(hierarch_df)
    df_pred = pd.Series(Y_pred, name='Cluster', index = hierarch_df.index)
    pdff = pd.concat([tdf35,df_pred], axis = 1)

    sns.scatterplot(x = 'x', y = 'y', hue='Cluster', palette=sns.color_palette('husl', n), data = pdff)
    plt.title('T-SNE Plot Color Coded by Average Linkage Hierarchial Clustering')
    plt.legend(bbox_to_anchor = (1,1))
    plt.show()

The last option to be explored was the average linkage over 2-14 different clusters. Firstly, the dendogram is created based on the data (`mdf`) used for the hierarchial clustering. From the dendogram, it is apparent that the relationships/clusters are more evenly spread out than that of the single linkage hierarchial clustering, hinting that there is a balanced cluster structure. After around 4 clusters, the dendogram produced really small clusters, which may be more useful for identifying outlier groups in the dataset. From the t-sne plots, it is evident that the agglomerative hierarchial clustering algorithm with average linkage tends to identify only a few observations per cluster. In other words, the single linkage hierarchial clustering is finding outlier cluster groups within the dataset. However, as the cluster number increases to about 10, the clusters made by average linkage start to get larger and the underlying cluster structure of the dataset is brought to light. But,this means that the algorithm may not efficiently be able to identify the main clusters in the dataset. 

In [None]:
ag = AgglomerativeClustering(n_clusters= 5, affinity='euclidean', linkage='complete')
Y_pred = ag.fit_predict(hierarch_df)
df_pred = pd.Series(Y_pred, name='Cluster', index = hierarch_df.index)
pdff = pd.concat([tdf35,df_pred], axis = 1)

sns.scatterplot(x = 'x', y = 'y', hue='Cluster', palette=sns.color_palette('husl', 5), data = pdff)
plt.title('T-SNE Plot Color Coded by Complete Linkage Hierarchial Clustering')
plt.legend(bbox_to_anchor = (1,1))
plt.show()

Overall, the options given by the t-sne plots as well as the dendograms provided a large amount of insight into choosing an adequate linkage. After an in-depth analysis, it seems that a complete linkage with 5 clusters produces decently sized clusters for the given dataset. Given the results of the dendogram, the complete linkage tended to have the most balanced lines and clusters, hinting that it would perform more effectively than the other linkages. Additionally, the sizes of each cluster has a large amount of observations in each, implying that more concrete conclusions can be made from the unsupervised learning method. All in all, agglomerative hierarchial clustering with a complete linkage using 5 clusters produces the best visual t-sne and dendogram results and will be used to derive insights from the data.

### K-modes Clustering Algorithm

In [None]:
cost = []
for num_clusters in list(range(1,10)):
    print(num_clusters)
    cost_sub_list=[]
    for j in range(0,3):
        kmode = KModes(n_clusters=num_clusters)
        kmode.fit_predict(mdf_cat)
        cost_sub_list.append(kmode.cost_)
    cost.append(np.average(cost_sub_list))

In [None]:
plt.plot(list(range(1,10)),cost)
plt.xlabel('Number of Clusters')
plt.ylabel('Cost')
plt.title('Elbow Method for k-Modes')
plt.show()

Now moving on to the k-modes clustering algorithm, it is necessary to investigate the optimal number of clusters for the unsupervised learning technique. Since, k-modes is optimal for categorical data, it should be especially effective for clustering this dataset. Using a `for` loop, the costs of the k-modes algorithm can be found for a range of clusters. The costs are then plotted to find the elbow plot shown above. Through several iterations, it was found that the elbow was between 4-6 clusters, hinting that anything in that range should yield positive results for the k-modes algorithm. Hence, for this unsupervised learning technique, 5 clusters were chosen for the k-modes algorithm.

In [None]:
km = KModes(n_clusters=5,random_state=1000)
cluster_labels = km.fit_predict(mdf_cat)
mdf_cat['predicted_cluster']=cluster_labels
mdf_cat['predicted_cluster'].value_counts()

In [None]:
X_cat = mdf_cat.copy()
X_cat = X_cat.drop('predicted_cluster', axis=1)
tdf_kmode = tsne_df(XDF=X_cat, DF=mdf_cat, PR=35, RS=1000)

In [None]:
tsne_plot_hue(DFF=tdf_kmode, W=12, H=8, FS=1.5, HUE = 'predicted_cluster')

From the t-sne plot above, the k-modes algorithm did a suboptimal job in clustering the dataset. For the most part, the clusters are quite spread out with observations protruding into other cluster groups. Additionally, there does not seem to be a lot of cohesion and separation among clusters, which could cause observations to be placed in the wrong cluster labels. However, overall the clusters seem to be pretty defined and each has a large amount of observations, which could yield insightful results.

## 6. Post-Analysis Questions

### Separation and Cohesion

In [None]:
def create_silhouette_plot(X, cluster_labels):
    #------------------------------------------------------
    #INPUT:
    #-----------------------------------------------------
    #X=dataframe of objects you clusterted
    #cluster_labels=cluster labels of each of the objects in the dataset X that you just clustered 

    #Gets the unique labels in the cluster_labels
    clabels=np.unique(cluster_labels)
    #Gets the number of unique labels
    k=len(clabels)
    
    #-------------------------------------------------------
    #SETTING UP THE PLOT SPACE
    # Create a subplot with 1 row and 1 columns
    fig, ax1 = plt.subplots(1, 1)
    fig.set_size_inches(18, 7)
    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    # The (k+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(X) + (k + 1) * 10])

    #FINDS THE SILHOUETTE SCORE FOR EACH OBJECT
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    y_lower = 10

    
    for i in clabels:
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / k)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    plt.show()

In [None]:
ag = AgglomerativeClustering(n_clusters= 5, affinity='euclidean', linkage='complete')
Y_pred = ag.fit_predict(hierarch_df)
create_silhouette_plot(hierarch_df, Y_pred)

From the silhouette plot above, it is apparent that each cluster from the agglomerative hierarchial clustering with complete linkage and 5 clusters had very good silhouette scores. In fact, most of the scores were positive and closer to 1 (most were greater than or equalled to 0.5), which hints that there is better separation and cohesion in the assigned clusters. There seems to be a large number of points in the cluster 4 that have negative silhouette scores, meaning that they are less cohesive and less separated from the other observations in the other clusters. Overall, the agglomerative hierarchial clustering with complete linkage provided very cohesive and well separated clusters.

In [None]:
km = KModes(n_clusters=5,random_state=1000)
cluster_labels = km.fit_predict(mdf_cat)
create_silhouette_plot(mdf_cat, cluster_labels)

From the silhouette plot above, it is apparent that each cluster from the k-modes clustering algorithm with 5 clusters had very mediocre silhouette scores. In fact, most of the scores were positive but were much closer to 0 than 1 (most were less than 0.4). This means that the clusters were very averagely cohesive and separated with respect to the dataset. Each of the clusters had numerous observations that dipped into a negative silhouette score, which is indicative of poor cohesion and separation. This isn't too appalling as the t-sne plots showed how spread out each cluster was and how it interfered with other clusters in the dataset. Overall, the k-modes clustering with 5 clusters provided very average cohesion and separation. 

### Cluster Attributes

In [None]:
ag = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='complete')
hierarch_df2 = mdf.copy()
hierarch_df2 = hierarch_df2.drop(['Title'], axis = 1)
Y_pred = ag.fit_predict(hierarch_df2)
cluster_labels = Y_pred
mdf_cat['predicted_cluster']=cluster_labels

for i in mdf_cat.columns:
    ctab = pd.crosstab(mdf_cat[i], mdf_cat['predicted_cluster'])
    print(ctab)
    print(ctab / ctab.sum())
    ctab.plot.bar()
    plt.title('{}'.format(i))
    plt.xlabel(i)
    plt.show()

In [None]:
# KModes plots
km = KModes(n_clusters=5,random_state=1000)
cluster_labels = km.fit_predict(mdf_cat)
mdf_cat['predicted_cluster']=cluster_labels

for i in mdf_cat.columns:
    ctab = pd.crosstab(mdf_cat[i], mdf_cat['predicted_cluster'])
    print(ctab)
    print(ctab / ctab.sum())
    ctab.plot.bar()
    plt.title('{}'.format(i))
    plt.xlabel(i)
    plt.show()

# ctab = pd.crosstab(mdf_cat['Netflix'], mdf_cat['predicted_cluster'])
# print(ctab)
# print(ctab / ctab.sum())
# ctab.plot.bar()

### Description of Clustering Attributes for both Algorithms

#### Agglomerative Hierarchical Clustering

##### Highlight Summary

* Cluster 0:
    * Mostly available on Netflix and Prime Video. There are more genres per movie. The movies are mostly available in 1 country and 1-2 languages, and mostly directed by 1 director. Movies in this cluster target the Young/Adolescent/Adult audience. Drama, Romance, and Action are popular genres. These movies are mostly available in Asia and in Asian languages and some English. All movies are in the Regular-Short (1.5 hours) range. Most movies were made in the last 20 years. Average IMDb ratings in the 6-8 stars range.
* Cluster 1:
    * Mostly available on Netflix and Prime Video. There are more genres per movie. The movies are mostly available in 1 country and 1-2 languages, and mostly directed by 1 director. Movies in this cluster target the Young/Adolescent/Adult audience. Drama, Comedy, and Action are popular genres. These movies are mostly available in North America, Europe, and Asia and in English, European, and Asian languages. All movies are in the Regular-Short (1.5 hours) range. Most movies were made in the last 20 years. Average IMDb ratings in the 6-8 stars range.
* Cluster 2:
    * Mostly available on Prime Video. There are less genres per movie. The movies are mostly available in 1 country and language, and mostly directed by 1 director. Movies in this cluster target the Young/All ages audience. Western, Drama, and Comedy are popular genres. These movies are mostly available in North America and in English. Most movies are in the Short (0-1.5 hours) range. Most movies were made 60-90 years ago. Average IMDb ratings in the 5-7 stars range.
* Cluster 3:
    * Mostly available on Netflix and Prime Video. There are less genres per movie. The movies are mostly available in 1 country and language, and mostly directed by 1 director. Movies in this cluster target the Young/All/Adult audience. Comedy, Documentary, and Short are popular genres. These movies are mostly available in North America and in English. All movies are in the Short (0-1.5 hours) range. Most movies were made in the last 20 years. Average IMDb ratings in the 6-8 stars range.
* Cluster 4:
    * Mostly available on Netflix and Prime Video. There are less genres per movie. The movies are mostly available in 1 country and language, and mostly directed by 1 director. Movies in this cluster target the Adult audience and some younger audiences. Drama, Comedy, and Thriller are popular genres. These movies are mostly available in North America and Europe and in English. There’s a wide-range of movie lengths but slightly more in the Regular-Long (1.5-1.75 hours) range. Most movies were made in the last 20 years. Average IMDb ratings in the 5-7 stars range.

##### Exhaustive Summary

Cluster 0:
 
* Streaming Platforms
    * Netflix (40.2%), Hulu (2.4%), Prime Video (61.2%), Disney+ (2.9%).
* Total Genres, Countries, Languages, and Directors
    * Total Genres (2: 23.4%, 3: 38.8%, 4: 21.1%).
    * Total Countries (1: 89.0%, 2: 7.7%).
    * Total Languages (1: 62.2%, 2: 20.1%).
    * Total Directors (1: 94.3%, 2: 4.3%).
* Age ratings
    * All ages (19.1%), 7+ (31.1%), 13+ (25.8%), 16+ (2.4%), 18+ (21.5%).
* Genres
    * Action (34.5%), Adventure (8.6%), Sci-Fi (4.3%), Thriller (20.6%), Comedy (31.1%), Western (1.9%), Animation (0.5%), Family (7.7%), Biography (7.7%), Drama (78.0%), Music (2.9%), War (7.2%), Crime (12.9%), Fantasy (5.7%), Romance (36.8%), History (9.1%), Mystery (4.8%), Sport (3.3%), Documentary (1.0%), Musical (14.4%), News (0%), Horror (1.9%), Short (0%), Film-Noir (0%), Reality TV (0%).
* Available Regions
    * North America (22.5%), South America (0%), Europe (14.4%), Asia (71.3%), Australia & Pacific (1.0%), Africa (0%).
* Available Languages
    * English (39.2%), French (10.0%), Spanish (4.3%), Arabic (1.0%), Portuguese (0.5%), North American (0%), South American (0%), European (15.3%), Asian (75.1%), Australian & Pacific (0%), African (1.0%), Sign Language (1.0%), Other (0.5%).
* Runtime Length
    * Short (0%), Regular-Short (100%), Regular-Long (0%), Long (0%).
* Movie Age
    * 0-10 years (44.0%), 10-20 years (37.3%).
* IMDb rating
    * 6-7 stars (27.3%), 7-8 stars (31.6%).

Cluster 1:

* Streaming Platforms
    * Netflix (41.4%), Hulu (7.9%), Prime Video (51.2%), Disney+ (8.0%).
* Total Genres, Countries, Languages, and Directors
    * Total Genres (2: 27.2%, 3: 34.9%, 4: 17.5%).
    * Total Countries (1: 71.3%, 2: 14.4%).
    * Total Languages (1: 63.0%, 2: 19.8%).
    * Total Directors (1: 95.6%, 2: 3.5%).
* Age ratings
    * All ages (9.4%), 7+ (18.6%), 13+ (27.0%), 16+ (2.3%), 18+ (42.7%).
* Genres
    * Action (25.4%), Adventure (13.1%), Sci-Fi (9.4%), Thriller (23.0%), Comedy (28.2%), Western (4.1%), Animation (1.1%), Family (6.5%), Biography (13.3%), Drama (69.3%), Music (3.8%), War (9.4%), Crime (14.8%), Fantasy (6.3%), Romance (24.5%), History (10.4%), Mystery (8.5%), Sport (4.2%), Documentary (2.9%), Musical (3.6%), News (0.3%), Horror (4.1%), Short (0%), Film-Noir (0%), Reality TV (0%).
* Available Regions
    * North America (56.6%), South America (1.2%), Europe (28.7%), Asia (34.3%), Australia & Pacific (1.8%), Africa (1.2%).
* Available Languages
    * English (70.4%), French (11.8%), Spanish (9.5%), Arabic (1.7%), Portuguese (1.2%), North American (0.5%), South American (0.3%), European (18.1%), Asian (36.6%), Australian & Pacific (0.5%), African (0.8%), Sign Language (0.6%), Other (0.2%).
* Runtime Length
    * Short (0%), Regular-Short (100%), Regular-Long (0%), Long (0%).
* Movie Age
    * 0-10 years (56.0%), 10-20 years (28.4%).
* IMDb rating
    * 6-7 stars (34.6%), 7-8 stars (33.2%).

Cluster 2:
 
* Streaming Platforms
    * Netflix (0%), Hulu (0%), Prime Video (83.5%), Disney+ (16.5%).
* Total Genres, Countries, Languages, and Directors
    * Total Genres (1: 14.7%, 2: 25.7%, 3: 24.8%, 4: 18.3%).
    * Total Countries (1: 94.5%, 2: 4.6%).
    * Total Languages (1: 92.7%, 2: 6.4%).
    * Total Directors (1: 86.2%, 2: 7.3%).
* Age ratings
    * All ages (67.0%), 7+ (28.4%), 13+ (3.7%), 16+ (0%), 18+ (0.9%).
* Genres
    * Action (22.0%), Adventure (23.9%), Sci-Fi (7.3%), Thriller (13.8%), Comedy (30.3%), Western (35.8%), Animation (11.0%), Family (22.0%), Biography (1.8%), Drama (34.9%), Music (11.9%), War (3.7%), Crime (10.1%), Fantasy (12.8%), Romance (20.2%), History (1.8%), Mystery (9.2%), Sport (1.8%), Documentary (6.4%), Musical (7.3%), News (0%), Horror (8.3%), Short (1.8%), Film-Noir (2.8%), Reality TV (0%).
* Available Regions
    * North America (92.7%), South America (0%), Europe (8.3%), Asia (0.9%), Australia & Pacific (0%), Africa (0.9%).
* Available Languages
    * English (96.3%), French (3.7%), Spanish (1.8%), Arabic (0%), Portuguese (0%), North American (0%), South American (0%), European (4.6%), Asian (0.9%), Australian & Pacific (0%), African (0%), Sign Language (0.9%), Other (0.9%).
* Runtime Length
    * Short (89.0%), Regular-Short (0%), Regular-Long (5.5%), Long (5.5%).
* Movie Age
    * 60-70 years (27.5%), 70-80 years (28.4%), 80-90 years (30.3%).
* IMDb rating
    * 5-6 stars (26.6%), 6-7 stars (32.1%).

Cluster 3:
 
* Streaming Platforms
    * Netflix (26.3%), Hulu (4.1%), Prime Video (62.5%), Disney+ (9.9%).
* Total Genres, Countries, Languages, and Directors
    * Total Genres (1: 39.6%, 2: 31.4%).
    * Total Countries (1: 90.1%, 2: 8.5%).
    * Total Languages (1: 92.5%, 2: 5.8%).
    * Total Directors (1: 83.6%, 2: 14.3%).
* Age ratings
    * All ages (25.3%), 7+ (22.2%), 13+ (10.9%), 16+ (5.5%), 18+ (36.2%).
* Genres
    * Action (6.5%), Adventure (8.5%), Sci-Fi (2.4%), Thriller (3.4%), Comedy (46.4%), Western (0.3%), Animation (19.1%), Family (19.8%), Biography (3.4%), Drama (9.9%), Music (4.8%), War (1.4%), Crime (3.4%), Fantasy (7.2%), Romance (2.0%), History (4.8%), Mystery (1.4%), Sport (2.0%), Documentary (46.4%), Musical (3.8%), News (0%), Horror (0%), Short (24.2%), Film-Noir (0%), Reality TV (0%).
* Available Regions
    * North America (83.6%), South America (0.3%), Europe (16.7%), Asia (5.5%), Australia & Pacific (0%), Africa (0%).
* Available Languages
    * English (95.9%), French (1.0%), Spanish (0.7%), Arabic (1.0%), Portuguese (0.3%), North American (0%), South American (0%), European (4.1%), Asian (3.8%), Australian & Pacific (0.3%), African (0%), Sign Language (1.4%), Other (1.0%).
* Runtime Length
    * Short (100%), Regular-Short (0%), Regular-Long (0%), Long (0%).
* Movie Age
    * 0-10 years (67.6%), 10-20 years (23.9%).
* IMDb rating
    * 6-7 stars (29.0%), 7-8 stars (33.8%).

Cluster 4:
 
* Streaming Platforms
    * Netflix (20.3%), Hulu (8.6%), Prime Video (69.6%), Disney+ (6.6%).
* Total Genres, Countries, Languages, and Directors
    * Total Genres (1: 22.7%, 2: 29.4%, 3: 25.4%).
    * Total Countries (1: 77.9%, 2: 15.6%).
    * Total Languages (1: 82.0%, 2: 12.5%).
    * Total Directors (1: 90.5%, 2: 8.1%).
* Age ratings
    * All ages (9.4%), 7+ (19.3%), 13+ (16.0%), 16+ (4.7%), 18+ (50.7%).
* Genres
    * Action (19.5%), Adventure (14.9%), Sci-Fi (10.3%), Thriller (26.1%), Comedy (32.4%), Western (2.1%), Animation (6.3%), Family (14.4%), Biography (4.5%), Drama (44.0%), Music (3.4%), War (2.6%), Crime (13.2%), Fantasy (9.9%), Romance (13.7%), History (2.8%), Mystery (8.8%), Sport (2.8%), Documentary (9.1%), Musical (2.3%), News (0.5%), Horror (17.9%), Short (0%), Film-Noir (0.1%), Reality TV (0.0%).
* Available Regions
    * North America (80.2%), South America (1.0%), Europe (22.7%), Asia (8.7%), Australia & Pacific (3.0%), Africa (0.9%).
* Available Languages
    * English (91.5%), French (4.8%), Spanish (5.8%), Arabic (1.0%), Portuguese (0.5%), North American (0.3%), South American (0.0%), European (8.5%), Asian (8.9%), Australian & Pacific (0.1%), African (0.3%), Sign Language (0.3%), Other (0.1%).
* Runtime Length
    * Short (24.2%), Regular-Short (14.9%), Regular-Long (34.0%), Long (26.8%).
* Movie Age
    * 0-10 years (51.7%), 10-20 years (21.7%).
* IMDb rating
    * 5-6 stars (27.0%), 6-7 stars (29.2%).


#### KModes Clustering Attributes

##### Highlight Summary

* Cluster 0:
    * Mostly available on Prime Video. There are less genres per movie. The movies are mostly available in 1 country and language, and mostly directed by 1 director. Movies in this cluster targets Adult audiences 18+. Drama, Thriller, and Comedies are popular genres. These movies are mostly available in North America and in English. There’s a wide-range of movie lengths but slightly more in the Regular-Long range (1.5-1.75 hours). Most movies were made in the last 20 years. Average IMDb ratings are in the 5-7 stars range.
* Cluster 1:
    * Mostly available on Disney+. There are more genres per movie. The movies are mostly available in 1 country and language, and mostly directed by 1 director. Movies in this cluster target the Young/All audience. Family, Comedy, and Adventure are popular genres. These movies are mostly available in North America and in English. There’s a wide-range of movie lengths but slightly more in the Short range (0-1.5 hours). Most movies were made in the last 20 years. Average IMDb ratings are in the 5-6 stars range.
* Cluster 2:
    * Mostly available on Prime Video. There are more genres per movie. The movies are mostly available in 1-2 countries and languages, and mostly directed by 1 director. Movies in this cluster target the Young/Adolescent/Adult audience. Drama, Thriller, Romance are popular genres. These movies are mostly available in Europe and some in North America, in English and some French. There’s a wide-range of movie lengths but slightly more in the Regular-Short (1.5 hours) range. Most movies were made in the last 20 years. Average IMDb ratings in the 6-7 stars range.
* Cluster 3:
    * Mostly available on Netflix. There are more genres per movie. The movies are mostly available in 1-2 countries and 1 language, and mostly directed by 1-2 directors. Movies in this cluster target the Young/All audience. Family, Adventure, Animation are popular genres. These movies are mostly available in North America and a little in Europe, in English and a little French. Most movies are Short (0-1.5 hours) length movies. Most movies were made in the last 20 years. Average IMDb ratings in the 6-8 stars range.
* Cluster 4:
    * Mostly available on Netflix and Prime Video. There are less genres per movie. The movies are mostly available in 1 country and language, and mostly directed by 1 director. Movies in this cluster target the Young/Adolescent/Adult audience. Drama, Action, Comedy are popular genres. These movies are mostly available in Asia and in Asian languages and some English. Most movies in the Regular-Short (1.5 hours) range. Most movies were made in the last 20 years. Average IMDb ratings in the 7-8 stars range.

##### Exhaustive summary

Cluster 0:

* Streaming Platforms
    * Netflix (17.3%), Hulu (7.7%), Prime Video (79.2%), Disney+ (0.7%).
* Total Genres, Countries, Languages, and Directors
    * Total Genres (1: 27.7%, 2: 37.3%, 3: 19.6%).
    * Total Countries (1: 84.0%, 2: 11.9%).
    * Total Languages (1: 85.3%, 2: 10.1%).
    * Total Directors (1: 91.9%, 2: 7.3%).
* Age ratings
    * All ages (5.7%), 7+ (11.6%), 13+ (13.7%), 16+ (4.6%), 18+ (64.4%).
* Genres
    * Action (19.7%), Adventure (8.4%), Sci-Fi (10.3%), Thriller (28.7%), Comedy (28.7%), Western (3.1%), Animation (1.1%), Family (3.6%), Biography (4.0%), Drama (38.9%), Music (3.2%), War (2.1%), Crime (14.0%), Fantasy (4.8%), Romance (11.4%), History (2.3%), Mystery (8.4%), Sport (2.2%), Documentary (12.3%), Musical (0.8%), News (0.5%), Horror (20.8%), Short (0.7%), Film-Noir (0.1%), Reality TV (0.0%).
* Available regions
    * North America (92.0%), South America (0.9%), Europe (12.5%), Asia (4.1%), Australia & Pacific (2.2%), Africa (0.7%).
* Available languages
    * English (97.0%), French (3.5%), Spanish (5.4%), Arabic (0.5%), Portuguese (0.5%), North American (0.3%), South American (0.0%), European (6.3%), Asian (4.6%), Australian & Pacific (0.0%), African (0.2%), Sign Language (0.4%), Other (0.2%).
* Runtime length
    * Short (26.9%), Regular-Short (13.9%), Regular-Long (36.5%), Long (22.7%).
* Movie age
    * 0-10 years (52.1%), 10-20 years (21.6%).
* IMDb rating
    * 5-6 stars (25.5%), 6-7 stars (27.8%).

Cluster 1:

* Streaming Platforms
    * Netflix (17.0%), Hulu (9.3%), Prime Video (18.3%), Disney+ (57.6%).
* Total Genres, Countries, Languages, and Directors
    * Total Genres (3: 38.5%, 4: 17.3%).
    * Total Countries (1: 82.4%, 2: 13.5%).
    * Total Languages (1: 85.9%, 2: 10.1%).
    * Total Directors (1: 88.0%, 2: 8.7%).
* Age ratings
    * All ages (38.9%), 7+ (51.4%), 13+ (7.2%), 16+ (2.0%), 18+ (0.5%).
* Genres
    * Action (13.5%), Adventure (32.0%), Sci-Fi (12.6%), Thriller (3.3%), Comedy (64.2%), Western (2.6%), Animation (15.6%), Family (72.5%), Biography (3.0%), Drama (29.9%), Music (6.2%), War (0.8%), Crime (4.8%), Fantasy (19.7%), Romance (17.0%), History (1.2%), Mystery (3.9%), Sport (7.5%), Documentary (6.2%), Musical (11.0%), News (0.2%), Horror (2.9%), Short (2.0%), Film-Noir (0%), Reality TV (0%).
* Available regions
    * North America (97.4%), South America (0%), Europe (7.8%), Asia (2.3%), Australia & Pacific (3.2%), Africa (0.8%).
* Available languages
    * English (98.9%), French (3.6%), Spanish (4.4%), Arabic (0.6%), Portuguese (0.3%), North American (0%), South American (0%), European (6.0%), Asian (3.2%), Australian & Pacific (0.2%), African (0.6%), Sign Language (0.6%), Other (0.2%).
* Runtime length
    * Short (41.7%), Regular-Short (16.4%), Regular-Long (22.6%), Long (19.7%).
* Movie age
    * 0-10 years (38.5%), 10-20 years (29.9%).
* IMDb rating
    * 5-6 stars (40.6%), 6-7 stars (27.1%).


Cluster 2:

* Streaming Platforms
    * Netflix (18.8%), Hulu (9.3%), Prime Video (77.1%), Disney+ (1.3%).
* Total Genres, Countries, Languages, and Directors
    * Total Genres (1: 16.1%, 2: 15.1%, 3: 43.2%, 4: 17.6%).
    * Total Countries (1: 53.4%, 2: 27.7%).
    * Total Languages (1: 64.2%, 2: 21.8%).
    * Total Directors (1: 93.7%, 2: 5.4%).
* Age ratings
    * All ages (8.6%), 7+ (20.3%), 13+ (29.4%), 16+ (5.8%), 18+ (36.0%).
* Genres
    * Action (14.6%), Adventure (12.9%), Sci-Fi (6.5%), Thriller (28.5%), Comedy (18.4%), Western (3.5%), Animation (2.1%), Family (5.1%), Biography (13.0%), Drama (77.1%), Music (5.0%), War (9.6%), Crime (14.8%), Fantasy (6.7%), Romance (21.3%), History (10.9%), Mystery (12.6%), Sport (3.3%), Documentary (9.2%), Musical (2.0%), News (0.5%), Horror (9.9%), Short (1.4%), Film-Noir (0%), Reality TV (0%).
* Available regions
    * North America (40.8%), South America (1.8%), Europe (78.2%), Asia (6.5%), Australia & Pacific (5.1%), Africa (2.1%).
* Available languages
    * English (84.6%), French (14.0%), Spanish (9.9%), Arabic (2.9%), Portuguese (1.0%), North American (0.2%), South American (0.3%), European (25.7%), Asian (7.6%), Australian & Pacific (0.5%), African (1.2%), Sign Language (0.3%), Other (0.3%).
* Runtime length
    * Short (14.6%), Regular-Short (46.5%), Regular-Long (12.7%), Long (26.3%).
* Movie age
    * 0-10 years (52.2%), 10-20 years (23.0%).
* IMDb rating
    * 5-6 stars (21.3%), 6-7 stars (40.7%), 7-8 stars (20.9%).


Cluster 3:

* Streaming Platforms
    * Netflix (53.5%), Hulu (13.5%), Prime Video (19.1%), Disney+ (20.9%).
* Total Genres, Countries, Languages, and Directors
    * Total Genres (4: 36.9%, 5: 30.1%, 6: 17.0%).
    * Total Countries (1: 62.1%, 2: 23.8%).
    * Total Languages (1: 74.8%, 2: 16.7%).
    * Total Directors (1: 61.7%, 2: 28.0%).
* Age ratings
    * All ages (33.7%), 7+ (57.1%), 13+ (8.2%), 16+ (7.1%), 18+ (0.4%).
* Genres
    * Action (24.1%), Adventure (81.9%), Sci-Fi (16.3%), Thriller (2.8%), Comedy (66.3%), Western (1.8%), Animation (77.7%), Family (86.2%), Biography (0.4%), Drama (15.2%), Music (2.5%), War (0.4%), Crime (3.2%), Fantasy (71.6%), Romance (7.8%), History (0.4%), Mystery (5.3%), Sport (0.4%), Documentary (0%), Musical (15.2%), News (0%), Horror (2.5%), Short (4.3%), Film-Noir (0%), Reality TV (0%).
* Available regions
    * North America (87.6%), South America (0.7%), Europe (27.0%), Asia (15.2%), Australia & Pacific (2.5%), Africa (0.4%).
* Available languages
    * English (94.0%), French (9.9%), Spanish (6.4%), Arabic (0.4%), Portuguese (1.1%), North American (0.7%), South American (0%), European (10.3%), Asian (11.0%), Australian & Pacific (0.4%), African (0%), Sign Language (0%), Other (0%).
* Runtime length
    * Short (52.1%), Regular-Short (11.3%), Regular-Long (17.7%), Long (18.8%).
* Movie age
    * 0-10 years (56.4%), 10-20 years (23.0%).
* IMDb rating
    * 6-7 stars (27.7%), 7-8 stars (30.9%).


Cluster 4:

* Streaming Platforms
    * Netflix (59.9%), Hulu (4.1%), Prime Video (44.5%), Disney+ (0.2%).
* Total Genres, Countries, Languages, and Directors
    * Total Genres (2: 27.8%, 3: 36.2%).
    * Total Countries (1: 88.0%, 2: 8.1%).
    * Total Languages (1: 71.1%, 2: 18.4%).
    * Total Directors (1: 93.1%, 2: 5.6%).
* Age ratings
    * All ages (14.4%), 7+ (25.0%), 13+ (30.1%), 16+ (4.1%), 18+ (26.5%).
* Genres
    * Action (36.2%), Adventure (9.4%), Sci-Fi (5.0%), Thriller (20.4%), Comedy (35.9%), Western (0.3%), Animation (6.9%), Family (5.8%), Biography (5.3%), Drama (66.8%), Music (1.7%), War (4.2%), Crime (14.5%), Fantasy (7.0%), Romance (29.0%), History (5.3%), Mystery (6.4%), Sport (2.8%), Documentary (2.0%), Musical (5.0%), News (0%), Horror (6.4%), Short (0.3%), Film-Noir (0%), Reality TV (0%).
* Available regions
    * North America (10.8%), South America (0.9%), Europe (3.9%), Asia (92.0%), Australia & Pacific (0.6%), Africa (0.3%).
* Available languages
    * English (22.5%), French (3.3%), Spanish (2.3%), Arabic (1.6%), Portuguese (0.6%), North American (0.2%), South American (0%), European (3.9%), Asian (92.2%), Australian & Pacific (0%), African (0.2%), Sign Language (0.2%), Other (0%).
* Runtime length
    * Short (5.8%), Regular-Short (71.6%), Regular-Long (8.3%), Long (14.4%).
* Movie age
    * 0-10 years (59.6%), 10-20 years (20.1%).
* IMDb rating
    * 6-7 stars (27.0%), 7-8 stars (38.5%).


### Cluster Distances

Since this dataset did not include pre-assigned class labels, the cluster distances/other cluster metrics were used to ascertain the accuracy of the algorithm.

In [None]:
ag = AgglomerativeClustering(n_clusters= 5, affinity='euclidean', linkage='complete')
Y_pred = ag.fit_predict(hierarch_df)
dm = pdist(hierarch_df,metric='euclidean')
Z = linkage(dm, method='complete')
coph = cophenet(Z)
print(np.corrcoef(dm,coph))
print(silhouette_score(hierarch_df, Y_pred))

To start, the first algorithm analyzed for the cluster distances was the agglomerative hierarchial clustering with complete linkage. Since this method is hierarchical clustering, the cophenetic correlation coefficient can be used to determine whether or not the clustering is better for the given dataset. In fact, the euclidean distance matrix was calculated for the given dataset and the cophenetic distances were given using the complete linkage function. Lastly, finding the correlation between the two distance matrices yielded the correlation matrix above. It is apparent that the value outputted from the correlation matrix is quite close to 1, hinting that the dendogram preserves the pairwise distances of the original dataset. Additionally, the average silhouette score of the complete linkage with 5 clusters yielded a value of about 0.35, hinting that the clusters are more cohesive and well separated. This can also be used as a metric to evaluate the cluster distances as more well separated and cohesive clusters means there is larger cluster distances.

In [None]:
km = KModes(n_clusters=5,random_state=1000)
cluster_labels = km.fit_predict(mdf_cat)
silhouette_avg = silhouette_score(mdf_cat, cluster_labels)
print(silhouette_avg)
create_silhouette_plot(mdf_cat, cluster_labels)

Unfortunately for k-modes, there does not seem to be a cophenetic correlation coefficient metric that can be used for this unsupervised learning method. Instead, it is better to look at the average silhouette score as well as the silhouette plot, once again. From the average silhouette score, the values is very close to about 0.1. This hints that there is a very average/mediocre amount of cohesion and separation among clusters in the dataset. Hence, the cluster distances among clusters must be on the smaller end of the spectrum as less separation and cohesion makes clusters overlap with one another.

## 7. Analysis Summary

### Algorithm Comparison Summary

In [None]:
# KModes plots
km = KModes(n_clusters=5,random_state=1000)
cluster_labels = km.fit_predict(mdf_cat)
mdf_cat['predicted_cluster']=cluster_labels

for i in mdf_cat.columns:
    ctab = pd.crosstab(mdf_cat[i], mdf_cat['predicted_cluster'])
    print(ctab)
    print(ctab / ctab.sum())
    ctab.plot.bar()
    plt.title('{}'.format(i))
    plt.xlabel(i)
    plt.show()

From the results throughout this report, it is apparent that the k-modes did a much better job in ascertaining groups within the data than that of the agglomerative hierarchial clustering. In terms of the research motivations, there were a lot of insights that needed to be confirmed. Prior to this analysis, there were a lot of relationships that were intuitive and needed to substantiate with unsupervised learning. For instance, a group that contained the `Disney+` streaming service should also like `family` and `animation` genres and have age groups that are younger. This specific example was better confirmed by the k-modes clustering than that of the agglomerative hierarchial clustering. The research goals of this project was to understand underlying groups within streaming platforms to better understand how to cater to them. By understanding this, reccomendations, like different movie genres, can be made to allow for a more enjoyable viewing experience. Overall, the k-modes algorithm better helped reach the research goals of this analysis and provided informative results.

### Insights Summary

In conclusion, this project provided a variety of insights on not only the dataset, but also the ways in which unsupervised learning methods can be applied and interpretted. In the beginning of the report, an exploratory data analysis as well as a data processing/cleaning phase was necessary to jumpstart the unsupervised learning analysus. First, many descriptive statistics were calculated for all the numerical data in the dataset. From the output, it was evident that there was a large standard deviation among each feature in the dataset. Additionally, with the creation of dummy variables, the standard deviation was quite unbalanced from feature to feature. To combat this, a scaling/stanardizing of the data was completed to ensure a more fair and holistic analysis. Furthermore, some basic pairwise relationships were ascertained to see if there were any correlations between variables of interest. Specifically, there was a strong positive correlation between `Disney+` and the `Family` genre, which makes sense intuitively. 

After the preliminary exploratory data analysis, the dataset structure was put under scrutiny. The two dataset, `mdf` and `mdf_cat`, were plotted onto a t-sne plot to discover the shape of each dataset. Unfortunately, both datasets seemed to be encompassed by 1-2 clusters and were approximately spherical in shape. Additionally, there was a large imbalance in the size of the clusters with many clusters overlapping with one another. Because of the shape and type of dataset that was being used, two main clustering algorithms were applied to the dataset: agglomerative hierarchial clustering and k-modes clustering. From further analysis using silhouette scores, dendograms, elbow plots and t-sne plots, it was found that using 5 cluster for the k-modes algorithm and using 5 clusters with complete linkage for the agglomerative hierarchial clustering algorithm provided the most optimal results.

After choosing the final clustering, the separation and cohesion of the clusters were assessed using the silhouette score and plots. From this analysis, it was found that the agglomerative hierarchial clustering had much better separation and cohesion in comparison to that of the k-modes algorithm. However, this does not mean that hierarchial clustering perfomed better than k-modes. When looking at the cluster attributes, it was apparent that many of the clusterings created by k-modes made sense. In other words, k-modes was much better at identifying streaming platforms as cluster groups, while hierarchial clustering tends to muddle the streaming platforms together. Additionally, some of the cluster in the hierarchial clustering didn't make sense, like having a group that is dominated by both younger and older audiences who watch comedies and documentaries. This doesn't intuitively make sense and it yielded inaccurate results. On the other hand, k-modes provided cluster groups that actually made sense. For instance, there was a `Disney+` cluster group that had younger audiences and had adventure, animation and family genre movies. This result intuitively makes sense and is more accurate. Lastly, looking at the cluster distances of each method, it was apparent that the cluster distances of agglomerative hierarchial clustering were smaller than that of k-modes clustering due to the differences in average silhouette scores.

All in all, this clustering analysis provided an insightful analysis on the research motivation and shed light on underlying groups within the streaming platforms dataset, which will help cater and reccomend genres to different groups. 

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=e78434b8-db04-42b7-8798-69dc38c46636' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>