# INTRODUCTION

In a nutshell, I will be using a Principal Component Analysis (PCA) based approach to analysing the TMDB dataset and then implementing some KMeans clustering to provide visualisations of any related clusters I find in the dataset. This notebook will purely be an exploratory and hopefully concise enough attempt to explain the idea of PCA as well as using a clustering method (KMeans) to extract meaningful relations out of it. 

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.decomposition import PCA # Principal Component Analysis module
from sklearn.cluster import KMeans # KMeans clustering 
import matplotlib.pyplot as plt # Python defacto plotting library
import seaborn as sns # More snazzy plotting library
%matplotlib inline 

Let's import the movie dataset with imagination. The dataset will be called `tmdb_movie` and let's inspect the first 5 rows of the dataframe with .head()

In [None]:
tmdb_movie = pd.read_csv('../input/tmdb_5000_movies.csv')
tmdb_credits = pd.read_csv('../input/tmdb_5000_credits.csv')

In [None]:
tmdb_movie.head()

In [None]:
tmdb_movie.tail()

# 1. DATA FILTERING AND CLEANSING
**Filtering for Numerical values only**

As observed from the dataframe above, some columns contain numbers while others, words. Let's do some filtering to extract only the numbered columns and not the ones with words.

In [None]:
str_list = [] # empty list to contain columns with strings (words)
for colname, colvalue in tmdb_movie.iteritems():
    if type(colvalue[1]) == str:
         str_list.append(colname)
# Get to the numeric columns by inversion            
num_list = tmdb_movie.columns.difference(str_list)         

Now create a new dataframe (movie_num) containing just the numbers as such : 

In [None]:
movie_num = tmdb_movie[num_list]
#del movie # Get rid of movie df as we won't need it now
movie_num.head()

In [None]:
movie_num.info()

**Removal of Null values**

In here, I will just do the naive thing of replacing these NaNs with zeros as such:

In [None]:
movie_num = movie_num.fillna(value=0, axis=1)

**Standardisation** 

Finally we mentioned that we have to find some sort of way to standardise the data and for this, we use sklearn's StandardScaler.

In [None]:
from sklearn.preprocessing import StandardScaler
X = movie_num.values
# Data Normalization
X_std = StandardScaler().fit_transform(X)

Let's look at some hexbin visualisations first to get a feel for how the correlations between the different features compare to one another. In the hexbin plots, the lighter in color the hexagonal pixels, the more correlated one feature is to another.

In [None]:
tmdb_movie.plot(y= 'vote_average', x ='runtime',kind='hexbin',gridsize=35, sharex=False, colormap='cubehelix', title='Hexbin of vote_average and runtime',figsize=(12,8))
tmdb_movie.plot(y= 'vote_average', x ='revenue',kind='hexbin',gridsize=45, sharex=False, colormap='cubehelix', title='Hexbin of vote_average and revenue',figsize=(12,8))

Anyway now - time for the customary heatmap per the tradition of most notebooks on Principal Component Analysis. The heatmap is generated to visually show how strongly correlated the values of the dataframe's columns are to one another. 

In [None]:
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(12, 10))
plt.title('Pearson Correlation of Movie Features')
# Draw the heatmap using seaborn
sns.heatmap(movie_num.astype(float).corr(),linewidths=0.25,vmax=1.0, square=True, cmap="YlGnBu", linecolor='black', annot=True)

As we can see from the heatmap, there are regions (features) where we can see quite positive linear correlations amongst each other, given the darker shade of the colours - top left-hand corner and bottom right quarter. This is a good sign as it means we may be able to find linearly correlated features for which we can perform PCA projections on.

# 2. EXPLAINED VARIANCE MEASURE

In [None]:
# Calculating Eigenvectors and eigenvalues of Cov matirx
mean_vec = np.mean(X_std, axis=0)
cov_mat = np.cov(X_std.T)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)

In [None]:
mean_vec

In [None]:
cov_mat

Now having obtained the eigenvalues and eigenvectors, we will group them together by creating a list of eigenvalue, eigenvector tuples. Following on from this we will sort the list  in order of Highest eigenvalue to lowest eigenvalue and then use the eigenvalues to calculate both the individual explained variance and the cumulative explained variance for visualisation.

In [None]:
# Create a list of (eigenvalue, eigenvector) tuples
eig_pairs = [ (np.abs(eig_vals[i]),eig_vecs[:,i]) for i in range(len(eig_vals))]

# Sort from high to low
eig_pairs.sort(key = lambda x: x[0], reverse= True)

# Calculation of Explained Variance from the eigenvalues
tot = sum(eig_vals)
var_exp = [(i/tot)*100 for i in sorted(eig_vals, reverse=True)] # Individual explained variance
cum_var_exp = np.cumsum(var_exp) # Cumulative explained variance

In [None]:
cum_var_exp

Now time to plot the explained variance graphs to see how our contributions look like. The cumulative explained variance is visualised in a blue step-plot while the individual explained variance is plotted via green bar charts as follows: 

In [None]:
# PLOT OUT THE EXPLAINED VARIANCES SUPERIMPOSED 
plt.figure(figsize=(10, 5))
plt.bar(range(len(var_exp)), var_exp, alpha=0.3333, align='center', label='individual explained variance', color = 'g')
plt.step(range(len(cum_var_exp)), cum_var_exp, where='mid',label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc='best')
plt.show()

In [None]:
movie_num.describe()

# 3. PRINCIPAL COMPONENT ANALYSIS 
Having roughly identified how many components/dimensions we would like to project on, let's now implement sklearn's PCA module. 

The first line of the code contains the parameters "n_components" which states how many PCA components we want to project the dataset onto. Since we are going implement PCA with 7 components, therefore we set n_components = 7.  

The second line of the code calls the "fit_transform" method, which fits the PCA model with the standardised movie data X_std and applies the dimensionality reduction on this dataset. 

In [None]:
pca = PCA(n_components=7)
x_9d = pca.fit_transform(X_std)

Awesome. Having now applied our specific PCA model with the movie dataset, let's visualise the first 2 projection components as a 2D scatter plot to see if we can get a quick feel for the underlying data. 

In [None]:
plt.figure(figsize = (9,7))
plt.scatter(x_9d[:,0],x_9d[:,1], c='goldenrod',alpha=0.5)
plt.ylim(-10,30)
plt.show()

 # 4. VISUALISATIONS WITH KMEANS CLUSTERING
A simple KMeans will now be applied to the PCA projection data. Each cluster will be visualised with a different colour so hopefully we will be able to pick out clusters by eye. 

To start off, we set up a KMeans clustering with sklearn's KMeans() and call the "fit_predict" method to compute cluster centers and predict cluster indices for the first and third PCA projections (to see if we can observe any appreciable clusters). We then define our own colour scheme and plot the scatter diagram as follows:

In [None]:
# Set a 3 KMeans clustering
kmeans = KMeans(n_clusters=3)
# Compute cluster centers and predict cluster indices
X_clustered = kmeans.fit_predict(x_9d)

# Define our own color map
LABEL_COLOR_MAP = {0 : 'r',1 : 'g',2 : 'b'}
label_color = [LABEL_COLOR_MAP[l] for l in X_clustered]

# Plot the scatter digram
plt.figure(figsize = (7,7))
plt.scatter(x_9d[:,0],x_9d[:,2], c= label_color, alpha=0.5) 
plt.show()

In [None]:
# Set a 4 KMeans clustering
kmeans = KMeans(n_clusters=4)
# Compute cluster centers and predict cluster indices
X_clustered = kmeans.fit_predict(x_9d)

# Define our own color map
LABEL_COLOR_MAP = {0 : 'r',1 : 'g',2 : 'b',3:'y'}
label_color = [LABEL_COLOR_MAP[l] for l in X_clustered]

# Plot the scatter digram
plt.figure(figsize = (7,7))
plt.scatter(x_9d[:,0],x_9d[:,3], c= label_color, alpha=0.5) 
plt.show()

In [None]:
# Set a 5 KMeans clustering
kmeans = KMeans(n_clusters=5)
# Compute cluster centers and predict cluster indices
X_clustered = kmeans.fit_predict(x_9d)

# Define our own color map
LABEL_COLOR_MAP = {0 : 'r',1 : 'g',2 : 'b',3:'y',4:'m'}
label_color = [LABEL_COLOR_MAP[l] for l in X_clustered]

# Plot the scatter digram
plt.figure(figsize = (7,7))
plt.scatter(x_9d[:,0],x_9d[:,4], c= label_color, alpha=0.5) 
plt.show()

In [None]:
# Set a 6 KMeans clustering
kmeans = KMeans(n_clusters=6)
# Compute cluster centers and predict cluster indices
X_clustered = kmeans.fit_predict(x_9d)

# Define our own color map
LABEL_COLOR_MAP = {0 : 'r',1 : 'g',2 : 'b',3:'y',4:'m',5:'c'}
label_color = [LABEL_COLOR_MAP[l] for l in X_clustered]

# Plot the scatter digram
plt.figure(figsize = (7,7))
plt.scatter(x_9d[:,0],x_9d[:,5], c= label_color, alpha=0.5) 
plt.show()

In [None]:
# Set a 7 KMeans clustering
kmeans = KMeans(n_clusters=7)
# Compute cluster centers and predict cluster indices
X_clustered = kmeans.fit_predict(x_9d)

# Define our own color map
LABEL_COLOR_MAP = {0 : 'r',1 : 'g',2 : 'b',3:'y',4:'m',5:'c',6:'k'}
label_color = [LABEL_COLOR_MAP[l] for l in X_clustered]

# Plot the scatter digram
plt.figure(figsize = (7,7))
plt.scatter(x_9d[:,0],x_9d[:,6], c= label_color, alpha=0.5) 
plt.show()

Pairplot automatically plots all the features in the dataframe  in pairwise manner. I will pairplot the first 3 projections against one another and the resultant plot is given below:

In [None]:
# Create a temp dataframe from our PCA projection data "x_9d"
df = pd.DataFrame(x_9d)
df = df[[0,1,2]] # only want to visualise relationships between first 3 projections
df['X_cluster'] = X_clustered

In [None]:
# Call Seaborn's pairplot to visualize our KMeans clustering on the PCA projected data
sns.pairplot(df, hue='X_cluster', palette= 'Dark2', diag_kind='kde',size=1.85)