# 1. Introduction: Business Goal & Problem Definition

IF YOU LIKE IT OR IF IT HELPS YOU SOMEHOW, COULD YOU PLEASE UPVOTE? THANK YOU VERY MUCH!!!

The goal of this project is to identify, study and analyze movies clustering, so the movie industry can have a better understanding of the customers segmentations according to their movies preferences and adapt different marketing strategies to each of them, bringing more revenue to the business. For that we´ll use the Movie Industry dataset available in Kaggle, containing 6820 movies (220 movies per year, 1986-2016). Each movie has the following attributes:

* budget: the budget of a movie. Some movies don't have this, so it appears as 0
* company: the production company
* country: country of origin
* director: the director
* genre: main genre of the movie.
* gross: revenue of the movie
* name: name of the movie
* rating: rating of the movie (R, PG, etc.)
* released: release date (YYYY-MM-DD)
* runtime: duration of the movie
* score: IMDb user rating
* votes: number of user votes
* star: main actor/actress
* writer: writer of the movie
* year: year of release

# 2. Importing Basic Libraries

In [None]:
import io
import openpyxl
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 3. Data Collection

In [None]:
movies_ds = pd.read_csv('../input/movies/movies.csv', encoding='latin1', sep=",")

movies_ds

# 4. Data Preliminary Exploration

In [None]:
#Checking a dataset sample

pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 100)
pd.options.display.float_format="{:,.2f}".format
movies_ds.sample(n=10, random_state=0)

In [None]:
#Checking dataset info by feature

movies_ds.info(verbose=True, null_counts=True)

In [None]:
#Checking the existence of zeros in rows

(movies_ds==0).sum(axis=0).to_excel("zeros_per_feature.xlsx")
(movies_ds==0).sum(axis=0)

In [None]:
#Checking the existence of duplicated rows

movies_ds.duplicated().sum()

In [None]:
#Checking basic statistical data by feature

movies_ds.describe(include="all")

# 5. Data Cleaning

    We´ll perform the following:
    
    
    1. I noticed budget is zero in 2182 observations, so we´ll treat it, making them proportional to "gross" since they are correlated


    2. I noticed there are 309 ratings as NOT RATED, UNRATED or Not Specified; since it´s not a significant amount those rows will be deleted


    3. Create a feature (gross_to_budget_ratio) to analyze the revenue to budget ratio relevance in the model


    4. Keep only the most relevant features for our clustering purpose (budget, country, genre, gross, rating, runtime, score, year), so we make the model easier to interpret, we reduce the training time, avoid curse of dimensionality and reduce overfitting (OCCAM´S RAZOR)
    
    
    5. Convert categorical variables (country, genre, rating) to dummies
    
    
    * No duplications found
    * No outliers found

In [None]:
#1

movies_ds["budget"].replace(0, np.nan, inplace=True)
movies_ds["budget"].fillna(movies_ds["budget"].sum() / movies_ds["gross"].sum() * movies_ds["gross"], inplace=True)

#2

movies_ds = movies_ds[~movies_ds["rating"].isin(["NOT RATED", "UNRATED", "Not specified"])]

#3

movies_ds["gross_to_budget_ratio"] = movies_ds["gross"] / movies_ds["budget"] #feature engineering

#4

movies_ds = movies_ds[["budget", "country", "genre", "gross", "rating", "runtime", "score", "year", "gross_to_budget_ratio"]] #keeping only the most relevant features

#5

# movies_ds = pd.concat([movies_ds, pd.get_dummies(movies_ds["country"])], axis=1) #country dummy coding (we´re skipping this line since it woulc generate a 87 columns dataset and we don´t want to make complex the problem explanation to the business in this example)
movies_ds = pd.concat([movies_ds, pd.get_dummies(movies_ds["genre"])], axis=1) #genre dummy coding
movies_ds = pd.concat([movies_ds, pd.get_dummies(movies_ds["rating"])], axis=1) #rating dummy coding

movies_ds.to_excel("movies_ds_clean.xlsx")

# 6. Data Exploration

In [None]:
#Plotting Categorical Variables

fig, ax = plt.subplots(1, 2)
movies_ds["country"].value_counts().plot.bar(color="purple", ax=ax[0])
movies_ds["country"].value_counts().plot.pie(autopct='%1.1f%%',shadow=True,textprops={"fontsize": 10},ax=ax[1])
fig.suptitle("Country Frequency", fontsize=15)
plt.xticks(rotation=90)
plt.yticks(rotation=45)

fig, ax = plt.subplots(1, 2)
movies_ds["genre"].value_counts().plot.bar(color="purple", ax=ax[0])
movies_ds["genre"].value_counts().plot.pie(autopct='%1.1f%%',shadow=True,textprops={"fontsize": 10},ax=ax[1])
fig.suptitle("Genre Frequency", fontsize=15)
plt.xticks(rotation=90)
plt.yticks(rotation=45)

fig, ax = plt.subplots(1, 2)
movies_ds["rating"].value_counts().plot.bar(color="purple", ax=ax[0])
movies_ds["rating"].value_counts().plot.pie(autopct='%1.1f%%',shadow=True,textprops={"fontsize": 10},ax=ax[1])
fig.suptitle("Rating Frequency", fontsize=15)
plt.xticks(rotation=90)
plt.yticks(rotation=45)


#Plotting Numerical Variables

fig, ax = plt.subplots(1, 3)
fig.suptitle("Budget Distribution", fontsize=15)
sns.distplot(movies_ds["budget"], ax=ax[0])
sns.boxplot(movies_ds["budget"], ax=ax[1])
sns.violinplot(movies_ds["budget"], ax=ax[2])

fig, ax = plt.subplots(1, 3)
fig.suptitle("Gross Revenue Distribution", fontsize=15)
sns.distplot(movies_ds["gross"], ax=ax[0])
sns.boxplot(movies_ds["gross"], ax=ax[1])
sns.violinplot(movies_ds["gross"], ax=ax[2])

fig, ax = plt.subplots(1, 3)
fig.suptitle("Runtime Distribution", fontsize=15)
sns.distplot(movies_ds["runtime"], ax=ax[0])
sns.boxplot(movies_ds["runtime"], ax=ax[1])
sns.violinplot(movies_ds["runtime"], ax=ax[2])

fig, ax = plt.subplots(1, 3)
fig.suptitle("Score Distribution", fontsize=15)
sns.distplot(movies_ds["score"], ax=ax[0])
sns.boxplot(movies_ds["score"], ax=ax[1])
sns.violinplot(movies_ds["score"], ax=ax[2])

fig, ax = plt.subplots(1, 3)
fig.suptitle("Year Distribution", fontsize=15)
sns.distplot(movies_ds["year"], ax=ax[0])
sns.boxplot(movies_ds["year"], ax=ax[1])
sns.violinplot(movies_ds["year"], ax=ax[2])

fig, ax = plt.subplots(1, 3)
fig.suptitle("Gross to Budget Distribution", fontsize=15)
sns.distplot(movies_ds["gross_to_budget_ratio"], ax=ax[0])
sns.boxplot(movies_ds["gross_to_budget_ratio"], ax=ax[1])
sns.violinplot(movies_ds["gross_to_budget_ratio"], ax=ax[2])

# 7. Correlations Analysis & Features Selection

In [None]:
#Deleting original categorical columns

movies_ds2 = movies_ds.drop(["country", "genre", "rating"], axis=1)

# #Plotting a Heatmap

# fig, ax = plt.subplots(1, figsize=(25,25))
# sns.heatmap(movies_ds2.corr(), annot=True, fmt=",.2f")
# plt.title("Heatmap Correlation", fontsize=20)
# plt.tick_params(labelsize=12)
# plt.xticks(rotation=90)
# plt.yticks(rotation=45)

# #Plotting a Pairplot

# sns.pairplot(movies_ds2)

# 8. Data Modelling

In [None]:
#Defining Xs

X_orig = movies_ds
X = movies_ds2

#Scaling all features

from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_scaled = sc_X.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled)

# 9. Dimensionality Reduction

In [None]:
#Applying PCA

from sklearn.decomposition import PCA

#Creating a model
pca = PCA(n_components=X_scaled.shape[1], random_state=0) #there are 18 features at the dataset

#Fitting to the model
pca.fit(X_scaled)

#Generating all components in an array
X_pca = pca.transform(X_scaled)
# X_pca_output = pd.DataFrame(X_pca)
# X_pca_output.to_excel("X_pca_file.xlsx",index=False)

#Displaying the explained variance by number of components
for n in range(0, X_scaled.shape[1]):
    print(f"Variance explained by the first {n+1} principal components = {np.cumsum(pca.explained_variance_ratio_ *100)[n]:.1f}%")
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel("Number of components")
plt.ylabel("Explained variance")

#Creating a model with the chosen number of components (#75% explainability = 20 components)
pca_selected = PCA(n_components=20, random_state=0)
pca_selected.fit(X_scaled)
X_pca_selected = pca_selected.transform(X_scaled)
# X_pca_selected_output = pd.DataFrame(X_pca_selected)
# X_pca_selected_output.to_excel("X_pca_selected_file.xlsx",index=False)

# 10. Machine Learning Algorithms Implementation & Assessment

# 10.1 K-means

In [None]:
#Creating a K-means model and checking its Metrics

from sklearn.cluster import KMeans

#Applying the Elbow Method to calculate distortion for a range of number of cluster

distortions = []
for i in range(1, 21):
    km = KMeans(n_clusters=i, init="random", n_init=10, max_iter=300, tol=1e-04, random_state=0)
    km.fit(X_pca_selected)
    distortions.append(km.inertia_)

#Plotting

plt.plot(range(1, 21), distortions, marker="o")
plt.xlabel("Number of clusters")
plt.ylabel("Distortion")
plt.show()

#Applying the Silhouette Method to interpret and validate of consistency within clusters of data

from sklearn.metrics import silhouette_score
silhouette_coefficients = []
for j in range(2, 21):
    km = KMeans(n_clusters=j, init="random", n_init=10, max_iter=300, tol=1e-04, random_state=0)
    km.fit(X_pca_selected)
    score = silhouette_score(X_pca_selected, km.labels_)
    silhouette_coefficients.append(score)

#Plotting

plt.style.use("fivethirtyeight")
plt.plot(range(2, 21), silhouette_coefficients)
plt.xticks(range(2, 21))
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Coefficient")
plt.show()

#Choosing number of clusters

n_clusters = 17
print('Estimated number of clusters: %d' % n_clusters)
km = KMeans(n_clusters=n_clusters)
km.fit(X_pca_selected)
print("Silhouette Coefficient: %0.3f" % silhouette_score(X_pca_selected, km.fit(X_pca_selected).labels_))

#Plotting chosen number of clusters

from yellowbrick.cluster import silhouette_visualizer
silhouette_visualizer(KMeans(n_clusters=n_clusters, random_state=0), X_pca_selected)

#Visualizing clusters in the dataset
X_orig = pd.DataFrame(X_orig)
X_orig["cluster"] = km.labels_
X_orig.to_excel("model_km.xlsx")

In [None]:
# #Plotting scatter graph per pair features

# #Mapping every individual cluster to a color

# colors = ['royalblue', 'mediumorchid', 'tan', 'deeppink', 'olive', 'goldenrod', 'lightcyan', 'navy']

# vectorizer = np.vectorize(lambda x: colors[x % len(colors)])

# #Plotting

# for i in range(1, X_pca_selected.shape[1]-1):
#     plt.scatter(X_pca_selected.iloc[:,0], X_pca_selected.iloc[:,i], c=vectorizer(clusters))
#     plt.xlabel(X_pca_selected.columns[0])
#     plt.ylabel(X_pca_selected.columns[i])
#     plt.show()

# 10.2 DBSCAN

In [None]:
#Creating a DBSCAN model and checking its Metrics
#OBS: we´re exploring DBSCAN only as a study exercise in this project - we´ll adopt K-Means

from sklearn.neighbors import NearestNeighbors

#We can calculate the distance from each point to its closest neighbour using the NearestNeighbors. The point itself is included in n_neighbors. The kneighbors method returns two arrays, one which contains the distance to the closest n_neighbors points and the other which contains the index for each of those points

neigh = NearestNeighbors(n_neighbors=2)
nbrs = neigh.fit(X_pca_selected)
distances, indices = nbrs.kneighbors(X_pca_selected)

#Soring and plotting results

distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.plot(distances)
plt.xlabel("Distances to the closest n_neighbors")
plt.ylabel("eps")
plt.show()

from sklearn.cluster import DBSCAN

#Selecting the best eps (the optimal value for epsilon will be found at the point of maximum curvature)

dbs = DBSCAN(eps=10)
dbs.fit(X_pca_selected)

#The labels_ property contains the list of clusters and their respective points

clusters = dbs.labels_

from sklearn import metrics

#Number of clusters in labels, ignoring noise (outlier) (-1) if present

n_clusters = len(set(clusters)) - (1 if -1 in clusters else 0)
n_noise_ = list(clusters).count(-1)
print('Estimated number of clusters: %d' % n_clusters)
print('Estimated number of noise points: %d' % n_noise_)
print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(X_pca_selected, clusters))

#Visualizing clusters in the dataset
X_orig = pd.DataFrame(X_orig)
X_orig["cluster"] = dbs.labels_
X_orig.to_excel("model_dbs.xlsx")

# 11. Conclusions

IF YOU LIKE IT OR IF IT HELPS YOU SOMEHOW, COULD YOU PLEASE UPVOTE? THANK YOU VERY MUCH!!!

In this exercise we went through all the process from collecting data, exploring features and distributions, treating data, understanding correlations, selecting relevant features, data modelling and presenting a clustering model, indicating groups of movies with similarities to be further developed and explored, so the movie industry can have a better understanding of the customers segmentations according to their movies preferences and adapt different marketing strategies to each of them, bringing more revenue to the business.