## SIADS591 - Milestone I - Project
### Exploring the Factors Impacting a Movie's Profitability
#### Machine Learning  - Data Exploration

The needed libraries are imported:

In [1]:
import utilities as utils
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

The prepared data is imported:

In [None]:
save_dir = "../../Output/"
imdb_dummies_df = pd.read_csv(save_dir + "imdb_dummies_df_2020_09_15.csv", sep="|")
print("imdb_dummies_df.shape :", imdb_dummies_df.shape)
cols = list(imdb_dummies_df.columns)
print("cols :", cols[:10])

A movie's budget and revenue are the numeric values that will be used to create the outpu , that we wish to predict. This could be a continuous variable or a categorical variable. but, this will be created from these two numberic values. Let's see what the budget and revenue for this data set looks like.

In [None]:
plt.figure(figsize=(7,7))
plt.scatter(imdb_dummies_df["budget"], imdb_dummies_df["revenue"], marker="o", color="red")
plt.xlabel("Budget")
plt.ylabel("Revenue")
plt.savefig(save_dir + "/Budget_vs_Revenue.png")

In [None]:
budget_revenue_corr = imdb_dummies_df.copy()
budget_revenue_corr = budget_revenue_corr[["budget", "revenue"]].corr()
budget_revenue_corr

A mild correlation exists between budget and revenue. So, in general as the revenue values get larger the budget value get larger too. This brings up a question: **would we know the budget when we want to predict the revenue, or are do we intend to predict the revenue just from features like actors and directors?** This will be kept in mind. But now let's also see what sort of correlations might exist between budget and revenue and the features.

We define a function that will return correlation values of interest - that can be filtered by a threshold value and a possible string value (e.g. 'revenue', 'budget').

In [None]:
def get_corr_values_above_threshold(corr_df, threshold_lower=0.5, threshold_upper=1.0, filter_values=None):
    corr_values = corr_df.values
    corr_dict = {}
    corr_labels = list(corr_df.columns)
    for row_idx in range(0, len(corr_labels)):
        for col_idx in range(0, len(corr_labels)):
            if corr_labels[row_idx] != corr_labels[col_idx]:
                if (corr_labels[row_idx], corr_labels[col_idx]) not in corr_dict.keys() and \
                    (corr_labels[col_idx], corr_labels[row_idx]) not in corr_dict.keys():
                    # corr_dict[(corr_labels[row_idx], corr_labels[col_idx])] = corr_values[row_idx, col_idx]
                    corr_dict[(corr_labels[col_idx], corr_labels[row_idx])] = corr_values[row_idx, col_idx]

    corr_dict_filtered_first = {}
    for key, value in corr_dict.items():
        if threshold_lower < value and value < threshold_upper:
            corr_dict_filtered_first[key] = value
    
    if filter_values:
        corr_dict_filtered_again = {}
        for (key1, key2), value in corr_dict_filtered_first.items():
            if (any(x in key1 for x in filter_values)) or (any(x in key2 for x in filter_values)) :
                corr_dict_filtered_again[(key1, key2)] = value
        corr_dict_filtered = corr_dict_filtered_again
    else:
        corr_dict_filtered = corr_dict_filtered_first
        
    return corr_dict_filtered

The correlations seem to have a lot of low values, and the correlationmatrix will be very large, so we filter out the low values and only check the features that have a correlation coefficient larger than a threshold value.

In [None]:
print("imdb_dummies_df.shape :", imdb_dummies_df.shape)
corr_df = imdb_dummies_df.copy()

corr_df["startYear"] = corr_df["startYear"].astype(int).astype(str)
corr_df["profitability"] = (corr_df["revenue"] -  corr_df["budget"]) / corr_df["budget"]
corr_df = corr_df.drop(["primaryTitle", "originalTitle", "imdb_id"], axis="columns")
corr_df.set_index("tconst", inplace=True)
corr_cols = list(corr_df.columns)
corr_cols.remove("profitability")
corr_cols_len = len(corr_cols)

We check correlations for a small sample of the features (columns).

In [None]:
drop_cols = corr_cols[-1 * (corr_cols_len - 100):]
corr_df_temp = corr_df.copy()
corr_df_temp = corr_df_temp.drop(drop_cols, axis=1)
corr_df_temp.head()
print("list(corr_df_temp.columns)[-5:] :", list(corr_df_temp.columns)[:])
correlation_matrix = corr_df_temp.corr(method="pearson")
correlation_matrix
high_corr_values = get_corr_values_above_threshold(corr_df=correlation_matrix, threshold_lower=0.5, threshold_upper=1.0)

high_corr_values

Some correlations come through on the small sample. Next we try a larger sample.

In [None]:
drop_cols = corr_cols[-1 * (corr_cols_len - 200):]
corr_df_temp = corr_df.copy()
corr_df_temp = corr_df_temp.drop(drop_cols, axis=1)
corr_df_temp.head()
print("list(corr_df_temp.columns)[-5:] :", list(corr_df_temp.columns)[-10:])
correlation_matrix = corr_df_temp.corr(method="pearson")
correlation_matrix
high_corr_values = get_corr_values_above_threshold(corr_df=correlation_matrix, threshold_lower=0.5, threshold_upper=1.0, 
                                                   filter_values=['budget', 'revenue', 'profitability'])

high_corr_values

This shows that there is some correlation between certain principals and profitability.
Next we will try to train a machine learning model on these feature and see if the profitability can be reliably predicted.

In [None]:
# imdb_dummies_df["startYear"] = imdb_dummies_df["startYear"].astype(int).astype(str)
# imdb_dummies_df.drop(["primaryTitle", "originalTitle", "imdb_id", "revenue", "budget"], axis="columns", inplace=True)
# imdb_dummies_df.set_index("tconst", inplace=True)
# print(imdb_dummies_df[list(imdb_dummies_df.columns)[:5]].head())