__Objective__: Load the model trained using the script run_pipeline.py and do quality assessment

__TODO__: add figure and models to git, removing from gitignore

In [1]:
%load_ext blackcellmagic

In [10]:
import os, pdb
import numpy as np
import pandas as pd
import seaborn as sns
from train_model import train_model, compare_methods

# Parameters

In [3]:
TARGET_CUTOFF = 7.5
PROJECT_BASE_DIR = "/home/rohail/projects/imdb_ratings/"
model_save_dir = "models/"
plot_write_dir = "reports/figures/"
idx_columns = ["imdb_title_id", "title", "original_title"]
target_columns = ["avg_vote", "avg_vote_flag"]
# determined via classification threshold
classification_threshold = 0.8

In [4]:
df_model = pd.read_csv(os.path.join(PROJECT_BASE_DIR, model_save_dir, "df_model.csv"))
y_test = pd.read_csv(os.path.join(PROJECT_BASE_DIR, model_save_dir, "y_test.csv"))
x_test = pd.read_csv(os.path.join(PROJECT_BASE_DIR, model_save_dir, "x_test.csv"))

In [5]:
# regression
parameters = {
    "plot_write_dir": os.path.join(PROJECT_BASE_DIR, plot_write_dir),
    "model_save_dir": os.path.join(PROJECT_BASE_DIR, model_save_dir),
    "model_type": "regression",  # , classification
    "idx_columns": idx_columns,
    "test_set_size": 0.1,
    "training_parameters": {
        "class_weight": "balanced",  # vs providing sample weight to fit --> does it make a difference?
        "n_jobs": -1,
        "max_iter" : 10000,
        "scoring": "balanced_accuracy",
    },
}

reg_model, df_reg_coefs, _, _ = train_model(
    df_model,
    parameters=parameters,
    load_from_disk="regression_2020_03_04_13_56.joblib"
)

# classification
parameters.update({"model_type": "classification"})
clf_model, df_clf_coefs, _, _ = train_model(
    df_model,
    parameters=parameters,
    load_from_disk="classification_2020_03_04_14_58.joblib",
)

Diagnostic plots for this model can be found in the following directory: 
/home/rohail/projects/imdb_ratings/reports/figures/
The model itself is saved in the following directory: /home/rohail/projects/imdb_ratings/models/

Loading model from /home/rohail/projects/imdb_ratings/models/regression_2020_03_04_13_56.joblib
Diagnostic plots for this model can be found in the following directory: 
/home/rohail/projects/imdb_ratings/reports/figures/
The model itself is saved in the following directory: /home/rohail/projects/imdb_ratings/models/

Loading model from /home/rohail/projects/imdb_ratings/models/classification_2020_03_04_14_58.joblib


# Validate model

In [6]:
parameters = {
    "target_columns": target_columns,
    "classification_threshold":classification_threshold, # determined from looking at diagnostic plot....
    "regression_threshold": TARGET_CUTOFF,
    "idx_columns": idx_columns,
}

train_movies_sample = {
    "The Dark Knight",
    "Anchorman: The Legend of Ron Burgundy",
    "The Big Lebowski",
    "Batman v Superman: Dawn of Justice",
    "Black Panther",
    "Kabhi Khushi Kabhie Gham...",
    "3 Idiots",
    "The Intouchables",
    "Amélie",
    "The Matrix",
    "The Matrix Reloaded",
    "V for Vendetta",
    "Kill Bill: Vol. 1",
    "La vita è bella",
    "Die Hard",
    "Requiem for a Dream",
    "Terminator 3: Rise of the Machines",
    "The Terminator",
    "Terminator 2: Judgment Day",
    "Titanic",
    "The Departed",
    "Groundhog Day",
    "Love in Kilnerry",
    "Jinnah",
    "Jawani Phir Nahi Ani",
    "Bol",
    "Das letzte Mahl",
    "The Lives of Others",
    "Das Experiment",
}

# predict on unseen examples depending on model type...
df_predict_test, df_predict_train  = compare_methods(
    df_model, reg_model, clf_model, x_test, y_test, train_movies_sample, parameters
)

Making predictions on test data
       reg_rating_prediction  clf_prob_prediction
count            6651.000000          6651.000000
mean                7.317829             0.541427
std                 3.213174             0.308058
min                 1.286682             0.066496
25%                 5.755538             0.226881
50%                 6.915588             0.550601
75%                 8.014299             0.853454
max                65.003010             1.000000
Regression and classification predictions the same? False
Balanced accuracy for regression:  0.6700273187586469
Balanced accuracy for classification:  0.6280635284037839
Making predictions on sample data from train data
       reg_rating_prediction  clf_prob_prediction
count              32.000000            32.000000
mean               22.449798             0.892645
std                14.794780             0.218369
min                 5.461126             0.230528
25%                 9.657111             0.91192

> Regression seems to perform better in general but this might need further assessment - We can best make this assessment by continuing to add more features. I kept the number of features low to have a reasonable model training time. The current classification model, with it's optimizations takes about ~ 45 minutes to fit. 

> Going forward, I'd try two things: fit a more complicated regression model since currently I'm only fitting a simple OLS model. I could also attempt a more complex model same for classification but I think the best way forward would be to focus on engineering more features and see how performance improves

In [13]:
len(df_reg_coefs)

312

In [12]:
np.abs(df_reg_coefs.coefficients).rank()

0      312.0
1      132.0
2       45.0
3       11.0
4       34.0
5       20.0
6       19.0
7       50.0
8       63.0
9      148.0
10      52.0
11     299.0
12     102.0
13     146.0
14     263.0
15     227.0
16     101.0
17     171.0
18     115.0
19     296.0
20      31.0
21     125.0
22       8.0
23      86.0
24     104.0
25     216.0
26      62.0
27     143.0
28     194.0
29      61.0
       ...  
282     18.0
283    144.0
284    193.0
285    170.0
286    106.0
287    231.0
288     94.0
289    195.0
290    253.0
291      4.0
292     78.0
293     60.0
294    145.0
295    276.0
296    259.0
297     82.0
298     89.0
299    112.0
300    176.0
301    197.0
302    211.0
303     84.0
304      1.0
305    131.0
306     25.0
307    250.0
308     27.0
309    189.0
310    214.0
311    122.0
Name: coefficients, Length: 312, dtype: float64

In [19]:
df_reg_coefs.loc[:, "importance"] = np.abs(df_reg_coefs.coefficients).rank(ascending = False)
df_reg_coefs.sort_values("importance", ascending = True).head(10)

Unnamed: 0,index,coefficients,importance
0,intercept,7.199475,1.0
95,primary_country_Nepal,2.429067,2.0
81,primary_country_Liechtenstein,-2.234611,3.0
257,primary_language_Quechua,2.213293,4.0
146,primary_country_Uganda,2.026378,5.0
165,primary_language_Assamese,2.019303,6.0
166,primary_language_Aymara,1.981734,7.0
239,primary_language_Marathi,1.897182,8.0
218,primary_language_Kashmiri,1.875165,9.0
263,primary_language_Russian Sign Language,1.842207,10.0
