# Project 3


# Movie Genre Classification

Classify a movie genre based on its plot.

<img src="moviegenre.png"
     style="float: left; margin-right: 10px;" />




https://www.kaggle.com/c/miia4201-202019-p3-moviegenreclassification/overview

### Data

Input:
- movie plot

Output:
- Probability of the movie belonging to each genre



### Acknowledgements

We thank Professor Fabio Gonzalez, Ph.D. and his student John Arevalo for providing this dataset.

See https://arxiv.org/abs/1702.01992

# Clasificación de géneros de películas

En este documento se intentará realizar una predicción de los géneros de algunas películas basados en el resumen o sinopsis de las películas. Para ello se tienen en cuenta las variables plot, Year y title con las cuales por medio de lo escrito en cada registro se realiza una transformación vectorizando cada palabra que describe la película.




In [1]:
import pandas as pd
import os
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score, roc_auc_score, accuracy_score
from sklearn.model_selection import train_test_split

In [2]:
# Cargue de información 

dataTraining = pd.read_csv('https://github.com/albahnsen/AdvancedMethodsDataAnalysisClass/raw/master/datasets/dataTraining.zip', encoding='UTF-8', index_col=0)
dataTesting = pd.read_csv('https://github.com/albahnsen/AdvancedMethodsDataAnalysisClass/raw/master/datasets/dataTesting.zip', encoding='UTF-8', index_col=0)

In [3]:
# Estructura de los datos a trabajar  - Train

dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,most is the story of a single father who takes...,"['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfi...","['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,"in a friday afternoon in new york , the presi...",['Drama'],7.4
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing h...","['Action', 'Crime', 'Thriller']",6.6


In [4]:
# Estructura de los datos a evaluar  - Test

dataTesting.head()

Unnamed: 0,year,title,plot
1,1999,Message in a Bottle,"who meets by fate , shall be sealed by fate ...."
4,1978,Midnight Express,"the true story of billy hayes , an american c..."
5,1996,Primal Fear,martin vail left the chicago da ' s office to ...
6,1950,Crisis,husband and wife americans dr . eugene and mr...
7,1959,The Tingler,the coroner and scientist dr . warren chapin ...


### Creación  vectorización de conteo

Para representar datos de texto de manera númerica, se usa una codificación única de las palabras por medio de la vectorización, en ello se va a limitar a 1.000 la cantidad de vectores que se van a imprimir. 

Por medio de la función "fit_transform" en la variable __plot__ obtenemos una base con dimensiones de 7.895 filas y 1.000 columnas.

In [4]:
vect = CountVectorizer(max_features=1000) 
X_dtm = vect.fit_transform(dataTraining['plot'])
X_dtm.shape

(7895, 1000)

In [5]:
# Primeras 50 palabras con mas frecuencia en la variable plot.

print(vect.get_feature_names()[:50])

['able', 'about', 'accepts', 'accident', 'accidentally', 'across', 'act', 'action', 'actor', 'actress', 'actually', 'adam', 'adult', 'adventure', 'affair', 'after', 'again', 'against', 'age', 'agent', 'agents', 'ago', 'agrees', 'air', 'alan', 'alex', 'alice', 'alien', 'alive', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'america', 'american', 'among', 'an', 'and', 'angeles', 'ann', 'anna', 'another', 'any', 'anyone', 'anything', 'apartment']


* Obteniendo los 50 vectores con mas frecuencia, observamos que palabras como _able_, _about_, _accepts_ son las de las más repetivivas en resumenes.

### Creación de la variable y


Como la variable a predecir son los géneros de las películas, la variable __y__ debe expresarse como una matriz binaria dentro de géneros que puede tener el filme, lo que hace que sea más fácil el procesamiento de la información. Es importante saber que una película puede tener uno o más géneros.


In [3]:

# Variable Y
dataTraining['genres'] = dataTraining['genres'].map(lambda x: eval(x))

# Binarización de las múltiples etiquetas o géneros que puede contener una película
le = MultiLabelBinarizer()
y_genres = le.fit_transform(dataTraining['genres'])

In [8]:
y_genres

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       ...,
       [0, 1, 0, ..., 0, 0, 0],
       [0, 1, 1, ..., 0, 0, 0],
       [0, 1, 1, ..., 0, 0, 0]])

In [9]:
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X_dtm, y_genres, test_size=0.33, random_state=42)

### Train de las multiclases en las multietiquetas de modelo.

Se usa la estrategi a One vs Rest para la clasificación de múltiples etiquetas, donde se divide una clasificación de múltiples etiquetas o géneros, en múltiples problemas de clasificación binaria por cada etiqueta


### 01. Evaluación modelo - Random forest para clasificación de los géneros



In [10]:
clf = OneVsRestClassifier(RandomForestClassifier(n_jobs=-1, n_estimators=100, max_depth=10, random_state=42))

In [11]:
clf.fit(X_train, y_train_genres)

OneVsRestClassifier(estimator=RandomForestClassifier(max_depth=10, n_jobs=-1,
                                                     random_state=42))

In [12]:
y_pred_genres = clf.predict_proba(X_test)

In [13]:
roc_auc_score(y_test_genres, y_pred_genres, average='macro')

0.7812262183677007

* Se usa predict_proba en lugar de solo predict, para que el resultado de la predicción no sea 1 o 0 como clasificación binaria para los géneros de cada película, sino la probabilidad de que una película se asocie con un género, es decir, la probabilidad asociada a que tome la etiqueta o genero la película que tratamos de predecir.

* El AUC de los datos de validación generados con el modelo anterior nos arroja una predicción del 0.7812. La cual no se percibe como una buena predicción para los datos entrenados en la cual el género o los géneros asociados a la película podría fallar repetidamente.

### Predicción de los datos del test 

Luego de realizar la validación del modelo de los datos de train, contra los datos de test sin vectorizar, obtenemos una estimación de AUC. Ahora Vectorizamos los datos del test transformandolos en categorías o variables los géneros que pueden contener las películas.

In [14]:
X_test_dtm = vect.transform(dataTesting['plot'])

cols = ['p_Action', 'p_Adventure', 'p_Animation', 'p_Biography', 'p_Comedy', 'p_Crime', 'p_Documentary', 'p_Drama', 'p_Family',
        'p_Fantasy', 'p_Film-Noir', 'p_History', 'p_Horror', 'p_Music', 'p_Musical', 'p_Mystery', 'p_News', 'p_Romance',
        'p_Sci-Fi', 'p_Short', 'p_Sport', 'p_Thriller', 'p_War', 'p_Western']

y_pred_test_genres = clf.predict_proba(X_test_dtm)


In [15]:
res = pd.DataFrame(y_pred_test_genres, index=dataTesting.index, columns=cols)

In [16]:
res.head()

Unnamed: 0,p_Action,p_Adventure,p_Animation,p_Biography,p_Comedy,p_Crime,p_Documentary,p_Drama,p_Family,p_Fantasy,...,p_Musical,p_Mystery,p_News,p_Romance,p_Sci-Fi,p_Short,p_Sport,p_Thriller,p_War,p_Western
1,0.14303,0.10196,0.024454,0.029938,0.354552,0.13883,0.030787,0.49014,0.073159,0.101339,...,0.025069,0.063208,0.0,0.362818,0.056648,0.00897,0.017522,0.202605,0.033989,0.018117
4,0.122624,0.085786,0.024213,0.084795,0.370949,0.216657,0.080359,0.515684,0.062976,0.067019,...,0.024734,0.060935,0.000477,0.149703,0.05819,0.014248,0.020099,0.204794,0.030438,0.018506
5,0.151364,0.110284,0.013762,0.075334,0.304837,0.448736,0.02101,0.611544,0.081741,0.169121,...,0.044538,0.261372,0.0,0.335987,0.128505,0.001016,0.048658,0.423242,0.052693,0.025351
6,0.154448,0.125772,0.020991,0.064124,0.340779,0.140892,0.009133,0.632038,0.068287,0.063631,...,0.131074,0.088418,0.0,0.197224,0.132208,0.001432,0.039743,0.269385,0.077607,0.017862
7,0.175143,0.210069,0.035476,0.032505,0.31385,0.24315,0.021793,0.427885,0.079781,0.143879,...,0.023859,0.090359,4.8e-05,0.205117,0.241663,0.002634,0.018403,0.259465,0.021569,0.017585


In [17]:
res.to_csv('pred_genres_text_RF.csv', index_label='ID')

In [18]:
X_test.shape

(2606, 1000)

## 02. Modelo LightGBM para evaluación de género

In [19]:
from lightgbm import LGBMClassifier

In [20]:
clf = OneVsRestClassifier( LGBMClassifier( random_seed=42 ) )
clf.fit(X_train.astype('float64'), y_train_genres.astype('float64'))

OneVsRestClassifier(estimator=LGBMClassifier(random_seed=42))

In [21]:
y_pred_genres = clf.predict_proba(X_test.astype('float64'))
roc_auc_score(y_test_genres, y_pred_genres, average='macro')

0.7978773764704655

* Comparados por el AUC, los modelos de Random Forest y LGMB dan __0.7812__ y __0.7978__ respectivamente, siendo el segundo un poco mejor pero no significativamente.

## 03. Usar un pipeline para evitar data leakage:

Como observamos anteriormente el modelo LGBM tiene una mejor predicción, pero tenemos que buscar los mejores parametros que se ajusten a la predicción, para ellos usamos pipeline para que el modelo los memorice y repita los pasos que especificamos para que realice el procesamiento de nuevos datos en el orden indicado.

In [23]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer

In [24]:
X = dataTraining.drop(columns=['genres','rating']).reset_index(drop=True)
y = dataTraining['genres']

In [25]:
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X, y, test_size=0.2, random_state=42)

In [26]:
le = MultiLabelBinarizer()
y_train_genres_mlb = le.fit_transform(y_train_genres)
y_test_genres_mlb = le.transform(y_test_genres)

In [27]:
text_transformer = Pipeline(steps=[('vectorizer', CountVectorizer())])

preprocessor = ColumnTransformer(
    transformers=[('txt_title', text_transformer, 'title'),
                 ('txt_plot', text_transformer, 'plot')], remainder='passthrough', sparse_threshold = 0 )

clf = Pipeline([('preprocessor', preprocessor ),
               ('classifier', OneVsRestClassifier( LGBMClassifier( random_seed=42 )))])

In [28]:
clf.fit(X_train,y_train_genres_mlb)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(remainder='passthrough', sparse_threshold=0,
                                   transformers=[('txt_title',
                                                  Pipeline(steps=[('vectorizer',
                                                                   CountVectorizer())]),
                                                  'title'),
                                                 ('txt_plot',
                                                  Pipeline(steps=[('vectorizer',
                                                                   CountVectorizer())]),
                                                  'plot')])),
                ('classifier',
                 OneVsRestClassifier(estimator=LGBMClassifier(random_seed=42)))])

In [29]:
y_pred_genres = clf.predict_proba(X_test)
roc_auc_score(y_test_genres_mlb, y_pred_genres, average='macro')

0.8695510005598357

Con respecto al LGBM entrenado en el paso __2__ se observa una mejora importante en la predicción usando Pipeline, en donde se pasa de un AUC de 0.79 a 0.87. La ganancia en la predicción fue de 10 % con respecto al anterior. 

### 04. Fine tunning de parámetros para mejorar el score

Por medio de este método realizamos una grilla con múltiples condiciones en cada parámetro, buscando los mejores dentro de las condiciones establecidas y que se muestran a continuación.

Luego evaluamos promedio del HavingRamdomsearch el modelo LGBM teniendo encuenta lo siguiente:

* CLF = Modelo LGBM
* param_grid = Grilla con los multiparametros establecidos
* n_candidates = exhaust, metodo de evaluación exhaustivo ejecutando múltiples muestras de todos los parametros.
* factor = 4 aumento de recursos al ir encontrando los mejores parametros
* scoring = neg_log_loss, puntaje usado para evaluar los modelos.
* cv = 5, número de folds escogidos.
* random_state = Semilla utilizada

In [30]:
from sklearn.experimental import enable_halving_search_cv  # noqa
from sklearn.model_selection import HalvingRandomSearchCV
import pickle

In [33]:
param_grid = {
    "classifier__estimator__num_leaves": [20, 31, 50, 100],
    "classifier__estimator__max_depth": [-1, 3, 6, 15, 20, 25, 50],
    "classifier__estimator__learning_rate": [0.05, 0.075, 0.1, 0.2, 0.3, 0.45, 0.4, 0.6],
    "classifier__estimator__n_estimators": [50, 100, 1000, 1500],
    "classifier__estimator__subsample_for_bin": [100000, 200000, 250000, 300000],
    "classifier__estimator__min_split_gain": [0.0, 0.3, 0.4, 0.1, 0.5, 0.7, 0.9],
    "classifier__estimator__min_child_samples": [10, 20, 30, 50],
    "classifier__estimator__subsample": [0.7, 0.8, 0.9, 1],
    "classifier__estimator__reg_alpha": [0.0, 0.01, 0.05, 0.1, 0.2, 0.3, 0.5],
    "classifier__estimator__reg_lambda": [0.0, 0.01, 0.03, 0.05, 0.1, 0.2, 0.3]
}

search = HalvingRandomSearchCV(clf,
                               param_grid,
                               n_candidates='exhaust',
                               factor=4,
                               scoring='neg_log_loss',
                               n_jobs=2,
                               cv=5,
                               random_state=0).fit(X_train, y_train_genres_mlb)

In [34]:
with open("LGBMClassifierSearch2.pkl", "wb") as model:
    pickle.dump(search, model)

In [35]:
y_pred_genres = search.predict_proba(X_test)
roc_auc_score(y_test_genres_mlb, y_pred_genres, average='macro')

0.8671279762876454

Cuando realizamos el fine tunning al modelo, observamos que no existe una mejora en en AUC en los datos de entrenamiento que con el metodo con Pipeline. Sin embargo al realizar la prueba con los datos de test en __Kaggle__, se observa una mejora que con solo el Pipeline pasando de un AUC de 0.86 a 0.88

## 05. Entrenar con todo el dataset

Luego de observar la mejor metodología con los mejores parametros para obtener una mejor predicción en los géneros que puede contener una película, se debe tomar la mayor cantidad de datos posibles. Para ello se genera el entrenamiento del módelo con todos los datos que contenia la base original de train.

In [45]:
search.best_params_

{'classifier__estimator__subsample_for_bin': 100000,
 'classifier__estimator__subsample': 1,
 'classifier__estimator__reg_lambda': 0.3,
 'classifier__estimator__reg_alpha': 0.2,
 'classifier__estimator__num_leaves': 31,
 'classifier__estimator__n_estimators': 1000,
 'classifier__estimator__min_split_gain': 0.9,
 'classifier__estimator__min_child_samples': 30,
 'classifier__estimator__max_depth': 15,
 'classifier__estimator__learning_rate': 0.075}

In [38]:
best_clf = search.best_estimator_

In [39]:
le = MultiLabelBinarizer()
y_genres_mlb = le.fit_transform(y)

In [40]:
best_clf.fit(X, y_genres_mlb)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(remainder='passthrough', sparse_threshold=0,
                                   transformers=[('txt_title',
                                                  Pipeline(steps=[('vectorizer',
                                                                   CountVectorizer())]),
                                                  'title'),
                                                 ('txt_plot',
                                                  Pipeline(steps=[('vectorizer',
                                                                   CountVectorizer())]),
                                                  'plot')])),
                ('classifier',
                 OneVsRestClassifier(estimator=LGBMClassifier(learning_rate=0.075,
                                                              max_depth=15,
                                                              min_child_samples=30,
                          

In [42]:
X_test_dtm = dataTesting

cols = ['p_Action', 'p_Adventure', 'p_Animation', 'p_Biography', 'p_Comedy', 'p_Crime', 'p_Documentary', 'p_Drama', 'p_Family',
        'p_Fantasy', 'p_Film-Noir', 'p_History', 'p_Horror', 'p_Music', 'p_Musical', 'p_Mystery', 'p_News', 'p_Romance',
        'p_Sci-Fi', 'p_Short', 'p_Sport', 'p_Thriller', 'p_War', 'p_Western']

y_pred_test_genres = best_clf.predict_proba(X_test_dtm)

In [43]:
res = pd.DataFrame(y_pred_test_genres, index=dataTesting.index, columns=cols)
res.head()

Unnamed: 0,p_Action,p_Adventure,p_Animation,p_Biography,p_Comedy,p_Crime,p_Documentary,p_Drama,p_Family,p_Fantasy,...,p_Musical,p_Mystery,p_News,p_Romance,p_Sci-Fi,p_Short,p_Sport,p_Thriller,p_War,p_Western
1,0.080779,0.08101,0.01331,0.015201,0.482351,0.043803,0.001697,0.466669,0.048681,0.22776,...,0.008372,0.009357,0.000158,0.681506,0.011398,0.001237,0.003685,0.207888,0.002959,0.001729
4,0.0759,0.012322,0.005079,0.219639,0.154151,0.199398,0.037461,0.768848,0.01497,0.012125,...,0.024042,0.030153,0.000158,0.034155,0.027519,0.007203,0.007469,0.157489,0.019946,0.008266
5,0.002244,0.002133,0.000566,0.013261,0.068823,0.864988,0.001203,0.829131,0.000712,0.006282,...,0.00093,0.663614,0.000176,0.118779,0.02172,0.000669,0.001207,0.506973,0.002633,0.000687
6,0.011534,0.038582,0.001378,0.013702,0.061144,0.045907,0.000747,0.808748,0.006504,0.012686,...,0.00819,0.098987,0.000158,0.039352,0.185578,0.00043,0.002071,0.322705,0.019989,0.00403
7,0.060829,0.108544,0.001746,0.023578,0.130385,0.018912,0.000884,0.143484,0.054584,0.307141,...,0.010135,0.035613,0.000158,0.132674,0.679945,0.0009,0.001601,0.179092,0.002772,0.004142


In [44]:
res.to_csv('pred_genres_text_RF.csv', index_label='ID')

# Comentarios

* El modelo LGBM presenta buenos resultados en las predicciones, sin embargo al realizar tunning en sus paramatros se obtienen mejorar considerables en su poder predictivo

* Realizar el procedimiento en el orden indicado anteriormente genera que al final, con la totalidad de los datos que tenemos el poder predictivo del modelo sea consideramblemente mejor, incluso mejorando el AUC con los datos de test en la competencia en kaggle
