![image info](https://raw.githubusercontent.com/albahnsen/MIAD_ML_and_NLP/main/images/banner_1.png)

# Proyecto 2 - Clasificación de género de películas

El propósito de este proyecto es que puedan poner en práctica, en sus respectivos grupos de trabajo, sus conocimientos sobre técnicas de preprocesamiento, modelos predictivos de NLP, y la disponibilización de modelos. Para su desarrollo tengan en cuenta las instrucciones dadas en la "Guía del proyecto 2: Clasificación de género de películas"

**Entrega**: La entrega del proyecto deberán realizarla durante la semana 8. Sin embargo, es importante que avancen en la semana 7 en el modelado del problema y en parte del informe, tal y como se les indicó en la guía.

Para hacer la entrega, deberán adjuntar el informe autocontenido en PDF a la actividad de entrega del proyecto que encontrarán en la semana 8, y subir el archivo de predicciones a la [competencia de Kaggle](https://www.kaggle.com/t/2c54d005f76747fe83f77fbf8b3ec232).

## Datos para la predicción de género en películas

![image info](https://raw.githubusercontent.com/albahnsen/MIAD_ML_and_NLP/main/images/moviegenre.png)

En este proyecto se usará un conjunto de datos de géneros de películas. Cada observación contiene el título de una película, su año de lanzamiento, la sinopsis o plot de la película (resumen de la trama) y los géneros a los que pertenece (una película puede pertenercer a más de un género). Por ejemplo:
- Título: 'How to Be a Serial Killer'
- Plot: 'A serial killer decides to teach the secrets of his satisfying career to a video store clerk.'
- Generos: 'Comedy', 'Crime', 'Horror'

La idea es que usen estos datos para predecir la probabilidad de que una película pertenezca, dada la sinopsis, a cada uno de los géneros.

Agradecemos al profesor Fabio González, Ph.D. y a su alumno John Arevalo por proporcionar este conjunto de datos. Ver https://arxiv.org/abs/1702.01992

## Ejemplo predicción conjunto de test para envío a Kaggle
En esta sección encontrarán el formato en el que deben guardar los resultados de la predicción para que puedan subirlos a la competencia en Kaggle.

In [10]:
import warnings
warnings.filterwarnings('ignore')

In [11]:
# Importación librerías
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import os
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
import nltk
from imblearn.over_sampling import BorderlineSMOTE, SMOTE, ADASYN, SMOTENC, RandomOverSampler
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /Users/yovany/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/yovany/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [12]:
# Carga de datos de archivo .csv
dataTraining = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTraining.zip', encoding='UTF-8', index_col=0)
dataTesting = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTesting.zip', encoding='UTF-8', index_col=0)

In [13]:
dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,most is the story of a single father who takes...,"['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfi...","['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,"in a friday afternoon in new york , the presi...",['Drama'],7.4
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing h...","['Action', 'Crime', 'Thriller']",6.6


In [14]:
def split_into_lemmas(text):
    text = text.lower()
    words = text.split()
    return [wordnet_lemmatizer.lemmatize(word) for word in words]

In [15]:
# Definición de variables predictoras (X)
vec = CountVectorizer(tokenizer=split_into_lemmas, max_features=1000)
X_dtm = vec.fit_transform(dataTraining['plot'])
X_dtm.shape

(7895, 1000)

In [16]:
# Definición de variable de interés (y)
dataTraining['genres'] = dataTraining['genres'].map(lambda x: eval(x))
le = MultiLabelBinarizer()
y_genres = le.fit_transform(dataTraining['genres'])

In [17]:
# Separación de variables predictoras (X) y variable de interés (y) en set de entrenamiento y test usandola función train_test_split
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X_dtm, y_genres, test_size=0.33, random_state=42)

In [18]:
# Definición y entrenamiento con RandomForestClassifier
clf = OneVsRestClassifier(RandomForestClassifier(n_jobs=-1, n_estimators=1000, max_depth=15, random_state=42))


#radom search
from sklearn.model_selection import RandomizedSearchCV
clf_rd_search = RandomizedSearchCV(estimator=clf, param_distributions={"estimator__n_estimators": [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000],
                                                                 "estimator__max_depth": [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]},
                                n_iter=10, cv=5, verbose=2, random_state=42, n_jobs=-1)
clf_rd_search.fit(X_train, y_train_genres)

# Predicción del modelo de clasificación
y_pred_genres = clf_rd_search.predict_proba(X_test)

# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred_genres, average='macro')

# transformación variables predictoras X del conjunto de test
X_test_dtm = vec.transform(dataTesting['plot'])

# Predicción del conjunto de test
y_pred_test_genres = clf_rd_search.predict_proba(X_test_dtm)


Fitting 5 folds for each of 10 candidates, totalling 50 fits


In [None]:
#Radom Search V2
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint as sp_randint


# Definición y entrenamiento con RandomForestClassifier
clf = OneVsRestClassifier(RandomForestClassifier(n_jobs=-1, n_estimators=1000, max_depth=15, random_state=42))

# Rango de búsqueda de parámetro dado: lista o distribución
param_dist = {"estimator__max_depth": [3, None],                     # Lista dada
              "estimator__max_features": sp_randint(1, 11),          #Distribución
              "estimator__min_samples_split": sp_randint(2, 11),     #Distribución
              "estimator__bootstrap": [True, False],                 # Lista dada
              "estimator__criterion": ["gini", "entropy"]}           # Lista dada

# Use RandomSearch + CV para seleccionar hiperparámetros
n_iter_search = 20

clf_rd_search2 = RandomizedSearchCV(clf, param_distributions=param_dist, n_iter=n_iter_search, cv=5)

# Ajustamos el modelo
clf_rd_search2.fit(X_train, y_train_genres)

# Predicción del modelo de clasificación
y_pred_genres = clf_rd_search2.predict_proba(X_test)

# transformación variables predictoras X del conjunto de test
X_test_dtm = vec.transform(dataTesting['plot'])

# Predicción del conjunto de test
y_pred_test_genres = clf_rd_search.predict_proba(X_test_dtm)

In [19]:
# Impresión del desempeño del modelo
print(roc_auc_score(y_test_genres, y_pred_genres, average='macro'))

print(clf_rd_search2.best_estimator_ )
print(clf_rd_search2.best_score_)
print( clf_rd_search2.best_params_)

0.7919369333077283


AttributeError: 'RandomizedSearchCV' object has no attribute 'best_estimator_'

In [20]:
# Guardar predicciones en formato exigido en la competencia de kaggle
cols = ['p_Action', 'p_Adventure', 'p_Animation', 'p_Biography', 'p_Comedy', 'p_Crime', 'p_Documentary', 'p_Drama', 'p_Family',
        'p_Fantasy', 'p_Film-Noir', 'p_History', 'p_Horror', 'p_Music', 'p_Musical', 'p_Mystery', 'p_News', 'p_Romance',
        'p_Sci-Fi', 'p_Short', 'p_Sport', 'p_Thriller', 'p_War', 'p_Western']

res = pd.DataFrame(y_pred_test_genres, index=dataTesting.index, columns=cols)
res.to_csv('pred_genres_text_RF.csv', index_label='ID')
res.head()

Unnamed: 0,p_Action,p_Adventure,p_Animation,p_Biography,p_Comedy,p_Crime,p_Documentary,p_Drama,p_Family,p_Fantasy,...,p_Musical,p_Mystery,p_News,p_Romance,p_Sci-Fi,p_Short,p_Sport,p_Thriller,p_War,p_Western
1,0.095046,0.083828,0.023028,0.032895,0.375159,0.09949,0.042927,0.515594,0.053036,0.110213,...,0.028819,0.057682,0.0,0.430129,0.041709,0.008139,0.013013,0.172867,0.01289,0.015997
4,0.121104,0.069533,0.020147,0.102682,0.339917,0.244224,0.149664,0.502987,0.052238,0.05179,...,0.01926,0.047196,0.010492,0.1164,0.046694,0.015195,0.01447,0.170684,0.024201,0.012633
5,0.21355,0.128769,0.019785,0.060135,0.288934,0.501559,0.022521,0.638764,0.081063,0.107783,...,0.022827,0.357159,0.0,0.349164,0.112691,0.021754,0.046069,0.455046,0.064613,0.02986
6,0.149434,0.106181,0.019578,0.056674,0.306812,0.11467,0.009882,0.596499,0.076071,0.05521,...,0.186012,0.097967,0.0,0.185916,0.127264,0.00191,0.041131,0.268309,0.119922,0.016668
7,0.153125,0.189816,0.023326,0.021764,0.262214,0.229032,0.007543,0.436897,0.092158,0.233228,...,0.024522,0.104939,0.0,0.180202,0.451825,0.005705,0.02124,0.257249,0.012129,0.015932


### Guardar el Modelo Calibrado

In [None]:
# Exportar modelo a archivo binario .pkl
joblib.dump(clf_rd_search, 'modelo_predice_generos.pkl', compress=3)

## Esta sección en independiente, por lo que se puede ejecutar sin necesidad de correr las celdas previas.

-  Solo se requiere el pkl del modelo generado
- la libreria predice_generos.py
- y ubicarlos en la misma ubicación de este notebook

### Ejemplo de ejecución

In [None]:

# Importar modelo y predicción
from sklearn.feature_extraction.text import CountVectorizer
from predice_generos import predice_GENEROS

corpus=['most is the story of a single father who takes his eight year - old son to work with him at the railroad drawbridge where he is the bridge tender .  a day before ,  the boy meets a woman boarding a train ,  a drug abuser .  at the bridge ,  the father goes into the engine room ,  and tells his son to stay at the edge of the nearby lake .  a ship comes ,  and the bridge is lifted .  though it is supposed to arrive an hour later ,  the train happens to arrive .  the son sees this ,  and tries to warn his father ,  who is not able to see this .  just as the oncoming train approaches ,  his son falls into the drawbridge gear works while attempting to lower the bridge ,  leaving the father with a horrific choice .  the father then lowers the bridge ,  the gears crushing the boy .  the people in the train are completely oblivious to the fact a boy died trying to save them ,  other than the drug addict woman ,  who happened to look out her train window .  the movie ends ,  with the man wandering a new city ,  and meets the woman ,  no longer a drug addict ,  holding a small baby .  other relevant narratives run in parallel ,  namely one of the female drug - addict ,  and they all meet at the climax of this tumultuous film .']

# Predicción de genero
y_pred_test_genres=predice_GENEROS(corpus) 

y_pred_test_genres

## Disponibilizar modelo con Flask

In [None]:
import joblib
import werkzeug
from werkzeug.utils import cached_property
from flask import Flask

try:
    from flask_restplus import Api, Resource, fields, marshal
except ImportError:
    import werkzeug
    werkzeug.cached_property = werkzeug.utils.cached_property
    from flask_restplus import Resource, Api, fields, marshal

from predice_generos import predice_GENEROS

# Definición aplicación Flask
app = Flask(__name__)

# Definición API Flask
api = Api(
    app, 
    version='1.0', 
    title='API de Predicción de Generos de Peliculas',
    description='API de Predicción de Generos de Peliculas')

ns = api.namespace('predict', 
     description='Predición de Generos de Peliculas')

# Definición argumentos o parámetros de la API
parser = api.parser()



parser.add_argument(
    'Plot', 
    type=str, 
    required=True, 
    help='', 
    location='args')


resource_fields = api.model('Resource', {
    'p_Action': fields.String,
    'p_Adventure' : fields.String,
    'p_Animation' : fields.String,
    'p_Biography' : fields.String,
    'p_Comedy' : fields.String,
    'p_Crime' : fields.String,
    'p_Documentary' : fields.String,
    'p_Drama' : fields.String,
    'p_Family' : fields.String,
    'p_Fantasy' : fields.String,       
    'p_Film-Noir' : fields.String,
    'p_History' : fields.String,
    'p_Horror' : fields.String,
    'p_Music' : fields.String,
    'p_Mystery' : fields.String,
    'p_News' : fields.String,
    'p_Romance' : fields.String,
    'p_Sci-Fi' : fields.String,
    'p_Short' : fields.String,
    'p_Sport' : fields.String,
    'p_Thriller' : fields.String,
    'p_War' : fields.String,
    'p_Western' : fields.String
    })


# Definición de la clase para disponibilización
@ns.route('/')
class GenerosApi(Resource):

    @api.doc(parser=parser)
    @api.marshal_with(resource_fields)
    def get(self):
        args = parser.parse_args()
        print(args)
        generos=predice_GENEROS([args['Plot']])
        
        return {
        'p_Action': generos['p_Action'],
        'p_Adventure' : generos['p_Adventure'],
        'p_Animation' : generos['p_Animation'],
        'p_Biography' : generos['p_Biography'],
        'p_Comedy' : generos['p_Comedy'],
        'p_Crime' : generos['p_Crime'],
        'p_Documentary' : generos['p_Documentary'],
        'p_Drama' : generos['p_Drama'],
        'p_Family' : generos['p_Family'],
        'p_Fantasy' : generos['p_Fantasy'],       ''
        'p_Film-Noir' : generos['p_Film-Noir'],
        'p_History' : generos['p_History'],
        'p_Horror' : generos['p_Horror'],
        'p_Music' : generos['p_Music'],
        'p_Mystery' : generos['p_Mystery'],
        'p_News' : generos['p_News'],
        'p_Romance' : generos['p_Romance'],
        'p_Sci-Fi' : generos['p_Sci-Fi'],
        'p_Short' : generos['p_Short'],
        'p_Sport' : generos['p_Sport'],
        'p_Thriller' : generos['p_Thriller'],
        'p_War' : generos['p_War'],
        'p_Western' : generos['p_Western']
        }, 200

# Ejecución de la aplicación que disponibiliza el modelo de manera local en el puerto 5002
app.run(debug=True, use_reloader=False, host='0.0.0.0', port=5002)