![image info](https://raw.githubusercontent.com/albahnsen/MIAD_ML_and_NLP/main/images/banner_1.png)

# Proyecto 2 - Clasificación de género de películas

El propósito de este proyecto es que puedan poner en práctica, en sus respectivos grupos de trabajo, sus conocimientos sobre técnicas de preprocesamiento, modelos predictivos de NLP, y la disponibilización de modelos. Para su desarrollo tengan en cuenta las instrucciones dadas en la "Guía del proyecto 2: Clasificación de género de películas"

**Entrega**: La entrega del proyecto deberán realizarla durante la semana 8. Sin embargo, es importante que avancen en la semana 7 en el modelado del problema y en parte del informe, tal y como se les indicó en la guía.

Para hacer la entrega, deberán adjuntar el informe autocontenido en PDF a la actividad de entrega del proyecto que encontrarán en la semana 8, y subir el archivo de predicciones a la [competencia de Kaggle](https://www.kaggle.com/t/2c54d005f76747fe83f77fbf8b3ec232).

## Datos para la predicción de género en películas

![image info](https://raw.githubusercontent.com/albahnsen/MIAD_ML_and_NLP/main/images/moviegenre.png)

En este proyecto se usará un conjunto de datos de géneros de películas. Cada observación contiene el título de una película, su año de lanzamiento, la sinopsis o plot de la película (resumen de la trama) y los géneros a los que pertenece (una película puede pertenercer a más de un género). Por ejemplo:
- Título: 'How to Be a Serial Killer'
- Plot: 'A serial killer decides to teach the secrets of his satisfying career to a video store clerk.'
- Generos: 'Comedy', 'Crime', 'Horror'

La idea es que usen estos datos para predecir la probabilidad de que una película pertenezca, dada la sinopsis, a cada uno de los géneros.

Agradecemos al profesor Fabio González, Ph.D. y a su alumno John Arevalo por proporcionar este conjunto de datos. Ver https://arxiv.org/abs/1702.01992

## Ejemplo predicción conjunto de test para envío a Kaggle
En esta sección encontrarán el formato en el que deben guardar los resultados de la predicción para que puedan subirlos a la competencia en Kaggle.

In [6]:
import warnings
warnings.filterwarnings('ignore')

In [7]:
#pip install --upgrade tensorflow

In [12]:
#!pip install --upgrade tensorflow
!pip install --upgrade tensorflow



In [15]:
# Importación librerías
import os
import re
import nltk
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import StandardScaler, MultiLabelBinarizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score, roc_auc_score
from xgboost import XGBClassifier
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
#import tensorflow as tf
#from tensorflow.keras.models import Sequential
#from tensorflow.keras.layers import Dense, Input
#from tensorflow.keras.wrappers.scikit_learn import KerasRegressor
#from tensorflow.keras.callbacks import EarlyStopping
#from tensorflow.keras import backend as K
from livelossplot import PlotLossesKeras

In [16]:
# Carga de datos de archivo .csv
dataTraining = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTraining.zip', encoding='UTF-8', index_col=0)
dataTesting = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTesting.zip', encoding='UTF-8', index_col=0)

In [17]:
# Visualización datos de entrenamiento
dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,most is the story of a single father who takes...,"['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfi...","['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,"in a friday afternoon in new york , the presi...",['Drama'],7.4
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing h...","['Action', 'Crime', 'Thriller']",6.6


In [18]:
# Visualización datos de test
dataTesting.head()

Unnamed: 0,year,title,plot
1,1999,Message in a Bottle,"who meets by fate , shall be sealed by fate ...."
4,1978,Midnight Express,"the true story of billy hayes , an american c..."
5,1996,Primal Fear,martin vail left the chicago da ' s office to ...
6,1950,Crisis,husband and wife americans dr . eugene and mr...
7,1959,The Tingler,the coroner and scientist dr . warren chapin ...


In [19]:
# Definición de variables predictoras (X)
vect = CountVectorizer(max_features=1000)
X_dtm = vect.fit_transform(dataTraining['plot'])
X_dtm.shape

(7895, 1000)

In [20]:
# Definición de variable de interés (y)
dataTraining['genres'] = dataTraining['genres'].map(lambda x: eval(x))
le = MultiLabelBinarizer()
y_genres = le.fit_transform(dataTraining['genres'])

In [21]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler(with_mean=False)
scaler.fit(X_dtm)
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X_dtm, y_genres, test_size=0.33, random_state=42)
#X_train = pd.DataFrame(data=scaler.transform(X_train))
#X_test = pd.DataFrame(data=scaler.transform(X_test))

In [22]:
# Definición y entrenamiento
clf = OneVsRestClassifier(RandomForestClassifier(n_jobs=-1, n_estimators=100, max_depth=10, random_state=42))
clf.fit(X_train, y_train_genres)

In [23]:
# Predicción del modelo de clasificación
y_pred_genres = clf.predict_proba(X_test)

# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred_genres, average='macro')

0.7812262183677007

In [24]:
y_test_genres

array([[0, 0, 0, ..., 1, 0, 0],
       [0, 1, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [25]:
# transformación variables predictoras X del conjunto de test
X_test_dtm = vect.transform(dataTesting['plot'])

cols = ['p_Action', 'p_Adventure', 'p_Animation', 'p_Biography', 'p_Comedy', 'p_Crime', 'p_Documentary', 'p_Drama', 'p_Family',
        'p_Fantasy', 'p_Film-Noir', 'p_History', 'p_Horror', 'p_Music', 'p_Musical', 'p_Mystery', 'p_News', 'p_Romance',
        'p_Sci-Fi', 'p_Short', 'p_Sport', 'p_Thriller', 'p_War', 'p_Western']

# Predicción del conjunto de test
y_pred_test_genres = clf.predict_proba(X_test_dtm)


In [26]:
# Guardar predicciones en formato exigido en la competencia de kaggle

res = pd.DataFrame(y_pred_test_genres, index=dataTesting.index, columns=cols)
res.to_csv('pred_genres_text_RF.csv', index_label='ID')
res.head()

Unnamed: 0,p_Action,p_Adventure,p_Animation,p_Biography,p_Comedy,p_Crime,p_Documentary,p_Drama,p_Family,p_Fantasy,...,p_Musical,p_Mystery,p_News,p_Romance,p_Sci-Fi,p_Short,p_Sport,p_Thriller,p_War,p_Western
1,0.14303,0.10196,0.024454,0.029938,0.354552,0.13883,0.030787,0.49014,0.073159,0.101339,...,0.025069,0.063208,0.0,0.362818,0.056648,0.00897,0.017522,0.202605,0.033989,0.018117
4,0.122624,0.085786,0.024213,0.084795,0.370949,0.216657,0.080359,0.515684,0.062976,0.067019,...,0.024734,0.060935,0.000477,0.149703,0.05819,0.014248,0.020099,0.204794,0.030438,0.018506
5,0.151364,0.110284,0.013762,0.075334,0.304837,0.448736,0.02101,0.611544,0.081741,0.169121,...,0.044538,0.261372,0.0,0.335987,0.128505,0.001016,0.048658,0.423242,0.052693,0.025351
6,0.154448,0.125772,0.020991,0.064124,0.340779,0.140892,0.009133,0.632038,0.068287,0.063631,...,0.131074,0.088418,0.0,0.197224,0.132208,0.001432,0.039743,0.269385,0.077607,0.017862
7,0.175143,0.210069,0.035476,0.032505,0.31385,0.24315,0.021793,0.427885,0.079781,0.143879,...,0.023859,0.090359,4.8e-05,0.205117,0.241663,0.002634,0.018403,0.259465,0.021569,0.017585


# Desarrollo Proyecto

In [27]:

output_var = y_train_genres.shape[1]
print(output_var, ' output variables')
dims = X_train.shape[1]
print(dims, 'input variables')
dense_shape = X_train.shape
indices = np.column_stack(X_train.nonzero())
values = X_train.data
X_train_sparse = tf.sparse.SparseTensor(
    indices=indices,
    values=values,
    dense_shape=dense_shape
)
X_train_sparse_ordered = tf.sparse.reorder(X_train_sparse)
X_train_dense = tf.sparse.to_dense(X_train_sparse_ordered).numpy()
X_train2, X_val, Y_train2, Y_val = train_test_split(X_train_dense, y_train_genres, test_size=0.15, random_state=42)
def nn_model_params(optimizer ,
                    neurons,
                    batch_size,
                    epochs,
                    activation,
                    patience,
                    loss):
    
    K.clear_session()

    # Definición red neuronal con la función Sequential()
    model = Sequential()
    
    # Definición de las capas de la red con el número de neuronas y la función de activación definidos en la función nn_model_params
    model.add(Dense(neurons, input_shape=(dims,), activation=activation))
    model.add(Dense(neurons, activation=activation))
    model.add(Dense(output_var, activation=activation))

    # Definición de función de perdida con parámetros definidos en la función nn_model_params
    model.compile(optimizer = optimizer, loss=loss)
    
    # Definición de la función EarlyStopping con parámetro definido en la función nn_model_params
    early_stopping = EarlyStopping(monitor="val_loss", patience = patience)

    # Entrenamiento de la red neuronal con parámetros definidos en la función nn_model_params
    model.fit(X_train2, Y_train2,
              validation_data = (X_val, Y_val),
              epochs=epochs,
              batch_size=batch_size,
              callbacks=[early_stopping, PlotLossesKeras()],
              verbose=True
              )
     
    return model

24  output variables
1000 input variables


In [28]:
nn_params = {
    'optimizer': ['adam','sgd'],
    'activation': ['relu','sigmoid','softmax'],
    'batch_size': [64,512],
    'neurons':[64,512],
    'epochs':[20,50],
    'patience':[2,5],
    'loss':['mean_squared_error','binary_crossentropy','categorical_crossentropy']
}

## Método busqueda por cuadrícula (Grid Search)

In [29]:



nn_model = KerasRegressor(build_fn=nn_model_params)
gs = GridSearchCV(nn_model, nn_params, cv=3)

gs.fit(X_train2, Y_train2)

print('Los mejores parametros segun Grid Search:', gs.best_params_)

NameError: name 'KerasRegressor' is not defined

In [30]:
model = nn_model_params(optimizer='adam', neurons=512, batch_size=64, epochs=50, activation='relu', patience=5, loss='mean_squared_error')


NameError: name 'K' is not defined

In [None]:
dense_shape = X_test.shape
indices = np.column_stack(X_test.nonzero())
values = X_test.data
X_test_sparse = tf.sparse.SparseTensor(
    indices=indices,
    values=values,
    dense_shape=dense_shape
)
X_test_sparse_ordered = tf.sparse.reorder(X_test_sparse)
X_test_dense = tf.sparse.to_dense(X_test_sparse_ordered).numpy()
Y_predict_neuronal = model.predict(X_test_dense)

In [None]:
roc_auc_score(y_test_genres, Y_predict_neuronal, average='macro')

In [None]:
X_test_dtm = vect.transform(dataTesting['plot'])
dense_shape = X_test_dtm.shape
indices = np.column_stack(X_test_dtm.nonzero())
values = X_test_dtm.data
X_test_dtm_sparse = tf.sparse.SparseTensor(
    indices=indices,
    values=values,
    dense_shape=dense_shape
)
X_test_dtm_sparse_ordered = tf.sparse.reorder(X_test_dtm_sparse)
X_test_dtm_dense = tf.sparse.to_dense(X_test_dtm_sparse_ordered).numpy()
Y_predict_neuronal = model.predict(X_test_dtm_dense)
# Predicción del conjunto de test
res = pd.DataFrame(Y_predict_neuronal, index=dataTesting.index, columns=cols)
res.to_csv('pred_genres_neuronal.csv', index_label='ID')
res.head()

Para este caso se realizo el entrenamiento de una red neuronal en la cual en primer lugar se realizo el estandarizado de los datos con el proposito de que todos tuvieran la misma dimensionalidad para despues  calibrar los hiperparametros usando GridSearchCV en donde se obtuvieron los siguientes parametros : {'activation': 'relu', 'batch_size': 64, 'epochs': 50, 'loss': 'mean_squared_error', 'neurons': 512, 'optimizer': 'adam', 'patience': 5} en donde despues se prosigue a evaluar su poder predictivo en donde nos da un roc de 0.6 en donde el modelo presenta un poder predictivo bajo

## Segundo metodo

In [31]:

nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\perezs5\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\perezs5\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [32]:
dataTraining = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTraining.zip', encoding='UTF-8', index_col=0)
dataTesting = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTesting.zip', encoding='UTF-8', index_col=0)

# Preprocesamiento de datos
dataTraining['genres'] = dataTraining['genres'].map(lambda x: eval(x))
dataTraining.dropna(subset=['plot'], inplace=True)
dataTesting.dropna(subset=['plot'], inplace=True)


In [33]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    text = re.sub(r'\b\w{1,2}\b', '', text)  # Remover palabras cortas
    text = re.sub(r'[^\w\s]', '', text)  # Remover puntuación
    text = text.lower()
    text = ' '.join([lemmatizer.lemmatize(word) for word in text.split() if word not in stop_words])
    return text

In [34]:


# Descarga los recursos necesarios de NLTK
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Define el conjunto de stopwords y el lematizador
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# Define la función para limpiar el texto
def clean_text(text):
    # Elimina palabras con menos de 3 caracteres
    text = re.sub(r'\b\w{1,2}\b', '', text)
    # Convierte el texto a minúsculas
    text = text.lower()
    # Elimina caracteres no alfabéticos
    text = re.sub(r'[^a-z\s]', '', text)
    # Elimina stopwords y lematiza las palabras restantes
    text = ' '.join(lemmatizer.lemmatize(word) for word in text.split() if word not in stop_words)
    return text

# Aplica la función clean_text a las columnas 'plot'
dataTraining['clean_plot'] = dataTraining['plot'].apply(clean_text)
dataTesting['clean_plot'] = dataTesting['plot'].apply(clean_text)

# Vectorización del texto (usando TF-IDF)
vect = TfidfVectorizer(max_features=1000)
X_dtm = vect.fit_transform(dataTraining['clean_plot'])

# Añadir característica adicional: año de lanzamiento
X_additional = dataTraining[['year']].values
scaler = StandardScaler()
X_additional = scaler.fit_transform(X_additional)

# Concatenar las características
X = np.hstack((X_dtm.toarray(), X_additional))

# Binariza las etiquetas de género
le = MultiLabelBinarizer()
y_genres = le.fit_transform(dataTraining['genres'])

# División en conjuntos de entrenamiento y prueba
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X, y_genres, test_size=0.33, random_state=42)

# Definición del modelo
clf = OneVsRestClassifier(XGBClassifier(n_jobs=-1, random_state=42))

# Parámetros para GridSearchCV
param_grid = {
    'estimator__max_depth': [3, 5],
    'estimator__n_estimators': [100, 200],
    'estimator__learning_rate': [0.1, 0.01]
}

# Búsqueda de hiperparámetros con GridSearchCV
grid_search = GridSearchCV(clf, param_grid, cv=3, scoring='roc_auc', n_jobs=-1)
grid_search.fit(X_train, y_train_genres)

# Obtener el mejor modelo
best_clf = grid_search.best_estimator_

# Predicción del modelo optimizado
y_pred_genres = best_clf.predict_proba(X_test)

# Evaluación del desempeño del modelo
score = roc_auc_score(y_test_genres, y_pred_genres, average='macro')
print(f'ROC AUC Score: {score:.4f}')

# Transformación de las variables predictoras del conjunto de test
X_test_dtm = vect.transform(dataTesting['clean_plot'])
X_test_additional = scaler.transform(dataTesting[['year']].values)
X_test_final = np.hstack((X_test_dtm.toarray(), X_test_additional))

# Predicción del conjunto de test
y_pred_test_genres = best_clf.predict_proba(X_test_final)

res = pd.DataFrame(y_pred_test_genres, index=dataTesting.index, columns=cols)
res.to_csv('pred_genres_text_optimized.csv', index_label='ID')
res.head()


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\perezs5\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\perezs5\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\perezs5\AppData\Roaming\nltk_data...


ROC AUC Score: 0.8443


Unnamed: 0,p_Action,p_Adventure,p_Animation,p_Biography,p_Comedy,p_Crime,p_Documentary,p_Drama,p_Family,p_Fantasy,...,p_Musical,p_Mystery,p_News,p_Romance,p_Sci-Fi,p_Short,p_Sport,p_Thriller,p_War,p_Western
1,0.092402,0.084652,0.01294,0.035859,0.250526,0.063461,0.008995,0.563048,0.038448,0.129786,...,0.011297,0.059565,2.7e-05,0.641219,0.049179,0.001708,0.008198,0.249507,0.003604,0.003297
4,0.111633,0.011719,0.007698,0.13059,0.299,0.373507,0.036538,0.720126,0.012663,0.015727,...,0.03051,0.049601,0.000169,0.048189,0.035659,0.01054,0.014539,0.135603,0.014654,0.011281
5,0.104538,0.028352,0.001959,0.013698,0.153403,0.895682,0.003289,0.78274,0.003807,0.046089,...,0.00488,0.70388,5.7e-05,0.159485,0.021397,0.000181,0.014431,0.523088,0.016002,0.001365
6,0.075964,0.050919,0.001829,0.057074,0.099848,0.080855,0.001344,0.69318,0.037175,0.01965,...,0.0224,0.131353,2.8e-05,0.254997,0.052519,0.000857,0.01459,0.257001,0.07434,0.024038
7,0.097777,0.060696,0.005795,0.039584,0.189369,0.046483,0.004461,0.138044,0.093255,0.37892,...,0.015328,0.032513,2.6e-05,0.042444,0.8918,0.002538,0.004779,0.140894,0.028046,0.028572


En este caso se realiza un modelo con XGBClassifier en el cual se obtiene un Roc de 0.84 demostrando un mejor poder predictivo que la red neuronal.

In [35]:
import joblib
modelo = XGBClassifier()
modelo.fit(X_train, y_train_genres)
joblib.dump(modelo, 'regresion.pkl', compress=3)

['regresion.pkl']

In [44]:
import joblib

scaler = StandardScaler()
X_additional = scaler.fit_transform(X_additional)

# Guardar el scaler entrenado en un archivo pkl
joblib.dump(scaler, 'scaler.pkl')


['scaler.pkl']

In [46]:
# Elegir el vectorizador (CountVectorizer o TfidfVectorizer)
vect = TfidfVectorizer(max_features=1000)  # Puedes ajustar max_features según tus necesidades

# Ajustar y transformar los datos de entrenamiento
X_dtm = vect.fit_transform(dataTraining['clean_plot'])

# Guardar el vectorizador entrenado en un archivo pkl
joblib.dump(vect, 'vectorizer.pkl')


['vectorizer.pkl']

In [52]:


# Binariza las etiquetas de género
le = MultiLabelBinarizer()
y_genres = le.fit_transform(dataTraining['genres'])


# Guardar el codificador de etiquetas
joblib.dump(le, 'label_binarizer.pkl')

['label_binarizer.pkl']

In [None]:
from flask import Flask
from flask_restx import Api, Resource
import joblib
from flask_cors import CORS
import numpy as np

app = Flask(__name__)
CORS(app)
api = Api(app)

parser = api.parser()
parser.add_argument('year', type=int, required=True, help='year', location='args')
parser.add_argument('title', type=str, required=True, help='title', location='args')
parser.add_argument('plot', type=str, required=True, help='Plot', location='args')

# Cargar los modelos entrenados y el vectorizador
vectorizer = joblib.load('vectorizer.pkl')
scaler = joblib.load('scaler.pkl')
modelo_regresion = joblib.load('regresion.pkl')
label_binarizer = joblib.load('label_binarizer.pkl')

def process_request(year, title, plot):
    # Vectorizar el plot
    plot_vectorized = vectorizer.transform([plot])
    
    # Normalizar el año
    year_normalized = scaler.transform([[year]])
    
    # Concatenar las características
    features = np.hstack((plot_vectorized.toarray(), year_normalized))
    
    # Predecir los géneros
    prediccion = modelo_regresion.predict(features)
    
    # Decodificar la predicción
    prediccion_decoded = label_binarizer.inverse_transform(prediccion)
    
    return prediccion_decoded[0]

class GenresInfo(Resource):
    @api.expect(parser)
    def get(self):
        args = parser.parse_args()
        year = args['year']
        title = args['title']
        plot = args['plot']
        
        result = process_request(year, title, plot)
        return {
            'description': f"Los géneros son: {', '.join(result)}",
            'result': result
        }

api.add_resource(GenresInfo, '/genres_info')

if __name__ == '__main__':
    app.run(debug=True, use_reloader=False, host='0.0.0.0', port=5000)

 * Serving Flask app '__main__'
 * Debug mode: on


 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:5000
 * Running on http://10.181.203.70:5000
Press CTRL+C to quit
