![image info](https://raw.githubusercontent.com/albahnsen/MIAD_ML_and_NLP/main/images/banner_1.png)

# Proyecto 2 - Clasificación de género de películas

El propósito de este proyecto es que puedan poner en práctica, en sus respectivos grupos de trabajo, sus conocimientos sobre técnicas de preprocesamiento, modelos predictivos de NLP, y la disponibilización de modelos. Para su desarrollo tengan en cuenta las instrucciones dadas en la "Guía del proyecto 2: Clasificación de género de películas"

**Entrega**: La entrega del proyecto deberán realizarla durante la semana 8. Sin embargo, es importante que avancen en la semana 7 en el modelado del problema y en parte del informe, tal y como se les indicó en la guía.

Para hacer la entrega, deberán adjuntar el informe autocontenido en PDF a la actividad de entrega del proyecto que encontrarán en la semana 8, y subir el archivo de predicciones a la [competencia de Kaggle](https://www.kaggle.com/t/2c54d005f76747fe83f77fbf8b3ec232).

## Datos para la predicción de género en películas

![image info](https://raw.githubusercontent.com/albahnsen/MIAD_ML_and_NLP/main/images/moviegenre.png)

En este proyecto se usará un conjunto de datos de géneros de películas. Cada observación contiene el título de una película, su año de lanzamiento, la sinopsis o plot de la película (resumen de la trama) y los géneros a los que pertenece (una película puede pertenercer a más de un género). Por ejemplo:
- Título: 'How to Be a Serial Killer'
- Plot: 'A serial killer decides to teach the secrets of his satisfying career to a video store clerk.'
- Generos: 'Comedy', 'Crime', 'Horror'

La idea es que usen estos datos para predecir la probabilidad de que una película pertenezca, dada la sinopsis, a cada uno de los géneros.

Agradecemos al profesor Fabio González, Ph.D. y a su alumno John Arevalo por proporcionar este conjunto de datos. Ver https://arxiv.org/abs/1702.01992

## Ejemplo predicción conjunto de test para envío a Kaggle
En esta sección encontrarán el formato en el que deben guardar los resultados de la predicción para que puedan subirlos a la competencia en Kaggle.

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
%pip install livelossplot

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [99]:
# Importación librerías
import os
import re
import nltk
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import StandardScaler, MultiLabelBinarizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score, roc_auc_score
from xgboost import XGBClassifier
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.wrappers.scikit_learn import KerasRegressor
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras import backend as K
from livelossplot import PlotLossesKeras

In [116]:
# Carga de datos de archivo .csv
dataTraining = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTraining.zip', encoding='UTF-8', index_col=0)
dataTesting = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTesting.zip', encoding='UTF-8', index_col=0)

In [117]:
# Visualización datos de entrenamiento
dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,"most is the story of a single father who takes his eight year - old son to work with him at the railroad drawbridge where he is the bridge tender . a day before , the boy meets a woman boarding a train , a drug abuser . at the bridge , the father goes into the engine room , and tells his son to stay at the edge of the nearby lake . a ship comes , and the bridge is lifted . though it is supposed to arrive an hour later , the train happens to arrive . the son sees this , and tries to warn his father , who is not able to see this . just as the oncoming train approaches , his son falls into the drawbridge gear works while attempting to lower the bridge , leaving the father with a horrific choice . the father then lowers the bridge , the gears crushing the boy . the people in the train are completely oblivious to the fact a boy died trying to save them , other than the drug addict woman , who happened to look out her train window . the movie ends , with the man wandering a new city , and meets the woman , no longer a drug addict , holding a small baby . other relevant narratives run in parallel , namely one of the female drug - addict , and they all meet at the climax of this tumultuous film .","['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets of his satisfying career to a video store clerk .,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfiguring facial scar meets a gentleman who lives beyond his means . they become accomplices in blackmail , and she falls in love with him , bitterly resigned to the impossibility of his returning her affection . her life changes when one of her victims proves to be the wife of a plastic surgeon , who catches her in his apartment , but believes her to be a jewel thief rather than a blackmailer . he offers her the chance to look like a normal woman again , and she accepts , despite the agony of multiple operations . meanwhile , her gentleman accomplice forms an evil scheme to rid himself of the one person who stands in his way to a fortune - his four - year - old - nephew .","['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,"in a friday afternoon in new york , the president of the tredway corporation avery bullard has just had a meeting with investment bankers and sends a telegram scheduling a meeting at the furniture factory in millburgh , pennsylvania , at six pm with his executives . bullard has never appointed an executive vice - president for the corporation after the death of the previous one but when he is getting a taxi , he has a stroke and dies on the street . a thief steals his wallet to get his money and his body goes to the morgue without identification . the investment banker george nyle caswell sees bullard ' s body from his window and decides to use the information to make money , asking a broker to sell as much tredway stocks as possible until the end of the day , with the intention of buying them back monday morning by a lower price making profit . meanwhile the executives unsuccessfully wait for bullard in the meeting room . when they learn that bullard is dead , the ambitions accountant vp and controller loren phineas shaw releases to the press the balance of tredway showing profit and assumes temporarily the leadership of the company , expecting to be elected the next president by the seven - member board . however , the vp for design and development mcdonald "" don "" walling and the vp and treasurer frederick y . alderson oppose to shaw . there is a struggle in the corporation for the position of president and shaw blackmails the vp for sales josiah walter dudley that is married and has a mistress , his secretary eva bardeman , to get his vote . caswell needs to cover the N , N stocks he sold and shaw promises to give to him the stocks for the price he sold if he is elected president . the vp for manufacturing jesse q . grimm is near to retire but is a close friend of frederick and supports him . therefore the heir of tredway and bullard ' s mistress julia o . tredway will be responsible to give the casting vote . but she is disenchanted with the corporation . who will be elected the next president ?",['Drama'],7.4
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing house carol hunnicut goes to a blind date with the lawyer michael tarlow , who has embezzled the powerful mobster leo watts . carol accidentally witnesses the murder of michel by leo ' s hitman . the scared carol sneaks out of michael ' s room and hides in an isolated cabin in canada . meanwhile the deputy district attorney robert caulfield and sgt . dominick benti discover that carol is a witness of the murder and they report the information to caulfield ' s chief martin larner and they head by helicopter to canada to convince carol to testify against leo . however they are followed and the pilot and benti are murdered by the mafia . caulfield and carol flees and they take a train to vancouver . caulfield hides carol in his cabin and he discloses that there are three hitman in the train trying to find carol and kill her . but they do not know her and caulfield does not know who might be the third killer from the mafia and who has betrayed him in his office .","['Action', 'Crime', 'Thriller']",6.6


In [118]:
# Visualización datos de test
dataTesting.head()

Unnamed: 0,year,title,plot
1,1999,Message in a Bottle,"who meets by fate , shall be sealed by fate . theresa osborne is running along the beach when she stumbles upon a bottle washed up on the shore . inside is a message , reading the letter she feels so moved and yet she felt as if she has violated someone ' s thoughts . in love with a man she has never met , theresa tracks down the author of the letter to a small town in wilmington , two lovers with crossed paths . but yet one can ' t let go of their past ."
4,1978,Midnight Express,"the true story of billy hayes , an american college student who is caught smuggling drugs out of turkey and thrown into prison ."
5,1996,Primal Fear,"martin vail left the chicago da ' s office to become a successful criminal lawyer , that success predicated on working on high profile cases . as such , he fights to get the case of naive nineteen year old rural kentuckian aaron stampler , an altar boy accused of the vicious bludgeoning death of archbishop rushman of chicago . the story that aaron tells marty is that he , abused by his father , was in the room when the murder was committed by a third party , a shadowy figure he did not see , before he blacked out , which commonly happens to him . not remembering anything during the blackout period , he awoke covered in the archbishop ' s blood , his fright the reason he ran from the police . he also states that he had no reason to kill the archbishop , who he loved as the father he wished he had . marty doesn ' t care if he is guilty or innocent , but needs to know the truth to defend him adequately . unlike the rest of the world , marty does believe his story , he who hopes he can use aaron ' s general appearance of being an innocent to his advantage . the powerful state attorney , john shaughnessy , who marty has had many a moral run - in , wants a first degree murder conviction and the death penalty in this case . he appoints to the case janet venable , who still has bad feelings toward marty , an ex - lover , their six month relationship which ended badly . although the case looks to be a slam dunk for janet , her career may be made or broken by its outcome . in building his case , marty comes across some major pieces of information , some pertaining to the archbishop himself , and one uncovered by dr . molly arrington about aaron , she a psychiatrist hired by marty to assess aaron ' s mental state . these pieces of information as a collective pose a problem for marty in how to mount a credible and legitimate defense for his client . it is more of a moral dilemma for marty if only because he believes the life of a young man , who he believes in , is at stake ."
6,1950,Crisis,"husband and wife americans dr . eugene and mrs . helen ferguson - he a renowned neurosurgeon - are traveling through latin america for a vacation . when they make the decision to return to new york earlier than expected , they find they are being detained by the military in the country they are in . ultimately , they learn the reason is that president raoul farrago , the tyrannical military dictator of the country , has been diagnosed with a brain tumor and will die without an operation to remove it , farrago choosing gene as the doctor to lead the surgical team . because of the volatile politics within the country and for his own safety as revolutionary forces would like to see him dead , farrago refuses to go to a hospital for the operation , instead it to be done at his home . despite not particularly liking farrago or his ways , gene agrees purely in his oath as a doctor . however , he ends up being caught in the middle between farrago / his brutal regime and the revolutionaries , each side who is willing to use him and helen to get what they want , namely the life or death of farrago ."
7,1959,The Tingler,"the coroner and scientist dr . warren chapin is researching the shivering effect of fear with his assistant david morris . dr . warren is introduced to ollie higgins , the relative of a criminal sentenced to the electric chair , while making the autopsy of the corpse , and he makes a comment about the tingler - effect to him . ollie asks for a lift to dr . warner , and introduces his deaf - mute wife martha higgins , who manages a theater of their own . dr . warner returns home , where he lives with his unfaithful and evil wife isabel stevens chapin and her sweet sister lucy stevens . dr . warner , upset with the situation with his wife , threatens and uses her as a subject of his experiment . when martha dies of fear , dr . warner makes her autopsy and finds a creature that lives inside every human being , feeds with fear and is controlled by the scream . once martha was not able to scream , the tingler was not rendered harmless and became enormous . when the living being escapes , dr . warner and ollie chase it in a crowded movie theater ."


In [119]:
# Definición de variables predictoras (X)
vect = CountVectorizer(max_features=1000)
X_dtm = vect.fit_transform(dataTraining['plot'])
X_dtm.shape

(7895, 1000)

In [120]:
# Definición de variable de interés (y)
dataTraining['genres'] = dataTraining['genres'].map(lambda x: eval(x))
le = MultiLabelBinarizer()
y_genres = le.fit_transform(dataTraining['genres'])

In [121]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler(with_mean=False)
scaler.fit(X_dtm)
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X_dtm, y_genres, test_size=0.33, random_state=42)
#X_train = pd.DataFrame(data=scaler.transform(X_train))
#X_test = pd.DataFrame(data=scaler.transform(X_test))

In [56]:
# Definición y entrenamiento
clf = OneVsRestClassifier(RandomForestClassifier(n_jobs=-1, n_estimators=100, max_depth=10, random_state=42))
clf.fit(X_train, y_train_genres)

In [57]:
# Predicción del modelo de clasificación
y_pred_genres = clf.predict_proba(X_test)

# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred_genres, average='macro')

0.7812262183677007

In [58]:
y_test_genres

array([[0, 0, 0, ..., 1, 0, 0],
       [0, 1, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [59]:
# transformación variables predictoras X del conjunto de test
X_test_dtm = vect.transform(dataTesting['plot'])

cols = ['p_Action', 'p_Adventure', 'p_Animation', 'p_Biography', 'p_Comedy', 'p_Crime', 'p_Documentary', 'p_Drama', 'p_Family',
        'p_Fantasy', 'p_Film-Noir', 'p_History', 'p_Horror', 'p_Music', 'p_Musical', 'p_Mystery', 'p_News', 'p_Romance',
        'p_Sci-Fi', 'p_Short', 'p_Sport', 'p_Thriller', 'p_War', 'p_Western']

# Predicción del conjunto de test
y_pred_test_genres = clf.predict_proba(X_test_dtm)


In [60]:
# Guardar predicciones en formato exigido en la competencia de kaggle

res = pd.DataFrame(y_pred_test_genres, index=dataTesting.index, columns=cols)
res.to_csv('pred_genres_text_RF.csv', index_label='ID')
res.head()

Unnamed: 0,p_Action,p_Adventure,p_Animation,p_Biography,p_Comedy,p_Crime,p_Documentary,p_Drama,p_Family,p_Fantasy,...,p_Musical,p_Mystery,p_News,p_Romance,p_Sci-Fi,p_Short,p_Sport,p_Thriller,p_War,p_Western
1,0.14303,0.10196,0.024454,0.029938,0.354552,0.13883,0.030787,0.49014,0.073159,0.101339,...,0.025069,0.063208,0.0,0.362818,0.056648,0.00897,0.017522,0.202605,0.033989,0.018117
4,0.122624,0.085786,0.024213,0.084795,0.370949,0.216657,0.080359,0.515684,0.062976,0.067019,...,0.024734,0.060935,0.000477,0.149703,0.05819,0.014248,0.020099,0.204794,0.030438,0.018506
5,0.151364,0.110284,0.013762,0.075334,0.304837,0.448736,0.02101,0.611544,0.081741,0.169121,...,0.044538,0.261372,0.0,0.335987,0.128505,0.001016,0.048658,0.423242,0.052693,0.025351
6,0.154448,0.125772,0.020991,0.064124,0.340779,0.140892,0.009133,0.632038,0.068287,0.063631,...,0.131074,0.088418,0.0,0.197224,0.132208,0.001432,0.039743,0.269385,0.077607,0.017862
7,0.175143,0.210069,0.035476,0.032505,0.31385,0.24315,0.021793,0.427885,0.079781,0.143879,...,0.023859,0.090359,4.8e-05,0.205117,0.241663,0.002634,0.018403,0.259465,0.021569,0.017585


# Desarrollo Proyecto

In [67]:

output_var = y_train_genres.shape[1]
print(output_var, ' output variables')
dims = X_train.shape[1]
print(dims, 'input variables')
dense_shape = X_train.shape
indices = np.column_stack(X_train.nonzero())
values = X_train.data
X_train_sparse = tf.sparse.SparseTensor(
    indices=indices,
    values=values,
    dense_shape=dense_shape
)
X_train_sparse_ordered = tf.sparse.reorder(X_train_sparse)
X_train_dense = tf.sparse.to_dense(X_train_sparse_ordered).numpy()
X_train2, X_val, Y_train2, Y_val = train_test_split(X_train_dense, y_train_genres, test_size=0.15, random_state=42)
def nn_model_params(optimizer ,
                    neurons,
                    batch_size,
                    epochs,
                    activation,
                    patience,
                    loss):
    
    K.clear_session()

    # Definición red neuronal con la función Sequential()
    model = Sequential()
    
    # Definición de las capas de la red con el número de neuronas y la función de activación definidos en la función nn_model_params
    model.add(Dense(neurons, input_shape=(dims,), activation=activation))
    model.add(Dense(neurons, activation=activation))
    model.add(Dense(output_var, activation=activation))

    # Definición de función de perdida con parámetros definidos en la función nn_model_params
    model.compile(optimizer = optimizer, loss=loss)
    
    # Definición de la función EarlyStopping con parámetro definido en la función nn_model_params
    early_stopping = EarlyStopping(monitor="val_loss", patience = patience)

    # Entrenamiento de la red neuronal con parámetros definidos en la función nn_model_params
    model.fit(X_train2, Y_train2,
              validation_data = (X_val, Y_val),
              epochs=epochs,
              batch_size=batch_size,
              callbacks=[early_stopping, PlotLossesKeras()],
              verbose=True
              )
     
    return model

24  output variables
1000 input variables


In [68]:
nn_params = {
    'optimizer': ['adam','sgd'],
    'activation': ['relu','sigmoid','softmax'],
    'batch_size': [64,512],
    'neurons':[64,512],
    'epochs':[20,50],
    'patience':[2,5],
    'loss':['mean_squared_error','binary_crossentropy','categorical_crossentropy']
}

## Método busqueda por cuadrícula (Grid Search)

In [69]:



nn_model = KerasRegressor(build_fn=nn_model_params)
gs = GridSearchCV(nn_model, nn_params, cv=3)

gs.fit(X_train2, Y_train2)

print('Los mejores parametros segun Grid Search:', gs.best_params_)

Loss
	training         	 (min:    0.040, max:    0.105, cur:    0.040)
	validation       	 (min:    0.086, max:    0.101, cur:    0.086)
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
Los mejores parametros segun Grid Search: {'activation': 'relu', 'batch_size': 64, 'epochs': 50, 'loss': 'mean_squared_error', 'neurons': 512, 'optimizer': 'adam', 'patience': 5}


In [122]:
model = nn_model_params(optimizer='adam', neurons=512, batch_size=64, epochs=50, activation='relu', patience=5, loss='mean_squared_error')


Loss
	training         	 (min:    0.031, max:    0.096, cur:    0.031)
	validation       	 (min:    0.083, max:    0.088, cur:    0.085)


In [123]:
dense_shape = X_test.shape
indices = np.column_stack(X_test.nonzero())
values = X_test.data
X_test_sparse = tf.sparse.SparseTensor(
    indices=indices,
    values=values,
    dense_shape=dense_shape
)
X_test_sparse_ordered = tf.sparse.reorder(X_test_sparse)
X_test_dense = tf.sparse.to_dense(X_test_sparse_ordered).numpy()
Y_predict_neuronal = model.predict(X_test_dense)

In [124]:
roc_auc_score(y_test_genres, Y_predict_neuronal, average='macro')

0.6019548999447989

In [125]:
X_test_dtm = vect.transform(dataTesting['plot'])
dense_shape = X_test_dtm.shape
indices = np.column_stack(X_test_dtm.nonzero())
values = X_test_dtm.data
X_test_dtm_sparse = tf.sparse.SparseTensor(
    indices=indices,
    values=values,
    dense_shape=dense_shape
)
X_test_dtm_sparse_ordered = tf.sparse.reorder(X_test_dtm_sparse)
X_test_dtm_dense = tf.sparse.to_dense(X_test_dtm_sparse_ordered).numpy()
Y_predict_neuronal = model.predict(X_test_dtm_dense)
# Predicción del conjunto de test
res = pd.DataFrame(Y_predict_neuronal, index=dataTesting.index, columns=cols)
res.to_csv('pred_genres_neuronal.csv', index_label='ID')
res.head()

Unnamed: 0,p_Action,p_Adventure,p_Animation,p_Biography,p_Comedy,p_Crime,p_Documentary,p_Drama,p_Family,p_Fantasy,...,p_Musical,p_Mystery,p_News,p_Romance,p_Sci-Fi,p_Short,p_Sport,p_Thriller,p_War,p_Western
1,0.0,0.0,0.0,0.0,0.281182,0.0,0.0,0.426335,0.0,0.0,...,0.0,0.0,0.0,0.88893,0.0,0.0,0.0,0.0,0.0,0.0
4,0.18873,0.0,0.0,0.0,0.116675,0.144052,0.0,1.014342,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.165793,0.0,0.91908,0.0,0.0,...,0.0,1.0556,0.0,0.0,0.0,0.0,0.0,1.006275,0.0,0.0
6,0.0,0.0,0.0,0.0,0.127128,0.0,0.0,0.48107,0.0,0.0,...,0.0,0.0,0.0,0.106217,0.0,0.0,0.0,0.230854,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.477212,0.0,0.322055,...,0.0,0.0,0.0,0.166783,0.530611,0.0,0.0,0.016101,0.0,0.0


Para este caso se realizo el entrenamiento de una red neuronal en la cual en primer lugar se realizo el estandarizado de los datos con el proposito de que todos tuvieran la misma dimensionalidad para despues  calibrar los hiperparametros usando GridSearchCV en donde se obtuvieron los siguientes parametros : {'activation': 'relu', 'batch_size': 64, 'epochs': 50, 'loss': 'mean_squared_error', 'neurons': 512, 'optimizer': 'adam', 'patience': 5} en donde despues se prosigue a evaluar su poder predictivo en donde nos da un roc de 0.6 en donde el modelo presenta un poder predictivo bajo

## Segundo metodo

In [94]:

nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/richard/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/richard/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [95]:
dataTraining = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTraining.zip', encoding='UTF-8', index_col=0)
dataTesting = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTesting.zip', encoding='UTF-8', index_col=0)

# Preprocesamiento de datos
dataTraining['genres'] = dataTraining['genres'].map(lambda x: eval(x))
dataTraining.dropna(subset=['plot'], inplace=True)
dataTesting.dropna(subset=['plot'], inplace=True)


In [96]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    text = re.sub(r'\b\w{1,2}\b', '', text)  # Remover palabras cortas
    text = re.sub(r'[^\w\s]', '', text)  # Remover puntuación
    text = text.lower()
    text = ' '.join([lemmatizer.lemmatize(word) for word in text.split() if word not in stop_words])
    return text

In [98]:


# Descarga los recursos necesarios de NLTK
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Define el conjunto de stopwords y el lematizador
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# Define la función para limpiar el texto
def clean_text(text):
    # Elimina palabras con menos de 3 caracteres
    text = re.sub(r'\b\w{1,2}\b', '', text)
    # Convierte el texto a minúsculas
    text = text.lower()
    # Elimina caracteres no alfabéticos
    text = re.sub(r'[^a-z\s]', '', text)
    # Elimina stopwords y lematiza las palabras restantes
    text = ' '.join(lemmatizer.lemmatize(word) for word in text.split() if word not in stop_words)
    return text

# Aplica la función clean_text a las columnas 'plot'
dataTraining['clean_plot'] = dataTraining['plot'].apply(clean_text)
dataTesting['clean_plot'] = dataTesting['plot'].apply(clean_text)

# Vectorización del texto (usando TF-IDF)
vect = TfidfVectorizer(max_features=1000)
X_dtm = vect.fit_transform(dataTraining['clean_plot'])

# Añadir característica adicional: año de lanzamiento
X_additional = dataTraining[['year']].values
scaler = StandardScaler()
X_additional = scaler.fit_transform(X_additional)

# Concatenar las características
X = np.hstack((X_dtm.toarray(), X_additional))

# Binariza las etiquetas de género
le = MultiLabelBinarizer()
y_genres = le.fit_transform(dataTraining['genres'])

# División en conjuntos de entrenamiento y prueba
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X, y_genres, test_size=0.33, random_state=42)

# Definición del modelo
clf = OneVsRestClassifier(XGBClassifier(n_jobs=-1, random_state=42))

# Parámetros para GridSearchCV
param_grid = {
    'estimator__max_depth': [3, 5],
    'estimator__n_estimators': [100, 200],
    'estimator__learning_rate': [0.1, 0.01]
}

# Búsqueda de hiperparámetros con GridSearchCV
grid_search = GridSearchCV(clf, param_grid, cv=3, scoring='roc_auc', n_jobs=-1)
grid_search.fit(X_train, y_train_genres)

# Obtener el mejor modelo
best_clf = grid_search.best_estimator_

# Predicción del modelo optimizado
y_pred_genres = best_clf.predict_proba(X_test)

# Evaluación del desempeño del modelo
score = roc_auc_score(y_test_genres, y_pred_genres, average='macro')
print(f'ROC AUC Score: {score:.4f}')

# Transformación de las variables predictoras del conjunto de test
X_test_dtm = vect.transform(dataTesting['clean_plot'])
X_test_additional = scaler.transform(dataTesting[['year']].values)
X_test_final = np.hstack((X_test_dtm.toarray(), X_test_additional))

# Predicción del conjunto de test
y_pred_test_genres = best_clf.predict_proba(X_test_final)

res = pd.DataFrame(y_pred_test_genres, index=dataTesting.index, columns=cols)
res.to_csv('pred_genres_text_optimized.csv', index_label='ID')
res.head()


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/richard/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/richard/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/richard/nltk_data...


ROC AUC Score: 0.8443


Unnamed: 0,p_Action,p_Adventure,p_Animation,p_Biography,p_Comedy,p_Crime,p_Documentary,p_Drama,p_Family,p_Fantasy,...,p_Musical,p_Mystery,p_News,p_Romance,p_Sci-Fi,p_Short,p_Sport,p_Thriller,p_War,p_Western
1,0.092402,0.084652,0.01294,0.035859,0.250526,0.063461,0.008995,0.563048,0.038448,0.129786,...,0.011297,0.059565,2.7e-05,0.641219,0.049179,0.001708,0.008198,0.249507,0.003604,0.003297
4,0.111633,0.011719,0.007698,0.13059,0.299,0.373507,0.036538,0.720126,0.012663,0.015727,...,0.03051,0.049601,0.000169,0.048189,0.035659,0.01054,0.014539,0.135603,0.014654,0.011281
5,0.104538,0.028352,0.001959,0.013698,0.153403,0.895682,0.003289,0.78274,0.003807,0.046089,...,0.00488,0.70388,5.7e-05,0.159485,0.021397,0.000181,0.014431,0.523088,0.016002,0.001365
6,0.075964,0.050919,0.001829,0.057074,0.099848,0.080855,0.001344,0.69318,0.037175,0.01965,...,0.0224,0.131353,2.8e-05,0.254997,0.052519,0.000857,0.01459,0.257001,0.07434,0.024038
7,0.097777,0.060696,0.005795,0.039584,0.189369,0.046483,0.004461,0.138044,0.093255,0.37892,...,0.015328,0.032513,2.6e-05,0.042444,0.8918,0.002538,0.004779,0.140894,0.028046,0.028572


En este caso se realiza un modelo con XGBClassifier en el cual se obtiene un Roc de 0.84 demostrando un mejor poder predictivo que la red neuronal.