# Clasificaci´ on de opiniones cinematogr´ aficas con Machine Learning 
## Contexto del problema 
El auge de las plataformas de streaming y comunidades cin´ efilas ha generado una gran cantidad de datos sobre pel´ ıculas, incluyendo informaci´ on t´ ecnica, sinopsis y reacciones del p´ublico. En este trabajo se utilizar´ a un conjunto de datos del dominio cinematogr´ afico para resolver dos tareas de clasificaci´ on: una clasificaci´ on binaria que predice una opini´ on emitida por un p´ ublico cin´ efilo exigente, y una clasificaci´ on multiclase que busca predecir el g´ enero principal de una pel´ ıcula.

In [14]:
# HACER IMPORTS
# Cargar datasets

import pandas as pd
df = pd.read_csv('./datasets/movies_train.csv')  # Cargar el conjunto de entrenamiento

## Exploracion y Preprocesamiento de los datos

In [12]:
df.head() # Mostrar las primeras filas del DataFrame para verificar que se ha cargado correctamente
df.info() # Mostrar información general del DataFrame, incluyendo tipos de datos y valores nulos
df.isna().sum() # Comprobar si hay valores nulos en el DataFrame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3602 entries, 0 to 3601
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   budget             3602 non-null   int64  
 1   keywords           3298 non-null   object 
 2   original_language  3602 non-null   object 
 3   original_title     3602 non-null   object 
 4   overview           3600 non-null   object 
 5   popularity         3602 non-null   float64
 6   release_date       3601 non-null   object 
 7   revenue            3602 non-null   int64  
 8   runtime            3600 non-null   float64
 9   status             3602 non-null   object 
 10  vote_count         3602 non-null   int64  
 11  director           3579 non-null   object 
 12  main_genre_top10   3602 non-null   object 
 13  opinion            3602 non-null   int64  
dtypes: float64(2), int64(4), object(8)
memory usage: 394.1+ KB


budget                 0
keywords             304
original_language      0
original_title         0
overview               2
popularity             0
release_date           1
revenue                0
runtime                2
status                 0
vote_count             0
director              23
main_genre_top10       0
opinion                0
dtype: int64

In [57]:
from sklearn.preprocessing import LabelEncoder

def clean_data(df, is_train=True):
    df = df.copy()

    # --- Eliminar nulos críticos SOLO en entrenamiento ---
    if is_train:
        df.dropna(subset=['runtime', 'overview', 'release_date'], inplace=True)

    # --- Completar nulos ---
    df['keywords'] = df['keywords'].fillna('')
    df['director'] = df['director'].fillna('unknown')

    # --- Fecha como año ---
    df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')
    df['release_year'] = df['release_date'].dt.year.fillna(0).astype(int)

    # --- Codificar columnas categóricas ---
    label_cols = ['original_language', 'status', 'director', 'main_genre_top10']
    for col in label_cols:
        le = LabelEncoder()
        df[col] = le.fit_transform(df[col].astype(str))  # Asegura que no haya problemas con NaN

    return df


## Split Data

In [23]:
from sklearn.model_selection import train_test_split

def split_data(df, feature_cols, target_col, test_size=0.2, random_state=42):
    X = df[feature_cols]
    y = df[target_col]

    return train_test_split(X, y, test_size=test_size, random_state=random_state)


## Definicion de los modelos

### Random Forest

In [24]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

def train_random_forest(X_train, X_test, y_train, y_test):
    rf = RandomForestClassifier(n_estimators=100, random_state=42)
    rf.fit(X_train, y_train)

    y_pred = rf.predict(X_test)

    print("Reporte de clasificación:\n")
    print(classification_report(y_test, y_pred))

    return rf, y_pred


### Regresion Logistica

In [68]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier

def train_logistic_regression(X_train, X_test, y_train, y_test):
    lr = LogisticRegression(penalty=None, max_iter=1000, random_state=42, solver='lbfgs')
    lr.fit(X_train, y_train)

    y_pred = lr.predict(X_test)

    print("Reporte de clasificación - Logistic Regression:\n")
    print(classification_report(y_test, y_pred))

    return lr, y_pred





### Arbol de Desision

In [65]:
def train_decision_tree(X_train, X_test, y_train, y_test):
    dt = DecisionTreeClassifier(random_state=42)
    dt.fit(X_train, y_train)

    y_pred = dt.predict(X_test)

    print("Reporte de clasificación - Decision Tree:\n")
    print(classification_report(y_test, y_pred))

    return dt, y_pred

### Gradient Boosting

In [66]:
def train_gradient_boosting(X_train, X_test, y_train, y_test):
    gbc = GradientBoostingClassifier(random_state=42)
    gbc.fit(X_train, y_train)

    y_pred = gbc.predict(X_test)

    print("Reporte de clasificación - Gradient Boosting:\n")
    print(classification_report(y_test, y_pred))

    return gbc, y_pred

## Uso de modelos

### Random Forest

In [63]:
df_clean = clean_data(df)

feature_cols = ['budget', 'popularity', 'revenue', 'runtime', 'vote_count',
                'original_language', 'status', 'director', 'main_genre_top10', 'release_year']

target_col = 'opinion'

X_train, X_test, y_train, y_test = split_data(df_clean, feature_cols, target_col)

model, y_pred = train_random_forest(X_train, X_test, y_train, y_test)

#SUBIR A KAGGLE:
#kaggle_submission(model)


Reporte de clasificación:

              precision    recall  f1-score   support

           0       0.70      0.72      0.71       365
           1       0.70      0.68      0.69       356

    accuracy                           0.70       721
   macro avg       0.70      0.70      0.70       721
weighted avg       0.70      0.70      0.70       721



In [None]:
# Logistic Regression
model_lr, y_pred_lr = train_logistic_regression(X_train, X_test, y_train, y_test)
kaggle_submission(model_lr)


Reporte de clasificación - Logistic Regression:

              precision    recall  f1-score   support

           0       0.67      0.75      0.71       365
           1       0.71      0.62      0.66       356

    accuracy                           0.68       721
   macro avg       0.69      0.68      0.68       721
weighted avg       0.69      0.68      0.68       721

Reporte de clasificación - Decision Tree:

              precision    recall  f1-score   support

           0       0.65      0.62      0.64       365
           1       0.63      0.66      0.64       356

    accuracy                           0.64       721
   macro avg       0.64      0.64      0.64       721
weighted avg       0.64      0.64      0.64       721

Reporte de clasificación:

              precision    recall  f1-score   support

           0       0.70      0.72      0.71       365
           1       0.70      0.68      0.69       356

    accuracy                           0.70       721
   macro 

In [72]:
# Decision Tree
model_dt, y_pred_dt = train_decision_tree(X_train, X_test, y_train, y_test)
kaggle_submission(model_dt)

Reporte de clasificación - Decision Tree:

              precision    recall  f1-score   support

           0       0.65      0.62      0.64       365
           1       0.63      0.66      0.64       356

    accuracy                           0.64       721
   macro avg       0.64      0.64      0.64       721
weighted avg       0.64      0.64      0.64       721



In [73]:
# Random Forest (tu ya lo tienes)
model_rf, y_pred_rf = train_random_forest(X_train, X_test, y_train, y_test)
kaggle_submission(model_rf)

Reporte de clasificación:

              precision    recall  f1-score   support

           0       0.70      0.72      0.71       365
           1       0.70      0.68      0.69       356

    accuracy                           0.70       721
   macro avg       0.70      0.70      0.70       721
weighted avg       0.70      0.70      0.70       721



In [74]:
# Gradient Boosting
model_gbc, y_pred_gbc = train_gradient_boosting(X_train, X_test, y_train, y_test)
kaggle_submission(model_gbc)

Reporte de clasificación - Gradient Boosting:

              precision    recall  f1-score   support

           0       0.70      0.74      0.72       365
           1       0.72      0.68      0.70       356

    accuracy                           0.71       721
   macro avg       0.71      0.71      0.71       721
weighted avg       0.71      0.71      0.71       721



## Subir a Kaggle

In [76]:

def kaggle_submission(model):
    # Cargar y limpiar el dataset de test
    df_test = pd.read_csv('./datasets/movies_test.csv')
    df_test = df_test.dropna(subset=['id'])  # eliminar posibles filas en blanco
    df_test = clean_data(df_test, False)  # aplicar transformaciones del train

    # Seleccionar las columnas que el modelo espera
    feature_cols = ['budget', 'popularity', 'revenue', 'runtime', 'vote_count',
                    'original_language', 'status', 'director', 'main_genre_top10', 'release_year']
    X_test_final = df_test[feature_cols]

    # Predecir
    predictions = model.predict(X_test_final)

    # Crear el DataFrame de submission
    submission_df = pd.DataFrame({
        'id': df_test['id'],        # ID en mayúscula si es para Kaggle
        'Rating': predictions       # Rating en mayúscula si es para Kaggle
    })

    # Verificar cantidad de filas
    assert len(submission_df) == 1201, f"Submission tiene {len(submission_df)} filas, deberían ser 1201"

    # Guardar sin índice
    submission_df.to_csv('submission.csv', index=False, header=True)
    return


# PARTE 2 : Redes Neuronales

In [77]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np

def train_neural_network_for_genre(df, feature_cols=None, target_col='main_genre_top10',
                                   test_size=0.2, random_state=42, epochs=30, batch_size=32):
    """
    Entrena una red neuronal para predecir la variable categórica 'main_genre_top10'.
    
    Parámetros:
    - df: dataframe limpio con datos.
    - feature_cols: lista de columnas a usar como features (si None usa columnas por defecto).
    - target_col: nombre de la columna objetivo.
    - test_size: proporción para test.
    - random_state: semilla para reproducibilidad.
    - epochs: número de épocas para entrenar.
    - batch_size: tamaño de batch para entrenamiento.
    
    Retorna:
    - model: modelo Keras entrenado.
    - history: historial de entrenamiento.
    - accuracy: precisión en el conjunto de test.
    """

    if feature_cols is None:
        feature_cols = ['budget', 'popularity', 'revenue', 'runtime', 'vote_count',
                        'original_language', 'status', 'director', 'release_year']

    X = df[feature_cols]
    y = df[target_col]

    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    # One-hot encode target
    num_classes = len(np.unique(y))
    y_cat = to_categorical(y, num_classes=num_classes)

    X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_cat,
                                                        test_size=test_size,
                                                        random_state=random_state)

    model = Sequential([
        Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
        Dropout(0.3),
        Dense(64, activation='relu'),
        Dropout(0.3),
        Dense(num_classes, activation='softmax')
    ])

    model.compile(optimizer='adam',
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])

    history = model.fit(X_train, y_train,
                        epochs=epochs,
                        batch_size=batch_size,
                        validation_split=0.2,
                        verbose=2)

    loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
    print(f"Test Accuracy: {accuracy:.4f}")

    return model, history, accuracy


In [78]:
model, history, acc = train_neural_network_for_genre(df_clean)


Epoch 1/30


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


72/72 - 2s - 24ms/step - accuracy: 0.2765 - loss: 2.1619 - val_accuracy: 0.3085 - val_loss: 2.0331
Epoch 2/30
72/72 - 0s - 4ms/step - accuracy: 0.3377 - loss: 1.9727 - val_accuracy: 0.3189 - val_loss: 1.9898
Epoch 3/30
72/72 - 0s - 4ms/step - accuracy: 0.3507 - loss: 1.9406 - val_accuracy: 0.3137 - val_loss: 1.9709
Epoch 4/30
72/72 - 0s - 4ms/step - accuracy: 0.3459 - loss: 1.9208 - val_accuracy: 0.3224 - val_loss: 1.9609
Epoch 5/30
72/72 - 0s - 4ms/step - accuracy: 0.3459 - loss: 1.8934 - val_accuracy: 0.3206 - val_loss: 1.9456
Epoch 6/30
72/72 - 0s - 5ms/step - accuracy: 0.3546 - loss: 1.8887 - val_accuracy: 0.3154 - val_loss: 1.9452
Epoch 7/30
72/72 - 0s - 5ms/step - accuracy: 0.3607 - loss: 1.8727 - val_accuracy: 0.3224 - val_loss: 1.9351
Epoch 8/30
72/72 - 0s - 4ms/step - accuracy: 0.3650 - loss: 1.8640 - val_accuracy: 0.3241 - val_loss: 1.9330
Epoch 9/30
72/72 - 0s - 5ms/step - accuracy: 0.3715 - loss: 1.8724 - val_accuracy: 0.3224 - val_loss: 1.9239
Epoch 10/30
72/72 - 0s - 5ms/