## Movie Genre Classification

Classify a movie genre based on its plot.

<img src="https://raw.githubusercontent.com/sergiomora03/AdvancedTopicsAnalytics/main/notebooks/img/moviegenre.png"
     style="float: left; margin-right: 10px;" />



### Data

Input:
- movie plot

Output:
Probability of the movie belong to each genre


### Evaluation

- 30% Report with all the details of the solution, the analysis and the conclusions. The report cannot exceed 10 pages, must be send in PDF format and must be self-contained.
- 30% Code with the data processing and models developed that support the reported results.
- 30% Presentation of no more than 15 minutes with the main results of the project.
- 10% Model performance achieved. Metric: "AUC".

• The project must be carried out in groups of 4 people.
• Use clear and rigorous procedures.
• The delivery of the project is on September 8th, 2023, 11:59 pm, through email.
• No projects will be received after the delivery time or by any other means than the one established.




### Acknowledgements

We thank Professor Fabio Gonzalez, Ph.D. and his student John Arevalo for providing this dataset.

See https://arxiv.org/abs/1702.01992

In [None]:
pip install spacy_transformers

In [None]:
!python -m spacy download en_core_web_trf

In [None]:
import pandas as pd
import os
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score, roc_auc_score,accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.svm import SVC
import spacy
import seaborn as sns
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
import re
import spacy_transformers
import multiprocessing
from sklearn.pipeline import Pipeline
import xgboost as xgb
from collections import Counter

from sklearn.datasets import make_blobs
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
import multiprocessing


In [None]:
nltk.download('punkt')
nltk.download('wordnet')
#####
nltk.download('stopwords')
stopwords_en = stopwords.words('english')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#Carga de Datos


##Texto Base

In [None]:
dataTraining = pd.read_csv('https://github.com/sergiomora03/AdvancedTopicsAnalytics/raw/main/datasets/dataTraining.zip', encoding='UTF-8', index_col=0)
dataTesting = pd.read_csv('https://github.com/sergiomora03/AdvancedTopicsAnalytics/raw/main/datasets/dataTesting.zip', encoding='UTF-8', index_col=0)

In [None]:
dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,most is the story of a single father who takes...,"['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfi...","['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,"in a friday afternoon in new york , the presi...",['Drama'],7.4
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing h...","['Action', 'Crime', 'Thriller']",6.6


In [None]:
print("Shape entrenamiento -> ",dataTraining.shape," shape test -> ", dataTesting.shape)

Shape entrenamiento ->  (7895, 5)  shape test ->  (3383, 3)


In [None]:
dataTesting.head()

Unnamed: 0,year,title,plot
1,1999,Message in a Bottle,"who meets by fate , shall be sealed by fate ...."
4,1978,Midnight Express,"the true story of billy hayes , an american c..."
5,1996,Primal Fear,martin vail left the chicago da ' s office to ...
6,1950,Crisis,husband and wife americans dr . eugene and mr...
7,1959,The Tingler,the coroner and scientist dr . warren chapin ...


Verificación Nulos

In [None]:
# Verificar si hay valores nulos en la columna 'plot'
nulos_en_plot = dataTraining['plot'].isnull().sum()

# Imprimir el número de valores nulos
print("Número de valores nulos en la columna 'plot':", nulos_en_plot)

Número de valores nulos en la columna 'plot': 0


Validación Generos

In [None]:
conteo_genres = dataTraining['genres'].value_counts().head(10)

conteo_genres

['Drama']                         429
['Comedy']                        368
['Comedy', 'Drama', 'Romance']    306
['Comedy', 'Romance']             291
['Comedy', 'Drama']               287
['Drama', 'Romance']              282
['Documentary']                   154
['Crime', 'Drama', 'Thriller']    125
['Horror']                        115
['Drama', 'Thriller']             115
Name: genres, dtype: int64

Validación Palabras más usadas en la reseña

In [None]:
# Combina todos los textos en una sola cadena
text_combined = ' '.join(dataTraining['plot'])

# Tokeniza el texto en palabras
words = nltk.word_tokenize(text_combined)

# Convierte todas las palabras a minúsculas
words = [word.lower() for word in words]

# Elimina las palabras vacías (stop words) en inglés
stop_words = set(stopwords.words('english'))
words = [word for word in words if word not in stop_words]

# Cuenta la frecuencia de cada palabra
word_counts = Counter(words)

# Muestra las 10 palabras más comunes
most_common_words = word_counts.most_common(10)
for word, count in most_common_words:
    print(f'{word}: {count}')

,: 57327
.: 46614
': 15571
-: 10466
``: 4776
n: 3832
one: 3010
life: 2721
new: 2255
(: 2072


In [None]:
# Conjunto vacio de los generos
unique_genres = set()

# Iterar por cada fila para poder extraer los diferentes generos.
for genres_string in dataTraining['genres'][:3]:
    genres_list = eval(genres_string) #Evalua la expresión para verificar que si es un formato ejecutable en python. Las
    unique_genres.update(genres_list)

print(sorted(unique_genres))

['Comedy', 'Crime', 'Drama', 'Film-Noir', 'Horror', 'Short', 'Thriller']


In [None]:
dataTraining['genres'].value_counts().head(10)

['Drama']                         429
['Comedy']                        368
['Comedy', 'Drama', 'Romance']    306
['Comedy', 'Romance']             291
['Comedy', 'Drama']               287
['Drama', 'Romance']              282
['Documentary']                   154
['Crime', 'Drama', 'Thriller']    125
['Horror']                        115
['Drama', 'Thriller']             115
Name: genres, dtype: int64

##Texto Preprocesamiento


In [None]:
dataTraining['genres'] = dataTraining['genres'].map(lambda x: eval(x))

le = MultiLabelBinarizer()
y_genres = le.fit_transform(dataTraining['genres']) #Vectorizacion de Genero del dataset de entrenamiento.

In [None]:
#Lematizador
lemmatizer = WordNetLemmatizer()
#Stemming
stemmer = PorterStemmer()
stemmersnow = SnowballStemmer('english')

In [None]:
#Modelo en ingles de spacy
nlp = spacy.load("en_core_web_sm")

In [None]:
#Añadir stopwords adicionales que se consideren necesarios
stopwords_adicionales = ["is", "the","huw","a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"]
stopwords_en.extend(stopwords_adicionales)

#Lista de los stopwords en ingles
print(list(stopwords_en))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [None]:
def quitar_nombres(text):
    # Tokenizacion texto con spacy
    doc = nlp(text)

    # Filtrar palabras que son nombres propios, personas, numeros ordinales y cardinales
    filtro = [token.text for token in doc if not token.ent_type_ in ('PERSON')]

    # join the las palabras filtradas
    text = ' '.join(filtro)

    return text

In [None]:
#Creacion de una función para poder realizar un preprocesamiento del plot.
def preprocesamiento(text):
  #Convertir todo a lowercase
  text = text.str.lower()
  # Eliminar stopwords
  text = text.apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords_en]))
  # Quitar caracteres que no sean letras (Se eliminan caracteres especiales y numeros)
  text = text.apply(lambda x: ' '.join([re.sub(r'[^a-zA-Z]', '', word) for word in x.split()]))
  #Quitarle espacios en blanco
  text = text.apply(lambda x: ' '.join(x.split()))
  #Lematizacion
  text = text.apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split()]))
  #Stemming Porter
  #text = text.apply(lambda x: ' '.join([stemmer.stem(word) for word in x.split()]))
  #Stemming Snowball
  #text = text.apply(lambda x: ' '.join([stemmersnow.stem(word) for word in x.split()]))

  #text = text.apply(quitar_nombres)


  return text

In [None]:
dataTraining['plot'] = preprocesamiento(dataTraining['plot'])

# 1. RN(Redes Neuronales)

In [None]:
vect_rn = TfidfVectorizer(binary = True) #
X_dtm_rn = vect_rn.fit_transform(dataTraining['plot']) #Vectorizacion del plot de la pelicula, al cual se le aplico la funcion preprocesamiento
X_dtm_rn.shape

(7895, 34430)

In [None]:
# Modelos
# ==============================================================================
modelo_1 = MLPClassifier(
                hidden_layer_sizes=(50,10,5),
                activation="identity",
                learning_rate="adaptive",
                learning_rate_init=0.01,
                solver = 'lbfgs',
                max_iter = 1000,
                random_state = 123
            )

In [None]:
X_train3, X_test3, y_train_genres3, y_test_genres3 = train_test_split(X_dtm_rn, y_genres, test_size=0.33, random_state=42)

modelo_1.fit(X=X_train3, y=y_train_genres3)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


In [None]:
# Make predictions on the test data
predicciones_rn_test = modelo_1.predict(X = X_test3,)

rn_auc_test=roc_auc_score(y_test_genres3,predicciones_rn_test,average='macro')

In [None]:
# Make predictions on the train data
predicciones_rn_train = modelo_1.predict(X = X_train3,)

rn_auc_train=roc_auc_score(y_train_genres3,predicciones_rn_train,average='macro')

#2. XGBoost

In [None]:
#vect3 = CountVectorizer(max_features=10200, min_df=4, binary = True) #Maxima cantidad de features establecida en 1000
vectxg = CountVectorizer(min_df=3,binary = True)
X_dtmxg = vectxg.fit_transform(dataTraining['plot']) #Vectorizacion del plot de la pelicula
X_dtmxg.shape

(7895, 13974)

División de Train y Test

In [None]:
X_trainxg, X_testxg, y_train_genresxg, y_test_genresxg = train_test_split(X_dtmxg, y_genres, test_size=0.33, random_state=42)

In [None]:
columns_de=['Modelo','N_Arboles','Profundidad_Maxima','Learning_Rate','AUC_Train','AUC_Test']
results=pd.DataFrame(columns=columns_de)

Parametros de Multiples Ejecuciones

In [None]:
#Parametros/Factores Fijos extraidos de los papers (demás por defecto de la libreria)
min_child_weight_fx=1
reg_lambda_fx=10
subsample_fx=0.9

#Parametros /Factores Variables
n_estimators=[160] #default:50/100
max_depth=[11] #default:1/3
learning_rate=[0.1] #default:0.1/1

In [None]:
# Fijar semilla aleatoria para reproducibilidad
np.random.seed(2023)

for es in n_estimators:
  for dp in max_depth:
    for lr in learning_rate:
     #Modelo
      model_exp=xgb.XGBClassifier(booster='gbtree', max_depth=dp, learning_rate=lr,
                                  n_estimators=es, min_child_weight=min_child_weight_fx,
                                  subsample=subsample_fx, reg_lambda=reg_lambda_fx)

      #Entrenamiento
      model_exp.fit(X_trainxg, y_train_genresxg)

      #Predicción sobre Train
      y_train_pred_genresxg = model_exp.predict_proba(X_trainxg)
      #Predicción sobre Test
      y_test_pred_genresxg = model_exp.predict_proba(X_testxg)

      #AUC Train
      AUC_train_xg=roc_auc_score(y_train_genresxg, y_train_pred_genresxg, average='macro')
      #AUC Test
      AUC_test_xg=roc_auc_score(y_test_genresxg, y_test_pred_genresxg, average='macro')

      #Almacenar en un pandas dataframe
      results_to_append = pd.Series(['XGBoost', es, dp, lr, AUC_train_xg, AUC_test_xg], index=columns_de)
      results = pd.concat([results, results_to_append], axis=1)
      print("Arboles: ",es,", Profundidad: ", dp,", Learning Rate: ", lr,", AUC Train: ", AUC_train_xg,", AUC Test: ", AUC_test_xg)

Arboles:  160 , Profundidad:  11 , Learning Rate:  0.1 , AUC Train:  0.9885075681316845 , AUC Test:  0.8510928159577179


#3. SVM (Support Vector Machine)

In [None]:
# Create an SVM classifier (you can adjust hyperparameters here)
svm_classifier = SVC(kernel='rbf', C=1.0, decision_function_shape='ovr')

X_train3, X_test3, y_train_genres3, y_test_genres3 = train_test_split(X_dtm_rn, y_genres, test_size=0.33, random_state=42)

# Creamos un clasificador OvR que utiliza el SVM como base
ovr_classifier = OneVsRestClassifier(svm_classifier)

# Train the SVM classifier on the training data
ovr_classifier.fit(X_train3, y_train_genres3)


In [None]:
# Make predictions on the test data
y_pred_svm_test = ovr_classifier.predict(X_test3)

svm_auc_test=roc_auc_score(y_test_genres3,y_pred_svm_test,average='macro')


In [None]:
# Make predictions on the train data
y_pred_svm_train = ovr_classifier.predict(X_train3)

svm_auc_train=roc_auc_score(y_train_genres3,y_pred_svm_train,average='macro')

#4.Modelo Regresión Logistica

Para este escenario se toman los mismos pasos del ejercicio base suministrado y lo unico que se hace es ampliar el limit de la cantidad de features. Binary = True

## TFIDF - Preprocesamiento

### TF-IDF vectorizer

In [None]:
vectlr_p = TfidfVectorizer(binary = True) #
X_dtmlr_p = vectlr_p.fit_transform(dataTraining['plot']) #Vectorizacion del plot de la pelicula, al cual se le aplico la funcion preprocesamiento
X_dtmlr_p.shape

(7895, 34430)

In [None]:
X_trainlr_p, X_testlr_p, y_train_genreslr_p, y_test_genreslr_p = train_test_split(X_dtmlr_p, y_genres, test_size=0.3, random_state=42)

Modelos

In [None]:
modelo_lr = LogisticRegression(n_jobs=-1, C=1.0, multi_class='multinomial', solver='lbfgs')

Modelo Regresion Logistica

In [None]:
clf_lr_tfidf_preproc = OneVsRestClassifier(modelo_lr)
clf_lr_tfidf_preproc.fit(X_trainlr_p, y_train_genreslr_p)

In [None]:
y_pred_genreslr_p = clf_lr_tfidf_preproc.predict_proba(X_testlr_p)

In [None]:
AUC_logistc_regression_preproc = roc_auc_score(y_test_genreslr_p, y_pred_genreslr_p, average='macro')

In [None]:
round(AUC_logistc_regression_preproc,2)

0.89

In [None]:
y_pred_genreslr_p_train = clf_lr_tfidf_preproc.predict_proba(X_trainlr_p)

In [None]:
AUC_logistc_regression_preproc_train = roc_auc_score(y_train_genreslr_p, y_pred_genreslr_p_train, average='macro')

In [None]:
round(AUC_logistc_regression_preproc_train,3)

0.997

#Comparacion AUC


##Comparaciones Generales Train

In [None]:
print("AUC de XGBoost             - con limpieza de text        ->", round(AUC_train_xg,3))
print("AUC de Red Neuronal        - TF-IDF con limpieza de text ->", round(rn_auc_train,3))
print("AUC de Regresion Logistica - TF-IDF con limpieza de text ->", round(AUC_logistc_regression_preproc_train,3))
print("AUC de SVM                 - TF-IDF con limpieza de text ->", round(svm_auc_train,3))

AUC de XGBoost             - con limpieza de text        -> 0.989
AUC de Red Neuronal        - TF-IDF con limpieza de text -> 0.877
AUC de Regresion Logistica - TF-IDF con limpieza de text -> 0.997
AUC de SVM                 - TF-IDF con limpieza de text -> 0.826


##Comparaciones Generales Test

In [None]:
print("AUC de XGBoost             - con limpieza de text        ->", round(AUC_test_xg,2))
print("AUC de Red Neuronal        - TF-IDF con limpieza de text ->", round(rn_auc_test,3))
print("AUC de SVM                 - TF-IDF con limpieza de text ->", round(svm_auc_test,3))
print("AUC de Regresion Logistica - TF-IDF con limpieza de text ->", round(AUC_logistc_regression_preproc,2))


AUC de XGBoost             - con limpieza de text        -> 0.85
AUC de Red Neuronal        - TF-IDF con limpieza de text -> 0.588
AUC de SVM                 - TF-IDF con limpieza de text -> 0.537
AUC de Regresion Logistica - TF-IDF con limpieza de text -> 0.89
