# Práctica 4: Procesamiento del Lenguaje Natural

__Fecha de entrega: 16 de mayo de 2025__

El objetivo de esta práctica es aplicar los conceptos teóricos vistos en clase en el módulo de PLN.

Lo más importante en esta práctica no es el código Python, sino el análisis de los datos y modelos que construyas y las explicaciones razonadas de cada una de las decisiones que tomes. __No se valorarán trozos de código o gráficas sin ningún tipo de contexto o explicación__.

Finalmente, recuerda establecer el parámetro `random_state` en todas las funciones que tomen decisiones aleatorias para que los resultados sean reproducibles (los resultados no varíen entre ejecuciones).

In [1]:
RANDOM_STATE = 1234

# 1) Carga del conjunto de datos

Los ficheros `fake.csv` y `true.csv` contienen artícuos de noticias clasificadas como fake (falsas) o true (reales) respectivamente. Cada noticia tiene como atributos:

*   Title: título de la noticia
*   Text: cuerpo del texto de la noticia
*   Subject: tema de la noticia
*   Date: fecha de publicación de la noticia

Muestra un ejemplo de cada clase.

Haz un estudio del conjunto de datos. ¿qué palabras aparecen más veces?, ¿tendría sentido normalizar de alguna manera el corpus?

Crea una partición de los datos dejando el 60% para entrenamiento, 20% para validación y el 20% restante para test. Comprueba que la distribución de los ejemplos en las particiones es similar.

In [2]:
import sys
!{sys.executable} -m pip install nltk



In [3]:
import pandas as pd
import numpy as np
import re
import nltk


#Realizamos todos los imports necesarios



In [4]:
#Leemos los csv y añadimos una columna a cada data frame que indica el tipo de noticia
fake_df  = pd.read_csv('fake.csv')
true_df  = pd.read_csv('true.csv')
fake_df['type'] = 0  # Fake news
true_df['type'] = 1  # True news

In [5]:
fake_df.head() #Mostramos los primeros elemenos de fake

Unnamed: 0,title,text,subject,date,type
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0


In [6]:
true_df.head() #Mostramos los primeros elementos de true

Unnamed: 0,title,text,subject,date,type
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",1
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",1
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",1
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",1
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",1


In [7]:
df = pd.concat([fake_df, true_df], ignore_index=True)
#Juntamos ambos dataframes en uno solo

In [8]:
news = np.array(df["text"])
tipo = np.array(df["type"])
#Convertimos a array las columnas "text" y "type" del dataframe conjunto

Tiene sentido normalizar, ya que, en caso contrario, estaríamos teniendo en cuenta caracteres especiales, espacios en blanco y las conocidas como stopwords (preposiciones, arículos...), que no tienen ningún poder discriminante en nuestro objetivo, que es interpretar y clasificar el texto en función de sus palabras. Lo realizamos mediante la función normalize_document, que, aparte de lo comentado, pone todas las palabras en minúsculas, para evitar distinguir palabras con iguales pero con primera letra mayúscula.

In [9]:
wpt = nltk.WordPunctTokenizer()
nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = wpt.tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

normalize_corpus = np.vectorize(normalize_document)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/_sergiio8_/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [10]:
norm_news = normalize_corpus(news) #Normalizamos el documento news

In [11]:
from collections import Counter
all_text = ' '.join(norm_news) #Unimos todos los textos

words = all_text.split() #Separamos el texto conjunto en todas sus palabras

conteo = Counter(words) #Contamos las apariciones de cada palabra 

print(conteo.most_common(10))  #Mostramos las 10 palabras que más aparecen junto con su número de apariciones. Entre ellas, algunas
#son esperables (said, us, would, also...) que son adverbios, verbos auxiliares, etc. Las más destacables fuera de estas son trump, president,
#state y people, lo cual tiene sentido ya que la temática de todos los textos gira en torno a Estados Unidos y su política.

[('said', 130100), ('trump', 116723), ('us', 62643), ('would', 54958), ('president', 51202), ('people', 41107), ('one', 35661), ('state', 31362), ('also', 31202), ('new', 30976)]


In [27]:
from sklearn.model_selection import train_test_split

X_train, X_resto, y_train, y_resto = train_test_split(news, tipo, test_size=0.4, random_state=RANDOM_STATE, stratify = tipo)
X_val, X_test, y_val, y_test = train_test_split(X_resto, y_resto, test_size=0.5, random_state=RANDOM_STATE, stratify = y_resto)

Para llevar a cabo la división de los datos en la partición que exige el enunciado, primero, realizamos una primera división con el 60% para determinar los datos del conjunto de entrenamiento (X_train, y_train) y después, sobre el conjunto restante (X_resto, y_resto) realizamos una segunda división de 50% para dividir el resto en test y validación.

In [28]:
print("Training texts:", len(y_train))
print("Test texts:", len(y_test))
print("Validation texts:", len(y_val))

print("Proporción de fake en train: " , np.count_nonzero(y_train == 0)/np.size(y_train))
print("Proporción de fake en val: " , np.count_nonzero(y_val == 0)/np.size(y_val))
print("Proporción de fake en test: " , np.count_nonzero(y_test == 0)/np.size(y_test))

Training texts: 26938
Test texts: 8980
Validation texts: 8980
Proporción de fake en train:  0.5229786918108249
Proporción de fake en val:  0.5230512249443207
Proporción de fake en test:  0.5229398663697105


Con unos simples cálculos comprobamos que la proporción de 60%, 20% y 20% para entrenamiento, validación y test respectivamente, está bien hecha, y observamos que el porcentaje de fake news en cada conjunto de la partición es prácticamente el mismo porque incluimos stratify = tipo y 
stratify = y_resto.

# 2) Representación como bolsa de palabras

Elige justificadamente una representación de bolsa de palabras y aplícala.
Muestra un ejemplo antes y después de aplicar la representación. Explica los cambios.

In [29]:
X_train_norm = normalize_corpus(X_train)
X_val_norm = normalize_corpus(X_val)
X_test_norm = normalize_corpus(X_test)

In [30]:
X_train_norm #Representación de los textos del conjunto de entrenamiento como un array en el que cada posición es un texto.

array(['release nude pictures might meant distraction trump disastrous two weeks beginning democratic national convention revealed melania trump may modeling without proper documentation words gasp may illegal immigrant appears strange things happening melania immigration status might married another american four years becoming mrs trumpthe story broken univision proving taking away reporter access backfire candidate claim immigration attorney worked trump organization said melania obtained green card based marriage four years married trumpbut michael wildes immigration attorney worked trump organization told univision investigative unit obtained green card four years earlier based marriage melania donald trump married jan bethesdabythe sea episcopal church palm beach floridawildes asked trump campaign far cricketswhen asked explain marriage discrepancy wildes said would seek clarification presumably trump organization later sent email saying hear back sorry wildes someone know immigr

In [41]:
from sklearn.feature_extraction.text import CountVectorizer
#CountVectorizer es una opción de bolsa de palabras, que muestra la frecuencia de aparición de cada palabra en cada texto.

#Estas matrices de CountVectorizer se declaran por si se requiere de su uso en el futuro (sin límite de max_features), 
#ya que para su visualización debemos imponer un max_features al Count Vectorizer para que no se supere la memoria permitida del ordenador.
cv = CountVectorizer()
cv_train_matrix = cv.fit_transform(X_train_norm)

cv_test_matrix = cv.transform(X_test_norm)

In [21]:
#La delcaración de estas matrices de CountVectorizer sirven para visualizar la matriz, y para ello incluimos un max_features
#que permite no sobrepasar la memoria máxima permitida del ordenador.
#Para ello, Determinamos el vocabulario (conjunto de todas las palabras distintas de los textos) y convertimos la matriz del 
#CountVectorizer a  array para poder visualizarla.
cv2 = CountVectorizer(max_features = 10000)
cv_train_matrix_visualize = cv2.fit_transform(X_train_norm)
cv_train_matrix_visualize = cv_train_matrix_visualize.toarray()
vocab = cv2.get_feature_names_out()
pd.DataFrame(cv_train_matrix_visualize, columns=vocab)

Unnamed: 0,000,10,11,12,13,14,15,17,18,19,...,zika,zimbabwe,zimbabwean,zimmerman,zinke,zip,zone,zones,zuckerberg,zuma
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26933,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26934,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26935,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26936,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
#Para visualizar mejor el funcionamiento de Count Vectorizer, mostramos la frecuencia de aparición de las 20 palabras que aparecen 
#con más frecuencia en los primeros textos, porque de lo contrario, observaremos mayoritariamente ceros en todas las casillas de la matriz.

# Sumamos las frecuencias de cada palabra en todos los documentos
frecuencia_global = np.sum(cv_train_matrix_visualize, axis=0)

# Elegimos las 20 palabras más frecuentes
N = 20
indices_top = np.argsort(frecuencia_global)[::-1][:N]

# Creamos un DataFrame con las 20 palabras más comunes
df_top = pd.DataFrame(cv_train_matrix_visualize[:, indices_top], columns=vocab[indices_top])
df_top.head()  # Mostramos los primeros textos con la frecuencia de aparición de las 20 palabras más comunes.

Unnamed: 0,said,trump,us,would,president,people,one,state,also,new,reuters,donald,states,house,government,clinton,obama,republican,could,told
0,3,17,0,2,0,1,2,0,0,0,0,1,0,0,0,2,0,0,0,1
1,0,0,0,1,0,0,0,0,0,1,1,0,0,2,0,0,0,0,0,0
2,1,0,6,2,1,0,1,7,0,1,0,0,2,0,5,0,1,0,0,1
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0


In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
#TF-IDF es otra opción de bolsa de palabras, que muestra la relación TF-IDF de cada palabra en cada texto (posteriormente
#hablaremos más sobre ella).

#Estas primeras matrices se declaran por si se requieren de su uso posteriormente en la práctica (no tienen límite de max_features).

tv = TfidfVectorizer()
tv_train_matrix = tv.fit_transform(X_train_norm)
tv_test_matrix = tv.transform(X_test_norm)

#Para visualizar cómo funciona TF-IDF, declaramos matrices usando TF-IDF con max_features, para no sobrepasar la memoria máxima permitida
#del ordenador, al igual que en CountVectorizer.
tv2 = TfidfVectorizer(max_features = 10000)
tv_matrix_visualize = tv2.fit_transform(X_train_norm)
vocab = tv2.get_feature_names_out()
tv_matrix_visualize = tv_matrix_visualize.toarray()
pd.DataFrame(np.round(tv_matrix_visualize, 2), columns=vocab)



Unnamed: 0,000,10,11,12,13,14,15,17,18,19,...,zika,zimbabwe,zimbabwean,zimmerman,zinke,zip,zone,zones,zuckerberg,zuma
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26933,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
26934,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
26935,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
26936,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [23]:
#Para visualizar mejor el funcionamiento de TF-IDF, mostramos la frceuencia de aparición de las 20 palabras que aparecen con más frecuencia
#en los primeros textos, porque de lo contrario, observaremos mayoritariamente ceros en todas las casillas de la matriz.

# Calculamos TF-IDF promedio para cada palabra en todo el corpus
tfidf_promedio = np.mean(tv_matrix_visualize, axis=0)

# Obtenemos las 20 palabras con mayor promedio
N = 20
top_indices = np.argsort(tfidf_promedio)[::-1][:N]

# Mostramos solo esas palabras para los primeros textos
df_tfidf_top = pd.DataFrame(
    np.round(tv_matrix_visualize[:10, top_indices], 2),
    columns=vocab[top_indices]
)
df_tfidf_top


Unnamed: 0,trump,said,us,president,would,people,clinton,house,state,obama,reuters,one,donald,republican,new,white,states,government,united,also
0,0.34,0.05,0.0,0.0,0.04,0.02,0.07,0.0,0.0,0.0,0.0,0.04,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.06,0.0,0.0,0.17,0.0,0.0,0.06,0.0,0.0,0.0,0.08,0.09,0.0,0.0,0.0,0.0
2,0.0,0.01,0.11,0.02,0.03,0.0,0.0,0.0,0.16,0.03,0.0,0.02,0.0,0.0,0.02,0.0,0.05,0.12,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.02,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.02,0.03,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.1,0.0,0.0
6,0.0,0.03,0.0,0.0,0.03,0.02,0.06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.02,0.0
7,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06
8,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.02,0.01,0.01,0.0,0.06,0.0,0.0,0.07,0.0,0.0,0.01,0.0,0.02,0.03,0.0,0.0,0.0,0.0,0.03


Entre el CountVectorizer y el TF-IDF, vamos a optar por la representación en una matriz del valor TF-IDF de cada palabra en cada texto,
ya que se trata de un valor mucho más concluyente y más representativo que la frecuencia en sí, ya que resta importancia a las palabras que
aparecen con bastante frecuencia en todos los textos (adverbios, verbos auxiliares,...) y otroga más importancia a las que aparecen con frecuencia pero solo en uno de los textos, incluso tras normalizar. Por ejemplo, en el texto 2 would aparece 2 veces, mientras que Obama aparece 1, pero sin embargo, ambos tienen el mismo valor de TF-IDF, 0.03. Un caso similar ocurre con government, que aparece 5 veces, pero sin embargo, cuatriplica su valor de TF-IDF respecto de would, 0.12 frente a 0.03, y sin embargo, no aparece 4 veces más que would. Esto se debe a que would, al ser un verbo auxiliar, aparece con frecuencia en todos los textos y TF-IDF le resta importancia.
Por otro lado, es evidente que la representación con cualquiera de las dos bolsas de palabras es mejor que la representación como array de textos, ya que, visualizando ambos casos, como pide el enunciado, en la bolsa de palabras tenemos una imagen mucho más ordenada y entendible de las propiedades de cada texto en función de sus palabras, y muy fácil de estudiar observando filas y coolumnas, mientras que si simplemente almacenamos los textos en un array (X_train_norm) no podemos concluir ni deducir nada acerca de ellos con tan solo observarlos, requeriría de un estudio y lectura detallada.

# 3) Aplica 3 algoritmos de aprendizaje automático para resolver la tarea

Justifica porqué los has elegido.
Ajusta los modelos respecto a un hiperparámetro que consideres oportuno. Justifica tu elección.
Explica los resultados obtenidos.

In [None]:
Como ya hemos mencionado, para los 3 algoritmos utilizaremos TF-IDF como bolsa de palabras.

In [17]:
from sklearn import tree
import numpy as np

depth_values = [3,5,10,15,20];
#En este caso, hemos elegido max_depth como hiperparámetro a ajustar ya que determina la profundidad maxima del arbol de decisión, y 
#por lo tanto, tiene un gran efecto sobre el overfitting/underfitting del algoritmo. Otra opción era min_samples_split y min_samples_leaf,
#que controlan el número de muestras en las hojas del árbol.

for i in range (0,5):
    tree_classifier = tree.DecisionTreeClassifier(max_depth = depth_values[i])
    tree_classifier.fit(tv_train_matrix, y_train)

    tree_train_predictions = tree_classifier.predict(tv_train_matrix)
    tree_test_predictions = tree_classifier.predict(tv_test_matrix)

    print("Árbol, porcentaje de aciertos en entrenamiento con max_depth = ",depth_values[i], " :", np.mean(tree_train_predictions == y_train))
    print("Árbol, porcentaje de aciertos en test con max_depth = ",depth_values[i], " :",np.mean(tree_test_predictions == y_test))

Árbol, porcentaje de aciertos en entrenamiento con max_depth =  3  : 0.9944316578810602
Árbol, porcentaje de aciertos en test con max_depth =  3  : 0.9922048997772829
Árbol, porcentaje de aciertos en entrenamiento con max_depth =  5  : 0.9962506496399138
Árbol, porcentaje de aciertos en test con max_depth =  5  : 0.9938752783964365
Árbol, porcentaje de aciertos en entrenamiento con max_depth =  10  : 0.9980325191179746
Árbol, porcentaje de aciertos en test con max_depth =  10  : 0.994097995545657
Árbol, porcentaje de aciertos en entrenamiento con max_depth =  15  : 0.9988120870146262
Árbol, porcentaje de aciertos en test con max_depth =  15  : 0.993652561247216
Árbol, porcentaje de aciertos en entrenamiento con max_depth =  20  : 0.9992204321033484
Árbol, porcentaje de aciertos en test con max_depth =  20  : 0.9935412026726058


Como se puede ver, en todos los casos se observa un gran rendimiento del algoritmo, independientemente del valor de max_depth utilizado.
De hecho, fijándonos en el tercer decimal, hay una tendencia a mayor rendimiento cuanto mayor sea el valor de max_depth.

In [18]:
from sklearn import neighbors

k_values = [2,4,6,8,10,12];
#Como ya hemos realizado en otras prácticas, vamos a ajustar el número de vecinos en k-NN, ya que se trata del parámetro más determinante
#en relación al rendimiento del algoritmo. Además, la mayoría del resto de parámetros no eran numéricos.

for i in range (0,6):
    knn_classifier = neighbors.KNeighborsClassifier(n_neighbors = k_values[i])
    knn_classifier.fit(tv_train_matrix, y_train)

    knn_train_predictions = knn_classifier.predict(tv_train_matrix)
    knn_test_predictions = knn_classifier.predict(tv_test_matrix)

    print("k-NN, porcentaje de aciertos en entrenamiento con k = ",k_values[i], " :", np.mean(knn_train_predictions == y_train))
    print("k-NN, porcentaje de aciertos en test con k = ",k_values[i], " :", np.mean(knn_test_predictions == y_test))

k-NN, porcentaje de aciertos en entrenamiento con k =  2  : 0.7230677852847279
k-NN, porcentaje de aciertos en test con k =  2  : 0.6478841870824054
k-NN, porcentaje de aciertos en entrenamiento con k =  4  : 0.6437003489494395
k-NN, porcentaje de aciertos en test con k =  4  : 0.6140311804008909
k-NN, porcentaje de aciertos en entrenamiento con k =  6  : 0.6114782092211746
k-NN, porcentaje de aciertos en test con k =  6  : 0.5938752783964365
k-NN, porcentaje de aciertos en entrenamiento con k =  8  : 0.5923973568936075
k-NN, porcentaje de aciertos en test con k =  8  : 0.5809576837416481
k-NN, porcentaje de aciertos en entrenamiento con k =  10  : 0.5803697379166975
k-NN, porcentaje de aciertos en test con k =  10  : 0.5740534521158129
k-NN, porcentaje de aciertos en entrenamiento con k =  12  : 0.5721285915806668
k-NN, porcentaje de aciertos en test con k =  12  : 0.56815144766147


En el caso de k-NN, cuanto mayor es el numero de vecinos, peor es el rendimiento del algoritmo, quizás porque el problema que estamos
analizando no tiene la complejidad suficiente para utilizar un número de vecinos mayor que 2. Con este número de vecinos es con el que se alcanza el rendimiento máximo, con un porcentaje del 72,31% en el entrenamiento y 64,79% en el test. Se trata de un rendimiento mejorable.

In [31]:
from sklearn.naive_bayes import MultinomialNB
alpha_values = [0, 0.1, 0.3, 1, 2, 6]
#Hemos elegido el parámetro alpha, que se encarga de determinar de que forma se suavizan las probabilidades condicionadas, ya que,
#al fin y al cabo, Naive Bayes está basado en está suavización. El valor por defecto es 1, y hemos incluido valores menores y mayores que uno.
for i in range (0,6):
    mnb_classifier = MultinomialNB(alpha = alpha_values[i])

    mnb_classifier.fit(tv_train_matrix, y_train)

    mnb_train_predictions = mnb_classifier.predict(tv_train_matrix)
    mnb_test_predictions = mnb_classifier.predict(tv_test_matrix)

    print("Multinomial Naive Bayes, porcentaje de aciertos en entrenamiento con alpha = ",alpha_values[i], np.mean(mnb_train_predictions == y_train))
    print("Multinomial Naive Bayes, porcentaje de aciertos en test on alpha = ",alpha_values[i], np.mean(mnb_test_predictions == y_test))

  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(


Multinomial Naive Bayes, porcentaje de aciertos en entrenamiento con alpha =  0 0.99784690771401
Multinomial Naive Bayes, porcentaje de aciertos en test on alpha =  0 0.8407572383073497
Multinomial Naive Bayes, porcentaje de aciertos en entrenamiento con alpha =  0.1 0.9725295122132304
Multinomial Naive Bayes, porcentaje de aciertos en test on alpha =  0.1 0.9536748329621381
Multinomial Naive Bayes, porcentaje de aciertos en entrenamiento con alpha =  0.3 0.9649565669314722
Multinomial Naive Bayes, porcentaje de aciertos en test on alpha =  0.3 0.9501113585746103
Multinomial Naive Bayes, porcentaje de aciertos en entrenamiento con alpha =  1 0.956381320068305
Multinomial Naive Bayes, porcentaje de aciertos en test on alpha =  1 0.9454342984409799
Multinomial Naive Bayes, porcentaje de aciertos en entrenamiento con alpha =  2 0.95315168163932
Multinomial Naive Bayes, porcentaje de aciertos en test on alpha =  2 0.9436525612472161
Multinomial Naive Bayes, porcentaje de aciertos en entren

En el caso del MultinomialNB, observamos un muy buen rendimiento del algoritmo independientemente del parámetro alpha, con un porcentaje
de acierto siempre superior al 94%. Aún así, la tendencia es que a mayor valor de alpha, peor es el rendimiento del algoritmo, siendo el mejor porcentaje alcanzado 99,78%, con alpha = 0;

# 4) Construye redes neuronales con Keras con distintas maneras de usar word embeddings

Justifica tus decisiones y explica los resultados obtenidos.

In [102]:
import tensorflow 
from tensorflow import keras
from keras.preprocessing.text import Tokenizer   # DEPRECATED
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from keras.utils import pad_sequences   # DEPRECATED
from tensorflow.keras.utils import pad_sequences


max_words = 1500
max_comment_length = 20

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(df.text)

sequences = tokenizer.texts_to_sequences(df.text)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
max_words = len(word_index)

data = pad_sequences(sequences, maxlen=max_comment_length)

ImportError: Traceback (most recent call last):
  File "C:\Users\usuario_local\AppData\Roaming\Python\Python311\site-packages\tensorflow\python\pywrap_tensorflow.py", line 73, in <module>
    from tensorflow.python._pywrap_tensorflow_internal import *
ImportError: DLL load failed while importing _pywrap_tensorflow_internal: Error en una rutina de inicialización de biblioteca de vínculos dinámicos (DLL).


Failed to load the native TensorFlow runtime.
See https://www.tensorflow.org/install/errors for some common causes and solutions.
If you need help, create an issue at https://github.com/tensorflow/tensorflow/issues and include the entire stack trace above this error message.

In [103]:
from sklearn.model_selection import train_test_split

X_train, X_resto, y_train, y_resto = train_test_split(data, data["type"], test_size=0.4, random_state=RANDOM_STATE, stratify = tipo)
X_val, X_test, y_val, y_test = train_test_split(X_resto, y_resto, test_size=0.5, random_state=RANDOM_STATE, stratify = y_resto)

# Fijamos el tamaño de los embedding a 50 dimensiones
embedding_dim = 50

Defaulting to user installation because normal site-packages is not writeable


NameError: name 'data' is not defined

In [18]:
# MODELO 1. SIN EMBEDDINGS PRE-ENTRENADOS

from keras.models import Sequential
from keras.layers import Flatten, Dense, Embedding

model1 = Sequential()
# We specify the maximum input length to our Embedding layer
# so we can later flatten the embedded inputs


model1.add(Embedding(max_words, embedding_dim, input_length=max_comment_length))
# After the Embedding layer, our activations have shape `(max_words, max_comment_length, embedding_dim)`.

# We flatten the 3D tensor of embeddings into a 2D tensor of shape `(max_words, max_comment_length * embedding_dim)`

model1.add(Flatten())

# We add the classifier on top
model1.add(Dense(1, activation='sigmoid'))

model1.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model1.summary()

history = model1.fit(X_train, y_train,
                    epochs=20,
                    batch_size=32,
                    validation_data=(X_test, y_test))

score1 = model1.evaluate(X_test, y_test)

print("Accuracy: %.2f%%" % (score1[1]*100))

ModuleNotFoundError: No module named 'keras'

In [None]:
# MODELO 2. EMBEDDINGS PRE-ENTRENADOS CONGELADOS

# from keras.models import Sequential  # DEPRECATED
from tensorflow.keras.models import Sequential
# from keras.layers import Embedding, Flatten, Dense  # DEPRECATED
from tensorflow.keras.layers import Embedding, Flatten, Dense

import numpy as np #####

# Debugging prints
print(f"max_words: {max_words}, embedding_dim: {embedding_dim}, max_comment_length: {max_comment_length}")
print(f"embedding_matrix shape: {embedding_matrix.shape}")

model2 = Sequential()
model2.add(Embedding(max_words, embedding_dim, input_length=max_comment_length))
# model2.add(Embedding(max_words, embedding_dim, input_length=max_comment_length, weights=embedding_matrix))

# Check if weights exist before setting them
print("Layer weights before setting:", model2.layers[0].weights)

# Build the model (just in case)
model2.build(input_shape=(None, max_comment_length))

model2.add(Flatten())
model2.add(Dense(1, activation='sigmoid'))
model2.summary()

history = model3.fit(X_train, y_train,
                    epochs=20,
                    batch_size=32,
                    validation_data=(X_test, y_test))

score3 = model3.evaluate(X_test, y_test)

In [None]:
# MODELO3. EMBEDDINGS PREENTRENADOS SIN CONGELAR

from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense

model3 = Sequential()
model3.add(Embedding(max_words, embedding_dim, input_length=max_comment_length))
model3.add(Flatten())
model3.add(Dense(1, activation='sigmoid'))
model3.summary()

model3.build(input_shape=(None, max_comment_length)) #####
model3.layers[0].set_weights([embedding_matrix])
model3.layers[0].trainable = True

model3.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
history = model3.fit(X_train, y_train,
                    epochs=20,
                    batch_size=32,
                    validation_data=(X_test, y_test))

score3 = model3.evaluate(X_test, y_test)

# 5) Aplica los modelos construidos a los datos de test y compáralos.

Calcula las métricas de recall, precisión y f1.
Discute cual es el mejor modelo y cual es peor y porqué.

In [None]:
y_pred1 = (model1.predict(X_test) > 0.5).astype("int32")

print("=== Métricas del Modelo 1 ===")
print(classification_report(y_test, y_pred1, digits=4))


In [None]:


y_pred2 = (model2.predict(X_test) > 0.5).astype("int32")

print("=== Métricas del Modelo 2 ===")
print(classification_report(y_test, y_pred2, digits=4))

In [None]:
y_pred3 = (model3.predict(X_test) > 0.5).astype("int32")

print("=== Métricas del Modelo 3 ===")
print(classification_report(y_test, y_pred3, digits=4))
