# **Preprocesamiento**
---

## **Pasos del preprocesamiento del texto de los subtítulos**
---

- Convertir las frases en minúsculas
- Eliminar caracteres especiales y números presentes en el texto
- Eliminar los espacios sobrantes
- Eliminar caracteres sueltos
- Añadir una etiqueta de inicio y otra de fin a las frases para indicar el principio y el final de una frase

In [21]:
import os
import numpy as np
import json
import pandas as pd # type: ignore
from tqdm import tqdm # type: ignore
from tensorflow.keras.preprocessing.image import ImageDataGenerator, load_img, img_to_array # type: ignore
from tensorflow.keras.preprocessing.text import Tokenizer # type: ignore
from tensorflow.keras.applications import DenseNet201 # type: ignore
from tensorflow.keras.models import Model # type: ignore

image_path = '../../src/database/flickr8k/Images'
captions_txt = '../../src/database/flickr8k/captions.txt'
data = pd.read_csv('../../src/database/flickr8k/captions.txt')

In [16]:
def text_preprocessing(data):
    data['caption'] = data['caption'].apply(lambda x: x.lower())
    data['caption'] = data['caption'].apply(lambda x: x.replace("[^A-Za-z]",""))
    data['caption'] = data['caption'].apply(lambda x: x.replace("\s+"," "))
    data['caption'] = data['caption'].apply(lambda x: " ".join([word for word in x.split() if len(word)>1]))
    data['caption'] = "startseq "+data['caption']+" endseq"
    return data

  data['caption'] = data['caption'].apply(lambda x: x.replace("\s+"," "))


In [17]:
data = text_preprocessing(data)
captions = data['caption'].tolist()
captions[:10]

['startseq child in pink dress is climbing up set of stairs in an entry way endseq',
 'startseq girl going into wooden building endseq',
 'startseq little girl climbing into wooden playhouse endseq',
 'startseq little girl climbing the stairs to her playhouse endseq',
 'startseq little girl in pink dress going into wooden cabin endseq',
 'startseq black dog and spotted dog are fighting endseq',
 'startseq black dog and tri-colored dog playing with each other on the road endseq',
 'startseq black dog and white dog with brown spots are staring at each other in the street endseq',
 'startseq two dogs of different breeds looking at each other on the road endseq',
 'startseq two dogs on pavement moving toward each other endseq']

## **Tokenización y representación codificada**
---

- Las palabras de una frase se separan/tokenizan y se codifican en una representación en caliente.
- Estas codificaciones se pasan a la capa de incrustación para generar incrustaciones de palabras.

In [18]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(captions)
vocab_size = len(tokenizer.word_index) + 1
max_length = max(len(caption.split()) for caption in captions)

images = data['image'].unique().tolist()
nimages = len(images)

split_index = round(0.85*nimages)
train_images = images[:split_index]
val_images = images[split_index:]

train = data[data['image'].isin(train_images)]
test = data[data['image'].isin(val_images)]

train.reset_index(inplace=True,drop=True)
test.reset_index(inplace=True,drop=True)

tokenizer.texts_to_sequences([captions[1]])[0]

[1, 18, 315, 63, 195, 116, 2]

## **Extracción de características de la imagen**
---

- Se utiliza la arquitectura DenseNet 201 para extraer las características de las imágenes
- También se puede utilizar cualquier otra arquitectura preentrenada para extraer características de estas imágenes.
- Dado que se ha seleccionado la capa Global Average Pooling como capa final del modelo DenseNet201 para la extracción de características, las imágenes incrustadas serán un vector de tamaño 1920

In [19]:
model = DenseNet201()
fe = Model(inputs=model.input, outputs=model.layers[-2].output)

img_size = 224
features = {}
for image in tqdm(data['image'].unique().tolist()):
    img = load_img(os.path.join(image_path,image),target_size=(img_size,img_size))
    img = img_to_array(img)
    img = img/255.
    img = np.expand_dims(img,axis=0)
    feature = fe.predict(img, verbose=0)
    features[image] = feature

2024-12-05 20:55:16.194663: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)


Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/densenet/densenet201_weights_tf_dim_ordering_tf_kernels.h5
[1m82524592/82524592[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 0us/step


100%|██████████| 8091/8091 [17:21<00:00,  7.77it/s]


## **Guardar preprocesamientos**
---

In [22]:
data.to_csv('../../src/database/flickr8k/captions_preprocessed.txt', index=False)

In [27]:
output_file = '../../src/database/flickr8k/images_features.json'
features = {k: v.tolist() for k, v in features.items()}
with open(output_file, 'w') as json_file:
    json.dump(features, json_file, indent=4)