## Data Preparation and Data Wrangling

En este notebook, se va a realizar la limpieza y preparación de los datos crudos para posteriormente analizarlos y modelarlos. Incluiremos limpieza de textos, renombramiento de columnas, manejo de valores faltantes y aplicación de técnicas de procesamiento de lenguaje natural.

Importamos librerías

In [36]:
import json 
import pandas as pd
import os

Definimos la ruta al archivo JSON y verificamos que si existe.

In [37]:
file_path = '../data/raw_data/tickets_classification_eng.json'

In [38]:
if not os.path.exists(file_path):
    print(f"El archivo {file_path} no existe")
else:
    with open(file_path, "r") as file:
        datos = json.load(file)

In [39]:
df = pd.json_normalize(datos)

In [40]:
df_clean = df.copy()

Renombramos columnas

In [41]:
df_clean = df_clean[['_source.complaint_what_happened', '_source.product', '_source.sub_product']]
df_clean.rename(columns={
        '_source.complaint_what_happened': 'complaint_what_happened',
        '_source.product': 'category',
        '_source.sub_product': 'sub_product'
    }, inplace=True)
df_clean

Unnamed: 0,complaint_what_happened,category,sub_product
0,,Debt collection,Credit card debt
1,Good morning my name is XXXX XXXX and I apprec...,Debt collection,Credit card debt
2,I upgraded my XXXX XXXX card in XX/XX/2018 and...,Credit card or prepaid card,General-purpose credit card or charge card
3,,Mortgage,Conventional home mortgage
4,,Credit card or prepaid card,General-purpose credit card or charge card
...,...,...,...
78308,,Checking or savings account,Checking account
78309,"On Wednesday, XX/XX/XXXX I called Chas, my XXX...",Credit card or prepaid card,General-purpose credit card or charge card
78310,I am not familiar with XXXX pay and did not un...,Checking or savings account,Checking account
78311,I have had flawless credit for 30 yrs. I've ha...,Credit card or prepaid card,General-purpose credit card or charge card


Craemos la columna `ticket_classification` que combina las columnas `category`y `sub_product` que representan la clasificación completa de cada ticket.

In [42]:
df_clean['ticket_classification'] = df_clean['category'] + ' + ' + df_clean['sub_product']

In [43]:
df_clean.drop(['category', 'sub_product'], axis=1, inplace=True)

Remplazamos los datos vacíos por NAN y eliminamos las filas que continen valores faltantes.

In [45]:
df_clean['complaint_what_happened'].replace('', pd.NA, inplace=True)
df_clean

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_clean['complaint_what_happened'].replace('', pd.NA, inplace=True)


Unnamed: 0,complaint_what_happened,ticket_classification
0,,Debt collection + Credit card debt
1,Good morning my name is XXXX XXXX and I apprec...,Debt collection + Credit card debt
2,I upgraded my XXXX XXXX card in XX/XX/2018 and...,Credit card or prepaid card + General-purpose ...
3,,Mortgage + Conventional home mortgage
4,,Credit card or prepaid card + General-purpose ...
...,...,...
78308,,Checking or savings account + Checking account
78309,"On Wednesday, XX/XX/XXXX I called Chas, my XXX...",Credit card or prepaid card + General-purpose ...
78310,I am not familiar with XXXX pay and did not un...,Checking or savings account + Checking account
78311,I have had flawless credit for 30 yrs. I've ha...,Credit card or prepaid card + General-purpose ...


In [48]:
df_clean.dropna(subset=['complaint_what_happened', 'ticket_classification'], inplace=True)

In [49]:
df_clean

Unnamed: 0,complaint_what_happened,ticket_classification
1,Good morning my name is XXXX XXXX and I apprec...,Debt collection + Credit card debt
2,I upgraded my XXXX XXXX card in XX/XX/2018 and...,Credit card or prepaid card + General-purpose ...
10,Chase Card was reported on XX/XX/2019. However...,"Credit reporting, credit repair services, or o..."
11,"On XX/XX/2018, while trying to book a XXXX XX...","Credit reporting, credit repair services, or o..."
14,my grand son give me check for {$1600.00} i de...,Checking or savings account + Checking account
...,...,...
78301,My husband passed away. Chase bank put check o...,Checking or savings account + Checking account
78303,After being a Chase Card customer for well ove...,Credit card or prepaid card + General-purpose ...
78309,"On Wednesday, XX/XX/XXXX I called Chas, my XXX...",Credit card or prepaid card + General-purpose ...
78310,I am not familiar with XXXX pay and did not un...,Checking or savings account + Checking account


Reseteamos index 

In [51]:
df_clean.reset_index(drop=True, inplace=True)

In [52]:
df_clean

Unnamed: 0,complaint_what_happened,ticket_classification
0,Good morning my name is XXXX XXXX and I apprec...,Debt collection + Credit card debt
1,I upgraded my XXXX XXXX card in XX/XX/2018 and...,Credit card or prepaid card + General-purpose ...
2,Chase Card was reported on XX/XX/2019. However...,"Credit reporting, credit repair services, or o..."
3,"On XX/XX/2018, while trying to book a XXXX XX...","Credit reporting, credit repair services, or o..."
4,my grand son give me check for {$1600.00} i de...,Checking or savings account + Checking account
...,...,...
18958,My husband passed away. Chase bank put check o...,Checking or savings account + Checking account
18959,After being a Chase Card customer for well ove...,Credit card or prepaid card + General-purpose ...
18960,"On Wednesday, XX/XX/XXXX I called Chas, my XXX...",Credit card or prepaid card + General-purpose ...
18961,I am not familiar with XXXX pay and did not un...,Checking or savings account + Checking account


Creamos un directorio para guardar los datos limpios y lo convertimos a csv.

In [53]:
clean_data_dir = '../data/clean_data'
os.makedirs(clean_data_dir, exist_ok=True)

In [54]:
output_path = os.path.join(clean_data_dir, 'clean_tickets.csv')
df_clean.to_csv(output_path, index=False)

print(f"Datos limpios guardados en {output_path}")

Datos limpios guardados en ../data/clean_data/clean_tickets.csv


## Data Wrangling

Ahora pasamos al procesamiento de los datos ya limpios para aplicarle técnicas de limpieza de texto para prepararlos para ser modelados.

Importamos las librerías necesarias

In [61]:
import nltk
import re
from nltk.corpus import stopwords
import contractions
from nltk.stem import WordNetLemmatizer
nltk.download('stopwords')
nltk.download('wordnet')

In [57]:
df = pd.read_csv('../data/clean_data/clean_tickets.csv')
df

Unnamed: 0,complaint_what_happened,ticket_classification
0,Good morning my name is XXXX XXXX and I apprec...,Debt collection + Credit card debt
1,I upgraded my XXXX XXXX card in XX/XX/2018 and...,Credit card or prepaid card + General-purpose ...
2,Chase Card was reported on XX/XX/2019. However...,"Credit reporting, credit repair services, or o..."
3,"On XX/XX/2018, while trying to book a XXXX XX...","Credit reporting, credit repair services, or o..."
4,my grand son give me check for {$1600.00} i de...,Checking or savings account + Checking account
...,...,...
18958,My husband passed away. Chase bank put check o...,Checking or savings account + Checking account
18959,After being a Chase Card customer for well ove...,Credit card or prepaid card + General-purpose ...
18960,"On Wednesday, XX/XX/XXXX I called Chas, my XXX...",Credit card or prepaid card + General-purpose ...
18961,I am not familiar with XXXX pay and did not un...,Checking or savings account + Checking account


Para limpiar los textos vamos a aplicar diferentes técnicas de procesamiento de lenguaje, para empezar vamos a hacer una función que expanda las contracciones.

In [59]:
def expand_contractions(text):
    return contractions.fix(text)

Definimos una función para limpiar el texto que incluye:
+	Expansión de contracciones.
+	Conversión a minúsculas.
+	Eliminación de caracteres repetidos (xx).
+	Eliminación de espacios adicionales.

In [66]:
def clean_text(text):
    text = expand_contractions(text)
    text = text.lower()
    text = re.sub(r'xx+', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

Por último crearemos una función llamada `preprocess_text` que va a realizar lo siguiente:
 + Aplicará la función `clean_text`.
 + Tokenizaremos el texto, esto significa que dividirá el texto en palabras individuales.
 + Eliminaremos las stopwords, o sea las palabras comunes que no aportan un significado significativo.
+ Lematizaremos el texto, buscará la forma más simple y básica de cada palabra, al reducir las palabras a su forma básica, puedes agruparlas y analizarlas más fácilmente.

In [67]:
def preprocess_text(text):
    # Limpiar el texto
    text = clean_text(text)
    # Tokenizar
    words = nltk.word_tokenize(text)
    # Eliminar stopwords
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]
    # Lematizar
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    # Unir palabras
    text = ' '.join(words)
    return text

In [68]:
df_clean['clean_complaint'] = df_clean['complaint_what_happened'].apply(preprocess_text)

In [69]:
df_clean

Unnamed: 0,complaint_what_happened,ticket_classification,clean_complaint
0,Good morning my name is XXXX XXXX and I apprec...,Debt collection + Credit card debt,good morning name appreciate could help put st...
1,I upgraded my XXXX XXXX card in XX/XX/2018 and...,Credit card or prepaid card + General-purpose ...,upgraded card //2018 told agent upgrade annive...
2,Chase Card was reported on XX/XX/2019. However...,"Credit reporting, credit repair services, or o...","chase card reported //2019 . however , fraudul..."
3,"On XX/XX/2018, while trying to book a XXXX XX...","Credit reporting, credit repair services, or o...","//2018 , trying book ticket , came across offe..."
4,my grand son give me check for {$1600.00} i de...,Checking or savings account + Checking account,grand son give check { $ 1600.00 } deposit cha...
...,...,...,...
18958,My husband passed away. Chase bank put check o...,Checking or savings account + Checking account,husband passed away . chase bank put check hol...
18959,After being a Chase Card customer for well ove...,Credit card or prepaid card + General-purpose ...,"chase card customer well decade , offered mult..."
18960,"On Wednesday, XX/XX/XXXX I called Chas, my XXX...",Credit card or prepaid card + General-purpose ...,"wednesday , // called chas , visa credit card ..."
18961,I am not familiar with XXXX pay and did not un...,Checking or savings account + Checking account,familiar pay understand great risk provides co...


Seleccionamos únicamente las columnas necesarias para el modelado

In [71]:
df_processed_clean = df_clean[['clean_complaint', 'ticket_classification']]
df_processed_clean

Unnamed: 0,clean_complaint,ticket_classification
0,good morning name appreciate could help put st...,Debt collection + Credit card debt
1,upgraded card //2018 told agent upgrade annive...,Credit card or prepaid card + General-purpose ...
2,"chase card reported //2019 . however , fraudul...","Credit reporting, credit repair services, or o..."
3,"//2018 , trying book ticket , came across offe...","Credit reporting, credit repair services, or o..."
4,grand son give check { $ 1600.00 } deposit cha...,Checking or savings account + Checking account
...,...,...
18958,husband passed away . chase bank put check hol...,Checking or savings account + Checking account
18959,"chase card customer well decade , offered mult...",Credit card or prepaid card + General-purpose ...
18960,"wednesday , // called chas , visa credit card ...",Credit card or prepaid card + General-purpose ...
18961,familiar pay understand great risk provides co...,Checking or savings account + Checking account


In [73]:
df_processed_clean = df_processed_clean.rename(columns={'clean_complaint': 'complaint_what_happened'})

Almacenamos los datos en un nuevo directorio que se llama processed_data

In [74]:
processed_data_dir = '../data/processed_data'
os.makedirs(processed_data_dir, exist_ok=True)
df_processed_clean.to_csv(os.path.join(processed_data_dir, 'processed_tickets.csv'), index=False)
