 En esta sección, se describe cómo se limpiaron y prepararon los datos para su análisis, incluyendo la eliminación de valores atípicos y la imputación de datos faltantes.

In [1]:
import json
import pandas as pd

file_path = "../data/tickets_classification_eng.json"

# Leer el archivo JSON 
with open(file_path, "r", encoding="utf-8") as file:  
    datos = json.load(file)

df = pd.json_normalize(datos)

In [2]:
columns_to_select = [
    "_source.complaint_what_happened",
    "_source.product",
    "_source.sub_product"
]

df_selected = df[columns_to_select]

In [3]:
# Renombrar columnas para mayor claridad
df_selected.rename(columns={
    "_source.complaint_what_happened": "complaint_what_happened",
    "_source.product": "category",
    "_source.sub_product": "sub_product"
}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_selected.rename(columns={


In [4]:
df_selected.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 78313 entries, 0 to 78312
Data columns (total 3 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   complaint_what_happened  78313 non-null  object
 1   category                 78313 non-null  object
 2   sub_product              67742 non-null  object
dtypes: object(3)
memory usage: 1.8+ MB


### Creación de Nueva Columna:

 Añade una nueva columna llamada ticket_classification que sea el resultado de concatenar los valores de las columnas category y sub_product, separados por un signo más. Por ejemplo, si category contiene "Banco" y sub_product contiene "Cuenta Corriente", entonces ticket_classification debería ser "Banco + Cuenta Corriente".

In [5]:
df_selected["ticket_classification"] = df_selected["category"] + " + " + df_selected["sub_product"]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_selected["ticket_classification"] = df_selected["category"] + " + " + df_selected["sub_product"]


In [6]:
df_selected.head()

Unnamed: 0,complaint_what_happened,category,sub_product,ticket_classification
0,,Debt collection,Credit card debt,Debt collection + Credit card debt
1,Good morning my name is XXXX XXXX and I apprec...,Debt collection,Credit card debt,Debt collection + Credit card debt
2,I upgraded my XXXX XXXX card in XX/XX/2018 and...,Credit card or prepaid card,General-purpose credit card or charge card,Credit card or prepaid card + General-purpose ...
3,,Mortgage,Conventional home mortgage,Mortgage + Conventional home mortgage
4,,Credit card or prepaid card,General-purpose credit card or charge card,Credit card or prepaid card + General-purpose ...


### Eliminar Columnas Redundantes:

Después de crear la columna ticket_classification, elimina las columnas sub_product y category, ya que su información ahora está encapsulada en la nueva columna.

In [7]:
df_selected.drop(columns=["sub_product", "category"], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_selected.drop(columns=["sub_product", "category"], inplace=True)


### Limpieza de Datos en Columnas Específicas: 

Aquí aseguramos de que la columna complaint_what_happened no contenga campos vacíos. Reemplaza esos campos vacíos con un valor que indique que los datos están ausentes (como NaN).

In [10]:
df_selected["complaint_what_happened"].replace("", pd.NA, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_selected["complaint_what_happened"].replace("", pd.NA, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_selected["complaint_what_happened"].replace("", pd.NA, inplace=True)


### Eliminación de Filas con Datos Faltantes: 

Elimina todas las filas que tengan datos faltantes en las columnas críticas, es decir, complaint_what_happened y ticket_classification.

In [11]:
df_selected.dropna(subset=["complaint_what_happened", "ticket_classification"], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_selected.dropna(subset=["complaint_what_happened", "ticket_classification"], inplace=True)


In [12]:
df_selected

Unnamed: 0,complaint_what_happened,ticket_classification
1,Good morning my name is XXXX XXXX and I apprec...,Debt collection + Credit card debt
2,I upgraded my XXXX XXXX card in XX/XX/2018 and...,Credit card or prepaid card + General-purpose ...
10,Chase Card was reported on XX/XX/2019. However...,"Credit reporting, credit repair services, or o..."
11,"On XX/XX/2018, while trying to book a XXXX XX...","Credit reporting, credit repair services, or o..."
14,my grand son give me check for {$1600.00} i de...,Checking or savings account + Checking account
...,...,...
78301,My husband passed away. Chase bank put check o...,Checking or savings account + Checking account
78303,After being a Chase Card customer for well ove...,Credit card or prepaid card + General-purpose ...
78309,"On Wednesday, XX/XX/XXXX I called Chas, my XXX...",Credit card or prepaid card + General-purpose ...
78310,I am not familiar with XXXX pay and did not un...,Checking or savings account + Checking account


In [13]:
##### Ahora si ya más limpio falta reiniciar el índex y guardarlo 
df_selected.reset_index(drop=True, inplace=True)

In [16]:
output_path = "../data/processed/tickets_cleaned.csv"

In [17]:
df_selected.to_csv(output_path, index=False)
print(f"DataFrame guardado exitosamente en: {output_path}")

DataFrame guardado exitosamente en: ../data/processed/tickets_cleaned.csv
