# Tecnicas para limpieza de datos

1. Eliminar duplicados
`drop_duplicates()`
2. Eliminar caracteres no deseados
`str.replace()`
3. Corrección del tipo de dato
df[ “campo” ].astype(float) [texto del enlace](https://)
4. Manejo de datos faltantes
`dropna()`
`fillna()`
5. Normalización de datos
`.str.strip().lower()`
6. Filtrado de datos
`df [ df[“campo”] + condición ]`


## Importamos algunos datasets para explorar

In [1]:
# Importar la librería Pandas
import pandas as pd

In [None]:
# Montar la unidad
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Verificar que los archivos csv se encuentren en la carpeta datasets
import os
os.listdir("/content/drive/MyDrive/datasets")

In [2]:
# Importamos el Dataset satis (customer satisfaction)
df_satis = pd.read_csv("https://raw.githubusercontent.com/v0ltax/TT-2C2025-Data-Analitycs-Notebooks/refs/heads/main/Clase_4/Datasets/satis_clientes.csv")
df_satis.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1128 entries, 0 to 1127
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            1128 non-null   int64  
 1   Empresa       1128 non-null   object 
 2   Fecha         1128 non-null   object 
 3   Calificación  904 non-null    float64
 4   Comentarios   818 non-null    object 
dtypes: float64(1), int64(1), object(3)
memory usage: 44.2+ KB


In [3]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("shivamb/netflix-shows")

print("Path to dataset files:", path)

  from .autonotebook import tqdm as notebook_tqdm


Path to dataset files: C:\Users\sebas\.cache\kagglehub\datasets\shivamb\netflix-shows\versions\5


In [4]:
# Importamos el Dataset Netflix
df_netflix = pd.read_csv(f"{path}/netflix_titles.csv")
df_netflix.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


In [17]:
# Importamos el Dataset de pacientes con sus mediciones de temperatura
df_pacientes = pd.read_csv('https://docs.google.com/spreadsheets/d/1-rUn4TUwpGrLE1DH8moeiR5eSyeF-jOOTpfhRgZxVIQ/gviz/tq?tqx=out:csv&sheet=')
df_pacientes.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3018 entries, 0 to 3017
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   nombre  3018 non-null   object 
 1   d1      2832 non-null   float64
 2   d2      2871 non-null   float64
 3   d3      2822 non-null   float64
 4   d4      2839 non-null   float64
 5   d5      2855 non-null   float64
 6   d6      2820 non-null   float64
 7   d7      2864 non-null   float64
 8   d8      2834 non-null   float64
 9   d9      2816 non-null   float64
 10  d10     2845 non-null   float64
dtypes: float64(10), object(1)
memory usage: 259.5+ KB


## Identificación de datos duplicados

[duplicated()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html)

### Dataframe satis

In [6]:
df_satis.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1128 entries, 0 to 1127
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            1128 non-null   int64  
 1   Empresa       1128 non-null   object 
 2   Fecha         1128 non-null   object 
 3   Calificación  904 non-null    float64
 4   Comentarios   818 non-null    object 
dtypes: float64(1), int64(1), object(3)
memory usage: 44.2+ KB


In [7]:
# Veamos si hay duplicados
df_satis.duplicated().sum()

np.int64(128)

In [8]:
# Veamos cuantos registros duplicados hay
df_satis.duplicated(subset=["Empresa", "Calificación"]).sum()

np.int64(134)

In [9]:
# Contar valores unicos
df_satis.value_counts(subset=["Empresa", "Calificación"]).sort_values(ascending=False)

Empresa                       Calificación
Johnson Inc                   4.0             2
Green, Nienow and Emard       2.0             2
Cremin-Ruecker                3.0             2
Huel Group                    1.0             2
Hintz, White and Kuphal       5.0             2
                                             ..
Zieme, Hintz and Cronin       4.0             1
Zieme-Kohler                  4.0             1
Bergnaum Inc                  2.0             1
Zulauf, Considine and Wisozk  1.0             1
Abbott LLC                    4.0             1
Name: count, Length: 798, dtype: int64

In [10]:
# Visualizar los duplicados con diferentes argumentos
df_satis[df_satis.duplicated(subset=['Empresa'], keep=False)].sort_values(by='Empresa')

Unnamed: 0,id,Empresa,Fecha,Calificación,Comentarios
1122,995,Altenwerth LLC,13/09/2024,,
1121,995,Altenwerth LLC,13/09/2024,,
811,726,"Altenwerth, Blanda and Waelchi",28/09/2024,1.0,"Proin leo odio, porttitor id, consequat in, co..."
812,726,"Altenwerth, Blanda and Waelchi",28/09/2024,1.0,"Proin leo odio, porttitor id, consequat in, co..."
186,172,"Altenwerth, Reichert and Mills",07/08/2024,3.0,"Proin eu mi. Nulla ac enim. In tempor, turpis ..."
...,...,...,...,...,...
469,423,Zemlak Group,13/12/2024,5.0,Duis consequat dui nec nisi volutpat eleifend....
974,865,Zemlak Inc,29/03/2024,3.0,Fusce consequat. Nulla nisl. Nunc nisl.\n\nDui...
973,865,Zemlak Inc,29/03/2024,3.0,Fusce consequat. Nulla nisl. Nunc nisl.\n\nDui...
512,459,Zulauf LLC,29/04/2024,5.0,


In [19]:
# Aplicar filtros
df_satis[df_satis['Empresa'] == "Keebler Inc"]

Unnamed: 0,id,Empresa,Fecha,Calificación,Comentarios
4,5,Keebler Inc,12/01/2024,4.0,Integer ac leo. Pellentesque ultrices mattis o...
636,570,Keebler Inc,28/07/2024,5.0,Vestibulum ac est lacinia nisi venenatis trist...


In [11]:
df_satis[df_satis.duplicated(subset=["Empresa", "Fecha"],keep=False)].sort_values(by='id')

Unnamed: 0,id,Empresa,Fecha,Calificación,Comentarios
9,10,Legros-Olson,12/11/2024,5.0,
10,10,Legros-Olson,12/11/2024,5.0,
11,11,Harris-Davis,13/11/2024,1.0,Nullam sit amet turpis elementum ligula vehicu...
12,11,Harris-Davis,13/11/2024,1.0,Nullam sit amet turpis elementum ligula vehicu...
17,16,"White, Balistreri and Daugherty",29/06/2024,5.0,
...,...,...,...,...,...
1109,984,"Strosin, Raynor and Oberbrunner",14/02/2024,1.0,Morbi porttitor lorem id ligula. Suspendisse o...
1117,992,Donnelly-Bashirian,14/03/2024,,Nullam porttitor lacus at turpis. Donec posuer...
1118,992,Donnelly-Bashirian,14/03/2024,,Nullam porttitor lacus at turpis. Donec posuer...
1121,995,Altenwerth LLC,13/09/2024,,


In [12]:
df_satis

Unnamed: 0,id,Empresa,Fecha,Calificación,Comentarios
0,1,Mitchell Group,11/12/2024,1.0,Integer ac leo. Pellentesque ultrices mattis o...
1,2,Kuhn-Fay,25/01/2024,4.0,Vestibulum ac est lacinia nisi venenatis trist...
2,3,Moen-Blick,11/11/2024,3.0,Aliquam quis turpis eget elit sodales sceleris...
3,4,McDermott Inc,01/12/2024,1.0,
4,5,Keebler Inc,12/01/2024,4.0,Integer ac leo. Pellentesque ultrices mattis o...
...,...,...,...,...,...
1123,996,Schiller-Armstrong,19/04/2024,2.0,In hac habitasse platea dictumst. Etiam faucib...
1124,997,"Schimmel, Gleichner and O'Keefe",25/12/2024,,In hac habitasse platea dictumst. Morbi vestib...
1125,998,"Strosin, Tromp and Dicki",23/05/2024,4.0,
1126,999,"Schultz, Vandervort and Mosciski",14/11/2024,4.0,Nullam sit amet turpis elementum ligula vehicu...


In [13]:
df_satis[df_satis.duplicated(subset=df_satis.columns[1:],keep=False)].sort_values(by='Empresa')

Unnamed: 0,id,Empresa,Fecha,Calificación,Comentarios
1122,995,Altenwerth LLC,13/09/2024,,
1121,995,Altenwerth LLC,13/09/2024,,
811,726,"Altenwerth, Blanda and Waelchi",28/09/2024,1.0,"Proin leo odio, porttitor id, consequat in, co..."
812,726,"Altenwerth, Blanda and Waelchi",28/09/2024,1.0,"Proin leo odio, porttitor id, consequat in, co..."
187,172,"Altenwerth, Reichert and Mills",07/08/2024,3.0,"Proin eu mi. Nulla ac enim. In tempor, turpis ..."
...,...,...,...,...,...
469,423,Zemlak Group,13/12/2024,5.0,Duis consequat dui nec nisi volutpat eleifend....
973,865,Zemlak Inc,29/03/2024,3.0,Fusce consequat. Nulla nisl. Nunc nisl.\n\nDui...
974,865,Zemlak Inc,29/03/2024,3.0,Fusce consequat. Nulla nisl. Nunc nisl.\n\nDui...
512,459,Zulauf LLC,29/04/2024,5.0,


### Dataframe Pacientes

In [18]:
# Veamos cuantos registros duplicados hay
df_pacientes.duplicated().sum()

np.int64(18)

In [19]:
# Si hay, entonces los listamos (keep, first, last, False)
df_pacientes[df_pacientes.duplicated(subset=["nombre"], keep=False)].sort_values(by='nombre')

Unnamed: 0,nombre,d1,d2,d3,d4,d5,d6,d7,d8,d9,d10
2200,Ailyn Hexter,36.67,36.1,36.36,36.59,36.99,37.22,36.34,37.51,36.72,37.59
85,Ailyn Hexter,36.67,36.1,36.36,36.59,36.99,37.22,36.34,37.51,36.72,37.59
1352,Andriana Mossman,37.4,36.81,37.46,36.89,36.22,36.4,37.97,36.07,37.86,
1528,Andriana Mossman,37.4,36.81,37.46,36.89,36.22,36.4,37.97,36.07,37.86,
847,Bernarr D'Arrigo,36.83,36.94,,36.01,37.8,37.94,36.3,37.97,37.75,37.35
306,Bernarr D'Arrigo,36.83,36.94,,36.01,37.8,37.94,36.3,37.97,37.75,37.35
453,Celeste Gooch,36.11,36.87,36.68,,37.92,36.72,37.93,37.29,36.15,36.6
2954,Celeste Gooch,36.11,36.87,36.68,,37.92,36.72,37.93,37.29,36.15,36.6
136,Dagny Burree,37.58,36.17,37.0,37.84,37.29,37.54,36.07,36.75,,36.53
73,Dagny Burree,37.58,36.17,37.0,37.84,37.29,37.54,36.07,36.75,,36.53


### Dataframe Netflix

In [20]:
df_netflix.duplicated().sum()

np.int64(0)

In [21]:
df_netflix[df_netflix.duplicated()]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description


## Tratamiento de datos duplicados
[Pandas drop_duplicates()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html)

### Dataset Satis

In [22]:
df_satis.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1128 entries, 0 to 1127
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            1128 non-null   int64  
 1   Empresa       1128 non-null   object 
 2   Fecha         1128 non-null   object 
 3   Calificación  904 non-null    float64
 4   Comentarios   818 non-null    object 
dtypes: float64(1), int64(1), object(3)
memory usage: 44.2+ KB


In [23]:
df_satis_pp1 = df_satis.drop_duplicates(subset=["Empresa", "Calificación"], keep="first")
df_satis_pp1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 994 entries, 0 to 1127
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            994 non-null    int64  
 1   Empresa       994 non-null    object 
 2   Fecha         994 non-null    object 
 3   Calificación  798 non-null    float64
 4   Comentarios   728 non-null    object 
dtypes: float64(1), int64(1), object(3)
memory usage: 46.6+ KB


In [24]:
df_netflix.duplicated().sum()

np.int64(0)

In [25]:
df_satis_pp1[df_satis_pp1.duplicated(subset=["Empresa", "Calificación"], keep=False)].sort_values(by='Empresa')


Unnamed: 0,id,Empresa,Fecha,Calificación,Comentarios


## Exploración de datos nulos

In [26]:
df_satis.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1128 entries, 0 to 1127
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            1128 non-null   int64  
 1   Empresa       1128 non-null   object 
 2   Fecha         1128 non-null   object 
 3   Calificación  904 non-null    float64
 4   Comentarios   818 non-null    object 
dtypes: float64(1), int64(1), object(3)
memory usage: 44.2+ KB


In [27]:
# Ver si hay valores nulos, en general o en columnas especificas
#df_satis.isnull().sum()
#df_satis[["Comentarios", "Calificación"]].isnull().sum()
df_satis.isnull().any(axis=1).sum()

np.int64(477)

In [28]:
# Visualizar los registros con celdas null / NaN
df_satis[df_satis.isnull().any(axis=1)][["Comentarios", "Calificación"]]

Unnamed: 0,Comentarios,Calificación
3,,1.0
6,,
7,"Donec diam neque, vestibulum eget, vulputate u...",
8,,
9,,5.0
...,...,...
1121,,
1122,,
1124,In hac habitasse platea dictumst. Morbi vestib...,
1125,,4.0


In [46]:
df_satis[df_satis.isnull().any(axis=1)]

Unnamed: 0,id,Empresa,Fecha,Calificación,Comentarios
3,4,McDermott Inc,01/12/2024,1.0,
6,7,Moen-Hartmann,11/03/2024,,
7,8,Lubowitz and Sons,27/01/2024,,"Donec diam neque, vestibulum eget, vulputate u..."
8,9,Waters-Lakin,04/09/2024,,
9,10,Legros-Olson,12/11/2024,5.0,
...,...,...,...,...,...
1121,995,Altenwerth LLC,13/09/2024,,
1122,995,Altenwerth LLC,13/09/2024,,
1124,997,"Schimmel, Gleichner and O'Keefe",25/12/2024,,In hac habitasse platea dictumst. Morbi vestib...
1125,998,"Strosin, Tromp and Dicki",23/05/2024,4.0,


## Tratamiento de datos nulos

### drop

`dropna` se usa para eliminar registros que contengan datos nulos
<BR>
[Pandas dropna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)

* how: para especificar si (any / all) celdas deben tener NaN
* thresh: indicar la cantidad de NaN para ejecutar el drop (no se puede combinar con any)
* subset: indicar las columnas a evaluar
* inplace: si guarda los cambios en el dataframe o retorna una copia

Analizar y comparar que sucede cuando aplicamos dropna a todas las columnas o a algunas especificas.

In [49]:
# Eliminar un registro si alguna las columnas contienen NaA
df_satis_pp2 = df_satis.dropna()
df_satis_pp2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 651 entries, 0 to 1126
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            651 non-null    int64  
 1   Empresa       651 non-null    object 
 2   Fecha         651 non-null    object 
 3   Calificación  651 non-null    float64
 4   Comentarios   651 non-null    object 
dtypes: float64(1), int64(1), object(3)
memory usage: 30.5+ KB


In [50]:
# Eliminar un registro si alguna o tdas las columnas contienen NaA
df_satis_pp2 = df_satis.dropna(how="all")
df_satis_pp2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1128 entries, 0 to 1127
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            1128 non-null   int64  
 1   Empresa       1128 non-null   object 
 2   Fecha         1128 non-null   object 
 3   Calificación  904 non-null    float64
 4   Comentarios   818 non-null    object 
dtypes: float64(1), int64(1), object(3)
memory usage: 44.2+ KB


In [51]:
# Eliminar un registro solo evaluando NaN en las columnas indicadas
df_satis_pp2 = df_satis.dropna(subset=["Comentarios", "Calificación"])
df_satis_pp2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 651 entries, 0 to 1126
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            651 non-null    int64  
 1   Empresa       651 non-null    object 
 2   Fecha         651 non-null    object 
 3   Calificación  651 non-null    float64
 4   Comentarios   651 non-null    object 
dtypes: float64(1), int64(1), object(3)
memory usage: 30.5+ KB


In [52]:
# Eliminar una columna si contiene NaN
df_satis_pp2 = df_satis.dropna(axis=1) # no accepta subset
df_satis_pp2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1128 entries, 0 to 1127
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   id       1128 non-null   int64 
 1   Empresa  1128 non-null   object
 2   Fecha    1128 non-null   object
dtypes: int64(1), object(2)
memory usage: 26.6+ KB


### fill

`fillna` se utiliza para completar el dato faltante
<BR>
[Pandas fillna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html)
<BR>
* df.fillna(0)  se completa con un valor fijo
* df.fillna(method="ffill")  completa con el valor del registro anterior
* df.fillna(method="bfill") completa con el valor del registro previo
* df["col"].fillna(df["col"].mean()) completa con la media de la columna
* df.fillna({"col1": 0, "col2": "desconocido"})

1. completamos con un valor fijo

In [None]:
# usamos el atributo value con un valor fijo en 0
df_satis_pp3 = df_satis.fillna(value = 0)
df_satis_pp3.info()

In [None]:
# También podríamos usar un diccionario
df_satis_pp3 = df_satis.fillna({"Calificación": 0, "Comentarios": "Sin dato"})

In [None]:
# listamos los primeros registros
df_satis_pp3.head(10)

In [None]:
# Eventualmente ajustamos el tipo de dato
df_satis_pp3 = df_satis_pp3.astype({"Comentarios": "string"})
df_satis_pp3.info()

In [None]:
# Podemos aplicar filtros
df_satis_pp3[df_satis_pp3["Comentarios"]=="0"]

2. Completamos con el valor próximo

In [None]:
df_satis_pp3 = df_satis.fillna(method = "ffill")
df_satis_pp3.info()

3. Completamos con el valor previo

In [None]:
df_satis_pp3 = df_satis.fillna(method = "bfill")
df_satis_pp3.info()

4. Completamos con la media, meadiana, moda

In [None]:
df_satis_pp3 = df_satis.fillna(df_satis.mean(numeric_only=True))
df_satis_pp3.info()

In [None]:
# calcular la media de la columna Calificación
media_calif = df_satis["Calificación"].mode()[0] # mean() / median() / mode()[0]

# aplicar fillna con diccionario
df_satis_pp3 = df_satis.fillna({
    "Calificación": media_calif,
    "Comentarios": "Sin dato"
})

In [None]:
df_satis_pp3.head(5)

Analicemos el dataset pacientes

In [None]:
df_pacientes.info()

In [None]:
# Completamos con la media de la columna
df_pacientes_pp1 = df_pacientes.fillna(df_pacientes.mean(numeric_only=True))
df_pacientes_pp1.info()

In [None]:
# Completamos con la media de la fila usando lambda

# seleccionar solo las columnas d1...d10
cols = df_pacientes.columns[1:]   # todas excepto 'nombre'

# aplicar moda por fila
df_pacientes_pp1[cols] = df_pacientes[cols].apply(
    lambda row: row.fillna(row.mean()), axis=1
)

In [None]:
# Completamos con la media de la fila usando T (traspuesta)
df_pacientes_pp1[cols] = df_pacientes[cols].T.fillna(df_pacientes[cols].mean(axis=1)).T

In [None]:
#df_pacientes.head()
df_pacientes_pp1.head()

Veamos el concepto de Traspuesta

In [None]:
df = pd.DataFrame({
    "A": [1, 2, 3],
    "B": [4, 5, 6]
})
print("Original:")
print(df)

print("\nTranspuesta:")
print(df.T)

## Normalización de datos

## Ajustar tipo de dato

### astype()

* df["col"] = df["col"].astype(int)
* df["col"] = df["col"].astype(float)
* df["col"] = df["col"].astype(str)

In [None]:
df_satis_pp4 = df_satis.copy()
df_satis_pp4["Comentarios"] = df_satis["Comentarios"].astype("string")

In [None]:
df_satis_pp4.info()

### to_numeric()

In [None]:
pd.to_numeric(df["col"], errors="coerce")

### to_datetime()

In [None]:
type(df_satis["Fecha"][0])
df_satis["Fecha"][0]

In [None]:
# pd.to_datetime(df_satis.Fecha)
pd.to_datetime(df_satis["Fecha"], format="%d/%m/%Y")


## Filtrado de datos

In [None]:
df_satis.info()

In [None]:
# Con una sola columna
df_satis["Empresa"]

In [None]:
# Con varias columnas
df_satis[["Empresa", "Fecha"]]

In [None]:
df_satis[df_satis["Empresa"] == "Kuhn-Fay"]

In [None]:
# Con condición
df_satis[[("Empresa") =="Alpha" & ("Comentarios" > "3")]]