# Limpieza de datos

Es una proceso crucial del proceso de analisis de datos, incluye el remover los datos faltantes, duplicados o irrelevantes. El fin es asegurar la precision y consistencia de los datos para evitar dañar el desempeño de los modelos de ML. Better data beats fancier algorithms

In [3]:
import pandas as pd
import numpy as np

In [4]:
#cargamos los datos

df = pd.read_csv('datasets/titanic_ds.csv')
df.head()

Unnamed: 0,Passengerid,Age,Fare,Sex,sibsp,zero,zero.1,zero.2,zero.3,zero.4,...,zero.12,zero.13,zero.14,Pclass,zero.15,zero.16,Embarked,zero.17,zero.18,2urvived
0,1,22.0,7.25,0,1,0,0,0,0,0,...,0,0,0,3,0,0,2.0,0,0,0
1,2,38.0,71.2833,1,1,0,0,0,0,0,...,0,0,0,1,0,0,0.0,0,0,1
2,3,26.0,7.925,1,0,0,0,0,0,0,...,0,0,0,3,0,0,2.0,0,0,1
3,4,35.0,53.1,1,1,0,0,0,0,0,...,0,0,0,1,0,0,2.0,0,0,1
4,5,35.0,8.05,0,0,0,0,0,0,0,...,0,0,0,3,0,0,2.0,0,0,0


In [6]:
#df.duplicated()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 28 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Passengerid  1309 non-null   int64  
 1   Age          1309 non-null   float64
 2   Fare         1309 non-null   float64
 3   Sex          1309 non-null   int64  
 4   sibsp        1309 non-null   int64  
 5   zero         1309 non-null   int64  
 6   zero.1       1309 non-null   int64  
 7   zero.2       1309 non-null   int64  
 8   zero.3       1309 non-null   int64  
 9   zero.4       1309 non-null   int64  
 10  zero.5       1309 non-null   int64  
 11  zero.6       1309 non-null   int64  
 12  Parch        1309 non-null   int64  
 13  zero.7       1309 non-null   int64  
 14  zero.8       1309 non-null   int64  
 15  zero.9       1309 non-null   int64  
 16  zero.10      1309 non-null   int64  
 17  zero.11      1309 non-null   int64  
 18  zero.12      1309 non-null   int64  
 19  zero.1

In [8]:
# Filtramos por columnas categoricas y numericas

cat_col = [col for col in df.columns if df[col].dtype=="object"]
print("Categoricas: ", cat_col)
num_col = [col for col in df.columns if df[col].dtype!="object"]
print("Categoricas: ", num_col)

Categoricas:  []
Categoricas:  ['Passengerid', 'Age', 'Fare', 'Sex', 'sibsp', 'zero', 'zero.1', 'zero.2', 'zero.3', 'zero.4', 'zero.5', 'zero.6', 'Parch', 'zero.7', 'zero.8', 'zero.9', 'zero.10', 'zero.11', 'zero.12', 'zero.13', 'zero.14', 'Pclass', 'zero.15', 'zero.16', 'Embarked', 'zero.17', 'zero.18', '2urvived']


In [9]:
# el numero total de valores unicos en las columnas categoricas
df[cat_col].nunique()

Series([], dtype: float64)

In [11]:
#Podemos eliminar toda la información irrelevante del dataframe, las maquinas no entienden valores de texto, entonces hay que convertirla o eliminarla. 

df1 = df.drop(columns=['zero'])

In [12]:
# obtener el porcentaje de datos nulos por atributo
round((df1.isnull().sum()/df1.shape[0])*100,2)

Passengerid    0.00
Age            0.00
Fare           0.00
Sex            0.00
sibsp          0.00
zero.1         0.00
zero.2         0.00
zero.3         0.00
zero.4         0.00
zero.5         0.00
zero.6         0.00
Parch          0.00
zero.7         0.00
zero.8         0.00
zero.9         0.00
zero.10        0.00
zero.11        0.00
zero.12        0.00
zero.13        0.00
zero.14        0.00
Pclass         0.00
zero.15        0.00
zero.16        0.00
Embarked       0.15
zero.17        0.00
zero.18        0.00
2urvived       0.00
dtype: float64

Usualmente cuando hay información faltante se puede eliminar la columna, o remplazar

In [15]:
df2 = df1.drop(columns='Embarked')
#df2.dropna(subset=['Embarked'], axis=0, inplace=True)
df2.shape

(1309, 26)

In [16]:
# Podemos llenar las columnas faltantes con el promedio, para no interferir tanto los datos
df3 = df1.fillna(df1.Embarked.mean())
df3.isnull().sum()

Passengerid    0
Age            0
Fare           0
Sex            0
sibsp          0
zero.1         0
zero.2         0
zero.3         0
zero.4         0
zero.5         0
zero.6         0
Parch          0
zero.7         0
zero.8         0
zero.9         0
zero.10        0
zero.11        0
zero.12        0
zero.13        0
zero.14        0
Pclass         0
zero.15        0
zero.16        0
Embarked       0
zero.17        0
zero.18        0
2urvived       0
dtype: int64