Se importan los paquetes necesarios

In [1]:
import numpy as np
import pandas as pd
import re

In [2]:
import sys
from src.cleaning_functions import *

Se importan los datos como pandas DataFrame

In [3]:
sharks = pd.read_csv("data/attacks.csv",encoding = "ISO-8859-1")

Se exploran los datos

In [4]:
sharks.shape

(25723, 24)

In [5]:
sharks.sample()

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,...,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
1952,2001.04.28,28-Apr-2001,2001.0,Unprovoked,USA,Florida,"New Smyrna Beach, Volusia County",Boogie boarding,male,M,...,,"S. Petersohn, GSAF",2001.04.28-NV-NewSmyrnaBeach.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2001.04.28,2001.04.28,4351.0,,


In [6]:
sharks.isna().sum()

Case Number               17021
Date                      19421
Year                      19423
Type                      19425
Country                   19471
Area                      19876
Location                  19961
Activity                  19965
Name                      19631
Sex                       19986
Age                       22252
Injury                    19449
Fatal (Y/N)               19960
Time                      22775
Species                   22259
Investigator or Source    19438
pdf                       19421
href formula              19422
href                      19421
Case Number.1             19421
Case Number.2             19421
original order            19414
Unnamed: 22               25722
Unnamed: 23               25721
dtype: int64

Se comprueba que es un DataFrame muy sucio con una gran cantidad de nulos

Se cambia el nombre de todas las columnas del DataFrame para simplificar

In [7]:
nuevas_columna = {columna:(columna.replace(" ", "_").replace(".", "_").lower() if columna[-1]!=" "
                  else columna.replace(" ", "_").replace(".", "_").lower()[:-1]) for columna in list(sharks.keys())}
sharks.rename(columns=nuevas_columna, inplace=True)

In [8]:
sharks.sample()

Unnamed: 0,case_number,date,year,type,country,area,location,activity,name,sex,...,species,investigator_or_source,pdf,href_formula,href,case_number_1,case_number_2,original_order,unnamed:_22,unnamed:_23
23157,,,,,,,,,,,...,,,,,,,,,,


Ahora se va a proceder a la limpieza de los datos.

Lo primero es eliminar todas las filas con datos repetidos

In [9]:
sharks.drop_duplicates(inplace=True)

A continuación se van a eliminar todas aquellas filas donde todas las celdas sean nulos y todas las columnas donde todos los valores sean nulos

In [10]:
sharks.dropna(axis=0, how='all', inplace=True)
sharks.dropna(axis=1, how='all', inplace=True)

In [11]:
sharks.shape

(6311, 24)

Con este proceso se han limpiado bastantes filas.

In [12]:
sharks[(sharks["case_number"]=="0")].isna().sum()

case_number               0
date                      8
year                      8
type                      8
country                   8
area                      8
location                  8
activity                  8
name                      8
sex                       8
age                       8
injury                    8
fatal_(y/n)               8
time                      8
species                   8
investigator_or_source    8
pdf                       8
href_formula              8
href                      8
case_number_1             8
case_number_2             8
original_order            1
unnamed:_22               8
unnamed:_23               8
dtype: int64

Se comprueba que en aquellas columnas donde "case_number" es nulo, todo el resto de columnas salvo "original__order" son nulas también, por lo que no aportan información útil y se van a eliminar.

In [13]:
sharks.drop(sharks[(sharks["case_number"]=="0")].index,inplace= True)

Viendo la información del DF se ve que en las columnas "unnamed:_22" y "unnamed:_23" casi todos los valores son NaN. Se va a verificar cuales son aquellos valores no nulos y posteriormente eliminar las columnas completas

In [14]:
sharks[sharks["unnamed:_22"].notna()].shape

(1, 24)

In [15]:
sharks[sharks["unnamed:_22"].notna()].head()

Unnamed: 0,case_number,date,year,type,country,area,location,activity,name,sex,...,species,investigator_or_source,pdf,href_formula,href,case_number_1,case_number_2,original_order,unnamed:_22,unnamed:_23
1478,2006.05.27,27-May-2006,2006.0,Unprovoked,USA,Hawaii,"North Shore, O'ahu",Surfing,Bret Desmond,M,...,,R. Collier,2006.05.27-Desmond.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2006.05.27,2006.05.27,4825.0,stopped here,


In [16]:
sharks[sharks["unnamed:_23"].notna()].shape

(2, 24)

In [17]:
sharks[sharks["unnamed:_23"].notna()].head(2)

Unnamed: 0,case_number,date,year,type,country,area,location,activity,name,sex,...,species,investigator_or_source,pdf,href_formula,href,case_number_1,case_number_2,original_order,unnamed:_22,unnamed:_23
4415,1952.03.30,30-Mar-1952,1952.0,Unprovoked,NETHERLANDS ANTILLES,Curacao,,Went to aid of child being menaced by the shark,A.J. Eggink,M,...,"Bull shark, 2.7 m [9'] was captured & dragged ...","J. Randall, p.352 in Sharks & Survival; H.D. B...",1952.03.30-Eggink.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,1952.03.30,1952.03.30,1888.0,,Teramo
5840,1878.09.14.R,Reported 14-Sep-1878,1878.0,Provoked,USA,Connecticut,"Branford, New Haven County",Fishing,Captain Pattison,M,...,,"St. Joseph Herald, 9/14/1878",1878.09.14.R-Pattison.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,1878.09.14.R,1878.09.14.R,463.0,,change filename


No hay ningún dato valioso en esas columnas por lo que se van a eliminar

In [18]:
sharks.drop(columns = ["unnamed:_22", "unnamed:_23"],axis=1,inplace= True)

Ahora se va a estudiar la diferencia entre "case_number"case_number_1" y "case_number_2"

In [19]:
sharks[sharks["case_number_1"]!=sharks["case_number_2"]].head(21)[["case_number","case_number_1","case_number_2"]]

Unnamed: 0,case_number,case_number_1,case_number_2
34,2018.04.03,2018.04.02,2018.04.03
117,2017.07.20.a,2017/07.20.a,2017.07.20.a
144,2017.05.06,2017.06.06,2017.05.06
217,2016.09.15,2016.09.16,2016.09.15
314,2016.01.24.b,2015.01.24.b,2016.01.24.b
334,2015.12.23,2015.11.07,2015.12.23
339,2015.10.28.a,2015.10.28,2015.10.28.a
560,2014.05.04,2013.05.04,2014.05.04
3522,1967.07.05,1967/07.05,1967.07.05
3795,"1962,08.30.b",1962.08.30.b,"1962,08.30.b"


Viendo la compación de las columnas "case_number_1" y"case_number_2" con "case_number" se ve que la columna "case_number_1" contiene la misma información que la de "case_number_2" salvo en una pequeña cantidad de casos donde la columna "case_number" coincide con "case_number_2" por lo que se asume que la columna "case_number_1" no aporta nada de valor

In [20]:
sharks.drop(columns = ["case_number_1"],axis=1,inplace= True)

In [21]:
sharks[sharks["case_number"]!=sharks["case_number_2"]][["case_number","case_number_2","date"]].sample(5)

Unnamed: 0,case_number,case_number_2,date
5488,,1905.09.06.R,Reported 06-Sep-1905
4949,1934.01.08.R,1934.02.08.R,Reported 08-Feb-1934
390,2015.07-10,2015.07.10,10-Jul-2015
5944,1864.05,1864.05.00,May-1864
25722,xx,,


Se comprueba que las columnas "case_number" y "case_number_2" son iguales salvo en unos casos donde comprobando con la columna date se ve que es la columna "case_number" la que es erronea

In [22]:
sharks["case_number"]=sharks["case_number_2"]
sharks.drop(columns = ["case_number_2"],axis=1,inplace= True)

In [23]:
sharks.sample(1)

Unnamed: 0,case_number,date,year,type,country,area,location,activity,name,sex,age,injury,fatal_(y/n),time,species,investigator_or_source,pdf,href_formula,href,original_order
3931,1961.01.06.a,06-Jan-1961,1961.0,Provoked,AUSTRALIA,Northern Territory,"Stokes Hill Wharf, Darwin",Fishing,Arthur Hopkins,M,,Finger bitten by hooked shark PROVOKED INCIDENT,N,,"Hammerhead shark, 1.8 m [6']","Darwin Northern Territory News, 1/10/1961",1961.01.06.a-Hopkins.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2372.0


Hasta aquí se han limpiado del DataFrame todas aquellas filas con todos los valores NaN y todas aquellas columnas que no aportan nada de información útil, a continuación se va a exportar el DataFrame modificado a un .csv

In [24]:
sharks.to_csv("data/sharks_limpio_raw.csv",index=False)

In [25]:
sharks = pd.read_csv("data/sharks_limpio_raw.csv",encoding = "ISO-8859-1")

Se van a eliminar todas aquellas columnas que no se van a usar para sacar conclusiones

In [26]:
sharks.drop(columns = ["investigator_or_source","pdf","href_formula","href","case_number",
                       "original_order","name","area","location"],axis=1,inplace= True)

Ahora se va a hacer una limpieza de cada columna útil, para ello se van a utilizar las funciones del archivo de funciones

In [27]:
sharks["sex"] = sharks["sex"].apply(limpieza_sex)
sharks["age"] = sharks["age"].apply(limpieza_age)
sharks["year"] = sharks["year"].apply(limpieza_year)
sharks["fatal_(y/n)"] = sharks["fatal_(y/n)"].apply(limpieza_fatal)
sharks["hour"] = sharks["time"].apply(limpieza_time)
sharks.drop(columns = ["time"],axis=1,inplace= True)
sharks["type"] = sharks["type"].apply(limpieza_type)
sharks["months_code"] = sharks["date"].apply(limpieza_meses)
sharks["day"] = sharks["date"].apply(limpieza_dias)
sharks.drop(columns = ["date"],axis=1,inplace= True)
sharks["species"] = sharks["species"].apply(limpieza_species)
sharks["injury"] = sharks["injury"].apply(limpieza_injury)
sharks["country"] = sharks["country"].apply(limpieza_country)
sharks["activity"] = sharks["activity"].apply(limpieza_activity)

Se reordena el DataFrame

In [28]:
sharks = sharks.iloc[:,[0,10,11,9,2,4,5,3,8,1,6,7]]
sharks.sample()

Unnamed: 0,year,months_code,day,hour,country,sex,age,activity,species,type,injury,fatal_(y/n)
2941,1983.0,5.0,24.0,,United States of America,M,15.0,surf,,Provoked,no injury,NO


Se eliminan las filas duplicadas y vacias

In [29]:
sharks.drop_duplicates(inplace=True)
sharks.dropna(axis=0, how='all', inplace=True)
sharks.dropna(axis=1, how='all', inplace=True)

In [30]:
sharks.shape

(6263, 12)

Se exporta el .csv limpio

In [31]:
sharks.to_csv("shark_attacks.csv",index=False)