# Limpieza de datos

Vamos a aplicar el [checklist del Banco Mundial](https://dimewiki.worldbank.org/wiki/Checklist:_Data_Cleaning) en Python usando 
el [SciPy stack](https://www.scipy.org/stackspec.html), principalmente pandas

## 0. Prerequisitos

Aclaración: La próxima celda es para compatilibidad con Colab, NO ES RECOMENDADO realizar pip install desde un notebook.

In [12]:
import sys
in_colab = 'google.colab' in sys.modules

if in_colab:
    BASE_DIR = "https://github.com/DiploDatos/AnalisisYCuracion/raw/master/"
    file_16 = BASE_DIR + '/input/kickstarter-projects/ks-projects-201612.csv'
    file_18 = BASE_DIR + '/input/kickstarter-projects/ks-projects-201801.csv'

else:
    BASE_DIR = ".."
    file_16 = BASE_DIR + '/input/ks-projects-201612.csv'
    file_18 = BASE_DIR + '/input/ks-projects-201801.csv'

if 'ftfy' not in sys.modules:
    !pip install 'ftfy<5.6'

You should consider upgrading via the 'pip install --upgrade pip' command.[0m


## 1. Importando los datos

### 1.1. Verificar que no hay problemas en la importación

In [13]:
import pandas as pd

pd.options.display.float_format = '{:.2f}'.format

In [14]:
kickstarter_2016 = pd.read_csv(file_16)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x99 in position 11: invalid start byte

Veamos de importar datos de proyectos de Kickstarter la plataforma de 
Crowdsourcing

Por defecto Pandas falla si hay errores para leer datos 
https://pandas.pydata.org/pandas-docs/stable/io.html#error-handling

Por ahora cambiamos a un archivo más actualizado, volveremos a este error más 
adelante

In [15]:
kickstarter_2018 = pd.read_csv(file_18)

Veamos los datos cargados en el dataframe

In [16]:
kickstarter_2018

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.00,2015-08-11 12:12:28,0.00,failed,0,GB,0.00,0.00,1533.95
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.00,2017-09-02 04:43:57,2421.00,failed,15,US,100.00,2421.00,30000.00
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.00,2013-01-12 00:20:50,220.00,failed,3,US,220.00,220.00,45000.00
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.00,2012-03-17 03:24:11,1.00,failed,1,US,1.00,1.00,5000.00
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.00,2015-07-04 08:35:03,1283.00,canceled,14,US,1283.00,1283.00,19500.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
378656,999976400,ChknTruk Nationwide Charity Drive 2014 (Canceled),Documentary,Film & Video,USD,2014-10-17,50000.00,2014-09-17 02:35:30,25.00,canceled,1,US,25.00,25.00,50000.00
378657,999977640,The Tribe,Narrative Film,Film & Video,USD,2011-07-19,1500.00,2011-06-22 03:35:14,155.00,failed,5,US,155.00,155.00,1500.00
378658,999986353,Walls of Remedy- New lesbian Romantic Comedy f...,Narrative Film,Film & Video,USD,2010-08-16,15000.00,2010-07-01 19:40:30,20.00,failed,1,US,20.00,20.00,15000.00
378659,999987933,BioDefense Education Kit,Technology,Technology,USD,2016-02-13,15000.00,2016-01-13 18:13:53,200.00,failed,6,US,200.00,200.00,15000.00


Por defecto solo vemos los valores al comienzo o al final del archivo.

Tomemos una muestra al azar para ver valores más dispersos

In [18]:
import numpy as np

# set seed for reproducibility
np.random.seed(0)
kickstarter_2018.sample(5)

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
338862,796196901,10G Christmas Tree,Art,Art,USD,2010-12-26,10526.0,2010-12-08 08:44:04,0.0,failed,0,US,0.0,0.0,10526.0
277871,483825010,Gliff,Gaming Hardware,Games,USD,2016-03-28,10000.0,2016-01-28 04:56:18,51.0,failed,5,US,51.0,51.0,10000.0
47000,123916947,STUFFED Food Truck,Food Trucks,Food,USD,2015-01-06,60000.0,2014-11-07 02:24:36,25.0,failed,1,US,25.0,25.0,60000.0
111338,1565733636,NeoExodus Adventure: Origin of Man for Pathfin...,Tabletop Games,Games,USD,2012-05-01,500.0,2012-03-15 01:16:10,585.0,successful,17,US,585.0,585.0,500.0
53743,1273544891,NAPOLEON IN NEW YORK! an original TV Series,Comedy,Film & Video,USD,2016-07-26,25000.0,2016-05-27 00:07:25,25.0,failed,1,US,25.0,25.0,25000.0


No se observa a simple vista ningún problema obvio.

Notar que todos vimos los mismos resultados. Al fijar la semilla no hubo tal 
azar, esto es algo necesario cuando queremos "reproducir valores aleatorios"

Veamos la descripción del dataset si se corresponde con lo levantado https://www.kaggle.com/kemical/kickstarter-projects/data

In [19]:
pd.DataFrame([
    ["ID", "No description provided", "Numeric"],
    ["name", "No description provided", "String"],
    ["category", "No description provided", "String"],
    ["main_category", "No description provided", "String"],
    ["currency", "No description provided", "String"],
    ["deadline", "No description provided", "DateTime"],
    ["goal", "Goal amount in project currency", "Numeric"],
    ["launched", "No description provided", "DateTime"],
    ["pledged", "Pledged amount in the project currency", "Numeric"],
    ["state", "No description provided", "String"],
    ["backers", "No description provided", "Numeric"],
    ["country", "No description provided", "String"],
    ["usd pledged", "Pledged amount in USD (conversion made by KS)", "Numeric"],
    ["usd_pledged_real", 
     "Pledged amount in USD (conversion made by fixer.io api)", "Numeric"],
    ["usd_goal_real", "Goal amount in USD", "Numeric"]], 
    columns=["Field name","Field description", "Type"]
)

Unnamed: 0,Field name,Field description,Type
0,ID,No description provided,Numeric
1,name,No description provided,String
2,category,No description provided,String
3,main_category,No description provided,String
4,currency,No description provided,String
5,deadline,No description provided,DateTime
6,goal,Goal amount in project currency,Numeric
7,launched,No description provided,DateTime
8,pledged,Pledged amount in the project currency,Numeric
9,state,No description provided,String


Ahora veamos los tipos de datos que detectó pandas

In [20]:
kickstarter_2018.dtypes

ID                    int64
name                 object
category             object
main_category        object
currency             object
deadline             object
goal                float64
launched             object
pledged             float64
state                object
backers               int64
country              object
usd pledged         float64
usd_pledged_real    float64
usd_goal_real       float64
dtype: object

Los campos object generalmente son String, entonces parece que no reconoció 
como fechas en **deadline** y **launched** :(

Veamos los datos un resumen de los datos

In [21]:
kickstarter_2018.describe()

Unnamed: 0,ID,goal,pledged,backers,usd pledged,usd_pledged_real,usd_goal_real
count,378661.0,378661.0,378661.0,378661.0,374864.0,378661.0,378661.0
mean,1074731191.99,49080.79,9682.98,105.62,7036.73,9058.92,45454.4
std,619086204.32,1183391.26,95636.01,907.19,78639.75,90973.34,1152950.06
min,5971.0,0.01,0.0,0.0,0.0,0.0,0.01
25%,538263516.0,2000.0,30.0,2.0,16.98,31.0,2000.0
50%,1075275634.0,5200.0,620.0,12.0,394.72,624.33,5500.0
75%,1610148624.0,16000.0,4076.0,56.0,3034.09,4050.0,15500.0
max,2147476221.0,100000000.0,20338986.27,219382.0,20338986.27,20338986.27,166361390.71


Por defecto se ven los datos numéricos, veamos el resto.

In [22]:
kickstarter_2018.describe(include=['object'])

Unnamed: 0,name,category,main_category,currency,deadline,launched,state,country
count,378657,378661,378661,378661,378661,378661,378661,378661
unique,375764,159,15,14,3164,378089,6,23
top,New EP/Music Development,Product Design,Film & Video,USD,2014-08-08,1970-01-01 01:00:00,failed,US
freq,41,22314,63585,295365,705,7,197719,292627


Operemos un cacho sobre los datos de lanzamiento

In [23]:
kickstarter_2018['launched'].min()

'1970-01-01 01:00:00'

Parece funcionar, pero ahora calculemos el rango de fechas de los proyectos

In [24]:
kickstarter_2018['launched'].max() - kickstarter_2018['launched'].min()

TypeError: unsupported operand type(s) for -: 'str' and 'str'

Indiquemos que columnas son fechas como indica la [documentación](https://pandas.pydata.org/pandas-docs/stable/io.html#datetime-handling)

In [26]:
kickstarter_2018 = pd.read_csv(file_18, parse_dates=["deadline","launched"])
kickstarter_2018.dtypes

ID                           int64
name                        object
category                    object
main_category               object
currency                    object
deadline            datetime64[ns]
goal                       float64
launched            datetime64[ns]
pledged                    float64
state                       object
backers                      int64
country                     object
usd pledged                float64
usd_pledged_real           float64
usd_goal_real              float64
dtype: object

Ahora vemos que esas columnas fueron reconocidas como fechas

Veamos la misma muestra de nuevo

In [27]:
kickstarter_2018.sample(5)

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
299667,595962034,Button Poetry Live!,Mixed Media,Art,USD,2015-09-18,10000.0,2015-08-19 19:34:20,18216.27,successful,455,US,18216.27,18216.27,10000.0
181674,1924707671,"C STREET 2012 : Tbilisi, Georgia",World Music,Music,USD,2012-06-07,5000.0,2012-05-08 18:22:59,7210.69,successful,82,US,7210.69,7210.69,5000.0
137583,1698707842,Dérive's Next Project,Punk,Music,USD,2014-07-06,1200.0,2014-06-08 17:58:37,1255.66,successful,33,US,1255.66,1255.66,1200.0
296861,581269566,Photo Book - World Santa Claus Congress,Photobooks,Photography,DKK,2017-04-14,110000.0,2017-03-14 23:45:35,462.0,failed,5,DK,0.0,66.46,15823.47
66362,1337585114,Kickstart CLE Brewing to greatness!,Drinks,Food,USD,2017-08-19,6500.0,2017-07-20 21:22:43,250.0,failed,5,US,75.0,250.0,6500.0


Y veamos el resumen de los datos

In [28]:
kickstarter_2018.describe(include='all')

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
count,378661.0,378657,378661,378661,378661,378661,378661.0,378661,378661.0,378661,378661.0,378661,374864.0,378661.0,378661.0
unique,,375764,159,15,14,3164,,378089,,6,,23,,,
top,,New EP/Music Development,Product Design,Film & Video,USD,2014-08-08 00:00:00,,1970-01-01 01:00:00,,failed,,US,,,
freq,,41,22314,63585,295365,705,,7,,197719,,292627,,,
first,,,,,,2009-05-03 00:00:00,,1970-01-01 01:00:00,,,,,,,
last,,,,,,2018-03-03 00:00:00,,2018-01-02 15:02:31,,,,,,,
mean,1074731191.99,,,,,,49080.79,,9682.98,,105.62,,7036.73,9058.92,45454.4
std,619086204.32,,,,,,1183391.26,,95636.01,,907.19,,78639.75,90973.34,1152950.06
min,5971.0,,,,,,0.01,,0.0,,0.0,,0.0,0.0,0.01
25%,538263516.0,,,,,,2000.0,,30.0,,2.0,,16.98,31.0,2000.0


Podemos ver que tenemos primero y último en el resumen de las columnas de fechas.

Ahora deberíamos poder calcular el rango de fechas de lanzamietos

In [29]:
kickstarter_2018['launched'].max() - kickstarter_2018['launched'].min()

Timedelta('17533 days 14:02:31')

### 1.2. Asegurar de tener ids/claves únicas

Chequear que no hay datos duplicados

In [30]:
kickstarter_2018.shape

(378661, 15)

Pandas soporta índices en los DataFrames vamos a recargar el conjunto de datos

In [31]:
kickstarter_2018 = pd.read_csv(
    file_18, parse_dates=["deadline","launched"], index_col=['ID']
)

In [32]:
kickstarter_2018

Unnamed: 0_level_0,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.00,2015-08-11 12:12:28,0.00,failed,0,GB,0.00,0.00,1533.95
1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.00,2017-09-02 04:43:57,2421.00,failed,15,US,100.00,2421.00,30000.00
1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.00,2013-01-12 00:20:50,220.00,failed,3,US,220.00,220.00,45000.00
1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.00,2012-03-17 03:24:11,1.00,failed,1,US,1.00,1.00,5000.00
1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.00,2015-07-04 08:35:03,1283.00,canceled,14,US,1283.00,1283.00,19500.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
999976400,ChknTruk Nationwide Charity Drive 2014 (Canceled),Documentary,Film & Video,USD,2014-10-17,50000.00,2014-09-17 02:35:30,25.00,canceled,1,US,25.00,25.00,50000.00
999977640,The Tribe,Narrative Film,Film & Video,USD,2011-07-19,1500.00,2011-06-22 03:35:14,155.00,failed,5,US,155.00,155.00,1500.00
999986353,Walls of Remedy- New lesbian Romantic Comedy f...,Narrative Film,Film & Video,USD,2010-08-16,15000.00,2010-07-01 19:40:30,20.00,failed,1,US,20.00,20.00,15000.00
999987933,BioDefense Education Kit,Technology,Technology,USD,2016-02-13,15000.00,2016-01-13 18:13:53,200.00,failed,6,US,200.00,200.00,15000.00


In [33]:
kickstarter_2018.shape

(378661, 14)

De esta forma podemos buscar por el índice

In [34]:
kickstarter_2018.loc[999988282]

name                Nou Renmen Ayiti!  We Love Haiti!
category                              Performance Art
main_category                                     Art
currency                                          USD
deadline                          2011-08-16 00:00:00
goal                                          2000.00
launched                          2011-07-19 09:07:47
pledged                                        524.00
state                                          failed
backers                                            17
country                                            US
usd pledged                                    524.00
usd_pledged_real                               524.00
usd_goal_real                                 2000.00
Name: 999988282, dtype: object

También podemos verificar si hay filas de contenidos duplicado

In [35]:
kickstarter_2018[kickstarter_2018.duplicated()]

Unnamed: 0_level_0,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1


Como Pandas acepta valores duplicados en los índices también debemos verificar ahí

In [36]:
pd.Series(kickstarter_2018.index, dtype=str).describe()

count       378661
unique      378661
top       16727979
freq             1
Name: ID, dtype: object

In [37]:
kickstarter_2018[kickstarter_2018.index.duplicated()]

Unnamed: 0_level_0,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1


#### Repasamos con un ejemplo de juguete

In [38]:
csv='1,2\n3,3\n1,3'
print(csv)

1,2
3,3
1,3


In [39]:
from io import StringIO

df = pd.read_csv(StringIO(csv), names=['id','value'], index_col='id')
df

Unnamed: 0_level_0,value
id,Unnamed: 1_level_1
1,2
3,3
1,3


In [40]:
df[df.duplicated()]

Unnamed: 0_level_0,value
id,Unnamed: 1_level_1
1,3


In [41]:
df[df.index.duplicated(keep=False)]

Unnamed: 0_level_0,value
id,Unnamed: 1_level_1
1,2
1,3


#### Ejercicio 1:

Armar una tabla con todos los proyectos con nombres duplicados, ordenados para 
revisar agrupados. 

In [44]:
cols = ['name', 'main_category', 'state', 'goal']

kickstarter_2018[
    kickstarter_2018[cols].duplicated(keep=False)
].sort_values('name')

Unnamed: 0_level_0,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
422509694,"""American Sports Stories"" - An Athletic Quest",Television,Film & Video,USD,2017-05-14,100000.00,2017-04-14 00:08:52,1.00,failed,1,US,25.00,1.00,100000.00
1880084695,"""American Sports Stories"" - An Athletic Quest",Shorts,Film & Video,USD,2015-08-26,100000.00,2015-06-27 02:02:00,100.00,failed,1,US,100.00,100.00,100000.00
1023301684,"""Pulse""- a new album from ""Blind Focus"" to sup...",Music,Music,CAD,2016-12-17,4000.00,2016-11-17 09:45:59,0.00,canceled,0,CA,0.00,0.00,2959.89
306461885,"""Pulse""- a new album from ""Blind Focus"" to sup...",Music,Music,CAD,2016-12-08,4000.00,2016-11-08 03:01:36,270.00,canceled,3,CA,44.77,204.90,3035.59
683862511,2nd Life,Web,Technology,USD,2017-05-23,75000.00,2017-04-19 00:55:37,925.00,canceled,7,US,175.00,925.00,75000.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
616190169,Zero Hour,Tabletop Games,Games,USD,2017-06-30,28000.00,2017-05-30 16:59:40,7440.00,canceled,160,US,5583.00,7440.00,28000.00
1217993841,iHelp,Apps,Technology,USD,2017-06-13,2750.00,2017-05-18 22:23:40,31.00,canceled,2,US,30.00,31.00,2750.00
1106384724,iHelp,Apps,Technology,USD,2017-05-22,2750.00,2017-05-21 18:55:04,0.00,canceled,0,US,0.00,0.00,2750.00
1612055887,x (Canceled),Fiction,Publishing,USD,2012-02-22,15000.00,2012-01-12 01:07:01,0.00,canceled,0,US,0.00,0.00,15000.00


### 1.3. Despersonalizar datos y guardarlos en un nuevo archivo

Hay muuuchas técnicas para despersonalizar datos.

Para ilustrar mostramos las ofrecidas por Google 
https://cloud.google.com/dlp/docs/transformations-reference:

* **Reemplazo**: Reemplaza cada valor de entrada con un valor determinado.
* **Ocultamiento**: Quita un valor y lo oculta.
* **Enmascaramiento con caracteres**: Enmascara una string por completo o 
    parcialmente mediante el reemplazo de un número determinado de caracteres
    con un carácter fijo especificado..
* **Seudonimización mediante el reemplazo de un valor de entrada con un hash 
    criptográfico**: 
    Reemplaza valores de entrada con una string hexadecimal de 32 bytes 
    mediante una clave de encriptación de datos.
* **Cambio de fechas**: Cambia las fechas por un número de días al azar, con 
    la opción de ser coherente en el mismo contexto..
* **Seudonimización mediante el reemplazo con token de preservación de formato 
    criptográfico**: Reemplaza un valor de entrada con un token, o valor 
    sustituto, de la misma longitud mediante la encriptación de preservación 
    de formato (FPE) con el modo de operación FFX. Esto permite que se use el 
    resultado en sistemas con validación de formato o que necesitan aparecer 
    como reales a pesar de que la información no se revela.
* **Valores de depósito con base en rangos de tamaño fijos**: Enmascara los 
    valores de entrada y los reemplaza por depósitos, o rangos dentro de los 
    cuales se encuentra el valor de entrada.
* **Valores de depósito con base en rangos de tamaño personalizados**: 
    Valores de entrada de depósito con base en rangos configurables por el 
    usuario y valores de reemplazo.
* **Extracción de datos de tiempo**: 
    Extrae o preserva una porción de los valores Date, Timestamp y TimeOfDay.

In [45]:
from hashlib import md5

kickstarter_2018['name'].apply(md5)

TypeError: Unicode-objects must be encoded before hashing

In [46]:
def hashit(val):
    return md5(val.encode('utf-8'))

kickstarter_2018['name'].apply(hashit)

AttributeError: 'float' object has no attribute 'encode'

In [47]:
def hashit(val):
    try:
        return md5(val.encode('utf-8'))
    except Exception as e:
        print(val, type(val))
        raise(e)

kickstarter_2018['name'].apply(hashit)

nan <class 'float'>


AttributeError: 'float' object has no attribute 'encode'

In [51]:
def hashit(val):
    if isinstance(val, float): 
        return str(val)
    return md5(val.encode('utf-8')).hexdigest()


kickstarter_2018['nhash'] = kickstarter_2018['name'].apply(hashit)
kickstarter_2018[['nhash']]

Unnamed: 0_level_0,nhash
ID,Unnamed: 1_level_1
1000002330,a6828ae8a2eca25f0dd7035efc0af0a0
1000003930,81609b3bdc0b96f429672d69702f2524
1000004038,c12f5c3bace2f0213cdb2679a265dca0
1000007540,4dbdcf09c86bbf5683ec39bc57b77f81
1000011046,9c01404a2ef702811c2088ce139042ad
...,...
999976400,d89228576343394467096843057f3aa4
999977640,bbcb30bd9bd4f9bff0a96fc44d0001f0
999986353,6c3094666e1a315b6e179566fe3972d9
999987933,887be409ad8b93f26084845a41d4c178


#### Ejercicio 2:

Verificar que los proyectos que tienen nombres duplicados también tienen el hash de nombre duplicado

In [55]:
cols = ['name', 'nhash']

kickstarter_2018[
    kickstarter_2018[cols].duplicated(keep=False)
].sort_values('name')[cols]

Unnamed: 0_level_0,name,nhash
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
816998285,"""...The Last shall be first..."" LODB Lifestyle...",0c7a251ffe4c7834cbc4b04906952ff1
815783250,"""...The Last shall be first..."" LODB Lifestyle...",0c7a251ffe4c7834cbc4b04906952ff1
1010584633,"""A Fresh Start""",67554ab4203d95f2f2f05365f768206e
713417995,"""A Fresh Start""",67554ab4203d95f2f2f05365f768206e
1880084695,"""American Sports Stories"" - An Athletic Quest",19a82bc4c5961834282575d07d9b5f7c
...,...,...
329580179,xxx (Canceled),930857c212f21166427b23d4a7fe52a3
1848699072,,
634871725,,
648853978,,


### 1.4. Nunca modificar los datos crudos u originales


In [56]:
kickstarter_2018.to_csv("../output/ks-projects-201801-for-pandas.csv")