In [2]:
import pandas as pd
import numpy as np
from io import StringIO


Imagina que eres un investigador en el mundo de Pokémon y has recibido una base de datos con información desactualizada y desordenada sobre avistamientos de Pokémon. Tu misión es limpiar y actualizar esta base de datos para que pueda ser utilizada en un estudio sobre la población de Pokémon en la región.

Datos Iniciales

Los datos iniciales contienen las siguientes columnas:

    SightingDate: Fecha del avistamiento.
    TrainerID: Identificación del entrenador que reportó el avistamiento.
    PokemonName: Nombre del Pokémon avistado.
    CP: Puntos de combate del Pokémon reportado.
    HP: Puntos de salud del Pokémon reportado.
    Type: Tipo del Pokémon.
    Weather: Clima durante el avistamiento.

# 1 Carga de Datos

In [23]:
path_datos = 'pokemon.csv'
# TODO completa el codigo para cargar los datos
df = pd.read_csv(path_datos, encoding= 'latin1')
df

Unnamed: 0,SightingDate,SightingTimeUTC,TrainerID,PokémonName,CP,HP,Type,Weather
0,2023-11-08T14:00:00+0000,2023-11-08T14:00Z,TR123,Pikachu,500,35,Electric,Clear
1,2023-07-12T09:30:00+0100,2023-07-12T08:30Z,TR456,Eevee,MISSING,55,Normal,Cloudy
2,2023/02/23T22:15:00+0900,2023-02-23T13:15Z,TR789,Magicarp,1000,10,Water,Rain
3,2023-04-30T06:45:00-0400,2023-04-30T10:45Z,TR101,Gengar,800,45,Ghost,PARTLY_CLOUDY
4,2023-08-15T16:00:00+1000,,TR102,Bulbasaur,750,50,Grass/Poison,Sunny


# Limpieza de Datos

## 1 Normalizacion de Zonas Horarias

Normaliza la columna `'SightingTimeUTC'` a la zona horaria UTC y convierte `'SightingDate'` al mismo formato de tiempo.


In [30]:
import pytz

#Vemos que en que formato estan las fechas
print(df['SightingTimeUTC'].dtype)
#convertimos a date time
df['SightingTimeUTC'] = pd.to_datetime(df['SightingDate'], utc = True)

#hacemos exactamente lo mismo para los sighting dates
print(df['SightingDate'].dtype)
df['SightingDate'] = pd.to_datetime(df['SightingDate'],  utc = True)

#si cargas el cell dos veces puedes verificar que el dtype cambio de object a datetime64[ns, UTC]

df


datetime64[ns, UTC]
datetime64[ns, UTC]


Unnamed: 0,SightingDate,SightingTimeUTC,TrainerID,PokémonName,CP,HP,Type,Weather
0,2023-11-08 14:00:00+00:00,2023-11-08 14:00:00+00:00,TR123,Pikachu,500,35,Electric,Clear
1,2023-07-12 08:30:00+00:00,2023-07-12 08:30:00+00:00,TR456,Eevee,MISSING,55,Normal,Cloudy
2,2023-02-23 13:15:00+00:00,2023-02-23 13:15:00+00:00,TR789,Magicarp,1000,10,Water,Rain
3,2023-04-30 10:45:00+00:00,2023-04-30 10:45:00+00:00,TR101,Gengar,800,45,Ghost,PARTLY_CLOUDY
4,2023-08-15 06:00:00+00:00,2023-08-15 06:00:00+00:00,TR102,Bulbasaur,750,50,Grass/Poison,Sunny


Compara si la fecha de la columna `'SightingDate'` coincide con la fecha en `'SightingTimeUTC'` una vez normalizada.


In [32]:
# Pista: Puedes crear una columna nueva para el resultado de la comparación.

df['DateMatch'] = ((df['SightingDate'])==(df['SightingTimeUTC']))

df


Unnamed: 0,SightingDate,SightingTimeUTC,TrainerID,PokémonName,CP,HP,Type,Weather,DateMatch
0,2023-11-08 14:00:00+00:00,2023-11-08 14:00:00+00:00,TR123,Pikachu,500,35,Electric,Clear,True
1,2023-07-12 08:30:00+00:00,2023-07-12 08:30:00+00:00,TR456,Eevee,MISSING,55,Normal,Cloudy,True
2,2023-02-23 13:15:00+00:00,2023-02-23 13:15:00+00:00,TR789,Magicarp,1000,10,Water,Rain,True
3,2023-04-30 10:45:00+00:00,2023-04-30 10:45:00+00:00,TR101,Gengar,800,45,Ghost,PARTLY_CLOUDY,True
4,2023-08-15 06:00:00+00:00,2023-08-15 06:00:00+00:00,TR102,Bulbasaur,750,50,Grass/Poison,Sunny,True


Ajusta `'SightingTimeUTC'` a la zona horaria local de cada entrenador y crea una columna `'SightingTimeLocal'`.

In [42]:
#localizamos. es una propiedad de UTC chida :)
df['SightingTimeLocal'] = df['SightingTimeUTC'].dt.tz_localize(None)

df

Unnamed: 0,SightingDate,SightingTimeUTC,TrainerID,PokémonName,CP,HP,Type,Weather,DateMatch,SightingTimeLocal
0,2023-11-08 14:00:00+00:00,2023-11-08 14:00:00+00:00,TR123,Pikachu,500,35,Electric,Clear,True,2023-11-08 14:00:00
1,2023-07-12 08:30:00+00:00,2023-07-12 08:30:00+00:00,TR456,Eevee,MISSING,55,Normal,Cloudy,True,2023-07-12 08:30:00
2,2023-02-23 13:15:00+00:00,2023-02-23 13:15:00+00:00,TR789,Magicarp,1000,10,Water,Rain,True,2023-02-23 13:15:00
3,2023-04-30 10:45:00+00:00,2023-04-30 10:45:00+00:00,TR101,Gengar,800,45,Ghost,PARTLY_CLOUDY,True,2023-04-30 10:45:00
4,2023-08-15 06:00:00+00:00,2023-08-15 06:00:00+00:00,TR102,Bulbasaur,750,50,Grass/Poison,Sunny,True,2023-08-15 06:00:00


Calcula el tiempo transcurrido desde el momento del avistamiento hasta `'ahora'` (tu hora local) y crea una columna `'TimeSinceSighting'`.

In [49]:
from datetime import datetime #este import me parece ridiculo, pero sin el no funciona la fucion now()
df['TimeSinceSighting'] = datetime.now()-df['SightingTimeLocal'] 

df


Unnamed: 0,SightingDate,SightingTimeUTC,TrainerID,PokémonName,CP,HP,Type,Weather,DateMatch,SightingTimeLocal,TimeSinceSighting
0,2023-11-08 14:00:00+00:00,2023-11-08 14:00:00+00:00,TR123,Pikachu,500,35,Electric,Clear,True,2023-11-08 14:00:00,2 days 05:11:18.258807
1,2023-07-12 08:30:00+00:00,2023-07-12 08:30:00+00:00,TR456,Eevee,MISSING,55,Normal,Cloudy,True,2023-07-12 08:30:00,121 days 10:41:18.258807
2,2023-02-23 13:15:00+00:00,2023-02-23 13:15:00+00:00,TR789,Magicarp,1000,10,Water,Rain,True,2023-02-23 13:15:00,260 days 05:56:18.258807
3,2023-04-30 10:45:00+00:00,2023-04-30 10:45:00+00:00,TR101,Gengar,800,45,Ghost,PARTLY_CLOUDY,True,2023-04-30 10:45:00,194 days 08:26:18.258807
4,2023-08-15 06:00:00+00:00,2023-08-15 06:00:00+00:00,TR102,Bulbasaur,750,50,Grass/Poison,Sunny,True,2023-08-15 06:00:00,87 days 13:11:18.258807


## 2. Limpeiza de IDs

Llena los valores faltantes en `'TrainerID'` con el ID `'UNKNOWN'`.


In [55]:
#como vimos en clase, usamos fill na
df['TrainerID'] = df['TrainerID'].fillna('UNKNOWN')

print(df[df['TrainerID']=='UNKNOWN'])
#a chiquita. Todos son conocidos


Empty DataFrame
Columns: [SightingDate, SightingTimeUTC, TrainerID, PokémonName, CP, HP, Type, Weather, DateMatch, SightingTimeLocal, TimeSinceSighting]
Index: []


## 3. Corrección de Nombres de Pokémon

Asegúrate de que los nombres de Pokémon estén capitalizados correctamente.


In [57]:
df['PokemonName'] = (df['PokémonName'].str.lower()).str.capitalize()

df



Unnamed: 0,SightingDate,SightingTimeUTC,TrainerID,PokémonName,CP,HP,Type,Weather,DateMatch,SightingTimeLocal,TimeSinceSighting,PokemonName
0,2023-11-08 14:00:00+00:00,2023-11-08 14:00:00+00:00,TR123,Pikachu,500,35,Electric,Clear,True,2023-11-08 14:00:00,2 days 05:11:18.258807,Pikachu
1,2023-07-12 08:30:00+00:00,2023-07-12 08:30:00+00:00,TR456,Eevee,MISSING,55,Normal,Cloudy,True,2023-07-12 08:30:00,121 days 10:41:18.258807,Eevee
2,2023-02-23 13:15:00+00:00,2023-02-23 13:15:00+00:00,TR789,Magicarp,1000,10,Water,Rain,True,2023-02-23 13:15:00,260 days 05:56:18.258807,Magicarp
3,2023-04-30 10:45:00+00:00,2023-04-30 10:45:00+00:00,TR101,Gengar,800,45,Ghost,PARTLY_CLOUDY,True,2023-04-30 10:45:00,194 days 08:26:18.258807,Gengar
4,2023-08-15 06:00:00+00:00,2023-08-15 06:00:00+00:00,TR102,Bulbasaur,750,50,Grass/Poison,Sunny,True,2023-08-15 06:00:00,87 days 13:11:18.258807,Bulbasaur


## 4. Conversión de 'CP' y 'HP' a Numéricos

Convierte `'CP'` y `'HP'` a valores numéricos, manejando los `'MISSING'` y comas como separadores de miles.

In [59]:
df['CP'] = pd.to_numeric(df['CP'], errors='coerce')
df['HP'] = pd.to_numeric(df['HP'], errors='coerce')

df

Unnamed: 0,SightingDate,SightingTimeUTC,TrainerID,PokémonName,CP,HP,Type,Weather,DateMatch,SightingTimeLocal,TimeSinceSighting,PokemonName
0,2023-11-08 14:00:00+00:00,2023-11-08 14:00:00+00:00,TR123,Pikachu,500.0,35,Electric,Clear,True,2023-11-08 14:00:00,2 days 05:11:18.258807,Pikachu
1,2023-07-12 08:30:00+00:00,2023-07-12 08:30:00+00:00,TR456,Eevee,,55,Normal,Cloudy,True,2023-07-12 08:30:00,121 days 10:41:18.258807,Eevee
2,2023-02-23 13:15:00+00:00,2023-02-23 13:15:00+00:00,TR789,Magicarp,,10,Water,Rain,True,2023-02-23 13:15:00,260 days 05:56:18.258807,Magicarp
3,2023-04-30 10:45:00+00:00,2023-04-30 10:45:00+00:00,TR101,Gengar,800.0,45,Ghost,PARTLY_CLOUDY,True,2023-04-30 10:45:00,194 days 08:26:18.258807,Gengar
4,2023-08-15 06:00:00+00:00,2023-08-15 06:00:00+00:00,TR102,Bulbasaur,750.0,50,Grass/Poison,Sunny,True,2023-08-15 06:00:00,87 days 13:11:18.258807,Bulbasaur


## 5. Estandarización de 'Type'

Divide la columna `'Type'` en `'PrimaryType'` y `'SecondaryType'` cuando hay dos tipos.


In [60]:
#usamos la función split y de separador el /

df[['PrimaryType', 'SecondaryType']] = df['Type'].str.split('/', expand = True)
df

Unnamed: 0,SightingDate,SightingTimeUTC,TrainerID,PokémonName,CP,HP,Type,Weather,DateMatch,SightingTimeLocal,TimeSinceSighting,PokemonName,PrimaryType,SecondaryType
0,2023-11-08 14:00:00+00:00,2023-11-08 14:00:00+00:00,TR123,Pikachu,500.0,35,Electric,Clear,True,2023-11-08 14:00:00,2 days 05:11:18.258807,Pikachu,Electric,
1,2023-07-12 08:30:00+00:00,2023-07-12 08:30:00+00:00,TR456,Eevee,,55,Normal,Cloudy,True,2023-07-12 08:30:00,121 days 10:41:18.258807,Eevee,Normal,
2,2023-02-23 13:15:00+00:00,2023-02-23 13:15:00+00:00,TR789,Magicarp,,10,Water,Rain,True,2023-02-23 13:15:00,260 days 05:56:18.258807,Magicarp,Water,
3,2023-04-30 10:45:00+00:00,2023-04-30 10:45:00+00:00,TR101,Gengar,800.0,45,Ghost,PARTLY_CLOUDY,True,2023-04-30 10:45:00,194 days 08:26:18.258807,Gengar,Ghost,
4,2023-08-15 06:00:00+00:00,2023-08-15 06:00:00+00:00,TR102,Bulbasaur,750.0,50,Grass/Poison,Sunny,True,2023-08-15 06:00:00,87 days 13:11:18.258807,Bulbasaur,Grass,Poison


## 6. Corrección del Clima

Estándariza la columna `'Weather'` para que todos los valores sean mayúsculas.


In [62]:
df['Weather'] = df['Weather'].str.upper()

df


Unnamed: 0,SightingDate,SightingTimeUTC,TrainerID,PokémonName,CP,HP,Type,Weather,DateMatch,SightingTimeLocal,TimeSinceSighting,PokemonName,PrimaryType,SecondaryType
0,2023-11-08 14:00:00+00:00,2023-11-08 14:00:00+00:00,TR123,Pikachu,500.0,35,Electric,CLEAR,True,2023-11-08 14:00:00,2 days 05:11:18.258807,Pikachu,Electric,
1,2023-07-12 08:30:00+00:00,2023-07-12 08:30:00+00:00,TR456,Eevee,,55,Normal,CLOUDY,True,2023-07-12 08:30:00,121 days 10:41:18.258807,Eevee,Normal,
2,2023-02-23 13:15:00+00:00,2023-02-23 13:15:00+00:00,TR789,Magicarp,,10,Water,RAIN,True,2023-02-23 13:15:00,260 days 05:56:18.258807,Magicarp,Water,
3,2023-04-30 10:45:00+00:00,2023-04-30 10:45:00+00:00,TR101,Gengar,800.0,45,Ghost,PARTLY_CLOUDY,True,2023-04-30 10:45:00,194 days 08:26:18.258807,Gengar,Ghost,
4,2023-08-15 06:00:00+00:00,2023-08-15 06:00:00+00:00,TR102,Bulbasaur,750.0,50,Grass/Poison,SUNNY,True,2023-08-15 06:00:00,87 days 13:11:18.258807,Bulbasaur,Grass,Poison


# 3 Analisis de Datos

## Agrupaciones

Agrupa el DataFrame por `'Type'` y calcula la suma de `'CP'` para cada grupo.

In [69]:
df_byType= df.groupby(['Type'])

res = df_byType.agg({'CP': 'sum'})

res


Unnamed: 0_level_0,CP
Type,Unnamed: 1_level_1
Electric,500.0
Ghost,800.0
Grass/Poison,750.0
Normal,0.0
Water,0.0


Después de la suma, agrega una columna que calcule la media de `'HP'` por cada `'Type'`, pero solo para aquellos Pokémon cuyo `'CP'` sea mayor que el promedio de `'CP'` de todo el DataFrame.

In [83]:
#filtramos
prom = df_byType.agg({'CP': 'sum','HP': 'mean'})
print(prom)

res_2=prom[prom['CP']>=df['CP'].mean()]
res_2

                 CP    HP
Type                     
Electric      500.0  35.0
Ghost         800.0  45.0
Grass/Poison  750.0  50.0
Normal          0.0  55.0
Water           0.0  10.0


Unnamed: 0_level_0,CP,HP
Type,Unnamed: 1_level_1,Unnamed: 2_level_1
Ghost,800.0,45.0
Grass/Poison,750.0,50.0
