## Ejemplo 2: Manipulación de Strings

### 1. Objetivos:
    - Aprender a usar `replace`, `strip`, `title`, `upper`, `lower` y `split` para transformar datos tipo `string``
 
### 2. Desarrollo:

In [None]:
import pandas as pd

Iniciamos leyendo el conjunto de datos, ahora usaremos el archivo `new_york_times_bestsellers-dirty.csv` o el obtenido del ejemplo anterior:

In [None]:
df = pd.read_csv(..., index_col=0)
df.head(3)

Empecemos con la columna `description` que tiene un 'Descr:' al inicio de cada texto. Si queremos remover ese texto podemos usar el método `replace` de la propiedad `str` de esa `Serie` en la forma:

`dataframe[-columna-].str.replace("origen", "destino")`

In [None]:
df["description"] = ...

df.head(3)

Podemos consultar una fila de la columna `description` para comprobar el resultado con:

`dataframe.loc[-fila-, -columna-]`

In [None]:
...

Ahora se observa también espacios vacíos al principio y final de nuestras textos o cadenas, vamos a removerlos usando `strip` que elimina espacios en blanco al inicio y al final en la forma:

`dataframe[-columna-].str.strip()`

In [None]:
df['description'] = ...

df.loc[0, "description"]

Perfecto, examinemos el DataFrame de nuevo:

In [None]:
df.head(3)

Ahora veamos la columna 'title', cuyos textos están en mayúsculas. Esto no es muy agradable, así que podemos usar algunos métodos para modificar el patrón de mayúsculas y minúsculas con la función `df.str.title()`:

In [None]:
df["title"] = ...

df.head(3)

Ahora, digamos que queremos separar nuestra columna `author` en dos columnas `author_first_name` y `author_last_name`. Eso lo podemos hacer con el método `split` en la forma:

`dataframe[-columna-].str.split(-separador-)`

In [None]:
...

Podemos convertirlo en dos columnas así:

`dataframe[-columna-].str.split(-separador-, expand=True)`

In [None]:
...

Y podemos asignar las nuevas columnas de forma compacta usando la forma:

`df[-lista de nuevas columnas-] = -dataframe con los valores de las nuevas columnas-`

In [None]:
...

df.head(3)

¡Éxito!

---
---

## Reto 2: Manipulación de Strings

### 1. Objetivos:
    - Practicar manipular `strings` usando métodos como `split`, `title`, `strip, etc.
 
---
    
### 2. Desarrollo:

#### a) Limpiando texto

Vamos a trabajar en la versión del dataset que guardaste en el reto pasado. Las acciones que tienes que tomar en este Reto son las siguientes:

1. Reemplaza los guiones en las `strings` de la columna `orbit_class_description` por espacios.
2. Elimina los espacios vacíos al principio y final de las `strings` de la misma columna.
3. Hay una columna llamada `id_name` que contiene el 'id' y el nombre de cada objeto separados por un guión. Separa estos datos en dos columnas llamadas `id` y `name`.
4. Haz que las `strings` de la columna `orbiting_body` empiecen con mayúscula.
5. Asigna el `DataFrame` resultante a la variable `df_reto_2`.
6. Guarda tu resultado en un archivo .csv.

In [1]:
# import
import pandas as pd

In [2]:
df_reto_2 = pd.read_csv("near_earth_objects-jan_feb_1995-reto_1.csv", index_col=0)
df_reto_2.head(3)

Unnamed: 0,id_name,is_potentially_hazardous_asteroid,estimated_diameter.meters.estimated_diameter_min,estimated_diameter.meters.estimated_diameter_max,close_approach_date,epoch_date_close_approach,orbiting_body,relative_velocity.kilometers_per_second,relative_velocity.kilometers_per_hour,orbit_class_description
0,2154652-154652 (2004 EP20),False,483.676488,1081.533507,1995-01-07,1995-01-07 08:33:00,earth,16.142864,58114.308667,Near-Earth-asteroid-orbits-similar-to-that-o...
1,3153509-(2003 HM),True,96.506147,215.794305,1995-01-07,1995-01-07 15:09:00,earth,12.351044,44463.757734,Near-Earth-asteroid-orbits-which-cross-the-E...
2,3837644-(2019 AY3),False,46.190746,103.285648,1995-01-07,1995-01-07 21:25:00,earth,22.478615,80923.015021,Near-Earth-asteroid-orbits-similar-to-that-o...


In [3]:
df_reto_2.shape

(301, 10)

In [4]:
df_reto_2.dtypes

id_name                                              object
is_potentially_hazardous_asteroid                      bool
estimated_diameter.meters.estimated_diameter_min    float64
estimated_diameter.meters.estimated_diameter_max    float64
close_approach_date                                  object
epoch_date_close_approach                            object
orbiting_body                                        object
relative_velocity.kilometers_per_second             float64
relative_velocity.kilometers_per_hour               float64
orbit_class_description                              object
dtype: object

In [9]:
df_reto_2["orbit_class_description"] = \
    df_reto_2["orbit_class_description"].str.replace("-", " ").str.strip()

In [10]:
df_reto_2["orbit_class_description"]

0      Near Earth asteroid orbits similar to that of ...
1      Near Earth asteroid orbits which cross the Ear...
2      Near Earth asteroid orbits similar to that of ...
3      Near Earth asteroid orbits similar to that of ...
4      An asteroid orbit contained entirely within th...
                             ...                        
296    Near Earth asteroid orbits similar to that of ...
297    Near Earth asteroid orbits similar to that of ...
298    Near Earth asteroid orbits which cross the Ear...
299    An asteroid orbit contained entirely within th...
300    Near Earth asteroid orbits similar to that of ...
Name: orbit_class_description, Length: 301, dtype: object

In [15]:
df_reto_2[ ["id", "name"] ] = df_reto_2["id_name"].str.split("-", expand=True)
df_reto_2.head()

Unnamed: 0,id_name,is_potentially_hazardous_asteroid,estimated_diameter.meters.estimated_diameter_min,estimated_diameter.meters.estimated_diameter_max,close_approach_date,epoch_date_close_approach,orbiting_body,relative_velocity.kilometers_per_second,relative_velocity.kilometers_per_hour,orbit_class_description,id,name
0,2154652-154652 (2004 EP20),False,483.676488,1081.533507,1995-01-07,1995-01-07 08:33:00,earth,16.142864,58114.308667,Near Earth asteroid orbits similar to that of ...,2154652,154652 (2004 EP20)
1,3153509-(2003 HM),True,96.506147,215.794305,1995-01-07,1995-01-07 15:09:00,earth,12.351044,44463.757734,Near Earth asteroid orbits which cross the Ear...,3153509,(2003 HM)
2,3837644-(2019 AY3),False,46.190746,103.285648,1995-01-07,1995-01-07 21:25:00,earth,22.478615,80923.015021,Near Earth asteroid orbits similar to that of ...,3837644,(2019 AY3)
3,3843493-(2019 PY),False,22.108281,49.435619,1995-01-07,1995-01-07 02:45:00,earth,4.998691,17995.288355,Near Earth asteroid orbits similar to that of ...,3843493,(2019 PY)
4,3765015-(2016 WR48),False,160.160338,358.129403,1995-01-08,1995-01-08 12:46:00,earth,7.465089,26874.321682,An asteroid orbit contained entirely within th...,3765015,(2016 WR48)


In [19]:
df_reto_2["id_name"].str.split?

[0;31mSignature:[0m [0mstr[0m[0;34m.[0m[0msplit[0m[0;34m([0m[0mself[0m[0;34m,[0m [0;34m/[0m[0;34m,[0m [0msep[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mmaxsplit[0m[0;34m=[0m[0;34m-[0m[0;36m1[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Return a list of the substrings in the string, using sep as the separator string.

  sep
    The separator used to split the string.

    When set to None (the default value), will split on any whitespace
    character (including \n \r \t \f and spaces) and will discard
    empty strings from the result.
  maxsplit
    Maximum number of splits (starting from the left).
    -1 (the default value) means no limit.

Note, str.split() is mainly useful for data that has been intentionally
delimited.  With natural text that includes punctuation, consider using
the regular expression module.
[0;31mType:[0m      method_descriptor

In [None]:
# separa por guiones
...

df_reto_2.head(3)

In [16]:
# cambia a títulos
df_reto_2["orbiting_body"] = df_reto_2["orbiting_body"].str.title()
df_reto_2["orbiting_body"]

0      Earth
1      Earth
2      Earth
3      Earth
4      Earth
       ...  
296    Earth
297    Earth
298    Earth
299    Earth
300    Earth
Name: orbiting_body, Length: 301, dtype: object

In [22]:
# guarda en el archivo objetos_cercanos_2.csv
df_reto_2.to_csv("near_earth_objects-jan_feb_1995-reto_2.csv")

¡Compara con el Ingeniero de Datos de tu confianza y si coinciden entonces envíen su respuesta!

In [20]:
df_reto_2.drop(columns=['id_name'], inplace=True)

In [21]:
df_reto_2.head()

Unnamed: 0,is_potentially_hazardous_asteroid,estimated_diameter.meters.estimated_diameter_min,estimated_diameter.meters.estimated_diameter_max,close_approach_date,epoch_date_close_approach,orbiting_body,relative_velocity.kilometers_per_second,relative_velocity.kilometers_per_hour,orbit_class_description,id,name
0,False,483.676488,1081.533507,1995-01-07,1995-01-07 08:33:00,Earth,16.142864,58114.308667,Near Earth asteroid orbits similar to that of ...,2154652,154652 (2004 EP20)
1,True,96.506147,215.794305,1995-01-07,1995-01-07 15:09:00,Earth,12.351044,44463.757734,Near Earth asteroid orbits which cross the Ear...,3153509,(2003 HM)
2,False,46.190746,103.285648,1995-01-07,1995-01-07 21:25:00,Earth,22.478615,80923.015021,Near Earth asteroid orbits similar to that of ...,3837644,(2019 AY3)
3,False,22.108281,49.435619,1995-01-07,1995-01-07 02:45:00,Earth,4.998691,17995.288355,Near Earth asteroid orbits similar to that of ...,3843493,(2019 PY)
4,False,160.160338,358.129403,1995-01-08,1995-01-08 12:46:00,Earth,7.465089,26874.321682,An asteroid orbit contained entirely within th...,3765015,(2016 WR48)
