## Ejemplo 2: Manipulación de Strings

### 1. Objetivos:
    - Aprender a usar `replace`, `strip`, `title`, `upper`, `lower` y `split` para transformar datos tipo `string``
 
---
    
### 2. Desarrollo:

In [1]:
import pandas as pd

In [8]:
df = pd.read_csv('../../Datasets/new_york_times_bestsellers-dirty.csv', index_col=0)
df["description"]

0       Descr: Aliens have taken control of the minds ...
1       Descr: A woman's happy marriage is shaken when...
2       Descr: A Massachusetts state investigator and ...
3       Descr: An aging porn queens aims to cap her ca...
5       Descr: The Minneapolis detective Lucas Davenpo...
                              ...                        
3027    Descr: The New York lawyer Stone Barrington di...
3028    Descr: Jake Fisher discovers that neither the ...
3029    Descr: Six friends meet in the 1970s at a summ...
3030    Descr: Bernie Gunther, the Berlin cop, is sent...
3031    Descr: A New Hampshire baker finds herself in ...
Name: description, Length: 2266, dtype: object

Empecemos con la columna `description` que tiene un 'Descr:' al inicio de cada texto. Si queremos remover ese texto podemos usar el método `replace` de la propiedad `str` de esa `Serie`:

In [11]:
# remplazar la palabra descripción
df["description"] = df["description"].str.replace("Descr:", "")
df.head(3)

Unnamed: 0,amazon_product_url,author,description,publisher,title,oid,bestsellers_date.numberLong,published_date.numberLong,rank.numberInt,rank_last_week.numberInt,weeks_on_list.numberInt,price.numberDouble
0,http://www.amazon.com/The-Host-Novel-Stephenie...,Stephenie Meyer,Aliens have taken control of the minds and bo...,"Little, Brown",THE HOST,5b4aa4ead3089013507db18c,2008-05-24 00:00:00,1212883200000,2,1,3,25.99
1,http://www.amazon.com/Love-Youre-With-Emily-Gi...,Emily Giffin,A woman's happy marriage is shaken when she e...,St. Martin's,LOVE THE ONE YOU'RE WITH,5b4aa4ead3089013507db18d,2008-05-24 00:00:00,1212883200000,3,2,2,24.95
2,http://www.amazon.com/The-Front-Garano-Patrici...,Patricia Cornwell,A Massachusetts state investigator and his te...,Putnam,THE FRONT,5b4aa4ead3089013507db18e,2008-05-24 00:00:00,1212883200000,4,0,1,22.95


Para que el cambio persista, tenemos que reasignarlo:

In [12]:
# consultar una fila para comprobar resultado
df.loc[0, "description"]

' Aliens have taken control of the minds and bodies of most humans, but one woman won’t surrender.     '

Como puedes ver, tenemos también espacios vacíos al principio y final de nuestras `strings`. Vamos a removerlos usando `strip`:

In [14]:
df['description'] = df['description'].str.strip()
df.loc[0, "description"]

'Aliens have taken control of the minds and bodies of most humans, but one woman won’t surrender.'

Perfecto.

Ahora veamos la columna 'title', cuyos textos están en mayúsculas. Esto no es muy agradable, así que podemos usar algunos métodos para modificar el patrón de mayúsculas y minúsculas:

In [16]:
# titulos a minúsculas
df["title"].str.lower()

0                       the host
1       love the one you're with
2                      the front
3                          snuff
5                   phantom prey
                  ...           
3027     unintended consequences
3028                   six years
3029            the interestings
3030        a man without breath
3031             the storyteller
Name: title, Length: 2266, dtype: object

In [17]:
# títulos como nombres
df["title"].str.title()

0                       The Host
1       Love The One You'Re With
2                      The Front
3                          Snuff
5                   Phantom Prey
                  ...           
3027     Unintended Consequences
3028                   Six Years
3029            The Interestings
3030        A Man Without Breath
3031             The Storyteller
Name: title, Length: 2266, dtype: object

Este último es más adecuado, vamos a guardarlo:

In [18]:
# remplazar la columna
df["title"] = df["title"].str.title()

Ahora, digamos que queremos separar nuestra columna `author` en dos columnas `author_first_name` y `author_last_name`. Eso lo podemos hacer con el método `split`:

In [22]:
# separando nombre y apellido de autor
df["author"].str.split(" ")  # \s, espacio, \t tabulador, \n avance de línea

0         [Stephenie, Meyer]
1            [Emily, Giffin]
2       [Patricia, Cornwell]
3         [Chuck, Palahniuk]
5           [John, Sandford]
                ...         
3027         [Stuart, Woods]
3028         [Harlan, Coben]
3029         [Meg, Wolitzer]
3030          [Philip, Kerr]
3031         [Jodi, Picoult]
Name: author, Length: 2266, dtype: object

Podemos convertirlo en dos columnas así:

In [23]:
# creando nuevas columnas de la separación
df["author"].str.split(" ", expand=True)

Unnamed: 0,0,1
0,Stephenie,Meyer
1,Emily,Giffin
2,Patricia,Cornwell
3,Chuck,Palahniuk
5,John,Sandford
...,...,...
3027,Stuart,Woods
3028,Harlan,Coben
3029,Meg,Wolitzer
3030,Philip,Kerr


In [24]:
# asignando las nuevas columnas
df[["author_first_name", "author_last_name"]] = \
    df["author"].str.split(" ", expand=True)

In [25]:
# revisando el resultado con head()
df.head(3)

Unnamed: 0,amazon_product_url,author,description,publisher,title,oid,bestsellers_date.numberLong,published_date.numberLong,rank.numberInt,rank_last_week.numberInt,weeks_on_list.numberInt,price.numberDouble,author_first_name,author_last_name
0,http://www.amazon.com/The-Host-Novel-Stephenie...,Stephenie Meyer,Aliens have taken control of the minds and bod...,"Little, Brown",The Host,5b4aa4ead3089013507db18c,2008-05-24 00:00:00,1212883200000,2,1,3,25.99,Stephenie,Meyer
1,http://www.amazon.com/Love-Youre-With-Emily-Gi...,Emily Giffin,A woman's happy marriage is shaken when she en...,St. Martin's,Love The One You'Re With,5b4aa4ead3089013507db18d,2008-05-24 00:00:00,1212883200000,3,2,2,24.95,Emily,Giffin
2,http://www.amazon.com/The-Front-Garano-Patrici...,Patricia Cornwell,A Massachusetts state investigator and his tea...,Putnam,The Front,5b4aa4ead3089013507db18e,2008-05-24 00:00:00,1212883200000,4,0,1,22.95,Patricia,Cornwell


¡Éxito!