# Python para el análisis de datos -  UNAV 2020-2021
---

# Notebook 6: Pandas, índices y métodos

## Índice  <a name="indice"></a>

- [Índices](#pandas_indices)
  - [Indexar con *.loc[]* y *iloc[]*](#indexar_loc_iloc)
  - [Renombrar etiquetas de índices y columnas](#renombrar_etiquetas)
- [Métodos avanzados](#pandas_metodos_avanzados)
  - [Métodos *.apply()* y *.map()*](#metodos_apply_map)
  - [Método *.copy()*](#metodo_copy)
  - [Método *.groupby()*](#metodo_groupby)
  - [Método *.agg()*](#metodo_agg)
  
- [Multi-índices](#pandas_multiindices)
  - [Método *.set_index()*](#pandas_multiindices_set_index)
  - [Método *.get_level_values()*](#pandas_multiindices_get_level_values)
  - [Método *.set_names()*](#pandas_multiindices_set_names)
  - [Indexar con *.loc()*](#pandas_multiindices_loc)
  
- [Ejercicios](#ejercicios)

## Índices<a name="pandas_indices"></a> 
[Volver al índice](#indice)

En la primera sección de esta sesión nos vamos a centrar en diferentes métodos para indexar _DataFrames_. Para mostrar estos métodos vamos a emplear el dataset de películas de James Bond, que contiene la siguiente información:

- Film: nombre de la película.
- Year: año.
- Actor: actor protagonista.
- Director: director de la película.
- Box Office: recaudación en millones de dólares.
- Budget: presupuesto en millones de dólares.
- Bond Actor Salary: salario del actor en millones de dólares.

In [1]:
import pandas as pd
import numpy as np

df_bond = pd.read_csv("datos/jamesbond.csv")
df_bond.head()

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
4,Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


### Métodos *.set_index()* y *.reset_index()*

Por defecto el índice que se nos crea en un _DataFrame_ es numérico y va desde $0$ hasta $n-1$ filas. Podemos cambiar esto y poner nuestro propio índice utilizando el método *.set_index()*. El parámetro *inplace=True* realiza los cambios en el mismo _DataFrame_, y es equivalente a reasignar el resultado de *.set_index()* a la variable *df_bond*.

In [2]:
df_bond.set_index("Film", inplace=True)
df_bond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


Podemos observar que nuestro índice ahora es la columna Film. Si queremos volver al estado anterior de nuestro _DataFrame_ podemos resetear el índice con el método *.reset_index()*.

In [3]:
df_bond.reset_index(drop=False).head()

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
4,Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


El parámetro *drop* le indica al método que el índice que se resetea, debe ser incluido de nuevo en el _DataFrame_. Por defecto *drop=False*.

In [4]:
df_bond.sort_values('Budget', ascending=False).reset_index(drop=True).head()

Unnamed: 0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,2015,Daniel Craig,Sam Mendes,726.7,206.3,
1,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
2,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
3,1999,Pierce Brosnan,Michael Apted,439.5,158.3,13.5
4,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9


In [5]:
df_bond.reset_index(drop=False, inplace=True)
df_bond.head()

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
4,Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


Podemos realizar el cambio de un índice a otro:

In [6]:
df_bond.set_index("Film", inplace=True)
df_bond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


Es muy importante resetear el índice antes de volver realizar un *._set_index()* de otro índice.

In [7]:
df_bond.reset_index(inplace=True)
df_bond.set_index("Year", inplace=True)
df_bond.head()

Unnamed: 0_level_0,Film,Actor,Director,Box Office,Budget,Bond Actor Salary
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1962,Dr. No,Sean Connery,Terence Young,448.8,7.0,0.6
1963,From Russia with Love,Sean Connery,Terence Young,543.8,12.6,1.6
1964,Goldfinger,Sean Connery,Guy Hamilton,820.4,18.6,3.2
1965,Thunderball,Sean Connery,Terence Young,848.1,41.9,4.7
1967,Casino Royale,David Niven,Ken Hughes,315.0,85.0,


### Indexar con *.loc[]* y *iloc[]*<a name="indexar_loc_iloc"></a> 
[Volver al índice](#indice)

#### Indexar con *.loc[]* a través de etiqueta

Cargamos de nuevo el dataset indicando a pandas que utilice la columan "Film" como índice.

In [8]:
df_bond = pd.read_csv("datos/jamesbond.csv", index_col="Film")

df_bond.sort_index(inplace=True)
df_bond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9


Podemos preguntar si un elemento está en el índice de forma análoga a como lo hacíamos con las listas:

In [9]:
'Goldfinger' in df_bond.index

True

Podemos realizar un acceso a las filas del _DataFrame_ a través del índice, usando el indexador *_loc_*. Por ejemplo, esto nos va a permitir extraer información correspondiente a una película. El resultado de la extracción es un objeto _Series_, cuyos índices son los nombres de las columnas del _DataFrame_.

In [10]:
s_bond_goldfinger = df_bond.loc["Goldfinger"]

print(type(s_bond_goldfinger))
s_bond_goldfinger

<class 'pandas.core.series.Series'>


Year                         1964
Actor                Sean Connery
Director             Guy Hamilton
Box Office                  820.4
Budget                       18.6
Bond Actor Salary             3.2
Name: Goldfinger, dtype: object

También podemos realizar el acceso a esos elementos como hacíamos en el apartado de _Series_:

In [11]:
df_bond.loc["Goldfinger"]["Year"], df_bond.loc["Goldfinger"]["Budget"]

(1964, 18.6)

In [12]:
df_bond.loc["Goldfinger"][["Year", "Budget"]]

Year      1964
Budget    18.6
Name: Goldfinger, dtype: object

Podemos extraer dos o más filas, incluso un rango de filas. En este caso, cuando el resultado es mayor de una fila, el objeto devuelto es un _DataFrame_ con la información a la que accedemos.

In [13]:
df_bond_films = df_bond.loc[["Octopussy", "Moonraker"]]

print(type(df_bond_films))
df_bond_films

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Octopussy,1983,Roger Moore,John Glen,373.8,53.9,7.8
Moonraker,1979,Roger Moore,Lewis Gilbert,535.0,91.5,


In [14]:
df_bond.loc[:"Dr. No"]

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6


**¡Cuidado! Porque esto último funciona bien porque tenemos el _DataFrame_ ordenado alfabéticamente, si no lo estuviese el comportamiento puede ser inesperado.**

#### Indexar con *.iloc[]* a través de posición

In [15]:
df_bond = pd.read_csv("datos/jamesbond.csv")
df_bond.head()

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
4,Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


De forma similar al acceso a través del índice, podemos acceder a un número de fila utilizando de *._iloc_[]*. De nuevo al acceder a una fila obtenemos un objeto _Series_.

In [16]:
print(type(df_bond.iloc[0]))
df_bond.iloc[0]

<class 'pandas.core.series.Series'>


Film                        Dr. No
Year                          1962
Actor                 Sean Connery
Director             Terence Young
Box Office                   448.8
Budget                           7
Bond Actor Salary              0.6
Name: 0, dtype: object

In [17]:
df_bond.iloc[5]

Film                 You Only Live Twice
Year                                1967
Actor                       Sean Connery
Director                   Lewis Gilbert
Box Office                         514.2
Budget                              59.9
Bond Actor Salary                    4.4
Name: 5, dtype: object

El indexado funciona de forma análoga a una lista y podemos hacer _slicing_ y obtener rangos de datos.

In [18]:
df_bond.iloc[10:15]

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
10,The Spy Who Loved Me,1977,Roger Moore,Lewis Gilbert,533.0,45.1,
11,Moonraker,1979,Roger Moore,Lewis Gilbert,535.0,91.5,
12,For Your Eyes Only,1981,Roger Moore,John Glen,449.4,60.2,
13,Never Say Never Again,1983,Sean Connery,Irvin Kershner,380.0,86.0,
14,Octopussy,1983,Roger Moore,John Glen,373.8,53.9,7.8


In [19]:
df_bond.iloc[-5:]

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
21,Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
22,Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
23,Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
24,Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
25,Spectre,2015,Daniel Craig,Sam Mendes,726.7,206.3,


¿Cuándo usar _loc_ o _iloc_? Las reglas básicas son:
* _.loc[]_ para indexado con etiqueta de índice.
* _.iloc[]_ para para indexado por posición de fila.

Estos métodos se pueden utilizar para acceder específicamente a una fila y una columna a la vez. Podemos imaginar que estamos accediendo a una celda de Excel (o un rango de celdas).

In [20]:
df_bond.set_index('Film', inplace=True)
df_bond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


Para acceder al año y actor de una película:

In [21]:
df_bond.loc['Dr. No', ['Year', 'Actor']]

Year             1962
Actor    Sean Connery
Name: Dr. No, dtype: object

In [22]:
df_bond.iloc[0, :2]

Year             1962
Actor    Sean Connery
Name: Dr. No, dtype: object

Acceder a las películas 10 y 11, y las columnas (Director, Box Office, Budget):

In [23]:
df_bond.iloc[10:12, 2:5]

Unnamed: 0_level_0,Director,Box Office,Budget
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
The Spy Who Loved Me,Lewis Gilbert,533.0,45.1
Moonraker,Lewis Gilbert,535.0,91.5


También podemos modificar elementos de nuestro _DataFrame_ mediante el operador de asignación.

In [24]:
df_bond.loc['A View to a Kill', 'Director'] = 'Juan Fernandez'
df_bond[df_bond.index == 'A View to a Kill']

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,Juan Fernandez,275.2,54.5,9.1


In [25]:
df_bond.iloc[1, 4] = 350
df_bond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,350.0,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


Se pueden realizar varios cambios a la vez:

In [26]:
df_bond.loc["Dr. No", ["Box Office", "Budget", "Bond Actor Salary"]] = [448800000, 7000000, 600000]
df_bond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448800000.0,7000000.0,600000.0
From Russia with Love,1963,Sean Connery,Terence Young,543.8,350.0,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


Podemos realizar cambios a varias celdas a la vez utilizando el indexado booleano. Por ejemplo, vamos a actualizar el título nobiliario a Sean Connery:

In [27]:
df_bond = pd.read_csv("datos/jamesbond.csv")


mask = df_bond["Actor"] == "Sean Connery"
df_bond.loc[mask, "Actor"] = "Sir Sean Connery"

df_bond.loc[mask]

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sir Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sir Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sir Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Sir Sean Connery,Terence Young,848.1,41.9,4.7
5,You Only Live Twice,1967,Sir Sean Connery,Lewis Gilbert,514.2,59.9,4.4
7,Diamonds Are Forever,1971,Sir Sean Connery,Guy Hamilton,442.5,34.7,5.8
13,Never Say Never Again,1983,Sir Sean Connery,Irvin Kershner,380.0,86.0,


**¡Cuidado! Esto modifica el _DataFrame_, no devuelve una copia. Al indexar con _.loc[]_ no se devuelve un nuevo _Dataframe_, por el contrario, cuando realizamos _slicing_ sí devuelve una copia.**

In [28]:
df_bond[df_bond['Actor'] == 'Sir Sean Connery']

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sir Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sir Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sir Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Sir Sean Connery,Terence Young,848.1,41.9,4.7
5,You Only Live Twice,1967,Sir Sean Connery,Lewis Gilbert,514.2,59.9,4.4
7,Diamonds Are Forever,1971,Sir Sean Connery,Guy Hamilton,442.5,34.7,5.8
13,Never Say Never Again,1983,Sir Sean Connery,Irvin Kershner,380.0,86.0,


### Renombrar etiquetas de índices y columnas<a name="renombrar_etiquetas"></a> 
[Volver al índice](#indice)

In [29]:
df_bond = pd.read_csv("datos/jamesbond.csv", index_col="Film")

df_bond.sort_index(inplace=True)
df_bond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9


Podemos renombrar de forma sencilla una columna usando el metodo _.rename()_:

In [30]:
df_bond.rename(columns={"Year" : "Release Date", "Box Office" : "Revenue"}, inplace=True)
df_bond.head()

Unnamed: 0_level_0,Release Date,Actor,Director,Revenue,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9


In [31]:
df_bond.rename(index={"Dr. No" : "Doctor No", 
                      "GoldenEye" : "Golden Eye",
                      "The World Is Not Enough" : "Best Bond Movie Ever"},
               inplace=True)

df_bond

Unnamed: 0_level_0,Release Date,Actor,Director,Revenue,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
Doctor No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
For Your Eyes Only,1981,Roger Moore,John Glen,449.4,60.2,
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Golden Eye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2


El atributo *.columns* devuelve las columnas del _DataFrame_. Estas columnas pueden renombrarse:

In [32]:
df_bond.columns

Index(['Release Date', 'Actor', 'Director', 'Revenue', 'Budget',
       'Bond Actor Salary'],
      dtype='object')

In [33]:
df_bond.columns = ["Year of Release", "Director", "Gross Revenue", "Cost", "Actor", "Salary"]
df_bond.head()

Unnamed: 0_level_0,Year of Release,Director,Gross Revenue,Cost,Actor,Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9


## Métodos avanzados<a name="pandas_metodos_avanzados"></a> 
[Volver al índice](#indice)

### Métodos *.apply()* y *.map()*<a name="metodos_apply_map"></a> 
[Volver al índice](#indice)

In [34]:
df_bond = pd.read_csv("datos/jamesbond.csv", index_col="Film")

df_bond.sort_index(inplace=True)
df_bond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9


Como vimos en los objetos _Series_, podemos utilizar el metodo _.map()_ para  mapear una función. Esto en _DataFrame_ podemos aprovecharlo y crear una nueva columna. Similar a la función *map()*, ésta nos permite utilizar funciones lambda o funciones definidas.

In [35]:
df_bond['above_5M'] = df_bond['Bond Actor Salary'].map(lambda x: True if x > 5 else False)
df_bond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary,above_5M
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1,True
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3,False
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,,False
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8,True
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9,True


Para _DataFrames_ existe un método llamado _.apply()_ que nos va a permitir actuar a nivel de fila. El concepto es similar a _.map()_ solo que ahora podemos acceder a todos los registros de la fila.

In [36]:
def film_review(row):
    
    actor = row['Actor']
    budget = row['Budget']
    
    if actor == "Pierce Brosnan":
        return "Cool!"
    elif actor == "Roger Moore" and budget > 40:
        return "Okish"
    else:
        return "No idea"

In [37]:
df_bond["film_review"] = df_bond.apply(film_review, axis="columns")

In [38]:
df_bond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary,above_5M,film_review
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1,True,Okish
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3,False,No idea
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,,False,No idea
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8,True,No idea
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9,True,Cool!


Con este nuevo método hemos aumentado la versatilidad significativamente. La regla general sería:
- _.map()_ para aplicar una función en una columna.
- _.apply()_ para aplicar una función en una fila.


*.apply()* también se puede aplicar en columnas, aunque *.map()* es generalmente más rápido.

### Método *.copy()*<a name="metodo_copy"></a> 
[Volver al índice](#indice)

In [39]:
df_bond = pd.read_csv("datos/jamesbond.csv", index_col="Film")

df_bond.sort_index(inplace=True)
df_bond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9


Como hemos visto en muchas ocasiones, para hacer un uso de memoria eficiente, Python no copia los valores del _DataFrame_, sino que produce una referencia al mismo. En ocasiones nos puede interesar realizar una copia del _DataFrame_, esto se hace con el método _.copy()_.

In [40]:
df_bond_copied = df_bond.copy()

Ahora si modificamos algo, sólo se ve afectado uno de los _DataFrame_.

In [41]:
df_bond.loc[df_bond['Actor'] == 'Roger Moore', 'Actor'] = 'Jeremy Irons'
df_bond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Jeremy Irons,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9


In [42]:
df_bond_copied.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9


### Método *.groupby()*<a name="metodo_groupby"></a> 
[Volver al índice](#indice)

Pandas radica en su sencillez a la hora de dividir, aplicar funciones y combinar de nuevo. Esto se conoce como la metodología _split-apply-combine_. Vamos a estudiar cómo aplicar todo esto con un _DataFrame_ de empresas americanas. El dataset es:

- Rank: posición de la compañía en la lista Fortune 1000.
- Company: nombre de la compañía.
- Sector: sector de la compañía.
- Industry: industria de la compañía.
- Location: ciudad donde se localizan los HQ.
- Revenue: ingresos en millones de dólares.
- Profits: beneficios en millones de dólares.
- Empleados: número de empleados.

In [43]:
df_fortune = pd.read_csv("datos/fortune1000.csv", index_col="Rank")

df_fortune.head()

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Walmart,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment","Cupertino, CA",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),"Omaha, NE",210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,"San Francisco, CA",181241,1476,70400


In [44]:
df_fortune.shape

(1000, 7)

Agrupamos por sectores y obtenemos un objeto *groupby*:

In [45]:
sectors = df_fortune.groupby("Sector")

type(sectors)

pandas.core.groupby.generic.DataFrameGroupBy

Ahora de una forma muy sencilla se pueden aplicar funciones de cálculo a cada uno de los grupos.

In [46]:
sectors.sum()

Unnamed: 0_level_0,Revenue,Profits,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Aerospace & Defense,357940,28742,968057
Apparel,95968,8236,346397
Business Services,272195,28227,1361050
Chemicals,243897,22628,463651
Energy,1517809,-73447,1188927
Engineering & Construction,153983,5304,406708
Financials,2217159,260209,3359948
Food and Drug Stores,483769,16759,1395398
"Food, Beverages & Tobacco",555967,51417,1211632
Health Care,1614707,106114,2678289


Se aplica la función _.sum()_ a cada uno de los grupos para todas las columnas numéricas, por eso obtenemos las tres de arriba.

In [47]:
sectors["Revenue"].sum().head()

Sector
Aerospace & Defense     357940
Apparel                  95968
Business Services       272195
Chemicals               243897
Energy                 1517809
Name: Revenue, dtype: int64

Si queremos saber cuáles son los 5 que tienen la media mas alta, podemos simplemente ordenar el _DataFrame_ y sacar el resultado.

In [48]:
sectors['Revenue'].sum().sort_values(ascending=False).head()

Sector
Financials     2217159
Health Care    1614707
Energy         1517809
Retailing      1465076
Technology     1377600
Name: Revenue, dtype: int64

Las funciones _.head()_ y _.tail()_ son interesantes de aplicar, porque nos permiten acceder a los primeros/últimos $n$ registros de cada grupo. Si ordenamos el _DataFrame_ por ingresos y luego aplicamos una de estas funciones, después de agrupar podemos obtener, de forma sencilla, las empresas de cada sector que más ingresos tienen.

In [49]:
sectors = df_fortune.sort_values('Revenue', ascending=False).groupby("Sector")
sectors.head(1).sort_values("Sector")

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
24,Boeing,Aerospace & Defense,Aerospace and Defense,"Chicago, IL",96114,5176,161400
91,Nike,Apparel,Apparel,"Beaverton, OR",30601,3273,62600
144,ManpowerGroup,Business Services,Temporary Help,"Milwaukee, WI",19330,419,27000
56,Dow Chemical,Chemicals,Chemicals,"Midland, MI",48778,7685,49495
2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
155,Fluor,Engineering & Construction,"Engineering, Construction","Irving, TX",18114,413,38758
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),"Omaha, NE",210821,24083,331000
7,CVS Health,Food and Drug Stores,Food and Drug Stores,"Woonsocket, RI",153290,5237,199000
41,Archer Daniels Midland,"Food, Beverages & Tobacco",Food Production,"Chicago, IL",67702,1849,32300
5,McKesson,Health Care,Wholesalers: Health Care,"San Francisco, CA",181241,1476,70400


Podemos agrupar por más de un campo y realizar los cálculos de la misma forma.

In [50]:
sectors = df_fortune.groupby(['Sector','Industry'])
sectors.median()

Unnamed: 0_level_0,Unnamed: 1_level_0,Revenue,Profits,Employees
Sector,Industry,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Aerospace & Defense,Aerospace and Defense,6832.0,566.5,24803.0
Apparel,Apparel,3963.0,233.0,13500.0
Business Services,"Advertising, marketing",11374.0,774.5,62050.0
Business Services,Diversified Outsourcing Services,3037.0,202.5,18550.0
Business Services,Education,2591.0,30.0,11770.0
...,...,...,...,...
Transportation,"Trucking, Truck Leasing",3321.0,198.0,18415.0
Wholesalers,Miscellaneous,8982.0,17.0,9200.0
Wholesalers,Wholesalers: Diversified,5305.0,128.0,5839.0
Wholesalers,Wholesalers: Electronics and Office Equipment,18310.0,212.0,13750.0


Si uno quiere transformar el objeto _groupby_ en un _DataFrame_, es tan sencillo como resetear el índice y ya tenemos un _DataFrame_ con nuestros cálculos.

In [51]:
sectors.median().reset_index()

Unnamed: 0,Sector,Industry,Revenue,Profits,Employees
0,Aerospace & Defense,Aerospace and Defense,6832.0,566.5,24803.0
1,Apparel,Apparel,3963.0,233.0,13500.0
2,Business Services,"Advertising, marketing",11374.0,774.5,62050.0
3,Business Services,Diversified Outsourcing Services,3037.0,202.5,18550.0
4,Business Services,Education,2591.0,30.0,11770.0
...,...,...,...,...,...
74,Transportation,"Trucking, Truck Leasing",3321.0,198.0,18415.0
75,Wholesalers,Miscellaneous,8982.0,17.0,9200.0
76,Wholesalers,Wholesalers: Diversified,5305.0,128.0,5839.0
77,Wholesalers,Wholesalers: Electronics and Office Equipment,18310.0,212.0,13750.0


### Método *.agg()*<a name="metodo_agg"></a> 
[Volver al índice](#indice)

In [52]:
df_fortune = pd.read_csv("datos/fortune1000.csv", index_col="Rank")

sectors = df_fortune.groupby("Sector")
df_fortune.head()

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Walmart,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment","Cupertino, CA",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),"Omaha, NE",210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,"San Francisco, CA",181241,1476,70400


El método *.agg()* nos va a permitir aplicar varios cálculos sobre diferentes columnas:

In [53]:
sectors.agg({"Revenue" : ["sum", "mean"],
             "Profits" : "sum",
             "Employees" : "mean"})

Unnamed: 0_level_0,Revenue,Revenue,Profits,Employees
Unnamed: 0_level_1,sum,mean,sum,mean
Sector,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Aerospace & Defense,357940,17897.0,28742,48402.85
Apparel,95968,6397.866667,8236,23093.133333
Business Services,272195,5337.156863,28227,26687.254902
Chemicals,243897,8129.9,22628,15455.033333
Energy,1517809,12441.057377,-73447,9745.303279
Engineering & Construction,153983,5922.423077,5304,15642.615385
Financials,2217159,15950.784173,260209,24172.28777
Food and Drug Stores,483769,32251.266667,16759,93026.533333
"Food, Beverages & Tobacco",555967,12929.465116,51417,28177.488372
Health Care,1614707,21529.426667,106114,35710.52


Podemos aplicar _.apply()_ para los grupos. Si queremos calcular cuales son las empresas de cada sector que más profit generan. Usando _.groupby()_ podemos definir una función que llamaremos _ranker()_. Esta función etiqueta cada fila de $1$ a $n$, donde $n$ es el número de empresas en cada sector. Después, llamamos a _.apply()_ para aplicar la función a cada grupo (en este caso cada sector). 

In [54]:
def ranker(df):
    """Asigna una posición en el ranking a cada empresa según 
    su profit siendo 1 la que mas profit genera.
    Asume que los datos estan ordenados de forma descendente."""
    
    df['sector_profit_rank'] = np.arange(1, len(df) + 1)
    return df

In [55]:
df_fortune = pd.read_csv("datos/fortune1000.csv")

df_fortune = df_fortune.sort_values('Profits', ascending=False)
df_fortune.head()

Unnamed: 0,Rank,Company,Sector,Industry,Location,Revenue,Profits,Employees
2,3,Apple,Technology,"Computers, Office Equipment","Cupertino, CA",233715,53394,110000
22,23,J.P. Morgan Chase,Financials,Commercial Banks,"New York, NY",101006,24442,234598
3,4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),"Omaha, NE",210821,24083,331000
26,27,Wells Fargo,Financials,Commercial Banks,"San Francisco, CA",90033,22894,264700
85,86,Gilead Sciences,Health Care,Pharmaceuticals,"Foster City, CA",32639,18108,8000


In [56]:
df_fortune = df_fortune.groupby('Sector').apply(ranker)
df_fortune[df_fortune['sector_profit_rank'] == 1].head()

Unnamed: 0,Rank,Company,Sector,Industry,Location,Revenue,Profits,Employees,sector_profit_rank
2,3,Apple,Technology,"Computers, Office Equipment","Cupertino, CA",233715,53394,110000,1
22,23,J.P. Morgan Chase,Financials,Commercial Banks,"New York, NY",101006,24442,234598,1
85,86,Gilead Sciences,Health Care,Pharmaceuticals,"Foster City, CA",32639,18108,8000,1
12,13,Verizon,Telecommunications,Telecommunications,"New York, NY",131620,17879,177700,1
1,2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600,1


## Multi-índices<a name="pandas_multiindices"></a> 
[Volver al índice](#indice)

Los multi-índices nos permiten añadir más de un índice a nuestros _DataFrames_, esto nos sirve para categorizar los _DataFrames_ de mejor forma, ya sea, a través de más índices, o mediante más capas o layers.

Para ver el uso de multi-índices, cargamos un fichero que contiene el precio de la hamburguesa BigMac en varios países:

In [57]:
df_bigmac = pd.read_csv("datos/bigmac.csv")

df_bigmac.sort_values('Price in US Dollars').head()

Unnamed: 0,Date,Country,Price in US Dollars
42,1/2016,Venezuela,0.66
98,7/2015,Venezuela,0.67
151,1/2015,Ukraine,1.2
139,1/2015,Russia,1.36
297,7/2013,India,1.5


In [58]:
df_bigmac.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 652 entries, 0 to 651
Data columns (total 3 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Date                 652 non-null    object 
 1   Country              652 non-null    object 
 2   Price in US Dollars  652 non-null    float64
dtypes: float64(1), object(2)
memory usage: 15.4+ KB


Especificamos a Pandas que trate la columna "Date" como fecha, convertiéndola a tipo *datetime*:

In [59]:
df_bigmac = pd.read_csv("datos/bigmac.csv", parse_dates=["Date"])
df_bigmac.head()

Unnamed: 0,Date,Country,Price in US Dollars
0,2016-01-01,Argentina,2.39
1,2016-01-01,Australia,3.74
2,2016-01-01,Brazil,3.35
3,2016-01-01,Britain,4.22
4,2016-01-01,Canada,4.14


In [60]:
df_bigmac.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 652 entries, 0 to 651
Data columns (total 3 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Date                 652 non-null    datetime64[ns]
 1   Country              652 non-null    object        
 2   Price in US Dollars  652 non-null    float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 15.4+ KB


### Método *.set_index()*<a name="pandas_multiindices_set_index"></a> 
[Volver al índice](#indice)

Ya hemos utilizado *.set_index()* antes para crear un nuevo índice sobre un _DataFrame_, en este caso, lo vamos a hacer para crear uno múltiple, pero primero repasamos como funcionaba con un único parámetro:

In [61]:
df_bigmac_dates = df_bigmac.set_index(keys=["Date"])
df_bigmac_dates.head()

Unnamed: 0_level_0,Country,Price in US Dollars
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2016-01-01,Argentina,2.39
2016-01-01,Australia,3.74
2016-01-01,Brazil,3.35
2016-01-01,Britain,4.22
2016-01-01,Canada,4.14


In [62]:
df_bigmac_dates.loc['2016-01-01', 'Price in US Dollars'].mean()

3.303928571428571

Podemos ver que la columna "Date" se convierte en índice, porque se mueve a la izquierda y está en negrita. Vamos a crear un multiIndex con las columnas "Date" y "Country":

In [63]:
df_bigmac = pd.read_csv("datos/bigmac.csv", parse_dates=["Date"])

df_bigmac.set_index(keys=["Date", "Country"], inplace=True)
df_bigmac.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Price in US Dollars
Date,Country,Unnamed: 2_level_1
2016-01-01,Argentina,2.39
2016-01-01,Australia,3.74
2016-01-01,Brazil,3.35
2016-01-01,Britain,4.22
2016-01-01,Canada,4.14


In [64]:
df_bigmac.loc[('2016-01-01','Argentina')]

Price in US Dollars    2.39
Name: (2016-01-01 00:00:00, Argentina), dtype: float64

In [65]:
df_bigmac.sort_index(ascending=[True, False], inplace=True)
df_bigmac.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Price in US Dollars
Date,Country,Unnamed: 2_level_1
2010-01-01,Uruguay,3.32
2010-01-01,United States,3.58
2010-01-01,Ukraine,1.83
2010-01-01,UAE,2.99
2010-01-01,Turkey,3.83


Vamos a ver qué pasa si intentamos ordenar los índices. Usamos la función *.sort_index()*:

In [66]:
df_bigmac.sort_index(inplace=True)
df_bigmac.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Price in US Dollars
Date,Country,Unnamed: 2_level_1
2010-01-01,Argentina,1.84
2010-01-01,Australia,3.98
2010-01-01,Brazil,4.76
2010-01-01,Britain,3.67
2010-01-01,Canada,3.97


Como vemos se ordenan de forma ascendente los dos valores, "Date" y "Country". Veamos como obtener información de los índices.

In [67]:
df_bigmac.index[:5]

MultiIndex([('2010-01-01', 'Argentina'),
            ('2010-01-01', 'Australia'),
            ('2010-01-01',    'Brazil'),
            ('2010-01-01',   'Britain'),
            ('2010-01-01',    'Canada')],
           names=['Date', 'Country'])

In [68]:
df_bigmac.index.names

FrozenList(['Date', 'Country'])

In [69]:
type(df_bigmac.index)

pandas.core.indexes.multi.MultiIndex

### Método *.get_level_values()*<a name="pandas_multiindices_get_level_values"></a> 
[Volver al índice](#indice)

Este método nos permite obtener los valores para un índice o layer en concreto. Vamos a crear un multiIndex, pero usando una nueva forma, al leer el dataset con *.read_csv()* le pasamos directamente los índices que queremos usar.

In [70]:
df_bigmac = pd.read_csv("datos/bigmac.csv", parse_dates=["Date"],
                        index_col=["Date", "Country"])

df_bigmac.sort_index(inplace=True)
df_bigmac.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Price in US Dollars
Date,Country,Unnamed: 2_level_1
2010-01-01,Argentina,1.84
2010-01-01,Australia,3.98
2010-01-01,Brazil,4.76
2010-01-01,Britain,3.67
2010-01-01,Canada,3.97


In [71]:
df_bigmac.index.get_level_values("Date")

DatetimeIndex(['2010-01-01', '2010-01-01', '2010-01-01', '2010-01-01',
               '2010-01-01', '2010-01-01', '2010-01-01', '2010-01-01',
               '2010-01-01', '2010-01-01',
               ...
               '2016-01-01', '2016-01-01', '2016-01-01', '2016-01-01',
               '2016-01-01', '2016-01-01', '2016-01-01', '2016-01-01',
               '2016-01-01', '2016-01-01'],
              dtype='datetime64[ns]', name='Date', length=652, freq=None)

In [72]:
df_bigmac.index.get_level_values("Country")

Index(['Argentina', 'Australia', 'Brazil', 'Britain', 'Canada', 'Chile',
       'China', 'Colombia', 'Costa Rica', 'Czech Republic',
       ...
       'Switzerland', 'Taiwan', 'Thailand', 'Turkey', 'UAE', 'Ukraine',
       'United States', 'Uruguay', 'Venezuela', 'Vietnam'],
      dtype='object', name='Country', length=652)

### Método *.set_names()*<a name="pandas_multiindices_set_names"></a> 
[Volver al índice](#indice)

In [73]:
df_bigmac = pd.read_csv("datos/bigmac.csv", parse_dates=["Date"],
                        index_col=["Date", "Country"])

df_bigmac.sort_index(inplace=True)
df_bigmac.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Price in US Dollars
Date,Country,Unnamed: 2_level_1
2010-01-01,Argentina,1.84
2010-01-01,Australia,3.98
2010-01-01,Brazil,4.76
2010-01-01,Britain,3.67
2010-01-01,Canada,3.97


Este método nos permite cambiar el nombre de un índice ya creado en el _DataFrame_.

In [74]:
df_bigmac.index.set_names(["Fecha", "Pais"], inplace=True)
df_bigmac.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Price in US Dollars
Fecha,Pais,Unnamed: 2_level_1
2010-01-01,Argentina,1.84
2010-01-01,Australia,3.98
2010-01-01,Brazil,4.76
2010-01-01,Britain,3.67
2010-01-01,Canada,3.97


### Indexar con *.loc()*<a name="pandas_multiindices_loc"></a> 
[Volver al índice](#indice)

Vamos a ver cómo podemos obtener filas de un _DataFrame_ construido con un MultiIndex.

In [75]:
df_bigmac = pd.read_csv("datos/bigmac.csv", parse_dates=["Date"],
                        index_col=["Date", "Country"])

df_bigmac.sort_index(inplace=True)
df_bigmac.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Price in US Dollars
Date,Country,Unnamed: 2_level_1
2010-01-01,Argentina,1.84
2010-01-01,Australia,3.98
2010-01-01,Brazil,4.76
2010-01-01,Britain,3.67
2010-01-01,Canada,3.97


El método *.loc[]* puede acceder a una fila del _DataFrame_ y acepta etiquetas como índices; pero en nuestro caso, como tenemos dos índices, debemos usar una tupla para especificar las etiquetas para los índices que vamos a usar, y como último parámetro, especificamos el valor que queremos extraer (la serie o columna), de no ser así, nos devolvería todas las disponibles para las etiquetas suministradas.

In [76]:
df_bigmac.loc[("2010-01-01", ["Brazil", "Argentina"]), "Price in US Dollars"]

Date        Country  
2010-01-01  Argentina    1.84
            Brazil       4.76
Name: Price in US Dollars, dtype: float64

In [77]:
df_bigmac.loc[("2015-07-01", "Chile"), "Price in US Dollars"]

Date        Country
2015-07-01  Chile      3.27
Name: Price in US Dollars, dtype: float64

## Ejercicios<a name="ejercicios"></a> 
[Volver al índice](#indice)

#### Bloque 1: salarios en la ciudad de Chicago (chicago.csv)

Columnas:
- Name: nombre de la persona.
- Position Title: nombre del trabajo que realiza.
- Department: departamento en el que trabaja.
- Employee Annual Salary: salario de la persona.

1 - Calcula el salario medio de los habitantes de Chicago.

In [6]:
import pandas as pd

df = pd.read_csv("S6_datos/chicago.csv")
df.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00
3,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$89880.00
4,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$106836.00


In [79]:
df["Employee Annual Salary"] = df["Employee Annual Salary"].str.replace('$', '').astype(float)
df["Employee Annual Salary"].mean()

80204.178633899

2 - Calcula cuantos habitantes ganan más que la media. (_count()_ te puede ayudar)

In [80]:
df[df["Employee Annual Salary"] > df["Employee Annual Salary"].mean()].shape[0]

20013

In [81]:
df[df["Employee Annual Salary"] > df["Employee Annual Salary"].mean()].count()[0]

20013

3 - ¿Cuál es el departamento que emplea a mayor número de personas?

In [82]:
df[["Name", "Department"]].groupby(
    "Department").count().sort_values("Name", ascending=False).index[0]

'POLICE'

4 - ¿Cuál es el departamento que tiene un salario medio mayor?

In [83]:
df[["Employee Annual Salary", "Department"]].groupby(
    "Department").mean().sort_values("Employee Annual Salary", ascending=False).index[0]

'DoIT'

5 - Averigua cuales son los 5 departamentos con una media de salario mayor.

In [84]:
df[["Employee Annual Salary", "Department"]].groupby(
    "Department").mean().sort_values("Employee Annual Salary", ascending=False).head()

Unnamed: 0_level_0,Employee Annual Salary
Department,Unnamed: 1_level_1
DoIT,96727.294118
BUILDINGS,96313.738626
FIRE,95700.627306
BUDGET & MGMT,91989.230769
HUMAN RELATIONS,91065.75


6 - ¿Cuál es el departamento con más puestos de trabajo distintos?

In [85]:
df[["Position Title", "Department"]].groupby(
    "Department").nunique().sort_values("Position Title", ascending=False).index[0]

'TRANSPORTN'

7 - ¿Cuál es el trabajo mejor remunerado y en qué departamento se realiza?

In [86]:
df.sort_values("Employee Annual Salary").iloc[0]

Name                                 KOCH,  STEVEN
Position Title            ADMINISTRATIVE SECRETARY
Department                          MAYOR'S OFFICE
Employee Annual Salary                        0.96
Name: 15102, dtype: object

8 - Averigua cuales son los trabajos de cada departamento que emplean a mayor número de personas.

In [87]:
dfaux = df[['Department', 'Position Title', 'Name']].groupby(
    ['Department','Position Title']).count().reset_index()

dfaux = dfaux.sort_values('Name', ascending=False).groupby('Department').head(1)
dfaux.head(5)

Unnamed: 0,Department,Position Title,Name
1418,POLICE,POLICE OFFICER,9184
824,FIRE,FIREFIGHTER-EMT,1208
1620,STREETS & SAN,SANITATION LABORER,673
1285,OEMC,CROSSING GUARD,560
1844,WATER MGMNT,CONSTRUCTION LABORER,399


9 - Calcula el intervalo de confianza al 95% de los salarios de la ciudad de Chicago

In [88]:
from scipy import stats

t = stats.norm.ppf(0.975) / df.shape[0] ** 0.5
std = df["Employee Annual Salary"].std()
mean = df["Employee Annual Salary"].mean()

(mean - t * std, mean + t * std)

(79929.45828053026, 80478.89898726775)

10 - ¿Cuál es el departamento con mayor variabiliad de salarios?

In [89]:
df[["Department", "Employee Annual Salary"]].groupby(
    "Department").std().sort_values("Employee Annual Salary", ascending=False).index[0]

"MAYOR'S OFFICE"

11 - Sube un 10% el salario a aquellas personas que estén en el top 5 de salarios más bajos de su departamento.

In [8]:
dfaux = df.sort_values(['Department','Employee Annual Salary']).reset_index(drop=True)
d_depart = dfaux.groupby('Department')['Employee Annual Salary'].unique().to_dict()

print(dfaux.groupby('Department')['Employee Annual Salary'])

for depart in d_depart.keys():
    dfaux.loc[(dfaux['Department'] == depart) & 
              (dfaux['Employee Annual Salary'].isin(d_depart[depart][:5])),
              'Employee Annual Salary'] *= 1.1
dfaux.head()

<pandas.core.groupby.generic.SeriesGroupBy object at 0x000001F584B289D0>


TypeError: can't multiply sequence by non-int of type 'float'

12 - Calcula la media del salario de los departamentos 'POLICE', 'POLICE BOARD' y 'FIRE'

In [91]:
df[df["Department"].isin(["POLICE", "POLICE BOARD", "FIRE"])].mean()

Employee Annual Salary    89112.865248
dtype: float64

#### Bloque 2: estudio tratamiento del cáncer (sanity.csv)

Columnas:
- idbus: identificador de la persona en el hospital. 
- sexo: 0 (hombre), 1 (mujer).
- edad: edad de la persona.
- altura: altura de la persona en cm.
- peso: peso de la persona en kg.
- tratamiento: 0 (quimioterapia), 1 (radioterapia).
- supervivencia: años de supervivencia desde el tratamiento.

14 - Calcula la supervivencia media de los participantes del estudio.

In [92]:
df = pd.read_csv("datos/sanity.csv")
df.head()

Unnamed: 0,idsub,sexo,edad,altura,peso,tratamiento,supervivencia
0,1000.0,0.0,61.0,166.0,74.0,1.0,2.0
1,1001.0,0.0,21.0,179.0,89.0,1.0,4.0
2,1002.0,0.0,59.0,166.0,59.0,1.0,2.0
3,1003.0,0.0,27.0,162.0,81.0,0.0,2.0
4,1004.0,0.0,23.0,164.0,69.0,1.0,2.0


In [93]:
df["supervivencia"].mean()

3.21

15 - Calcula cuantos sobreviven más que la media. ¿Y más que la mediana?. ¿Cuál es el porcentaje de cada uno de los casos?

In [94]:
import numpy as np

In [95]:
above_avg = df["supervivencia"] > df["supervivencia"].mean()
(np.count_nonzero(above_avg), f"{np.mean(above_avg):.2%}")

(71, '35.50%')

In [96]:
above_median = df["supervivencia"] > df["supervivencia"].median()
(np.count_nonzero(above_median), f"{np.mean(above_median):.2%}")

(71, '35.50%')

16 - ¿Cuál es el grupo de personas (hombre o mujer) con mayor supervivencia?

In [97]:
d_sex = {0: "hombre", 1: "mujer"}

d_sex[df.groupby("sexo").mean().sort_values(
    "supervivencia", ascending=False).index[0]]

'mujer'

17 - ¿Qué tratamiento alarga más la supervivencia la quimioterapia o la radioterapia?

In [98]:
d_sex = {0: "quimioterapia", 1: "radioterapia"}

d_sex[df.groupby("tratamiento").mean().sort_values(
    "supervivencia", ascending=False).index[0]]

'quimioterapia'

18 - Asigna una nueva columna que marque como ancianos a los mayores de 60 años y jóvenes al resto.

In [99]:
df["grupo_edad"] = df["edad"].map(lambda edad: "anciano" if edad > 60 else "joven")

In [100]:
df.head()

Unnamed: 0,idsub,sexo,edad,altura,peso,tratamiento,supervivencia,grupo_edad
0,1000.0,0.0,61.0,166.0,74.0,1.0,2.0,anciano
1,1001.0,0.0,21.0,179.0,89.0,1.0,4.0,joven
2,1002.0,0.0,59.0,166.0,59.0,1.0,2.0,joven
3,1003.0,0.0,27.0,162.0,81.0,0.0,2.0,joven
4,1004.0,0.0,23.0,164.0,69.0,1.0,2.0,joven


19 - ¿Puedes ver si las mujeres que pesan más de 60 kilos sobreviven más que las que pesan menos?

In [101]:
grupo1 = df["supervivencia"][(df["peso"] > 60) & (df["sexo"] == 1)].mean()
grupo2 = df["supervivencia"][(df["peso"] <= 60) & (df["sexo"] == 1)].mean()

grupo1 > grupo2

True

20 - Crea una nueva columna que clasifique a las personas en base a su índice de masa corporal (imc) según el siguiente criterio:

- Por debajo de 18.5: por debajo del peso.
- 18.5 a 24.9: saludable.
- 25.0 a 29.9: Sobrepeso.
- 30.0 a 39.9: Obeso.
- Más de 40: Obesidad extrema o de alto riesgo.

In [102]:
def imclabel(imc):
    if imc < 18.5:
        return "debajo del peso"
    elif imc < 24.9:
        return "saludable"
    elif imc < 29.9:
        return "sobrepeso"
    elif imc < 39.9:
        return "obeso"
    else:
        return "obesidad extrema"
    

df["imc"] = df["peso"] / (df["altura"] * 0.01) ** 2
df["imclabel"] = df["imc"].map(imclabel)

In [103]:
df.head()

Unnamed: 0,idsub,sexo,edad,altura,peso,tratamiento,supervivencia,grupo_edad,imc,imclabel
0,1000.0,0.0,61.0,166.0,74.0,1.0,2.0,anciano,26.854406,sobrepeso
1,1001.0,0.0,21.0,179.0,89.0,1.0,4.0,joven,27.776911,sobrepeso
2,1002.0,0.0,59.0,166.0,59.0,1.0,2.0,joven,21.410945,saludable
3,1003.0,0.0,27.0,162.0,81.0,0.0,2.0,joven,30.864198,obeso
4,1004.0,0.0,23.0,164.0,69.0,1.0,2.0,joven,25.654372,sobrepeso
