# DataFrames

Los DataFrames (marcos o cuadros de datos) son el caballito de batalla de Pandas y son directamente inspiradas por el lenguaje de programación R.  Podemos pensar en un DataFrame como un conjunto de Series que se han unido para compartir el mismo índice.  Utilicemos a Pandas para explorat este tema!

In [12]:
import pandas as pd
import numpy as np

In [13]:
from numpy.random import randn
np.random.seed(101)

In [14]:
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())

Para ver el DataFrame, solo invocamos su nombre

In [15]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


## Selección e Indización

Veamos varios de los métodos para obtener datos de un DataFrame

In [16]:
df['W']

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [17]:
# Pasar una lista de nombres de columnas
df[['W','Z']]

Unnamed: 0,W,Z
A,2.70685,0.503826
B,0.651118,0.605965
C,-2.018168,-0.589001
D,0.188695,0.955057
E,0.190794,0.683509


In [18]:
# También se puede usar una Sintáxis SQL (NO RECOMENDADO!)
df.W

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

Las columnas de un DataFrame son simplemente Series

In [19]:
type(df['W'])

pandas.core.series.Series

**Crear una nueva columna:**

In [20]:
df['nueva'] = df['W'] + df['Y']

In [21]:
df

Unnamed: 0,W,X,Y,Z,nueva
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


** Eliminación de columnas**

In [22]:
df.drop('nueva',axis=1)

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [23]:
# OJO!  No lo hace en el DataFrame a menos que se especifique (No es inplace "en el lugar")!
df

Unnamed: 0,W,X,Y,Z,nueva
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


In [24]:
df.drop('nueva',axis=1,inplace=True)

In [25]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


También se pueden eliminar filas de esta forma:

In [26]:
df.drop('E',axis=0)

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057


** Selección de filas**

In [27]:
df.loc['A']

W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

O seleccionar basado en la posición en vez de una etiqueta 

In [28]:
df.iloc[2]

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

** Seleccionar un subconjunto de filas y columnas **

In [29]:
df.loc['B','Y']

-0.84807698340363147

In [30]:
df.loc[['A','B'],['W','Y']]

Unnamed: 0,W,Y
A,2.70685,0.907969
B,0.651118,-0.848077


### Selección condicional

Es una característica importante de Pandas, utilizando corchetes...muy parecido a numpy:

In [31]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [32]:
df>0

Unnamed: 0,W,X,Y,Z
A,True,True,True,True
B,True,False,False,True
C,False,True,True,False
D,True,False,False,True
E,True,True,True,True


In [33]:
df[df>0]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,,,0.605965
C,,0.740122,0.528813,
D,0.188695,,,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [34]:
df[df['W']>0]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [35]:
df[df['W']>0]['Y']

A    0.907969
B   -0.848077
D   -0.933237
E    2.605967
Name: Y, dtype: float64

In [36]:
df[df['W']>0][['Y','X']]

Unnamed: 0,Y,X
A,0.907969,0.628133
B,-0.848077,-0.319318
D,-0.933237,-0.758872
E,2.605967,1.978757


Si hay dos condiciones, se puede utilizar |(or) y & (and) con paréntesis:

In [37]:
df[(df['W']>0) & (df['Y'] > 1)]

Unnamed: 0,W,X,Y,Z
E,0.190794,1.978757,2.605967,0.683509


## Más detalles sobre índices

Veamos algunas características de la indización, incluyendo reseteando el índice o fijándolo en otra cosa. También veremos sobre la jerarquía de índices!

In [38]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [39]:
# Resetear al índice default de 0,1...n
df.reset_index()

Unnamed: 0,index,W,X,Y,Z
0,A,2.70685,0.628133,0.907969,0.503826
1,B,0.651118,-0.319318,-0.848077,0.605965
2,C,-2.018168,0.740122,0.528813,-0.589001
3,D,0.188695,-0.758872,-0.933237,0.955057
4,E,0.190794,1.978757,2.605967,0.683509


In [40]:
nuevoInd = 'Izabal Peten Solola Zacapa Suchitepequez'.split()

In [41]:
df['Deptos.'] = nuevoInd

In [42]:
df

Unnamed: 0,W,X,Y,Z,Deptos.
A,2.70685,0.628133,0.907969,0.503826,Izabal
B,0.651118,-0.319318,-0.848077,0.605965,Peten
C,-2.018168,0.740122,0.528813,-0.589001,Solola
D,0.188695,-0.758872,-0.933237,0.955057,Zacapa
E,0.190794,1.978757,2.605967,0.683509,Suchitepequez


In [43]:
df.set_index('Deptos.')

Unnamed: 0_level_0,W,X,Y,Z
Deptos.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Izabal,2.70685,0.628133,0.907969,0.503826
Peten,0.651118,-0.319318,-0.848077,0.605965
Solola,-2.018168,0.740122,0.528813,-0.589001
Zacapa,0.188695,-0.758872,-0.933237,0.955057
Suchitepequez,0.190794,1.978757,2.605967,0.683509


In [44]:
df

Unnamed: 0,W,X,Y,Z,Deptos.
A,2.70685,0.628133,0.907969,0.503826,Izabal
B,0.651118,-0.319318,-0.848077,0.605965,Peten
C,-2.018168,0.740122,0.528813,-0.589001,Solola
D,0.188695,-0.758872,-0.933237,0.955057,Zacapa
E,0.190794,1.978757,2.605967,0.683509,Suchitepequez


In [45]:
df.set_index('Deptos.',inplace=True)

In [46]:
df

Unnamed: 0_level_0,W,X,Y,Z
Deptos.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Izabal,2.70685,0.628133,0.907969,0.503826
Peten,0.651118,-0.319318,-0.848077,0.605965
Solola,-2.018168,0.740122,0.528813,-0.589001
Zacapa,0.188695,-0.758872,-0.933237,0.955057
Suchitepequez,0.190794,1.978757,2.605967,0.683509


## Multi-Indice y jerarquía de Indices

Repasémos cómo trabajar con Multi-Index, primero creamos un ejemplo de como se vería un DataFrame multi-indizado:

In [47]:
# Niveles de índices
externo = ['G1','G1','G1','G2','G2','G2']
interno = [1,2,3,1,2,3]
indice_jerarquico = list(zip(externo, interno))
indice_jerarquico = pd.MultiIndex.from_tuples(indice_jerarquico)

In [48]:
indice_jerarquico

MultiIndex(levels=[['G1', 'G2'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])

In [49]:
df = pd.DataFrame(np.random.randn(6,2),index=indice_jerarquico,columns=['A','B'])
df

Unnamed: 0,Unnamed: 1,A,B
G1,1,0.302665,1.693723
G1,2,-1.706086,-1.159119
G1,3,-0.134841,0.390528
G2,1,0.166905,0.184502
G2,2,0.807706,0.07296
G2,3,0.638787,0.329646


Ahora veamos como indizar esto! Para jerarquía de índices utilizamos df.loc[ ], si esto se hace sobre el eje de columnas, simplemente se usaría la notación normal de corchetes df[]. Invocando un nivel de índices devuelve el siguiente sub_DataFrame:

In [50]:
df.loc['G1']

Unnamed: 0,A,B
1,0.302665,1.693723
2,-1.706086,-1.159119
3,-0.134841,0.390528


In [51]:
df.loc['G1'].loc[1]

A    0.302665
B    1.693723
Name: 1, dtype: float64

In [52]:
df.index.names

FrozenList([None, None])

In [53]:
df.index.names = ['Grupo','Num']

In [54]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
Grupo,Num,Unnamed: 2_level_1,Unnamed: 3_level_1
G1,1,0.302665,1.693723
G1,2,-1.706086,-1.159119
G1,3,-0.134841,0.390528
G2,1,0.166905,0.184502
G2,2,0.807706,0.07296
G2,3,0.638787,0.329646


In [55]:
df.xs('G1')

Unnamed: 0_level_0,A,B
Num,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.302665,1.693723
2,-1.706086,-1.159119
3,-0.134841,0.390528


In [56]:
df.xs(['G1',1])

A    0.302665
B    1.693723
Name: (G1, 1), dtype: float64

In [57]:
df.xs(1,level='Num')

Unnamed: 0_level_0,A,B
Grupo,Unnamed: 1_level_1,Unnamed: 2_level_1
G1,0.302665,1.693723
G2,0.166905,0.184502


# Buen trabajo!