___

<a href='https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html'> <img src='https://www.python.org/static/img/python-logo.png' /></a>

___

# DataFrames

Los **DataFrames** son el caballo de batalla de la libreria Pandas y se inspiran directamente en el lenguaje de programación R. Podemos pensar en un DataFrame como un grupo de objetos Series juntos para compartir el mismo índice. ¡Usemos Pandas para explorar este tema!

In [1]:
import pandas as pd
import numpy as np

In [2]:
from numpy.random import randn
np.random.seed(101)

In [3]:
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())

In [4]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


## Selección e Indexación

Aprendamos los diversos métodos para obtener datos de un DataFrame

In [5]:
df['W']

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [6]:
# Pasar una lista de nombres de columnas
df[['W','Z']]

Unnamed: 0,W,Z
A,2.70685,0.503826
B,0.651118,0.605965
C,-2.018168,-0.589001
D,0.188695,0.955057
E,0.190794,0.683509


In [7]:
# Sintaxis SQL (NO RECOMENDADA!)
df.W

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

Las columnas de un DataFrame son solo SERIES

In [8]:
type(df['W'])

pandas.core.series.Series

**Creando una nueva columna:**

In [9]:
df['nueva'] = df['W'] + df['Y']

In [10]:
df

Unnamed: 0,W,X,Y,Z,nueva
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


**Eliminacion de columnas y/o filas**

axis = 1 (o eje = 'columnas') es el eje vertical. Para llevarlo más allá, si usan el método pandas drop, para eliminar columnas o filas, si se especifica axis = 1, eliminará columnas. Si especifica axis = 0, eliminará filas del conjunto de datos.

**Eliminando columnas:**

In [11]:
#axis=1 es equivalente a columns=labels
df.drop('nueva',axis=1)

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [12]:
# ¡No en el lugar a menos que se especifique!
df

Unnamed: 0,W,X,Y,Z,nueva
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


In [13]:
df.drop('nueva',axis=1,inplace=True)

In [14]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


También pueden eliminar filas de esta manera:

In [15]:
df.drop('D',axis=0, inplace = True)

In [16]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
E,0.190794,1.978757,2.605967,0.683509


**Seleccionando Filas**

In [17]:
df.loc['A']

W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

O seleccione según la posición en lugar de la etiqueta:

In [18]:
df.iloc[2]

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

**Seleccionando un subconjunto de filas y columnas**

In [19]:
df.loc['B','Y']

-0.8480769834036315

In [20]:
df.loc[['A','B'],['W','Y']]

Unnamed: 0,W,Y
A,2.70685,0.907969
B,0.651118,-0.848077


### Selección condicional

Una característica importante de Pandas es la selección condicional usando notación de corchetes, muy similar a Numpy:

In [21]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
E,0.190794,1.978757,2.605967,0.683509


In [22]:
df>0

Unnamed: 0,W,X,Y,Z
A,True,True,True,True
B,True,False,False,True
C,False,True,True,False
E,True,True,True,True


In [23]:
df[df>0]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,,,0.605965
C,,0.740122,0.528813,
E,0.190794,1.978757,2.605967,0.683509


In [24]:
df[df['W']>0]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
E,0.190794,1.978757,2.605967,0.683509


In [25]:
df[df['W']>0]['Y']

A    0.907969
B   -0.848077
E    2.605967
Name: Y, dtype: float64

In [26]:
df[df['W']>0][['Y','X']]

Unnamed: 0,Y,X
A,0.907969,0.628133
B,-0.848077,-0.319318
E,2.605967,1.978757


Para dos condiciones puede utilizar | y & entre paréntesis:

In [27]:
df[(df['W']>0) & (df['Y'] > 1)]

Unnamed: 0,W,X,Y,Z
E,0.190794,1.978757,2.605967,0.683509


## Más detalles sobre Index

Analicemos algunas características más de la indexación, incluido el restablecimiento del índice o la configuración de otra cosa. ¡También hablaremos sobre la jerarquía de índices!

In [28]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
E,0.190794,1.978757,2.605967,0.683509


In [29]:
# Restablecer el índice predeterminado 0,1 ... n
df.reset_index()

Unnamed: 0,index,W,X,Y,Z
0,A,2.70685,0.628133,0.907969,0.503826
1,B,0.651118,-0.319318,-0.848077,0.605965
2,C,-2.018168,0.740122,0.528813,-0.589001
3,E,0.190794,1.978757,2.605967,0.683509


In [30]:
newind = 'LIM AQP CUZ PIU'.split()

In [31]:
df['Regiones'] = newind

In [32]:
df

Unnamed: 0,W,X,Y,Z,Regiones
A,2.70685,0.628133,0.907969,0.503826,LIM
B,0.651118,-0.319318,-0.848077,0.605965,AQP
C,-2.018168,0.740122,0.528813,-0.589001,CUZ
E,0.190794,1.978757,2.605967,0.683509,PIU


In [33]:
df.set_index('Regiones')

Unnamed: 0_level_0,W,X,Y,Z
Regiones,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
LIM,2.70685,0.628133,0.907969,0.503826
AQP,0.651118,-0.319318,-0.848077,0.605965
CUZ,-2.018168,0.740122,0.528813,-0.589001
PIU,0.190794,1.978757,2.605967,0.683509


In [34]:
df

Unnamed: 0,W,X,Y,Z,Regiones
A,2.70685,0.628133,0.907969,0.503826,LIM
B,0.651118,-0.319318,-0.848077,0.605965,AQP
C,-2.018168,0.740122,0.528813,-0.589001,CUZ
E,0.190794,1.978757,2.605967,0.683509,PIU


In [35]:
df.set_index('Regiones',inplace=True)

In [36]:
df

Unnamed: 0_level_0,W,X,Y,Z
Regiones,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
LIM,2.70685,0.628133,0.907969,0.503826
AQP,0.651118,-0.319318,-0.848077,0.605965
CUZ,-2.018168,0.740122,0.528813,-0.589001
PIU,0.190794,1.978757,2.605967,0.683509


## Jerarquía de índices e índices múltiples

Repasemos cómo trabajar con Multi-Index, primero crearemos un ejemplo rápido de cómo se vería un DataFrame multi-indexado:

In [37]:
# Niveles de índice
exterior = ['G1','G1','G1','G2','G2','G2']
interior = [1,2,3,1,2,3]
jerarquia_index = list(zip(exterior,interior))
jerarquia_index = pd.MultiIndex.from_tuples(jerarquia_index)

In [38]:
jerarquia_index

MultiIndex([('G1', 1),
            ('G1', 2),
            ('G1', 3),
            ('G2', 1),
            ('G2', 2),
            ('G2', 3)],
           )

In [39]:
df = pd.DataFrame(np.random.randn(6,2),index=jerarquia_index,columns=['A','B'])
df

Unnamed: 0,Unnamed: 1,A,B
G1,1,0.302665,1.693723
G1,2,-1.706086,-1.159119
G1,3,-0.134841,0.390528
G2,1,0.166905,0.184502
G2,2,0.807706,0.07296
G2,3,0.638787,0.329646


¡Ahora veamos cómo indexar esto! Para la jerarquía de índices usamos df.loc [], si esto estuviera en el eje de las columnas, simplemente usaría la notación de corchetes normal df []. Llamar a un nivel del índice devuelve el subtrama de datos:

In [40]:
df.loc['G1']

Unnamed: 0,A,B
1,0.302665,1.693723
2,-1.706086,-1.159119
3,-0.134841,0.390528


In [41]:
df.loc['G1'].loc[1]

A    0.302665
B    1.693723
Name: 1, dtype: float64

In [42]:
df.index.names

FrozenList([None, None])

In [43]:
df.index.names = ['Grupo','Num']

In [44]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
Grupo,Num,Unnamed: 2_level_1,Unnamed: 3_level_1
G1,1,0.302665,1.693723
G1,2,-1.706086,-1.159119
G1,3,-0.134841,0.390528
G2,1,0.166905,0.184502
G2,2,0.807706,0.07296
G2,3,0.638787,0.329646


In [45]:
df.xs('G1')

Unnamed: 0_level_0,A,B
Num,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.302665,1.693723
2,-1.706086,-1.159119
3,-0.134841,0.390528


In [46]:
df.xs(['G1',1])

  """Entry point for launching an IPython kernel.


A    0.302665
B    1.693723
Name: (G1, 1), dtype: float64

In [47]:
df.xs(1,level='Num')

Unnamed: 0_level_0,A,B
Grupo,Unnamed: 1_level_1,Unnamed: 2_level_1
G1,0.302665,1.693723
G2,0.166905,0.184502


# Excelente trabajo!