# Introduccion al módulo pandas
## Breve descripción y ejemplos. 
## Parte 1. Creación de cuadros de series y cuadros de datos  

### Marco Arieli Herrera-Valdez$^1$
#### $^1$ Laboratorio de Fisiología de Sistemas, Departamento de Matemáticas, Facultad de Ciencias, Universidad Nacional Autónoma de México

(Basado en el tutorial https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html)


Import 

In [1]:
import numpy as np
import pandas as pd

Creación de una serie pasando una lista de valores, con indexado automático en enteros

In [7]:
s = pd.Series([2,4,5,'np.pi',3])
print(s)

0        2
1        4
2        5
3    np.pi
4        3
dtype: object


Arreglos de fechas en rango específico

In [48]:
n=17
fechas = pd.date_range('20200221',periods=n)
print(fechas)

DatetimeIndex(['2020-02-21', '2020-02-22', '2020-02-23', '2020-02-24',
               '2020-02-25', '2020-02-26', '2020-02-27', '2020-02-28',
               '2020-02-29', '2020-03-01', '2020-03-02', '2020-03-03',
               '2020-03-04', '2020-03-05', '2020-03-06', '2020-03-07',
               '2020-03-08'],
              dtype='datetime64[ns]', freq='D')


Creación de un cuadro de datos ('DataFrame') en el que los datos son un arreglo de NumPy, con fechas como índices de las filas, y columnas etiquetadas

In [49]:
df= pd.DataFrame(np.random.randn(len(fechas),4),index=fechas,columns=list('ABCD'))
print(df)

                   A         B         C         D
2020-02-21 -0.586041  0.663774 -1.622344  1.092950
2020-02-22 -0.173809 -0.919900 -0.542429  1.439191
2020-02-23 -0.669590  1.845015 -0.624577 -0.130137
2020-02-24  0.976177  1.707507 -1.566449  1.471524
2020-02-25 -0.458236 -0.805270 -0.651588 -0.017644
2020-02-26 -0.236260 -1.197829  0.828102  1.540503
2020-02-27  1.296408  0.013231 -1.069230  1.256078
2020-02-28 -0.001822 -0.545049  0.847808  1.402061
2020-02-29 -1.325373  1.135355  0.996185 -0.734947
2020-03-01 -1.333509 -1.235313 -0.041608 -1.201182
2020-03-02  0.596006  0.347821  1.525043  1.381117
2020-03-03  0.759615  1.007199  0.664399 -0.571112
2020-03-04  1.164883  1.022742 -0.818324  1.483369
2020-03-05  1.142191 -1.948610  1.119062  0.147237
2020-03-06 -0.706992 -0.913163 -0.054579 -0.830121
2020-03-07  0.358952  0.675526 -1.300356 -2.441243
2020-03-08 -0.363817 -0.440014 -0.872687 -1.120922


Estampas de tiempo con etiqueta de fecha

In [50]:
pd.Timestamp('20200212')

Timestamp('2020-02-12 00:00:00')

Arreglo categórico 

In [51]:
pd.Categorical(["test","train","test","train"])

[test, train, test, train]
Categories (2, object): [test, train]

Cuadro de datos generado a partir de un diccionario del que se pueden generar columnas de cada llave, y tal que se pueden ordenar los datos por sus dimensiones

In [52]:
df2= pd.DataFrame({'A': 1.0,
                  'B': pd.Timestamp('20200212'),
                  'C': pd.Series(1,index=list(range(4)),dtype='float32'),
                  'D': np.array([3]*4, dtype='int32'),
                  'E': pd.Categorical(["test","train","test","train"]),
                  'F': 'foo'})
print(df2)

     A          B    C  D      E    F
0  1.0 2020-02-12  1.0  3   test  foo
1  1.0 2020-02-12  1.0  3  train  foo
2  1.0 2020-02-12  1.0  3   test  foo
3  1.0 2020-02-12  1.0  3  train  foo


Los tipos de datos en las columnas son de distintos tipos 

In [53]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

Los nombres de las columnas y otros atributos dentro de un cuadro de datos se pueden obtener con una completación con la tecla tab si es una sesión de ipython

### Visualización de datos

In [54]:
df.head()

Unnamed: 0,A,B,C,D
2020-02-21,-0.586041,0.663774,-1.622344,1.09295
2020-02-22,-0.173809,-0.9199,-0.542429,1.439191
2020-02-23,-0.66959,1.845015,-0.624577,-0.130137
2020-02-24,0.976177,1.707507,-1.566449,1.471524
2020-02-25,-0.458236,-0.80527,-0.651588,-0.017644


In [55]:
df.tail(3)

Unnamed: 0,A,B,C,D
2020-03-06,-0.706992,-0.913163,-0.054579,-0.830121
2020-03-07,0.358952,0.675526,-1.300356,-2.441243
2020-03-08,-0.363817,-0.440014,-0.872687,-1.120922


In [56]:
df.index

DatetimeIndex(['2020-02-21', '2020-02-22', '2020-02-23', '2020-02-24',
               '2020-02-25', '2020-02-26', '2020-02-27', '2020-02-28',
               '2020-02-29', '2020-03-01', '2020-03-02', '2020-03-03',
               '2020-03-04', '2020-03-05', '2020-03-06', '2020-03-07',
               '2020-03-08'],
              dtype='datetime64[ns]', freq='D')

In [57]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

### Conversión de datos

DataFrame.to_numpy() regresa una representación de los datos en modo NumPy sin etiquetas de filas (index) ni columnas

In [58]:
df.to_numpy()

array([[-5.86041163e-01,  6.63773508e-01, -1.62234396e+00,
         1.09294991e+00],
       [-1.73809231e-01, -9.19899505e-01, -5.42428975e-01,
         1.43919115e+00],
       [-6.69590218e-01,  1.84501458e+00, -6.24577215e-01,
        -1.30137422e-01],
       [ 9.76176665e-01,  1.70750724e+00, -1.56644870e+00,
         1.47152388e+00],
       [-4.58236326e-01, -8.05269595e-01, -6.51587940e-01,
        -1.76444707e-02],
       [-2.36260155e-01, -1.19782864e+00,  8.28102230e-01,
         1.54050273e+00],
       [ 1.29640758e+00,  1.32305992e-02, -1.06922992e+00,
         1.25607777e+00],
       [-1.82185528e-03, -5.45049332e-01,  8.47808071e-01,
         1.40206104e+00],
       [-1.32537263e+00,  1.13535467e+00,  9.96185352e-01,
        -7.34946676e-01],
       [-1.33350922e+00, -1.23531251e+00, -4.16084074e-02,
        -1.20118182e+00],
       [ 5.96006028e-01,  3.47820915e-01,  1.52504304e+00,
         1.38111721e+00],
       [ 7.59614732e-01,  1.00719862e+00,  6.64399249e-01,
      

In [59]:
df2.to_numpy()

array([[1.0, Timestamp('2020-02-12 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2020-02-12 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2020-02-12 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2020-02-12 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

### Descripciones y re-arreglos de datos

In [60]:
df.describe()

Unnamed: 0,A,B,C,D
count,17.0,17.0,17.0,17.0
mean,0.025811,0.024295,-0.187269,0.245101
std,0.851461,1.125119,1.008867,1.242169
min,-1.333509,-1.94861,-1.622344,-2.441243
25%,-0.586041,-0.913163,-0.872687,-0.734947
50%,-0.173809,0.013231,-0.542429,0.147237
75%,0.759615,1.007199,0.828102,1.402061
max,1.296408,1.845015,1.525043,1.540503


Transponiendo un cuadro de datos

In [61]:
df.T

Unnamed: 0,2020-02-21,2020-02-22,2020-02-23,2020-02-24,2020-02-25,2020-02-26,2020-02-27,2020-02-28,2020-02-29,2020-03-01,2020-03-02,2020-03-03,2020-03-04,2020-03-05,2020-03-06,2020-03-07,2020-03-08
A,-0.586041,-0.173809,-0.66959,0.976177,-0.458236,-0.23626,1.296408,-0.001822,-1.325373,-1.333509,0.596006,0.759615,1.164883,1.142191,-0.706992,0.358952,-0.363817
B,0.663774,-0.9199,1.845015,1.707507,-0.80527,-1.197829,0.013231,-0.545049,1.135355,-1.235313,0.347821,1.007199,1.022742,-1.94861,-0.913163,0.675526,-0.440014
C,-1.622344,-0.542429,-0.624577,-1.566449,-0.651588,0.828102,-1.06923,0.847808,0.996185,-0.041608,1.525043,0.664399,-0.818324,1.119062,-0.054579,-1.300356,-0.872687
D,1.09295,1.439191,-0.130137,1.471524,-0.017644,1.540503,1.256078,1.402061,-0.734947,-1.201182,1.381117,-0.571112,1.483369,0.147237,-0.830121,-2.441243,-1.120922


Un cuadro de datos puede ser ordenado con respecto a dimensiones, en distintos órdenes, y tomando en cuenta criterios específicos, como ordenamientos relativos a los valores de una columna 

In [62]:
df.sort_index(axis=1,ascending=False)

Unnamed: 0,D,C,B,A
2020-02-21,1.09295,-1.622344,0.663774,-0.586041
2020-02-22,1.439191,-0.542429,-0.9199,-0.173809
2020-02-23,-0.130137,-0.624577,1.845015,-0.66959
2020-02-24,1.471524,-1.566449,1.707507,0.976177
2020-02-25,-0.017644,-0.651588,-0.80527,-0.458236
2020-02-26,1.540503,0.828102,-1.197829,-0.23626
2020-02-27,1.256078,-1.06923,0.013231,1.296408
2020-02-28,1.402061,0.847808,-0.545049,-0.001822
2020-02-29,-0.734947,0.996185,1.135355,-1.325373
2020-03-01,-1.201182,-0.041608,-1.235313,-1.333509


In [63]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2020-03-05,1.142191,-1.94861,1.119062,0.147237
2020-03-01,-1.333509,-1.235313,-0.041608,-1.201182
2020-02-26,-0.23626,-1.197829,0.828102,1.540503
2020-02-22,-0.173809,-0.9199,-0.542429,1.439191
2020-03-06,-0.706992,-0.913163,-0.054579,-0.830121
2020-02-25,-0.458236,-0.80527,-0.651588,-0.017644
2020-02-28,-0.001822,-0.545049,0.847808,1.402061
2020-03-08,-0.363817,-0.440014,-0.872687,-1.120922
2020-02-27,1.296408,0.013231,-1.06923,1.256078
2020-03-02,0.596006,0.347821,1.525043,1.381117
