# Introducción a Pandas

En este notebook exploraremos una nueva librería para el manejo de datos estructurados con muchas funcionalides muy útiles para tareas de aprendizaje automático y ciencia de datos. 

Pandas nos permitirá importar, procesar y explorar datos en base a distintos formatos y características aprovechando varios conceptos y utilidades ya conocidas con con numpy,


In [1]:
import numpy as np
import pandas as pd

## Estructuras de datos

Pandas ofrece 2 estructuras de datos principales para realizar y facilitar las operaciones, los Dataframes y las Series.


### Series

Un objeto de tipo Series es un array unidimensional capaz de contener cualquier tipo de datos. Las etiquetas de los objetos se denominan índices o index.

Para crear un objeto de tipo `series` se puede usar:

```python
s = pd.Series(data, index=index)
```

donde `data` puede ser un diccionario de python, un array de numpy, un valor escalar.

`index` es una lista de etiquetas.


In [2]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])


In [3]:
s

a   -0.904461
b    1.307372
c   -0.155950
d    1.198162
e    1.316672
dtype: float64

In [4]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [5]:
d = {'b': 1, 'a': 0, 'c': 2}
pd.Series(d)

b    1
a    0
c    2
dtype: int64

In [6]:
s[0]

-0.9044607672905638

In [7]:
s[:3]

a   -0.904461
b    1.307372
c   -0.155950
dtype: float64

In [8]:
s[s > s.median()]

b    1.307372
e    1.316672
dtype: float64

In [9]:
s.median()

1.1981619779372439

In [10]:
s[[4, 3, 1]]

e    1.316672
d    1.198162
b    1.307372
dtype: float64

In [11]:
np.exp(s)

a    0.404760
b    3.696446
c    0.855602
d    3.314020
e    3.730986
dtype: float64

In [12]:
s['a']

-0.9044607672905638

In [14]:
s['e'] = 12.
s

a    -0.904461
b     1.307372
c    -0.155950
d     1.198162
e    12.000000
dtype: float64

In [15]:
'e' in s

True

In [16]:
'f' in s

False

In [17]:
s + s

a    -1.808922
b     2.614744
c    -0.311900
d     2.396324
e    24.000000
dtype: float64

In [18]:
s[1:] + s[:-1]

a         NaN
b    2.614744
c   -0.311900
d    2.396324
e         NaN
dtype: float64

In [19]:
s = pd.Series(np.random.randn(5), name='something')

In [20]:
s.name

'something'

### Dataframe

Un DataFrame es una estructura de datos bidimensional con columnas de datos no necesariamente iguales. Un DataFrame se suele comparar con una tabla de excel o una tabla en una base de datos.

Generalmente, el DataFrame es el objeto más usado de Pandas. Puede aceptar distintos tipos de entradas:
  - diccionario de arrays 1d, listas, dicts o Series
  - array numpy de 2 dimensiones
  - Un Series
  - Otro DataFrame

Junto con los datos, se puede especificar un `index` (etiquetas de filas) y `columns` (etiquetas de columnas.
)

In [24]:
d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
    'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

In [25]:
pd.DataFrame(d, index=['d', 'b', 'a'])

Unnamed: 0,one,two
d,,4.0
b,2.0,2.0
a,1.0,1.0


In [26]:
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])

Unnamed: 0,two,three
d,4.0,
b,2.0,
a,1.0,


In [27]:
df.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [28]:
df.columns

Index(['one', 'two', 'three'], dtype='object')

In [29]:
d = {'one': [1., 2., 3., 4.],'two': [4., 3., 2., 1.]}
pd.DataFrame(d)

Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0


In [30]:
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

Unnamed: 0,one,two
a,1.0,4.0
b,2.0,3.0
c,3.0,2.0
d,4.0,1.0


In [31]:
data = np.zeros((2, ), dtype=[('A', 'i4'), ('B', 'f4'), ('C', 'a10')])
data

array([(0, 0., b''), (0, 0., b'')],
      dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])

In [32]:
data[:] = [(1, 2., 'Hello'), (2, 3., "World")]
pd.DataFrame(data)

Unnamed: 0,A,B,C
0,1,2.0,b'Hello'
1,2,3.0,b'World'


In [33]:
pd.DataFrame(data, index=['first', 'second'])

Unnamed: 0,A,B,C
first,1,2.0,b'Hello'
second,2,3.0,b'World'


In [34]:
pd.DataFrame(data, columns=['C', 'A', 'B'])

Unnamed: 0,C,A,B
0,b'Hello',1,2.0
1,b'World',2,3.0


In [35]:
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

In [36]:
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.707569,1.21442,-1.968239,-0.836022
2013-01-02,-0.42626,-0.819122,-2.466912,-1.329321
2013-01-03,0.492334,-0.428974,-0.273841,0.839468
2013-01-04,-2.42703,-0.015659,-0.627664,0.544265
2013-01-05,0.002603,0.696186,1.130312,1.292582
2013-01-06,0.06466,-1.207287,0.269181,-0.726179


## Operaciones



In [38]:
df.head(2)

Unnamed: 0,A,B,C,D
2013-01-01,-0.707569,1.21442,-1.968239,-0.836022
2013-01-02,-0.42626,-0.819122,-2.466912,-1.329321


In [39]:
df.tail(3)

Unnamed: 0,A,B,C,D
2013-01-04,-2.42703,-0.015659,-0.627664,0.544265
2013-01-05,0.002603,0.696186,1.130312,1.292582
2013-01-06,0.06466,-1.207287,0.269181,-0.726179


In [40]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.50021,-0.093406,-0.656194,-0.035868
std,1.031324,0.918685,1.356104,1.063705
min,-2.42703,-1.207287,-2.466912,-1.329321
25%,-0.637242,-0.721585,-1.633095,-0.808561
50%,-0.211829,-0.222316,-0.450752,-0.090957
75%,0.049145,0.518225,0.133426,0.765668
max,0.492334,1.21442,1.130312,1.292582


In [41]:
df.T

Unnamed: 0,2013-01-01,2013-01-02,2013-01-03,2013-01-04,2013-01-05,2013-01-06
A,-0.707569,-0.42626,0.492334,-2.42703,0.002603,0.06466
B,1.21442,-0.819122,-0.428974,-0.015659,0.696186,-1.207287
C,-1.968239,-2.466912,-0.273841,-0.627664,1.130312,0.269181
D,-0.836022,-1.329321,0.839468,0.544265,1.292582,-0.726179


In [42]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2013-01-01,-0.836022,-1.968239,1.21442,-0.707569
2013-01-02,-1.329321,-2.466912,-0.819122,-0.42626
2013-01-03,0.839468,-0.273841,-0.428974,0.492334
2013-01-04,0.544265,-0.627664,-0.015659,-2.42703
2013-01-05,1.292582,1.130312,0.696186,0.002603
2013-01-06,-0.726179,0.269181,-1.207287,0.06466


In [43]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2013-01-06,0.06466,-1.207287,0.269181,-0.726179
2013-01-02,-0.42626,-0.819122,-2.466912,-1.329321
2013-01-03,0.492334,-0.428974,-0.273841,0.839468
2013-01-04,-2.42703,-0.015659,-0.627664,0.544265
2013-01-05,0.002603,0.696186,1.130312,1.292582
2013-01-01,-0.707569,1.21442,-1.968239,-0.836022


In [44]:
df.to_numpy()

array([[-0.70756876,  1.21441993, -1.96823863, -0.83602192],
       [-0.42625984, -0.81912215, -2.46691202, -1.32932092],
       [ 0.49233387, -0.42897435, -0.27384076,  0.83946849],
       [-2.42703046, -0.01565853, -0.62766354,  0.54426477],
       [ 0.00260252,  0.69618613,  1.13031153,  1.2925817 ],
       [ 0.06465975, -1.20728717,  0.26918105, -0.72617943]])