# Introducción a Pandas

En este notebook exploraremos una nueva librería para el manejo de datos estructurados con muchas funcionalides muy útiles para tareas de aprendizaje automático y ciencia de datos. 

Pandas nos permitirá importar, procesar y explorar datos en base a distintos formatos y características aprovechando varios conceptos y utilidades ya conocidas con con numpy,


In [4]:
import numpy as np
import pandas as pd

## Estructuras de datos

Pandas ofrece 2 estructuras de datos principales para realizar y facilitar las operaciones, los Dataframes y las Series.


### Series

Un objeto de tipo Series es un array unidimensional capaz de contener cualquier tipo de datos. Las etiquetas de los objetos se denominan índices o index.

Para crear un objeto de tipo `series` se puede usar:

```python
s = pd.Series(data, index=index)
```

donde `data` puede ser un diccionario de python, un array de numpy, un valor escalar.

`index` es una lista de etiquetas.


In [11]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])


In [12]:
s

a   -0.748200
b    0.527469
c   -0.307823
d    0.402765
e   -0.073102
dtype: float64

In [13]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [15]:
d = {'b': 1, 'a': 0, 'c': 2, 'x': 9}
pd.Series(d)

b    1
a    0
c    2
x    9
dtype: int64

In [16]:
s[0]

-0.7482000031762714

In [17]:
s[:3]

a   -0.748200
b    0.527469
c   -0.307823
dtype: float64

In [18]:
s[s > s.median()]

b    0.527469
d    0.402765
dtype: float64

In [19]:
s.median()

-0.07310173254155264

In [20]:
s[[4, 3, 1]]

e   -0.073102
d    0.402765
b    0.527469
dtype: float64

In [21]:
np.exp(s)

a    0.473218
b    1.694638
c    0.735046
d    1.495955
e    0.929506
dtype: float64

In [22]:
s['a']

-0.7482000031762714

In [24]:
s['e'] = 12.0
s

a    -0.748200
b     0.527469
c    -0.307823
d     0.402765
e    12.000000
dtype: float64

In [25]:
'e' in s

True

In [26]:
'f' in s

False

In [27]:
s + s

a    -1.496400
b     1.054939
c    -0.615645
d     0.805529
e    24.000000
dtype: float64

In [29]:
s[1:]

b     0.527469
c    -0.307823
d     0.402765
e    12.000000
dtype: float64

In [30]:
s[:-1]

a   -0.748200
b    0.527469
c   -0.307823
d    0.402765
dtype: float64

In [28]:
s[1:] + s[:-1]

a         NaN
b    1.054939
c   -0.615645
d    0.805529
e         NaN
dtype: float64

In [31]:
s = pd.Series(np.random.randn(5), name='something')

In [32]:
s.name

'something'

### Dataframe

Un DataFrame es una estructura de datos bidimensional con columnas de datos no necesariamente iguales. Un DataFrame se suele comparar con una tabla de excel o una tabla en una base de datos.

Generalmente, el DataFrame es el objeto más usado de Pandas. Puede aceptar distintos tipos de entradas:
  - diccionario de arrays 1d, listas, dicts o Series
  - array numpy de 2 dimensiones
  - Un Series
  - Otro DataFrame

Junto con los datos, se puede especificar un `index` (etiquetas de filas) y `columns` (etiquetas de columnas.
)

In [33]:
d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
    'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
d

{'one': a    1.0
 b    2.0
 c    3.0
 dtype: float64,
 'two': a    1.0
 b    2.0
 c    3.0
 d    4.0
 dtype: float64}

In [34]:
pd.DataFrame(d, index=['d', 'b', 'a'])

Unnamed: 0,one,two
d,,4.0
b,2.0,2.0
a,1.0,1.0


In [37]:
df = pd.DataFrame(d, index=['d', 'b', 'a'], columns=['one', 'two', 'three'])

In [38]:
df.index

Index(['d', 'b', 'a'], dtype='object')

In [39]:
df.columns

Index(['one', 'two', 'three'], dtype='object')

In [40]:
d = {'one': [1., 2., 3., 4.],'two': [4., 3., 2., 1.]}
pd.DataFrame(d)

Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0


In [41]:
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

Unnamed: 0,one,two
a,1.0,4.0
b,2.0,3.0
c,3.0,2.0
d,4.0,1.0


In [42]:
data = np.zeros((2, ), dtype=[('A', 'i4'), ('B', 'f4'), ('C', 'a10')])
data

array([(0, 0., b''), (0, 0., b'')],
      dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])

In [43]:
data[:] = [(1, 2., 'Hello'), (2, 3., "World")]
pd.DataFrame(data)

Unnamed: 0,A,B,C
0,1,2.0,b'Hello'
1,2,3.0,b'World'


In [44]:
pd.DataFrame(data, index=['first', 'second'])

Unnamed: 0,A,B,C
first,1,2.0,b'Hello'
second,2,3.0,b'World'


In [45]:
pd.DataFrame(data, columns=['C', 'A', 'B'])

Unnamed: 0,C,A,B
0,b'Hello',1,2.0
1,b'World',2,3.0


In [46]:
pd.date_range('20130101', periods=6)

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [47]:
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

In [48]:
df

Unnamed: 0,A,B,C,D
2013-01-01,1.521969,-2.682575,0.467486,1.530601
2013-01-02,1.340206,1.066603,-0.658575,-0.14433
2013-01-03,0.056583,-0.932226,2.008403,-0.600352
2013-01-04,-1.266713,0.418023,-0.451682,0.726455
2013-01-05,-1.866548,-0.223139,1.678463,-0.187838
2013-01-06,-1.483031,-0.424513,-1.658431,-0.393625


## Operaciones



In [50]:
df.head(3)

Unnamed: 0,A,B,C,D
2013-01-01,1.521969,-2.682575,0.467486,1.530601
2013-01-02,1.340206,1.066603,-0.658575,-0.14433
2013-01-03,0.056583,-0.932226,2.008403,-0.600352


In [51]:
df.tail(3)

Unnamed: 0,A,B,C,D
2013-01-04,-1.266713,0.418023,-0.451682,0.726455
2013-01-05,-1.866548,-0.223139,1.678463,-0.187838
2013-01-06,-1.483031,-0.424513,-1.658431,-0.393625


In [52]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.282922,-0.462971,0.230944,0.155152
std,1.478067,1.290084,1.423909,0.812099
min,-1.866548,-2.682575,-1.658431,-0.600352
25%,-1.428952,-0.805298,-0.606852,-0.342178
50%,-0.605065,-0.323826,0.007902,-0.166084
75%,1.0193,0.257732,1.375719,0.508759
max,1.521969,1.066603,2.008403,1.530601


In [53]:
df.T

Unnamed: 0,2013-01-01,2013-01-02,2013-01-03,2013-01-04,2013-01-05,2013-01-06
A,1.521969,1.340206,0.056583,-1.266713,-1.866548,-1.483031
B,-2.682575,1.066603,-0.932226,0.418023,-0.223139,-0.424513
C,0.467486,-0.658575,2.008403,-0.451682,1.678463,-1.658431
D,1.530601,-0.14433,-0.600352,0.726455,-0.187838,-0.393625


In [55]:
df.sort_index(axis=0, ascending=False)

Unnamed: 0,A,B,C,D
2013-01-06,-1.483031,-0.424513,-1.658431,-0.393625
2013-01-05,-1.866548,-0.223139,1.678463,-0.187838
2013-01-04,-1.266713,0.418023,-0.451682,0.726455
2013-01-03,0.056583,-0.932226,2.008403,-0.600352
2013-01-02,1.340206,1.066603,-0.658575,-0.14433
2013-01-01,1.521969,-2.682575,0.467486,1.530601


In [56]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2013-01-01,1.521969,-2.682575,0.467486,1.530601
2013-01-03,0.056583,-0.932226,2.008403,-0.600352
2013-01-06,-1.483031,-0.424513,-1.658431,-0.393625
2013-01-05,-1.866548,-0.223139,1.678463,-0.187838
2013-01-04,-1.266713,0.418023,-0.451682,0.726455
2013-01-02,1.340206,1.066603,-0.658575,-0.14433


In [59]:
df.to_numpy()

array([[ 1.52196867, -2.68257515,  0.46748612,  1.53060118],
       [ 1.34020618,  1.06660345, -0.65857535, -0.14433016],
       [ 0.05658286, -0.93222572,  2.00840298, -0.60035237],
       [-1.26671313,  0.41802262, -0.45168211,  0.72645521],
       [-1.8665482 , -0.22313924,  1.67846325, -0.18783837],
       [-1.48303102, -0.4245131 , -1.65843076, -0.39362451]])