# CHAPTER 5
---
# Getting Started with pandas

In [77]:
import pandas as pd
import numpy as np

## Introduction to pandas Data Structures <font color='green'>[Essential]</font>
To get started with pandas, you will need to get comfortable with its two workhorse
data structures: Series and DataFrame. While they are not a universal solution for every
problem, they provide a solid, easy-to-use basis for most applications.

### Series <font color='green'>[Essential]</font><font color='blue'>[INTRANT]</font>
A Series is a one-dimensional array-like object containing an array of data (of any
NumPy data type) and an associated array of data labels, called its index. The simplest
Series is formed from only an array of data:

In [78]:
obj = pd.Series([4, 7, -5, 3])

In [79]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [80]:
obj.values

array([ 4,  7, -5,  3], dtype=int64)

In [81]:
obj.index

RangeIndex(start=0, stop=4, step=1)

podemos escoger el index pasándolo en argumento

In [82]:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [83]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

usamos el index para acceder a los valores

In [84]:
obj2['a']

-5

In [85]:
obj2['d'] = 6
obj2[['c', 'a', 'd']]

c    3
a   -5
d    6
dtype: int64

vinculo index <-> valor est se conserva cuando hacemos filtros u operaciones matématicas

In [86]:
obj2 * 2

d    12
b    14
a   -10
c     6
dtype: int64

Podemos construir las series con dicciónarios

In [87]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

### DataFrame <font color='green'>[Essential]</font><font color='blue'>[INTRANT]</font>

- Un DataFrame representa una tabla. 
- se compara a una tabla de la cual las líneas y las columnas tienen un índice
- es como una serie de serie

In [88]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = pd.DataFrame(data)

In [89]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


- Podemos acceder a las series de un Dataframe como a los valores de un dicciónario
- Accedemos a unas vistas (referencias) de estas series : no son copias
- las series de un DataFrame (df) comparten el mismo index

In [90]:
frame['state'] # acceso a una línea

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
Name: state, dtype: object

In [91]:
frame.loc[2] # acceso a una columna

state    Ohio
year     2002
pop       3.6
Name: 2, dtype: object

In [92]:
frame['debt'] = 10
frame['emp'] = [1, 2, 3, 4, 5]
frame

Unnamed: 0,state,year,pop,debt,emp
0,Ohio,2000,1.5,10,1
1,Ohio,2001,1.7,10,2
2,Ohio,2002,3.6,10,3
3,Nevada,2001,2.4,10,4
4,Nevada,2002,2.9,10,5


In [93]:
# los valores de un DataFrame son guardados en una tabla de dos dimensiones
frame.values

array([['Ohio', 2000, 1.5, 10, 1],
       ['Ohio', 2001, 1.7, 10, 2],
       ['Ohio', 2002, 3.6, 10, 3],
       ['Nevada', 2001, 2.4, 10, 4],
       ['Nevada', 2002, 2.9, 10, 5]], dtype=object)

## Essential Functionality

### Dropping entries from an axis <font color='green'>[Essential]</font>

In [94]:
frame = pd.DataFrame(
    np.arange(9).reshape((3, 3)), 
    index=['a', 'c', 'd'],
    columns=['Ohio', 'Texas', 'California']
)
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [95]:
frame.drop('California', axis=1)

Unnamed: 0,Ohio,Texas
a,0,1
c,3,4
d,6,7


In [96]:
frame.drop('a')

Unnamed: 0,Ohio,Texas,California
c,3,4,5
d,6,7,8


### Indexing, selection, and filtering <font color='green'>[Essential]</font></font><font color='blue'>[INTRANT]</font>

In [97]:
obj = pd.Series([2, 3, 5, 9], index=['a', 'b', 'c', 'd'])
obj

a    2
b    3
c    5
d    9
dtype: int64

In [98]:
# indexing
obj.iloc[[1, 3]]

b    3
d    9
dtype: int64

In [99]:
# filtering
obj.loc[obj < 6]

a    2
b    3
c    5
dtype: int64

In [100]:
# slicing
obj.loc['b':'c']

b    3
c    5
dtype: int64

In [101]:
obj.loc['a':'b'] = 4

In [102]:
obj

a    4
b    4
c    5
d    9
dtype: int64

In [103]:
data = pd.DataFrame(
    np.arange(16).reshape((4, 4)),
    index=['Ohio', 'Colorado', 'Utah', 'New York'],
    columns=['one', 'two', 'three', 'four']
)

In [104]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [105]:
data.iloc[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [106]:
data.loc[data['three'] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


### Function application and mapping <font color='green'>[Essential]</font><font color='blue'>[INTRANT]</font>

In [107]:
frame = pd.DataFrame(
    np.random.randn(3, 3), 
    columns=list('bde'),
    index=['Utah', 'Ohio', 'Texas']
)

In [108]:
frame

Unnamed: 0,b,d,e
Utah,-1.773187,0.74978,0.370691
Ohio,-1.900352,0.526072,-0.075212
Texas,-1.139199,-0.849878,-0.45326


Las funciónes de la biblioteca numpy funcionan con los dataframes y las series

In [109]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,1.773187,0.74978,0.370691
Ohio,1.900352,0.526072,0.075212
Texas,1.139199,0.849878,0.45326


In [110]:
def series_min(series):
    return series.min() # retornamos un valor único

# llamando la función para todas las series del dataframe, se crea una series
frame.apply(series_min)

b   -1.900352
d   -0.849878
e   -0.453260
dtype: float64

In [111]:
def fois_deux(series):
    return series * 2 # retornamos una serie

# llamando la función para todas las series del dataframe, se crea una series
frame.apply(fois_deux)

Unnamed: 0,b,d,e
Utah,-3.546373,1.499561,0.741382
Ohio,-3.800705,1.052144,-0.150425
Texas,-2.278398,-1.699756,-0.90652


In [112]:
def plus_un(valeur):
    return 'valeur: %.2f' % valeur

frame.applymap(plus_un) # on applique la fonction pour chaque cellule !

Unnamed: 0,b,d,e
Utah,valeur: -1.77,valeur: 0.75,valeur: 0.37
Ohio,valeur: -1.90,valeur: 0.53,valeur: -0.08
Texas,valeur: -1.14,valeur: -0.85,valeur: -0.45


### Sorting and ranking <font color='green'>[Essential]</font>

In [113]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])

In [114]:
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

In [115]:
frame = pd.DataFrame(
    np.arange(8).reshape((2, 4)), 
    index=['three', 'one'],
    columns=['d', 'a', 'b', 'c']
)

In [116]:
frame

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [117]:
# podemos ordenar según las lineas o las columnas
sorted_frame = frame.sort_index(axis=1)
sorted_frame

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


## Summarizing and Computing Descriptive Statistics <font color='green'>[Essential]</font>  <font color='#D22328'>[Advanced]</font></font><font color='blue'>[INTRANT]</font>
pandas objects are equipped with a set of common mathematical and statistical methods.
Most of these fall into the category of reductions or summary statistics, methods
that extract a single value (like the sum or mean) from a Series or a Series of values from
the rows or columns of a DataFrame. Compared with the equivalent methods of vanilla
NumPy arrays, they are all built from the ground up to exclude missing data.

In [118]:
df = pd.DataFrame(
    [[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]],
    index=['a', 'b', 'c', 'd'],
    columns=['one', 'two']
) # nan : not a number 

In [119]:
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [120]:
df.sum()

one    9.25
two   -5.80
dtype: float64

In [121]:
df.cumsum() # cumul 

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


In [122]:
df.idxmax() # index del valor max

one    b
two    d
dtype: object

In [123]:
df.sum(axis=1) # podemos sumar según las comumnas o las líneas

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [124]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


In [125]:
df['one'].describe()

count    3.000000
mean     3.083333
std      3.493685
min      0.750000
25%      1.075000
50%      1.400000
75%      4.250000
max      7.100000
Name: one, dtype: float64

In [126]:
strings = df.astype(str) # cambiamos el tipo de los valores de df
strings['one'].describe() 

count       4
unique      4
top       1.4
freq        1
Name: one, dtype: object

## Handling Missing Data <font color='green'>[Essential]</font></font><font color='blue'>[INTRANT
]</font></font>

In [127]:
df.isnull()

Unnamed: 0,one,two
a,False,True
b,False,False
c,True,True
d,False,False


In [128]:
df.dropna(subset=['two'])

Unnamed: 0,one,two
b,7.1,-4.5
d,0.75,-1.3
