# Bibliotecas de Data Science 
## Pandas

Primeiro, é necessário importar o pandas

In [1]:
import pandas as pd
import numpy as np

### Series

In [2]:
s = pd.Series([1, 3, 5, 6, 8]) # serie a partir de lista
s


0    1
1    3
2    5
3    6
4    8
dtype: int64

In [3]:
type(s)

pandas.core.series.Series

In [4]:
dates = pd.date_range('20190101', periods=6)
dates

DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
               '2019-01-05', '2019-01-06'],
              dtype='datetime64[ns]', freq='D')

### DataFrame

#### Criando um DataFrame

* Listas e Dicionários

In [5]:
# criando dataframe a partir de numpy array
df = pd.DataFrame(np.random.random((4,4)),columns=list('ABCD'))
df


Unnamed: 0,A,B,C,D
0,0.287012,0.460435,0.80842,0.176701
1,0.61999,0.23981,0.650158,0.92029
2,0.220045,0.790086,0.024839,0.966303
3,0.925009,0.476258,0.037144,0.012367


In [6]:
# criando dataframe a partir de dicionário
df2 = pd.DataFrame({
    'A': 1.,
    'B': pd.Timestamp('20190101'),
    'C': 'foo',
    'D': np.array([1,2,3,4])
})
df2

Unnamed: 0,A,B,C,D
0,1.0,2019-01-01,foo,1
1,1.0,2019-01-01,foo,2
2,1.0,2019-01-01,foo,3
3,1.0,2019-01-01,foo,4


* CSV

Para importar um arquivo csv com pandas, utilizamos a função pandas.read_csv("caminho_até_arquivo").

Como esse dataframe tem muitas linhas, é útil visualizar apenas as suas primeiras linhas. Para isso, utilizamos a função ```df.head(n)```, que nos mostra as n primeiras linhas de df.

In [7]:
iris = pd.read_csv('./Iris.csv')
iris.head(n = 5)

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


Alternativamente, podemos também visualizar as n últimas linhas de df com ```df.tail(n)```.

In [8]:
iris.tail(n = 5)

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica
149,150,5.9,3.0,5.1,1.8,Iris-virginica


Descrevendo um dataframe

* Describe

In [9]:
df.describe()

Unnamed: 0,A,B,C,D
count,4.0,4.0,4.0,4.0
mean,0.513014,0.491647,0.38014,0.518915
std,0.325624,0.226347,0.408338,0.494961
min,0.220045,0.23981,0.024839,0.012367
25%,0.27027,0.405279,0.034068,0.135618
50%,0.453501,0.468347,0.343651,0.548495
75%,0.696245,0.554715,0.689724,0.931793
max,0.925009,0.790086,0.80842,0.966303


* Info

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
A    4 non-null float64
B    4 non-null float64
C    4 non-null float64
D    4 non-null float64
dtypes: float64(4)
memory usage: 208.0 bytes


#### Selecionando partes do DataFrame
* Colunas

In [11]:
# criando dataframe
df = pd.DataFrame([[1,2],[4,5],[7,8]], index=['cobra','tartaruga','pomba'], columns=['vel_max','peso'])
df

Unnamed: 0,vel_max,peso
cobra,1,2
tartaruga,4,5
pomba,7,8


Para acessar colunas de uma dataframe basta fazer:

In [12]:
df['vel_max']

cobra        1
tartaruga    4
pomba        7
Name: vel_max, dtype: int64

Mas perceba que isso retorna um elemento do tipo pandas.Series

Acrescentando colchetes a saída é da forma pandas.DataFrame

In [13]:
df[['vel_max']]

Unnamed: 0,vel_max
cobra,1
tartaruga,4
pomba,7


* Linhas

Uma forma de acessar as linhas de uma dataframe é com o método loc

In [14]:
df.loc['cobra'] # saida do tipo pandas.Series

vel_max    1
peso       2
Name: cobra, dtype: int64

In [15]:
df.loc[['tartaruga']] # saida do tipo pandas.DataFrame

Unnamed: 0,vel_max,peso
tartaruga,4,5


In [16]:
df.loc[['cobra','pomba']] # saida do tipo pandas.DataFrame

Unnamed: 0,vel_max,peso
cobra,1,2
pomba,7,8


Outra forma é com o iloc

In [17]:
df.iloc[0] # saida pandas.Series

vel_max    1
peso       2
Name: cobra, dtype: int64

In [18]:
df.iloc[[0]] # saida pandas.DataFrame

Unnamed: 0,vel_max,peso
cobra,1,2


In [19]:
df.iloc[:2] # retorna os primeiros dois items da DataFrame

Unnamed: 0,vel_max,peso
cobra,1,2
tartaruga,4,5


#### Filtrando um DataFrame

Vamos exemplificar com o dataset Iris

In [20]:
iris_virginica = iris[iris['Species'] == 'Iris-virginica']
iris_virginica.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
100,101,6.3,3.3,6.0,2.5,Iris-virginica
101,102,5.8,2.7,5.1,1.9,Iris-virginica
102,103,7.1,3.0,5.9,2.1,Iris-virginica
103,104,6.3,2.9,5.6,1.8,Iris-virginica
104,105,6.5,3.0,5.8,2.2,Iris-virginica


In [22]:
iris_filtered = iris[(iris['SepalLengthCm'] > 7.5) & (iris['Species'] == 'Iris-virginica')]
iris_filtered.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
105,106,7.6,3.0,6.6,2.1,Iris-virginica
117,118,7.7,3.8,6.7,2.2,Iris-virginica
118,119,7.7,2.6,6.9,2.3,Iris-virginica
122,123,7.7,2.8,6.7,2.0,Iris-virginica
131,132,7.9,3.8,6.4,2.0,Iris-virginica
