In [1]:
import pandas as pd

## Series

Series is a pandas unidimensional array used to store some data. Each observation of the array has an index starting on 0 and going to n-1 by default. We can access each element by its index on the series.

In [2]:
s = pd.Series([1, 2, 3, 4, 5])
s

0    1
1    2
2    3
3    4
4    5
dtype: int64

However, sometimes is important to specify the index of the values. For this, we use the index argument of the Series function of pandas library with number of elements corresponding to the number of observations on our data:

In [4]:
s = pd.Series([12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34], index = ['jan','fev', 'mar',
'abr', 'mai', 'jun',
'jul', 'ago', 'set',
'out','nov', 'dez'])
s

jan    12
fev    14
mar    16
abr    18
mai    20
jun    22
jul    24
ago    26
set    28
out    30
nov    32
dez    34
dtype: int64

In [5]:
s['jan']

12

Even though we'd used other index values for the Series object, we can still access it's values using index and arrays slicing. For example, taking the first six numbers of the series object:

In [6]:
s[0:6]

jan    12
fev    14
mar    16
abr    18
mai    20
jun    22
dtype: int64

Also, we can apply various statistical functions to the series, like **mean, max, min, std**:

In [11]:
print(f'Mean: {s.mean()};\nMin: {s.min()};\nMax: {s.max()}\nStandard Deviation: {s.std()}\n')

Mean: 23.0;
Min: 12;
Max: 34
Standard Deviation: 7.211102550927978



In [12]:
s.describe() # quartis

count    12.000000
mean     23.000000
std       7.211103
min      12.000000
25%      17.500000
50%      23.000000
75%      28.500000
max      34.000000
dtype: float64

In [13]:
s.sum()

276

An important detail to notice is that a result of a indexing series is also a series.

In [14]:
s[0:6].mean()

17.0

In [15]:
print(f'Sum of accidents on the first trimester: {s[0:3].sum()}\nMean of accidents on the third semestrer: {s[0:3].mean()}')

Sum of accidents on the first trimester: 42
Mean of accidents on the third semestrer: 14.0


We can also access all the values stored on a Series object using the **values** attribute of this object:

In [16]:
s.values

array([12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34], dtype=int64)

## Data Frame

A data frame is a data structure object from pandas that allow us to build tabular data. It is created with the **DataFrame** function, using as argument a dictionarie with keys being the name of the columns (variables) and the value of it being a list of values (observations) for this corresponding key (variable). By default, the index of the observations will be a continous and ascending sequence starting by 0, being the left column of the data frame.

The length of the list of observations must be equal, otherwise the code will crash.

In [19]:
df = pd.DataFrame({'Aluno' : ['Marina','Felipe','Cleyton','Isabel'],
'Créditos cursados': [20,64,32,24],
'Rendimento acadêmico' : [8.55,7.88,8.17,9.04],
'Mês de nascimento' : ['Novembro','Setembro','Janeiro','Julho'],
'Curso': ['Computação','Estatística','Computação','Matemática']})

df

Unnamed: 0,Aluno,Créditos cursados,Rendimento acadêmico,Mês de nascimento,Curso
0,Marina,20,8.55,Novembro,Computação
1,Felipe,64,7.88,Setembro,Estatística
2,Cleyton,32,8.17,Janeiro,Computação
3,Isabel,24,9.04,Julho,Matemática


As we have done on Series, we can use the index values for having access for the observations. For this, we have to use the **iloc** attribute, which will return a structure like: key (variable) <=> observation for the observation with that index

In [22]:
df.iloc[1]

Aluno                        Felipe
Créditos cursados                64
Rendimento acadêmico           7.88
Mês de nascimento          Setembro
Curso                   Estatística
Name: 1, dtype: object

A data frame is structure as a 2D series object, which means that we can access the **rows** (observations) using **iloc** attribute and the **columns** using simple indexing as dictionaries keys, and these keys are the names of the columns.

For instance, supose that we wanted to have access to all the values of the column "Rendimento acadêmico" as a Series object:

In [24]:
df['Rendimento acadêmico']

0    8.55
1    7.88
2    8.17
3    9.04
Name: Rendimento acadêmico, dtype: float64

The advantage of this usage is because, as spoken before, Series objects allow us to use various mathmatical and statistical functions.

In [25]:
df['Rendimento acadêmico'].describe()

count    4.000000
mean     8.410000
std      0.501664
min      7.880000
25%      8.097500
50%      8.360000
75%      8.672500
max      9.040000
Name: Rendimento acadêmico, dtype: float64

We can also use the describe function to calculate all these values for all quantitative variables at once.

In [26]:
df.describe()

Unnamed: 0,Créditos cursados,Rendimento acadêmico
count,4.0,4.0
mean,35.0,8.41
std,19.966639,0.501664
min,20.0,7.88
25%,23.0,8.0975
50%,28.0,8.36
75%,40.0,8.6725
max,64.0,9.04


## Read CSV

``df = pd.read_csv(file_path or url)``

In [29]:
df = pd.read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/alcohol-consumption/drinks.csv')
df

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol
0,Afghanistan,0,0,0,0.0
1,Albania,89,132,54,4.9
2,Algeria,25,0,14,0.7
3,Andorra,245,138,312,12.4
4,Angola,217,57,45,5.9
...,...,...,...,...,...
188,Venezuela,333,100,3,7.7
189,Vietnam,111,2,1,2.0
190,Yemen,6,0,0,0.1
191,Zambia,32,19,4,2.5


Let's understand this data frame:
- It has 193 rows, which means 193 observations, one for each country
- A qualitative variable called "country" which each observation is the name of some country
- 4 quantitative variables, being the total of cans of beers drunk, doses of licor, number of glasses of wine, total of liters of alcohol ingested of one person on each country on yer.

## Querying Data

Querying data on pandas data frames are very simple, you just have to specify the data frame and pass a condition between [ ]. Pandas will make a boolean mask for the data frame and only the observations with True values for this mask will be shown to you.

In [30]:
df[df['country'] == 'Brazil']

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol
23,Brazil,245,145,16,7.2


You can have access for values on columns just using its name between [ ] or accessing its name as attribute:

In [34]:
print(df[df['country'] == 'Brazil'].spirit_servings)
print(df[df['country'] == 'Brazil']['spirit_servings'])

23    145
Name: spirit_servings, dtype: int64
23    145
Name: spirit_servings, dtype: int64


## Sorting Data Frames

We can sort the values of a data frame using the **sort_values** of pandas and saying from which column we want to sort the data with the **by** attribute.

In [38]:
df.sort_values(by='total_litres_of_pure_alcohol', ascending=False)

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol
15,Belarus,142,373,42,14.4
98,Lithuania,343,244,56,12.9
3,Andorra,245,138,312,12.4
68,Grenada,199,438,28,11.9
45,Czech Republic,361,170,134,11.8
...,...,...,...,...,...
79,Iran,0,0,0,0.0
90,Kuwait,0,0,0,0.0
128,Pakistan,0,0,0,0.0
97,Libya,0,0,0,0.0
