# pandas


## 0) What have we done so far?

- Bash / Command Line
- Git / GitHub


## 1) pandas

- "pandas is a [...] open source data analysis and manipulation tool, built on top of the Python programming language." - https://pandas.pydata.org/
- Could be compared to Excel, loosely speaking


## What are the main functionalities of pandas

| Functionality                                                             | Covered in Course   |
| ------------------------------------------------------------------------- | ------------------- |
| Reading and writing data                                                  | Today               |
| ------------------------------------------------------------------------- | ------------------- |
| Selecting subsets of a DataFrame / filtering DataFrames                   | Today               |
| ------------------------------------------------------------------------- | ------------------- |
| Creating new columns                                                      | Today               |
| ------------------------------------------------------------------------- | ------------------- |
| Plotting data                                                             | Tomorrow            |
| ------------------------------------------------------------------------- | ------------------- |
| Combining DataFrames                                                      | Tomorrow            |
| ------------------------------------------------------------------------- | ------------------- |
| Reshaping DataFrames                                                      | Thursday            |
| ------------------------------------------------------------------------- | ------------------- |
| Quick calculations / descriptive statistics                               | Thursday            |
| ------------------------------------------------------------------------- | ------------------- |
| Data aggregation                                                          | Friday              |
| ------------------------------------------------------------------------- | ------------------- |
| Data transformation                                                       | Friday              |
| ------------------------------------------------------------------------- | ------------------- |
| Handling missing data                                                     | Next week           |
| ------------------------------------------------------------------------- | ------------------- |
| Handling time data                                                        | Week 3              |


## 3) Reading and Writing Data

The main datatype of pandas is a pandas.DataFrame (or short DataFrame).

A DataFrame can be compared to a table in Excel. So how do we create them?

### 3.1 Create a DataFrame from a List

In [1]:
import pandas as pd
import numpy as np

In [2]:
germany = [82_000_000, 1.9, "Europe"]

In [3]:
type(germany)

list

In [4]:
denmark = [ 5_500_000, 1.8, "Europe"]

In [7]:
countries = pd.DataFrame(data=[germany, denmark],
                columns=['population', 'fertility_rate', 'continent'])
# Creating a DataFrame out of a list of lists
# data is a list of germany and denmark which are lists

In [8]:
countries

Unnamed: 0,population,fertility_rate,continent
0,82000000,1.9,Europe
1,5500000,1.8,Europe


In [10]:
type(82_000_000)

int

### 4.2 Create a DataFrame from a dictionary

In [11]:
data = {'spices': ['parsley', 'sage', 'rosemary', 'thyme'],
        'value': [1.2, 3.4, np.nan, 5.6],
        'good_for_spaghetti': [False, False, True, True]}

In [13]:
type(data)

dict

In [14]:
pd.DataFrame(data=data)

Unnamed: 0,spices,value,good_for_spaghetti
0,parsley,1.2,False
1,sage,3.4,False
2,rosemary,,True
3,thyme,5.6,True


### 4.3 Read a DataFrame from a file

In [28]:
df = pd.read_csv('./penguins_simple.csv',
                sep=';') # read a csv file by providing the path to the csv file

In [29]:
df.head() # first 5 rows of a DataFrame

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
0,Adelie,39.1,18.7,181.0,3750.0,MALE
1,Adelie,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,36.7,19.3,193.0,3450.0,FEMALE
4,Adelie,39.3,20.6,190.0,3650.0,MALE


In [38]:
df.tail()

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
328,Gentoo,47.2,13.7,214.0,4925.0,FEMALE
329,Gentoo,46.8,14.3,215.0,4850.0,FEMALE
330,Gentoo,50.4,15.7,222.0,5750.0,MALE
331,Gentoo,45.2,14.8,212.0,5200.0,FEMALE
332,Gentoo,49.9,16.1,213.0,5400.0,MALE


In [35]:
# Inspect the DataFrame
df.size # Nr. of entries

# The size of a DataFrame is an attribute of the DataFrame; it describes how the DataFrame looks like
# These descriptive attributes have to be used without parentheses

1998

In [36]:
df.shape # (# of rows, # of columns)

(333, 6)

In [33]:
df.describe()

# describe is a method of the DataFrame; it runs calculations on the DataFrame
# These methods that actively use the available data need the parentheses ()

Unnamed: 0,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g)
count,333.0,333.0,333.0,333.0
mean,43.992793,17.164865,200.966967,4207.057057
std,5.468668,1.969235,14.015765,805.215802
min,32.1,13.1,172.0,2700.0
25%,39.5,15.6,190.0,3550.0
50%,44.5,17.3,197.0,4050.0
75%,48.6,18.7,213.0,4775.0
max,59.6,21.5,231.0,6300.0


In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333 entries, 0 to 332
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Species              333 non-null    object 
 1   Culmen Length (mm)   333 non-null    float64
 2   Culmen Depth (mm)    333 non-null    float64
 3   Flipper Length (mm)  333 non-null    float64
 4   Body Mass (g)        333 non-null    float64
 5   Sex                  333 non-null    object 
dtypes: float64(4), object(2)
memory usage: 15.7+ KB


## 5) Selecting subsets and filtering DataFrames

In [39]:
df.columns # Outputs a list of column names

Index(['Species', 'Culmen Length (mm)', 'Culmen Depth (mm)',
       'Flipper Length (mm)', 'Body Mass (g)', 'Sex'],
      dtype='object')

In [45]:
# Select one column
df['Species'] # df['name_of_colum']

0      Adelie
1      Adelie
2      Adelie
3      Adelie
4      Adelie
        ...  
328    Gentoo
329    Gentoo
330    Gentoo
331    Gentoo
332    Gentoo
Name: Species, Length: 333, dtype: object

In [46]:
# Select one column - option 2
df.Species # exactly the same as df['Species']

0      Adelie
1      Adelie
2      Adelie
3      Adelie
4      Adelie
        ...  
328    Gentoo
329    Gentoo
330    Gentoo
331    Gentoo
332    Gentoo
Name: Species, Length: 333, dtype: object

In [50]:
# Select multiple columns
selected_columns = ['Species', 'Body Mass (g)']
df[selected_columns]

Unnamed: 0,Species,Body Mass (g)
0,Adelie,3750.0
1,Adelie,3800.0
2,Adelie,3250.0
3,Adelie,3450.0
4,Adelie,3650.0
...,...,...
328,Gentoo,4925.0
329,Gentoo,4850.0
330,Gentoo,5750.0
331,Gentoo,5200.0


In [None]:
type(df[['Species', 'Body Mass (g)']]) # Selecting columns of a DataFrame returns a DataFrame

In [53]:
selected_column = ['Species']
type(selected_column)

list

In [55]:
type(df[selected_column])

pandas.core.frame.DataFrame

In [85]:
df[['Species', 'Body Mass (g)']]

Unnamed: 0,Species,Body Mass (g)
0,Adelie,3750.0
1,Adelie,3800.0
2,Adelie,3250.0
3,Adelie,3450.0
4,Adelie,3650.0
...,...,...
328,Gentoo,4925.0
329,Gentoo,4850.0
330,Gentoo,5750.0
331,Gentoo,5200.0


In [62]:
# Q: Can you call a column by the number of the column instead of the column name?
df.iloc[:,0]

# .iloc[rows, column] allows you to select rows and columns by their index location
# .iloc[:,:] would mean: give me all the rows and all the columns

0      Adelie
1      Adelie
2      Adelie
3      Adelie
4      Adelie
        ...  
328    Gentoo
329    Gentoo
330    Gentoo
331    Gentoo
332    Gentoo
Name: Species, Length: 333, dtype: object

In [66]:
# Select rows - iloc
df.iloc[0:5]

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
0,Adelie,39.1,18.7,181.0,3750.0,MALE
1,Adelie,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,36.7,19.3,193.0,3450.0,FEMALE
4,Adelie,39.3,20.6,190.0,3650.0,MALE


In [67]:
# Select rows and columns
df.iloc[0:5, 0:2]

Unnamed: 0,Species,Culmen Length (mm)
0,Adelie,39.1
1,Adelie,39.5
2,Adelie,40.3
3,Adelie,36.7
4,Adelie,39.3


In [74]:
# Select rows and columns - loc: lets you select rows and columns by their namees
df.loc[:, 'Species':'Body Mass (g)']

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g)
0,Adelie,39.1,18.7,181.0,3750.0
1,Adelie,39.5,17.4,186.0,3800.0
2,Adelie,40.3,18.0,195.0,3250.0
3,Adelie,36.7,19.3,193.0,3450.0
4,Adelie,39.3,20.6,190.0,3650.0
...,...,...,...,...,...
328,Gentoo,47.2,13.7,214.0,4925.0
329,Gentoo,46.8,14.3,215.0,4850.0
330,Gentoo,50.4,15.7,222.0,5750.0
331,Gentoo,45.2,14.8,212.0,5200.0


In [78]:
condition = df['Culmen Length (mm)'] > 37

In [83]:
condition

0       True
1       True
2       True
3      False
4       True
       ...  
328     True
329     True
330     True
331     True
332     True
Name: Culmen Length (mm), Length: 333, dtype: bool

In [82]:
# Filter by column value
# We want to reduce the DataFrame to only show penguins with a Culmen Length of more than 37 mm
df[condition] # df[condition]

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
0,Adelie,39.1,18.7,181.0,3750.0,MALE
1,Adelie,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,40.3,18.0,195.0,3250.0,FEMALE
4,Adelie,39.3,20.6,190.0,3650.0,MALE
5,Adelie,38.9,17.8,181.0,3625.0,FEMALE
...,...,...,...,...,...,...
328,Gentoo,47.2,13.7,214.0,4925.0,FEMALE
329,Gentoo,46.8,14.3,215.0,4850.0,FEMALE
330,Gentoo,50.4,15.7,222.0,5750.0,MALE
331,Gentoo,45.2,14.8,212.0,5200.0,FEMALE


In [87]:
df[df['Culmen Length (mm)']>37] # Exactly the same thing

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
0,Adelie,39.1,18.7,181.0,3750.0,MALE
1,Adelie,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,40.3,18.0,195.0,3250.0,FEMALE
4,Adelie,39.3,20.6,190.0,3650.0,MALE
5,Adelie,38.9,17.8,181.0,3625.0,FEMALE
...,...,...,...,...,...,...
328,Gentoo,47.2,13.7,214.0,4925.0,FEMALE
329,Gentoo,46.8,14.3,215.0,4850.0,FEMALE
330,Gentoo,50.4,15.7,222.0,5750.0,MALE
331,Gentoo,45.2,14.8,212.0,5200.0,FEMALE


In [92]:
# Filter by multiple conditions
# We want to reduce the DataFrame to only show penguins with a Culmen Length of more than 37 mm
# At the same time we want to filter out penguins with a Culmen Length of more than 40 mm
lower_bound = df['Culmen Length (mm)'] > 37
upper_bound = df['Culmen Length (mm)'] < 40
df[lower_bound & upper_bound]
# df[(df['Culmen Length (mm)'] > 37) & (df['Culmen Length (mm)'] < 40)]

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
0,Adelie,39.1,18.7,181.0,3750.0,MALE
1,Adelie,39.5,17.4,186.0,3800.0,FEMALE
4,Adelie,39.3,20.6,190.0,3650.0,MALE
5,Adelie,38.9,17.8,181.0,3625.0,FEMALE
6,Adelie,39.2,19.6,195.0,4675.0,MALE
8,Adelie,38.6,21.2,191.0,3800.0,MALE
11,Adelie,38.7,19.0,195.0,3450.0,FEMALE
15,Adelie,37.8,18.3,174.0,3400.0,FEMALE
16,Adelie,37.7,18.7,180.0,3600.0,MALE
18,Adelie,38.2,18.1,185.0,3950.0,MALE


In [94]:
df[df['Culmen Length (mm)'].between(37, 40)]

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
0,Adelie,39.1,18.7,181.0,3750.0,MALE
1,Adelie,39.5,17.4,186.0,3800.0,FEMALE
4,Adelie,39.3,20.6,190.0,3650.0,MALE
5,Adelie,38.9,17.8,181.0,3625.0,FEMALE
6,Adelie,39.2,19.6,195.0,4675.0,MALE
8,Adelie,38.6,21.2,191.0,3800.0,MALE
11,Adelie,38.7,19.0,195.0,3450.0,FEMALE
15,Adelie,37.8,18.3,174.0,3400.0,FEMALE
16,Adelie,37.7,18.7,180.0,3600.0,MALE
18,Adelie,38.2,18.1,185.0,3950.0,MALE


## 6) Create a new column

In [95]:
df['Culmen Area (mm^2)'] = df['Culmen Length (mm)'] * df['Culmen Depth (mm)']

In [96]:
df.head()

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Culmen Area (mm^2)
0,Adelie,39.1,18.7,181.0,3750.0,MALE,731.17
1,Adelie,39.5,17.4,186.0,3800.0,FEMALE,687.3
2,Adelie,40.3,18.0,195.0,3250.0,FEMALE,725.4
3,Adelie,36.7,19.3,193.0,3450.0,FEMALE,708.31
4,Adelie,39.3,20.6,190.0,3650.0,MALE,809.58


In [97]:
# General syntax of creating a new column
# df['new_column_name'] = new values