## Pandas

### What is Pandas?
* Python library for data analysis
* High-performance containers for data analysis
* Data structures with a lot of functionality

In [1]:
import pandas as pd

### DataFrame

A DataFrame is a table. It contains an array of individual *entries*, each of which has a certain *value*. Each entry corresponds to a row (or *record*) and a *column*.

For example, consider the following simple DataFrame:

In [44]:
# Pandas DataFrame
data = {'weekday': ['Sun', 'Sun', 'Mon', 'Mon'],
'city': ['Austin', 'Dallas', None, 'Dallas'],
'visitors': [139, 237, 326, 456],
'signups': [7, 12, 3, 5]}

In [45]:
df = pd.DataFrame(data)

In [9]:
df

Unnamed: 0,weekday,city,visitors,signups
0,Sun,Austin,139,7
1,Sun,Dallas,237,12
2,Mon,,326,3
3,Mon,Dallas,456,5


In [10]:
# basic information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   weekday   4 non-null      object
 1   city      3 non-null      object
 2   visitors  4 non-null      int64 
 3   signups   4 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 256.0+ bytes


In [12]:
df.isna().sum()

weekday     0
city        1
visitors    0
signups     0
dtype: int64

In [6]:
df.shape

(4, 4)

### Reading a CSV file into DataFrame

In [13]:
# Read CSV
df2 = pd.read_csv('biostats_cleaned.csv')

In [14]:
# top 5 rows
df2.head()

Unnamed: 0,Name,Sex,Age,Height,Weight
0,Alex,M,41,74,170
1,Bert,M,42,68,166
2,Carl,M,32,70,155
3,Dave,M,39,72,167
4,Elly,F,30,66,124


In [15]:
# Basic statistical Info
df2.describe()

Unnamed: 0,Age,Height,Weight
count,18.0,18.0,18.0
mean,34.666667,69.055556,146.722222
std,7.577055,3.52257,22.540958
min,23.0,62.0,98.0
25%,30.0,66.25,132.0
50%,32.5,69.5,150.0
75%,38.75,71.75,165.25
max,53.0,75.0,176.0


In [16]:
df2.columns

Index(['Name', 'Sex', 'Age', 'Height', 'Weight'], dtype='object')

In [18]:
# Set name column as index
df2 = df2.set_index('Name')

In [19]:
df2.head()

Unnamed: 0_level_0,Sex,Age,Height,Weight
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Alex,M,41,74,170
Bert,M,42,68,166
Carl,M,32,70,155
Dave,M,39,72,167
Elly,F,30,66,124


### Index based selection

In [20]:
df2.iloc[2:4, :]

Unnamed: 0_level_0,Sex,Age,Height,Weight
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Carl,M,32,70,155
Dave,M,39,72,167


###  Label-based selection

In [32]:
df2.loc['Bert':'Dave', ['Height', 'Age']]

Unnamed: 0_level_0,Height,Age
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Bert,68,42
Carl,70,32
Dave,72,39


### Conditional selection

In [33]:
# Selecting people over age 35
df2[df2['Age'] > 35]

Unnamed: 0_level_0,Sex,Age,Height,Weight
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Alex,M,41,74,170
Bert,M,42,68,166
Dave,M,39,72,167
Ivan,M,53,72,175
Kate,F,47,69,139
Neil,M,36,75,160
Omar,M,38,70,145


In [39]:
# Select only females
df2[df2['Sex'] == 'F']

Unnamed: 0_level_0,Sex,Age,Height,Weight
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Elly,F,30,66,124
Fran,F,33,66,115
Gwen,F,26,64,121
Kate,F,47,69,139
Myra,F,23,62,98
Page,F,31,67,135
Ruth,F,28,65,131


In [41]:
df2.groupby('Sex').Height.min()

Sex
F    62
M    68
Name: Height, dtype: int64