In [1]:
import pandas as pd

# pandas package

In [2]:
pd.__version__

'2.1.4'

---

pandas is designed to make working with “relational” or “labeled” data both easy and intuitive

---

pandas is a dependency of __<u>statsmodels</u>__, making it an important part of the statistical computing ecosystem in Python.

https://www.statsmodels.org/stable/index.html

---

__<u>Data structures</u>__:

- Series - 1D labeled homogeneously-typed array
- DataFrame - 
General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed column



---

In [None]:
for col in df.columns:
    series = df[col]
    # do something with series

All pandas data structures are value-mutable (the values they contain can be altered) but not always size-mutable.

---

In [3]:
df = pd.DataFrame(
    {'Name': [
        "Braund, Mr. Owen Harris",
        "Allen, Mr. William Henry",
        "Bonnell, Miss. Elizabeth",
    ],
    'Age': [22, 35, 78],
    'Sex': ["male", "male", "female"]
    }
)

In [4]:
df

Unnamed: 0,Name,Age,Sex
0,"Braund, Mr. Owen Harris",22,male
1,"Allen, Mr. William Henry",35,male
2,"Bonnell, Miss. Elizabeth",78,female


Each column in a DataFrame is a Series. When selecting a single column of a pandas DataFrame, the result is a pandas Series.


In [5]:
df['Age']

0    22
1    35
2    78
Name: Age, dtype: int64

---

You can create a Series from scratch as well:


In [6]:
ages = pd.Series([22, 35, 78], name='Age')
ages

# The name to give to the Series.

0    22
1    35
2    78
Name: Age, dtype: int64

A pandas Series has no column labels, but does have row labels.

---

In [7]:
# I want to know the maximum Age of the passengers

df['Age'].max()

78

In [8]:
ages.max()

78

__max() method__

As methods are functions, do not forget to use parentheses ().

---

some basic statistics of the numerical data of my data table

In [9]:
df.describe()

Unnamed: 0,Age
count,3.0
mean,45.0
std,29.308702
min,22.0
25%,28.5
50%,35.0
75%,56.5
max,78.0


---
---
---

## Titanic

reading a tabular data - read_*

writing a tabular data - to_*

In [3]:
titanic = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv')

In [4]:
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


---

__HDF5__ - The Hierarchical Data Format version 5, is an open source file format that supports large, complex, heterogeneous data. 

---

__.head()__ method
<br> __.tail()__ method

---

A check on how pandas interpreted __each of the column data types__ can be done by requesting the pandas __dtypes__ <u>attribute</u>:


In [5]:
titanic.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

---

#### <span style="color:red">__* Note__</span>

    When asking for the dtypes, no brackets are used! dtypes is an attribute of a DataFrame and Series. Attributes of a DataFrame or Series do not need brackets. Attributes represent a characteristic of a DataFrame/Series, whereas methods (which require brackets) do something with the DataFrame/Series.

---

__.info()__ method - a technical summary of a DataFrame

In [6]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


---

Each column in a DataFrame is a Series. As a single column is selected, the returned object is a pandas Series.

In [7]:
type(titanic['Age'])

pandas.core.series.Series

In [8]:
titanic['Age'].shape 

# a vector
# A pandas Series is 1-dimensional and only the number of rows is returned.

(891,)

In [9]:
type(titanic[['Age', 'Sex']])

# To select multiple columns, use a list of column names within the selection brackets [].

pandas.core.frame.DataFrame

---
---

### How do I filter specific rows from a DataFrame?

To select rows based on a conditional expression, use a condition inside the selection brackets __[ ]__.


In [14]:
above_35 = titanic[titanic['Age'] > 35]

In [15]:
titanic['Age'] > 35 # element-wise

0      False
1       True
2      False
3      False
4      False
       ...  
886    False
887    False
888    False
889    False
890    False
Name: Age, Length: 891, dtype: bool

The output of the conditional expression (>, but also ==, !=, <, <=,… would work) is actually a pandas Series of boolean values (either True or False) with the same number of rows as the original DataFrame. 

In [17]:
# Titanic passengers from cabin class 2 and 3
titanic[titanic['Pclass'].isin([2, 3])].head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S


In [18]:
# the above is equivalent to
titanic[(titanic['Pclass'] == 2) | (titanic['Pclass'] == 3)].head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S


__!!! Note__
    
When combining multiple conditional statements, each condition must be surrounded by parentheses `()`. Moreover, you can not use or/and but need to use the or operator `|` and the and operator `&`.



---

In [21]:
# passenger data for which the age is known

age_no_na = titanic[titanic['Age'].notna()]

---

### How do I select specific rows and columns from a DataFrame?


In [25]:
# the names of the passengers older than 35 years

titanic.loc[titanic['Age'] > 35, 'Name']

1      Cumings, Mrs. John Bradley (Florence Briggs Th...
6                                McCarthy, Mr. Timothy J
11                              Bonnell, Miss. Elizabeth
13                           Andersson, Mr. Anders Johan
15                      Hewlett, Mrs. (Mary D Kingcome) 
                             ...                        
865                             Bystrom, Mrs. (Karolina)
871     Beckwith, Mrs. Richard Leonard (Sallie Monypeny)
873                          Vander Cruyssen, Mr. Victor
879        Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)
885                 Rice, Mrs. William (Margaret Norton)
Name: Name, Length: 217, dtype: object

---

- In this case, __a subset of both rows and columns__ is made in one go and just using __selection brackets [] is not sufficient anymore__.
- The `loc/iloc` operators are required in front of the selection brackets [].
- `When using loc/iloc, the part before the comma is the rows you want, and the part after the comma is the columns you want to select.`
- For both the part before and after the comma, you can use a single label, a list of labels, a slice of labels, a conditional expression or a colon. Using a colon specifies you want to select all rows or columns.





---

In [26]:
# I’m interested in rows 10 till 25 and columns 3 to 5.

titanic.iloc[9:25, 2:5]

Unnamed: 0,Pclass,Name,Sex
9,2,"Nasser, Mrs. Nicholas (Adele Achem)",female
10,3,"Sandstrom, Miss. Marguerite Rut",female
11,1,"Bonnell, Miss. Elizabeth",female
12,3,"Saundercock, Mr. William Henry",male
13,3,"Andersson, Mr. Anders Johan",male
14,3,"Vestrom, Miss. Hulda Amanda Adolfina",female
15,2,"Hewlett, Mrs. (Mary D Kingcome)",female
16,3,"Rice, Master. Eugene",male
17,2,"Williams, Mr. Charles Eugene",male
18,3,"Vander Planke, Mrs. Julius (Emelia Maria Vande...",female


---

When selecting specific rows and/or columns with loc or iloc, new values can be assigned to the selected data. For example, to assign the name 'anonymous' to the first 3 elements of the fourth column:


In [27]:
titanic.iloc[:3, 3] = 'anonymous'

In [29]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,anonymous,male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,anonymous,female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,anonymous,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


---