In [41]:
import pandas as pd

<div class="alert alert-warning">
    
This tutorial uses the titanic data set, stored as CSV. For details on the titanic columns and how to read the data with pandas, see [tutorial 2 on read/write operations](./2_read_write.ipynb).

</div>

In [42]:
titanic = pd.read_csv("../data/titanic.csv")
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# How do I select a subset of data in a `DataFrame`? 

### How do I select specific columns from a `DataFrame`?

![](../schemas/03_subset_columns.png)

  > I'm interested in the age of the titanic passengers.

In [43]:
ages = titanic["Age"]
ages.head()

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64

To select a single column, use square brackets `[]` with the column name of the column of interest.

Each column in a `DataFrame` is a `Series`. As a single column is selected, the returned object is a pandas `Series`. We can verify this by checking the type of the output:

In [65]:
type(titanic["Age"])

pandas.core.series.Series

And have a look at the `shape` of the output:

In [64]:
titanic["Age"].shape

(891,)

`shape` is an attribute (remember [previous tutorial](./2_read_write.ipynb), no parentheses for attributes) of a pandas `Series` and `DataFrame` containing the number of rows and columns: _(nrows, ncolumns)_. A pandas Series is 1-dimensional and only the number of rows is returned.

  > I'm interested in the age and sex of the titanic passengers.

In [66]:
age_sex = titanic[["Age", "Sex"]]
age_sex.head()

Unnamed: 0,Age,Sex
0,22.0,male
1,38.0,female
2,26.0,female
3,35.0,female
4,35.0,male


To select multiple columns, use a list of column names within the selection brackets `[]`. 

<div class="alert alert-info">
    
__Note:__ The inner square brackets define a :ref:`Python list <python:tut-morelists>` with column names, whereas the outer brackets are used to select the data from a pandas `DataFrame`. The previous example can therefore also be written as:

```python
columns_to_select = ["Age", "Sex"]
titanic[columns_to_select]
```

</div>

The returned data type is a Pandas DataFrame:

In [67]:
type(titanic[["Age", "Sex"]])

pandas.core.frame.DataFrame

In [68]:
titanic[["Age", "Sex"]].shape

(891, 2)

The selection returned a `DataFrame` with 891 rows and 2 columns. A `DataFrame` is 2-dimensional with both a row and column dimension.

__To user guide:__ For basic information on indexing, see :ref:`indexing.basics`

### How do I filter specific rows from a `DataFrame`?

![](../schemas/03_subset_rows.png)

> I'm interested in the passengers older than 35 years.

In [73]:
above_35 = titanic[titanic["Age"] > 35]
above_35.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S
13,14,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.275,,S
15,16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0,0,248706,16.0,,S


To select rows based on a conditional expression, use a condition inside the selection brackets `[]`. The condition inside the selection brackets `titanic["Age"] > 35` checks for which rows the `Age` column has a value larger than 35:

In [70]:
titanic["Age"] > 35

0      False
1       True
2      False
3      False
4      False
       ...  
886    False
887    False
888    False
889    False
890    False
Name: Age, Length: 891, dtype: bool

The output of the conditional expression (`>`, but also `==`, `!=`, `<`, `<=`,... would work) is actually a pandas `Series` of boolean values (either `True` or `False`) with the same number of rows as the original `DataFrame`. Such a `Series` of boolean values can be used to filter the `DataFrame` by putting it in between the selection brackets `[]`. Only rows for which the value is `True` will be selected.

We now from before that the original titanic `DataFrame` consists of 891 rows. Let's have a look at the amount of rows which satisfy the condition by checking the `shape` attribute of the resulting `DataFrame` above_35:

In [75]:
above_35.shape

(217, 12)

> I'm interested in the titanic passengers from cabin class 2 and 3.

In [76]:
class_23 = titanic[titanic["Pclass"].isin([2, 3])]
class_23.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S


Similar to the conditional expression, the `isin` conditional function returns a `True` for each row the values are in the provided list. To filter the rows based on such a function, use the conditional function inside the selection brackets `[]`. In this case, the condition inside the selection brackets `titanic["Pclass"].isin([2, 3])` checks for which rows the `Pclass` column is either 2 or 3.

The above is equivalent to filtering by rows for which the class is either 2 or 3 and combining the two statements with an `|` (or) operator:

In [58]:
class_23 = titanic[(titanic["Pclass"] == 2) | (titanic["Pclass"] == 3)]
class_23.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S


<div class="alert alert-info">
    
__Note:__ When combining multiple conditional statements, each condition must be surrounded by parentheses `()`. Moreover, you can not use `or`/`and` but need to use the "or" operator `|` and the "and" operator `&`.

</div>

__To user guide:__ Conditional (boolean) indexing, see :ref:`indexing.boolean`. Specific information on `isin`, see :ref:`indexing.basics.indexing_isin`. 

> I want to work with passenger data for which the age is known.

In [59]:
age_no_na = titanic[titanic["Age"].notna()]
age_no_na.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


The `notna` conditional function returns a `True` for each row the values are not an `Null` value. As such, this can be combined with the selection brackets `[]` to filter the data table.

You might wonder what actually changed, as the first 5 lines are still the same values. One way to verify is to check if the shape has changed:

In [78]:
age_no_na.shape

(714, 12)

__To user guide:__ For more dedicated functions on missing values, see :ref:`missing-data`

### How do I select specific rows and columns from a `DataFrame`? 

![](../schemas/03_subset_columns_rows.png)

> I'm interested in the names of the passengers older than 35 years.

In [60]:
adult_names = titanic.loc[titanic["Age"] > 35, "Name"]
adult_names.head()

1     Cumings, Mrs. John Bradley (Florence Briggs Th...
6                               McCarthy, Mr. Timothy J
11                             Bonnell, Miss. Elizabeth
13                          Andersson, Mr. Anders Johan
15                     Hewlett, Mrs. (Mary D Kingcome) 
Name: Name, dtype: object

In this case, a subset of both rows and columns is made in one go and just using selection brackets `[]` is not sufficient anymore. The `loc`/`iloc` operators are required in front of the selection brackets `[]`. When using `loc`/`iloc`, the part before the comma is the rows you want, and the part after the comma is the columns you want to select.

When using the column names, row labels or a condition expression, use the `loc` operator in front of the selection brackets `[]`. For both the part before and after the comma, you can use a single label, a list of labels, a slice of labels, a conditional expression or a colon. using a colon specificies you want to select all rows or columns.

> I'm interested in rows 10 till 25 and columns 3 to 5.

In [61]:
titanic.iloc[9:25, 2:5]

Unnamed: 0,Pclass,Name,Sex
9,2,"Nasser, Mrs. Nicholas (Adele Achem)",female
10,3,"Sandstrom, Miss. Marguerite Rut",female
11,1,"Bonnell, Miss. Elizabeth",female
12,3,"Saundercock, Mr. William Henry",male
13,3,"Andersson, Mr. Anders Johan",male
14,3,"Vestrom, Miss. Hulda Amanda Adolfina",female
15,2,"Hewlett, Mrs. (Mary D Kingcome)",female
16,3,"Rice, Master. Eugene",male
17,2,"Williams, Mr. Charles Eugene",male
18,3,"Vander Planke, Mrs. Julius (Emelia Maria Vande...",female


Again, a subset of both rows and columns is made in one go and just using selection brackets `[]` is not sufficient anymore. When specifically interested in certain rows and/or columns based on their position in the table, use the `iloc` operator in front of the selection brackets `[]`.

When selecting specific rows and/or columns with `loc` or `iloc`, new values can be assigned to the selected data. For example, to assign the name `anonymous` to the first 3 elements of the third column:

In [40]:
titanic.iloc[0:3, 3] = "anonymous"
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,anonymous,male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,anonymous,female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,anonymous,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


__To user guide:__ For more detailed description on selecting subsets of a data table, see :ref:`indexing.choice`

## REMEMBER

- When selecting subsets of data, square brackets `[]` are used.
- Inside these brackets, you can use a single column/row label, a list of column/row labels, a slice of labels, a conditional expression or a colon.
- Select specific rows and/or columns using `loc` when using the row and column names
- Select specific rows and/or columns using `iloc` when using the positions in the table
- You can assign new values to a selection based on `loc`/`iloc`.

__To user guide:__ Further details about indexing is provided in :ref:`indexing`