Objectives

- Extract and manipulate data using column headings
- Query / select a subset of data using boolean indexing
- Understand the difference between loc and iloc
- Drop rows with Nan values in a given column

Content to cover

- df["COLUMN_NAME"] and df[["COLUMN_NAME_1", "COLUMN_NAME_2"]]
- assign new value to selection
- df[df["NAME] < 18] conditional setup 
- df[df[“Name”].isin([...])] conditional function
- loc/iloc
- df[“column”].dropna() or df.dropna(“column”)


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
titanic = pd.read_csv("../data/titanic.csv")
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Select the data you need

### Select specific columns

![](../schemas/03_subset_columns.png)

  > I'm interested in the age of the titanic passengers

In [20]:
ages = titanic["Age"]
ages.head()

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64

To select a single column, use square brackets `[]` with the column name of the column of interest.

The returned data type is a Pandas Series, as a single column is selected.

In [5]:
type(titanic["Age"])

pandas.core.series.Series

  > I 'm interested in the age and sex of the titanic passengers

In [23]:
age_sex = titanic[["Age", "Sex"]]
age_sex.head()

Unnamed: 0,Age,Sex
0,22.0,male
1,38.0,female
2,26.0,female
3,35.0,female
4,35.0,male


To select multiple columns, use a list of column names within the selection brackets `[]`. Note, the inner square brackets define the list of column names, the outer brackets are to select data from a Pandas DataFrame as seen in the previous example.

The returned data type is a Pandas DataFrame:

In [24]:
type(titanic[["Age", "Sex"]])

pandas.core.frame.DataFrame

__To user guide:__ For basic information on indexing, see :ref:`indexing.basics`

### Filter rows of a table

![](../schemas/03_subset_rows.png)

> I 'm interested in the passengers older than 18 years

In [25]:
adults = titanic[titanic["Age"] > 18]
adults.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


To select rows based on a conditional expression, use a conditional statement inside the selection brackets `[]`. The condition inside the selection brackets `titanic["Age"] > 18` checks for which rows the `Age` column has a value larger than 18. Each row for which the condition is `True`, is selected.

> I 'm interested in the titanic passengers from cabin class 2 and 3

In [26]:
class_23 = titanic[titanic["Pclass"].isin([2, 3])]
class_23.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S


Similar to the conditional expression, the `isin` conditional function returns a `True` for each row the values are in the provided list. To filter the rows based on such a function, use the conditional function inside the selection brackets `[]`. In this case, the condition inside the selection brackets `titanic["Pclass"].isin([2, 3])` checks for which rows the `Pclass` column is either 2 or 3.

The above is equivalent to filtering by rows for which the class is either 2 or 3 and combiniing the two statements with an `|` (or) operator:

In [27]:
class_23 = titanic[(titanic["Pclass"] == 2) | (titanic["Pclass"] == 3)]
class_23.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S


__To user guide:__ Conditional (boolean) indexing, see :ref:`indexing.boolean`. Specific information on `isin`, see :ref:`indexing.basics.indexing_isin`. 

> I want to work with passenger data for which the age is known

In [28]:
age_nonull = titanic[titanic["Age"].notnull()]
age_nonull.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


The `notnull` conditional function returns a `True` for each row the values are not an `Null` value. As such, this can be combined with the selection brackets `[]` to filter the data table.

__To user guide:__ For more dedicated functions on missing values, see :ref:`missing-data`

### Select specific rows and/or columns

![](../schemas/03_subset_columns_rows.png)

> I 'm interested in the Names of the passengers older than 18 years

In [34]:
adult_names = titanic.loc[titanic["Age"] > 18, "Name"]
adult_names.head()

0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
Name: Name, dtype: object

When using the column names, row labels or a condition expression, use the `loc` operator in front of the selection brackets `[]`.

> I 'm interested in rows 10 till 25 and columns 3 to 5

In [35]:
titanic.iloc[9:25, 2:5]

Unnamed: 0,Pclass,Name,Sex
9,2,"Nasser, Mrs. Nicholas (Adele Achem)",female
10,3,"Sandstrom, Miss. Marguerite Rut",female
11,1,"Bonnell, Miss. Elizabeth",female
12,3,"Saundercock, Mr. William Henry",male
13,3,"Andersson, Mr. Anders Johan",male
14,3,"Vestrom, Miss. Hulda Amanda Adolfina",female
15,2,"Hewlett, Mrs. (Mary D Kingcome)",female
16,3,"Rice, Master. Eugene",male
17,2,"Williams, Mr. Charles Eugene",male
18,3,"Vander Planke, Mrs. Julius (Emelia Maria Vande...",female


When specifically interested in certain rows and/or columns based on their position in the table, use the `ìloc` operator in front of the selection brackets `[]`.

__To user guide:__ For more detailed description on selecting subsets of a data table, see :ref:`indexing.choice`

## REMEMBER

- When selecting subsets of data, square brackets `[]` are used.
- Inside these brackets, you can use a single column name, multiple columns within a list, conditional expressions or conditional statements
- Select specific rows and/or columns using `loc` when using the row and column names
- Select specific rows and/or columns using `iloc` when using the positions in the table

__To user guide:__ Further details about indexing is provided in :ref:`indexing`