Objectives

- Extract and manipulate data using column headings
- Query / select a subset of data using boolean indexing
- Understand the difference between loc and iloc
- Drop rows with Nan values in a given column

Content to cover

- df["COLUMN_NAME"] and df[["COLUMN_NAME_1", "COLUMN_NAME_2"]]
- assign new value to selection
- df[df["NAME] < 18] conditional setup 
- df[df[“Name”].isin([...])] conditional function
- loc/iloc
- df[“column”].dropna() or df.dropna(“column”)


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
titanic = pd.read_csv("../data/titanic.csv")
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Select specific columns of a table

![](../schemas/03_subset_columns.png)

  > I'm interested in the the ages of the titanic passengers

In [4]:
titanic["Age"]

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
5       NaN
6      54.0
7       2.0
8      27.0
9      14.0
10      4.0
11     58.0
12     20.0
13     39.0
14     14.0
15     55.0
16      2.0
17      NaN
18     31.0
19      NaN
20     35.0
21     34.0
22     15.0
23     28.0
24      8.0
25     38.0
26      NaN
27     19.0
28      NaN
29      NaN
       ... 
861    21.0
862    48.0
863     NaN
864    24.0
865    42.0
866    27.0
867    31.0
868     NaN
869     4.0
870    26.0
871    47.0
872    33.0
873    47.0
874    28.0
875    15.0
876    20.0
877    19.0
878     NaN
879    56.0
880    25.0
881    33.0
882    22.0
883    28.0
884    25.0
885    39.0
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

To select a single column, use square brackets `[]` with the column name of the column of interest.

The returned data type is a Pandas Series, as a single column is selected.

In [5]:
type(titanic["Age"])

pandas.core.series.Series

  > I 'm interested in the ages and sex of the titanic passengers

In [6]:
titanic[["Age", "Sex"]]

Unnamed: 0,Age,Sex
0,22.0,male
1,38.0,female
2,26.0,female
3,35.0,female
4,35.0,male
5,,male
6,54.0,male
7,2.0,male
8,27.0,female
9,14.0,female


To select multiple columns, use a list of column names of the column of interest within the selection brackets `[]`. Note, the inner square brackets define the list of columns, the outer brackets are to select data from a Pandas DataFrame.

The returned data type is a Pandas DataFrame.

In [7]:
type(titanic[["Age", "Sex"]])

pandas.core.frame.DataFrame

__To user guide:__ For basic information on indexing, see :ref:`indexing.basics`

### Filter rows of a table using conditions

![](../schemas/03_subset_rows.png)

> I 'm interested in the passengers older than 18 years

In [8]:
titanic[titanic["Age"] > 18]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
12,13,0,3,"Saundercock, Mr. William Henry",male,20.0,0,0,A/5. 2151,8.0500,,S
13,14,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.2750,,S


To select rows based on a conditional expression, use a conditional statement inside the selection brackets `[]`. The condition inside the selection brackets `titanic["Age"] > 18` checks for which rows the `Age` column has a value larger than 18. Each row for which the condition is `True`, is selected by the selection brackets.

> I 'm interested in the titanic passengers from cabin class 2 and 3

In [9]:
titanic[titanic["Pclass"].isin([2, 3])].head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S


To select rows based on a conditional function, use the conditional function inside the selection brackets `[]`. The condition inside the selection brackets `titanic["Pclass"].isin([2, 3])` checks for which rows the `Pclass` column is either 2 or 3. Each row for which the condition is `True`, is selected by the selection brackets.

__To user guide:__ Conditional (boolean) indexing, see :ref:`indexing.boolean`. Specific information on `isin`, see :ref:`indexing.basics.indexing_isin`. 

> I want to work with passenger data for which the age is known

In [10]:
titanic[titanic["Age"].notnull()].head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


__To user guide:__ For more dedicated functions on missing values, see :ref:`missing-data`

### Filter specific rows and/or columns

![](../schemas/03_subset_columns_rows.png)

> I 'm interested in the Names of the passengers older than 18 years

In [11]:
titanic.loc[titanic["Age"] > 18, "Name"]

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
6                                McCarthy, Mr. Timothy J
8      Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
11                              Bonnell, Miss. Elizabeth
12                        Saundercock, Mr. William Henry
13                           Andersson, Mr. Anders Johan
15                      Hewlett, Mrs. (Mary D Kingcome) 
18     Vander Planke, Mrs. Julius (Emelia Maria Vande...
20                                  Fynney, Mr. Joseph J
21                                 Beesley, Mr. Lawrence
23                          Sloper, Mr. William Thompson
25     Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...
27                        Fortune, Mr. Charles Alexander
30                             

When using the column names, row labels or a condition, use the `loc` operator in combination with the selection brackets `[]`.

> I 'm interested in rows 10 till 25 and columns 3 to 5

In [12]:
titanic.iloc[9:25, 2:5]

Unnamed: 0,Pclass,Name,Sex
9,2,"Nasser, Mrs. Nicholas (Adele Achem)",female
10,3,"Sandstrom, Miss. Marguerite Rut",female
11,1,"Bonnell, Miss. Elizabeth",female
12,3,"Saundercock, Mr. William Henry",male
13,3,"Andersson, Mr. Anders Johan",male
14,3,"Vestrom, Miss. Hulda Amanda Adolfina",female
15,2,"Hewlett, Mrs. (Mary D Kingcome)",female
16,3,"Rice, Master. Eugene",male
17,2,"Williams, Mr. Charles Eugene",male
18,3,"Vander Planke, Mrs. Julius (Emelia Maria Vande...",female


When specifically interested in certain rows and/or columns based on their position in the table, use the `ìloc` operator in combination with the selection brackets `[]`.

**REMEMBER**

- When selecting subsets of data, square brackets `[]` are used.
- Inside these brackets, you can use a single column name, multiple columns within a list, conditional expressions or conditional statements
- Select specific rows and/or columns using `loc` when using the row and column names
- Select specific rows and/or columns using `iloc` when using the positions in the table

__To user guide:__ see :ref:`indexing`