In [1]:
import pandas as pd

In [2]:
titanic = pd.read_csv("data/titanic.csv")

In [3]:
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


To select a single column, use square brackets `[]` with the column name of the column of interest.

In [4]:
ages = titanic["Age"]

In [5]:
ages

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

Each column in a `DataFrame` is a `Series`. As a single column is selected, the returned object is a pandas `Series`.

In [6]:
type(ages)

pandas.core.series.Series

`DataFrame.shape` is an attribute of a pandas `Series` and `DataFrame` counting the number of rows and columns: *(nrows, ncolumns)*. A pandas `Series` is 1-dimensional and only the number of rows is returned.

In [7]:
ages.shape

(891,)

To select multiple columns, use a list of column names within the selection brackets `[]`.
- the inner square brackets define a *Python list* with column names
- the outer brackets are used to select the data from a pandas `DataFrame`

In [8]:
age_sex = titanic[["Age", "Sex"]]

In [9]:
age_sex.head()

Unnamed: 0,Age,Sex
0,22.0,male
1,38.0,female
2,26.0,female
3,35.0,female
4,35.0,male


In [10]:
type(age_sex)

pandas.core.frame.DataFrame

In [11]:
age_sex.shape

(891, 2)

The output of the conditional expression, `titanic["Age"] > 35` (>, but also ==, !=, <, <=,… would work) is actually a pandas `Series` of boolean values (either `True` or `False`) with the same number of rows as the original `DataFrame`.

In [12]:
age_above_35 = titanic["Age"] > 35

In [13]:
age_above_35

0      False
1       True
2      False
3      False
4      False
       ...  
886    False
887    False
888    False
889    False
890    False
Name: Age, Length: 891, dtype: bool

In [14]:
type(age_above_35)

pandas.core.series.Series

Such a `Series` of boolean values can be used to **filter** the `DataFrame` by putting it in between the selection brackets []. Only rows for which the value is `True` will be selected.

In [15]:
titanic[age_above_35]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
13,14,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.2750,,S
15,16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0,0,248706,16.0000,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
865,866,1,2,"Bystrom, Mrs. (Karolina)",female,42.0,0,0,236852,13.0000,,S
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
873,874,0,3,"Vander Cruyssen, Mr. Victor",male,47.0,0,0,345765,9.0000,,S
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C


Similar to the conditional expression, the `isin()` conditional function returns a `True` for each row the values are in the provided list.

In [16]:
class_23 = titanic["Pclass"].isin([2, 3])

In [17]:
class_23

0       True
1      False
2       True
3      False
4       True
       ...  
886     True
887    False
888     True
889    False
890     True
Name: Pclass, Length: 891, dtype: bool

In [18]:
type(class_23)

pandas.core.series.Series

In [19]:
titanic[class_23]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.0500,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S


The `notna()` conditional function returns a `True` for each row the values are not an `Null` value. As such, this can be combined with the selection brackets [] to filter the data table.

In [20]:
age_not_null = titanic["Age"].notna()

In [21]:
age_not_null

0       True
1       True
2       True
3       True
4       True
       ...  
886     True
887     True
888    False
889     True
890     True
Name: Age, Length: 891, dtype: bool

In [22]:
type(age_not_null)

pandas.core.series.Series

In [23]:
titanic[age_not_null]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In order to get a subset of both rows and columns in one go, just using *selection brackets* `[]` is not sufficient anymore. With the `loc` operator (`iloc` for indexing), you can specify rows and columns you want; `DataFrame.loc[rows, columns]`.

In [24]:
titanic.loc[class_23, "Name"]

0                       Braund, Mr. Owen Harris
2                        Heikkinen, Miss. Laina
4                      Allen, Mr. William Henry
5                              Moran, Mr. James
7                Palsson, Master. Gosta Leonard
                         ...                   
884                      Sutehall, Mr. Henry Jr
885        Rice, Mrs. William (Margaret Norton)
886                       Montvila, Rev. Juozas
888    Johnston, Miss. Catherine Helen "Carrie"
890                         Dooley, Mr. Patrick
Name: Name, Length: 675, dtype: object

When selecting specific rows and/or columns with `loc` or `iloc`, new values can be assigned to the selected data. For example, to assign the name `anonymous` to the first 3 elements of the third column:

In [25]:
titanic.iloc[0:3, 3] = "anonymouse"

In [26]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,anonymouse,male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,anonymouse,female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,anonymouse,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
