In [None]:
import numpy as np
import pandas as pd

# Data Selection

In order to explore a dataset and perform data analysis, we need to be able to select specific parts of our data. Selecting data allows us to clean/transform data to prepare for analysis, and create the appropriate calculations to get the results we need.



Now that the file is available, we can open it with the Pandas `read_csv` function

In [None]:
# Load titanic CSV data
csv_path = 'https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv'
titanic = pd.read_csv(csv_path)
titanic


# Selecting Columns

![Selecting Columns](https://drive.google.com/uc?id=17_YMZ8CxJOPUyBIV1pHThMwIo-AwY9qB)


What if we want to analyze the passengers **age**?

In [None]:
# Select the age column Series
ages = titanic['Age']
ages

As we expect, this created a Pandas Series

In [None]:
type(ages)

The shape of our output verifies that we have a single dimension

In [None]:
ages.shape

If we want to analyze both **Age** and **Sex** of the passengers, we can select multiple columns, creating a new dataframe.

In [None]:
# Selecting multiple columns creates a new dataframe
age_sex = titanic[["Age", "Sex"]]
age_sex.head()

Notice the double brackets: `titanic[["Age", "Sex"]]`

This time we pass a list of column names as a parameter, rather than a single column name.

In [None]:
# Verify that we have a new dataframe
type(age_sex)

Because we have a dataframe, `shape` now outputs two dimensions.

In [None]:
age_sex.shape

# Selecting Rows

![Selecting Rows](https://drive.google.com/uc?id=1NOsEk5vaDLkvCFLYDUHmci8ojat3Hmpl)


### Conditional Expressions


What if we only want to analyze passengers who are above a certain age?

In [None]:
above_35 = titanic[titanic["Age"] > 35]
above_35.head()

Let's deconstruct what we just did, because there are a couple of important parts.

First, inspect the conditional expression where we select the age of the passengers: `titanic["Age"] > 35`

Evaluating this expression by itself, we see that it returns a Series of boolean values.

In [None]:
titanic["Age"] > 35

A conditional expression can be used to select rows from a dataframe, because when we pass a boolean Series to a dataframe, Pandas will only select rows where the value is `True`.

Let's double check that it worked by comparing the shape of the original to the shape of the new dataframe. We should see fewer rows in the new dataframe.



In [None]:
# Original titanic dataframe shape
titanic.shape

In [None]:
# New above_35 dataframe shape has fewer rows
above_35.shape

### Multiple Conditions

What if we want to select only rows where the passenger class is 2 or 3?

In [None]:
class_23 = titanic[(titanic["Pclass"] == 2) | (titanic["Pclass"] == 3)]
class_23.head()

This time we have two conditional expressions, combined with the or `|` operator. Thus, we select rows where the passenger class is either 2 or 3.


-------------------------
NOTE:

Because we are combining booleans, we need to remember to use the boolean `&`, `|` operators, rather than python's `and`, `or`.

-------------------------


We can deconstruct these expressions as well in order to get a feel for what is going on.

In [None]:
# Rows where passenger class is 2
(titanic["Pclass"] == 2)

In [None]:
# Rows where passenger class is 3
(titanic["Pclass"] == 3)

In [None]:
# Rows where passenger class is 2 or 3
(titanic["Pclass"] == 2) | (titanic["Pclass"] == 3)

We can also use Pandas `isin` function to perform the same task:

In [None]:
# Use `isin()` to select multiple passenger classes
class_23 = titanic[titanic["Pclass"].isin([2, 3])]
class_23.head()

Evaluating the inner statement by itself, we see that the `isin` function returns a boolean Series, just like our previous conditional statements!

In [None]:
# `isin()` returns a boolean Series
titanic["Pclass"].isin([2, 3])

# Conditions on Multiple Columns

What if we want to compare the number of female passnegers who survived to the number of male passengers who survived?

We can do this by splitting our data into two dataframes. First let's select the female passengers who survived.

In [None]:
# Select female passengers who survived
female_survived = titanic[(titanic['Survived']==1) & (titanic['Sex'] == 'female')]
female_survived.head()

In [None]:
# See how many females survived
female_survived.shape

In [None]:
# Select male passengers who survived
male_survived = titanic[(titanic['Survived']==1) & (titanic['Sex'] == 'male')]
male_survived.head()

In [None]:
# See how many males survived
male_survived.shape

Compared to males, well over twice as many females survived!

In [None]:
print('Females vs males:')
len(female_survived) / len(male_survived)

# Selecting Known Data

It is often the case that a dataset will contain blank or NA values. Inspecting the age column, we see that some ages were unknown:

In [None]:
titanic['Age']

(Notice the `NaN` values above)

We can select only known values using Pandas' `notna` conditional function, which returns a `True` for each row where the values are not a `Null` value.

In [None]:
age_no_na = titanic[titanic["Age"].notna()]
age_no_na.head()

By comparing shapes, it looks like we removed over 100 rows that contained `NaN`.

In [None]:
# Original dataframe shape
titanic.shape

In [None]:
# New dataframe shape
age_no_na.shape

# Selecting Rows and Columns

![Selecting Rows and Columns](https://drive.google.com/uc?id=1iWzdqmL2IUYwHS3ZXdKDRqiq2ait31qk)

Pandas has two functions `iloc` and `loc` that enable selection of rows and columns at the same time.

- `iloc` selects rows and columns based on indexes (numbers)
- `loc` selects rows and columns based on labels (text)

Both locators select using the format `[rows, columns]`.

# Select with `iloc`

When we know the **indexes** of the data we're interested, we can use `iloc`.

In [None]:
# Remember what our titanic dataframe looks like
titanic.head()

What if we want to inspect only the first 3 rows and the first 4 columns?

In [None]:
# Select with iloc using the format [row_range, column_range]
subset1 = titanic.iloc[0:3, 0:4]
subset1

Select the last 4 rows and the last 2 columns:

In [None]:
# Select with iloc using the format [row_range, column_range]
subset2 = titanic.iloc[-4:, -2:]
subset2

Select rows and columns in the middle of the dataframe:

In [None]:
# Select with iloc using the format [row_range, column_range]
subset2 = titanic.iloc[33:41, 5:7]
subset2

Select specific rows and columns by passing a list of numbers to `iloc`.


In [None]:
# Select with iloc using the format [rows_list, columns_list]
titanic.iloc[[1,3,8,12], [1,4,6]]

# Select with `loc`

When we know the **labels** of the data we're interested, we can use `loc`.

In [None]:
# Create a sample Dataframe with row labels
df = pd.DataFrame({'date': pd.date_range('2020-01-01', periods=5),
                   'numbers': [np.nan,1,8,5,1],
                   'fractions': [0.481236,0.758691, 0.977380, 0.992931,	np.nan],
                   'category': pd.Categorical(["test", "train", "test", "train", "test"]),
                   'boolean': pd.array([True, False, False, False, True], dtype='bool')},
                  index=['a','b','c','d', 'e'])
df

Selecting a single row and a single column returns the **value** at that location.

In [None]:
df.loc['a','category']

In [None]:
df.loc['d','numbers']

Selecting a range of rows and a single column returns a `Series`.

In [None]:
df.loc['c':'e', 'fractions']

Selecting a single row and a range of columns also returns a `Series`.

In [None]:
df.loc['b', 'fractions':'boolean']

Selecting a range of rows and columns returns a `Dataframe`.

In [None]:
df.loc['a':'c', 'numbers':'category']

We can also select spcific rows and columns by passing lists of labels to `loc`.

In [None]:
df.loc[['b', 'e'], ['date', 'numbers', 'boolean']]

# Chaining Locators

Locators can be chained together in a single expression.

In [None]:
# Remember what our titanic dataframe looks like
titanic.head()

Find a list of row indexes with `iloc`, then select all rows for the given range of column labels with `loc`.

In [None]:
titanic.iloc[[1,3,5]].loc[:, 'Name':'Age']

# Conditional Locators

We can use conditional statements with both `iloc` and `loc` just as we did previously with dataframe selectors.

We'll use `loc` as an example because it is a bit more common to conditionally select rows, then choose columns based on labels.

In [None]:
# Remember what our titanic dataframe looks like
titanic.head()

What if we want to analyze the **age** and **fare** of passengers in **Pclass 1**?

In [None]:
titanic.loc[titanic['Pclass']==1, ['Age', 'Fare']]

Now select the cabin of male passengers who survived.

In [None]:
titanic.loc[((titanic['Survived']==1) & (titanic['Sex'] == 'male')), 'Cabin']

There are a lot of people who don't have cabins. We can filter them out by including the `notna` funciton in our conditional statements.

In [None]:
titanic.loc[((titanic['Survived']==1) & (titanic['Sex'] == 'male') & titanic['Cabin'].notna()), 'Cabin']

# Replacing Data

Locators are also great for replacing data. Let's make a new column called *age_replace* and try it.

In [None]:
titanic['age_replace'] = 'age_replace'
titanic.head()

We will label all passengers with `age < 30` as young.

In [None]:
titanic.loc[titanic['Age']<30, 'age_replace'] = 'young'
titanic.head()

And that means passengers with `age >= 30` are old.

In [None]:
titanic.loc[titanic['Age']>=30, 'age_replace'] = 'old'
titanic.head()

# Summary

- When selecting subsets of data, square brackets `[]` are used.
- Inside these brackets, you can use a single column/row label, a list of column/row labels, a slice of labels, a conditional expression or a colon.
- Select specific rows and/or columns using `iloc` when using the positions in the table
- Select specific rows and/or columns using `loc` when using the row and column names
- Use locators to replace data

