# 2: Selecting data using loc and iloc

Last Session we looked at how we can import an external dataset into our Jupyter notebook or create our own dataframe from scratch using pandas.

In this week's tutorial, we are going to look at how we can select a subset of our dataframe whether that is an entire row, column or a specific cell. We will learn how to use the loc and iloc function to accomplish this task. 

We will use the [titanic](https://www.kaggle.com/c/titanic/data) dataset from kaggle for this weeks' tutorial.

## Import pandas

In [None]:
 #from pandas.core.computation.check import NUMEXPR_INSTALLED

In [None]:
import pandas as pd

## Import data

In [None]:
data = pd.read_csv("train.csv")
data.head()

## Selecting a series/column in a dataframe

There are two ways you can select a column of a dataframe.
1. data.Name
2. data['Name']

What is the difference between the two? Well, they both do the exact same thing except the second one is more robust. Here is an example, say I rename the 'PassengerId' column to 'Passenger ID', data.Passenger ID would not work. 

Let's see it in action.

In [None]:
# Let's first try out it out on the Name feature
data.Name

In [None]:
data['Name']

In [None]:
#data['Passenger ID']

So both ways are able to give us the Name column without any issues.

In [None]:
# Rename 'PasengerId' column

data.rename(columns = {'PassengerId': 'Passenger ID'}, inplace = True)
data.head()

Passenger ID column has now been renamed.

In [None]:
data['Passenger ID']

In [None]:
# Uncomment and run this line, it will show an error
#data.Passenger ID

Personally, I like to use method 2 because it can handle all cases but if for some reason you prefer to use method 1 just bear in mind that it has this limitation.

## Index-based selection 

We use iloc to select data based on their numerical position in the dataframe.

iloc takes two argument, first is row followed by column. It has a starting index of 0 that is 0 is first, 1 is second, 2 is third and so on.

In [None]:
data.head()

In [None]:
data.iloc[:-886]

In [None]:
data.iloc[886:]

In [None]:
data.iloc[-5:]

In [None]:
data.iloc[-889:-888, -9:-6]

In [None]:
data.tail()

In [None]:
data.iloc[2:3 , 3:6]

In [None]:
data.iloc[1:3,0:1]

In [None]:
data.iloc[1:4,3:6]

In [None]:
# First row and all columns
data.iloc[0:1,:]

In [None]:
data.iloc[1:3,5:6]

#print(data.iloc[-890:-888,-7])


In [None]:
data.iloc[:5]

In [None]:
data.iloc[3, -2:]

In [None]:
data.iloc[4,8]

In [None]:
# Fourth column that is the Name column and all rows
# Since starting index is 0 fourth column corresponds to index number 3

data.iloc[:,3]

Suppose we want to select a range of values.

iloc includes the first number but exclude the last number of the range. For example, if we want the second and third row of the first column, the code is as follows:

In [None]:
data.head()

In [None]:
# Second and third rows of the first 

data.iloc[0:5, :]

We can also pass in a list or series into iloc

In [None]:
# First three rows and all columns
data.iloc[[0, 1, 2,3,4], :]

In [None]:
data.head()

In [None]:
data.iloc[0 : 5 ,:]

In [None]:
data.iloc[-5:,:]

We can also go from the bottom of the dataframe.

In [None]:
# Bottom five rows of the dataframe

data.iloc[-5:, :]

In [None]:
# This is the same as using the tail function

data.tail()

## Label-based selection 

With loc we need to specify the actual name of the column.

In [None]:
data

In [None]:
data.loc[[1,10],['Name', 'Sex', 'Age', 'Embarked']]

In [None]:
data.loc[1:10,'Name':'Age']

In [None]:
data.loc[[1,2,3],["Name","Sex", "Age"]]

In [None]:
data.loc[1:3,"Name":"Age"]

In [None]:
data.head()

In [None]:
data.loc[0:4, "Passenger ID" : "Embarked"]

In [None]:
# First row of the Name column
data.loc[0:4, "Name":"Cabin"]

In [None]:
#Another Way
data.loc[[1,2,3,4,5,10], ["Name","Age","SibSp", "Parch", "Ticket", "Fare","Cabin"]]

In [None]:
data.loc[0:4,:]

Different to iloc, when we want to select a range of values, loc includes both the start as well as the end of the range.

For example, to get the first 5 rows under iloc we would have data[:5] whereas for loc we have data[:4] instead.

In [None]:
# First 5 rows of the Name, Sex and Age column

data.loc[:4, ['Name', 'Sex', 'Age']]

In [None]:
data.loc[886:, "Passenger ID" : "Embarked"]

In [None]:
data.loc[886:]

## Conditional Selection

We can select rows that satisfy certain conditions. In this section, we will look at how that works.

In [None]:
#count of people that have 50 years?
# Rows with age 50
data.loc[data['Age'] == 50, :]

In [None]:
# Rows with age 50 AND are female
# This is a subset of the above dataframe by filtering out females

data.loc[(data['Age'] == 50) & (data['Sex'] == 'male') ,:]

In [None]:
data.loc[(data["Age"] >=18) & (data["Age"] <= 50),:]

In [None]:
# Rows with age 50 OR have fare greater than or equal to 200

data.loc[(data['Age'] == 50) | (data['Fare'] >= 200), :]

In [None]:
data.isnull().sum()

In [None]:
data['Cabin'].isnull().sum()

In [None]:
# All the rows with null cabin column

data.loc[data['Cabin'].isnull(), :]

The exact opposite to the isnull function is the notnull function which returns series without any null values.

In [None]:
data['Embarked'].isin(["Q","C"])

In [None]:
# All rows with C or Q in Embarked column
data.loc[data['Embarked'].isin(["Q","C"]), :]

In [None]:
data.loc[~(data['Embarked'].isin(['Q','C']))]

In [None]:
# This is the same as if we had used the or statement
data.loc[(data['Embarked'] == 'C') | (data['Embarked'] == 'Q'), :]

In [None]:
data.loc[(data['Embarked'] != 'C') & (data['Embarked'] != 'Q'), :]