# SLU2 - Subsetting data in Pandas: Learning notebook

In this notebook we will cover the following topics: 
    - Setting pandas Dataframe index
    - Selecting columns (brackets and dot notation)
    - Selecting rows (loc and iloc)
    - Chain indexing (not good) vs Multi-axis indexing (good)
    - Masks
    - Where
    - Subsetting on conditions
    - Basic math operations: adding columns and rows
    - Removing columns and rows

First, we import pandas, like we learned in the previous unit. It will be the only think that we will need for this learning unit as well.

In [1]:
import pandas as pd

# This is an option to preview less rows in the notebook's cells' outputs
pd.options.display.max_rows = 10

Now, we read the data that we'll use in this unit from the file __airbnb_input.csv__, which is located in the __data/__ directory.

For this, we'll use function __read_csv( )__, which was alreay shown in the previous unit.
We want to use column __room_id__ as the DataFrame index, and for that we use the argument __index_col__.

In [2]:
# Read the data in file airbnb_input.csv into a pandas DataFrame and use column room_id as the DataFrame index.
df = pd.read_csv('data/airbnb_input.csv', index_col='room_id')

# Preview the first rows of the DataFrame.
df.head()

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6499,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0
17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0
25659,107347,Entire home/apt,Santa Maria Maior,63,5.0,3,1.0,69.0
29248,125768,Entire home/apt,Santa Maria Maior,225,4.5,4,1.0,58.0
29396,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,67.0


## Sorting the index

With the [sort_index](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_index.html) function, we can sort the DataFrame along the index.

For instance, our DataFrame df was already sorted along the index, but we can resort it from bigger to smaller rooms ids, using the __ascending=False__ parameter.

In [3]:
# Original df
df.head()

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6499,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0
17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0
25659,107347,Entire home/apt,Santa Maria Maior,63,5.0,3,1.0,69.0
29248,125768,Entire home/apt,Santa Maria Maior,225,4.5,4,1.0,58.0
29396,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,67.0


In [4]:
# df with the index sorted from bigger to smaller room_id
df.sort_index(ascending=False)

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
19400722,28219108,Entire home/apt,Areeiro,0,0.0,5,3.0,75.0
19397373,97139334,Entire home/apt,São Vicente,0,0.0,4,1.0,56.0
19396300,6115933,Entire home/apt,Santo António,0,0.0,6,4.0,138.0
19393935,5376796,Entire home/apt,Santa Maria Maior,0,0.0,3,1.0,50.0
19388006,135915593,Entire home/apt,São Vicente,0,0.0,6,3.0,415.0
...,...,...,...,...,...,...,...,...
29396,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,67.0
29248,125768,Entire home/apt,Santa Maria Maior,225,4.5,4,1.0,58.0
25659,107347,Entire home/apt,Santa Maria Maior,63,5.0,3,1.0,69.0
17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0


## Resetting the index

We can reset the index of a DataFrame with function [reset_index](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html).
This will convert the index into a range from 0 to the length of the DataFrame minus 1.

Regarding the old index, we can either keep it by adding it as a column in the DataFrame (__drop=False__, this is the default behaviour) or reset it completely (__drop=True__).

In [5]:
# Resetting the index and keeping it as a new column room_id
df.reset_index()

Unnamed: 0,room_id,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
0,6499,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0
1,17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0
2,25659,107347,Entire home/apt,Santa Maria Maior,63,5.0,3,1.0,69.0
3,29248,125768,Entire home/apt,Santa Maria Maior,225,4.5,4,1.0,58.0
4,29396,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,67.0
...,...,...,...,...,...,...,...,...,...
13227,19388006,135915593,Entire home/apt,São Vicente,0,0.0,6,3.0,415.0
13228,19393935,5376796,Entire home/apt,Santa Maria Maior,0,0.0,3,1.0,50.0
13229,19396300,6115933,Entire home/apt,Santo António,0,0.0,6,4.0,138.0
13230,19397373,97139334,Entire home/apt,São Vicente,0,0.0,4,1.0,56.0


In [6]:
# Resetting the index and dropping it -> no new column is added
df.reset_index(drop=True)

Unnamed: 0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
0,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0
1,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0
2,107347,Entire home/apt,Santa Maria Maior,63,5.0,3,1.0,69.0
3,125768,Entire home/apt,Santa Maria Maior,225,4.5,4,1.0,58.0
4,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,67.0
...,...,...,...,...,...,...,...,...
13227,135915593,Entire home/apt,São Vicente,0,0.0,6,3.0,415.0
13228,5376796,Entire home/apt,Santa Maria Maior,0,0.0,3,1.0,50.0
13229,6115933,Entire home/apt,Santo António,0,0.0,6,4.0,138.0
13230,97139334,Entire home/apt,São Vicente,0,0.0,4,1.0,56.0


## Setting the index

With function [set_index](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html), we can set a new index for our DataFrame.

The old index is dropped.

In this function, the __drop=True__ parameter deletes the column to be used as the new index, which is the default behaviour, and __drop=False__ keeps the column unchanged.

In [7]:
# Setting column neighborhood as the new index
# The neighborhood column is dropped from the DataFrame, this is the default behaviour
df.set_index('neighborhood', drop=True)

Unnamed: 0_level_0,host_id,room_type,reviews,overall_satisfaction,accommodates,bedrooms,price
neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Belém,14455,Entire home/apt,8,5.0,2,1.0,57.0
Alvalade,66015,Entire home/apt,0,0.0,2,1.0,46.0
Santa Maria Maior,107347,Entire home/apt,63,5.0,3,1.0,69.0
Santa Maria Maior,125768,Entire home/apt,225,4.5,4,1.0,58.0
Santa Maria Maior,126415,Entire home/apt,132,5.0,4,1.0,67.0
...,...,...,...,...,...,...,...
São Vicente,135915593,Entire home/apt,0,0.0,6,3.0,415.0
Santa Maria Maior,5376796,Entire home/apt,0,0.0,3,1.0,50.0
Santo António,6115933,Entire home/apt,0,0.0,6,4.0,138.0
São Vicente,97139334,Entire home/apt,0,0.0,4,1.0,56.0


## Selecting columns

### Selecting columns by name - dot notation

Using __dot notation__, you can select a column from a DataFrame, obtaining a Series with the column values.

This is how you can select the room_type column using dot notation:

In [8]:
df.room_type

room_id
6499        Entire home/apt
17031       Entire home/apt
25659       Entire home/apt
29248       Entire home/apt
29396       Entire home/apt
                 ...       
19388006    Entire home/apt
19393935    Entire home/apt
19396300    Entire home/apt
19397373    Entire home/apt
19400722    Entire home/apt
Name: room_type, Length: 13232, dtype: object

### Selecting columns by name - brackets notation

Using __brackets__, you can select one or more columns from the DataFrame.

This is how you can select the room_type column using brackets. Note that the output is a Series:

In [9]:
df['room_type']

room_id
6499        Entire home/apt
17031       Entire home/apt
25659       Entire home/apt
29248       Entire home/apt
29396       Entire home/apt
                 ...       
19388006    Entire home/apt
19393935    Entire home/apt
19396300    Entire home/apt
19397373    Entire home/apt
19400722    Entire home/apt
Name: room_type, Length: 13232, dtype: object

This is how you can select the room_type and neighborhood columns using brackets. Note that the output is a DataFrame:

In [10]:
df[['room_type', 'neighborhood']]

Unnamed: 0_level_0,room_type,neighborhood
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1
6499,Entire home/apt,Belém
17031,Entire home/apt,Alvalade
25659,Entire home/apt,Santa Maria Maior
29248,Entire home/apt,Santa Maria Maior
29396,Entire home/apt,Santa Maria Maior
...,...,...
19388006,Entire home/apt,São Vicente
19393935,Entire home/apt,Santa Maria Maior
19396300,Entire home/apt,Santo António
19397373,Entire home/apt,São Vicente


## Selecting rows

### Selecting rows by index position - iloc

With function [iloc](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html) you can select specific rows from a DataFrame.

In order to specify the rows you want to select you can use the row position (integer starting from 0), a list, or an array slice.

This is how you can select the first row (remember that Python starts indexing with a 0). Note that the output is a Series:

In [11]:
df.iloc[0]

host_id                           14455
room_type               Entire home/apt
neighborhood                      Belém
reviews                               8
overall_satisfaction                  5
accommodates                          2
bedrooms                              1
price                                57
Name: 6499, dtype: object

This is how you select rows 0, 2, 4 and 6. Note that the output is a DataFrame:

In [12]:
df.iloc[[0, 2, 4, 6]]

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6499,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0
25659,107347,Entire home/apt,Santa Maria Maior,63,5.0,3,1.0,69.0
29396,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,67.0
29872,128698,Entire home/apt,Alcântara,25,5.0,2,1.0,75.0


This is how you select the first 3 rows:

In [13]:
df.iloc[:3]

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6499,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0
17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0
25659,107347,Entire home/apt,Santa Maria Maior,63,5.0,3,1.0,69.0


We can pick __slices__ using the slice notation below:

__$start:stop:step$__


In [14]:
df.iloc[1:10:2,:]

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0
29248,125768,Entire home/apt,Santa Maria Maior,225,4.5,4,1.0,58.0
29720,128075,Entire home/apt,Estrela,14,5.0,16,9.0,1154.0
29891,128792,Entire home/apt,Misericórdia,28,5.0,3,1.0,49.0
33312,144398,Entire home/apt,Misericórdia,24,4.5,4,1.0,66.0


### Selecting rows by index name - loc

With function [loc](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html) you can select specific rows from a DataFrame, like with iloc.

The difference here is that you specify the rows to select using the rows' indexes instead of the rows' positions in the DataFrame.

This is how you select room 29396:

In [15]:
df.loc[29396]

host_id                            126415
room_type                 Entire home/apt
neighborhood            Santa Maria Maior
reviews                               132
overall_satisfaction                    5
accommodates                            4
bedrooms                                1
price                                  67
Name: 29396, dtype: object

Note that if you search for an index that doesn't exist, you'll get a KeyError:

In [16]:
df.loc[100]

KeyError: 100

We can also pass an array with multiple indexes just as we did for the the iloc function.

In [17]:
df.loc[[29396,17031]]

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
29396,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,67.0
17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0


We can also apply the slice operation using the loc function. It is important to sort the index before doing this operation.

In [39]:
df.loc[50000:70000,:]

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
50108,229376,Entire home/apt,Estrela,33,4.5,4,1.0,69.0
56906,270457,Entire home/apt,São Vicente,86,4.5,3,1.0,52.0
59227,184400,Entire home/apt,Misericórdia,25,4.5,7,3.0,138.0
65878,322145,Entire home/apt,Misericórdia,51,5.0,4,1.0,76.0


One cool thing about the slice notation is that we can even use it with an index that is not integer. For example if our index is a string we can get a slice based on the alphabetic order of the index. Lets set the _neighborhood_ column as index and select all the rows whose _neighborhood_ is between 'Alvalade' and 'Belém'

In [43]:
df.set_index('neighborhood').sort_index().loc['Alvalade':'Belém',:]

Unnamed: 0_level_0,host_id,room_type,reviews,overall_satisfaction,accommodates,bedrooms,price
neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Alvalade,31062136,Entire home/apt,8,5.0,8,3.0,83.0
Alvalade,2381723,Entire home/apt,116,5.0,4,2.0,52.0
Alvalade,8104036,Entire home/apt,0,0.0,2,1.0,64.0
Alvalade,11440809,Private room,0,0.0,2,1.0,40.0
Alvalade,62222594,Private room,0,0.0,2,1.0,52.0
...,...,...,...,...,...,...,...
Belém,8048828,Entire home/apt,122,5.0,4,1.0,58.0
Belém,21277737,Entire home/apt,0,0.0,4,2.0,230.0
Belém,3168004,Entire home/apt,13,4.5,3,2.0,46.0
Belém,4132746,Private room,49,4.5,3,1.0,22.0


## Multi-axis indexing

One nice thing about loc and iloc is that we can select columns and rows at the same time. 

Let's use the iloc to select based on the position of the rows and columns to pick the last five rows and the first 3 columns.

In [18]:
df.iloc[-5:,:3]

Unnamed: 0_level_0,host_id,room_type,neighborhood
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
19388006,135915593,Entire home/apt,São Vicente
19393935,5376796,Entire home/apt,Santa Maria Maior
19396300,6115933,Entire home/apt,Santo António
19397373,97139334,Entire home/apt,São Vicente
19400722,28219108,Entire home/apt,Areeiro


Let's now use the loc to select based on the names of the rows and columns to pick the _neighborhood_ and _price_ of the rooms 29396 and 17031.

In [19]:
df.loc[[29396,17031],['neighborhood','price']]

Unnamed: 0_level_0,neighborhood,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1
29396,Santa Maria Maior,67.0
17031,Alvalade,46.0


## A performance remark!

### Chain indexing vs Multi-axis indexing

Imagine you are asked to select the neighborhood of room 17031.

When we want to select a specific value in a DataFrame, given the row and the column, we might be tempted to do the following:

In [20]:
%%time
# This command is used to count the time that the code in this cell took to run

df['neighborhood'][17031]

CPU times: user 199 µs, sys: 12 µs, total: 211 µs
Wall time: 219 µs


'Alvalade'

However, as you can see, this is a faster solution:

In [21]:
%%time
df.loc[17031, 'neighborhood']

CPU times: user 70 µs, sys: 1e+03 ns, total: 71 µs
Wall time: 74.1 µs


'Alvalade'

But why?

When we select a row or column in a DataFrame using brackets, the Python bellow Pandas is calling the \__getitem\__ method to return the requested data.

Well, when we chain two sets of brackets, as in the first example, we are calling the \__getitem\__ method twice! This is called __chain indexing__, and __should be avoided__!

On the other hand, when we use loc to select a value given a row and column at the same time, Python is only calling the \__getitem\__ method once. This is called __multi-axis indexing__, and should be used.

## Masks

The [mask](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mask.html) function can be used to "hide" the rows that verify a certain condition.

The rows where the condition holds will have the values replaced by NaN:

In [23]:
df.mask(df.overall_satisfaction == 5.0)

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6499,,,,,,,,
17031,66015.0,Entire home/apt,Alvalade,0.0,0.0,2.0,1.0,46.0
25659,,,,,,,,
29248,125768.0,Entire home/apt,Santa Maria Maior,225.0,4.5,4.0,1.0,58.0
29396,,,,,,,,
...,...,...,...,...,...,...,...,...
19388006,135915593.0,Entire home/apt,São Vicente,0.0,0.0,6.0,3.0,415.0
19393935,5376796.0,Entire home/apt,Santa Maria Maior,0.0,0.0,3.0,1.0,50.0
19396300,6115933.0,Entire home/apt,Santo António,0.0,0.0,6.0,4.0,138.0
19397373,97139334.0,Entire home/apt,São Vicente,0.0,0.0,4.0,1.0,56.0


## Where

And the [where](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html) function can be used to hide the rows that __don't__ verify a certain condition.

The rows where the condition doesn't hold will have the values replaced by NaN:

In [24]:
df.where(df.overall_satisfaction == 5.0)

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6499,14455.0,Entire home/apt,Belém,8.0,5.0,2.0,1.0,57.0
17031,,,,,,,,
25659,107347.0,Entire home/apt,Santa Maria Maior,63.0,5.0,3.0,1.0,69.0
29248,,,,,,,,
29396,126415.0,Entire home/apt,Santa Maria Maior,132.0,5.0,4.0,1.0,67.0
...,...,...,...,...,...,...,...,...
19388006,,,,,,,,
19393935,,,,,,,,
19396300,,,,,,,,
19397373,,,,,,,,


Basically __mask__ and __where__ do the opposite of each other!

## Subsetting data on conditions

Using brackets notation, we can use conditions to subset data from the DataFrame.

By doing this, we get a DataFrame that (most likelly) has a different shape from the initial one, i.e, it's only a subset of it's rows.

Note that this is different from what we saw in the mask/filter functions: these functions don't change the DataFame shape, instead, they just replace the values that we don't want with NaNs.

Here we're subsetting the DataFrame to get all the rooms in the Alvalade neighborhood.

Note the DataFrame shape!

In [25]:
df[df.neighborhood == 'Alvalade']

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0
72807,378525,Private room,Alvalade,1,0.0,1,1.0,29.0
108058,557442,Private room,Alvalade,24,5.0,1,1.0,27.0
143882,697596,Entire home/apt,Alvalade,0,0.0,3,2.0,577.0
172014,820718,Private room,Alvalade,77,4.5,4,1.0,45.0
...,...,...,...,...,...,...,...,...
19206346,131826194,Entire home/apt,Alvalade,0,0.0,5,2.0,87.0
19225159,84062304,Private room,Alvalade,0,0.0,2,1.0,54.0
19227195,134599148,Entire home/apt,Alvalade,0,0.0,4,2.0,56.0
19266319,15462808,Entire home/apt,Alvalade,0,0.0,2,1.0,52.0


As another example, we're selecting the rooms in Alvalade, that have more than 10 reviews.

Note the parenthesis around each condition, they're required!

In [26]:
df[(df.neighborhood == 'Alvalade') & (df.reviews > 10)]

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
108058,557442,Private room,Alvalade,24,5.0,1,1.0,27.0
172014,820718,Private room,Alvalade,77,4.5,4,1.0,45.0
172248,507901,Private room,Alvalade,38,4.5,2,1.0,29.0
216881,1119812,Entire home/apt,Alvalade,22,4.5,6,2.0,74.0
333919,507901,Private room,Alvalade,48,4.0,3,1.0,40.0
...,...,...,...,...,...,...,...,...
15044690,18671578,Entire home/apt,Alvalade,33,5.0,3,1.0,52.0
15786593,102115202,Entire home/apt,Alvalade,22,5.0,5,2.0,57.0
15839689,16844987,Entire home/apt,Alvalade,24,5.0,4,2.0,58.0
16690259,63598544,Entire home/apt,Alvalade,16,5.0,2,1.0,58.0


## Basic Math Operations: Adding Rows & Columns

### Adding a row

We can use the __loc__ indexing operation to add a new row in a dataframe with a specific index. If dataframe already has any row with that index name then this will replace the contents of that row.

In [27]:
# Add new room to our dataframe with roomid=737800
df.loc[737800] = [123456,'Private room','Entrecampos',82,4.5,2,1,29.0]

# Show our new room
df.loc[737800]

host_id                       123456
room_type               Private room
neighborhood             Entrecampos
reviews                           82
overall_satisfaction             4.5
accommodates                       2
bedrooms                           1
price                             29
Name: 737800, dtype: object

We can use __iloc__ to replace the row at a given index position:

In [28]:
df.iloc[-1] = [56787,'Private room','Avenidas Novas',82,4.5,2,1,29.0]

# Show our new room
df.tail()

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
19393935,5376796,Entire home/apt,Santa Maria Maior,0,0.0,3,1.0,50.0
19396300,6115933,Entire home/apt,Santo António,0,0.0,6,4.0,138.0
19397373,97139334,Entire home/apt,São Vicente,0,0.0,4,1.0,56.0
19400722,28219108,Entire home/apt,Areeiro,0,0.0,5,3.0,75.0
737800,56787,Private room,Avenidas Novas,82,4.5,2,1.0,29.0


### Adding a column

We can add a column to a daframe using the same notations that we used to select them, i.e., dot notation, brackets notation, or loc operator.

In [44]:
# Creates a new column in the DataFrame (price_per_week), where each row is equal to the price * 7
df['price_per_week'] = df.price * 7
df.head()

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price,price_per_week
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
6499,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0,399.0
17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0,322.0
25659,107347,Entire home/apt,Santa Maria Maior,63,5.0,3,1.0,69.0,483.0
29248,125768,Entire home/apt,Santa Maria Maior,225,4.5,4,1.0,58.0,406.0
29396,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,67.0,469.0


In [45]:
# Creates a new column in the DataFrame (price_per_week), where each row is equal to the price * 7
df.loc['price_per_week'] = df.price * 7
df.head()

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price,price_per_week
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
6499,14455.0,Entire home/apt,Belém,8.0,5.0,2.0,1.0,57.0,399.0
17031,66015.0,Entire home/apt,Alvalade,0.0,0.0,2.0,1.0,46.0,322.0
25659,107347.0,Entire home/apt,Santa Maria Maior,63.0,5.0,3.0,1.0,69.0,483.0
29248,125768.0,Entire home/apt,Santa Maria Maior,225.0,4.5,4.0,1.0,58.0,406.0
29396,126415.0,Entire home/apt,Santa Maria Maior,132.0,5.0,4.0,1.0,67.0,469.0


There is also a cool function called [assign](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.assign.html). This function returns a new dataframe with all the columns plus the one that we creating.

In [4]:
# Creates new columns in the DataFrame (people_per_bedroom, price_per_month), 
# where each row is equal to the value of the accommodates column divided by the bedrooms column
df = df.assign(people_per_bedroom = df['accommodates']/df['bedrooms'],
          price_per_month = df['price'] * 31) 
df.head()

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price,people_per_bedroom,price_per_month
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
6499,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0,2.0,1767.0
17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0,2.0,1426.0
25659,107347,Entire home/apt,Santa Maria Maior,63,5.0,3,1.0,69.0,3.0,2139.0
29248,125768,Entire home/apt,Santa Maria Maior,225,4.5,4,1.0,58.0,4.0,1798.0
29396,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,67.0,4.0,2077.0


## Removing Rows & Columns

In order to drop rows and columns from a DataFrame, we can use function [drop](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html).

In order to drop a row, we do the following:

In [32]:
# This drops the row with index 17031. This is the same than doing drop(17031,axis=0)
df.drop(labels=17031)

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price,price_per_week
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
6499,14455.0,Entire home/apt,Belém,8.0,5.0,2.0,1.0,57.0,399.0
25659,107347.0,Entire home/apt,Santa Maria Maior,63.0,5.0,3.0,1.0,69.0,483.0
29248,125768.0,Entire home/apt,Santa Maria Maior,225.0,4.5,4.0,1.0,58.0,406.0
29396,126415.0,Entire home/apt,Santa Maria Maior,132.0,5.0,4.0,1.0,67.0,469.0
29720,128075.0,Entire home/apt,Estrela,14.0,5.0,16.0,9.0,1154.0,8078.0
...,...,...,...,...,...,...,...,...,...
19396300,6115933.0,Entire home/apt,Santo António,0.0,0.0,6.0,4.0,138.0,966.0
19397373,97139334.0,Entire home/apt,São Vicente,0.0,0.0,4.0,1.0,56.0,392.0
19400722,28219108.0,Entire home/apt,Areeiro,0.0,0.0,5.0,3.0,75.0,525.0
737800,56787.0,Private room,Avenidas Novas,82.0,4.5,2.0,1.0,29.0,203.0


In order to drop a column, we do the following:

In [33]:
# This drops column neighborhood. This is the same than doing drop('neighborhood',axis=1)
df.drop(columns='neighborhood')

Unnamed: 0_level_0,host_id,room_type,reviews,overall_satisfaction,accommodates,bedrooms,price,price_per_week
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6499,14455.0,Entire home/apt,8.0,5.0,2.0,1.0,57.0,399.0
17031,66015.0,Entire home/apt,0.0,0.0,2.0,1.0,46.0,322.0
25659,107347.0,Entire home/apt,63.0,5.0,3.0,1.0,69.0,483.0
29248,125768.0,Entire home/apt,225.0,4.5,4.0,1.0,58.0,406.0
29396,126415.0,Entire home/apt,132.0,5.0,4.0,1.0,67.0,469.0
...,...,...,...,...,...,...,...,...
19396300,6115933.0,Entire home/apt,0.0,0.0,6.0,4.0,138.0,966.0
19397373,97139334.0,Entire home/apt,0.0,0.0,4.0,1.0,56.0,392.0
19400722,28219108.0,Entire home/apt,0.0,0.0,5.0,3.0,75.0,525.0
737800,56787.0,Private room,82.0,4.5,2.0,1.0,29.0,203.0


If we want to drop multiple rows (or columns), we can use lists:

In [34]:
df.drop(labels=[6499, 17031])

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price,price_per_week
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
25659,107347.0,Entire home/apt,Santa Maria Maior,63.0,5.0,3.0,1.0,69.0,483.0
29248,125768.0,Entire home/apt,Santa Maria Maior,225.0,4.5,4.0,1.0,58.0,406.0
29396,126415.0,Entire home/apt,Santa Maria Maior,132.0,5.0,4.0,1.0,67.0,469.0
29720,128075.0,Entire home/apt,Estrela,14.0,5.0,16.0,9.0,1154.0,8078.0
29872,128698.0,Entire home/apt,Alcântara,25.0,5.0,2.0,1.0,75.0,525.0
...,...,...,...,...,...,...,...,...,...
19396300,6115933.0,Entire home/apt,Santo António,0.0,0.0,6.0,4.0,138.0,966.0
19397373,97139334.0,Entire home/apt,São Vicente,0.0,0.0,4.0,1.0,56.0,392.0
19400722,28219108.0,Entire home/apt,Areeiro,0.0,0.0,5.0,3.0,75.0,525.0
737800,56787.0,Private room,Avenidas Novas,82.0,4.5,2.0,1.0,29.0,203.0
