# Pandas Advanced: Learning notebook

In this notebook we will cover the following: 

    - Selecting columns (brackets and dot notation)
    - Selecting rows (loc and iloc)
    - Subsetting on conditions
    - Select Dtypes
    - nlargest & nsmallest
    - groupby
    - pandas plotting

First, we import pandas, like we learned in the previous unit:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

Now, we read the data that we'll use in this unit from the file __airbnb_input.csv__, which is located in the __data/__ directory.

For this, we'll use function __read_csv( )__, which was alreay shown in the previous unit.
We want to use column __room_id__ as the DataFrame index, and for that we use the argument __index_col__ in function read_csv( ).

In [None]:
# Read the data in file airbnb_input.csv into a pandas DataFrame and use column room_id as the DataFrame index.
df = pd.read_csv('https://raw.githubusercontent.com/vohcolab/PandaViz-Workshop/main/Pandas/Pandas%20Advanced/data/healthcare_costs.csv')
df = df.set_index('patient_id')

# Preview the first rows of the DataFrame.
df.head()

## Selecting columns

### Selecting columns by name - dot notation

Using __dot notation__, you can select a column from a DataFrame, obtaining a Series with the column values.

This is how you can select the room_type column using dot notation:

In [None]:
df.region

### Selecting columns by name - brackets notation

Using __brackets__, you can select one or more columns from the DataFrame.

This is how you can select the room_type column using brackets. Note that the output is a Series:

In [None]:
df['region']

This is how you can select the room_type and neighborhood columns using brackets. Note that the output is a DataFrame:

In [None]:
df[['sex', 'region']]

## Selecting rows

### Selecting rows by index position - iloc

With function [iloc](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html) you can select specific rows from a DataFrame.

In order to specify the rows you want to select you can use the row position (integer starting from 0), a list, or an array slice.

This is how you can select the first row (remember that Python starts indexing with a 0). Note that the output is a Series:

In [None]:
df.iloc[0]

This is how you select rows 0, 2, 4 and 6. Note that the output is a DataFrame:

In [None]:
df.iloc[[0, 2, 4, 6]]

This is how you select the first 3 rows:

In [None]:
df.iloc[0:3]

### Selecting rows by index name - loc

With function [loc](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html) you can select specific rows from a DataFrame, like with iloc.

The difference here is that you specify the rows to select using the rows' indexes instead of the rows' positions in the DataFrame.

This is how you select the patient whose patient_id is 17:

In [None]:
df.loc[17]

Note that if you search for an index that doesn't exist, you'll get a KeyError:

In [None]:
df.loc[9000]

### Selecting rows & columns

We can use either loc or iloc to select rows and columns at the same time!

A - Index or list of indexes <br>
B - Column or list of columns <br>

df.loc[A,B]

Here's a few examples:

In [None]:
df.loc[17,'bmi']

In [None]:
df.loc[[10,13,14],['bmi','children']]

In [None]:
df.iloc[:10,:2] # first 10 rows and first 2 columns

## Subsetting data on conditions

Using brackets notation, we can use conditions to subset data from the DataFrame.

By doing this, we get a DataFrame that (most likelly) has a different shape from the initial one, i.e, it's only a subset of it's rows.

Note that this is different from what we saw in the mask/filter functions: these functions don't change the DataFame shape, instead, they just replace the values that we don't want with NaNs.

Here we're subsetting the DataFrame to get all the male patients.

In [None]:
df.sex == 'male'

Here we're subsetting the DataFrame to get all the male patients.

Note the DataFrame shape!

In [None]:
df[df.sex == 'male']

As another example, we're selecting the males older than 50.

Note the parenthesis around each condition, they're required!

In [None]:
df[(df.sex == 'male') & (df.age > 50)]

### Also using columns

If we don't want all columns, we can simply use loc/iloc

In [None]:
df.loc[df.sex == 'male', 'smoker']

## Data Types

DataFrames have a class attribute that shows us the data type of each column. It's called __dtypes__ and can be used like this:

In [None]:
df.dtypes

Note that strings have the dtype __object__.

__Dtypes__ can also be used to subset DataFrames. For instance, this is how we select all the float64 columns from the DataFrame:

In [None]:
df.select_dtypes(include=['float64'])

## nlargest and nsmallest

[nlargest](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.nlargest.html) is a function that can be used to select the n rows that have the largest values regarding certain column(s).

For instance, this is how we select the two rooms that had the highest number of reviews.

In [None]:
df.nlargest(n=2, columns='bmi')

[nsmallest](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.nsmallest.html)... 
Well, I'll let you extrapolate :)

But here's an example, where we sellect the 5 cheapest rooms:

In [None]:
df.nsmallest(n=5, columns='bmi')

## Group by

We can also represent subgroups of our data with a single datapoint.

Let's say i want to know the "average body mass index of each gender"

We could do it in a boring way..

In [None]:
df.loc[df.sex == 'male',:]

In [None]:
df.loc[df.sex == 'male','bmi'].mean()

And repeat for female

In [None]:
df.loc[df.sex == 'female','bmi'].mean()

But what if we want to do the same for all ages for example? We would need to create a bunch of lines of code by hand. Here's an easier way: **introducing groupby**

In [None]:
df.groupby(by='sex').bmi.mean()

That easy huh..

Let's count the number of people in our dataset by age:

In [None]:
df.groupby(by='age').size().head(5)

So there's 69 people aged 18, 68 people aged 19, etc..

## Extra material - Pandas plotting

Pandas allows out-of-the-box plotting to look at your data in an easy way.

Let's go back to the previous example:

In [None]:
df.groupby(by='age').size().head(3)

It would be nice to have a bar plot so that we can compare distribution of people by age of our dataset.

In [None]:
df.groupby(by='age').size().plot(kind='bar') # that easy

hum the plot is a bit condensed. We can change that!

In [None]:
plt.figure(figsize=(10,6)) # (width, height) = (10,6)
df.groupby(by='age').size().plot(kind='bar')

What if we want to have a notion of the charges' distribution?

In [None]:
df.charges.plot(kind='hist')

Interesting, most charges are below $10k

----