# Introduction

## About the notebook

This notebook was created for a friend, to show her some of the most used [pandas](https://pandas.pydata.org/) functions and answer her questions. It is basically a pandas tutorial on an interesting dataset, so if you are new to pandas as well, I hope that you can learn something from it as well.

## About the dataset

I will use the [penguin dataset](https://www.kaggle.com/parulpandey/palmer-archipelago-antarctica-penguin-data) as it resembles the data of my friend. Instead of laboratory samples, it describes three species of penguins - Adelie, Chinstrap and Gentoo, living on three islands - Biscoe, Dream and Torgersen. For each penguin culmen length and depth, flipper length, body mass and their sex was recorded. You can find more about the dataset in the linked description.

In [None]:
# Standard libraries
import glob

# Specific imports from the standard library
from pathlib import Path

# Basic imports
import numpy as np
import pandas as pd

In [None]:
# Load the data
df = pd.read_csv("/kaggle/input/palmer-archipelago-antarctica-penguin-data/penguins_size.csv")
df = df.dropna()   # Drop rows with nulls, so we have easier work

In [None]:
# Check out, how dataframe looks like
df

## Questions

Here are some questions, my friend had and I will try to answer. I grouped some of them, as they can be solved using the same method.

- Any other useful functions? (The last question, but I will begin with those)
- How to select columns?
- How to drop columns?
- How to filter rows?
    - How does `.loc` work?
    - How to filter using values in a cells?
    - How to divide a dataframe?
- How to create sub-dataframes using?
    - How to filter using for cycle?
- How to concat a dataframe?
- How does groupby work?
- How does joining work?
    - Is there anything else apart from concat to join dataframes?
- How to color cells by the value?

# Tutorial

# Some useful functions to check the dataframe

At first, you can get the list of all columns in a dataframe using `.columns`:

In [None]:
df.columns

You can select `x` rows from the top of the dataframe using `.head`:

In [None]:
df.head(5)

You can do the same with the bottom of the dataframe using `.tail`:

In [None]:
df.tail(5)

To get the size of the dataframe use `.shape`. It returns a tuple, where the first number corresponds to the number of rows and the second to the number of columns.

In [None]:
df.shape

It is also handy, to set default pandas settings, because often you may want to see all the rows or the first hundred or two. It is also usually not necessary to show six decimal digits.

In [None]:
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 20)
pd.set_option("display.float_format", '{:,.2f}'.format)

# How to select columns?

The selection of columns depends on whether you want to select by a column name or its number. If you want to select using column names, you can either use `.loc` or simply write a list in square brackets. The following four selections are all equal:

In [None]:
df[["species", "flipper_length_mm"]]

In [None]:
selected_columns = ["species", "flipper_length_mm"]
df[selected_columns]

In [None]:
df.loc[:, ["species", "flipper_length_mm"]]

In [None]:
selected_columns = ["species", "flipper_length_mm"]
df.loc[:, selected_columns]

If you want to select using the numbers of the columns, you need to use `.iloc` instead.

In [None]:
df.iloc[:, 2:]    # Select all, but the first two columns

In [None]:
df.iloc[:, :2]    # Select the first two columns

In [None]:
df.iloc[:, -3:]   # Select last three columns

# How to drop columns?

Columns can be dropped using `.drop` function, to which you pass a list of columns you want to be dropped and you also need to specify `axis=1` to drop columns. This function can be also used to drop rows, but it is rarely used, i.e. I have never seen somebody use it to drop rows, as there are easier methods to do it, e.g. using masks and `.loc`.

In [None]:
cols_to_be_dropped = ["body_mass_g", "sex"]
df_with_less_columns = df.drop(cols_to_be_dropped, axis=1)
df_with_less_columns

Note, that it is the same to drop a few columns as to select all other columns.

In [None]:
selected_cols = ["species", "island", "culmen_length_mm", "culmen_depth_mm", "flipper_length_mm"]
df_with_less_columns = df[selected_cols]
df_with_less_columns

# How to filter rows

You could have noticed that when we used `.loc`, we passed it two things a colon and a list of selected columns. The meaning of a colon is "Select everything" and as the first argument is for rows and the second for column, the colon for the first argument means "Select all rows". Instead of a colon, we can pass a mask to select only rows, which fulfill some condition. A mask is a Series of the same length as how many rows the dataframe has. Basically, the mask says for each row, whether it fulfills a condition and should be selected or not. 

For example, we can see that at the beginning of the dataframe are Adelie penguins and at the end there are Gentoo penguins, so let's create a mask for selecting only Gentoo penguins.

In [None]:
mask_is_gentoo = df["species"]=="Gentoo"
mask_is_gentoo

You can see that in the beginning it is False, because at the beginning of the dataframe, there are no Gentoo penguins. However, there are Gentoo penguins at the end of the dataframe, so the values there are `True`. Now, we can apply the mask to get only rows with Gentoo penguins.

In [None]:
df.loc[mask_is_gentoo, :]   # We use colon as second argument, because we want to select all columns

Note, that masks can be joined together using `&` (AND) and `|` (OR) symbols. So we can for example create another mask to select only female penguins and combine it with Gentoo mask:

In [None]:
mask_is_female = df["sex"] == "FEMALE"
mask_is_female_gentoo = mask_is_female & mask_is_gentoo
# mask_is_female_gentoo = (df["species"]=="Gentoo") & (df["sex"]=="FEMALE")   # Identical definition as on the row above

In [None]:
df.loc[mask_is_female_gentoo]   # You can also skip the second colon and all columns will be selected automatically

Of course, it is possible to use use `>`, `<`, `>=`, `<=` apart from `==` for numerical values. For string values, it is also handy to know about other methods apart from exact match (`==`) we used. We may for example want to select penguins that live on any of two islands. This can be done using two masks and `|` (OR) symbol, or we can use `.isin` and pass it a list.

In [None]:
# mask_island_TD = (df["island"]=="Torgersen") | (df["island"]=="Dream")   # Equivalent to the row below
mask_island_TD = df["island"].isin(["Torgersen", "Dream"])
df.loc[mask_island_TD, :]

If a case you worked with a lot of strings, e.g. every row would have multiple tags, divided by commas, you can also create mask, checking whether a substring is contained in each row using `.str.contains` to which you pass a substring. If you knew how to use regular expressions, you can pass regular expression instead and add another argument `regex=True`.

In [None]:
mask_rea_in_island_name = df["island"].str.contains("rea")   # In our case only Dream island contains "rea" substring

df.loc[mask_rea_in_island_name]

# How to create sub-dataframes?

We may for example want to divide penguins by their species and island they live on to different dataframes and save them separately. Currently, the best way how to handle paths in Python is using `Path` object form `pathlib` library. 

In [None]:
path_to_output_folder = Path("/kaggle/out/")   # Define path to folder
path_to_output_folder.mkdir(parents=True, exist_ok=True)   # Create the folder if it does not exist
for species in df["species"].unique():
    for island in df["island"].unique():
        mask = (df["species"] == species) & (df["island"] == island)
        sub_df = df.loc[mask]
        if not sub_df.empty:
            filename = f"{species}_on_{island}.csv"
            path_to_file = path_to_output_folder / filename   # "/" sign joins parts of the path together
            sub_df.to_csv(path_to_file, index=False)

We can check, which files were created e.g. using `glob` library. This library is used to traverse the filesystem and search for files fulfilling some criteria. The most common use is shown below, where we pass path to a folder `/kaggle/out/` and `*.csv` which means *any file having suffix .csv*.

In [None]:
glob.glob("/kaggle/out/*.csv")

# How to concatenate dataframes

If we were given dataframes with individual species on individual island (e.g. like we divided them in the previous part), we can join them using `.concat` function, to which we pass a list of dataframes to concatenate. Note, you have to have the same number of columns in dataframes or it won't work.

In [None]:
new_df = pd.DataFrame()
for path in glob.glob("/kaggle/out/*.csv"):
    sub_df = pd.read_csv(path)
    new_df = pd.concat([new_df, sub_df])
new_df

# .groupby

To find out information about a group, use `.groupby` function followed by `.agg`. I also always `.reset_index`, because if you don't then attributes you used for group by become new indices.

In [None]:
df.groupby(by=["species"]).agg(**{"avg_flipper_length_mm":("flipper_length_mm", "mean"),
                                  "median_flipper_length_mm":("flipper_length_mm", "median"),
                                  "std_flipper_length_mm":("flipper_length_mm", "std")}).reset_index() 

In [None]:
df.groupby(by=["island", "species"]).agg(**{"cnt_penguins":("species", "count")}).reset_index()

# How does joining work

To answer the question, whether there is another way how to join dataframes - there actually is a function called `pd.merge`, but this one is usually used with indexed data or when you have information about samples or groups in different dataframes. 

To demonstrate how this work, let me at first load more detailed dataframe about penguins to `df_detailed`. Then I split measurements and information about penguins like where the live or what species are they.

In [None]:
df_detailed = pd.read_csv("/kaggle/input/palmer-archipelago-antarctica-penguin-data/penguins_lter.csv")
df_detailed = df_detailed.loc[df_detailed["studyName"]=="PAL0708"]
df_detailed

In [None]:
cols_indices = ['Individual ID']
cols_about_penguins = ['Species', 'Region', 'Island', 'Stage', 'Clutch Completion', 'Date Egg']
cols_measurements = ['Culmen Length (mm)', 'Culmen Depth (mm)', 'Flipper Length (mm)', 'Body Mass (g)', 
                     'Sex', 'Delta 15 N (o/oo)', 'Delta 13 C (o/oo)', 'Comments']

df_about_penguins = df_detailed[cols_indices+cols_about_penguins]
df_measurements = df_detailed[cols_indices+cols_measurements]

In [None]:
df_about_penguins

In [None]:
df_measurements

Now, we have information about penguins saved in two different dataframes. This is a common situation when you work with databases. It is not practical to save all the information in a single table and it is much better solution to split data and make them joinable through IDs.

In both dataframes, each penguin can be identified by `Individual ID`. Now, if we want to use `pd.merge`, we should at first check, whether this identifier is unique, because e.g. if in the first table was each `Individual ID` only once, but in the second table there were some duplicates, all duplicates will join and the final statistics will not be correct.

In [None]:
print(f"{df_measurements.shape[0]} == {df_measurements.drop_duplicates(subset=['Individual ID']).shape[0]}")
print(f"{df_measurements.shape[0] == df_measurements.drop_duplicates(subset=['Individual ID']).shape[0]}")
print()
print(f"{df_about_penguins.shape[0]} == {df_about_penguins.drop_duplicates(subset=['Individual ID']).shape[0]}")
print(f"{df_about_penguins.shape[0] == df_about_penguins.drop_duplicates(subset=['Individual ID']).shape[0]}")

When we are sure, that there are no duplicates and dataframes will join correctly, we will use `pd.merge`.

In [None]:
joined_df = pd.merge(df_about_penguins, df_measurements, how="inner", on=["Individual ID"])
joined_df

To `on` argument we pass the common column through which dataframes should be joined. To `how` we pass the type of join. There are five types of join:
- Inner - Join and keep only rows where the index is present in both dataframes
- Left - Same as inner, but also keeps records from the left dataframe if there is no correponding index in the right dataframe. Columns from the right dataframe are filled with None.
- Right - The same as left, but keeps records from the right dataframe instead
- Outer - Keep records from both dataframes, if there is no corresponding index on left/right fill columns with None.
- Cross - Cartesian product, not really the same join as the previous

# How to color cells by the value?

Styling and coloring cells is done via `.style` attribute. Through it you access [Styler](https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html) and you can set for example background gradient or create bar plots in the dataframe.

In [None]:
df.head(20).style.background_gradient(subset=["flipper_length_mm", "body_mass_g"], cmap='viridis')

In [None]:
df.head(20).style.bar(subset=["culmen_length_mm", "culmen_depth_mm"], align='mid')