### Welcome to Google Colab!

**Instructions**: Make a copy of this notebook by clicking **File -> Save a copy in Drive**

To run a cell, hit **shift + enter** or click the run button at the top left of each cell

### Data Manipulation with Pandas

**DataFrames** are the primary data structure in pandas. You can think of a DataFrame as a table with rows and columns, or as multiple lists concatenated together (where each list is a column).

Pandas documentation: https://pandas.pydata.org/docs/user_guide/index.html#user-guide

In [None]:
# Imports
import numpy as np
import pandas as pd

np.random.seed(1)

# Create a DataFrame
df = pd.DataFrame({
    "Name":["Alice", "Bob", "Chloe"],
    "Age":[20, 23, 21],
    "FavoriteColor":["Blue", "Red", "Green"]
})

# When you run a cell, the last expression's value is automatically outputted
# This is why df is shown when this cell is run
# To show a DataFrame, you can use display(df)
df

In [None]:
# TODO 1: Output the dataframe using display

### Filtering By Column

To view a specific column in a dataframe, you can do:

`df["colName"]` or `df.colName`

To view multiple columns:

`df[["col1", "col2"]]` (note the double brackets as this is a list of column names)

In [None]:
# TODO 2a: Select and display only the "Name" column from the dataframe

# TODO 2b: Select and display only the "Name" and "FavoriteColor" columns from the dataframe

### Now For A More Complex Dataset!

The dataset includes information on different penguin species.

In [None]:
import seaborn as sns

# Load the Penguins dataset
penguins_df = sns.load_dataset("penguins")

# Display the first 5 rows
display(penguins_df.head(5))

### Some Useful Commands

Pandas has many commands which can help you get a better understanding of a DataFrame. Try out these commands in the cell below and figure out what they do! Make sure you run them on the penguins dataframe (penguins_df).

`df.shape`

`df.columns`

`df.describe()`

`df.corr()`

`df["col1"].unique()`

In [None]:
# TODO 3: Try out the above commands and use them to answer these questions:
"""
How many rows are in the penguins DataFrame?
What is the mean "bill_length_mm"?
What is the minimum "flipper_length_mm"?
What are the unique values in the "species" column?
What are the unique values in the "island" column?
"""

### Filtering By Row

In [None]:
# Get all rows where the species is "Adelie"
penguins_df[penguins_df["species"] == "Adelie"]

In [None]:
# Get all rows where the species is "Adelie" AND the bill length is over 39mm
penguins_df[(penguins_df["species"] == "Adelie") & (penguins_df["bill_length_mm"] > 39)]

In [None]:
# TODO 4: Answer these questions:
"""
How many entries are there for each penguin species?
How many entries are there for Adelie penguins with a body mass below 3400 grams?
For all penguins with a bill depth above 18, how many are Female and how many are Male?
"""

### Groupby

Groupby is a command which allows you to group data based on one or more columns, then apply an aggregate function.

General form:

`df.groupby("col1")["col2"].aggregateFunction()`

Examples of aggregate functions:

`mean()`, `count()`, `sum()`, `min()`, `max()`


In [None]:
# Group the dataset by species, then generate the average bill length for each species
penguins_df.groupby("species")["bill_length_mm"].mean()

In [None]:
# Grouping by multiple columns with multiple aggregations
penguins_df.groupby(["species", "sex"])["bill_length_mm"].agg(["mean", "min", "max"])

In [None]:
# TODO 5: Answer the below questions:
"""
What is the average body mass for each penguin species?
Which species has the highest average bill depth?
How many entries are there for each species? Hint: count() does not require a "col2" filter.
"""

### Data Visualization With Seaborn

Seaborn is a library for Python data visualization.

Seaborn documentation: https://seaborn.pydata.org/tutorial.html

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.set()

In [None]:
# Scatterplot of bill_length_mm vs. bill_depth_mm
sns.scatterplot(data=penguins_df, x="bill_length_mm", y="bill_depth_mm")
plt.title("Bill Length (mm) vs Bill Depth (mm)")
plt.show()

In [None]:
# Color the points by species and style them by sex
sns.scatterplot(data=penguins_df, x="bill_length_mm", y="bill_depth_mm", hue="species", style="sex")
plt.title("Bill Length (mm) vs Bill Depth (mm)")
plt.show()

In [None]:
# Histogram of the distribution of flipper length
sns.histplot(penguins_df["flipper_length_mm"], bins=20)
plt.title("Distribution of Flipper Length")
plt.xlabel("Flipper Length (mm)")
plt.ylabel("Count")
plt.show()

### More Visualizations

Go to this link below and scroll down to Plotting functions. Try out some graphs! What conclusions can you draw from them?

https://seaborn.pydata.org/tutorial.html

### Suggestions:

Scatterplot - Flipper Length vs. Body Mass (colored by species): https://seaborn.pydata.org/tutorial/relational.html#relating-variables-with-scatter-plots

Box Plot - Distribution of Body Mass by Species: https://seaborn.pydata.org/generated/seaborn.boxplot.html#seaborn-boxplot

In [None]:
# TODO 6: Generate at least three more plots! What observations can you make?