# Data Loading

In [None]:
import pandas as pd

In [None]:
df = pd.read_feather(
    '../data/preprocessed/adult.feather',
#     'https://raw.githubusercontent.com/sesise0307/pydata2021-eda/main/data/preprocessed/adult.feather',
)

In [None]:
df.head()

# Matplotlib

![Matplotlib](../image/matplotlib.svg)

- [GitHub](https://github.com/matplotlib/matplotlib)
- [Documentation](https://matplotlib.org/stable/contents.html)

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.

It is a basic building block of other advanced Python visualization libraries such as `seaborn` or `pandas`.

It is essential to know `matplotlib` if you're using Python for data analysis.

I assume that you already have some experiences with `matplotlib`.

In [None]:
# iPython magic
%matplotlib inline

import matplotlib.pyplot as plt

## Scatter Plot

In [None]:
plt.scatter(df['hours_per_week'], df['fake_income'])

In [None]:
plt.scatter(df['hours_per_week'], df['fake_income'], c=df['age'], alpha=0.3, s=10)
plt.axhline(50000, color='r', ls='--');

## Histogram

In [None]:
plt.figure(figsize=(10, 5))
plt.hist(df['age'])
plt.xticks(rotation=15);

In [None]:
plt.figure(figsize=(10, 5))
plt.hist(df['race'].dropna())
plt.xticks(rotation=15);

In [None]:
plt.hist(df['fake_income']);

In [None]:
plt.hist(df['fake_income'], bins=range(0, 400000, 10000));

- Histogram is affected by bin selection
- To avoid that, Empirical Cumulative Distribution

## Boxplot

In [None]:
plt.boxplot(df['fake_income']);

In [None]:
plt.boxplot(df['age']);

## Pie Chart

# Pandas

- [GitHub](https://github.com/pandas-dev/pandas)
- [Documentation](https://pandas.pydata.org/docs/index.html)
- [Visualization User Guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html)

`pandas` is a Python package that provides fast, flexible, and expressive data structures
designed to make working with "relational" or "labeled" data both easy and intuitive.
It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.

> `pandas` is used mainly for manipulating DataFrame, but it also supports handy methods for creating decent looking plots with one line of code.

Supported Plot Type:

![Pandas Plot Kind](../image/pandas_plot_kind.png)

Source: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html

In [None]:
(
    df
    .groupby('age_group')
    ['fake_income']
    .mean()
#     .plot(kind='bar')
)

In [None]:
(
    df
    .groupby('age_group')
    ['fake_income']
    .mean()
    .plot(kind='bar')
)

plt.ylabel('Fake Income')
plt.xlabel('Age Group');

In [None]:
(
    df
    .groupby(['age_group', 'sex'])
    ['fake_income']
    .mean()
    .unstack()
    .plot(kind='bar')
)

plt.ylabel('Fake Income')
plt.xlabel('Age Group');

# Seaborn

- [GitHub](https://github.com/mwaskom/seaborn)
- [Documentation](https://seaborn.pydata.org/index.html)

`Seaborn` is a library for making statistical graphics in Python.

It builds on top of `matplotlib` and integrates closely with `pandas` data structures.

`Seaborn` helps you explore and understand your data.
Its plotting functions operate on `dataframes` and `arrays` containing whole datasets
and internally perform the necessary semantic mapping and statistical aggregation to produce informative plots.
Its dataset-oriented, declarative API lets you focus on what the different elements of your plots mean,
rather than on the details of how to draw them.

> This is my go-to library for starting data analysis.

In [None]:
import seaborn as sns

In [None]:
sns.barplot(data=df, x='age_group', y='fake_income', ci='sd', hue='sex')

In [None]:
sns.barplot(data=df, x='age_group', y='fake_income', ci='sd', hue='sex')

plt.ylabel('Fake Income')
plt.xlabel('Age Group');

# Comparison