<h1>How to derive basic insights from data?</h1>
<br>
We often work with an Excel-like data file, for which Pandas is an appropriate data library. Pandas plotting is built on top of Matplotlib but is a higher-level plotting API (with shorter and more convenient code). 


<p class="lead"> 
Table of Content: 

- <a href="#Understanding-Pandas-and-loading-data">Understanding Pandas and loading data</a>
- <a href="#Pandas-plotting-API">Pandas plotting API</a>    
- <a href="#Pattern-of-a-continuous-variable">Pattern of a continuous variable</a>
- <a href="#Pattern-of-a-categorical-variable">Pattern of a categorical variable</a>
- <a href="#Relationship-between-two-variables">Relationship between two variables</a>
- <a href="#Time-series">Time series</a>
    
</p>





<div>
<h2 class="breadcrumb">Understanding Pandas and loading data</h2><p>
</div>

> Pandas DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. (pandas.pydata.org)

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('assets/mpg.csv')

In [None]:
df.head()

In [None]:
len(df)

In [None]:
df.describe()

The row and column labels can be accessed respectively by accessing the index and columns attributes:

In [None]:
df.columns

In [None]:
df.index

Use df.column_name or df['column_name'] to get a column:

In [None]:
df.weight

In [None]:
df['weight']

<div class="alert alert-info">
<h4>Exercise</h4>

Try load in another dataset from 'assets/penguins.csv' and call it `dfe`.  
    
<details><summary><i><u>(Solution)</u><i></summary><br>
    
```python
dfe = pd.read_csv('assets/penguins.csv')
```

</details>
</div>

<div class="alert alert-info">
<h4>Exercise</h4>

Explore the Penguins dataset.  
</div>

<div>
<h2 class="breadcrumb">Pandas plotting API</h2><p>
</div>

In [None]:
df.plot();

<div class="alert alert-success">
<h4>Tips</h4>

To understand how df.plot works, try `df.plot?`

In [None]:
df.plot?

<div>
<h2 class="breadcrumb">Pattern of a continuous variable</h2><p>
</div>

In [None]:
df.head()

### Histogram

> A `histogram` is a representation of the distribution of data.


In [None]:
df['mpg'].hist(grid=False);

In [None]:
df['mpg'].plot(kind='hist');

Create histogram plots for two variables:

In [None]:
df[['mpg', 'horsepower']].plot(kind='hist');

In [None]:
df[['mpg', 'horsepower']].plot(kind='hist', subplots=True);

What is the distribution of mpg by origin?

In [None]:
df.groupby('origin')['mpg'].plot(kind='hist', alpha=0.3, legend=True);

### Kernel Density Estimate plot

> In statistics, kernel density estimation (KDE) is a non-parametric way to estimate the probability density function (PDF) of a random variable. This function uses Gaussian kernels and includes automatic bandwidth determination.


In [None]:
df['mpg'].plot(kind='kde');

In [None]:
df.groupby('origin')['mpg'].plot(kind='kde', legend=True);

<div class="alert alert-info">
<h4>Exercise</h4>

In the previous exercise, you have loaded the Penguins dataset. In this exerise, try explore the distribution of body mass of penguins and see how the distributions differ by species and sex. 
    
<details><summary><i><u>(Solution)</u><i></summary><br>
    
```python
dfe['body_mass_g'].plot(kind='kde');
dfe.groupby('species')['body_mass_g'].plot(kind='kde', legend=True);
dfe.groupby('sex')['body_mass_g'].plot(kind='kde', legend=True);
dfe.groupby(['sex','species'])['body_mass_g'].plot(kind='kde', legend=True);

```

</details>
</div>

<div>
<h2 class="breadcrumb">Pattern of a categorial variable</h2><p>
</div>

### bar chart

In [None]:
df.head()

In [None]:
df['origin'].value_counts()

In [None]:
df['origin'].value_counts().plot(kind='bar');

In [None]:
df['origin'].value_counts().plot(kind='barh');

In [None]:
table = df.groupby(['model_year', 'origin']).mean().unstack('origin')['mpg']
table

In [None]:
table.plot(kind='bar', stacked=True);

### Pie chart

In [None]:
df['origin'].value_counts().plot(kind='pie');

<div class="alert alert-info">
<h4>Exercise</h4>

Create two plots with the Penguins dataset:
1) Bar plot showing the counts of species. 
    
2) Bar plot showing the mean values of body mass by species and by sex. 
    
    
<details><summary><i><u>(Solution)</u><i></summary><br>
    
```python
dfe['species'].value_counts().plot(kind='bar');
dfe.groupby(['species', 'sex']).mean().unstack('sex')['body_mass_g'].plot(kind='bar');
```

</details>
</div>

<div>
<h2 class="breadcrumb">Relationship between two variables</h2><p>
</div>

Relationship between weight and mpg:

In [None]:
df.plot(x='weight', y='mpg', kind='scatter');

In [None]:
df.plot(x='weight', y='mpg', kind='scatter', title='Relationship bewteen weight and mpg');

Relationship between weight and mpg by origin:

There are often many ways to create the same plot. Here is one way to create this plot. We will see another way in the next notebook. 

In [None]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
df[df.origin=='usa'].plot(x='weight', y='mpg', kind='scatter', ax=ax, c='r', label='USA');
df[df.origin=='japan'].plot(x='weight', y='mpg', kind='scatter', ax=ax, c='g', label='Japan');
df[df.origin=='europe'].plot(x='weight', y='mpg', kind='scatter', ax=ax, c='b', label='Europe');


<div class="alert alert-info">
<h4>Exercise</h4>

Create two plots with the Penguins dataset:
1) Scatter plot showing the relationship between flipper length and body mass. 
    
2) Showing this relationship by species
    
    
<details><summary><i><u>(Solution)</u><i></summary><br>
    
```python
dfe.plot(x='flipper_length_mm', y='body_mass_g', kind='scatter');

fig, ax = plt.subplots()
dfe[dfe.species=='Adelie'].plot(x='flipper_length_mm', y='body_mass_g', kind='scatter', ax=ax, c='r', label='Adelie');
dfe[dfe.species=='Gentoo'].plot(x='flipper_length_mm', y='body_mass_g', kind='scatter', ax=ax, c='g', label='Gentoo');
dfe[dfe.species=='Chinstrap'].plot(x='flipper_length_mm', y='body_mass_g', kind='scatter', ax=ax, c='b', label='Chinstrap');

```

</details>
</div>

<div>
<h2 class="breadcrumb">Time series</h2><p>
</div>

In [None]:
dft = pd.read_csv(
    'assets/air_quality_no2.csv', 
    index_col=0, 
    parse_dates=True
)

In [None]:
dft.head()

In [None]:
dft.plot();

In [None]:
dft.plot(subplots=True, layout=(1,3), figsize=(15,4));

In [None]:
df.plot.area(figsize=(12, 4), subplots=True);

In [None]:
df.plot.area(figsize=(12, 4));