# Setting Matplotlib Defaults

In [None]:
# Setup plotting
import matplotlib.pyplot as plt

# Set Matplotlib defaults
plt.style.use("seaborn-whitegrid")
plt.rc("figure", autolayout=True, figsize=(11, 4))
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=14,
    titlepad=10,
)

plot_params = dict(
    color="0.75",
    style=".-",
    markeredgecolor="0.25",
    markerfacecolor="0.25",
    legend=False, 
)
%config InlineBackend.figure_format = 'retina'
# annotations: https://stackoverflow.com/a/49238256/5769929

we'll look at the linear model, our ppsimplest neural network. Having only a single weight and a bias, it's easier to see what effect a change of parameter has.

The next cell will generate an animation like the one in the tutorial. Change the values for learning_rate, batch_size, and num_examples (how many data points) and then run the cell. (It may take a moment or two.) Try the following combinations, or try some of your own:

In [None]:
# YOUR CODE HERE: Experiment with different values for the learning rate, batch size, and number of examples
from learntools.deep_learning_intro.dltools import animate_sgd

learning_rate = .1
batch_size = 128
num_examples = 256

animate_sgd(
    learning_rate=learning_rate,
    batch_size=batch_size,
    num_examples=num_examples,
    # You can also change these, if you like
    steps=50, # total training steps (batches seen)
    true_w=3.0, # the slope of the data
    true_b=2.0, # the bias of the data
)

# Pandas Plot
## Plot 2 columns on single graph

In [None]:
ax = flu_trends.plot(
    y=["FluCough", "FluVisits"],
    secondary_y="FluCough",
)

# Pandas display df in tabs

In [None]:
import ipywidgets as widgets

datasets = load_multistep_data()

data_tabs = widgets.Tab([widgets.Output() for _ in enumerate(datasets)])
for i, df in enumerate(datasets):
    data_tabs.set_title(i, f'Dataset {i+1}')
    with data_tabs.children[i]:
        display(df)

display(data_tabs)

# Seaborn

<img src="https://imgur.com/LPWH19I.png" height="500" width="1000" usemap="#plottingmap" />
<map name="plottingmap">
  <area shape="rect" coords="262,342,402,476" href="https://www.kaggle.com/alexisbcook/hello-seaborn" title="EXAMPLE: sns.lineplot(data=my_data)">
  <area shape="rect" coords="8,75,154,200" href="https://www.kaggle.com/alexisbcook/bar-charts-and-heatmaps" title="EXAMPLE: sns.swarmplot(x=my_data['Column 1'], y=my_data['Column 2'])">
   <area shape="rect" coords="8,200,154,350" href="https://www.kaggle.com/alexisbcook/bar-charts-and-heatmaps" title="EXAMPLE: sns.regplot(x=my_data['Column 1'], y=my_data['Column 2'])">
   <area shape="rect" coords="8,350,154,500" href="https://www.kaggle.com/alexisbcook/bar-charts-and-heatmaps" title='EXAMPLE: sns.lmplot(x="Column 1", y="Column 2", hue="Column 3", data=my_data)'>
      <area shape="rect" coords="229,10,393,160" href="https://www.kaggle.com/alexisbcook/bar-charts-and-heatmaps" title="EXAMPLE: sns.scatterplot(x=my_data['Column 1'], y=my_data['Column 2'], hue=my_data['Column 3'])">
     <area shape="rect" coords="397,10,566,160" href="https://www.kaggle.com/alexisbcook/line-charts" title="EXAMPLE: sns.heatmap(data=my_data)">
     <area shape="rect" coords="565,10,711,160" href="https://www.kaggle.com/alexisbcook/line-charts" title="EXAMPLE: sns.barplot(x=my_data.index, y=my_data['Column'])">
     <area shape="rect" coords="780,55,940,210" href="https://www.kaggle.com/alexisbcook/scatter-plots" title="EXAMPLE: sns.jointplot(x=my_data['Column 1'], y=my_data['Column 2'], kind='kde')">
     <area shape="rect" coords="780,210,940,350" href="https://www.kaggle.com/alexisbcook/scatter-plots" title="EXAMPLE: sns.kdeplot(data=my_data['Column'], shade=True)">
   <area shape="rect" coords="780,360,1000,500" href="https://www.kaggle.com/alexisbcook/scatter-plots" title="EXAMPLE: sns.histplot(a=my_data['Column'])">
</map>


Since it's not always easy to decide how to best tell the story behind your data, we've broken the chart types into three broad categories to help with this.
- **Trends** - A trend is defined as a pattern of change.
    - `sns.lineplot` - **Line charts** are best to show trends over a period of time, and multiple lines can be used to show trends in more than one group.
- **Relationship** - There are many different chart types that you can use to understand relationships between variables in your data.
    - `sns.barplot` - **Bar charts** are useful for comparing quantities corresponding to different groups.
    - `sns.heatmap` - **Heatmaps** can be used to find color-coded patterns in tables of numbers.
    - `sns.scatterplot` - **Scatter plots** show the relationship between two continuous variables; if color-coded, we can also show the relationship with a third [categorical variable](https://en.wikipedia.org/wiki/Categorical_variable).
    - `sns.regplot` - Including a **regression line** in the scatter plot makes it easier to see any linear relationship between two variables.
    - `sns.lmplot` - This command is useful for drawing multiple regression lines, if the scatter plot contains multiple, color-coded groups.
    - `sns.swarmplot` - **Categorical scatter plots** show the relationship between a continuous variable and a categorical variable.
- **Distribution** - We visualize distributions to show the possible values that we can expect to see in a variable, along with how likely they are.
    - `sns.histplot` - **Histograms** show the distribution of a single numerical variable.
    - `sns.kdeplot` - **KDE plots** (or **2D KDE plots**) show an estimated, smooth distribution of a single numerical variable (or two numerical variables).
    - `sns.jointplot` - This command is useful for simultaneously displaying a 2D KDE plot with the corresponding KDE plots for each individual variable.

In [None]:
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Set up code checking
import os
if not os.path.exists("../input/fifa.csv"):
    os.symlink("../input/data-for-datavis/fifa.csv", "../input/fifa.csv")  

## Lineplot

In [None]:
# Set the width and height of the figure
plt.figure(figsize=(14,6))

# Add title
plt.title("Daily Global Streams of Popular Songs in 2017-2018")

# Line chart showing daily global streams of 'Shape of You'
sns.lineplot(data=spotify_data['Shape of You'], label="Shape of You")

# Line chart showing daily global streams of 'Despacito'
sns.lineplot(data=spotify_data['Despacito'], label="Despacito")

# Add label for horizontal axis
plt.xlabel("Date")

As you can see above, the line of code is relatively short and has two main components:
- `sns.lineplot` tells the notebook that we want to create a line chart. 
 - _Every command that you learn about in this course will start with `sns`, which indicates that the command comes from the [seaborn](https://seaborn.pydata.org/) package. For instance, we use `sns.lineplot` to make line charts.  Soon, you'll learn that we use `sns.barplot` and `sns.heatmap` to make bar charts and heatmaps, respectively._
- `data=spotify_data` selects the data that will be used to create the chart.

## Bar chart

The commands for customizing the text (title and vertical axis label) and size of the figure are familiar from the previous tutorial.  The code that creates the bar chart is new:

```python
# Bar chart showing average arrival delay for Spirit Airlines flights by month
sns.barplot(x=flight_data.index, y=flight_data['NK'])
```
It has three main components:
- `sns.barplot` - This tells the notebook that we want to create a bar chart.
 - _Remember that `sns` refers to the [seaborn](https://seaborn.pydata.org/) package, and all of the commands that you use to create charts in this course will start with this prefix._
- `x=flight_data.index` - This determines what to use on the horizontal axis.  In this case, we have selected the column that **_index_**es the rows (in this case, the column containing the months).
- `y=flight_data['NK']` - This sets the column in the data that will be used to determine the height of each bar.  In this case, we select the `'NK'` column.

> **Important Note**: You must select the indexing column with `flight_data.index`, and it is not possible to use `flight_data['Month']` (_which will return an error_).  This is because when we loaded the dataset, the `"Month"` column was used to index the rows.  **We always have to use this special notation to select the indexing column.**

In [None]:
# Set the width and height of the figure
plt.figure(figsize=(10,6))

# Add title
plt.title("Average Arrival Delay for Spirit Airlines Flights, by Month")

# Bar chart showing average arrival delay for Spirit Airlines flights by month
sns.barplot(x=flight_data.index, y=flight_data['NK'])
sns.barplot(x=ign_data['Racing'], y=ign_data.index)

# Add label for vertical axis
plt.ylabel("Arrival delay (in minutes)")

## Heatmap

The relevant code to create the heatmap is as follows:
```python
# Heatmap showing average arrival delay for each airline by month
sns.heatmap(data=flight_data, annot=True)
```
This code has three main components:
- `sns.heatmap` - This tells the notebook that we want to create a heatmap.
- `data=flight_data` - This tells the notebook to use all of the entries in `flight_data` to create the heatmap.
- `annot=True` - This ensures that the values for each cell appear on the chart.  (_Leaving this out removes the numbers from each of the cells!_)

_What patterns can you detect in the table?  For instance, if you look closely, the months toward the end of the year (especially months 9-11) appear relatively dark for all airlines.  This suggests that airlines are better (on average) at keeping schedule during these months!_ 

In [None]:
# Set the width and height of the figure
plt.figure(figsize=(14,7))

# Add title
plt.title("Average Arrival Delay for Each Airline, by Month")

# Heatmap showing average arrival delay for each airline by month
sns.heatmap(data=flight_data, annot=True)

# Add label for horizontal axis
plt.xlabel("Airline")

## Hist plot

We can create three different histograms (one for each species) of petal length by using the `sns.histplot` command (_as above_).  
- `data=` provides the name of the variable that we used to read in the data
- `x=` sets the name of column with the data we want to plot
- `hue=` sets the column we'll use to split the data into different histograms 

In [None]:
# Histograms for each species
sns.histplot(data=iris_data, x='Petal Length (cm)', hue='Species')

# Add title
plt.title("Histogram of Petal Lengths, by Species")

We can also create a KDE plot for each species by using `sns.kdeplot` (_as above_).  The functionality for `data`, `x`, and `hue` are identical to when we used `sns.histplot` above.  Additionally, we set `shade=True` to color the area below each curve.

In [None]:
# KDE plots for each species
sns.kdeplot(data=iris_data, x='Petal Length (cm)', hue='Species', shade=True)

# Add title
plt.title("Distribution of Petal Lengths, by Species")

## Scatter plot

To create a simple **scatter plot**, we use the `sns.scatterplot` command and specify the values for:
- the horizontal x-axis (`x=insurance_data['bmi']`), and 
- the vertical y-axis (`y=insurance_data['charges']`).

The scatterplot above suggests that [body mass index](https://en.wikipedia.org/wiki/Body_mass_index) (BMI) and insurance charges are **positively correlated**, where customers with higher BMI typically also tend to pay more in insurance costs.  (_This pattern makes sense, since high BMI is typically associated with higher risk of chronic disease._)

To double-check the strength of this relationship, you might like to add a **regression line**, or the line that best fits the data.  We do this by changing the command to `sns.regplot`.

In [None]:
sns.scatterplot(x=insurance_data['bmi'], y=insurance_data['charges'])

In [None]:
sns.regplot(x=insurance_data['bmi'], y=insurance_data['charges'])

**Color-coded scatter plots**

We can use scatter plots to display the relationships between (_not two, but..._) three variables!  One way of doing this is by color-coding the points.  

For instance, to understand how smoking affects the relationship between BMI and insurance costs, we can color-code the points by `'smoker'`, and plot the other two columns (`'bmi'`, `'charges'`) on the axes.

In [None]:
sns.scatterplot(x=insurance_data['bmi'], y=insurance_data['charges'], hue=insurance_data['smoker'])

This scatter plot shows that while nonsmokers to tend to pay slightly more with increasing BMI, smokers pay MUCH more.

To further emphasize this fact, we can use the `sns.lmplot` command to add two regression lines, corresponding to smokers and nonsmokers.  (_You'll notice that the regression line for smokers has a much steeper slope, relative to the line for nonsmokers!_)

In [None]:
sns.lmplot(x="bmi", y="charges", hue="smoker", data=insurance_data)

The `sns.lmplot` command above works slightly differently than the commands you have learned about so far:
- Instead of setting `x=insurance_data['bmi']` to select the `'bmi'` column in `insurance_data`, we set `x="bmi"` to specify the name of the column only.  
- Similarly, `y="charges"` and `hue="smoker"` also contain the names of columns.  
- We specify the dataset with `data=insurance_data`.

Finally, there's one more plot that you'll learn about, that might look slightly different from how you're used to seeing scatter plots.  Usually, we use scatter plots to highlight the relationship between two continuous variables (like `"bmi"` and `"charges"`).  However, we can adapt the design of the scatter plot to feature a categorical variable (like `"smoker"`) on one of the main axes.  We'll refer to this plot type as a **categorical scatter plot**, and we build it with the `sns.swarmplot` command.

In [None]:
sns.swarmplot(x=insurance_data['smoker'],
              y=insurance_data['charges'])

### Regression Plot Seaborn

In [None]:
fig, ax = plt.subplots()
ax = sns.regplot(x='Lag_1', 
                 y='Hardcover', 
                 data=df, 
                 ci=None,   # confidence interval
                 scatter_kws=dict(color='0.25'),
                 )
ax.set_aspect('equal')  # square aspect ratio of plot
ax.set_title('Lag Plot of Hardcover Sales');

In [None]:
fig, ax = plt.subplots()
ax.plot('Time', 'Hardcover', data=book_sales, color='0.75')
ax = sns.regplot(x='Time', 
                 y='Hardcover', 
                 data=book_sales, 
                 ci=None, 
                 scatter_kws=dict(color='0.25'))
ax.set_title('Time Plot of Hardcover Sales');

## KDE Plots

To make a KDE plot, we use the `sns.kdeplot` command.  Setting `shade=True` colors the area below the curve (_and `data=` chooses the column we would like to plot_).

In [None]:
# KDE plot 
sns.kdeplot(data=iris_data['Petal Length (cm)'], shade=True)

### 2D KDE plots

We're not restricted to a single column when creating a KDE plot.  We can create a **two-dimensional (2D) KDE plot** with the `sns.jointplot` command.

In the plot below, the color-coding shows us how likely we are to see different combinations of sepal width and petal length, where darker parts of the figure are more likely. 

Note that in addition to the 2D KDE plot in the center,
- the curve at the top of the figure is a KDE plot for the data on the x-axis (in this case, `iris_data['Petal Length (cm)']`), and
- the curve on the right of the figure is a KDE plot for the data on the y-axis (in this case, `iris_data['Sepal Width (cm)']`).

In [None]:
# 2D KDE plot
sns.jointplot(x=iris_data['Petal Length (cm)'], y=iris_data['Sepal Width (cm)'], kind="kde")

## Styling

Seaborn has five different themes: (1)`"darkgrid"`, (2)`"whitegrid"`, (3)`"dark"`, (4)`"white"`, and (5)`"ticks"`, and you need only use a command similar to the one in the code cell above (with the chosen theme filled in) to change it.  

In [None]:
# Change the style of the figure to the "dark" theme
sns.set_style("dark")

# Line chart 
plt.figure(figsize=(12,6))
sns.lineplot(data=spotify_data)