# Seaborn Unit 01 - Introduction

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%201%20-%20Lesson%20Learning%20Outcome.png"> Lesson Learning Outcome

* **Seaborn Lesson is made of 5 units.**
* By the end of this lesson, you should be able to:
  * Load Seaborn Datasets for exploring its multiple plots types
  * Combine Matplotlib and Seaborn capabilities
  * Manage Seaborn Plot Style
  * Create distinct plot types using Seaborn

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Get familiar with Seborn datasets
* Combine Matplotlib and Seaborn capabilities
* Understand Axes and Figure level functions



---

Seaborn is considered a library for making statistical graphics in Python and is built on top of Matplotlib

* Seaborn helps resolve the two major issues in Matplotlib:
  * Default Matplotlib parameters and 
  * Working with Pandas data frames

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Question%20mark%20icon.png
"> **Why do we study Seaborn?**
  * Seaborn has very important qualities:
    * It offers built-in themes for styling matplotlib graphics
    * It visualises univariate and bivariate data
    * It fits and visualizes linear regression models
    * It works well with NumPy and Pandas data structures
    * Its syntax is simple. It offers shorter and more intuitive syntax to create plots.



## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%203%20-%20Additional%20Learning%20Context.png"> Additional Learning Context

* We encourage you to:
  * Add **code cells and try out** other possibilities, play around with parameter values in a function/method, or consider additional function parameters etc.
  * Also, **add your comments** in the cells. It can help you to consolidate your learning. 

* Parameters in given function/method
  * As you may expect, a given function in a package may contain multiple parameters. 
  * Some of them are mandatory to declare; some have pre-defined values; some are optional. We will cover the most common parameters used/employed in Data Science for a particular function/method. 
  * However, you may seek additional in the respective package documentation, where you will find instructions on how to use a given function/method. The studied packages are open source, so this documentation is public.
  * **For Seaborn the link is [here](https://seaborn.pydata.org/)**.

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

For convention, `Seaborn` is imported with the alias `sns`

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  Seaborn Introduction

**Matplotlib** has a wide range of plots, but it can be complex to plot non-basic plots or adjust the plots to look nice.
  * **Seaborn** provides a higher-level interface for Matplotlib plots, with less code and typically with a nicer design.

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Seaborn Datasets oriented API 

Seaborn offers built-in datasets for exploring its plotting capabilities.
* You can get the dataset names using `sns.get_dataset_names()`

sns.get_dataset_names()

Just pass the dataset name into the function as an argument and assign it to a DataFrame variable

df = sns.load_dataset('tips')
df = df.head(50)
df.head(3)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> 


**PRACTICE**: Load other datasets from Seaborn so you can get used to this process

df = sns.load_dataset(..............)
df = df.head(50)
df.head()

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Combine Matplotlib and Seaborn

**Seaborn is built on top of Matplotlib**; therefore, they share many aspects. You can combine functions from each library when plotting
* Consider the following dataset
  * It has records for three different species of penguins collected from three islands in the Palmer Archipelago, Antarctica


df = sns.load_dataset('penguins').sample(50, random_state=1)
df.head(3)

You can initialise a **Figure with 1 Axis** on matplotlib and draw a Seaborn plot. 
  * Then, you can write the title for these Seaborn plots using Matplotlib notation
  * We are not focused on the Seaborn code itself. The idea is to present an example of how both are used together



<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The main takeway is that you **can use the range of functions learned in Matplotlib and Seaborn!**

fig, axes = plt.subplots(figsize=(8,8))

sns.scatterplot(data=df, x='bill_depth_mm', y='bill_length_mm', hue='species')   # Seaborn code to draw a scatter plot
plt.title("Seaborn Plot!!!")
plt.xlabel('X-Axis: bill_depth_mm ')
plt.legend(loc='upper left', title='Legend', frameon=False)
plt.show()

Now you are interested in initialising a **Figure with two Axes** on matplotlib and drawing a Seaborn plot. Then, you can write the title from these Seaborn plots using Matplotlib notation
  *  You will notice the parameter ``ax`` of the seaborn function relates to the ``axes`` from your Matplotlib Figure.

fig, axes = plt.subplots(1, 3, figsize=(12, 4))

sns.scatterplot(data=df,
                x='bill_depth_mm',
                y='bill_length_mm',
                hue="species",
                ax=axes[0]) # you use the Axes from Matplotlib line of code above

axes[0].set_title('Nice Title')
sns.histplot(data=df, x="flipper_length_mm", ax=axes[1])
sns.histplot(data=df, x="bill_length_mm", ax=axes[2])

plt.tight_layout()
fig.suptitle('Super title for the Figure', fontsize=16, y=1.1)
plt.show()


---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Axes-level functions and Figure-level functions

Now we introduce you to **Axes-level functions** and **Figure-level functions** at Seaborn.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> 1: **Axes-level function** plot data on a matplotlib Axes
  * You will recognize when there is `ax` argument in a given Seaborn function.
   * Axes-level functions include kdeplot, histplot, scatterplot, boxplot, countplot, heatmap etc.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> 2 - **Figure-level function** can't (easily) be arranged with other Axes/plots
  * You will recognize when there **is not**  an `ax` argument in a given Seaborn function.
  * The difference is that a Figure-level function creates, on its backend, subplots already. 
    * For example, a Pairplot is a Figure-level function and outputs a set of scatter plots for all numerical variables. In this case, you have multiple scatter plots and a histogram for each numerical variable arranged in a Figure. We will study Pairplot soon.
  * Figure-level functions include lmplot,  pairplot, jointplot etc




<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> This Seaborn documentation [link](https://seaborn.pydata.org/tutorial/function_overview.html) also gives an overview of how Axes-level and Figure-level functions work within Seaborn.
* As a rule of thumb, it will take practice to get familiar with these functions; typically, you might not be interested in making an Axes a Figure-level function since this can split the plot, and it is better to have a single Figure for it.
* The example below shows a Pairplot, so you can see in practical terms the use case where a single function outputs a set of plots.

sns.pairplot(data=df)
plt.show()

# Seaborn - Unit 02 - Managing Plot Style

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Learn Seaborn capabilities for styling your plots, including plot design and axis layout



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Managing Plot Style

According to Seaborn [documentation](http://seaborn.pydata.org/tutorial/aesthetics.html), there are five preset seaborn themes: `darkgrid`, `whitegrid`, `dark`, `white`, and `ticks`.

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> You will set the style using the `sns.set_style()` function to define the style you are interested in.
  * The function documentation is found [here](https://seaborn.pydata.org/generated/seaborn.set_style.html). Once you set the style, in a Jupyter Notebook session, the style remains until you close the session or change the style



<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Let's plot using the default style
  * We will use the ``tips`` dataset. It holds records for waiter tips based on the day of the week, time of day, total bill, gender, if it is a table of smokers or not, and how many people were at the table.  
  * We will do a scatter plot for tip levels and total bill levels. We will not focus on explaining the Seaborn scatter plot function; the focus of this exercise is on the style

df = sns.load_dataset('tips')
df = df.head(50)
sns.scatterplot(data=df, y='tip',x='total_bill')
plt.show()

Let's set as `whitegrid` and plot again
  *  This style has a technical advantage since it has a grid that helps us see what value is represented in the plot. It helps to enhance the perception of small differences, narrow the focus to a specific area of the plot when you visualise
  * Also, `darkgrid` shows grids, which have the same technical advantage listed above

sns.set_style("darkgrid")
df = sns.load_dataset('tips')
df = df.head(50)
sns.scatterplot(data=df, y='tip',x='total_bill')
plt.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **PRACTICE**: try out all the different styles and figure out your favourite.

# Write your code here.


---

Depending on your use case, you can adjust the axis layout using `sns.despine()`.
  * The function documentation is found [here](https://seaborn.pydata.org/generated/seaborn.despine.html)

sns.set_style("white")
sns.scatterplot(data=df, y='tip',x='total_bill')
sns.despine()

You can remove a top, right, left, or bottom spine. The respective arguments are `top`, `right`, `left`, `bottom`, and they should be `True` or `False`

sns.scatterplot(data=df, y='tip',x='total_bill')
sns.despine(left=True)

---

Alternatively, you can set the colours, or the pallete of your plots with `sns.set_theme()` by setting the `palette` argument.
  * The `palette` options can be found [here](https://seaborn.pydata.org/generated/seaborn.color_palette.html#seaborn.color_palette) and [here](https://matplotlib.org/stable/tutorials/colors/colormaps.html)
  * The `sns.set_theme() `function documentation can be found in this [link](https://seaborn.pydata.org/generated/seaborn.set_theme.html)

sns.set_theme(palette="pastel")

df = sns.load_dataset('tips')
df = df.head(50)
sns.scatterplot(data=df, y='tip',x='total_bill',hue='tip')
plt.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **PRACTICE**: Try out using the despine method and the set_theme method.

df_practice = sns.load_dataset('tips')
df_practice = df_practice.head(50)

# Write your code here.


# Seaborn - Unit 03 - Seaborn Plots: Part 01

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Learn and deliver Histograms, Distplot, KDE, boxplot and swarm plot in Seaborn



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Package for Learning

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Unit 03 - Seaborn Plots: Part 01

We will cover in this unit how to deliver the following plots in Seaborn:
* Histograms
* Distplot
* KDE
* Boxplot
* Swarmplot

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Histogram

Consider the following dataset
  * It has records for three different species of penguins collected from 3 islands in the Palmer Archipelago, Antarctica

df = sns.load_dataset('penguins')
df = df.sample(n=50, random_state=1)
df.head(3)

We will set the theme to 'whitegrid' and draw a histogram with sns.histplot(). The documentation function is found [here](https://seaborn.pydata.org/generated/seaborn.histplot.html)
* The arguments are data, where you parse your dataset, and x, where you parse the variable you are interested to draw.

sns.set_theme(style="whitegrid")
sns.histplot(data=df, x='bill_length_mm')
plt.show()

If you want to have more than one variable in your plot, you can subset the variables and parse at `data` argument
* In this case, we used bracket notation to subset `['bill_length_mm','bill_depth_mm']`.

sns.set_theme(style="whitegrid")
sns.histplot(data=df[['bill_length_mm','bill_depth_mm']])
plt.show()

You can add a kde line to the histogram by adding the argument `kde=True`

sns.histplot(data=df, kde=True, x='bill_length_mm')
plt.show()

You can understand the distribution of a given category by adding the argument ``hue`` 
* In this case, we want to know bill_length_mm distribution per species
* We can note that the most frequent levels from Adelie look different from the remaining species. We notice the distribution shape of Chinstrap is different to Gentoo

sns.histplot(data=df, kde=True, x='bill_length_mm', hue='species')
plt.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **PRACTICE**: We will use the DataFrame below, which uses the iris dataset.

df_practice = sns.load_dataset('iris')
df_practice = df_practice.sample(n=50, random_state=1)
df_practice.head(3)

Feel free to try out your ideas or use the following suggestion.

You are interested in using a histogram to plot the distribution of sepal_length across the species.

# Write your code here.


---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Distplot

This is a Figure level function. This functionality plots histograms but with the ability to facet histograms by subsets. The documentation function is found [here](https://seaborn.pydata.org/generated/seaborn.displot.html)
* The arguments for the plot below are similar to the previous section: data and x
* Let's consider the same dataset from before

sns.displot(data=df, x="flipper_length_mm")
plt.show()

Now we add a ``col`` argument. It will plot on different column facets according to the variable, in this case, species

sns.displot(df, x="flipper_length_mm", col="species")
plt.show()

We can add ``row`` argument and facet by sex

 sns.displot(df, x="flipper_length_mm", col="species", row="sex")
 plt.show()

In addition, we can add ``hue`` and check the distribution per island. 
* We see that Gentoo species live on Biscoe island and the Chinstrap species live on Dream island. Adelie species live on all islands. Also, note the distribution difference between sex.

 sns.displot(df, x="flipper_length_mm", col="species", row="sex", hue='island')
 plt.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **PRACTICE**: We will use the DataFrame below, which uses the tips dataset.

df_practice = sns.load_dataset('tips')
df_practice = df_practice.sample(n=50, random_state=1)
df_practice.head(3)

Feel free to try out your ideas or use the following suggestion.

You are interested in using a Distplot to display the distribution of the total_bill. You decide on using the col parameter with day and the row parameter on time.

# Write your code here.


---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> KDE

Consider the tips dataset. It holds records for waiter tips based on the day of the week, day time, total bill, gender, if it is a smoker table or not, and how many people were at the table.

df = sns.load_dataset('tips')
df = df.sample(n=50, random_state=1)
df.head(3)

As we studied in the previous lesson, A kde plot represents the distribution of a numeric variable.
* It plots the data using a density probability curve; therefore, the y-axis is called density. You will be interested in checking in which range the distribution is denser and the shape of the distribution.
* Compared to a histogram, kde shall draw a more interpretable plot. It doesn't need a bin argument since it automatically determines bandwidth.

We use `sns.kdeplot()`; the documentation is [here](https://seaborn.pydata.org/generated/seaborn.kdeplot.html). We parse the data by filtering the variables we are interested in. We add ``fill=True``, so the area under the line is coloured.
* As we may expect, there looks to be a difference between levels of tip and total_bill. Tips stay in the range of 0 to 10 and total_bill up to 50; the majority seem to be between 10 and 25.

fig,ax = plt.subplots(figsize =(8,6))
sns.kdeplot(data=df.filter(['tip','total_bill'], axis=1),fill=True)
plt.show()

This is an Axes level function since it has ``ax`` as an argument
* We create 4 plots wth plt.subplots().
  * Then we draw an individual kde plot for the tip, colouring by a given variable: sex, time, day, size
  * We added ``palette='Set1'`` to help to distinguish the different category levels. Once again, we use the matplotlib [palette](https://matplotlib.org/stable/tutorials/colors/colormaps.html)

* We note more males than females (since the blue curve has a higher density than the orange). The tip levels between male and female looks to have a similar shape
* We note more dinner in the dataset. It looks like there are higher tips at dinner than lunch
* We note not many Sunday meals. The tip levels across days look to have similar range levels, even though they have slightly different shapes.
* We note a table of 2 is the most common. It looks like the tables with more people tend to offer more tips.

fig, axes = plt.subplots(nrows=2, ncols=2, figsize =(11,6))

sns.kdeplot(data=df, x='tip',hue='sex',ax=axes[0,0],fill=True,  palette='Set1')
sns.kdeplot(data=df, x='tip',hue='time',ax=axes[0,1],fill=True,  palette='Set1')
sns.kdeplot(data=df, x='tip',hue='day',ax=axes[1,0],fill=True,  palette='Set1')
sns.kdeplot(data=df, x='tip',hue='size',ax=axes[1,1],fill=True, palette='Set1')

plt.tight_layout()
plt.show()


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **PRACTICE**: We will use the DataFrame below, which uses the iris dataset.




df_practice = sns.load_dataset('iris')
df_practice = df_practice.sample(n=50, random_state=1)
df.head(3)

Feel free to try out your ideas or use the following suggestion.

You are interested in creating two plots with plt.subplots().
* You draw an individual kde plot for sepal_length and sepal_width, colouring by species.

# Write your code here.


---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Boxplot

Let's still consider the tips dataset

df.head()

We can create boxplot with `sns.boxplot()`, the documentation is [here](https://seaborn.pydata.org/generated/seaborn.boxplot.html). This is a Axes level function
* If you parse the DataFrame as the ``data`` argument, it will plot all numerical columns as categories.
  * We know in practical terms that ``size`` reflects a categorical ordinal variable, but in this dataset, its data type is an integer. That is the reason it is being shown in the boxplot; This is fine; we need to be aware of that when interpreting


sns.boxplot(data=df)
plt.show()

Another approach is to set x as the categorical variable to inspect the levels across, and y as the numerical variable that we want to assess the distribution

* We note that the IQR (the range where the levels are more frequent) are higher for size 4, 5 and 6. Size 6 has less variability, so the waiter would prefer to come to a table with more people.

sns.boxplot(data=df, x='size', y='tip')
plt.show()

We can draw a grouped boxplot by adding ``hue``, which will be categorical data
* We notice that a table of 6 typically happens at lunch, and a table of 5 happens at dinner

sns.boxplot(data=df, x="size", y="tip", hue="time")
plt.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **PRACTICE**: We will use the DataFrame below, which uses the iris dataset.

df_practice = sns.load_dataset('iris')
df_practice = df_practice.sample(n=50, random_state=1)
df_practice.head(3)

Feel free to try out your ideas or use the following suggestion.

You are interested in creating a Boxplot.
* You use species to inspect the levels across.
* And sepal_length to assess the distribution

# Write your code here.


---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Swarm Plot

Swarmplot is like a scatterplot for categorical variables.
* This helps to give a better representation of the distribution of values. However, it does not scale well enough for large numbers of data points. It effectively complements a boxplot when you want to show all observations. The documentation is found [here](https://seaborn.pydata.org/generated/seaborn.swarmplot.html)

* The arguments are ``data``, ``x`` and ``y``: you can interpret them similarly to the boxplot.
* You will notice that tables with sizes 1, 5 and 6 are not typical. So don't get high expectations of getting tables of 5 or 6.

sns.swarmplot(data=df, x='size', y='tip')
plt.show()

We can draw a grouped swarmplot by adding the parameter hue, which will be categorical data. We add ``dodge=True``, so the dots will not be cluttered

sns.swarmplot(data=df, x='size', y='tip', hue='time', dodge=True)
plt.show()

You can combine a boxplot and a swarmplot in your analysis. The benefit you take from a boxplot is to assess the distribution levels. The benefit you get from a swarmplot is to understand how frequent the levels are.
* In this example, we create a figure with plt.subplots() with 1 row and 2 columns, then place a boxplot and swarmplot side by side

x, y, hue = 'size', 'tip', 'time'
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15,6))

sns.boxplot(data=df, x=x, y=y, hue=hue, ax=axes[0])
sns.swarmplot(data=df, x=x, y=y, hue=hue, dodge=True, ax=axes[1])
plt.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **PRACTICE**: We will use the DataFrame below, which uses the penguins' dataset.

df_practice = sns.load_dataset('penguins')
df_practice = df_practice.sample(n=50, random_state=1)
df_practice.head(3)

Feel free to try out your ideas or use the following suggestion.

You are interested in creating two plots with plt.subplots().

* You draw a boxplot and a swarmplot using island and bill_length_mm and colouring by species.


# Write your code here.


---

# Seaborn - Unit 04 - Seaborn Plots: Part 02

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Learn and deliver jointplot, LM plot, Scatter plot and Pair plot in Seaborn



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Unit 04 - Seaborn Plots: Part 02


We will cover in this unit the following plots:
* Jointplot
* LM plot
* Scatter plot
* Pair plot

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Jointplot

Consider the tips dataset. It holds records for waiter tips based on the day of the week, time of day, total bill, gender, if it is a table of smokers or not, and how many people were at the table.

df = sns.load_dataset('tips')
df = df.sample(n=50, random_state=1)
df.head(3)

You are interested in drawing a plot of two variables with bivariate and univariate graphs at the same time. You can use ``sns.jointplot()``; the documentation is [here](https://seaborn.pydata.org/generated/seaborn.jointplot.html)
* The result is a scatter plot and two histograms, so you can visualise how the variables correlate and are distributed
* The arguments are ``data``, and ``x`` and ``y`` are the numerical variables you are interested in studying
* You will notice a general trend where the tips tend to increase when the total bill increases

sns.set_theme(style="whitegrid")
sns.jointplot(data=df, y='tip', x='total_bill')
plt.show()

You can include an additional dimension with ``hue`` to set a variable that maps the colour of the data points and the distribution

sns.jointplot(data=df, y='tip', x='total_bill', hue='smoker')
plt.show()

There is a parameter called `kind`. When `kind='hex'`, it creates the "scatter plot" using hexagonal bins. The more intense the hexagonal bins, the more values of both numerical distributions are in that bin. It is helpful to assess in which ranges the pairs x and y are more frequent
* The plot says bills from 10 to almost 20 with tips from 2 to 3 are more common in this restaurant

sns.jointplot(data=df, x='total_bill', y='tip', kind='hex')
plt.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **PRACTICE**: We will use the DataFrame below, which uses the 'penguins' dataset.

df_practice = sns.load_dataset('penguins')
df_practice = df_practice.sample(n=50, random_state=1)
df_practice.head(3)

Feel free to try out your ideas or use the following suggestion.

You are interested in using a joinplot. You decide on using body_mass_g and the flipper_length_mm colouring by sex.

# write your code here

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> LM Plot

Consider the following dataset
  * It has records for three different species of penguins collected from three islands in the Palmer Archipelago, Antarctica

df = sns.load_dataset('penguins')
df = df.sample(n=50, random_state=1)
df.head(3)

You are interested in plotting a Scatter plot and fitting and visualising a Regression model of your data. You will not assess the fitted model itself; the idea is to visualise how a linear model would fit.
* You can use `sns.lmplot()`; the documentation is [here](https://seaborn.pydata.org/generated/seaborn.lmplot.html). This is a Figure level plot.
* The arguments are: ``data``, ``x`` and ``y`` are the variables you are interested in. 

* You will notice a positive and linear trend between the variables

sns.lmplot(data=df, x="body_mass_g", y="bill_length_mm" )
plt.show()

`ci` is the size of the confidence interval for the regression estimate. This will be drawn using translucent bands around the regression line. You may set it to `None`. 
* In the plot above, you will notice the band exists at the ends of the line only since the data points are, in general, properly condensed

sns.lmplot(data=df, x="body_mass_g", y="bill_length_mm", ci=None )
plt.show()

You can facet your data with `col` and `row` arguments. That will create sub-plots and will facet the graphs according to the new variable information
* In this case, either Male or Female subsets show linear behaviour between the studied variables

sns.lmplot(data=df, x="body_mass_g", y="flipper_length_mm", col="sex", ci=None)
plt.show()

You can facet by row, adding the parameter row
* Note that in the subset of Island Torgensen and Female, the trend between the studied variables change

sns.lmplot(data=df, x="body_mass_g", y="flipper_length_mm", col="sex",row='island')
plt.show()

You can conduct Multiple Linear Regressions. You need to add the hue parameter to subset your data and fit multiple linear models on different subsets from the studied variables.

sns.lmplot(data=df,  x="body_mass_g", y="flipper_length_mm", hue="species")
plt.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **PRACTICE**: We will use the DataFrame below, which uses the tips dataset.

df_practice = sns.load_dataset('tips')
df_practice = df_practice.sample(n=50, random_state=1)
df_practice.head(3)

Feel free to try out your ideas or use the following suggestion.

You are interested in using a lmplot. You decide on using the tip and the total_bill colouring by sex.

# write your code here

---

You can fit different kinds of models that are not linear and straight. That is useful when you notice the variables' relationships are not straight and linear.
* We consider a toy dataset from seaborn to illustrate the example.

df = sns.load_dataset("anscombe")
df.head(3)

We subset the dataset with `.query()` and parse x and y
* We notice the fitted line is not a good representation of the data

sns.lmplot(data=df.query("dataset == 'II'"), x="x", y="y")
plt.show()

We can use an `order` argument. When it is greater than 1, it uses an internal function to estimate a polynomial regression. The `order` argument sets the order of the polynomial regression. When `order=1`, which is the default value, it is a linear regression
* We notice the fitted line is now a better representation of the data. That gives you insights into which type of algorithm could potentially fit your data

sns.lmplot(x="x", y="y", data=df.query("dataset == 'II'"), order=2)
plt.show()

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Scatter Plot

We use the 'penguins' dataset. It has records for three different species of penguins collected from three islands in the Palmer Archipelago, Antarctica

df = sns.load_dataset('penguins')
df = df.sample(n=50, random_state=1)
df.head(3)

We use sns.scatterplot(); the documentation is [here](https://seaborn.pydata.org/generated/seaborn.scatterplot.html); This is a Axes level function.
* The arguments are: ``data``, ``x`` and ``y`` are the variables you are interested to study

fig, axes = plt.subplots(figsize=(12,6))
sns.scatterplot(data=df, x='body_mass_g', y='bill_length_mm')
plt.show()

Notice you can use a categorical variable. It may not be the ideal way to plot the data, but the function accepts it.
* As an alternative, you could probably consider a box plot or swarmplot to study 'body_mass_g' and 'island'

fig, axes = plt.subplots(figsize=(12,6))
sns.scatterplot(data=df, x='body_mass_g', y='island')
plt.show()

You can add the parameter `hue` to assess levels per a given category

fig, axes = plt.subplots(figsize=(12,6))
sns.scatterplot(data=df, x='body_mass_g', y='flipper_length_mm', hue='species')
plt.show()

If you use a numerical variable when using ``hue``, it works also. It creates a coloured range based on the ``hue`` variable and colours the data points accordingly.
* In this case, we selected `bill_length_mm` for `hue`
* You may notice a pattern that `bill_length_mm` has low values when studied together with `'body_mass_g'` and `'flipper_length_mm'`

fig, axes = plt.subplots(figsize=(12,6))
sns.scatterplot(data=df, x='body_mass_g', y='flipper_length_mm', hue='bill_length_mm')
plt.show()

You can add the parameter `size`. It groups data to generate points with different sizes. It is not recommended to do this.

fig, axes = plt.subplots(figsize=(14,8))
sns.scatterplot(data=df, x='body_mass_g', y='flipper_length_mm',
                hue='species',
                size='bill_length_mm')
plt.show()

The `size` argument determines how sizes are chosen. When the size is numeric, it can be a tuple specifying the minimum and maximum size to use, so the values are normalized within this range.
 * We randomly selected 40 and 200 as the min and max dot sizes. It is a matter of trial and error to validate what is suitable for your plot
 * We also added the parameter `alpha` to give a transparency effect when datapoints overlap. `alpha` takes values from 0.0 to 1.0

fig, axes = plt.subplots(figsize=(14,8))
sns.scatterplot(data=df, x='body_mass_g', y='flipper_length_mm',
                hue='species',
                size='island', sizes=(40, 200),
                alpha=0.8)
plt.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **PRACTICE**: We will use the DataFrame below, which uses the mpg dataset. It holds records for different car models and their attributes, like miles per gallon, model year, weight etc

df_practice = sns.load_dataset('mpg')
df_practice = df_practice.sample(n=50, random_state=1)
df_practice.head(3)

Feel free to try out your ideas using scatter plot in this dataset

# write your code here

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Pair Plot

A Pairplot aims to show pairwise relationships in a dataset by creating multiple plots so that each numeric variable in the data will be shared across themselves in a scatter plot. The diagonal plots are univariate distribution plots from each numerical variable
* Consider the iris dataset. It contains records of three species or classes of iris plants, with petal and sepal measurements

df = sns.load_dataset("iris")
df = df.sample(n=50, random_state=1)
df.head(3)

You are interested in evaluating how your numerical and categorical variables correlate. You can use ``sns.pairplot()``; the documentation is  [here](https://seaborn.pydata.org/generated/seaborn.pairplot.html). It is a Figure level plot.
* The arguments are data; ``plot_kws={alpha}`` sets the transparency across the data points since they typically tend to overlap in this plot. By default, it will plot numerical data

sns.pairplot(data=df, plot_kws={'alpha':0.8})
plt.show()

You will notice the upper part of the plot is repeated, so we added a piece of code to remove it. The lower part is typically enough to analyze the data
* You noticed we created a variable fig, which is the Figure
* This additional piece of code loops over the Figure's plots, more specifically over the upper triangle and sets its visibility to false

fig = sns.pairplot(data=df, plot_kws={'alpha':0.8});
for i, j in zip(*np.triu_indices_from(fig.axes, 1)):  # the loop removes the upper triangle part
  fig.axes[i, j].set_visible(False)
plt.show()

There will be use cases where you are interested to see how the data changes in a categorical variable. 
  * You can use a pairplot and set the parameter `hue` as your categorical variable

fig = sns.pairplot(data=df, plot_kws={'alpha':0.8}, hue="species");
for i, j in zip(*np.triu_indices_from(fig.axes, 1)):
  fig.axes[i, j].set_visible(False)
plt.show()

As a reinforcement, in Seaborn, you can use [Matplolib palette](https://matplotlib.org/stable/tutorials/colors/colormaps.html) colours to customise your plot. Seaborn also has its own [palette options](https://seaborn.pydata.org/tutorial/color_palettes.html)




* In this example, we considered `palette='crest'`







sns.pairplot(data=df, plot_kws={'alpha':0.8}, hue="species", palette='crest')
plt.show()

An alternative way to choose among the available palette options is to write any "wrong" option like: "xxxxx".
* It will produce an error, and it will suggest valid options that you can pick



```
ValueError: 'xxxxx' is not a valid value for name; supported values are 'Accent', 'Accent_r', 'Blues', 'Blues_r', 'BrBG', 'BrBG_r', 'BuGn', 'BuGn_r', 'BuPu', ....
```



sns.pairplot(data=df, hue="species", palette='crest')

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **PRACTICE**: We will use the DataFrame below, which uses the mpg dataset. It holds records for different car models and their attributes, like miles per gallon, model year, weight etc

df_practice = sns.load_dataset('mpg')
df_practice = df_practice.sample(n=50, random_state=1)
df_practice.head(3)

Feel free to try out your ideas using pair plot in this dataset

# write your code here

---

Sometimes the feature space is so big that your Pair plot will add no value. You would have to subset only the variables you are most interested in to analyse them better.
* A criteria to select variables can be based on correlation analysis, which we will study in upcoming lessons

