# **Data Visualization** 📈

Greeting, challengers! In this station, you will learn on how to convert data and information into beautiful graphics by using Python. Visual elements like charts, graphs and maps provide an accessible way to see and understand trends, outliers and patterns in data. This step is essentially important when the data scientists want to perform exploratory data analysis (EDA) before creating a machine learning model. It is also used to present the findings from the data in a more visual appeal way, because who likes to see a bunch of lengthy sentences when a beautiful graph can summarise all of these at once?

<br>

![Introduction%20Picture.png](Assets/Introduction%20Picture.png)

---

## **Know Your Tools 🔎**
Below are the tools and libraries that will be used for this tutorial and challenges.

<br>

> **Jupyter Notebook**<br><br>
Number one go-to tool for data scientists. It offers an interactive web interface that can be used for data visualization, easy analysis, modelling and collaboration
<br><br><br>
> **Pandas**<br><br>
An open-source Python library that provides high-performance, easy-to-use data structure and data analysis tools
<br><br><br>
> **Matplotlib**<br><br>
A comprehensive library for creating static, animated and interactive visualizations in Python
<br><br><br>
> **Seaborn**<br><br>
A library for making statistical graphics in Python. It is built on top of matplotlib and integrates closely with pandas data structures

---

## **Gear Up Yourself 🧰**
*If you use your own laptop and did not install any tools mentioned above, please go through the following instructions. If you use the machine from the station, you can skip this part.*

1. Install Jupyter by entering the following code into the command-line <br>
```
$ pip install jupyter
```

2. Next, install pandas, matplotlib and seaborn libraries <br>
```
$ pip install pandas 
$ pip install matplotlib
$ pip install seaborn
```

3. After the installations are completed, start the Jupyter Notebook server. Type the command below in your terminal. <br>
```
$ jupyter notebook
```

    A Jupyter page showing files in the current directory will open in your computer's default browser. <br><br>
    **Note: <font color=red>Do not close the terminal window that you run this command in. Your server will stop if you do so.</font>**

<br>
Great! That's all for the setup. Let's us begin the fun stuff!

---

# **Tutorial Section 👨‍🏫**
This section will teach you the basic of the matplotlib and seaborn libraries. It will not provide mark but experience to those who are unfamiliar with these tools. If you are an expert in these libraries, you may start to go through the challenges.

---
---

## **Matplotlib**

Matplotlib is the "grandfather" library of data visualization with Python. It was created by John Hunter, and is an excellent 2D and 3D graphics library for generating scientific figures.

Some of the major Pros of Matplotlib are:
- Generally easy to get started for simple plots
- Support for custom labels and texts
- Great control of every element in a figure
- High-quality output in many formats
- Very customizable in general

### Importing
Run the following cell to import the **pyplot** graphic library from the **matplotlib** API with the name `plt` by convention.

In [None]:
import matplotlib.pyplot as plt

### Plot a Basic Graph
Run the following cell:

In [None]:
# Defines data to be ploted as a graph
x = [1,2,3,4,5,6,7,8]
y = [2,4,6,8,10,12,14,16]

# Plot the graph
plt.plot(x, y)

# Name the X and Y axes
plt.xlabel('X Axis Title Here')
plt.ylabel('Y Axis Title Here')

# Name the title of the graph
plt.title('String Title Here')

# Display the graph
plt.show()

You should get the graph like this after running the code

![basic_plot_1.png](Assets/basic_plot_1.png)

Matplotlib generates the minimum and maximum values of variable to be presented along the x, y (and z-axis in the case of a 3D plot) axes of a plot automatically. However, it is possible to define explicit limits by using the xlim() and ylim() methods.

Run the following code:

In [None]:
plt.plot(x, y) 
plt.xlabel('X Axis Title Here')
plt.ylabel('Y Axis Title Here')
plt.title('String Title Here')
plt.xlim(1, 8) # Bound the x-axis within the range [1, 8]
plt.ylim(2, 16) # Bound the y-axis within the range [2, 16]
plt.show()

Now the graph should looks like this:

![basic_plot_2.png](Assets/Basic_plot_2.png)

### Matplotlib Figure Object
Now that we've seen the basics, we shall start with a more formal introduction of Matplotlib's Object Oriented API, which we will instantiate a **figure** objects and then call methods or attributes from that object. This apporach is better when dealing with a canvas that has multiple plots or a more complicated plots in general.

In [None]:
# Defines data to be plotted as a graph
x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

To begin with, let's use a figure object to plot a graph using the data that we defined above.

In [None]:
# Create Figure (empty canvas)
fig = plt.figure()

# Add set of axes to figure
axes = fig.add_axes([0, 0, 1, 1]) # [left, bottom, width, height] (range 0 to 1)

# Plot on that set of axes
axes.plot(x, y)

# Display the plot
plt.show()

The *matplotlib.pyplot.figure.add_axes()* method adds a set of axes to the figure, by taking a list argument, **rect**.

The **rect** parameter is the dimension [left, bottom, width, height] of the new axes, where:
- **left**: Horizontal distance from the lower left corner.
- **bottom**: Vertical distance form the lower left corner.
- **width**: Width of the subplot.
- **height**: Height of the subplot.

Note that all quantities are in fractions, i.e. range from 0 to 1. 

### Multiple plots in a figure
Now that you've had a better understanding about the *add_axes()* method, you can easily add more than one axis to the figure. In the next session you will learn how to create multiple plots in a figure!

In [None]:
# Create Figure (empty canvas)
fig = plt.figure()

# Create the first set of axes
axes1 = fig.add_axes([0, 0, 1, 1]) # Large figure
axes1.plot(x, y)
axes1.set_xlabel('X Label')
axes1.set_ylabel('Y Label')
axes1.set_title('First Figure')

# Create the second set of axes
axes2 = fig.add_axes([0.2, 0.4, 0.3, 0.3]) # Small figure
axes2.plot(x, y)
axes2.set_title('Second Figure')

# Display the plot
plt.show()

The order of adding the axes matters! Note that the first set of axes added to the figure will be rendered first, and appears behind the second set of axes.

### Matplotlib Sub Plots
The main focus here is the *plt.subplots()* method. This method acts like a more automatic axis manager. This makes it much easier to show multiple plots side by side.

The *matplotlib.pyplot.subplots()* method takes two arguments that describes the layout of the figure, in which:

- **nrows** represents the number of rows of the figure, and
- **ncols** represents the number of columns of the figure.

It will then return **fig** and **ax**, where:

- **fig** is an figure object with an empty canvas, and
- **ax** can be either a single Axes object or an array of Axes objects if more than one subplot was created.

Now let's try to create a figure with two side-by-side subplots using the *plt.subplots()* method!

In [None]:
# Defines data to be plotted as a graph
x1 = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y1 = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

x2 = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y2 = [0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

# Create empty canvas with 1 row and 2 columns of subplot
fig, ax = plt.subplots(1, 2)

In [None]:
# Ax is an array of axes to plot on
ax

Just as before, we simply use the plot() method to plot on each of the axes that are created by the *subplots()* method.

In [None]:
# Plot on the first set of axes
ax[0].plot(x1, y1)
ax[0].set_title("Y = X Graph")
ax[0].set_xlabel("X")
ax[0].set_ylabel("Y")

# Plot on the second set of axes
ax[1].plot(x2, y2)
ax[1].set_title("Y = X^2 Graph")
ax[1].set_xlabel("X")
ax[1].set_ylabel("Y")

# Display the figure
fig

A common issue with matplotlib is overlapping subplots or figures. we can use *fig.tight_layout()* or *plt.tight_layout()* method which will automatically adjust the position of the axes on the figure canvas so that there is no overlapping subplots.


In [None]:
# Automatically adjust the axes to prevent subplots from overlapping with each other
fig.tight_layout()

# Display the figure after adjusting
fig

### Matplotlib Styling
In the following session, you will learn how you can customize your plot by changing the line color, line type, line width, and a lot more!

First, let's learn about legend. Legend is mainly used to descriobe the elements of the graph. In order to add legend to your figure you are required to label all the plots in your figure, using the **label** keyword argument.

In [None]:
# Defines data to be plotted as a graph
x1 = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y1 = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

x2 = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y2 = [0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

# Create an empty figure
fig = plt.figure()

# Add a set of axes to the figure
ax = fig.add_axes([0, 0, 1, 1])

# Plot twice on the axes
ax.plot(x1, y1, label="x")
ax.plot(x2, y2, label="x^2")

# Add legend to the figure
ax.legend();

The **legend** function takes an optional keyword argument **loc** that can be used to specify where in the figure the legend is to be drawn, in which, when:

- **legend(loc=0)** places the legend in the optimal location, prevent overlapping with the actual plot
- **legend(loc=1)** places the legend in the upper right corner
- **legend(loc=2)** places the legend in the upper left corner
- **legend(loc=3)** places the legend in the lower left corner
- **legend(loc=4)** places the legend in the lower right corner

The value of the **loc** argument is 0 by default. Run the following cell to see the differences!

In [None]:
# Create empty canvas with 2 rows and 2 columns of subplot
fig, ax = plt.subplots(2,2)

# Legend at top right corner
ax[0][0].plot(x1, y1, label="x")
ax[0][0].legend(loc=1)

# Legend at top left corner
ax[0][1].plot(x1, y1, label="x")
ax[0][1].legend(loc=2)

# Legend at lower left corner
ax[1][0].plot(x1, y1, label="x")
ax[1][0].legend(loc=3)

# Legend at lower right corner
ax[1][1].plot(x1, y1, label="x")
ax[1][1].legend(loc=4)

Next, let's learn about the how we can change the line color, line width and line type! Matplotlib gives you a lot of options for customizing colors, linewidths, and linetypes.

To change the line color of the plots, you need to declare the color of the plot using the **color** keyword argument, right when you are plotting the graph. You can also change the transparency of your plot using the **alpha** keyword argument.

Run the following cell.

In [None]:
# Create empty canvas with 2 rows and 2 columns of subplot
fig, ax = plt.subplots(2,2)

# Red line color
ax[0][0].plot(x1, y1, color="red")

# Blue line color
ax[0][1].plot(x1, y1, color="blue")

# Blue line color with half-transparency
ax[1][0].plot(x1, y1, color="blue", alpha=0.5)

# RGB hex code
ax[1][1].plot(x1, y1, color="#FF8C00")

To change the line width of the plots, you need to declare the line width of the plot using the **linewidth** or **lw** keyword argument.

Run the following cell.

In [None]:
# Create empty canvas with 2 rows and 2 columns of subplot
fig, ax = plt.subplots(2,2)

# Line width 0.25
ax[0][0].plot(x1, y1, linewidth="0.25")

# Line width 0.5
ax[0][1].plot(x1, y1, linewidth="0.5")

# Line width 1
ax[1][0].plot(x1, y1, lw="1")

# Line width 5
ax[1][1].plot(x1, y1, lw="5")

To change the line width of the plots, you need to declare the line width of the plot using the **linestyle** or **ls** keyword argument.

Run the following cell.

In [None]:
# Some possible line style options include ‘--‘, ‘–’, ‘-.’, ‘:’ and ‘steps’

# Create empty canvas with 2 rows and 2 columns of subplot
fig, ax = plt.subplots(2,2)

# Solid line 
ax[0][0].plot(x1, y1, linestyle='-')

# Dash and dot line
ax[0][1].plot(x1, y1, linestyle="-.")

# Dots line
ax[1][0].plot(x1, y1, ls=":")

# Dashes line
ax[1][1].plot(x1, y1, ls="--")

Lastly, let's learn about the markers. You can use the **marker** keyword argument to emphasize each point with a specified marker, and the **markersize** or **ms** keyword argument to declare the marker size.

Run the following cell.

In [None]:
# Some possible marker type options include ‘+‘, ‘o’, ‘s’ and ‘1’

# Create empty canvas with 2 rows and 2 columns of subplot
fig, ax = plt.subplots(2,2)

#  '+' type marker
ax[0][0].plot(x1, y1, marker='+', markersize=10)

# 'o' type marker
ax[0][1].plot(x1, y1, marker="o", markersize=5)

# 's' type marker
ax[1][0].plot(x1, y1, marker="s", ms=5)

# '1' type marker
ax[1][1].plot(x1, y1, marker="1", ms=10)

In [None]:
# CODE HERE

---
---
## **Seaborn**

Seaborn is the children of Matplotlib. It builds on top of Matplotlib and integrates closely with the pandas dat structure. With just a few lines of Python code, the user can create beautiful graphs for performing data visualization and exploratory data analysis.

Some of the major pros of Seaborn are:
- Its default aesthetics are much more visually appealing than matplotlib
- Easy to customize the aesthetics
- Great control of every element in a figure
- High-quality output in many formats
- Very customizable in general

### Importing
Run the following cell to import the **seaborn** and **pandas** library with the name `sns` and `pd` by convention respectively.

In [None]:
import seaborn as sns
import pandas as pd

### Reading Dataset
Run the following cell to gain some information regarding the dataset.

In [None]:
# Read the csv (comma-separated values) file using pandas and store it inside a variable 'df'
df = pd.read_csv("Data/dm_office_sales.csv")

# Show the first five data from the dataset
df.head()

To obtain the size of the dataset, you need to run the **shape** property using pandas. Note that the value is in the form of tuple (A, B), which are,

`A` - Number of rows  
`B` - Number of columns

In [None]:
df.shape

Hence, this dataset consists of 1000 rows and 6 columns, which are division, level of education, training level, work experience, salary and sales.

---
### **Scatter Plot**
Scatter plots can show how different features are related to one another, the main theme between all relational plot types is they display how features (which also known as columns) are interconnected to each other. There are many different types of plots that can be used to show this, so let's explore the `scatterplot()` as well as general seaborn parameters applicable to other plot types.  

Run the following cell.

In [None]:
# This code will plot the scatterplot between salary and sales, using the data that is stored inside 'df' variable
sns.scatterplot(x='salary', y='sales', data=df)

### Connecting to Figure in Matplotlib
For your information, matplotlib is connected to seaborn underneath (even without importing `matplotlib.pyplot`) as seaborn itself is directly making a Figure call with matplotlib. We can import `matplotlib.pyplot` and make calls to directly effect the seaborn figure.  

Run the following code.

In [None]:
# Import the pyplot graphic library from the matplotlib
import matplotlib.pyplot as plt

In [None]:
# Set the figure size to (12,8)
plt.figure(figsize=(12,8))

# Plot scatterplot using seaborn
sns.scatterplot(x='salary', y='sales', data=df)

---
### **Seaborn Parameters**
The `hue` and `palette` parameters are commonly available around many plot calls in seaborn. Some parameters, such as `size`, `alpha`, `linewidth` and `style` are specified for certain plots only.

#### **Hue**
Color points based off a categorical feature in the dataframe. Basically it allows you to display the tertiary feature on the plot.

Run the following cell.

In [None]:
# Plot the scatterplot. Each marker (data point) is colored based on its 'division' value.
sns.scatterplot(x='salary',y='sales',data=df,hue='division')

If the feature given to hue parameter is a numerical, the marker will be colored to a same color with different intensity.

Run the following cell.

In [None]:
# Plot the scatterplot. Each marker (data point) is colored based on its 'work experience' value.
sns.scatterplot(x='salary',y='sales',data=df,hue='work experience')

#### **Palette**
Instead of using the default color, you can customize it using `palette` parameter. There are lots of patlette [options](https://matplotlib.org/stable/tutorials/colors/colormaps.html) to choose from.  

Run the following cell.

In [None]:
# Plot the scatterplot using 'work experience" as the hue parameter, and set the palette parameter to 'viridis'.
sns.scatterplot(x='salary',y='sales',data=df,hue='work experience',palette='viridis')

#### **Size**
Parameter `s` allows you to change the marker size.

#### **Alpha**
Sometimes the markers are overlapped with one another, making it harder to visualize the amount of data for a particular value. Hence, parameter `alpha` is used to control the transparency of plots. 

#### **Linewidth**
To adjust the width of the line, parameter `linewidth` is used for that purpose.

Run the following cell.

In [None]:
# Set the figure size to (12,8).
plt.figure(figsize=(12,8))

# Plot a scatterplot. The marker size is set to 200, line width is set to 0 and the alpha (transparency) is set to 0.2.
sns.scatterplot(x='salary',y='sales',data=df,s=200,linewidth=0,alpha=0.2)

#### **Style**
Paramete `style` will automatically choose styles based on the given categorical feature in the dataset. To put it simply, it allows you to display the fourth feature on the plot if you have use `hue` parameter.  

Optionally use the markers = [parameter] to pass a list of marker choices based off matplotlib, for example: ['*','+','o'].

Run the following cell.

In [None]:
# Set the figure size to (12,8).
plt.figure(figsize=(12,8))

# Plot a scatterplot with the intended style, hue and marker size.
sns.scatterplot(x='salary',y='sales',data=df,style='level of education',hue='level of education',s=100)

---
### **Distribution Plots**

#### **Histogram**
Histograms represent the data distribution by forming bins along the range of the data and then drawing bars to show the number of observations that fall in each bin (range). This allows you to see the frequency distribution of a dataset.

Run the following cell.

In [None]:
# Plot the frequency distribution for the feature 'salary' using the dataset from variable 'df'
sns.displot(data=df, x='salary')

In [None]:
# Another method to plot the exactly same histogram
sns.histplot(data=df,x='salary')

To adjust the number of bins, just set the `bins` parameter.

Run the following cell.

In [None]:
# Plot histogram with 10 bins
sns.histplot(data=df,x='salary',bins=10)

In [None]:
# Plot histogram with 100 bins
sns.histplot(data=df,x='salary',bins=100)

And of course, you can customize the histogram based on your preferences.

Run the following cell.

In [None]:
# Bins number is set to 20
# Bar is colored red
# Line edge is colored black
# Line width is set to 4
# Line style is set to '--'
sns.displot(data=df,x='salary',bins=20, color='red',edgecolor='black', lw=4, ls='--')

Kernel density estimation (Kde) is one of the useful techniques for estimation of probability density function as it enables the data scientists to better analyse the studied probability distribution than when using a traditional histogram.

Kernel (continuous curve) is drawn at every individual data point and then all these curves are added together to make a single smoothened density estimation. Histogram fails when we want to compare the data distribution of a single variable over the multiple categories at that time.

To show kde on top of the histogram, just set the `kde` parameter to **`True`**.

Run the following cell.

In [None]:
# Plot the histogram with kde
sns.displot(data=df,x='salary',kde=True)

#### **Kernel Density Estimation Plot**
Kde can be plotted on its own graph as well using `kdeplot()` method.

Run the following cell.

In [None]:
# Plot kdeplot for 'salary' from the dataset stored in the variable 'df'
sns.kdeplot(data=df, x='salary')

If you want to show kde for a specific range of values, just put the range inside `clip` parameter.

Run the following cell.

In [None]:
# Plot kdeplot for 'salary' from the range 60000 to 120000
sns.kdeplot(data=df, x='salary', clip=[60000, 120000])

Kernel has a property called as **`bandwith`**. By affecting the bandwidth of the kernel, this will make the Kde more sensitive to the data. Notice how with a smaller bandwith, the kernels don't stretch so wide, meaning we don't need the cut-off anymore. This is analagous to increasing the number of bins in a histogram. This is simply done by giving a value to the `bw_adjust` parameter.

Run the following cell.

In [None]:
# Plot kdeplot for 'salary' with a bandwidth of 0.1
sns.kdeplot(data=df, x='salary', bw_adjust=0.1)

In [None]:
# Plot kdeplot for 'salary' with a bandwidth of 0.5
sns.kdeplot(data=df, x='salary', bw_adjust=0.5)

In [None]:
# Plot kdeplot for 'salary' with a bandwidth of 1
sns.kdeplot(data=df, x='salary', bw_adjust=1)

---
### **Categorical Plots**

#### Statistical Estimation 

Often we have categorical data, meaning the data is in distinct groupings, such as Countries or Companies. There is no country value "between" USA and France and there is no company value "between" Google and Apple, unlike continuous data where we know values can exist between data points, such as age or price.

To begin with categorical plots, we'll focus on **statistical estimation** within categories. Basically this means we will visually report back some statistic (such as mean or count) in a plot.

#### **Count Plot**
It is a simple plot that shows the total count of rows per category.

Run the following cell.

In [None]:
# Set figure size to (10,4) with a higher resolution
plt.figure(figsize=(10,4), dpi=200)

# Plot the countplot for the category in the feature 'level of education' using the dataset from the variable 'df'
sns.countplot(x='level of education', data=df)

Also, each category within a feature (A) can be breakdown to categories of another feature (B) using the `hue` that we have learnt just now.

Run the following cell.

In [None]:
# Set figure size to (10,4) and generate graph with a higher quality
plt.figure(figsize=(10,4), dpi=200)

# Plot the countplot for the category in the feature 'level of education', in which these category will be separated based on the categories of 'training level'
sns.countplot(x='level of education', data=df, hue='training level')

#### **Bar Plot**
Barplot is one of the most common types of graphics. It shows the relationship between a numeric and a categoric variable. Each entity of the categoric variable is represented as a bar. The size of the bar represents its numeric value.  

In seaborn, if the plotted feature has multiple values for a single category, it will plot the mean for that particular category.

Run the following cell.

In [None]:
# Set figure size to (12,6) and generate graph with a high quality
plt.figure(figsize=(12,6), dpi=100)

# Plot the barplot for the 'level of education' versus 'salary', using the dataset from the variable 'df'
sns.barplot(x='level of education', y='salary', data=df)

On top of the bars, you can see there is a black line. This is the error bar. It displays either confidence intervals or the standard deviation which is quite useful when doing statistical analysis. The black line will not be shown if the `ci` parameter is set to 0.

For more information, please select the links below:  
1. [What is the black line](https://stackoverflow.com/questions/58362473/what-does-black-lines-on-a-seaborn-barplot-mean)
2. [Parameter of the barplot](https://seaborn.pydata.org/generated/seaborn.barplot.html)

To remove the black line and separate 'level of education' based on 'division' feature, run the following cell.

In [None]:
# Set figure size to (12,6) and generate graph with a high quality
plt.figure(figsize=(12,6), dpi=100)

# Plot the barplot for the 'level of education' versus 'salary', in which each category within 'level of education' will be separated based on categories of 'division'
sns.barplot(x='level of education', y='salary', data=df, ci=0, hue='division')

# Move the legend away from the graph
plt.legend(bbox_to_anchor=(1.05, 1))

You will notice that the legend is not inside the graph. This is because the legend is moved using the method `bbox_to_anchor`. The parameter given must be in the form of (a, b), in which 'a' refers the x-axis position and 'b' refers to the y-axis position for the legend. For more information, you can visit this [link](https://stackoverflow.com/questions/30490740/move-legend-outside-figure-in-seaborn-tsplot). Feel free to play around with it to get a hang of it.

---
#### Distribution within Categories

So far we've seen how to apply a statistical estimation (like mean or count) to categories and compare them to one another. Let's now explore how to visualize the distribution within categories. We already know about distplot() which allows to view the distribution of a single feature, now we will break down that same distribution per category.

Run the following cell to import new dataset `StudentsPerformance` and check its data.

In [None]:
# Read the csv (comma-separated values) file using pandas and store it inside a variable 'df'
df = pd.read_csv("Data/StudentsPerformance.csv")

# Show the first five data from the dataset
df.head()

#### **Box Plot**
A boxplot display distribution through the use of quartiles and an Interquartile Range (IQR) for outliers.

> **Quartile**<br><br>
A statistical term that describes a division of observations into four defined intervals based on the values of the data and how they compare to the entire set of observations (data). Each interval contains 25% of the total observations.
<br><br><br>
> **Interquartile Range**<br><br>
It measures the spread of the middle half of the total observations, allowing us to assess the variability where most of the data lies.


Run the following cell.

In [None]:
# Set the figure size to (12,6)
plt.figure(figsize=(12,6))

# Plot a box plot for 'parental level of education' versus 'math score' using the data from the variable 'df'
sns.boxplot(x='parental level of education',y='math score',data=df)

The boxplot is used to show the following information:


![box_plot_anatomy](Assets/box_plot_anatomy.png)

Boxplot is important for data scientist to identify the existence of the outliers within the dataset, which then will be removed before doing any further action such as building a machine learning model.

Note that you can add `hue` parameter to the boxplot as well to further separate the categories. Run the follwing cell.

In [None]:
# Set the figure size to (12,6)
plt.figure(figsize=(12,6))

# Plot a box plot with 'gender' as the hue parameter 
sns.boxplot(x='parental level of education',y='math score',data=df,hue='gender')

# Move the legend outside
plt.legend(bbox_to_anchor=(1.05, 1))

#### **Violin Plot**
It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared. Unlike a box plot, in which all of the plot components correspond to actual datapoints, the violin plot features a kernel density estimation of the underlying distribution.

Run the following cell.

In [None]:
# Set the figure size to (12,6)
plt.figure(figsize=(12,6))

# Plot a violin plot for 'parental level of education' versus 'math score' using the data from the variable 'df',
# and separate the categories within x values based on 'gender'
sns.violinplot(x='parental level of education',y='math score',data=df,hue='gender')

---
### **Comparison Plots**

#### **Joint Plot** 

![joint_plot](Assets/Joint_plot.png)

A jointplot comprises three plots. 

Out of the three, one plot displays a bivariate graph which shows how the dependent variable(Y) varies with the independent variable(X). Another plot is placed horizontally at the top of the bivariate graph and it shows the distribution of the independent variable(X). The third plot is placed on the right margin of the bivariate graph with the orientation set to vertical and it shows the distribution of the dependent variable(Y). 

It is very helpful to have univariate and bivariate plots together in one figure. This is because the univariate analysis focuses on one variable, it describes, summarizes and shows any patterns in your data and the bivariate analysis explores the relationship between two variables and also describes the strength of their relationship.

Run the following cell.

In [None]:
# Plot a joint plot between 'math score' versus 'reading score' using the data stored in the variable 'df'
sns.jointplot(x='math score',y='reading score',data=df)

Instead of displaying scatterplot, the jointplot can also display the marker in the form of `hexbin` of `bivariate kde`. Just change the **`kind`** parameter when calling jointplot(). Run the following cell to see the differences.

In [None]:
# Plot a joint plot with the marker 'scatter'
sns.jointplot(x='math score',y='reading score',data=df,kind='scatter')

In [None]:
# Plot a joint plot with the marker 'hex'
sns.jointplot(x='math score',y='reading score',data=df,kind='hex')

In [None]:
# Plot a joint plot with the marker 'kde'
sns.jointplot(x='math score',y='reading score',data=df,kind='kde')

#### **Pair Plot** 

To plot multiple pairwise bivariate distributions in a dataset, you can use the `pairplot() function`. This shows the relationship for (n, 2) combination of variable in a dataframe as a matrix of plots and the diagonal plots are the univariate plots. It is used to find the relationship between them where the variables can be continuous or categorical.

Run the following cell.

In [None]:
# Plot the pairplot for the dataset
sns.pairplot(df)

And same goes with the pairplot, you can give a `hue` parameter and customize its color using `palette` option.

Run the following cell.

In [None]:
# Plot a pairplot for the dataset, in which each continuous variables are separated based on 'gender' category and colored with 'viridis' style
sns.pairplot(df,hue='gender',palette='viridis')

You will notice that the scatterplots of the upper diagonal is essentially similar to the bottom ones. To remove these duplicated scatterplots, just set the `corner` varibale to **`True`**.

Run the following cell.

In [None]:
# Plot the pairplot without upper diagonal
sns.pairplot(df,hue='gender',palette='viridis',corner=True)

---
---

# **End of Tutorials 🔚**
Congratulations, You have reached the end of the tutorial section! Throughout these notes, you have mastered the basic techniques when using **matplotlib** and **seaborn** libraries. Certainly, the things that we have taught here is just minor, as we left out a lot of specific details and information for some methods mentioned above. For those who are interested in learning more about these libraries, just browse on Google, YouTube, StackOverflow or any learning website to get more familiar with them.

---
---

# **Challenge Ahead 🐱‍💻**
The challenges are broken down into 3 parts, each with different difficulties. The harder the challenge is, the higher the mark that you will obtain ***if and only if*** you manage to solve them. If you encounter any problems or technical issues, feel free to contact the stuff members. Remember, Google and StackOverflow are always with you from time to time. Good luck and let's start the challenges!

---