## Visualization: `matplotlib`

### Programming for Data Science
### Created: April 11, 2023
---  

### PREREQUISITES
- variables
- data types
- numpy
- pandas

### SOURCES 
- https://matplotlib.org/

### OBJECTIVES
- Introduce some basic functionality of the `matplotlib` package

### CONCEPTS
- creating various visualizations (e.g., scatterplot, boxplot, histrogram)
- changing xlabel, ylabel, and title
- changing figure styles
- saving plots
- creating subplots to display multiple plots in a single figure 
---

### `matplotlib`

`matplotlib` is a comprehensive library for creating static, animated, and interactive visualizations in Python.

Below are some quick examples. See documentation for more details and examples.

`plt.[figtype]` for some [figtype] will create the figure.

You can layer on details such as titles, a legend, etc.

#### Load packages and import some data

In [None]:
import matplotlib.pyplot as plt # `plt` is an alias for the `matplotlib.pyplot` module.
import numpy as np
import pandas as pd
import seaborn as sns # (wrapper of matplotlib)

The ``inline`` flag allows us to display plots inline in the notebook.

In [None]:
%matplotlib inline

Import a wine quality dataset from UCI. Note the separator ';', othewise, the data doesn't parse into columns.

In [None]:
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv'
df = pd.read_csv(url, sep=';') # load wine data
df.head() # view the first 5 rows of the dataframe

Check if we have nans in the data. It is difficult to plot data with nans.

In [None]:
df.info()

---

### Basics

#### `plot` method

Plot *pH* on the x axis and *density* on the y axis. 

In [None]:
help(plt.plot)

In [None]:
plt.plot(df.pH, df.density) # plot x and y using default line style and color

#### Change Marker and Color

**Marker**

string | description
--- | ---
'.' | point marker
',' | pixel marker
'o' | circle marker
'v' | triangle_down marker
'^' | triangle_up marker
'<' | triangle_left marker
'>' | triangle_right marker
's'	| square marker
'\*' | star marker
'+' | plus marker
'x' | x marker

**Color**

string | color
--- | ---
'b' | blue
'blue' | blue
'g' | green
'green' | green
'r' | red
'red' | red
'c' | cyan
'cyan' | cyan
'm' | magenta
'magenta' | magenta
'y' | yellow
'yellow' | yellow
'k' | black
'black' | black
'w' | white
'white' | white

The parameter can also recognize hexadecimal strings. For example, the color green is represented as '#008000'. It is also possible to use RGB tuples.

A format string consists of a part for color, marker and line:
 
`fmt = '[marker][line][color]'`

In [None]:
plt.plot(df.pH, df.density, 'or') # plot x and y using red circle markers

In [None]:
plt.plot(df.pH, df.density, color = 'r', marker = 'o', linestyle='') # the same figure with multiple parameters

In [None]:
plt.plot(df.pH, df.density, '.b') # plot x and y using blue point markers

In [None]:
plt.plot(df.pH, df.density, '.-b') 
# plot x and y using blue point markers with lines
# Adding lines is not useful for this figure.

In [None]:
plt.plot(df.pH, df.density, c = (0, 0, 0), marker = '.', linestyle='', markersize = 10)
# Use rgb tuples: (0, 0, 0) is black and (0, 0, 1) is blue

#### Setting plot titles, labels, and limits

- Set xlim and ylim.

In [None]:
plt.plot(df.pH, df.density, 'ok')
plt.xlim(left=2.6, right=3.9)
plt.ylim(bottom=0.98, top=1.02) 

- Set xlabel, ylabel, and title.

In [None]:
plt.plot(df.pH, df.density, 'ok')
plt.xlabel('pH')
plt.ylabel('Density')
plt.title('Relationship between pH and Density')

- Change fontsize.

In [None]:
plt.plot(df.pH, df.density, 'ok')
plt.xlabel('pH', fontsize = 15)
plt.ylabel('Density', fontsize = 15)
plt.title('Relationship between pH and Density', fontsize = 20)

In [None]:
# Change tick font size
plt.plot(df.pH, df.density, 'ok')
plt.xlabel('pH', fontsize = 15)
plt.ylabel('Density', fontsize = 15)
plt.title('Relationship between pH and Density', fontsize = 20)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.show()

#### Figure Style

`plt.style.available` offers us a selection of figure styles that we can use. The style that is set as default is not very visually appealing.

In [None]:
plt.style.available

In [None]:
plt.style.use('classic') # this is very similar to the default style.
plt.plot(df.pH, df.density, '.') 

In [None]:
plt.style.use('ggplot') # `ggplot` style from R
plt.plot(df.pH, df.density, '.') 

In [None]:
plt.style.use('seaborn') # a wrapper from matplotlib
plt.plot(df.pH, df.density, '.') 

---

### Grids

In [None]:
plt.style.use('default') # use the default style. Note that it's not listed under plt.style.available

In [None]:
plt.plot(df.pH, df.density, 'ok')
plt.grid()

# plt.figure(figsize=(4, 3)) # change figure size

In [None]:
# add vertical grid lines only
plt.plot(df.pH, df.density, 'ok')
plt.grid(axis = 'x')

In [None]:
# add horizontal grid lines only
plt.plot(df.pH, df.density, 'ok')
plt.grid(axis = 'y')

In [None]:
# change the color, transparency, and linestyle of grid lines
plt.plot(df.pH, df.density, 'ok')
plt.grid(color = 'r', 
         alpha = 0.5,
         linestyle = 'dotted')

---

### Legends

Legends help to give meaning to the various elements in your plots.

The `loc` parameter, which determines the location of the legend, can take in strings, integers, and tuples.

string | int
--- | ---
'best' | 0
'upper right' | 1
'upper left' | 2
'lower left' |3
'lower right' | 4
'right' | 5
'center left' | 6
'center right' | 7
'lower center' | 8
'upper center' | 9
'center' | 10

If a tuple with two elements `x, y` is used, it sets the position of the legend in the axes coordinates, where the point (0, 0) represents the lower-left corner of the legend.

In [None]:
# plot another info on the same graph, and add a legend.
plt.plot(df.pH, df["fixed acidity"], '.k', label = "Fixed acidity")
plt.plot(df.pH, df["volatile acidity"], '.r', label = "Volatile acidity")
plt.legend()

In [None]:
# change the position/location of the legend
plt.plot(df.pH, df["fixed acidity"], '.k', label = "Fixed acidity")
plt.plot(df.pH, df["volatile acidity"], '.r', label = "Volatile acidity")
plt.legend(loc="upper left")

In [None]:
# Move the legend outside the plot area.
plt.plot(df.pH, df["fixed acidity"], '.k', label = "Fixed acidity")
plt.plot(df.pH, df["volatile acidity"], '.r', label = "Volatile acidity")
plt.legend(loc=(1.03,0))

---

### Saving Plots

It is essential to save your visualizations outside of your Jupyter notebook so that you can share them with others. It is equally important to check the saved visualization because the graph may appear differently in the image file compared to how it looks in the notebook.

In [None]:
plt.style.use('seaborn') # change plot style

# The appearance of an image in a notebook may differ from how it appears when saved.
plt.plot(df.pH, df["fixed acidity"], '.k', label = "Fixed acidity")
plt.plot(df.pH, df["volatile acidity"], '.r', label = "Volatile acidity")
plt.xlabel('pH')
plt.ylabel('Acidity')
plt.title('Relationship between pH and two types of acidity')
plt.legend(loc=(1.01,0))

plt.savefig('figure1.png', dpi = 300)

In [None]:
# To ensure the subplots fit well within the figure area, the function tight_layout() can be employed to automatically adjust the subplot parameters.
plt.plot(df.pH, df["fixed acidity"], '.k', label = "Fixed acidity")
plt.plot(df.pH, df["volatile acidity"], '.r', label = "Volatile acidity")
plt.xlabel('pH')
plt.ylabel('Acidity')
plt.title('Relationship between pH and two types of acidity')
plt.legend(loc=(1.03,0))

plt.tight_layout()
plt.savefig('figure2.png', dpi = 300)

---

#### TRY FOR YOURSELF
Do the following:

- Use the `plot` function from `matplotlib` to create a scatter plot based on the wine data.
   - Change the figure style. 
   - Fit a scatterplot with any two numeric variables from the data, using the `plot` function.
   - Add x-label, y-label, and title.
   - Save the plot as a `.png` file.

In [None]:
plt.style.use('fivethirtyeight') # change figure style
plt.plot(df.pH, df.alcohol, "ob")
plt.xlabel("pH")
plt.ylabel("Alcohol")
plt.title("Relationship between pH and Alcohol")
plt.tight_layout()
plt.savefig('figure3.png', dpi = 300)

---
### Different Types of Plots

Import `iris` data

In [None]:
import seaborn as sns

iris = sns.load_dataset('iris')

In [None]:
iris.head()

In [None]:
iris.species.value_counts()

#### Barplot

**Using matplotlib**

The `matplotlib` library provides a basic set of tools and requires more code to perform complex tasks.

In [None]:
plt.bar(iris.species.value_counts().index, iris.species.value_counts(), color ='green',
        width = 0.4)
plt.xlabel("Species")
plt.ylabel("Count")

**Using pandas**

You can use `pandas` as a layer over Matplotlib. One of the benefits of using Pandas for plotting is that it requires less code to produce a plot.

In [None]:
iris.species.value_counts().plot(kind='bar', rot=0)
plt.xlabel("Species")
plt.ylabel("Count")

**Using seaborn**

`Seaborn` is a nice visualization library based on matplotlib.  The graphs are generally more appealing than matplotlib. Refer to [seaborn's website](https://seaborn.pydata.org/tutorial/introduction) for more details. 

In [None]:
sns.countplot(x="species", data=iris)
plt.xlabel("Species")
plt.ylabel("Count")

In [None]:
sns.barplot(x=iris.species.value_counts().index, y=iris.species.value_counts())
plt.xlabel("Species")
plt.ylabel("Count")

#### Boxplot

A boxplot is a widely used method of representing the distribution of data using a summary of five numbers (minimum, first quartile, second quartile, third quartile, and maximum). It provides information about outliers, symmetry of data, and the degree of grouping.

**Using matplotlib**


In [None]:
sl_setosa = iris.loc[iris['species']=='setosa','sepal_length'].values
sl_versicolor = iris.loc[iris['species']=='versicolor','sepal_length'].values
sl_virginica = iris.loc[iris['species']=='virginica','sepal_length'].values

plt.boxplot([sl_setosa,sl_versicolor,sl_virginica], labels=['setosa', 'versicolor', 'virginica'])
plt.xlabel('Species')
plt.ylabel('Sepal Length')
plt.title('Distributions of Sepal Length by Species')

**Using pandas**


In [None]:
iris.boxplot(column = 'sepal_length', by = 'species')

In [None]:
# eliminate the subtitle and title, and add a ylabel
iris.boxplot(column = 'sepal_length', by = 'species')
plt.title('')
plt.suptitle('')
plt.ylabel('Sepal Length')

**Using seaborn**


In [None]:
import seaborn as sns

sns.boxplot(x='species', y='sepal_length', data=iris)
plt.xlabel('Species')
plt.ylabel('Sepal Length')
plt.title('Distributions of Sepal Length by Species')

#### Histogram

A histogram is a way to represent the distribution of a variable. It involves dividing the entire range of the variable into a set of intervals or bins and counting how many values fall within each bin. 

In [None]:
# using matplotlib
plt.hist(iris.sepal_length)
plt.xlabel('Sepal Length')
plt.title('Distributions of Sepal Length')

In [None]:
# using pandas
iris['sepal_length'].hist()
plt.xlabel('Sepal Length')
plt.title('Distributions of Sepal Length')

In [None]:
# Customizing the settings: using 20 bins
iris['sepal_length'].hist(bins = 20)
plt.xlabel('Sepal Length')
plt.title('Distributions of Sepal Length')

In [None]:
# Customizing the settings: selecting a portion of the data and assigning the edgecolor as black.
sl_filter = iris.loc[:, 'sepal_length'] <= 6
iris.loc[sl_filter, 'sepal_length'].hist(bins = 20, edgecolor='black')

In [None]:
# using seaborn
sns.histplot(iris, x='sepal_length') # try with , bins=20, kde=True

#### Scatter Plot

A scatter plot is a type of plot that shows the relationship between two variables, and it typically displays the data points as a collection of dots. We used the `plot` function above, and here we use the `scatter` function to create a scatter plot. 

In [None]:
# using matplotlib
plt.scatter(iris.sepal_length, iris.sepal_width)
plt.title('Sepal Length vs Sepal Width')
plt.xlabel('Sepal Length') 
plt.ylabel('Sepal Width') 
plt.grid()

In [None]:
# using seaborn
sns.scatterplot(data=iris, x="sepal_length", y="sepal_width")
plt.title('Sepal Length vs Sepal Width')
plt.xlabel('Sepal Length') 
plt.ylabel('Sepal Width') 
plt.grid()

In [None]:
# Use parameter `hue` to vary the color of points. 
sns.scatterplot(data=iris, x="sepal_length", y="sepal_width", hue="species")

In [None]:
# Additionally, use parameter `style' to vary the makers. 
sns.scatterplot(data=iris, x="sepal_length", y="sepal_width", hue="species", style="species")

#### Subplot

It's often useful to compare different subsets of data side by side.

In [None]:
# generate some data
x1 = np.linspace(0.0, 5.0)
x2 = np.linspace(0.0, 5.0)

y1 = np.sin(2 * np.pi * x1) * np.exp(-x1)
y2 = np.sin(2 * np.pi * x2)

# two rows, one column, first figure
plt.subplot(2, 1, 1)
plt.plot(x1, y1, 'o-')
plt.title('Two subplots') # this is main title of figures
plt.ylabel('Damped oscillation')

# two rows, one column, second figure
plt.subplot(2, 1, 2)
plt.plot(x2, y2, '.-')
plt.xlabel('time (s)')
plt.ylabel('Undamped')
plt.show()

---  

#### TRY FOR YOURSELF
Do the following:

- Use `matplotlib` to create your own visualizations based on the wine data
- Challenge: Given response variable y and a list of X variables, write a *loop* that builds graphs of y versus each x in X.  
  Hint: include `plt.show()` after each graph to show them all together.

In [None]:
y = df.pH
X = ['chlorides','citric acid','density']

nrows = len(X)

for row in range(nrows):
    plt.subplot(nrows, 1, row+1)
    plt.scatter(df[X[row]], y)
    plt.title('pH vs {}'.format(X[row]))
    plt.xlabel(X[row]) 
    plt.ylabel('pH') 
    plt.show()