# Matplotlib

Visualization plays an essential role in both analyzing data and sharing results. One widely used python library for visualization is Matplotlib. If you have time, the full documentation for Matplotlib is available on the website https://matplotlib.org/contents.html, but this notebook should serve as a tutorial for the basic features relevant to Medlytics.

## Simple First Graph
As with any python library, the first step is to import the required modules. We will start with the submodule "pyplot," which provides an interface for basic plotting. We will also import numpy, which you may recognize. This is a scientific computing module that is used a lot in this course.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

For our first figure, we will plot a few arbitrary points and then apply some labels. Note that the arguments for "plot" are two lists of equal size, representing X and Y values. The notebook will automatically render the figure for us, but in stand-alone python code, the call to "plt.show()" becomes necessary.

In [None]:
# Plot some (X,Y) data points, defined manually
plt.plot([0,1,2,3,4], [1, 1.5, 2, 2.5, 3])
# Add labels to the current figure
plt.xlabel('X label')
plt.ylabel('Y label')
plt.title('Plot Points')
# Display the figure - not necessary in a jupyter notebook
plt.show()

## Figures and Subplots
You may have noticed the lack of reference to a figure object. Figures in Matplotlib are tracked behind the scenes, and only the "current" figure can be modified. Since we didn't specify a figure, the default figure number (1) was used. We can create a new figure at a specific number with:

In [None]:
plt.figure(1)

Notice that this time, nothing was rendered. This is because the new figure 1 doesn't have any axes. Calling something like "plot" will generate axes automatically. Another option is to generate axes explicitly, which will give us greater control. For example, the "plt.subplot()" function allows us to have multiple sets of axes on the same figure. Note that each call to subplot will only create one set of axes.

For our example subplots, we will be using some arbitrary numpy array data. Note that for the first subplot axes we are passing multiple sets of (X,Y) data, and that each data argument pair is followed by a formatting argument.

In [None]:
plt.subplot(1,2,1) # 1 row, 2 columns, select subaxes number 1
# Pass multiple numpy arrays as arguments to plt.plot
x = np.arange(0, 10)
y = np.arange(0, 10)
y2 = np.arange(2, 12)
plt.plot(x, y, 'r-', x, y2, 'bs') # formatting: r:red -:line, b:blue s:square
plt.title("Fig Num 1")

plt.subplot(1,2,2)  # 1 row, 2 columns, figure number 2
plt.plot(x, y, 'bo')  # formatting: b:blue o:circle
plt.title("Fig Num 2")

plt.show()

In our final example, we will use a different function to generate multiple axes: plt.subplots(), which generates all axes at the same time. In order to track these axes, we will save the output of the function to some variables. We could have been doing this in previous examples, but so far we haven't needed to modify anything other than the current axes. Also note that the first output returned from subplots is actually the figure itself. We can use this saved figure to call "show()," which will ensure only this figure is rendered.


In [None]:
# Generate some numpy data to plot
x = np.arange(0,5, .05)
y = np.sin(x ** 2)
# Create a new set of axes, and a new figure in the process
fig, (ax1, ax2) = plt.subplots(1, 2)  # 1 row, 2 columns
ax1.plot(x, y)
ax1.set_title('ax1')
ax2.plot(x, y, 'ro')
ax2.set_title('ax2')

fig.show()

The Matplotlib documentation mentioned above contains a lot more tutorial content, such as for pyplot:

https://matplotlib.org/users/pyplot_tutorial.html

# Seaborn

Seaborn is just another data visualization library that's based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. Here are a few examples of how you can use it:

In [None]:
import seaborn as sns

data = pd.read_csv("https://raw.githubusercontent.com/BeaverWorksMedlytics2020/Data_Public/master/NotebookExampleData/Week1/BWSI_Dataset1.csv")
data.head()

In [None]:
# distribution of age using seaborn
sns.distplot(data['Age'], kde=True) #More info on seaborn.distplot(...) here: https://seaborn.pydata.org/generated/seaborn.distplot.html
plt.xlabel('Age')
plt.ylabel('Density')

In [None]:
# boxplot of age 
sns.boxplot('Age', data = data) #More info on seaborn.boxplot(...) here: https://seaborn.pydata.org/generated/seaborn.boxplot.html
plt.xlabel('Age')

In [None]:
# frequency of sex using seaborn
sns.countplot(x = 'Sex', data = data) 
plt.xticks(np.arange(2), labels = ('Male', 'Female'), rotation = 0)
plt.xlabel('Sex')

In [None]:
sns.set(rc={'figure.figsize':(18, 18)}) # bigger figures
sns.heatmap(data.corr(), annot = True) # creating a heatmap of all variables in the data set
#More info on seaborn.heatmap(...) here: https://seaborn.pydata.org/generated/seaborn.heatmap.html

In [None]:
g = sns.FacetGrid(data, col = 'Sex') # facet distribution plots by sex
g = g.map(sns.kdeplot, 'Age')

In [None]:
# pairwise plots only for selected variables
sns.pairplot(data, vars =['Age', 'ParentEducation', 'HouseIncome', 'BMI', 'BloodPres', 'HoursTV'], hue="Sex")
#More info on seaborn.pairplot(...) here: https://seaborn.pydata.org/generated/seaborn.pairplot.html

In [None]:
# saving figures
plt.savefig("figs.png") #You can access the saved image on the left-hand side of your screen, where the Colab files are located

## Exploratory Data Analysis

Seaborn/Matplotlib is very effective for conducting **Exploratory Data Analysis (EDA)**. EDA and Data preparation are critical for having more effective ML models.
EDA analysis can fall under three categories: Univariate, Bivariate, or Multivariate. These analyses can provide insight when manually choosing the most optimal features during Feature Selection.

**Univariate analysis** means the analysis of data that has one variable. These are summary statistics that tell about the mean, variance, quartiles, and standard deviation, among others. Learning about the distribution of data through univariate analysis is a necessary step to measure the effectiveness of a ML algorithm's prediction.

In [None]:
data.describe() #provides (typically continuous) summary statistics for univariate analysis 

In [None]:
#To find the descriptive statistics of Categorical Variables (which we don't have here. Example is still provided though): pandas.Series.value_counts()
#value_counts() tells xplains how many counts there are for different categorical sub variables in a single Series. 

print(data.Age.value_counts()) #data.Age refers to the Age Series from the data Dataframe
Years = data["Year"]
print(Years.value_counts())

Now for making sense of these summary statistics. We will now implement visualizations of univariate analysis such as bar charts, line charts, and histograms. 

In [None]:
#Histograms are often used to understand and visualize frequency distribution
data.hist()
plt.plot()

There are a lot of different shapes to these histograms. Some can be explained easily: Subject ID is linear, and Height is more or less a Normal Distribution. If you didn't know, Normal Distribution is capitalized because it refers to a specific shape: the so-called bell curve. It is used to describe probability distributions with a fixed variance about the mean.

Another thing you might notice is that some plots, such as MaritalStatus and HoursTV, seem to only have one or two bars, and they are clustered at 0. This suggests the presence of an **outlier** in these variables: a few values that are far outside the rest of the distribution. In the case of HoursTV, what do you notice about the axis?

If we return to the table above, we see that 75% of the values for HoursTV are 4 or less, yet the maximum is 77. We might surmise that HoursTV should never exceed 24, and that the outlier of 77 is actually an invalid datum. Findings like this can help us **"clean"** our data by removing invalid entries, which will improve our modeling.

For the age variable, the histogram is skewed to the right, such that the median is greater than the mean. Because Age doesn't fit a Normal Distribution, we need to **normalize** the data before inputting it to a machine learning algorithm. Once the data is Normal (like the Height variable is already), the next step is to **standardize** the data, so different variables are comparable. We will talk more about these steps later on.

**Bivariate analysis** is conducted to determine the relationship between two variables. It aims to investigate how one variable impacts another. Examples include Scatterplots, Stacked Bar Charts, and Bivariate Line Charts. It is also very important for identifying which features to use. 

In [None]:
data.head(10)

In [None]:
"""
pick the variables from the heatmap that provide some of the highest correlation values
(For now, let's analyze linear relationships. Here is an intro to curvilinear relationships:
https://www.analyticsvidhya.com/blog/2020/03/polynomial-regression-python/). 

For example, Protein and TotalFat has a correlation of 0.74, which is the highest r value
among the other variables. But remember to check the correlation of other variables through
a scatterplot to see how spread the data is.

REMEMBER: Correlation DOES NOT equal Causation

""" 
x = data["Protein"]
y = data["TotalFat"]
plt.scatter(x, y)
plt.xlabel("Protein")
plt.ylabel("TotalFat")
plt.title("Scatter plot of Protein vs. TotalFat")

From the graph there is an apparent strong positive association between TotalFat and Protein. However, there seems to be an overplotting of points. Fortunately, we have a solution: create a hex plot!

By Hex plotting, points in space are aggregated into hexagons, and the shading of the Hex plot are represented by the values.

In [None]:
data.plot.hexbin(x = "Protein", y = "TotalFat", gridsize = 15, ec = "r", sharex=False)
#For Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.hexbin.html 

That's better. Now we can see the data surrounding 50 grams of TotalFat and around 60 grams of Protein. If the Data was more spread out, that would improve the effectiveness of the ML output!

That's all for now! Here is more info if you want an intro to Multivariate Analysis: https://machinelearningmastery.com/visualize-machine-learning-data-python-pandas/