# Multivariate Data

In this notebook we are going to look at some simple multivariate data.

## What is Multivariate Data?

As the name implies, multivariate data is data that consists of more than two variables. The idea is the same is it is with bivariate data, except we have more variables to work with.

### Examples

Again let's consider data from the *U.S. Standard Atmostphere 1976*. We 
The original data and model can be found in its original form, provided by NASA [here](https://ntrs.nasa.gov/api/citations/19770009539/downloads/19770009539.pdf). We however are using a stripped down version sourced from [here](https://www.engineeringtoolbox.com/standard-atmosphere-d_604.html).

In [None]:
import pandas as pd
atmos = pd.read_csv('_resources/atmosphere.csv')
atmos

We can look at any two variables with respect to altitude (generally our independent variable).

In [None]:
atmos.plot(
    x='altitude',
    y='atmospheric_pressure',
    xlabel='Altitude (m)',
    ylabel='Atmospheric Pressure ($10^{-5}N s/m^{2}$)',
    title='Altitude vs. Atmospheric Pressure',
)

In [None]:
atmos.plot(
    x='altitude',
    y='temperature',
    xlabel='Altitude (m)',
    ylabel='Temeprature ($^{\circ}C$)',
    title='Altitude vs. Temperature',
)

Or we can plot them simulataneously on the same figure

In [None]:
ax = atmos.plot(
    x='altitude',
    y=['atmospheric_pressure', 'temperature'],
    xlabel='Altitude (m)',
    title='Altitude vs. Atmospheric Pressure & Temperature',
)


The problem with setting both of these variables on the same plot is that numerically they are within very different ranges of values. Temperature fluctuates from 20 down to nearly -80, and our density only ranges from a little over 10.0 to nearly 0.0 - what would be beneficial is even we could place these variables on separate axes within our figure, so that they do not need to share the same range on our canvas.

In [None]:
ax = atmos.plot(
    x='altitude',
    y='atmospheric_pressure',
    xlabel='Altitude (m)',
    ylabel='Atmospheric Pressure ($10^{-5}N s/m^{2}$)',
    title='Altitude vs. Atmospheric Pressure & Temperature',
)
ax = atmos.plot(
    x='altitude',
    y='temperature',
    secondary_y=True,
    ax=ax,
)
ax.set_ylabel('Temeprature ($^{\circ}C$)')


Now we can see both data sets in better detail (specifically atmospheric pressure), as temperature is added to a secondary axis. This allows us to avoid visually squashing the pressure data. We can even try to throw the rest of our metrics onto this dual-axis figure.

In [None]:
%matplotlib widget
ax = atmos.plot(
    x='altitude',
    y=['gravity', 'atmospheric_pressure', 'air_density', 'dynamic_viscosity'],
    xlabel='Altitude (m)',
    title='How Altitude affects Atmospheric Characteristics',
)
ax = atmos.plot(
    x='altitude',
    y='temperature',
    secondary_y=True,
    ax=ax,
)

Sadly this is not an improvement, as some of our other variables are getting squisshed by the relatively large scale. We cannot add more y-axes (we could add a z-axis, but then we cannot have a secondary-y), and so we are left to make multiple plots and figures, or even rescale some of our variables to fall within similar ranges. Rescaling data like this is ok so long as it is clear what scaling was done (usually by modifying how the units are presented).

The 3d plot gives us a single curve that shows how our data flows in all three dimensions, but it is not entirely useful, and we cannot add secondary axes to any of our dimensions.

In [None]:
%matplotlib widget
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.scatter3D(atmos.altitude, atmos.atmospheric_pressure, atmos.temperature)
ax.set_xlabel('Altitude (m)')
ax.set_ylabel('Atmospheric Pressure ($10^{-5}N s/m^{2}$)')
ax.set_zlabel('Temeprature ($^{\circ}C$)')

One solution to encoding multiple variables of our data into our visualzation is to employ some of the basic techniques we learned about last week - using markers and colors to represent other facets of our data, perhaps even keeping it as a 3d plot.

In [None]:
%matplotlib widget
fig, ax = plt.subplots(subplot_kw=dict(projection='3d'))
p = ax.scatter3D(atmos.altitude, atmos.atmospheric_pressure, atmos.temperature, c=atmos.air_density, cmap='rainbow')

ax.set_xlabel('Altitude (m)')
ax.set_ylabel('Atmospheric Pressure ($10^{-5}N s/m^{2}$)')
ax.set_zlabel('Temeprature ($^{\circ}C$)')

cb = fig.colorbar(p, shrink=0.8, pad=0.15)
cb.set_label('Density of Air ($kg/m^{3}$)')

There are many ways for us to work with and visualize multivariate data, but it is highly dependent on the data. Data that consists of clusters of data can be visualized using methods like parallel coordinate/Andrews Curves; data that consists of categorical data may be representable with a radar plot; and sometimes the data just needs to be manipulated in ways to reduce what an analyst is looking at. We will soon be looking in more detail on how to cluster and group data, especially when working with data that is not directly categorical.