# Bivariate Data

In this notebook we are going to look at some simple bivariate data.

## What is Bivariate Data?

As the name implies, bivariate data is data that consists of two variables. We compare and analyze the two variables with respect to one another in order to attempt to find and/or explain the relationship between them. It is also possible that one of these variables depends on the other, in which case we have an independent variable and a dependent variable.

* altitude and air density
* ice cream sales and temperature throughout a day
* mana cost of a MTG card and the turns remaining in a game

When we have bivariate data, there are some simple things we can do to help us understand what sort of relationship our variables have with one another. We want to be able to visualize our data as well as visualize the relationship (if there is one) and quantify it. We can use a combination of `pandas`, `numpy`, and `matplotlib`/`seaborn`/`bokeh` to handle the analysis and visualization.

Let's take a look at a few different datasets that contain bivariate data.

### Examples

First let's consider data from the *U.S. Standard Atmostphere 1976*. We simply want to look at how the density of air changes as we move higher into the atmosphere. In our simple data set (a full one has been provided) consists of altitudes in meters and densities in kilograms per meters cubed.

The original data and model can be found in its original form, provided by NASA [here](https://ntrs.nasa.gov/api/citations/19770009539/downloads/19770009539.pdf). We however are using a stripped down version sourced from [here](https://www.engineeringtoolbox.com/standard-atmosphere-d_604.html).

In [None]:
import pandas as pd
atmos = pd.read_csv('_resources/atmosphere_simple.csv')
atmos

In [None]:
atmos.plot.scatter(
    x='altitude',
    y='air_density',
    xlabel='Altitude (m)',
    ylabel='Density of Air ($kg/m^{3}$)',
    title='U.S. Standard Atmosphere, Altitude vs. Density of Air'
)

We can see that there is a pretty strong relationship between these two variables (in the direction that we expect - at higher altitudes air is much thinner!). We can see too that this relationship is non-linear - air density rapidly decreases as we ascend to about 20km in altitude. It seems to asymtotically approach a density of 0 as we continue upward (in reality it is not asymtotic, as it is around 100000m, or 100km, that atmospheric pressue, and thus the density of air, becomes 0).

We can also take a look at [data from a study](https://www.kaggle.com/tunguz/drug-use-by-age) on drug use of individuals across multiple age groups. While the data presented has many variables, we are going to treat any drug-usage statistic and the age group as a set of bivariate data.

In [None]:
drug_use = pd.read_csv('_resources/drug_use_by_age.csv')
drug_use

We are going to look at how age and alcohol use trend with one another. We are going to add some axis labels, a title, and a rotation so the x-axis tick labels.

Note that the age group is actually stored as string data, as some of our values represent ranges instead of individual vales.

In [None]:
drug_use.plot.scatter(x='age', y='alcohol-use', rot=45, xlabel='Age Group', ylabel='Percentage That Consumed Alcohol', title='Age Group and Alcohol Use')

We can see that our data exhibits a sort of quadratic pattern. Alcohol use seems to grow in usage as age increases to a maximum around 22 years old, and then proceeds to decrease.

Note that our age axis represents a non-linear growth - the first half of our axis increases linearly by 1 every step, but then begins incrementing by more and more years for the rest of the dataset, with our final data point encompassing all ages 65 and greater.