# Data visualisation with Python
### Yngve Mardal Moe

## Agenda:

 1. Creating simple plots with Pandas
 2. Creating informative visualisations with Seaborn
 3. Customising our visualisations with Matplotlib
 4. Key concepts in creating useful visualisations

## Part 1: Creating simple plots with Pandas
We'll start by using Pandas to load our data and create some simple visualisations.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

pd.set_option('max_columns', 100)

In [2]:
%matplotlib notebook

In [3]:
weather = pd.read_excel('weather_data.xlsx', index_col='dato')

In [4]:
weather.head()

Unnamed: 0_level_0,albedo,balanse,diffus,fd,fluxm,fluxs,global,grmin,irød,jt010,jt100,jt002,jt020,jt005,jt050,lp,lt,ltmax,ltmin,nb,par,rf,sd,sdman,synlig,uv,vh,vhmax,vr
dato,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1
1988-01-01,,,,,,,,,,,,,,,,,5.1,5.9,3.0,,,,,,,,,,
1988-01-02,,,,,,,,,,,,,,,,,3.7,5.3,1.5,20.2,,,,,,,,,
1988-01-03,,,,,,,,,,,,,,,,,1.3,2.1,0.2,,,,,,,,,,
1988-01-04,,,,,,,,,,,,,,,,,0.1,0.4,-0.1,,,,,,,,,,
1988-01-05,,,,,,,,,,,,,,,,,-0.1,0.6,-0.5,0.0,,,,,,,,,


In [5]:
weather.tail()

Unnamed: 0_level_0,albedo,balanse,diffus,fd,fluxm,fluxs,global,grmin,irød,jt010,jt100,jt002,jt020,jt005,jt050,lp,lt,ltmax,ltmin,nb,par,rf,sd,sdman,synlig,uv,vh,vhmax,vr
dato,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1
2018-12-27 00:10:00,0.648644,-1.037384,0.520922,,-3.85984,-0.33349,0.542782,-7.409,38.36279,0.694146,3.838215,0.449583,1.094819,0.664396,2.068958,1000.777,-1.244958,0.695,-4.085,1.8,1.339207,99.99931,,13.0,55.943656,5.693554,0.776569,2.105,NØ
2018-12-28 00:10:00,0.458021,-1.520098,0.991158,,-4.974732,-0.429817,1.719382,-12.72,57.93696,0.775146,3.810514,0.444639,1.082417,0.668576,2.049465,1005.838,-6.232451,-4.125,-8.2,0.0,2.614267,98.92361,,,39.267257,2.795783,,,
2018-12-29 00:10:00,0.879624,-0.61726,0.57752,,-5.268014,-0.455156,0.61319,-11.05,46.90142,0.660903,3.7595,0.332375,1.023354,0.542597,1.984889,1003.835,-4.474945,-1.8,-7.296,0.1,1.723672,99.4375,,,47.482347,5.616233,,,
2018-12-30 00:10:00,0.476756,-0.901978,0.896532,,-5.038892,-0.43536,1.807672,-11.05,58.87096,0.611986,3.728389,0.261771,0.933201,0.470632,1.91184,1009.307,-3.461583,-1.738,-5.724,0.3,3.238451,99.72916,,,38.155978,2.973062,,,
2018-12-31 22:10:00,0.605755,-1.960404,0.422131,,-4.760121,-0.411274,0.454753,-2.048,24.69542,0.509465,3.67366,0.23116,0.871861,0.429556,1.876417,1002.851,1.882986,4.524,-1.915,1.9,1.215902,99.46528,,13.0,67.885875,7.418705,4.38793,10.2,S


Now, we are ready to create our first visualisations. We start with a simple line-plot where we plot the percentage of different radiations. 

In [6]:
weather.plot(y='synlig')

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1c0cd754da0>

The above plot was very difficult to read because of small fluctuations. We can easily perform compute the monthly averages using the ``groupby`` function we introduced last lecture.

In [7]:
monthly_weather = weather.groupby(pd.Grouper(freq='M')).mean()
monthly_weather.plot(y='synlig')

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1c0cdc4d128>

We can easily plot multiple values against each other.

In [8]:
monthly_weather.plot(y=['synlig', 'irød'])

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1c0cd8d6080>

Alternatively, we can choose what to keep at the x and y axis

In [9]:
monthly_weather.plot(x='global', y='synlig', kind='scatter')

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1c0ccd56470>

In [10]:
monthly_weather.plot(x='global', y='synlig', kind='scatter', s=10)  # We can change the marker area with the s-command

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x28f6b5933c8>

## Excersice
Try to recreate the following plot, where we have plotted the minimum temperature 2 cm above grass level (``grmin``) against time.

In [10]:
weather.plot(y='grmin')

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1c0cd821f28>

## Let us remove those outliers:
First, we find them

In [11]:
grmin = weather['grmin']
print(grmin[grmin.abs() > 100])

dato
2010-09-07 00:10:00    6999.0
2010-09-08 00:10:00    6999.0
2010-09-09 00:10:00    6999.0
2010-09-10 00:10:00    6999.0
2010-09-11 00:10:00    6999.0
2010-09-12 00:10:00    6999.0
2010-09-13 00:10:00    6999.0
2010-09-14 00:10:00    6999.0
2010-09-17 00:10:00    6999.0
2010-09-18 00:10:00    6999.0
2010-09-19 00:10:00    6999.0
2010-09-20 00:10:00    6999.0
2010-09-21 00:10:00    6999.0
2010-09-22 00:10:00    6999.0
2010-09-23 00:10:00    6999.0
2010-09-24 00:10:00    6999.0
2010-09-25 00:10:00    6999.0
2010-09-26 00:10:00    6999.0
2010-09-27 00:10:00    6999.0
2010-09-28 00:10:00    6999.0
2010-09-29 00:10:00    6999.0
2010-09-30 00:10:00    6999.0
2010-10-01 00:10:00    6999.0
2010-10-02 00:10:00    6999.0
2010-10-03 00:10:00    6999.0
2010-10-04 00:10:00    6999.0
2010-10-05 00:10:00    6999.0
2012-08-12 00:10:00   -6999.0
Name: grmin, dtype: float64


Then we remove them and put the newly created column back into the dataset

In [12]:
grmin[grmin.abs() > 100] = np.nan
weather['grmin'] = grmin

weather.plot(y='grmin')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1c0cd949390>

Now, we must create our monthly_weather dataframe once more.

In [13]:
monthly_weather = weather.groupby(pd.Grouper(freq='M')).mean()

## Changing column names

We see that the column names are what decides the plot labels, as such it might be useful to rename them to become more informative.

To do that, we need something called a dictionary, or a key-value mapping. Let us create one now.

In [32]:
name_lookup = {
    'lt': 'Air temperature',
    'lp': 'Air pressure',
    'dato': 'Date',
    'grmin': 'Minimum grass temperature'
}

print(name_lookup)

{'lt': 'Air temperature', 'lp': 'Air pressure', 'dato': 'Date', 'grmin': 'Minimum grass temperature'}


To explain how a dictionary works, we simply use an example:

In [33]:
print(name_lookup['lt'])

Air temperature


A dictionary is, in other words, akin to a lookup table. We have keys, which are the elements before the colons and values, which are the elements behind the colons. Then, we can use the keys to get the corresponding value.

It is important to know that a dictionary points just one way, keys -> values. This is therefore impossible:

In [34]:
name_lookup['Air temperature']

KeyError: 'Air temperature'

With dictionaries explained, it is time to rename the columns. To do this, we simply use the rename function of the data frame, and supply it with a dictionary whose keys are original column names and values are new column names.

In [35]:
weather = weather.rename(name_lookup, axis=1)
weather.index.name = 'Date'

In [36]:
weather.plot(y='Air temperature')

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1c0d7f58a58>

## Exersices

 1. Create a lineplot where you plot the air pressure against the date
 2. Are there any missing values for the air pressure? If so, find them and set them equal to NAN and plot the air pressure once more.
 2. Rename the "synlig" column to "Percentage visible light", the "irød" column to "Percentage infra-red light" and the 'uv' column to "Percentage UV light"
 3. Create a scatterplot where the percentage visible light is on the x-axis and the percentage uv light is on the y-axis.
 4. Aggregate weather into a monthly_weather data frame containing the monthly averages for each column.

In [37]:
weather.plot(y='Air pressure')

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1c0d4b2c518>

In [None]:
air_pressure = weather['Air pressure']
air_pressure[air_pressure > 1200] = np.nan
weather['Air pressure'] = air_pressure

In [38]:
weather.plot(y='Air pressure')

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1c0d4be7b38>

In [27]:
weather = weather.rename(
    {
        'synlig': 'Percentage visible light',
        'irød': 'Percentage infra-red light',
        'uv': 'Percentage UV light',
        'global': 'Total amount of EM radiation'
    },
    axis=1
)

In [39]:
weather

Unnamed: 0_level_0,albedo,balanse,diffus,fd,fluxm,fluxs,Total amount of EM radiation,Minimum grass temperature,Percentage infra-red light,jt010,jt100,jt002,jt020,jt005,jt050,Air pressure,Air temperature,ltmax,ltmin,nb,par,rf,sd,sdman,Percentage visible light,Percentage UV light,vh,vhmax,vr
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1
1988-01-01 00:00:00,,,,,,,,,,,,,,,,,5.100000,5.900,3.000,,,,,,,,,,
1988-01-02 00:00:00,,,,,,,,,,,,,,,,,3.700000,5.300,1.500,20.2,,,,,,,,,
1988-01-03 00:00:00,,,,,,,,,,,,,,,,,1.300000,2.100,0.200,,,,,,,,,,
1988-01-04 00:00:00,,,,,,,,,,,,,,,,,0.100000,0.400,-0.100,,,,,,,,,,
1988-01-05 00:00:00,,,,,,,,,,,,,,,,,-0.100000,0.600,-0.500,0.0,,,,,,,,,
1988-01-06 00:00:00,,,,,,,,,,,,,,,,,-0.500000,0.200,-1.400,3.5,,,,,,,,,
1988-01-07 00:00:00,,,,,,,,,,,,,,,,,-1.300000,0.200,-5.600,0.6,,,,,,,,,
1988-01-08 00:00:00,,,,,,,,,,,,,,,,,-6.200000,-4.500,-10.700,0.1,,,,,,,,,
1988-01-09 00:00:00,,,,,,,,,,,,,,,,,0.500000,2.900,-4.900,,,,,,,,,,
1988-01-10 00:00:00,,,,,,,,,,,,,,,,,4.200000,5.800,1.900,5.5,,,,,,,,,


In [40]:
weather.plot(
    x='Percentage visible light',
    y='Percentage UV light',
    kind='scatter',
    s=1,
    alpha=0.1
)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1c0d4cd9978>

In [43]:
monthly_weather = weather.groupby(pd.Grouper(freq='M')).mean()

## Creating publication ready visualisations with Seaborn and Matplotlib 
The plot function in Pandas makes it easy to explore high-dimensional datasets. However, they require much customisation to be publication ready. A very good tool to accomplish this is seaborn, and I recommend to always start with seaborn when you want to create beautiful visualisations.

In [44]:
import seaborn as sns

In [45]:
plt.figure()
sns.scatterplot(x='Percentage UV light', y='Total amount of EM radiation', data=monthly_weather)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1c0d4d498d0>

## Changing plot style
Thus far, we have used the default settings for seaborn, however, we can easily modify our plots to fit a journal paper or journal poster using the ``set_context`` function.

In [46]:
sns.set_context('poster')

In [47]:
plt.figure()
sns.scatterplot(x='Percentage UV light', y='Total amount of EM radiation', data=monthly_weather)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1c0d7cc5198>

We see that the figure is too small to fit a poster. We still need to change the figure size, which is easily done

In [48]:
plt.figure(figsize=(10, 10))
sns.scatterplot(x='Percentage UV light', y='Total amount of EM radiation', data=monthly_weather)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1c0d7d03400>

Note that the ``set_context`` function affects all of matplotlib, and therefore also Pandas plotting.

In [49]:
monthly_weather.plot(y='Percentage UV light')

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1c0d7d65da0>

If we want a figure for a paper, we set the corresponding context

In [50]:
sns.set_context('paper')

In [51]:
plt.figure()
sns.scatterplot(x='Percentage UV light', y='Total amount of EM radiation', data=monthly_weather)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1c0d7df91d0>

## Labelling our graphs

We are still not at publication ready plots. Now the issue is the axis labels and the axis title. How this works ties in with the Matplotlib figure model, which we will discuss in depth later.

In [52]:
plt.figure()
axes = sns.scatterplot(x='Percentage UV light', y='Total amount of EM radiation', data=monthly_weather)
axes.set_ylabel('Percentage UV radiation')
axes.set_ylabel('Global radiation')
axes.set_title('Percentage UV radiation versus global radiation', size=12)

<IPython.core.display.Javascript object>

Text(0.5, 1.0, 'Percentage UV radiation versus global radiation')

## Encoding more information in our visualisations
Sometimes, we want to encode more information in our visualisations. It is very simple to do this in Seaborn, as we'll demonstrate below where we encode the temperature as the circle area.

In [53]:
plt.figure()
axes = sns.scatterplot(x='Percentage UV light', y='Total amount of EM radiation', size='Air temperature', data=monthly_weather)
axes.set_ylabel('Percentage UV radiation')
axes.set_ylabel('Global radiation')
axes.set_title('Percentage UV radiation versus global radiation', size=12)

<IPython.core.display.Javascript object>

Text(0.5, 1.0, 'Percentage UV radiation versus global radiation')

It is still somewhat difficult to see the patterns in our data. The reason for this is that humans are not very good at comparing different areas. Therefore, it might be a good idea to use two different encoding methods on to visualise the air temperature. Let us try with size and colour.

In [54]:
plt.figure()
axes = sns.scatterplot(x='Percentage UV light', y='Total amount of EM radiation', size='Air temperature', hue='Air temperature', data=monthly_weather)
axes.set_ylabel('Percentage UV radiation')
axes.set_ylabel('Global radiation')
axes.set_title('Percentage UV radiation versus global radiation', size=12)

<IPython.core.display.Javascript object>

Text(0.5, 1.0, 'Percentage UV radiation versus global radiation')

We might want to use a more descriptive colormap as well. To find colormaps, check out [this](https://matplotlib.org/3.1.0/tutorials/colors/colormaps.html) webpage, we'll use the "coolwarm" colormap.

In [55]:
plt.figure()
axes = sns.scatterplot(
    x='Percentage UV light',
    y='Total amount of EM radiation',
    size='Air temperature',
    hue='Air temperature',
    palette='coolwarm',
    data=monthly_weather
)
axes.set_ylabel('Percentage UV radiation')
axes.set_ylabel('Global radiation')
axes.set_title('Percentage UV radiation versus global radiation', size=12)

<IPython.core.display.Javascript object>

Text(0.5, 1.0, 'Percentage UV radiation versus global radiation')

## The seaborn data model
What we've just demonstrated is the seaborn data model. When we visualise data with seaborn, we always follow three steps:

 1. Choose overarching visualisation method (e.g. scatter plot, box plot, line-plot)
 2. Choose which rows to encode using which method.
 3. Customise the visualisation (e.g. setting colour palette).

By understanding these three steps, we can easily create our own powerful visualisations.

### Exersices

 1. Create a scatter plot with Air temperature on the x-axis and Albedo on the y-axis (you might want to rename the albedo row to get a pretty graph).
 2. Create a scatter plot with Air temperature on the x-axis and Total amount of EM radiation on the y-axis
 3. Create a scatter plot with Air temperature on the x-axis, temperature 1 meter below ground level (jt100) on the y-axis (you might want to rename this column)
 4. Create a scatter plot with Air temperature on the x-axis, temperature 1 meter below ground level (jt100) on the y-axis (you might want to rename this column). Use colour to represent month.

In [60]:
monthly_weather['Month'] = monthly_weather.index.month
monthly_weather = monthly_weather.rename(
    {
        'albedo': 'Albedo',
        'jt100': 'Temperature 1 meter below the ground'
    },
    axis=1
)

In [66]:
plt.figure()

ax = sns.scatterplot(
    x='Air temperature',
    y='Temperature 1 meter below the ground',
    hue=monthly_weather.index.month,
    data=monthly_weather,
)



<IPython.core.display.Javascript object>

## Colourmaps
Do you see a problem with the visualisation above? There is a sharp peak between january and december, even though there should not be any. Let us fix this problem by changing our colourmap. We can find a list of all available colourmaps [here](https://matplotlib.org/3.1.0/tutorials/colors/colormaps.html).

There are three kinds of good colourmaps, sequential, diverging and cyclic colourmaps. Each of these have their own use-case.

### Sequential colourmaps
A sequential colourmap is ideal to present positive numeric values and numeric values on an unbounded scale. Sequential colourmaps are defined by their monotone change from one colour to another. It can, for example, start at white and become blue (such as the ``blues`` colourmap), or it can start at dark blue and become light yellow (such as the ``viridis`` colourmap).

When choosing a sequential colourmap I recommend choosing from the list of perceptually uniform colourmaps or the list of sequential colourmaps. Not the list of sequential (2) colourmaps.

### Diverging colourmaps
A diverging colourmap is ideal to present numeric values that have a clear mid-point at zero. Examples of this is correlation coefficients. Data have both positive and negative values with no clear center can also benefit from a diverging colourmap so long midpoint of the colourmap is at zero (thus, the full dynamic range of the colourmap is not used). Diverging colourmaps are defined by three colours, $a$, $b$ and $c$. The first half of the colourmap change monotonically from $a$ to $b$ and the second half of the colour map change from $b$ to $c$.

### Cyclic colourmaps
A cyclic colourmap is ideal when the data is continuous but has no clear ordering. Examples of such data are angles and time, essentially, anything that is best presented using a polar coordinate system should be presented using cyclic colourmaps. These colourmaps have the same start and end colour, and is therefore what we will use in the example above.

In [72]:
plt.figure()

ax = sns.scatterplot(
    x='Air temperature',
    y='Temperature 1 meter below the ground',
    # style='season',
    hue=monthly_weather.index.month,
    data=monthly_weather,
    palette='twilight_shifted'
)



<IPython.core.display.Javascript object>

## Let us talk about data encoding

When we visualise data, we try to encode the numerical information in a dataset using visual cues. However, too often we don't stop and think what the goal of a visualisation is. Namely, communicating or finding insight on the data at hand. As such, we should think toroughly about how we want to encode our information to best communicate our findings.

Below, we have a good visualisation from Peter Aldous that demonstrate the strength of different encoding methods. Using this method, we can see that we should encode the most important variables on a scale axis. Less important variables can be displayed using other means, such as colour or shape.

![Visual hierarchy](https://paldhous.github.io/ucb/2016/dataviz/img/class2_2.jpg)
From: https://paldhous.github.io/ucb/2016/dataviz/week2.html

## Categorical grouping of variables

Let us now move back to some hands-on visualisation. Specifically, we want to visualise our data with one variable being a categoric variable. To start, we'll create a boxplot of our datasets

In [76]:
plt.figure()
sns.boxplot(
    x='Month',
    y='Air temperature',
    data=monthly_weather
)



<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1c0e853d198>

Boxplots are often used to visualise data distributions. However, they are not always ideal. To illustrate this point, see the animation below

![Boxplot vs violinplot](https://d2f99xq7vri1nk.cloudfront.net/BoxViolinSmaller.gif)
From https://www.autodeskresearch.com/publications/samestats

Here we see a dataset changing, but its summary statistics stay constant. This illustrates that boxplots can often hide the true distribution of our data. An alternative to boxplots are swarm plots, jitter (or strip) plots and violin plots. Look at the seaborn documentation, and try one of them.

In [77]:
plt.figure()
sns.swarmplot(
    x='Month',
    y='Air temperature',
    data=monthly_weather
)



<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1c0e8ce5cc0>

In [79]:
plt.figure()
sns.stripplot(
    x='Month',
    y='Air temperature',
    data=monthly_weather
)



<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1c0e92a4470>

In [80]:
plt.figure()
sns.violinplot(
    x='Month',
    y='Air temperature',
    data=monthly_weather
)



<IPython.core.display.Javascript object>

  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval


<matplotlib.axes._subplots.AxesSubplot at 0x1c0e8cdac18>

Are there any weaknesses with violin plots? Yes, namely that the data distribution is measured using the KDE method, which will smooth away sharp changes. Thus, if your histogram is positive, then a violin plot might show a small probability of data being negative.

## The Matplotlib model

Till now, we have used both Pandas and Seaborn to generate data visualisations. And you might have noticed that the end result share several similarities. The reason for this is that most plotting in Python is done using a tool called Matplotlib as a backend. However, Matplotlib is tedious to work with, so we use wrappers instead.

Let us now show how the matplotlib model works.

In [87]:
figure = plt.figure()  # Create a new figure
ax1 = figure.add_axes([0.05, 0.05, 0.4, 0.9])  # left, bottom, width, height. Fraction of the figure
ax2 = figure.add_axes([0.55, 0.05, 0.4, 0.9])

ax1.plot(monthly_weather.index, monthly_weather['Air temperature'])
ax2.scatter(monthly_weather['Air temperature'], monthly_weather['Albedo'], s=1)



<IPython.core.display.Javascript object>

<matplotlib.collections.PathCollection at 0x1c0ecf32668>

In [88]:
figure = plt.figure()
ax1 = figure.add_subplot(1, 2, 1)  # Create the first set of axes for a two-faceted plot
ax2 = figure.add_subplot(1, 2, 2)

ax1.plot(monthly_weather.index, monthly_weather['Air temperature'])
ax2.scatter(monthly_weather['Air temperature'], monthly_weather['Albedo'], s=1)



<IPython.core.display.Javascript object>

<matplotlib.collections.PathCollection at 0x1c0ed54dcf8>

In [94]:
figure = plt.figure()

for i in range(12):
    ax = figure.add_subplot(3, 4, i+1)
    ax.text(0.5, 0.5, f'{i+1}')
    ax.set_xticks([])
    ax.set_yticks([])




<IPython.core.display.Javascript object>

If you want to learn more about matplotlib, I reccomend that you look at examples. The benefit of learning Matplotlib is that you can create any visualisation that you can imagine. However, it can be time-consuming to do so, which is why you should start with seaborn.

## Heatmaps in Seaborn

Heatmaps are a useful way to display correlation matrices, and seaborn has some very nice functions that will allow us to create them. Let us start with a simple correlation matrix

In [106]:
plt.figure()
sns.heatmap(
    weather.corr()
)



<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1c0efda89b0>

Remember what we said about colour maps, let us use a diverging colour map and ensure that the color ranges from -1 to 1

In [107]:
plt.figure()
sns.heatmap(
    weather.corr(),
    cmap='coolwarm'
)



<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1c0ea4d5d68>

We see that there are some patterns, however, if we cluster the rows and columns of this colour map, then it will be much easier to see what is correlated with what.

In [114]:
sns.clustermap(
    weather.drop('fd', axis=1).corr(),  # the fd column has too many missing values
    cmap='coolwarm',
    figsize=(8, 8)
)



<IPython.core.display.Javascript object>

<seaborn.matrix.ClusterGrid at 0x1c0ed647be0>

The same visualisation can be used to cluster simply the rows of the original data frame.

In [121]:
missingno.matrix(weather.drop(['fd', 'vr', 'sd', 'sdman', 'Minimum grass temperature'], axis=1))



<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1c0a5014278>

In [126]:
sns.clustermap(
    weather.drop(['fd', 'vr', 'sd', 'sdman'], axis=1).dropna(0),
    figsize=(8, 8),
    row_cluster=False,
    z_score=True
)



<IPython.core.display.Javascript object>

<seaborn.matrix.ClusterGrid at 0x1c0b0b5ee10>

In [131]:
weather.plot(y='Air pressure')



<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1c0a51a8710>

In [132]:
weather.loc[np.abs(weather['Air pressure']) > 1200, 'Air pressure'] = np.nan

In [133]:
sns.clustermap(
    weather.drop(['fd', 'vr', 'sd', 'sdman'], axis=1).dropna(0),
    figsize=(8, 8),
    row_cluster=False,
    z_score=True
)



<IPython.core.display.Javascript object>

<seaborn.matrix.ClusterGrid at 0x1c0a452b278>

## Bonus: Interactive plots

In [95]:
import ipywidgets as widgets

In [97]:
fig = plt.figure()

ax = fig.add_subplot(111)  # Short for 1, 1, 1

def replot(column):
    ax.clear()
    sns.scatterplot(
        x=monthly_weather.index,
        y=column,
        data=monthly_weather
    )

replot('Air temperature')



<IPython.core.display.Javascript object>

In [98]:
monthly_weather.columns

Index(['Albedo', 'balanse', 'diffus', 'fd', 'fluxm', 'fluxs',
       'Total amount of EM radiation', 'Minimum grass temperature',
       'Percentage infra-red light', 'jt010',
       'Temperature 1 meter below the ground', 'jt002', 'jt020', 'jt005',
       'jt050', 'Air pressure', 'Air temperature', 'ltmax', 'ltmin', 'nb',
       'par', 'rf', 'sd', 'sdman', 'Percentage visible light',
       'Percentage UV light', 'vh', 'vhmax', 'month', 'Month'],
      dtype='object')

In [101]:
fig = plt.figure()

ax = fig.add_subplot(111)  # Short for 1, 1, 1

@widgets.interact(column=monthly_weather.columns)
def replot(column):
    ax.clear()
    sns.scatterplot(
        x=monthly_weather.index,
        y=column,
        data=monthly_weather
    )




<IPython.core.display.Javascript object>

interactive(children=(Dropdown(description='column', options=('Albedo', 'balanse', 'diffus', 'fd', 'fluxm', 'f…

In [102]:
fig = plt.figure()

ax = fig.add_subplot(111)  # Short for 1, 1, 1

@widgets.interact(x=monthly_weather.columns, y=monthly_weather.columns)
def replot(x, y):
    ax.clear()
    sns.scatterplot(
        x=x,
        y=y,
        data=monthly_weather
    )




<IPython.core.display.Javascript object>

interactive(children=(Dropdown(description='x', options=('Albedo', 'balanse', 'diffus', 'fd', 'fluxm', 'fluxs'…

In [105]:
fig = plt.figure()

ax = fig.add_subplot(111)  # Short for 1, 1, 1

@widgets.interact(
    x=monthly_weather.columns,
    y=monthly_weather.columns,
    hue=monthly_weather.columns,
    size=monthly_weather.columns
)
def replot(x, y, hue):
    ax.clear()
    sns.scatterplot(
        x=x,
        y=y,
        hue=hue,
        data=monthly_weather
    )




<IPython.core.display.Javascript object>

interactive(children=(Dropdown(description='x', options=('Albedo', 'balanse', 'diffus', 'fd', 'fluxm', 'fluxs'…