# Visualisation

## Exploratory Data Analysis and Pre-processing

In this module you will learn how to load an online retail dataset in Python and visualise it. 

You will learn how to plot data which will help you in the future when building models.

For this module, you will need to import 

* `pandas` (as `pd`) [link to documentation](https://pandas.pydata.org/pandas-docs/stable/)
* `matplotlib.pyplot` (as `plt`) [link to documentation](https://matplotlib.org/contents.html)
* `seaborn` (as `sns`) [link to documentation](https://seaborn.pydata.org)


In order to use `matplotlib` in the notebook, you need to have a cell with
```python
%matplotlib inline
```

In [None]:
# add your code here to load the libraries


### Loading the dataset


Load the dataset `online_retail_customer_data.csv` with `pandas`. Recall that you need to 

* use the `read_csv()` method from `pandas`
* point to the location of the dataset
* determine a name under which you want to store the resulting data frame (we suggest the name `customers`)
* specify that the `CustomerID` column is the index column using the `index_col` option

Use the `head` method to display the first few lines of the dataset (you can specify how many lines).


### About the dataset

The dataset is based on data from an online retailer selling gifts and is based on a dataset taken from [here](https://archive.ics.uci.edu/ml/datasets/Online+Retail#).

We have taken the original data and done some preprocessing (filled in missing elements, removed non-numerical columns) to create a 'profile' for each customer, which includes a number of features such as:

* `balance`: Amount of money spent at the store (purchases minus refunds).
* `n_orders`: Total number of orders from the online retailer.
* `time_between_orders`: Average time (in days) between orders.
* `max_spent`: Most amount of money customer spent on a single order.

### Scaling

The different numerical variables may have completely different scales. This can be easily checked with a boxplot. You'll use the `seaborn` wrapper around `matplotlib` that is great for producing clear plots.
Have a look [here](https://stanford.edu/~mwaskom/software/seaborn/examples/index.html) for a gallery of plots possible with `seaborn`.

* define a figure environment with the `figure()` method of `matplotlib.pyplot` (you can pass a figure size)
* use the `boxplot` function of `seaborn` specifying the appropriate dataframe

In [None]:
# add your code to plot a sns.boxplot() of the customer dataframe


As we can see, the variables have completely different scales. This can normally be resolved by scaling, but that is beyond the objectives of this module so we will skip it. The current state of the data will do just fine for all of our plotting exercises.

## Relationship between input features

An important tool for the exploratory data analysis step is the **scatter plot**. 

This plot helps visualise the relationship in-between two input features. It may also give you a first indication of the model to applied. 

Create a scatter plot of the `n_orders` vs `balance` using the `lmplot` function of `seaborn`. Once you have a grip of this, you should try looking at the scatter plot corresponding to other couples.

### Grid/scatterplot matrix

A scatterplot matrix shows a grid of all scatterplots where each attribute is plotted against all other attributes.
This can be applied when there aren't too many variables (otherwise it quickly becomes impractical). 

You can find further information on how to create a scatterplot matrix with seaborn using the `pairplot()` function [here](https://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.pairplot.html).


### Correlation matrix and heatmap of correlations between the input features

It is often of great interest to investigate whether any of the variables in a multivariate dataset are significantly correlated. 
As previously shown, the different features (variables) in `customers` are not independent from each other. 
To quickly identify which features are related and to what degree, it is useful to compute a correlation matrix that shows the correlation coefficient for each pair of variables. 
You can do this by using the `corr()` function from the `pandas` library:

To visualise the degree of correlation between variables, you can use a heatmap (also from `seaborn`, use the `heatmap` method and pass it the correlation matrix calculated in the previous exercise, also set center option to 0). 