# Numpy, Pandas and Plotly basics

Numpy is one of the most useful packages for engineering calculations and mastering some of its syntax can greatly improve your coding efficiency. Formulae written with Numpy are much more compact and readable than doing the same tasks in Excel. Although it may not seem like much, Numpy will make your code faster and easier to read. Guess which one Numpy is in the picture below. 

<img src="Images/faster_slower.jpg">

Together with the Pandas package for data manipulation and Plotly for data visualisation, a very strong foundation for data science in Python is available. In this notebook, some highlights of the three packages are presented.

## The advantages of using Numpy illustrated

Numpy arrays are lists of values. They behave like ordinary Python lists, but have the added advantage that you can use them to do array operations with a single-line statement. Have a look at the following example:

In [None]:
import numpy as np # Always import numpy first

In [None]:
a = [1.0, 2.0, 3.0]
b = [3.0, 2.0, 1.0]
a ** b

In [None]:
a = np.array([1.0, 2.0, 3.0])
b = np.array([3.0, 2.0, 1.0])
a ** b

We have two arrays and we want to elevate the first array to the power of the second, element by element. If we use Python lists, we get an error because this behaviour is not implemented for lists. But if we define the lists as Numpy arrays (just use the ```np.array()``` function which takes a list as argument), Numpy's magic makes it work.

Numpy arrays also have a number of useful functions which you can apply on the array:

In [None]:
a = np.array([1.0, 2.0, 3.0])
a.sum()

In [None]:
a.std()

In [None]:
a.mean()

In [None]:
a.max()

In [None]:
b = np.array([3.0, 2.0, 1.0])
b.sort()
b

and [many more](https://docs.scipy.org/doc/numpy-1.15.4/reference/arrays.ndarray.html#calculation).

## Creating Numpy arrays

Creating Numpy arrays is often a first step in a workflow and there are a number of standard ways to do this.

   a) ```np.linspace(<min>, <max>, <noelements>)```: Creating a list of values of given length, linearly increasing between a minimum and a maximum: 

In [None]:
np.linspace(0.0, 10.0, 11)

   b) ```np.arange(<min>, <max>, <step>)```: Creating a list of values between the minimum (included) and maximum (not included). The following example includes even values between 0 and 10.

In [None]:
np.arange(0, 10, 2)

This also works with dates which is pretty cool! We just need to tell Numpy that we are working with days ```datetime[D]``` objects by specifying the ```dtype``` keyword argument.

In [None]:
np.arange('2019-02-01', '2019-03-01', dtype='datetime64[D]')

   c) ```np.logspace(<log_min>, <log_max>, <noelements>)```: Create a list of values of given length, logarithmically increasing between the logarithm of the minimum and the logarithm of the maximum. This is very useful when there is a linear dependency on the logarithm of the variable under investigation.

In [None]:
np.logspace(0.0, 2.0, 21)

   d) Creating a Numpy array from a given list:

In [None]:
a = [0.0, 1.0, 2.2, 3.4]
np.array(a)

   e) Create a Numpy array containing a given number of zeros

In [None]:
np.zeros(10)

   f) Create a Numpy array containing a given number of ones

In [None]:
np.ones(10)

## Working with Pandas

Numpy is often used hand-in-hand with Pandas, a Python package aimed specifically at data manipulation. Pandas is the go-to package for importing data from various file formats such as .csv, .txt, .xlsx ...

The Pandas ``DataFrame`` is essential for working with data, you can think of it like a table with rows and columns.

### Data import

As an example, we will import a file with soil descriptions from the Borssele offshore wind farm area (RVO.nl) from Excel using the ``read_excel`` method.

In [None]:
import pandas as pd # Import the library first

In [None]:
soildescriptions = pd.read_excel('Data/soil_descriptions.xlsx')

The data has now been imported to the variable ``soildescriptions``. We can check its content by printing the first five row to the notebook using the ``head`` method:

In [None]:
soildescriptions.head()

We can see that the soil descriptions are organised per layer and per location. Pandas allow us to manipulate this data with an easy and straightforward syntax.

### Data manipulations

#### Location count

We can display the locations in the dataframe. We access the location column by writing the column name between square brackets and then apply the ``unique`` method.

In [None]:
soildescriptions['Location'].unique()

We can count the number of locations with the Python ``__len__`` method which gives the length of the list:

In [None]:
soildescriptions['Location'].unique().__len__()

This shows that there are 16 locations in the dataframe.

### Data slicing

We can retrieve the data for a selected location (e.g. BH-WFS2-5). We retrieve the data where the Location column equals the selected location.

In [None]:
soildescriptions[soildescriptions['Location'] == 'BH-WFS2-5']

Slicing can also happen on other properties, for example we can print the layers which are Soil unit A for all locations:

In [None]:
soildescriptions[soildescriptions['Soil unit'] == 'A']

From this data, we can retrieve the minimum and maximum depth of Soil unit A. We access the ``Depth to [m]`` column from the previous result and use the ``min`` and ``max`` method to obtain this result.

In [None]:
soildescriptions[soildescriptions['Soil unit'] == 'A']['Depth to [m]'].min()

In [None]:
soildescriptions[soildescriptions['Soil unit'] == 'A']['Depth to [m]'].max()

### Calculations with Pandas dataframes

We can calculate with Pandas columns the same way as we do with Numpy arrays. For example, the center depth of each layer can be calculated and assigned to a new column ``Depth center [m]``:

In [None]:
soildescriptions['Depth center [m]'] = \
    0.5 * (soildescriptions['Depth from [m]'] + soildescriptions['Depth to [m]'])
soildescriptions.head()

This shows how an equation can be applied to each row by just writing the formula for the columns. This can be applied for any mathematical formula.

## Plotting with plotly

Plotly is a Python package for data visualisation which allows the creation of interactive plots which are rendered in the browser. Plotly can plot line diagrams and scatterplots with a simple and compact syntax.

In [None]:
from plotly import subplots
import plotly.graph_objs as go

The following examples show the creation of the plot with comments explaining each step. Note that an even more compact syntax is available using Plotly Express. The longer syntax is used here since it will be used in the machine learning demos.

### Line plot

In [None]:
fig = subplots.make_subplots(rows=1, cols=1, print_grid=False) # Create a Plotly Figure with 1 row and 1 column
_data = go.Scatter(       # Create a Scatter trace with x and y data
    x=[0.0, 1.0, 2.0],    # Specify the x-coordinates of the points, a list is used but Pandas columns are allowed
    y=[0.0, 1.0, 2.0],    # Specify the y-coordinates of the points, a list is used but Pandas columns are allowed
    showlegend=True,      # Select False for hiding the legend, True for showing it
    mode='lines',         # Use the mode 'lines' to show a line connecting the points
    name='Trace 1')       # Name the trace
fig.append_trace(_data, 1, 1)   # Append the trace to the first row and first column of the figure
fig['layout']['xaxis1'].update(title='X axis title', range=(0, 2)) # Specify the x-axis title and range
fig['layout']['yaxis1'].update(title='Y axis title', range=(0, 2)) # Specify the y-axis title and range
fig['layout'].update(title='Figure title')           # Specify the figure title
fig.show()                # Display the figure

### Scatterplot

The syntax for showing the points without a connecting line is almost identical.

In [None]:
fig = subplots.make_subplots(rows=1, cols=1, print_grid=False) # Create a Plotly Figure with 1 row and 1 column
_data = go.Scatter(       # Create a Scatter trace with x and y data
    x=[0.0, 1.0, 2.0],    # Specify the x-coordinates of the points, a list is used but Pandas columns are allowed
    y=[0.0, 1.0, 2.0],    # Specify the y-coordinates of the points, a list is used but Pandas columns are allowed
    showlegend=True,      # Select False for hiding the legend, True for showing it
    mode='markers',       # Use the mode 'markers' to only show the points
    name='Trace 1')       # Name the trace
fig.append_trace(_data, 1, 1)   # Append the trace to the first row and first column of the figure
fig['layout']['xaxis1'].update(title='X axis title', range=(0, 2)) # Specify the x-axis title and range
fig['layout']['yaxis1'].update(title='Y axis title', range=(0, 2)) # Specify the y-axis title and range
fig['layout'].update(title='Figure title')           # Specify the figure title
fig.show()                # Display the figure

You can hover of the figure to inspect the data and zoom by clicking and dragging on the figure.

## Acknowledgements

The data from the Borssele offshore wind farm area was obtain from RVO.nl and is used under a Creative Commons license.