# Numpy and Pandas

## Introduction

Numpy and Pandas are Python libraries that support data manipulation. 

* **NumPy** library adds support for large, multi-dimensional arrays and matrices together with a large collection of high-level mathematical functions that can be used on these arrays. The advantage of using NumPy is that it consumes less memory to store the data and it allows to specify the data types, making the code more optimized.    To install it check [NumPy-installation](https://numpy.org/install/).


* **Pandas** is built on NumPy and it is a high-level data manipulation tool. It offers data and operations structures to manipulate numerical tables and time series. In Pandas, the data structure is called DataFrame and it allows to store and manipulate tabular data with rows and columns. To install it check [Pandas-installation](https://pandas.pydata.org/getting_started.html).

**Table of contents:**

* [Libraries import](#Libraries-import)
* [Basics of Numpy](#Basics-of-NumPy)
* [Reading files with NumPy](#Reading-files-with-NumPy)
* [Basics of Pandas](#Basics-of-Pandas)
* [Reading files with Pandas](#Reading-files-with-Pandas)

## Libraries import

Libraries are very simple to import, as this is done exactly like modules. It is very common to see both libraries imported with a shortened name:

In [None]:
import numpy as np
import pandas as pd

## Basics of NumPy

As mentioned before, NumPy has included a big amount of mathematical functions and methods. In [NumPy routines](https://numpy.org/doc/stable/reference/routines.html) you can take a look at all the routines available in NumPy. For example, if we want to calculate the square root of a value, we can simply do:

In [None]:
print(np.sqrt(8))

Or we could perform a calculation over many values. Let's use `numpy.linspace` to create a sequence of values evenly spaced.

In [None]:
x = np.linspace(1,10,5) # Sequence starts at 1, ends at 10, and has 5 evenly spaced samples.

print(np.sin(x))

In NumPy, the data structure is the **array**. An array is a very powerful data structure as we can perform calculations directly on it. 

In [None]:
list_of_values = [3.8, 4.5, 6.2, 3.7, 5.2, 7.1] 
corrections_list = [0.8, 0.9, 0.95, 0.87, 0.6, 0.94]

# We redefine each list as an array. (We can also use the same name and the variable will be overwritten)
values = np.array(list_of_values)
corrections = np.array(corrections_list)
print('The values array is:', values)
print('The corrections array is:', corrections)
#Now we can perform operations directly on the arrays instead of iterating through the whole list.

corrected = values * corrections
print('The corrected values are:', corrected)

We can also create multidimensional arrays:

In [None]:
values_to_correct = [[3.8, 0.8],[4.5,0.9],[6.2,0.95],[3.7,0.87],[5.2,0.6],[7.1,0.94]] #List of lists

values_to_correct_array = np.array(values_to_correct)
print(values_to_correct_array.shape) #This shows the dimensions of the array
print(values_to_correct_array) #This shows a two dimensional array

In [None]:
#To extract only the first row:
print('The first row is:',values_to_correct_array[0])
#To extract the first column:
print('The first column is:', values_to_correct_array[:,0])
#To extract element 3rd in row and 2nd in column:
print('The element is:', values_to_correct_array[2,1]) #Remember that Python counts from 0!

### Numpy's file format

When we need to store and read back a NumPy array, it is useful to know the functions `numpy.save` and `numpy.load`:

In [None]:
np.save('data/values_to_correct.npy', values_to_correct_array)
test = np.load('data/values_to_correct.npy')
print(test)


NumPy offers many more options and as we cannot revise them all in this pre-course please check the documentation at [NumPy documentation](https://numpy.org/) to get more insights on it.

## Reading files with NumPy

One very important task that you will often face is to read text files containing data. NumPy offers options to do it. 

* When there are no missing values the simpler option is `numpy.loadtxt`

In [None]:
#Let's read a file containing lon, lat and slip for the 2010 Maule earthquake
#lon, lat, slip = np.loadtxt('data/slip_maule.xyz')


* If there are missing values in the file we can use `numpy.genfromtxt`

In [None]:
#lon, lat, slip = np.genfromtxt('data/slip_maule.xyz')

## Basics of Pandas

**DataFrames**, the data structure of Pandas, is very powerful. It is similar to NumPy arrays but it has the advantage of allowing different data types.

* Creating a DataFrame from a **dictionary**:

In [None]:
# The following dictionary contains the number of active volcanoes per region in 2019 in Italy
volcanoes_italy = {'region': ['sicily', 'campania', 'lazio'],
                   'active_volcanoes': [6, 2, 1]
                  }                   
print('The dict looks like:', volcanoes_italy)

volcanoes_df = pd.DataFrame(volcanoes_italy)
print('The DataFrame looks like:', volcanoes_df)

As you can note, DataFrame looks much nicer, similar to a spreadsheet and easily readable. 
* Let's check how would look a DataFrame from a **list of lists** (`values_to_correct`):

In [None]:
coordinates = [[38.14,73.41],[60.91,147.34],[3.30,95.98],[38.30,142.37],[52.62,159.78]]
pandas_coord = pd.DataFrame(coordinates)
print(pandas_coord)

If we want to have a clearer data structure, we can also add names to the columns:

In [None]:
pandas_coord = pd.DataFrame(coordinates, columns = ['latitude','longitude'])
print(pandas_coord)

* DataFrame from a **NumPy array**:

In [None]:
data = np.array([[9.5,38.14,73.41],[9.2,60.91,147.34],[9.1,3.30,95.98],[9.1,38.30,142.37],[9.0,52.62,159.78]])
columns = ['magnitude', 'latitude', 'longitude']
pd.DataFrame(data, columns=columns) 

As you may have noticed, DataFrames always indicates an index, and to make it look better, it is also possible to **indicate a name to the index** column:

In [None]:
index = ['Valdivia1960', 'Alaska1964', 'Sumatra2004', 'Tohoku2011', 'Kamchatka1952']
data_with_index = pd.DataFrame(data, columns=columns, index=index) 
data_with_index

## Reading files with Pandas

A nice feature of Pandas is that can write and read Excel, CSV and other types of files. If we have a `.csv` file, we can easily read it with `pandas.read_csv()`. We just need to include the filename and the delimiter:

In [None]:
ingv = pd.read_csv('data/ingv_seismic.csv', delimiter=';')

Check the following examples of what we can do using `read_csv`.

* To visualize the first 5 rows:

In [None]:
ingv.head()

* To visualize the last 5 rows:

In [None]:
ingv.tail()

* To check 3 random rows:

In [None]:
ingv.sample(3)

* We can select certain columns:

In [None]:
ingv[['Station', 'Municipality']]

When we use brackets, it tells pandas that we will select columns. And here we have used double brackets because we are giving a list of columns as argument.

* To extract certain columns and certain rows:

In [None]:
ingv[['Municipality', 'Station']][1:4]

* The property `loc` is very useful to access a group of rows and columns by labels or boolean array. If we want to access the elements where the latitude is below 37, then we could do:

In [None]:
ingv.loc[ingv['Lat (N)'] < 37 ]

* Or if we want to visualize only the information for a certain `Station`:

In [None]:
ingv.loc[ingv['Station'] == 'ISPIC']

# Summary

* You learned about the main **differences** between Numpy and Pandas
* You learned the **basics** of NumPy and Pandas
* You learned to read files with Numpy using **`loadtxt`** and **`genfromtxt`**
* You learned to read files with Pandas using **`read_csv`**