# Pandas practicals
***Data analysis for geosciences with python***

*Atelier Num√©rique de l'OMP*

# **Part I: Introduction to pandas objects**
In this part, we will create a simple dataframe containing information about the planets of the solar system. We will then open a more complex dataframe containing more information.

### Question 1.1
Import the pandas library using the `pd` alias

In [10]:
import numpy as np
import matplotlib.pyplot as plt

# write your answer here


### Question 1.2
Here is a list of data about planets in our solar system. The mass and distance are given relative to the Earth. Create a `Series` for the mass of the planets. The index should be the planets' names.

In [2]:
name_planets = ['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus', 'Neptune']
distance_planets = [ 0.38709893, 0.72333199, 1.0, 1.52366231, 5.20336301, 9.6, 19.19126393, 30.0]
mass_planets =[0.055, 0.815, 1.0, 0.107, 317.8, 95.152, 14.536, 17.147]

# write your answer here


### Question 1.3
Now create a `DataFrame` that contains the mass and the distance of the planets.

In [3]:
# write your answer here


### Question 1.4
Print the mass of Saturn

In [4]:
# write your answer here


### Question 1.5

In the `'../Data'` directory, there is a more complete file called `sol_data.csv`. 
Open the file and call it `df` as in *dataframe*. The index column is the column 0. 
Once open, add a line containing `df` at the end of the cell before running it. This allows to check what the dataframe looks like. You can also type `df.columns` to check out the list of available columns.

In [5]:
# write your answer here


### Question 1.6 
Add a new column which is the surface of the bodies. The surface is computed as $S = 4\pi r^2$ where $r$ is the mean radius

In [6]:
# write your answer here
mean_radius = 
surface = 4*np.pi*mean_radius**2

# Add column to dataframe here



eName
Moon            1.368478e+04
Phobos          1.368478e+04
Deimos          1.368478e+04
Io              4.169349e+07
Europa          3.061289e+07
                    ...     
S/2017 J 8      3.141593e+00
S/2017 J 9      2.827433e+01
Ersa            2.827433e+01
Ultima Thule    1.368478e+04
101955 Bennu    1.368478e+04
Name: surface, Length: 265, dtype: float64

-----------------------------------------------
# **Part II: First analyses**
In this part we will analyse the solar system data from the file we opened in the first part. Most of the columns contain information about celestial bodies (mean radisu, mass, volume, inclination, ...) and their orbit (semi major axis, perihelion, ...). The following schematics shows the different orbit parameters that will be used (from https://www.astronomynotes.com/history).

![image.png](attachment:image.png)

### Question 2.1
Create a new dataframe containing only planets (where `isPlanet` is `True`), called `df_planet`

In [7]:
# write your answer here


### Question 2.2
Sort `df_planet` by the mass in descending order to find the heaviest planets

In [8]:
# write your answer here



### Question 2.3
Here is a new series called `type_of_planets` which contains the type of each of the planets.
Add a new column to `df_planet` containing the information of `type_of_planets`.

In [None]:
type_of_planets = pd.Series(['dwarf','dwarf','gaz','dwarf',
                             'gaz', 'dwarf','dwarf','gaz',
                             'terrestrial','terrestrial', 
                             'gaz','terrestrial','terrestrial'],
                             index = df_planet.index)
type_of_planets

In [12]:
# write your answer here ...


### Question 2.4
Group the planets by their type and compute the mean density.

In [13]:
# write your answer here ...



### Question 2.5
Now using the same group and the `agg` method, compute the mean, minimum, maximum and standard deviation of the mean radius.

In [14]:
# write your answer here ...



### <span style="color:red">Question 2.6:</span>
<span style="color:red">*More open question*</span>


The column *orbit* specifies around which other body a given body is orbitting. Find which planet has the most satellites.

In [16]:
# write your answer here ...


-----------------------------------------------
# **Part III: Plotting**

### Question 3.1
Make a scatterplot showing the semi major axis (in x axis) of planets compared to their density (in y axis).


In [17]:
# write your answer here ...


### Question 3.2:
Using the the full dataframe `df` used in Part II, plot a histogram of the eccentricity of all bodies in the solar system.

In [18]:
# write your answer here ...


### Question 3.3:
Using the the full dataframe `df` used in Part II, plot a boxplot of the aphelion and perihelion of all bodies in the solar system.

In [19]:
# write your answer here ...



### Question 3.4:
Using the `df_planet` used in Part II, plot the mass of the planets in descending order using a horizontal barplot. 
You can use a log scale on the xaxis by adding `plt.xscale('log')` for better visibility

In [20]:
# write your answer here ...


-----------------------------------------------
# **Part IV: Timeseries**
In this part, we move away from planets to focus on the ocean temperature. We will analyse two temperature loggers that recorded the ocean temperature in coral reefs in the tropical Pacific ocean, one at 10 meters deep and the other at 25m deep.

## Question 4.1

Open the following files
- `../data/PerosBanhos-NorthPB-IleDiamant_71.76952_-5.24626_10_.csv`
- `../data/PerosBanhos-NorthPB-IleDiamant_71.76952_-5.24626_25_.csv`

**Attention !** Only use the first 300 000 rows as files are very large.

Use the keyword `parse_dates=True` to automatically convert dates into `datetime_index`

In [21]:
# write your answer here


### Question 4.2

Combine the two timeseries into a unique dataframe, where the variables are ``T10m`` and ``T25m``

In [22]:
# write your answer here


### Question 4.3

Plot the two timeseries on April 24th 2019 as line plots.

In [23]:
# write your answer here



### Question 4.4
Get the key statistics of the dataframe `df` using the describe method.

In [24]:
# write your answer here



### Question 4.5 
Resample the data to get one point every 10 minutes, and plot the same day.

You can try to use a mean aggregator or a minimum.

In [25]:
# write your answer here



### Question 4.6
Plot the data on the day where there is the largest temperature difference between 10m and 25m.
To do so, you could use the `idxmax` method

In [26]:
# write your answer here ...


### Question 4.7
Using the full dataset, extract data that occur only from 12:00 to 12:59. Plot a histogram of the data

In [28]:
# write your answer here



### Question 4.8
Using the 10min resampled dataset, compute the mean diurnal cycle of the data, grouping data by hour.
Plot the results as line plots

In [29]:
# write your answer here



### Question 4.9
Using the same 10min resampled dataset, compute the mean diurnal cycle of the data, grouping data by hour and minutes.
Plot the results as line plots

In [30]:
# write your answer here



### <span style="color:red">Question 4.10:</span>
<span style="color:red">*More open question*</span>

Compute the quantiles 0.25 and 0.75 on the T25m and on the T10m to compute uncertainties in the diurnal cycle and add it to the plot using the `plt.fill_between(x, y1, y2)` matplotlib function. You will need to use reshaping functions such as `df.unstack()`. 
You can also change the aggregator used for the line from `mean` to `median` for better results.


In [31]:
# write your answer here
