In [1]:
import pandas as pd
import numpy as np
import re

In [2]:
df = pd.read_csv('../csv/volcanoes_2021.csv', delimiter=';', encoding = "ISO-8859-1")
df.head()

Unnamed: 0,Volcano Number,Volcano Name,Country,Primary Volcano Type,Activity Evidence,Last Known Eruption,Region,Subregion,Latitude,Longitude,Elevation (m),Dominant Rock Type,Tectonic Setting
0,210010,West Eifel Volcanic Field,Germany,Maar(s),Eruption Dated,8300 BCE,Mediterranean and Western Asia,Western Europe,5017,685,600,Foidite,Rift zone / Continental crust (>25 km)
1,210020,Chaine des Puys,France,Lava dome(s),Eruption Dated,4040 BCE,Mediterranean and Western Asia,Western Europe,45775,297,1464,Basalt / Picro-Basalt,Rift zone / Continental crust (>25 km)
2,210030,Olot Volcanic Field,Spain,Pyroclastic cone(s),Evidence Credible,Unknown,Mediterranean and Western Asia,Western Europe,4217,253,893,Trachybasalt / Tephrite Basanite,Intraplate / Continental crust (>25 km)
3,210040,Calatrava Volcanic Field,Spain,Pyroclastic cone(s),Eruption Dated,3600 BCE,Mediterranean and Western Asia,Western Europe,3887,-402,1117,Basalt / Picro-Basalt,Intraplate / Continental crust (>25 km)
4,211003,Vulsini,Italy,Caldera,Eruption Observed,104 BCE,Mediterranean and Western Asia,Italy,426,1193,800,Trachyte / Trachydacite,Subduction zone / Continental crust (>25 km)


In [3]:
df.shape[0]

1356

### Search for missing values 


Let's check if there are any null values.

In [4]:
df.isnull().sum()

Volcano Number           0
Volcano Name             0
Country                  0
Primary Volcano Type     0
Activity Evidence        0
Last Known Eruption      0
Region                   0
Subregion                0
Latitude                 0
Longitude                0
Elevation (m)            0
Dominant Rock Type      21
Tectonic Setting         5
dtype: int64

Only ```Dominant Rock Type``` and ```Tectonic Setting``` have null values, but they are not too many. <br><br>
On the other hand, it looks like some columns have the string ```Unknown``` as value, that's another way of indicating the absence of value.<br>
<br>
Let's check how many : 

In [5]:
for col in df.columns:
    unknowns = df[df[col]=='Unknown'].shape[0]
    print(f"{col} = {unknowns}" )

Volcano Number = 0
Volcano Name = 0
Country = 0
Primary Volcano Type = 0
Activity Evidence = 0
Last Known Eruption = 490
Region = 0
Subregion = 0
Latitude = 0
Longitude = 0
Elevation (m) = 0
Dominant Rock Type = 0
Tectonic Setting = 1


```'Last Known Eruption'``` has **490** ```'Unknown'```, which is about **36%** of observations in the dataframe. <br>
Since we have another dataset specific for eruptions, which contains more information, we can discard this column. 

In [6]:
df = df.drop(columns=['Last Known Eruption'])

### Latitude and Longitude exploration

In [7]:
df['Latitude'].head()

0     50,17
1    45,775
2     42,17
3     38,87
4      42,6
Name: Latitude, dtype: object

Latitude and Longitude are of type **object** and have a ',' as decimal delimiter.<br>
We will transform both columns to numeric.


In [8]:
df['Latitude'] = df['Latitude'].str.replace(',','.')
df['Latitude'] = pd.to_numeric(df['Latitude'],errors = 'coerce')

In [9]:
df['Longitude'] = df['Longitude'].str.replace(',','.')
df['Longitude'] = pd.to_numeric(df['Longitude'],errors = 'coerce')

### 'Primary Volcano Type' exploration

In [10]:
df['Primary Volcano Type'].value_counts()

Stratovolcano          554
Stratovolcano(es)      116
Submarine              115
Shield                 102
Pyroclastic cone(s)     87
Volcanic field          80
Caldera                 76
Complex                 52
Lava dome(s)            29
Shield(s)               23
Pyroclastic cone        16
Fissure vent(s)         15
Compound                12
Maar(s)                 10
Caldera(s)              10
Tuff cone(s)             9
Pyroclastic shield       8
Lava dome                8
Crater rows              6
Subglacial               5
Maar                     5
Fissure vent             3
Tuff cone                3
Stratovolcano?           2
Lava cone                2
Submarine(es)            2
Explosion crater(s)      1
Complex(es)              1
Cone(s)                  1
Tuff ring(s)             1
Lava cone(s)             1
Lava cone(es)            1
Name: Primary Volcano Type, dtype: int64

We have ```Stratovolcano``` and ```Stratovolcano(es)```, both with significant numbers.<br>
We also have ```Stratovolcano?```. <br><br>
Let's change all of them to ```Stratovolcano``` 


In [11]:
df['Primary Volcano Type'].loc[lambda x: x.str.startswith('Stratovolcano', na=False)].value_counts()

Stratovolcano        554
Stratovolcano(es)    116
Stratovolcano?         2
Name: Primary Volcano Type, dtype: int64

In [12]:
df['Primary Volcano Type'] = df['Primary Volcano Type'].replace(['Stratovolcano(es)','Stratovolcano?'], 'Stratovolcano')

In [13]:
df['Primary Volcano Type'].loc[lambda x: x.str.startswith('Stratovolcano', na=False)].value_counts()

Stratovolcano    672
Name: Primary Volcano Type, dtype: int64

### 'Dominant Rock type' exploration

In [14]:
df['Dominant Rock Type'].value_counts()

Andesite / Basaltic Andesite                559
Basalt / Picro-Basalt                       427
Dacite                                       89
Trachybasalt / Tephrite Basanite             65
Rhyolite                                     61
No Data (checked)                            45
Trachyte / Trachydacite                      36
Trachyandesite / Basaltic Trachyandesite     28
Foidite                                      14
Phonolite                                     8
Phono-tephrite /  Tephri-phonolite            3
Name: Dominant Rock Type, dtype: int64

We have ```No Data``` values. Let's change them to ```NaN``` for consistency.

In [15]:
df['Dominant Rock Type'].isnull().sum()

21

In [16]:
df['Dominant Rock Type'] = df['Dominant Rock Type'].replace(['No Data (checked)'], np.NaN)
df['Dominant Rock Type'].value_counts()

Andesite / Basaltic Andesite                559
Basalt / Picro-Basalt                       427
Dacite                                       89
Trachybasalt / Tephrite Basanite             65
Rhyolite                                     61
Trachyte / Trachydacite                      36
Trachyandesite / Basaltic Trachyandesite     28
Foidite                                      14
Phonolite                                     8
Phono-tephrite /  Tephri-phonolite            3
Name: Dominant Rock Type, dtype: int64

Let's check the new count of NaN. It should now include the count of "No Data"


In [17]:
df['Dominant Rock Type'].isnull().sum()

66

In [18]:
df.to_csv('../csv/cleaned/volcanoes_cleaned.csv', index=False)