# Volcanic Eruptions - Cleaning

## Setup

In [1]:
# import libraries
import pandas as pd
import numpy as np
import re
from itertools import compress

In [2]:
# project root
here = "../"

# read data
eruptions = pd.read_csv(here + 'data/extdata/smithsonian-volcanic-eruptions.csv')

## Show data info

In [3]:
eruptions.head()

Unnamed: 0,Number,Name,Country,Region,Type,Activity Evidence,Last Known Eruption,Latitude,Longitude,Elevation (Meters),Dominant Rock Type,Tectonic Setting
0,210010,West Eifel Volcanic Field,Germany,Mediterranean and Western Asia,Maar(s),Eruption Dated,8300 BCE,50.17,6.85,600,Foidite,Rift Zone / Continental Crust (>25 km)
1,210020,Chaine des Puys,France,Mediterranean and Western Asia,Lava dome(s),Eruption Dated,4040 BCE,45.775,2.97,1464,Basalt / Picro-Basalt,Rift Zone / Continental Crust (>25 km)
2,210030,Olot Volcanic Field,Spain,Mediterranean and Western Asia,Pyroclastic cone(s),Evidence Credible,Unknown,42.17,2.53,893,Trachybasalt / Tephrite Basanite,Intraplate / Continental Crust (>25 km)
3,210040,Calatrava Volcanic Field,Spain,Mediterranean and Western Asia,Pyroclastic cone(s),Eruption Dated,3600 BCE,38.87,-4.02,1117,Basalt / Picro-Basalt,Intraplate / Continental Crust (>25 km)
4,211001,Larderello,Italy,Mediterranean and Western Asia,Explosion crater(s),Eruption Observed,1282 CE,43.25,10.87,500,No Data,Subduction Zone / Continental Crust (>25 km)


In [4]:
eruptions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1508 entries, 0 to 1507
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Number               1508 non-null   int64  
 1   Name                 1508 non-null   object 
 2   Country              1508 non-null   object 
 3   Region               1508 non-null   object 
 4   Type                 1508 non-null   object 
 5   Activity Evidence    1507 non-null   object 
 6   Last Known Eruption  1508 non-null   object 
 7   Latitude             1508 non-null   float64
 8   Longitude            1508 non-null   float64
 9   Elevation (Meters)   1508 non-null   int64  
 10  Dominant Rock Type   1455 non-null   object 
 11  Tectonic Setting     1501 non-null   object 
dtypes: float64(2), int64(2), object(8)
memory usage: 141.5+ KB


## Cleaning

In [5]:
# Standardise column names

# drop (, ); replace spaces; to lower
eruptions.columns = [re.sub(r"(\(|\))", "", C.lower().replace(" ", "_")) for C in eruptions.columns]

# shorten column names
eruptions.rename(columns = {'latitude': 'lat', 'longitude': 'lon'}, inplace=True)
eruptions.rename(
    columns = {
        'number': 'id', 
        'elevation_meters': 'elevation', 
        'dominant_rock_type': 'dominant_rock'
    }, 
    inplace = True
)

### Dates

The dates of the last known eruption for each volcano are given in the BCE/CE system. But, we need to check for values that don't match a consistent format before they can be converted to a more usable system.

A quick inspection shows that at least some missing values are encoded as `'Unknown'`

In [6]:
eruptions[['id', 'name', 'last_known_eruption']].head()

Unnamed: 0,id,name,last_known_eruption
0,210010,West Eifel Volcanic Field,8300 BCE
1,210020,Chaine des Puys,4040 BCE
2,210030,Olot Volcanic Field,Unknown
3,210040,Calatrava Volcanic Field,3600 BCE
4,211001,Larderello,1282 CE


We can also use regex matching to check for values that don't match a sequence of up to four digits followed by a space and either `'BCE'` or `'CE'`.

In [7]:
# Get list of unique values in the `last_known_eruption` column
unique_dates = eruptions['last_known_eruption'].sort_values().unique()

# Filter the list for values that don't match standard date
list(compress(unique_dates, [re.search(r"^\d{1,4}\s(BCE|CE)$", date)==None for date in unique_dates]))

['10450 BCE', 'Unknown']

This revealed that only 'Unkown' and a single date made up of five digits don't 
match, so minimal cleaning will be needed before converting dates.

There was one record where the last known eruption was given as "0 CE". As there
is no 'year 0' in the BC/AD or BCE/CE system, we need to remove this row.

In [8]:
eruptions[eruptions['last_known_eruption'] == '0 CE']

Unnamed: 0,id,name,country,region,type,activity_evidence,last_known_eruption,lat,lon,elevation,dominant_rock,tectonic_setting
874,305011,Arshan,China,Kamchatka and Mainland Asia,Pyroclastic cone(s),Eruption Dated,0 CE,47.5,120.7,0,Basalt / Picro-Basalt,Intraplate / Continental Crust (>25 km)


In [9]:
eruptions.query('last_known_eruption != "0 CE"', inplace = True)

Next we can define a function to convert our BCE/CE string dates into integers 
that will be much easier plot and analyse. 

One of the difficulties of the BCE/CE system is that most data analysis tools 
don't support BCE dates, and it can be difficult to work around this by 
representing them with negative numbers (and CE dates with positive numbers) 
because there is no 'year 0'. As a simpler solution for this project, we can
convert the dates to 'bp' (Before Present), which is usually reserved for 
radiocarbon dating but will suit our purposes here.

I'll use the 'bp' format rather than 'BP' (which is used to indicate callibrated
radiocarbon dates) and follow the standard convention of using 1950 CE as the 
definition of 'present'. Any eruptions post-dating 1950 will therefore appear as
negative numbers on our new scale.

We can define a function that will take in a BCE/CE string and split it into 
two, converting the numeric part into an integer and using the BCE/CE indicator 
to determine what arithmetic to apply to convert the date to 'bp' format. We'll 
also use this function to replace 'Unkown' with a true missing value `None`.

In [10]:
# Define function
def era_date_to_bp(date, refdate = 1950, missing = ['Unknown']):
    
    # Convert any input in `missing` to `None`
    if date in missing:
        out = None
    # Convert other input to 'bp'    
    else:
        
        # Split string and convert year to integer
        date_split = re.split(r"\s", date)
        date_split[0] = int(date_split[0])
        
        # Raise exception if year is 0, otherwise convert
        if date_split[0] == 0:
            raise Exception("There is no year '0' in the BCE/CE system")
        elif date_split[1] == "BCE":            
            out = (refdate - 1) + date_split[0] # `refdate-1` - there is no year 0
        elif date_split[1] == "CE":            
            out = refdate - date_split[0]
            
    return out

In [11]:
eruptions['last_eruption_bp'] = [era_date_to_bp(date) for date in eruptions['last_known_eruption']]

Converting dates to BP, however, has revealed a large number of unknown 
(now `Null`) values. We only have 870 dates out of 1507 rows.

In [12]:
eruptions.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1507 entries, 0 to 1507
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   1507 non-null   int64  
 1   name                 1507 non-null   object 
 2   country              1507 non-null   object 
 3   region               1507 non-null   object 
 4   type                 1507 non-null   object 
 5   activity_evidence    1506 non-null   object 
 6   last_known_eruption  1507 non-null   object 
 7   lat                  1507 non-null   float64
 8   lon                  1507 non-null   float64
 9   elevation            1507 non-null   int64  
 10  dominant_rock        1454 non-null   object 
 11  tectonic_setting     1500 non-null   object 
 12  last_eruption_bp     870 non-null    float64
dtypes: float64(3), int64(2), object(8)
memory usage: 164.8+ KB


### Types

#### Overlapping categories

Examining the unique values in the `type` column shows that we have overlapping 
categories. Generally, categories should be mutually exclusive for most types
of analysis and a large number of categories makes visualisations less readable.
We can make this variable less problematic and more useful by reducing overlaps
and the total number of categories.

Ideally, consolidating categories down should be done using domain knowledge,
but for now we'll rely on some reasonable suppositions about issues we can
observe just by looking at the unique values.

In [13]:
for i in eruptions.type.sort_values().unique():
    print(f'- {i}')

- Caldera
- Caldera(s)
- Complex
- Complex(es)
- Compound
- Cone(s)
- Crater rows
- Explosion crater(s)
- Fissure vent
- Fissure vent(s)
- Lava cone
- Lava cone(s)
- Lava dome
- Lava dome(s)
- Maar
- Maar(s)
- Pyroclastic cone
- Pyroclastic cone(s)
- Pyroclastic shield
- Shield
- Shield(s)
- Stratovolcano
- Stratovolcano(es)
- Stratovolcano?
- Subglacial
- Submarine
- Submarine(es)
- Tuff cone
- Tuff cone(s)
- Tuff ring(s)
- Unknown
- Volcanic field
- Volcanic field(s)


##### **Singular vs Plural Distinctions**

Firstly, we have cases where a 'singular' type is distinguished from a 
'*possible* plural' type, such as 'Caldera' and 'Caldera(s)'. We have:

- No information about the criteria that defined this distinction
- No way of quantifying a plural type numerically
- No information about the potential range or scale of a plural quantity (e.g. 
  could 'Caldera(s)' refer to 10 or 100 calderas?). As we don't know the 
  potential scale, we don't know how meaningful it is to differentiate between 
  singular and plural. 

Therefore, it seems justified to remove the overlap between these categories 
and between other pairs that make the same distinction.

We'll start by defining a list of the values that need to be changed, and use
regular expressions to compile a dictionary of what the new values should be.

In [14]:
# Define a dictionary to handle `type` replacement

## Start with all types
new_types = eruptions.type.sort_values().unique()

## Use regex to map originals to replacements in a list of tuples
new_types = [(t, re.sub(r"\((es|s)\)$", "", t)) for t in new_types]

## Drop values where original replacement are the same; for speed
new_types = [None if t[0] == t[1] else t for t in new_types]
new_types = list(compress(new_types, [t != None for t in new_types]))

## Dictionary comprehension
new_types = {k: v for k, v in new_types}

We'll hold back on applying this replacement dictionary to the data for now,
as we may add to it later.

##### **Overlapping categories (uncertainty)**

We have an uncertainly identified category `'Stratovolcano?'`, in 
addition to `'Stratovolcano'` and `'Stratovolcano(es)'`. Only four volcanoes
have this uncertain type, however, so we'll handle this just by deleting the
rows.

In [15]:
# Number of rows with Stratovolano? type
eruptions.query("type == 'Stratovolcano?'").shape[0]

4

In [16]:
# Drop rows with uncertain type
eruptions.query('type != "Stratovolcano?"', inplace = True)

#### Other issues

- 'Unknown'` can be replaced with `None`. We'll add this to our dictionary of
   type replacements.

In [17]:
new_types['Unknown'] = None

#### Apply Replacements

Now our `new_types` dictionary is complete, so we can apply that to our data:

In [18]:
eruptions.replace({'type': new_types}, inplace=True)

#### Further Technical Overlaps

Examining the remaining categories makes it easier to spot some more potential
issues:

In [19]:
for i in eruptions.type.sort_values().unique():
    print(f"- {i}")

- Caldera
- Complex
- Compound
- Cone
- Crater rows
- Explosion crater
- Fissure vent
- Lava cone
- Lava dome
- Maar
- Pyroclastic cone
- Pyroclastic shield
- Shield
- Stratovolcano
- Subglacial
- Submarine
- Tuff cone
- Tuff ring
- Volcanic field
- None


We have two 'pyroclastic' categories, and they both overlap with other types 
('Shield' and 'Cone'). There are also 'Lava cone', 'Tuff cone' and 'Tuff ring' 
categories. These are examples where domain expertise is even more important
for deciding whether non-mutually exclusive categories might be a problem.

It seems clear that the remaining categories are split predominantly between
'shape'-based classification (cone, dome, shield, ring), 'environmental'
('submarine', 'subglacial') or some other form of vulancological or geographicic 
context for where the volcano is located ('crater rows', 'explosion crater', 
'volcanic field').

The additional 'lava \[x\]', 'pyroclastic \[x\]', and 'tuff \[x\]' categories
add some additional information about their volcanoes' typology, but it's
unclear whether 'lava cone' is radically different from 'cone', for example.

I contemplated exploring the potential impact of separating 'lava', 
'pyroclastic', and 'tuff' from the categories they're attached to by creating
new boolean variables:

- `type_is_lava`
- `type_is_tuff`
- `type_is_pyro`

This would then allow the multiple 'cone' and 'shield' categories to be merged,
creating a new 'dome' category in `type` from 'lava dome' as a side-effect.

Ultimately, I decided against this as it would introduce more complexity than
I wanted for this short project, and because these new boolean variables 
would have some artificially strong correlations with `type` that might distort
key results.

I'm including the code drafted to implement this change below, however:

```python
def detect_vol_type(x, cat):
    
    # Compile regex
    regex = re.compile(cat)
    
    # Catch nulls
    if x == None:
        out = None
    # True if match, false if no match
    elif re.search(regex, x) != None:
        out = True
    else:
        out = False
    
    # Return
    return out

# Create new boolean columns
eruptions['type_is_lava'] = [detect_vol_type(t, "^Lava") for t in eruptions['type']]
eruptions['type_is_tuff'] = [detect_vol_type(t, "^Tuff") for t in eruptions['type']]
eruptions['type_is_pyro'] = [detect_vol_type(t, "^Pyroclastic") for t in eruptions['type']]

# Define replacements for original types
simple_types = eruptions['type'].sort_values().unique()
simple_types = [st for st in simple_types if st != None and re.search(r"(Lava|Tuff|Pyroclastic)", st)]
simple_types = {st: re.sub(r"(Lava|Tuff|Pyroclastic)\s", "", st).title() for st in simple_types}

# Implement replacements
eruptions.replace({'type': simple_types}, inplace = True)eruptions.replace({'type': simple_types}, inplace = True)
```

### Dominant Rock

We next need to examine the given categories for `dominant_rock`. We can 
immediately make an improvement, however, by merging two values, `nan` and 
`'No Data'` which are both representing missing values.

In [20]:
# Replace values that should be None
eruptions['dominant_rock'].replace({np.nan: None, 'No Data': None}, inplace=True)

`dominant_rock` is another column with multiple data points in the same variable:

In [21]:
for e in eruptions['dominant_rock'].sort_values().unique():
    print(f'- {e}')

- Andesite / Basaltic Andesite
- Basalt / Picro-Basalt
- Dacite
- Foidite
- Phono-tephrite /  Tephri-phonolite
- Phonolite
- Rhyolite
- Trachyandesite / Basaltic Trachyandesite
- Trachybasalt / Tephrite Basanite
- Trachyte / Trachydacite
- None


Multiple categories are 'either or' types, such as 'Andesite / Basaltic 
Andesite'. We have some overlap between 'Trachyte / Trachydacite' and 'Dacite';
between 'Phono-tephrite / Tephri-phonolite' and 'Phonolite' and so on. It's 
unclear whether 'either or' categories have been used to consolidate a longer 
list of more specific categories or because it can be difficult for 
vulcanologists to categorise a volcano's dominant geology.

This is an area that requires more expert knowleldge to properly handle these 
categories for analysis purposes. As with `type`, I contemplated converting
these to booleans using a regex-based approach where any category containing
`'basalt'` would return `True` for a new `dr_is_basaltic` column, and any value
containing `'trachy'` would return `True` for a new `dr_is_trachytic` column.

This would remove the problem with overlapping categories, but I decided that
it would simply replace it with a problem with artifical correlation between
the new boolean columns, effecitvely replicating the overlap problem in a new
form. It would also undermine the concept behind a '*dominant* rock' variable,
which fundamentally recognises that a volcano's geological makeup is not
monolithic, forcing the data collector to pick a dominant rock type.

As before, I'm nevertheless including code drafted for this purpose below. It
re-uses the `detect_vol_type()` function I created for `type`.

```python
# Create boolean columns for overlapping rock types

## Dictionary of adjectives (keys) and regex (values) to match
rock_type_booleans = {
    'basaltic': 'basalt',
    'andesitic': 'andesite',
    'dacitic': 'dacite',
    'phonolitic': 'phono',
    'tephritic': 'tephri',
    'trachytic': 'trachy',
    'basanitic': 'basan'    
}

## Loop over dictionary to add boolean column by regex matching with `detect_vol_type()`

for key, value in rock_type_booleans.items():
    column_name = f'dr_is_{key}' # add a prefix to the column name
    regex = f'(?i){value}' # make regex case non-sensitive
    
    eruptions[column_name] = [detect_vol_type(rock, regex) for rock in eruptions['dominant_rock']]
```

### Tectonic Setting

The `tectonic_setting` column is also made up of 'either or' categories, but they are more consistent than the `dominant_rock` values.

In [22]:
for i in eruptions['tectonic_setting'].sort_values().unique():
    print(f'- {i}')

- Intraplate / Continental Crust (>25 km)
- Intraplate / Intermediate Crust (15-25 km)
- Intraplate / Oceanic Crust (< 15 km)
- Rift Zone / Continental Crust (>25 km)
- Rift Zone / Intermediate Crust (15-25 km)
- Rift Zone / Oceanic Crust (< 15 km)
- Subduction Zone / Continental Crust (>25 km)
- Subduction Zone / Crust Thickness Unknown
- Subduction Zone / Intermediate Crust (15-25 km)
- Subduction Zone / Oceanic Crust (< 15 km)
- Unknown
- nan


As these values seem to be consistently in a 'Zone / Crust (Crust Thickness)' format, we can split them into separate columns rather than creating boolean variables.

In [23]:
# Replace 'Unknown' and 'nan' with 'None'
eruptions['tectonic_setting'].replace({'Unknown': None, np.nan: None}, inplace=True)

# Split tectonic setting
eruptions[['ts_zone', 'ts_crust_type']] = eruptions['tectonic_setting'].str.split(r'\s/\s', expand=True)

Selecting our new columns and dropping the duplicates allows us to see that the measurements in the Crust Type column simply indicate how a measure of crust thickness has been binned. But, it also reveals that only one 'Zone' category ('Subduction Zone') pairs with 'Crust Thickness Unkown'. Therefore, we can probably safely replace 'Crust Thickness Unknown' with `None`.

In [24]:
eruptions[['ts_zone', 'ts_crust_type']].drop_duplicates().sort_values('ts_crust_type')

Unnamed: 0,ts_zone,ts_crust_type
0,Rift Zone,Continental Crust (>25 km)
2,Intraplate,Continental Crust (>25 km)
4,Subduction Zone,Continental Crust (>25 km)
278,Subduction Zone,Crust Thickness Unknown
48,Rift Zone,Intermediate Crust (15-25 km)
343,Subduction Zone,Intermediate Crust (15-25 km)
1462,Intraplate,Intermediate Crust (15-25 km)
44,Rift Zone,Oceanic Crust (< 15 km)
201,Intraplate,Oceanic Crust (< 15 km)
233,Subduction Zone,Oceanic Crust (< 15 km)


In [25]:
eruptions['ts_crust_type'].replace({'Crust Thickness Unknown': None}, inplace = True)

Having identified that `ts_crust_type` is an ordered bin, we can strip out the measurements in kilometres and redefine the column as a pandas Category data type.

In [26]:
# Strip out km measurement bins
eruptions['ts_crust_type'] = [re.sub(r'\s\(.*km\)$', '', ct) if ct != None else ct for ct in eruptions['ts_crust_type']]

In [27]:
# Natural alphabetic ordering of the categories matches the numeric bin order, so use unique values (-None) to define the category
ts_crust_cat = eruptions['ts_crust_type'].dropna().unique().tolist()
ts_crust_cat.sort(reverse=True)
ts_crust_cat

['Oceanic Crust', 'Intermediate Crust', 'Continental Crust']

In [28]:
# Redefine the column as a category
eruptions['ts_crust_type'] = pd.Categorical(eruptions['ts_crust_type'], categories = ts_crust_cat, ordered = True)

As we'll export the final cleaned version of this data to CSV later,
we'll generalise this code converting the column to a category
as a function so it can be easily repeated in other
notebooks. This will be saved as a function in src/.

```python
def ts_crust_type_to_cat(data):
    
    # Get categories
    categories = data['ts_crust_type'].dropna().unique().tolist()
    categories.sort(reverse = True)
    
    # Define category
    data['ts_crust_type'] = pd.Categorical(
        data['ts_crust_type'],
        categories = categories,
        ordered = True
    )
    
    # Return
    return data
```

## Finalise Cleaning

To finish up our cleaning, we'll drop the columns we no longer need. These are
the columns we're replacing with new ones.

In [29]:
eruptions.drop(columns = ['last_known_eruption', 'tectonic_setting'], inplace = True)

In [30]:
eruptions.head()

Unnamed: 0,id,name,country,region,type,activity_evidence,lat,lon,elevation,dominant_rock,last_eruption_bp,ts_zone,ts_crust_type
0,210010,West Eifel Volcanic Field,Germany,Mediterranean and Western Asia,Maar,Eruption Dated,50.17,6.85,600,Foidite,10249.0,Rift Zone,Continental Crust
1,210020,Chaine des Puys,France,Mediterranean and Western Asia,Lava dome,Eruption Dated,45.775,2.97,1464,Basalt / Picro-Basalt,5989.0,Rift Zone,Continental Crust
2,210030,Olot Volcanic Field,Spain,Mediterranean and Western Asia,Pyroclastic cone,Evidence Credible,42.17,2.53,893,Trachybasalt / Tephrite Basanite,,Intraplate,Continental Crust
3,210040,Calatrava Volcanic Field,Spain,Mediterranean and Western Asia,Pyroclastic cone,Eruption Dated,38.87,-4.02,1117,Basalt / Picro-Basalt,5549.0,Intraplate,Continental Crust
4,211001,Larderello,Italy,Mediterranean and Western Asia,Explosion crater,Eruption Observed,43.25,10.87,500,,668.0,Subduction Zone,Continental Crust


Next we'll reorder the remaining columns so variables with similar meanings are 
clustered together for easy reading:

In [31]:
# Define column order

## Define column order
eruptions_col_order = [
    # Identifiers
    'id', 
    'name', 
    # Geography in increasing specificity
    'region', 
    'country', 
    'lon', 
    'lat', 
    'elevation', 
    # Taxonomy
    'type',
    'dominant_rock',
    'ts_zone',
    'ts_crust_type',
    # Activity
    'last_eruption_bp',
    'activity_evidence',
]

### Add `type_*` columns to order

# Set column order
eruptions = eruptions[eruptions_col_order]

In [32]:
eruptions.head()

Unnamed: 0,id,name,region,country,lon,lat,elevation,type,dominant_rock,ts_zone,ts_crust_type,last_eruption_bp,activity_evidence
0,210010,West Eifel Volcanic Field,Mediterranean and Western Asia,Germany,6.85,50.17,600,Maar,Foidite,Rift Zone,Continental Crust,10249.0,Eruption Dated
1,210020,Chaine des Puys,Mediterranean and Western Asia,France,2.97,45.775,1464,Lava dome,Basalt / Picro-Basalt,Rift Zone,Continental Crust,5989.0,Eruption Dated
2,210030,Olot Volcanic Field,Mediterranean and Western Asia,Spain,2.53,42.17,893,Pyroclastic cone,Trachybasalt / Tephrite Basanite,Intraplate,Continental Crust,,Evidence Credible
3,210040,Calatrava Volcanic Field,Mediterranean and Western Asia,Spain,-4.02,38.87,1117,Pyroclastic cone,Basalt / Picro-Basalt,Intraplate,Continental Crust,5549.0,Eruption Dated
4,211001,Larderello,Mediterranean and Western Asia,Italy,10.87,43.25,500,Explosion crater,,Subduction Zone,Continental Crust,668.0,Eruption Observed


In [33]:
eruptions.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1503 entries, 0 to 1507
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   id                 1503 non-null   int64   
 1   name               1503 non-null   object  
 2   region             1503 non-null   object  
 3   country            1503 non-null   object  
 4   lon                1503 non-null   float64 
 5   lat                1503 non-null   float64 
 6   elevation          1503 non-null   int64   
 7   type               1500 non-null   object  
 8   dominant_rock      1393 non-null   object  
 9   ts_zone            1495 non-null   object  
 10  ts_crust_type      1412 non-null   category
 11  last_eruption_bp   868 non-null    float64 
 12  activity_evidence  1502 non-null   object  
dtypes: category(1), float64(3), int64(2), object(7)
memory usage: 154.2+ KB


## Export

In [34]:
eruptions.to_csv(here + 'data/processed/eruptions.csv', index = False)