In this notebook, the dataset will be manipulated using Exploratory Data Analysis (EDA) together with Light Gradient Boosting Machine (LGBM).

In [None]:
import pandas as pd
import matplotlib as mp
import seaborn as sb
import glob

We imported *pandas* and below we look at *train* and *sample_submission* files.

In [None]:
train = pd.read_csv("../input/predict-volcanic-eruptions-ingv-oe/train.csv")
sample_submission = pd.read_csv("../input/predict-volcanic-eruptions-ingv-oe/sample_submission.csv")

In [None]:
train

Train file has 2 columns and 4431 rows.
* **segment_id** indicates the file name from the train set
* **time_to_eruption** is the time from the end of the file until the first erruption.

=> we have 4431 files in the train set.

In [None]:
sample_submission

Interesting enough, sample_submission.csv has 4520 rows, the exact number of files in the test set.
It is posible for this file to be the file with results that we have to complete with the time_to_eruption numbers for each test file.

Our dataset comprises of the following:
* a test set with 4520 test files
* a train set with 4431 train files
* a train.csv file that contains the metadata for train set (time_to_eruption for each train file)
* a sample_submission.csv file

Verify number of files for train and test sets.

In [None]:
train_set = glob.glob("../input/predict-volcanic-eruptions-ingv-oe/train/*")
len(train_set)

In [None]:
test_set = glob.glob("../input/predict-volcanic-eruptions-ingv-oe/test/*")
len(test_set)

The lengths of the two sets are confirmed.

In [None]:
train_set[0]

In [None]:
test_set[0]

In [None]:
train_file1 = pd.read_csv("../input/predict-volcanic-eruptions-ingv-oe/train/800654756.csv")
train_file1

In the train file above, we have 10 columns with measurements coming from 10 sensors installed on an active volcano.
Each file contains readings from the 10 sensors equal in time with 10 minutes of recording.

The data is numerical but we have also NaN values (not a number).

In [None]:
sensors = set()
observations = set()

for file in train_set:
    f = pd.read_csv(file)
    
    sensors.add(len(f.columns))
    observations.add(len(f))

In [None]:
print("Unique number of sensors: ", sensors)
print("Unique number of observations: ", observations)

From the two lines above we conclude all files from the train set are identical as structure: all files have 10 columns and 60001 rows.

In [None]:
sensors = set()
observations = set()

for file in test_set:
    f = pd.read_csv(file)
    
    sensors.add(len(f.columns))
    observations.add(len(f))

In [None]:
print("Number of sensors: ", sensors)
print("Number of observations: ", observations)

We have done the same thing for the test set. In this set we have also 10 sensors and 60001 observations for each file, so the dataset is consistent from this point of view.

All the files from the test and train sets have the same structure: 10 columns and 60001 rows each.

In [None]:
time2eruption = train["time_to_eruption"]
time2eruption.plot(kind = "hist")

In [None]:
train["time_to_eruption"].describe()