## Imports
For visualizing the UCI Epileptic Seizure dataset, we will be using Bokeh, NumPy, and Pandas. To output graphs to this notebook, we use output_notebook() instead of output_file().

In [1]:
import pandas as pd
import numpy as np
from bokeh.plotting import figure, output_notebook, show, ColumnDataSource
import bokeh.colors
from bokeh.palettes import Spectral6
from bokeh.models import Span
from sklearn.preprocessing import MinMaxScaler

output_notebook()

## Data Processing
After loading in our imports, we can clean the data a bit to make sure it's ready for visualizing.

In [2]:
dataset = pd.read_csv('seizure_data.csv') # load in seizure dataset
dataset = dataset.drop(columns=['Unnamed: 0'], axis=1) # clean unwanted columns

full_dataset = dataset
X = dataset.drop(columns=['y'])
Y = dataset['y']

## Visualizing
We can take a quick look at our data graphically by simply representing each row of the dataset as a line:

In [3]:
colormap = {1: 'lightsteelblue', 2: 'deepskyblue', 3: 'dodgerblue', 4: 'steelblue', 5: 'darkblue'}
colors = [colormap[x] for x in Y]

plot = figure(title="EEG Values over Time", x_axis_label='Timestamps', y_axis_label='EEG Value', x_range=(0, 178), y_range=(-1800,1000))
for index, row in X.head(20).iterrows():
    plot.line(np.linspace(0, 178, 178), row, line_color=colors[index], line_width=2, legend=str(Y[index]))

show(plot)

As we can see from the color legend, only the 1st class is visually recognizable. Since the 1st class corresponds to having a seizure, it should be expected that the EEG values will fluctuate more. Past that, the other classes aren't immediately apparent.

We can further investigate the differences between the classes by drawing the average line from each class. This will help us see if there are any obvious differences we can immediately see.

In [4]:
sum_series = {1: pd.Series(), 2: pd.Series(), 3: pd.Series(), 4: pd.Series(), 5: pd.Series()}
for index, row in X.head(200).iterrows():
    sum_series[Y[index]] = sum_series[Y[index]].add(row, fill_value=0)
sum_series = dict((k, v / (Y.head(200).value_counts()[k]) ) for k, v in sum_series.items())

plot = figure(title="EEG Values over Time (20 values)", x_axis_label='Timestamps (178 total)', y_axis_label='EEG Value', x_range=(0, 178), y_range=(-150,150))
for key, value in sum_series.items():
    plot.line(np.linspace(0, 178, 178), value, line_color=colormap[key], line_width=2, legend=str(key))

show(plot)

From seeing the average values, we can see that certain classes have a more distinct and recognizable shape than others. Class 1 in particular is very noticeable, as it has a much higher amplitude. The maximum peaks seem to decrease as the classes increase in value, from 2-5, with the 5th being the most level out of all of them.

Another characteristic to consider is the number of a class that exists with a peak above a certain value. This can give us an idea if a certain class is predisposed to have higher EEG peaks than others, and by how much.

To do this, we need to get the percentages of each class at different steps in the graph, from 0 to 2000. We do this by keeping track of the number of occurences of each of the classes at a timestep that are above the peak at that point. For example, at the value 1000, we iterate through the dataset and only increment the count of a class if that specific sequence of EEG values peaks at or above 1000. We then divide each count by the total number of classes counted to get a percentage. This lets us see if if there are certain classes are more common at different peaks. 

In [5]:
stored_percents = {}
full_dataset['max'] = full_dataset.apply(lambda row: row.max(), axis=1) # add new column storing the max in the row
for val in range(0, 2000):
    full_dataset_filtered = full_dataset[full_dataset['max'] >= val]
    counts = {1: 0, 2: 0, 3: 0, 4: 0, 5: 0}
    for class_val in range(1, 6):
        try:
            counts[class_val] = full_dataset_filtered['y'].value_counts()[class_val]
        except KeyError:
            counts[class_val] = 0
    # print(sum(counts.values()))
    percents = dict((k, v / sum(counts.values()) ) for k, v in counts.items())
    stored_percents[val] = percents
    # print(stored_percents)

We start by adding a new column to the full dataset, which holds the maximum EEG value of the row it is in. Then we iterate through every value from 0 to 2000. We filter the dataset based on if the max is higher than that value, and then add the number of of each class left to a counts dictionary. We calculate the percentages share of all sequences that each class takes up, and store that in the corresponding value key in the stored_percents dictionary.

Now that we have the percentages of classes at each peak value, we can seperate each of the percents at each value into their respective classes to prepare for graphing.

In [6]:
classes = {}

for class_val in range(1, 6):
    classes[class_val] = dict((k, v[class_val]) for k, v in stored_percents.items())

# print(classes)

The graph is defined below, with tooltips that popup based on the percentage that class takes up of the entire dataset at that point. Since the mode used is 'vline', all the tooltips popup instead of having to scroll over each one.

In [7]:
from bokeh.models import HoverTool

plot = figure(title="Class Percentages", tools='box_zoom,reset', x_axis_label='Maximum EEG value', y_axis_label='Percent of Class', x_range=(0, 500), y_range=(0, 1))
for class_val in range(1, 6):
    source = ColumnDataSource(data = dict(
        x = list(sorted(classes[class_val].keys())),
        y = list(classes[class_val].values()),
    ))
    plot1 = plot.line('x', 'y', line_color=colormap[class_val], line_width=3, legend=str(class_val), source=source)
    plot.add_tools(HoverTool(renderers=[plot1], mode='vline', tooltips=[('Percent', '@{y}{%0.03f}')]))

show(plot)

If we look at just peaks from 0-500, we can see that class 1 (confirmed epileptic seizure) starts to rise in percentage sharply, reaching a maximum approximately 92% dominance over all other classes. In the other classes, only class 2 competes in percentage share with class 1 after EEG value 300, as all the others hit 0, meaning they do not show up at all past this point.

This helps us partition our dataset more cleanly, as we can safely say that classes 1 and 2 are the only ones that peak a points higher than 300, with all other classes staying below this value. If we look at higher peaks, however, we notice a clear change in the percentage share:

In [8]:
plot = figure(title="Class Percentages", tools='box_zoom,reset', x_axis_label='Maximum EEG value', y_axis_label='Percent of Class', x_range=(500, 2000), y_range=(-0, 1))
for class_val in range(1, 6):
    source = ColumnDataSource(data = dict(
        x = list(sorted(classes[class_val].keys())),
        y = list(classes[class_val].values()),
    ))
    plot1 = plot.line('x', 'y', line_color=colormap[class_val], line_width=3, legend=str(class_val), source=source)
    plot.add_tools(HoverTool(renderers=[plot1], mode='vline', tooltips=[('Percent', '@{y}{%0.03f}')]))
    # plot.line('x', 'y', line_color=colormap[class_val], line_width=3, legend=str(class_val), source=source)

show(plot)

This graph shows a clear shift in percentage share of class 1 and 2 at a peak of 1550. This means that at peaks past 1550, class 2, the non-seizure class, actually occurs more often. This tells us that a simple linear division between classes (eg, 'class 1 occurs is a sequence has a peak over 500') does not work, as there is more complexity to this dataset.

At this level we notice more plateaus as well. This is because there are not as many occurences of EEG sequences that peak above certain values. Basically, the higher peaks are less common than those at lower values, which in turn means there are fewer EEG sequences that peak that high. 

Next, we can look at the distributon of EEG peaks compared between classes. We can do this by storing the maxes of each class in their own Series, and then graphing them accordingly.

In [9]:
def close_values(dataset, row, proximity):
    peak = row['max']
    dataset = dataset[(dataset['max'] < peak + proximity) & (dataset['max'] > peak - proximity)]
    # print(dataset.shape)
    return(dataset.shape[0])

plot = figure(title="EEG Peaks ", x_axis_label='Class', y_axis_label='EEG Peak')
sets = {1: pd.Series(), 2: pd.Series(), 3: pd.Series(), 4: pd.Series(), 5: pd.Series()}
sizes = {1: pd.Series(), 2: pd.Series(), 3: pd.Series(), 4: pd.Series(), 5: pd.Series()}

# full_dataset.apply(lambda row: row.max(), axis=1)

for i in range(1, 6):
    peaks_dataset = full_dataset[full_dataset['y'] == i]
    # print(full_dataset['max'])
    peaks_dataset['grouped'] = peaks_dataset.apply(lambda row: close_values(peaks_dataset, row, 10), axis=1)
    sizes[i] = peaks_dataset['grouped']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


Here we are scaling the values to make sure their sizes are reasonable for the plot.

In [12]:
for i in range(1, 6):
    sizes[i] = np.interp(sizes[i], (sizes[i].min(), sizes[i].max()), (3, 25))

# print(sizes)

In [11]:
for i in range(1, 6):
    filtered_dataset = full_dataset[full_dataset['y'] == i]
    sets[i] = filtered_dataset['max'] # store series of maxes in dict
    source = ColumnDataSource(data = dict(
        x = [i] * 2300,
        y = list(sets[i]),
        size = sizes[i]
    ))
    plot.circle('x', 'y', color=colormap[i], legend=str(i), size='size', source=source)

show(plot)