# Liverpool Data EDA

In this competition, you’ll use ion channel data to better model automatic identification methods. If successful, you’ll be able to detect individual ion channel events in noisy raw signals. The data is simulated and injected with real world noise to emulate what scientists observe in laboratory experiments.
![](https://www.nature.com/scitable/content/ne0000/ne0000/ne0000/ne0000/14707004/U4CP3-1_IonChannel_ksm.jpg)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns
from itertools import cycle

plt.style.use('ggplot')
color_pal = [x['color'] for x in plt.rcParams['axes.prop_cycle']]
color_cycle = cycle(plt.rcParams['axes.prop_cycle'].by_key()['color'])

`#smalldata`
(hashtag "small data") - the training and test set are 87MB and 32MB respectively!

In [None]:
!ls -GFlash ../input/liverpool-ion-switching/

## Data
In this competition, you will be predicting the number of open_channels present, based on electrophysiological signal data.

**IMPORTANT: While the time series appears continuous, the data is from discrete batches of 50 seconds long 10 kHz samples (500,000 rows per batch). In other words, the data from 0.0001 - 50.0000 is a different batch than 50.0001 - 100.0000, and thus discontinuous between 50.0000 and 50.0001.**

In [None]:
train = pd.read_csv('../input/liverpool-ion-switching/train.csv')
test = pd.read_csv('../input/liverpool-ion-switching/test.csv')
ss = pd.read_csv('../input/liverpool-ion-switching/sample_submission.csv')

In [None]:
def color_negative_red(val):
    """
    Takes a scalar and returns a string with
    the css property `'color: red'` for negative
    strings, black otherwise.
    """
    color = 'red' if val < 0 else 'black'
    return 'color: %s' % color

In [None]:
# Training
train.describe() \
    .T.round(4) \
    .style \
    .applymap(color_negative_red)

In [None]:
# Test
test.describe() \
    .T.round(4) \
    .style \
    .applymap(color_negative_red)

## Training "Batch" of 500,000 rows
- Total training size is 5000000, so we can think of them in terms of 10 "batches"

In [None]:
train['batch'] = (train.index // 500_000) + 1

## Plotting the signal and open_channels for each "batch"

In [None]:
fig, axs = plt.subplots(10, 2, figsize=(15, 30))
axs = axs.flatten()
i = 0
for b, d in train.groupby('batch'):    
    d.set_index('time')['signal'].plot(style='.',
                                       title=f'train batch {b:0.0f} - signal',
                                       ax=axs[i],
                                       alpha=0.2,
                                      color=next(color_cycle))
    d.set_index('time')['open_channels'].plot(style='.',
                                              title=f'train batch {b:0.0f} - open_channels',
                                              ax=axs[i+1],
                                              alpha=0.5,
                                      color=next(color_cycle))
    i += 2
#     break
plt.tight_layout()

## Closer look at signal/channels

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
train.loc[510000:520000].set_index('time')['signal'].plot(style='.',
                                   title=f'train - close look - signal',
                                   ax=ax1,
                                   alpha=0.2,
                                  color=next(color_cycle))
train.loc[510000:520000].set_index('time')['open_channels'].plot(style='.',
                                          title=f'train - close look - open_channels',
                                          ax=ax2,
                                  color=next(color_cycle))
plt.show()

## Distribution of our target "Open Channels"

In [None]:
ax = train.groupby('open_channels') \
    .count()['time'] \
    .plot(figsize=(15, 5),
          kind='bar',
          title='Count of Target (Open Channels) in Training Set',
          color=color_pal[0])
plt.show()

In [None]:
train['signal'].plot(kind='hist',
                     figsize=(15, 5),
                     bins=55, label='train', alpha=0.5)
test['signal'].plot(kind='hist',
                    bins=55,
                    label='test',
                    alpha=0.5,
                    title='Signal distribution in train vs test')
plt.legend()
plt.show()

## Look at the distribution of "Signal" for different number of open channels.

In [None]:
fig, axs = plt.subplots(4, 3, figsize=(15, 12))
axs = axs.flat
idx = 0
for i, d in train.groupby('open_channels'):
    coloridx = idx
    if idx > 6:
        coloridx = idx - 6
    d['signal'].plot(kind='hist',
                     ax=axs[idx],
                     title=f'Distribution of Signal for {i} Open Channels',
                     bins=50, 
                    color=next(color_cycle))
    idx += 1
plt.tight_layout()

In [None]:
for i, d in train.groupby('open_channels'):
    coloridx = idx
    if idx > 6:
        coloridx = idx - 6
    d['signal'].plot(kind='hist',
                     title=f'Distribution of Signal for {i} Open Channels',
                     bins=50, 
                     figsize=(15, 5),
                     alpha=0.5,
                     color=next(color_cycle),
                     label=i)
plt.legend()
plt.show()

# Viewing Channels vs Signal by batch.

In [None]:
fig, axs = plt.subplots(5, 2, figsize=(15, 20))
axs = axs.flatten()
i = 0
for b, d in train.groupby('batch'):
    sns.violinplot(x='open_channels', y='signal', data=d, ax=axs[i])
    axs[i].set_title(f'Batch {b:0.0f}')
    i += 1
plt.tight_layout()

# Test Data Signal Attributes
- The test day only has 4 "batches"

In [None]:
test['batch'] = np.nan
test.loc[:500000, 'batch'] = 1 
test.loc[500000:500000*2, 'batch'] = 2
test.loc[500000*2:500000*3, 'batch'] = 3
test.loc[500000*3:500000*4, 'batch'] = 4

In [None]:
fig, axs = plt.subplots(2, 2, figsize=(15, 10))
axs = axs.flatten()
i = 0
for b, d in test.groupby('batch'):    
    d.set_index('time')['signal'].plot(style='.',
                                       title=f'test set batch {b:0.0f} - signal',
                                       ax=axs[i],
                                       alpha=0.2,
                                      color=next(color_cycle))
    i += 1
plt.tight_layout()

## Test batches may be smaller?

In [None]:
test['batch2'] = (test.index // 100_000) + 1

In [None]:
fig, axs = plt.subplots(10, 2, figsize=(15, 30))
axs = axs.flatten()
i = 0
for b, d in test.groupby('batch2'):    
    d.set_index('time')['signal'].plot(style='.',
                                       title=f'test set batch {b:0.0f} - signal',
                                       ax=axs[i],
                                       alpha=0.2,
                                      color=next(color_cycle))
    i += 1
plt.tight_layout()