<h1>Exploratory Data Analysis</h1>

Here is my crack at EDA for this competition. Thanks to the University of Liverpool for hosting this competition!

In [None]:
import pandas as pd
import numpy as np

from scipy import stats
import math
import random

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
train_df = pd.read_csv('/kaggle/input/liverpool-ion-switching/train.csv')
test_df = pd.read_csv('/kaggle/input/liverpool-ion-switching/test.csv')

In [None]:
train_df.head()

The target variable in this dataset is the number of open ion channels. Features are the signal strength over time, broken up into discrete 'chunks' of 50 seconds.

In [None]:
train_df.tail()

In [None]:
train_df.info(null_counts=True)

In [None]:
test_df.info(null_counts=True)

In [None]:
train_df.describe()

In [None]:
test_df.describe()

In [None]:
plt.hist(train_df.time)
plt.hist(test_df.time)
plt.legend(labels=['Train', 'Test'])
plt.title('Time Distribution (Just Checking)')
plt.show()

In [None]:
train_df['open_channels'].value_counts().plot(kind='bar')
plt.title('Open channels distribution')
plt.show()

In [None]:
scores = []
for iteration in range(1000000):    
    total = 0
    for choice in range(10):
        if random.randint(0,10) > 9:
            total += 1
    scores.append(total)
plt.hist(scores, bins=22)
plt.title('Binomial Distribution')
plt.show()

The target variable is not uniformly distributed. The number of open channels goes down almost linearly. Is the number of open channels controlled by the researchers or are they observing it? If the number of open channels was random on off switches at some probability, I would expect this distribution to look more like the next cell, which is a Binomial distribution. Since it looks more linear to me, I would guess that the data was more picked and chosen. Some domain knowledge would be helpful here that I don't possess.

In [None]:
plt.figure(figsize=(6,6))
plt.hist(train_df.signal, bins=20)
plt.hist(test_df.signal, bins=20)
plt.title('Signal Distribution for Test and Train')
plt.legend(labels=['Train', 'Test'])
plt.show()
print('Train mean {}, median {}, standard deviation {}'.format(np.mean(train_df.signal), np.median(train_df.signal), np.std(train_df.signal)))
print('Test mean {}, median {}, standard deviation {}'.format(np.mean(test_df.signal), np.median(test_df.signal), np.std(test_df.signal)))
print('\nTrain:', stats.normaltest(train_df.signal))
print('Test:', stats.normaltest(train_df.signal))

Train and test data appear to have similar distributions, but the mean, median, and standard deviations are all different. Even though they are not normally distributed, both distributions have enough samples to make a t-test valid.

In [None]:
stats.ttest_ind(train_df.signal, test_df.signal)

The test and train signal data are drawn from different populations of signal data. This is a little concerning when making predictions.

In [None]:
plt.figure(figsize=(18,18))
plt.plot(train_df.signal[train_df.time < 2])
plt.plot(train_df.open_channels[train_df.time < 2])
plt.show()

This zoomed in look at the signal data next to the number of open_channels shows that they definitely look dependent. I'm going to zoom in even more at different times to try to capture more exactly how the number of open channels affects the signal.

In [None]:
start = 0.72
end = 0.727
plt.figure(figsize=(14,14))
plt.plot(train_df.signal[(train_df.time > start) & (train_df.time < end)])
plt.plot(train_df.open_channels[(train_df.time > start) & (train_df.time < end)])
plt.legend(['Signal Strength', 'Number of Open Channels'])
plt.show()

In [None]:
start = 200.07
end = 200.08
plt.figure(figsize=(14,14))
plt.plot(train_df.signal[(train_df.time > start) & (train_df.time < end)])
plt.plot(train_df.open_channels[(train_df.time > start) & (train_df.time < end)])
plt.legend(['Signal Strength', 'Number of Open Channels'])
plt.show()

Even though the number of open channels obviously directly affects the signal strength, there are some times when the signal 'wiggles' despite the number of open channels not changing and vice versa. Is the strength only dependent on the number of channels plus some noise, or are there other time-related features?

In [None]:
start = 310.07
end = 310.08
plt.figure(figsize=(14,14))
plt.plot(train_df.signal[(train_df.time > start) & (train_df.time < end)])
plt.plot(train_df.open_channels[(train_df.time > start) & (train_df.time < end)])
plt.legend(['Signal Strength', 'Number of Open Channels'])
plt.show()

Here the number of open channels fluctuates between 0 and 1, like the first graph, but the strength is much higher on average. Maybe some sort of RNN is needed for this problem, since it looks like long term trends matter as well as immediate ones. The amount of data could pose a limitation for those of us on simple machines.

The time feature resets every 50. Thank you https://www.kaggle.com/artgor/eda-and-model-qwk-optimization for the following little batch code:

In [None]:
train_df['batch'] = 0
for i in range(0, 10):
    train_df.iloc[i * 500000: 500000 * (i + 1), 3] = i
    
test_df['batch'] = 0
for i in range(0, 4):
    test_df.iloc[i * 500000: 500000 * (i + 1), 2] = i

Here is a macro view of the different time chunks. I'm imagining researchers playing with a bunch of on-off switches in different patters to generate weird looking signal graphs.

In [None]:
fig, axes = plt.subplots(2, 5, figsize=(16, 8))
num_batches = len(train_df.batch.unique())
fig.suptitle('Signal and Open Channels by Batch. Blue == signal, Orange == open_channels', fontsize=16)
axis_on = True
for i in range(num_batches):
    axes[i // (num_batches // 2), i % (num_batches // 2)].plot(train_df.signal[train_df.batch == i])
    axes[i // (num_batches // 2), i % (num_batches // 2)].plot(train_df.open_channels[train_df.batch == i])
    axes[i // (num_batches // 2), i % (num_batches // 2)].set_yticks(range(-4, 13))
    if axis_on == False:
        axes[i // (num_batches // 2), i % (num_batches // 2)].set_xticks([])
        axes[i // (num_batches // 2), i % (num_batches // 2)].set_yticks([])
    axis_on = False

Let's check out the test distibutions by chunk real quick:

In [None]:
test_df['batch'] = 0
for i in range(0, 4):
    test_df.iloc[i * 500000: 500000 * (i + 1), 2] = i

fig, axes = plt.subplots(2, 2, figsize=(10, 10))
num_batches = len(test_df.batch.unique())
fig.suptitle('Test Distributions', fontsize=16)
axis_on = True
for i in range(num_batches):
    axes[i // (num_batches // 2), i % (num_batches // 2)].plot(test_df.signal[test_df.batch == i])
    axes[i // (num_batches // 2), i % (num_batches // 2)].set_yticks(range(-4, 13))
    if axis_on == False:
        axes[i // (num_batches // 2), i % (num_batches // 2)].set_xticks([])
        axes[i // (num_batches // 2), i % (num_batches // 2)].set_yticks([])
    axis_on = False

They sure gave us some wild and crazy signal patterns! Maybe the best model first looks at the overall distribution of the 50 second chunk and then chooses from a collection of more micro models.

In [None]:
fig, axes = plt.subplots(4, 3, figsize=(15, 20))
fig.suptitle('Signal Distributions at Number of Open Channels', fontsize=16)
for i in range(11):
    n, bins, patches = axes[i // 3, i % 3].hist(train_df.signal[train_df.open_channels == i], bins=40)
    ind = list(n).index(max(n))
    mean = round(np.mean(train_df.signal[train_df.open_channels == i]), 2)
    binned_mode = (bins[ind] + bins[ind + 1])/2
    axes[i // 3, i % 3].set_title('Channels {}, BinMode {}, Mean {}'.format(i, round(binned_mode, 2), mean))
    axes[i // 3, i % 3].set_xticks([-5, -2.5, 0, 2.5, 5, 7.5, 10, 12.5])
    axes[i // 3, i % 3].axvline(binned_mode , color='orange')
    axes[i // 3, i % 3].axvline(mean , color='green')
plt.show()

Overall distributions of signal strength broken up by the number of open channels. I also included vertical lines for the 'bin mode' which is just the location of the peak, as well as the overall mean. As expected, both the bin mode and the mean go up as the number of open channels goes up...But not perfectly. The weird distributions show that there is definitely more to the signal strength than just a 1 - to - 1 dependency on the number of open channels plus random noise.

That's it for my EDA notebook, any feedback is greatly welcomed as this is my first Kaggle notebook.