# University of Liverpool - Ion Switching Tutorial

**Hello everybody, in this tutorial i will do this competition:** https://www.kaggle.com/c/liverpool-ion-switching

**Since you can only use the "late submission" option to still submit to this competition, this competition will NOT be shown in your profile!**

**The results of this notebook so far look like this (the ranks vary as time goes on):**

* **public score:   0.92874**
* **public rank:  1888/2618**

* **private score:  0.91612**
* **private rank: 1711/2618**

**What I do is strongly oriented on this notebook:** https://www.kaggle.com/cdeotte/one-feature-model-0-930

**I will explain certain things with more detail and more explanations, that's the main purpose of this tutorial.**

**The focus of this notebook lies on the modification of the signal curves, these measured curves must be altered in a certain way.**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# 1. Load and analyze data

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
train = pd.read_csv('/kaggle/input/liverpool-ion-switching/train.csv')
test = pd.read_csv('/kaggle/input/liverpool-ion-switching/test.csv')


print("loading successful!")

In [None]:
print(train.shape, "\n")
print(train.info(), "\n")
print(train.columns, "\n")
print(train.index, "\n")

In [None]:
print(test.shape, "\n")
print(test.info(), "\n")
print(test.columns, "\n")
print(test.index, "\n")

# 2. Check for missing values

In [None]:
for i in train.columns:
    print(i, train[i].isnull().sum())

In [None]:
for i in test.columns:
    print(i, test[i].isnull().sum())

**No missing values, that is very good.**

# 3. Analyze distribution of target

In [None]:
print(train.open_channels.value_counts())

**Ok, there are 0 to 10 open channels, but we have to plot and look at the data to really understand it.**

## 3.1 Plot the data

In [None]:
plt.figure(figsize=(20,5))
plt.plot(train.time[::100], train.signal[::100])

plt.show()

In [None]:
plt.figure(figsize=(20,5))
plt.plot(train.time[::1000], train.open_channels[::1000], color = 'red')
plt.show()

## 3.2 Analyze correlation between time and signal

In [None]:
corr_dataframe = train[["time", "signal"]]

corr_mat = corr_dataframe.corr()

print(corr_mat)

**Ok, the correlation value between time and signal is 0.831, which is quite high, which is quite good, since these are the only 2 features we have.**

**But this value does not say much, because it's simply the correlation value of the entire time series, and as we can see from the plot, there are many local differences and the signal curve does weird things, which the open_channel curve does not relate to, hence we must remove these weird effects and shapes in the signal curve.** 

**The open_channel curve only seems to correlate with the height of the plateaus of the signal curve and with the rapid ups and downs of the signal curve.**


**We can see that there are 10 distinct parts of our time series, hence let's look at the plots in these 10 parts:**

In [None]:
a = 500000
dist = 100

for i in range(0,10):
    
    print(i, "min: ", min(train.signal[0+i*a:(i+1)*a:dist].values), "max: ", max(train.signal[0+i*a:(i+1)*a:dist]))
    plt.figure(figsize=(20,5))
    plt.plot(train.time[0+i*a:(i+1)*a:dist], train.signal[0+i*a:(i+1)*a:dist])
    plt.plot(train.time[0+i*a:(i+1)*a:dist], train.open_channels[0+i*a:(i+1)*a:dist], color = 'red')
    plt.show()

**As we can see there are some distinct shapes, we have to remove from the signal curve:**

* **In the second plot there is a weird short linear increasing from 50 to 60, where the open_channel curve does not follow this linear trend.**
* **In plots 7,8,9,10 there is this parabolic shape of the signal curve, but the open_channel curve stays horizontial.**
* **The sharp peaks and dips standing out of the parabola shape in the 8th part of the signal curve do not correlate at all with the open_channel curve, hence we should remove them as well.**


**When we remove these weird shapes from the train signal curve, we have to remove these from the test signal curve as well, let's look how the test signal curve looks like:**

In [None]:
plt.figure(figsize=(20,5))
plt.plot(test.time[::100], test.signal[::100])
plt.show()

**As we can see there are 10 distinct parts as well, but they look different than the train signal curve, and they are not equidistant, because the last 2 parts are much longer than the first 8 parts.** 

**The test signal curve has many linear increasing segments, the train signal curve only had one tiny part of a linearly increasing signal in the second part between 50 and 60.**

**But since the open_channel curve did not react at all to the linear increasing segment in the train signal curve, we must remove all linear increasing segments from the test signal curve.**

**Besides that we must remove the parabolic shape of the 9th part of the test signal curve and the 4 parabolic shapes of the train signal curve.**


**These two tasks (remove linear increasing and remove parabolic shape) offer a huge potential to solve these with a lot of elegance and mathematics, you could fit the slope of the linear increasing segment as well as the parabolic shapes and then  automate everything afterwards,  I will try it fast and dirty by hand and let's see if it works and how fast we can progress.**

# 4. Correct linear increasing slope
## 4.1 Correct linear increasing slope of train

**To remove the linear increasing a simple formula should suffice.**

**The slope of the linear increasing segment of the train curve in the 2nd part between 50 and 60 can be measured to rougly 3 y-units on 10 x-units, hence our slope rougly is 3/10.** 

**We want to modify the train.signal values by subtracting the train.time values, because the train.time values are a perfect linear slope, hence we can use this linear increasing character of these values and do not have to implement any linear increasing function.**

**And the only 2 values we need to remove/correct the linear increasing segment are the slope and the starting point of the train.time values, which is 50 in this case.**

**Hence let's try the simple formula  slope * (train.time - 50), by doing this we correct the offset of the train.time values by subtracting 50, and the linear increasing character of the train.time values is compensated by multiplying with 0.3.**

**Let's see if it works:**

In [None]:
a = 500000 
b = 600000 


plt.plot(train.time[0+1*a:(1+1)*a:dist], train.signal[0+1*a:(1+1)*a:dist])
plt.plot(train.time[0+1*a:(1+1)*a:dist], train.open_channels[0+1*a:(1+1)*a:dist], color = 'red')
plt.show()

#####################################
train2 = train.copy()

c = 0.3
d = 50

train2.signal[a:b] = train2.signal[a:b].values - c*(train2.time[a:b].values - d)
train.signal[a:b] = train2.signal[a:b]

**Let's see if it worked:**

In [None]:
a = 500000
dist = 100

plt.plot(train.time[0+1*a:(1+1)*a:dist], train.signal[0+1*a:(1+1)*a:dist])
plt.plot(train.time[0+1*a:(1+1)*a:dist], train.open_channels[0+1*a:(1+1)*a:dist], color = 'red')
plt.show()

## 4.2 Correct linear increasing slope of test

**Beautiful.  Sadly this kind of method cannot be used for the parabolic shapes, since parabolic shapes are not linear.**

**And now we have to remove all the linear increasing slopes of the test data.**

**Let's look at the test data again:**


![](https://i.imgur.com/C3WPCew.png)


**Above we have already splitted this test signal curve up into the 10 distinct parts.**

**The following parts contain linear increasing slopes, which we have to compensate in order for our models to predict accurately:**

* **part 1**
* **part 2**
* **part 5**
* **part 7**
* **part 8**
* **part 9**


**We will use the same procedure as we did for the train data.**

**We will measure the slope easily by hand, it seems to be the same slope for the entire test signal curve anyway.**

**The offset value of the test.time feature can easily be read out from the x-axis.**

**With these 2 values we can easily correct the slopes and end up with a horizontal plateau at different heights.**

In [None]:
a = 100000
dist = 100

print("part 1")
plt.figure(figsize=(20,5))
plt.plot(test.time[0:1*a:dist], test.signal[0:1*a:dist])
plt.show()

print("part 2")
plt.figure(figsize=(20,5))
plt.plot(test.time[1*a:2*a:dist], test.signal[1*a:2*a:dist])
plt.show()

print("part 3")
plt.figure(figsize=(20,5))
plt.plot(test.time[2*a:3*a:dist], test.signal[2*a:3*a:dist])
plt.show()

print("part 4")
plt.figure(figsize=(20,5))
plt.plot(test.time[3*a:4*a:dist], test.signal[3*a:4*a:dist])
plt.show()

print("part 5")
plt.figure(figsize=(20,5))
plt.plot(test.time[4*a:5*a:dist], test.signal[4*a:5*a:dist])
plt.show()

print("part 6")
plt.figure(figsize=(20,5))
plt.plot(test.time[5*a:6*a:dist], test.signal[5*a:6*a:dist])
plt.show()

print("part 7")
plt.figure(figsize=(20,5))
plt.plot(test.time[6*a:7*a:dist], test.signal[6*a:7*a:dist])
plt.show()

print("part 8")
plt.figure(figsize=(20,5))
plt.plot(test.time[7*a:8*a:dist], test.signal[7*a:8*a:dist])
plt.show()

print("part 9")
plt.figure(figsize=(20,5))
plt.plot(test.time[8*a:9*a:dist], test.signal[8*a:9*a:dist])
plt.show()

print("part 10")
plt.figure(figsize=(20,5))
plt.plot(test.time[9*a:10*a:dist], test.signal[9*a:10*a:dist])
plt.show()

print("part 11")
plt.figure(figsize=(20,5))
plt.plot(test.time[10*a:15*a:dist], test.signal[10*a:15*a:dist])
plt.show()

print("part 12")
plt.figure(figsize=(20,5))
plt.plot(test.time[15*a:20*a:dist], test.signal[15*a:20*a:dist])
plt.show()

In [None]:
################

test2 = test.copy()

################
# part 1:

a = 0
b = 100000

c = 0.3
d = 500

test2.signal[a:b] = test2.signal[a:b].values - c*(test2.time[a:b].values - d)
test.signal[a:b]  = test2.signal[a:b]
################
# part 2:

a = 100000
b = 200000

d =  510

test2.signal[a:b] = test2.signal[a:b].values - c*(test2.time[a:b].values - d)
test.signal[a:b]  = test2.signal[a:b]
################
# part 5:

a = 400000
b = 500000

d =  540

test2.signal[a:b] = test2.signal[a:b].values - c*(test2.time[a:b].values - d)
test.signal[a:b]  = test2.signal[a:b]
################
# part 7:

a = 600000
b = 700000

d =  560

test2.signal[a:b] = test2.signal[a:b].values - c*(test2.time[a:b].values - d)
test.signal[a:b]  = test2.signal[a:b]
################
# part 8:

# slope  =  3/10

a = 700000
b = 800000

d =  570

test2.signal[a:b] = test2.signal[a:b].values - c*(test2.time[a:b].values - d)
test.signal[a:b]  = test2.signal[a:b]
################
# part 9:

a = 800000
b = 900000

d =  580

test2.signal[a:b] = test2.signal[a:b].values - c*(test2.time[a:b].values - d)
test.signal[a:b]  = test2.signal[a:b]
################

print("correcting linear slopes in test successful!")

**Let's see if it worked:**

In [None]:
plt.figure(figsize=(20,5))
plt.plot(test.time[::100], test.signal[::100])
plt.show()

# 5. Correct the parabolic shape

**I looked up this notebook to see how other people removed the parabolic shape:** https://www.kaggle.com/cdeotte/one-feature-model-0-930#Remove-Training-Data-Drift

**We will try to construct a parabolic shape, we will need 3 values for this:  minimum, maximum, middle.**

**I printed the minimum and maximum values when i plotted the 10 parts of the train signal curve, the middle can be easily detected by hand, since it's just the time stamp in the middle of the parabolic shape, where it has its peak.**

## 5.1 Correct parabolic shape in train data

In [None]:
def remove_parabolic_shape(values, minimum, middle, maximum):
    
    a = maximum - minimum
    return -(a/625)*(values - middle)**2+a

################################################

# I really want to find out, how he found these perfectly working
# numbers, because I can't imagine, that he sat around for hours,
# tweaking these low and high values until it worked.

# idea1: get the min and max value by calculating the mean
# of a certain window at the beginning of the batch
# and at the middle of the batch.


################################################
# part 7 goes from 3000k to 3500k

#his values
#low  = -1.817
#high =  3.186

#my values
#min:   -2.9517 
#max:    4.366

a = 3000000
b = 3500000
minimum = -1.817
middle = 325
maximum = 3.186

train2.signal[a:b] = train2.signal[a:b].values - remove_parabolic_shape(train2.time[a:b].values, minimum, middle, maximum)
train.signal[a:b] = train2.signal[a:b]

################################################
# part 8 goes from 3500k to 4000k

#his values
#low  = -0.094
#high =  4.936

#my values
#min:   -3.0399 
#max:    9.9986

a = 3500000
b = 4000000
minimum = -0.094
middle = 375
maximum = 4.936

train2.signal[a:b] = train2.signal[a:b].values - remove_parabolic_shape(train2.time[a:b].values, minimum, middle, maximum)
train.signal[a:b] = train2.signal[a:b]

################################################
# part 9 goes from 4000k to 4500k

#his values
#low  =  1.715
#high =  6.689

#my values
#min:   -2.0985 
#max:    9.0889

a = 4000000
b = 4500000
minimum = 1.715
middle = 425
maximum = 6.689

train2.signal[a:b] = train2.signal[a:b].values - remove_parabolic_shape(train2.time[a:b].values, minimum, middle, maximum)
train.signal[a:b] = train2.signal[a:b]

################################################
# part10 goes from 4500k to 5000k

#his values
#low  =  3.361
#high =  8.45

#my values
#min:   -1.5457 
#max:   12.683

a = 4500000
b = 5000000
minimum = 3.361
middle = 475
maximum = 8.45

train2.signal[a:b] = train2.signal[a:b].values - remove_parabolic_shape(train2.time[a:b].values, minimum, middle, maximum)
train.signal[a:b] = train2.signal[a:b]

################################################

In [None]:
a = 500000
dist = 100

for i in range(6,10):    
    plt.figure(figsize=(20,5))
    plt.plot(train.time[0+i*a:(i+1)*a:dist], train.signal[0+i*a:(i+1)*a:dist])
    plt.plot(train.time[0+i*a:(i+1)*a:dist], train.open_channels[0+i*a:(i+1)*a:dist], color = 'red')
    plt.show()

**As we can see, this method worked quite well, all parabolic shapes are gone.**

**Now we have to do the same thing for the big parabolic shape in the test signal curve.**

## 5.2 Correct parabolic shape in test data

In [None]:
#######################################################
# his magical function full of magical numbers

def f(x):
    return -(0.00788)*(x-625)**2+2.345 +2.58


#test2.loc[test2.index[a:b],'signal'] = test2.signal.values[a:b] - f(test2.time[a:b].values)
#######################################################

test2 = test.copy()


a = 1000000
b = 1500000

plt.figure(figsize=(20,5))
plt.plot(test.time[a:b], test.signal[a:b])
plt.show()

test2.signal[a:b] = test2.signal[a:b].values - f(test2.time[a:b].values)
#test2.signal[a:b] = test2.signal[a:b].values - remove_parabolic_shape(test2.time[a:b].values, minimum, middle, maximum)
test.signal[a:b] = test2.signal[a:b]

plt.figure(figsize=(20,5))
plt.plot(test.time[a:b], test.signal[a:b])
plt.show()

# 6. Choose and train models

**In this notebook:**  https://www.kaggle.com/cdeotte/one-feature-model-0-930#Make-Five-Simple-Models

**he identified the 5 different parts of the signal curves by the number of open channel and how fast the number switches.**

* **1 slow open channel**
* **1 fast open channel**
* **3 open channels**
* **5 open channels**
* **10 open channels**


**The differentiation between slow and fast will only be made for the parts where there is only 1 or 0 open channels.**

**In all the other parts the switching is always fast compared to the first part from 0 to 100 where the one open channel switches slowly.**

**This can be seen in this picture:**

![](https://i.imgur.com/gdoz3nE.png)

**He then wisely chooses 5 different models, optimizes the parameters for that exact model and then trains the model with the correct part of the training data.**

**He chooses DecisionTreeClassifier models and mainly adjusts the max_depth parameter to the number of open channels.**

**For the training of the models we will only use the signal feature, since the time feature does not contain any valuable info, it is simply the linear increasing time that belongs to the measurement of the signal curve.**

**Let's go:**

## 6.1 1 slow open channel

**Only 0 or 1 channels are open in this time window, hence we only need max_depth = 1.**

In [None]:
from sklearn.metrics import f1_score
import graphviz
from sklearn import tree

In [None]:
# 1 slow open channel

a =  0
b =  500000
c =  500000
d = 1000000

X_train = np.concatenate([train.signal.values[a:b],train.signal.values[c:d]]).reshape((-1,1))
y_train = np.concatenate([train.open_channels.values[a:b],train.open_channels.values[c:d]]).reshape((-1,1))

model_1_slow_channel = tree.DecisionTreeClassifier(max_depth=1)
model_1_slow_channel.fit(X_train,y_train)

print('Training model_1_slow_open_channel...')
preds = model_1_slow_channel.predict(X_train)


print('has f1 validation score =', f1_score(y_train,preds, average='macro'))


#tree_graph = tree.export_graphviz(model_1_slow_channel, out_file=None, max_depth = 10,
#    impurity = False, feature_names = ['signal'], class_names = ['0', '1'],
#    rounded = True, filled= True )
#graphviz.Source(tree_graph)  

## 6.2 1 fast open channel

**Again only 0 or 1 channels are open in that time window, hence we use max_depth = 1 again.**

In [None]:
a = 1000000
b = 1500000

c = 3000000 
d = 3500000

X_train = np.concatenate([train.signal.values[a:b],train.signal.values[c:d]]).reshape((-1,1))
y_train = np.concatenate([train.open_channels.values[a:b],train.open_channels.values[c:d]]).reshape((-1,1))

model_1_fast_channel = tree.DecisionTreeClassifier(max_depth=1)

model_1_fast_channel.fit(X_train, y_train)

print('Training model_1_fast_channel...')
preds = model_1_fast_channel.predict(X_train)

print('has f1 validation score =',f1_score(y_train,preds,average='macro'))

#tree_graph = tree.export_graphviz(clf1f, out_file=None, max_depth = 10,
#    impurity = False, feature_names = ['signal'], class_names = ['0', '1'],
#    rounded = True, filled= True )
#graphviz.Source(tree_graph) 

## 6.3 3 open channels

**In these time windows there are 0,1,2 or 3 open channels, hence we need max_depth = 4.**

**He uses max_leaf_nodes = 4 and get's a f1 score of 0.9321.**

**I replaced max_leaf_nodes = 4 by max_depth = 4 and got 0.9454,  so I stick with max_depth.**

In [None]:
a = 1500000 
b = 2000000

c = 3500000 
d = 4000000

X_train = np.concatenate([train.signal.values[a:b],train.signal.values[c:d]]).reshape((-1,1))
y_train = np.concatenate([train.open_channels.values[a:b],train.open_channels.values[c:d]]).reshape((-1,1))

model_3_channels = tree.DecisionTreeClassifier(max_depth=4)
model_3_channels.fit(X_train,y_train)
print('Training model_3_open_channels')

preds = model_3_channels.predict(X_train)
print('has f1 validation score =',f1_score(y_train,preds,average='macro'))

#tree_graph = tree.export_graphviz(clf3, out_file=None, max_depth = 10,
#    impurity = False, feature_names = ['signal'], class_names = ['0', '1','2','3'],
#    rounded = True, filled= True )
#graphviz.Source(tree_graph) 

## 6.4 5 open channels

**In these time windows there are 0,1,2,3,4,5 or 6 open channels,  he uses max_leaf_nodes = 6, probably because there is only one dip down to 0 channels in the second time window from 400 to 450.**

**I again tried replacing max_leaf_nodes with max_depth, and again my F1 score got slightly better when using max_depth.**

**And here I tried max_depth = 6 and max_depth = 7, and the F1 score with max_depth = 7  was better at the 4th digit after the comma.**

**But so far we are only calculating the F1-score with the data we used to train the model,  hence using max_depth = 7 instead of 6 can lead to overfitting, which results in a better score here on the X_train data, but will result in a worse score later on the unknown test data.**

**Hence I will use the smaller value of max_depth to prevent overfitting.**

In [None]:
a = 2500000
b = 3000000

c = 4000000 
d = 4500000


X_train = np.concatenate([train.signal.values[a:b],train.signal.values[c:d]]).reshape((-1,1))
y_train = np.concatenate([train.open_channels.values[a:b],train.open_channels.values[c:d]]).reshape((-1,1))

model_5_channels = tree.DecisionTreeClassifier(max_depth=6)
model_5_channels.fit(X_train, y_train)
print('Training model_5_open_channels')
preds = model_5_channels.predict(X_train)
print('has f1 validation score =',f1_score(y_train,preds,average='macro'))

#tree_graph = tree.export_graphviz(clf5, out_file=None, max_depth = 10,
#    impurity = False, feature_names = ['signal'], class_names = ['0', '1','2','3','4','5'],
#    rounded = True, filled= True )
#graphviz.Source(tree_graph) 

## 6.5 10 open channels

**Again the same effect was observable:  using max_depth instead of max_leaf_nodes yields a slightly better F1 score.**

**He used max_leaf_nodes = 8,  I tried max_depth = 9, because there are 2,3,4,5,6,7,8,9,10 open channels, but it only dips down to 2 channels one time per time window.**

**Using max_depth = 9 yielded F1: 0.8597**

**Using max_depth = 8 yielded F1: 0.7674**

**Because the F1 score of max_depth = 9 is so much better here, I will use max_depth = 9 and hope it doesnt cause much overfitting.**

**I can still play with these parameters and submit, to see what the public/private score will be in that competition.**

In [None]:
a = 2000000
b = 2500000

c = 4500000 
d = 5000000

X_train = np.concatenate([train.signal.values[a:b],train.signal.values[c:d]]).reshape((-1,1))
y_train = np.concatenate([train.open_channels.values[a:b],train.open_channels.values[c:d]]).reshape((-1,1))

model_10_channels = tree.DecisionTreeClassifier(max_depth=9)  # max_depth = 9 may be overfitting, try 8 and see if priv/pub score gets better
model_10_channels.fit(X_train, y_train)

print('Training model_10_open_channels')
preds = model_10_channels.predict(X_train)
print('has f1 validation score =',f1_score(y_train,preds,average='macro'))

#tree_graph = tree.export_graphviz(clf10, out_file=None, max_depth = 10,
#    impurity = False, feature_names = ['signal'], class_names = [str(x) for x in range(11)],
#    rounded = True, filled= True )
#graphviz.Source(tree_graph) 

# 7. Predict with test data

**I liked his way of loading in the sample_submission.csv,  and then simply replace all the values in that dataframe, and then submit it.**

**By doing this we dont have to construct a new dataframe with the correct column names etc.**

**What we have to do in order to use the 5 separate models correctly, is to assign the 5 different models to the correct parts of the test signal.**

**For the corrected train signal curve we have trained the following 5 models for the different parts of the train signal curve:**

![](https://i.imgur.com/0kcC2xQ.png)

**Hence we look at our test signal curve, try to identify parts that look similar to the train signal curve, and then assign that corresponding model to that part of the test signal curve.**

**Then we end up with the following image:**

![](https://i.imgur.com/sE6vmMj.png)

In [None]:
sub = pd.read_csv('../input/liverpool-ion-switching/sample_submission.csv')

a = 100000

# part 1
sub.iloc[0*a:1*a,1] = model_1_slow_channel.predict(test.signal.values[0*a:1*a].reshape((-1,1)))

# part 2
sub.iloc[1*a:2*a,1] = model_3_channels.predict(test.signal.values[1*a:2*a].reshape((-1,1)))

# part 3
sub.iloc[2*a:3*a,1] = model_5_channels.predict(test.signal.values[2*a:3*a].reshape((-1,1)))

# part 4
sub.iloc[3*a:4*a,1] = model_1_slow_channel.predict(test.signal.values[3*a:4*a].reshape((-1,1)))

# part 5
sub.iloc[4*a:5*a,1] = model_1_fast_channel.predict(test.signal.values[4*a:5*a].reshape((-1,1)))

# part 6
sub.iloc[5*a:6*a,1] = model_10_channels.predict(test.signal.values[5*a:6*a].reshape((-1,1)))

# part 7
sub.iloc[6*a:7*a,1] = model_5_channels.predict(test.signal.values[6*a:7*a].reshape((-1,1)))

# part 8
sub.iloc[7*a:8*a,1] = model_10_channels.predict(test.signal.values[7*a:8*a].reshape((-1,1)))

# part 9
sub.iloc[8*a:9*a,1] = model_1_slow_channel.predict(test.signal.values[8*a:9*a].reshape((-1,1)))

# part 10
sub.iloc[9*a:10*a,1] = model_3_channels.predict(test.signal.values[9*a:10*a].reshape((-1,1)))

# part 11
sub.iloc[10*a:20*a,1] = model_1_slow_channel.predict(test.signal.values[10*a:20*a].reshape((-1,1)))

print("training successful!")

**Let's plot the predictions to see if it worked:**

In [None]:
plt.figure(figsize=(20,5))
res = 1000
let = ['A','B','C','D','E','F','G','H','I','J']
plt.plot(range(0,test.shape[0],res),sub.open_channels[0::res])
for i in range(5): plt.plot([i*500000,i*500000],[-5,12.5],'r')
for i in range(21): plt.plot([i*100000,i*100000],[-5,12.5],'r:')
for k in range(4): plt.text(k*500000+250000,10,str(k+1),size=20)
for k in range(10): plt.text(k*100000+40000,7.5,let[k],size=16)
plt.title('Test Data Predictions',size=16)
plt.show()

**And now let's save the predictions and submit :)**

In [None]:
sub.to_csv('submission.csv', index = False, float_format='%.4f')

print("submission.csv saved successfully!")



#################################################
# result so far:

# public uses 30% of the test data
# public score:   0.92874  
# public rank:  1888/2618


# private uses 70% of the test data
# private score:  0.91612
# private rank: 1711/2618

#################################################

In [None]:
# things to do:

# 1.) remove all these warnings, maybe try the df.loc[df.column, 'signal']  alternative
# 2.) understand how he got the parabola values to work so nicely

# things to improve performance:

# 1.) tweak max_depth  of the decisiontree models, then submit, and see if the private/public score improves