# How I Got a Perfect Prediction by Deleting a Hyphen
## Watch out when using `shift()`!

More like this over at [datasciencehorrorstories.com](http://datasciencehorrorstories.com)

In [1]:
import pandas as pd
import numpy as np

Create a dataset with numbers and a binary outcome

In [2]:
np.random.seed(142) # for reproducibility

df = pd.DataFrame()
df["time_column"] = pd.Series(pd.date_range("20161024 00:00:00", "20161031 00:00:00" ,freq="H"))
df["my_value"] = pd.Series(np.random.randint(1, 10, size=len(df)))

# I want to preserve the original data for use later, so I'll take a copy
df2 = df.copy()

df2["my_previous_value"] = df2.my_value.shift(1)
df2["my_next_value"] = df2.my_value.shift(-1)

# drop NaN values at the front and back, that were created by shifting
df2 = df2.dropna()

# create outcome column
df2["outcome"] = df2.my_next_value > 5

df2.head()

Unnamed: 0,time_column,my_value,my_previous_value,my_next_value,outcome
1,2016-10-24 01:00:00,6,6.0,1.0,False
2,2016-10-24 02:00:00,1,6.0,2.0,False
3,2016-10-24 03:00:00,2,1.0,3.0,False
4,2016-10-24 04:00:00,3,2.0,8.0,True
5,2016-10-24 05:00:00,8,3.0,5.0,False


Train a binary classifier on this random data

In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, accuracy_score

# extract features and outcome
X = df2[["my_value", "my_previous_value"]]
y = df2.outcome

# train-test split
# use 70-30, so for 167 rows use 116-51
X_train = X.iloc[:116]
X_test = X.iloc[116:]
y_train = y.iloc[:116]
y_test = y.iloc[116:]

model = LogisticRegression()

model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
print("Accuracy: {0:.2f}, Precision: {1:.2f}".format(accuracy, precision))

Accuracy: 0.57, Precision: 0.50


This should be pretty terrible as the numbers are random!

But what happens if we forget the hyphen?

In [None]:
df3 = df.copy()

df3["my_previous_value"] = df3.my_value.shift(1)
df3["my_next_value"] = df3.my_value.shift(1)

At this point now, the "previous" and "next" columns are identical!

So when we create our binary outcome, we're accidentally doing it using one of our features!

In [4]:
# drop NaN values at the front and back, that were created by shifting
df3 = df3.dropna()

# create outcome column
df3["outcome"] = df3.my_next_value > 5

# extract features and outcome
X = df3[["my_value", "my_previous_value"]]
y = df3.outcome

# train-test split
# use 70-30, so for 167 rows use 116-51
X_train = X.iloc[:116]
X_test = X.iloc[116:]
y_train = y.iloc[:116]
y_test = y.iloc[116:]

model = LogisticRegression()

model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
print("Accuracy: {0:.2f}, Precision: {1:.2f}".format(accuracy, precision))

Accuracy: 0.94, Precision: 0.91


Well, that's one way to improve our accuracy...