____

In this kernel I've made a small simulation framework to test your features (based on https://www.kaggle.com/c/trackml-particle-identification/discussion/62804#latest-368156) and provide my own attempts to find best features.

___

I have been struggling to solve this clustering competition. I understand that we need to find the proper features to get beyond 0.6, but without public kernels I can't get more that 0.2. I believe that I'm missing something very obivous.

I've read almost all forum topics, I think, these are the most valuable posts (may be this list will help someone):

https://www.kaggle.com/c/trackml-particle-identification/discussion/62804#latest-368156
https://www.kaggle.com/khahuras/0-53x-clustering-using-hough-features-basic/notebook
https://www.kaggle.com/c/trackml-particle-identification/discussion/61590#latest-368105
https://www.kaggle.com/c/trackml-particle-identification/discussion/61081#latest-367422

_____

I'm afraid I won't have time to make and calculate some top-score solution. So let's try to win at least a kernel medal :)

Let's assume that helix has this formula:

$$x = -R \sin{(\omega \cdot t + \theta_0)} + (R - D)\sin{(\theta_0)}$$
$$y = -R \cos{(\omega \cdot t + \theta_0)} + (R - D)\cos{(\theta_0)}$$
$$z = z_0 + v \cdot t $$

As we can see, helix has 4 variables: $x,y,z,t$ and 6 parameters: $\omega, v, R, \theta_0, D, z_0$.

Let's fix 3 of them: $R, \theta_0, z_0 $. We will iterate over their values from a certain range to reduce the number of variables in the equation.

Pseudocode:
``` python
final_clusters = 
for theta0 in np.linspace(theta0_from, theta0_to, theta0_iterations):
    for z_0 in np.linspace(z_0_from, z_0_to, z_0_iterations):
        for R in np.linspace(R_from, R_to, R_iterations):
            features = calculate_features(hits_data, theta0, z0, R)
            clusters = clisterize(features)
            print(score(clusters))
            final_clusters = merge(clusters=clusters, to=final_clusters)
            
print("Final score: {}".format(score(clusters)))
```

Now we have three equations with only three unknowns: $w, v, D$. I'll show you that we can combine $w$ and $v$ to $\frac{w}{v}$, so we will have only 2 unknowns.

These variables are constants for each helix. Let's find their equations:

$$D = f(x,y,z,t,R, \theta_0, z_0)$$
$$\frac{w}{v} = g(x,y,z,t,R, \theta_0, z_0)$$

_____

# The equation for $D$

$$R \sin{(\omega \cdot t + \theta_0)} =  (R - D)\sin{(\theta_0)} - x$$
$$R \cos{(\omega \cdot t + \theta_0)} = (R - D)\cos{(\theta_0)} - y$$


$$R^2 \sin^2{(\omega \cdot t + \theta_0)} =  (R - D)^2(\sin{(\theta_0)} - x)^2 $$ 
$$R^2 \cos^2{(\omega \cdot t + \theta_0)} = (R - D)^2(\cos{(\theta_0)} - y)^2 $$

Sum:

$$R^2 = (R-D)^2 + x^2 + y^2 - 2(R-D)\cdot(x\sin\theta_0 + y\cos\theta_0)$$

Let's denote the expression

$$\gamma = x\sin\theta_0  +  y\cos\theta_0$$

Then

$$D^2 - 2RD + x^2 + y^2 - 2R\gamma + 2D\gamma = 0$$

$$D^2 - 2D(R-\gamma) + x^2 +y^2 - 2R\gamma = 0$$

Solve the quadratic equation:

$$D = R - \gamma ± \sqrt{R^2 + \gamma^2 - x^2 -y^2} $$

Or using some trigonometric identities

$$D = R - (x\sin\theta_0  +  y\cos\theta_0) ± \sqrt{R^2 - (x\cos\theta_0  +  y\sin\theta_0)^2} $$.

Notice that there are two possible solutions.
_____

# The equation for $\frac{\omega}{v}$
Let's find the equation for \frac{\omega}{v}  too.

Let's get rid of $D$ using a simple proportion:
$$\frac{x + R\sin(\omega t + \theta_0)}{\sin(\theta_0)} = R - D = \frac{y + R\cos(\omega t + \theta_0)}{\cos(\theta_0)}$$

$$x\cos(\theta_0) - y\sin(\theta_0) = R[\sin(\theta_0)\cdot \cos(\omega t + \theta_0) - 
cos(\theta_0) \cdot \sin(\omega t + \theta_0)] = R \sin(-\omega t)$$

Express the time from the equation for the z coordinate

$$ t = \frac{z-z0}{v}$$

And finally


$$\frac{\omega}{v} = \frac{1}{z-z_0} \cdot \arcsin \frac{y\sin( \theta_0) - x\cos( \theta_0)}{R}$$

______

#### Let's test this features on the perfect helixes

In [None]:
import os
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [None]:
def get_helix(R=2000, w=np.pi * 3, v=3000, theta0=np.pi / 3, D=500, z0=100):
    # Generates perfect helix data for feature testing
    t = np.random.rand(1000)
    df = pd.DataFrame(index=t)
    df["z"] = z0 + v * t
    df["x"] = -R * np.sin(w * t + theta0)+(R - D) * np.sin(theta0)
    df["y"] = -R * np.cos(w * t + theta0)+(R - D) * np.cos(theta0)
    return df

def plot_helix(df):
    # 3D plot
    from mpl_toolkits.mplot3d import Axes3D
    import matplotlib.pyplot as plt
    fig = plt.figure()
    ax = fig.add_subplot(111, projection='3d')
    ax.scatter(df.x, df.y, df.z)
    ax.set_xlabel('X Label')
    ax.set_ylabel('Y Label')
    ax.set_zlabel('Z Label')
    plt.show()

Let's visualize our generated helixes.

In [None]:
df = get_helix()
df.plot.scatter(x="x", y="y")
plot_helix(df)

In [None]:
def add_features(df, theta0=0, R=2000, z0=0):
    x,y,z = df.x, df.y, df.z
    
    x_sin_y_cos = x * np.sin(theta0) + y * np.cos(theta0)
    x_cos_y_sin = x * np.cos(theta0) + y * np.sin(theta0)
    df["d1"] = (R - x_sin_y_cos + np.sqrt(R ** 2 - (x_cos_y_sin) ** 2))
    df["d2"] = (R - x_sin_y_cos - np.sqrt(R ** 2 - (x_cos_y_sin) ** 2))
    df["w_div_v"] = np.arcsin((y * np.sin(theta0) - x * np.cos(theta0)) / R) / (z - z0)
    return df

Let's generate several helixes with different parameters to check our features.

In [None]:
h1 = get_helix(theta0=0, R=2000, z0=0)
h2 = get_helix(theta0=-0.6, R=1000, z0=250)
h3 = get_helix(theta0=0.6, R=1500, z0=50)

Let's assume that we find the params for one helix and calculate the features based on these parameters.

In [None]:
theta0 = -0.6
R = 1000
z0 = 250

h1 = add_features(h1, theta0=theta0, R=R, z0=z0)
h2 = add_features(h2, theta0=theta0, R=R, z0=z0)
h3 = add_features(h3, theta0=theta0, R=R, z0=z0)

Ooops! We saw some errors inside the arcsin. That means the $(x,y,z)$ point is definately not on the helix with params $(\theta_0, R, z_0)$.

Let's plot the features for every helix. Remember that the green helix is the right one.

In [None]:
ax = h1.plot.scatter(x="x", y="d1", c="r")
ax = h2.plot.scatter(x="x", y="d1", c="g", ax=ax)
ax = h3.plot.scatter(x="x", y="d1", c="b", ax=ax, title="Feature: D1")

In [None]:
ax = h1.plot.scatter(x="x", y="d2", c="r")
ax = h2.plot.scatter(x="x", y="d2", c="g", ax=ax)
ax = h3.plot.scatter(x="x", y="d2", c="b", ax=ax, title="Feature: D2")

In [None]:
ax = h1.plot.scatter(x="x", y="w_div_v", c="r")
ax = h2.plot.scatter(x="x", y="w_div_v", c="g", ax=ax)
ax = h3.plot.scatter(x="x", y="w_div_v", c="b", ax=ax, title="Feature: w/v")

The D1 and D2 features failed. May be we should try some combination. Does it mean that the $D$ of a helix is not constant?

Let's wee the `value_counts` of the last feature - it seems to be the best:

In [None]:
h2.w_div_v.value_counts().head(20)

We can see that a lot of values are the same. That means I have chosen at least one right feature. Right, @yuval?