### TODO List

* [ ] Clustering analysis on test dataset to take advantage of those observations.
* [ ] Impute missing incidence angle information.

This specific competition seems a perfect use case for CNNs/image detection. But right now I am working on my feature engineering capabilities and am improving my understanding of logistic regression, so I want to see what I can accomplish with logistic regression alone. Let's begin.

## Data Exploration

In [1]:
# import some packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [6]:
# let's load in the data and look around
df_train = pd.read_json('../input/train.json')
df_train.info()

Looks like we have 1604 images where whether or not the image contains an iceberg is labeled. The only features are the HH and HV bands, the image id, incidence angle (note: has missing values ... that's why it shows up as an object instead of a numeric), and whether or not there is an iceberg.

Now let's look at the test dataset as well.

In [7]:
df_test = pd.read_json('../input/test.json')
df_test.info()

Five times as many test images! With this ratio, it would no doubt be helpful to do some unsupervised clustering on the test dataset to improve our model. I'll add that to the to do list.

No lat/lon information, time of year, etc. that might have proved useful. All the features to engineer will have to come from those radar echos.

Thanks to [MuonNeutrino](https://www.kaggle.com/muonneutrino/exploration-transforming-images-in-python), an excellent starting point is investigating the statistical properties of the two separate bands.

In [11]:
# coerce incidence angle to numeric
df_train['inc_angle'] = pd.to_numeric(df_train['inc_angle'], errors='coerce')

# combine training and test set for feature engineering
df_full = pd.concat([df_train, df_test], axis=0, ignore_index=True)

In [14]:
def get_stats(df, label=1):
    df['max'+str(label)] = [np.max(np.array(x)) for x in df['band_'+str(label)] ]
    df['maxpos'+str(label)] = [np.argmax(np.array(x)) for x in df['band_'+str(label)] ]
    df['min'+str(label)] = [np.min(np.array(x)) for x in df['band_'+str(label)] ]
    df['minpos'+str(label)] = [np.argmin(np.array(x)) for x in df['band_'+str(label)] ]
    df['med'+str(label)] = [np.median(np.array(x)) for x in df['band_'+str(label)] ]
    df['std'+str(label)] = [np.std(np.array(x)) for x in df['band_'+str(label)] ]
    df['mean'+str(label)] = [np.mean(np.array(x)) for x in df['band_'+str(label)] ]
    df['p25_'+str(label)] = [np.sort(np.array(x))[int(0.25*75*75)] for x in df['band_'+str(label)] ]
    df['p75_'+str(label)] = [np.sort(np.array(x))[int(0.75*75*75)] for x in df['band_'+str(label)] ]
    df['mid50_'+str(label)] = df['p75_'+str(label)]-df['p25_'+str(label)]

    return df

df_full = get_stats(df_full, 1)
df_full = get_stats(df_full, 2)

In [19]:
def plot_var(name, nbins=50):
    minval = df_full[name].min()
    maxval = df_full[name].max()
    plt.hist(df_full.loc[df_full.is_iceberg==1,name],range=[minval,maxval],
             bins=nbins,color='b',alpha=0.5,label='Boat')
    plt.hist(df_full.loc[df_full.is_iceberg==0,name],range=[minval,maxval],
             bins=nbins,color='r',alpha=0.5,label='Iceberg')
    plt.legend()
    plt.xlim([minval, maxval])
    plt.xlabel(name)
    plt.ylabel('Number')
    plt.show()
    
for col in ['inc_angle','min1','max1','std1','med1','mean1','mid50_1', 'p25_1', 'p75_1', 'minpos1', 'maxpos1']:
    plot_var(col)

Looks like the minimum, maximum, median, and mean reflectivities are useful features ... particularly the maximum reflectivity! This makes intuitive sense. The icebergs are scattering back more information to the radar. Also the first and third quartile features are useful.

Now let's look at the second band.

In [20]:
for col in ['min2','max2','std2','med2','mean2','mid50_2','p25_2', 'p75_2']:
    plot_var(col)

Interesting. The minimum reflectivity doesn't appear useful at all for band_2. However, the maximum reflectivity is! Additionally, we can almost guarantee that if the standard deviation of the values is over 3 then the image contains an iceberg.

So far, it seems that the features most usefull are:
* min1, max1, std1, med1, mean1, p25_1, p75_1 (from band_1)
* max2, std2 (from band_2)

Now let's look at the correlation between these features:

In [21]:
df_full_stats = df_full.drop(['id','is_iceberg','band_1','band_2', 'inc_angle'], axis=1)
corr = df_full_stats.corr()
fig = plt.figure(1, figsize=(10,10))
plt.imshow(corr,cmap='inferno')
labels = np.arange(len(df_full_stats.columns))
plt.xticks(labels,df_full_stats.columns,rotation=90)
plt.yticks(labels,df_full_stats.columns)
plt.title('Correlation Matrix of Global Variables')
cbar = plt.colorbar(shrink=0.85,pad=0.02)
plt.show()

Unfortunately max1 and max2 are pretty highly correlated, so we might not get much additional information including both in there. The minimums aren't too correlated with the maximums in both bands, however, so each might be useful.

Going back to the value of the maximum reflectivity, let's see if the number of pixels above some threshold could be useful.

In [43]:
np_test = np.array(df_full.loc[1, 'band_1'])

In [45]:
len(np_test[np_test > -10])

In [59]:
def get_threshold_size(df, label=1, threshold=-15):
    df['gt_'+str(threshold)+'_'+str(label)] = [len(np.array(x)[np.array(x) > threshold]) for x in df_full['band_'+str(label)]]
    
    return df

In [60]:
for lbl in range(1, 3):
    for thr in range(-15, 35, 5):
        get_threshold_size(df_full, label=lbl, threshold=thr)

In [61]:
for col in ['gt_-15_1', 'gt_-10_1', 'gt_-5_1', 'gt_0_1', 'gt_5_1', 'gt_10_1', 'gt_15_1',
            'gt_20_1', 'gt_25_1', 'gt_30_1']:
    plot_var(col)

Looks like the thresholds beyond 10 aren't useful.

In [63]:
df_full.drop(['gt_15_1', 'gt_20_1', 'gt_25_1', 'gt_30_1'], axis=1, inplace=True)

In [62]:
for col in ['gt_-15_2', 'gt_-10_2', 'gt_-5_2', 'gt_0_2', 'gt_5_2', 'gt_10_2', 'gt_15_2',
            'gt_20_2', 'gt_25_2', 'gt_30_2']:
    plot_var(col)

For band 2, the thresholds beyond -5 aren't useful.

In [64]:
df_full.drop(['gt_0_2', 'gt_5_2', 'gt_10_2', 'gt_15_2', 'gt_20_2', 'gt_25_2', 'gt_30_2'], 
             axis=1, inplace=True)

At this point let's remind ourselves all of the columns we have. And drop features we have deemed not useful.

In [66]:
df_full.info()

In [67]:
df_full.drop(['inc_angle', 'maxpos1', 'minpos1', 'maxpos2', 'minpos2'], axis=1, inplace=True)

In [68]:
df_full.info()

## Logistic Regression

Now that we have a bunch of numeric features, let's perform a regression model.

In [78]:
X = df_full.drop(['band_1', 'band_2', 'id', 'is_iceberg'], axis=1)[:1604]
y = df_full['is_iceberg'][:1604]

In [93]:
# sklearn imports
from sklearn.cross_validation import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
from sklearn.metrics import classification_report

skf = StratifiedKFold(y, n_folds=3)

In [94]:
for train_index, test_index in skf:
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    logmodel = LogisticRegression()
    logmodel.fit(X_train, y_train)
    predictions = logmodel.predict(X_test)
    predictions_prob = logmodel.predict_proba(X_test)[:,1]
    print(classification_report(y_test, predictions))
    print(log_loss(y_test, predictions_prob))

Looks like we should expect a loss near 0.4. Let's submit it and see what happens!

In [96]:
X_test = df_full.drop(['band_1', 'band_2', 'id', 'is_iceberg'], axis=1)[1604:]
logmodel = LogisticRegression()
logmodel.fit(X, y)
predictions_prob = logmodel.predict_proba(X_test)[:,1]

In [102]:
df_predictions = pd.DataFrame({'id' : df_full['id'][1604:], 'is_iceberg' : predictions_prob})
df_predictions.to_csv('logistic_regression_submission.csv', index=False)