# Tiny Training Trial
This notebook is to support the [Tiny Training Trial](https://w.amazon.com/bin/view/MLSciences/Community/ML_Challenge/TinyTraining/). The associated Leaderboard is available [here](https://leaderboard.corp.amazon.com/tasks/292).

## Loading the Data into Eider:
Since we got some people asking about best ways to import into Eider, I thought we'd go one step further and make it trivial to import!  Below is a snippet for loading and and taking a look at the dataset via S3 below. It's highly recommended to use the below method to avoid a needless local import. 

First, let's make sure we have our credentials set to ```ml-eider-shared-1```, and then load them in.

In [0]:
### Download Training Data and Test Features ###

import pandas as pd
from sklearn.utils import shuffle

eider.s3.download('s3://eider-datasets/mlu/projects/DontOverFitChallenge/TTT_train.csv', '/tmp/TTT_train.csv')
eider.s3.download('s3://eider-datasets/mlu/projects/DontOverFitChallenge/TTT_test_features.csv', '/tmp/TTT_test_features.csv')

train = pd.read_csv('/tmp/TTT_train.csv')
test = pd.read_csv('/tmp/TTT_test_features.csv')

train_data = train
unlabeled_data = test

train_data = shuffle(train_data, random_state=11)

### Let's See What We Are Up Against ###
pd.options.display.max_columns = None
# train.describe()
# train_described = train.copy()
# import numpy as np
# train_described = train_described.replace(0, np.NaN)
# train_described.describe().iloc[[0]]

# Separate features and labels from the training set
labels = train_data[['label']].copy()
features = train_data.loc[:, train_data.columns != 'label'].copy()

# Separate ids and features from the unlabeled set
unlabeled_ids = unlabeled_data[['ID']].copy()
unlabeled_features = unlabeled_data.loc[:, unlabeled_data.columns != 'ID'].copy()

# Removing the collisions from the training set

Let's remove from the training set ($T_{old}$) "very similar" documents with the **different labels**.     
The documents are "very similar" in case if the cosine similarity of the TF-IDF vectors is high **and** the euclidean distance is small.

$T_{new} = T_{old} \setminus T_{collisions}$    
      
where $T_{collisions} \subset T_{old}$ and $\forall x_i, x_j \in T_{collisions}$:
- $\left\Vert x_i - x_j \right\Vert \leq 1$    
- ${x_i \cdot x_j^T \over \left\Vert x_i \right\Vert \cdot \left\Vert x_j \right\Vert} \geq 0.85$     
    
- $label(x_i) \neq label(x_j)$

In [0]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances

cos_sim = cosine_similarity(features, features)
euc_dists = euclidean_distances(features, features)

from collections import defaultdict

lab = labels.values.ravel()
cnt = 0

removed = defaultdict(lambda: 0)
removed_indices = []

for i in range(1243):
    for j in range(i + 1, 1244):
        if cos_sim[i][j] > 0.85 and euc_dists[i][j] < 1 and lab[i] != lab[j]:
            print('{} <==> {} ({} <=> {})'.format(i, j, lab[i], lab[j]))
            cnt += 1
            removed[lab[i]] += 1
            removed[lab[j]] += 1
            removed_indices.append(i)
            removed_indices.append(j)

print(cnt)
print(removed)
print(sum(removed.values()))
print(sorted(removed_indices))

features1 = features.copy()
label1 = labels.copy()

features1.drop(features1.index[removed_indices], inplace=True)
label1.drop(label1.index[removed_indices], inplace=True)
print(features1.shape)
print(features.shape)

print(label1.shape)
print(labels.shape)

labels = label1
features = features1

train_set_size = labels.shape[0]
print(train_set_size)

# Creating the matrix $X$, which contains the features of all documents (classified and unclassified)

The matrix $X$ has 24'863 rows and 1'256 columns.

In [0]:
# Append all features together (from the training set and from the unlabeled set)
X = features.copy().append(unlabeled_features)

print(features.shape)
print(unlabeled_features.shape)
print(X.shape)

# PCA visualization of the training set

In [0]:
# Visualize the features of the training set, using the LDA

def pca_2d(X):
    
    import numpy as np
    X = np.array(X)
    
    features = X[:train_set_size,:]
    print(X.shape)
    print(features.shape)
    
    from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
    from sklearn.decomposition import PCA
    import matplotlib.pyplot as plt
    import random
    
    random.seed(17)
    colors = [(random.uniform(0, 1), random.uniform(0, 1), random.uniform(0, 1)) for _ in range(0, 10)]
    
    X2 = features.copy()
    y = labels.values.ravel()
    
    # lda = LinearDiscriminantAnalysis(n_components=2)
    # X_r2 = lda.fit(X2, y).transform(X2)
    pca = PCA(n_components=2)
    pca.fit(X)
    X_r2 = pca.transform(X2)
    
    plt.figure()
    for color, i in zip(colors, range(0, 10)):
        plt.scatter(X_r2[y == i, 0], X_r2[y == i, 1], alpha=.2, color=color)
    
    plt.show()
    
# Visualize training set (with the extracted features) in 3D using the LDA

def pca_3d(X, elev=60, azim=40):
    
    import numpy as np
    X = np.array(X)
    
    features = X[:train_set_size,:]
    print(X.shape)
    print(features.shape)
    
    import numpy as np
    import matplotlib.pyplot as plt
    from mpl_toolkits.mplot3d import Axes3D
    from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
    from sklearn.decomposition import PCA
    
    np.random.seed(1)
    
    centers = [[1, 1], [-1, -1], [1, -1]]
    
    X3 = features.copy()
    y = labels.values.ravel()
    
    fig = plt.figure(1, figsize=(7, 7))
    plt.clf()
    ax = Axes3D(fig, rect=[0, 0, 1, 1], elev=elev, azim=azim)
    plt.cla()
    
    # lda = LinearDiscriminantAnalysis(n_components=3)
    # X3 = lda.fit(X3, y).transform(X3)
    pca = PCA(n_components=3)
    pca.fit(X)
    X3 = pca.transform(X3)
    
    y = np.choose(y, range(0, 10)).astype(np.float)
    ax.scatter(X3[:, 0], X3[:, 1], X3[:, 2], c=y, cmap=plt.cm.nipy_spectral, edgecolor='') #, alpha=.2)
    
    ax.w_xaxis.set_ticklabels([])
    ax.w_yaxis.set_ticklabels([])
    ax.w_zaxis.set_ticklabels([])
    
    plt.show()
    
pca_2d(X)
pca_3d(X, elev=40)

# Binarization of the feature vectors

Almost all documents have less than 10 words (many of them have 1-5 words).

In [0]:
# Binarize
import numpy as np 
X1 = np.array(X.copy())
X1[X1 > 0] = 1

pca_2d(X1)
pca_3d(X1, elev=40)

X = X1

# Finding the structure in the data

      
The matrix $X$ has 24'863 rows and 1'256 columns.     
Let's apply the **lossy compression** to $X$ in order to reveal the presense of the internal structure in the data.     

We apply the lossy compression to all feature vectors (labaled and unlabeled).       
Afterwards we will check if the detected internal structure correlates to the labels from the training set.

# Non-negative matrix factorization     

Let's approximate $X$ as a product of two low-dimensional matrices: $X \approx W \times H$    

Where:     
- the matrix $W$ has 24'863 rows and 40 columns
- the matrix $H$ has 40 rows and 1'256 columns

Let's solve the following optimization problem:    
- $min \left\Vert X - W H \right\Vert_F$     
- subject to: $W \geq 0$, $H \geq 0$     
   
Where: $\left\Vert A \right\Vert_F = \sqrt{ \sum_{i=1}^N \sum_{j=1}^N a_{ij}^2}$ is a Frobenius norm.

In [0]:
# Perform the matrix factorization with 10 hidden features
from sklearn.decomposition import NMF
model = NMF(n_components=40, init='nndsvd', random_state=1, # verbose=2, 
            alpha=0.1, l1_ratio=0.5,
            beta_loss='frobenius', solver='cd', max_iter=500, tol=0.000001)

W = model.fit_transform(X)
H = model.components_

W_orig = W.copy()

# Print the shape of the documents matrix
print(W.shape)

# Extract from the documents matrix - the items, which correspond to the training set
features_2 = W[:train_set_size,:]
print(features_2.shape)

# NMF results visualization

Given the approximation: $X \approx W \times H$

For every 1'256-dimensional row-vector from the matrix $X$ - there is a corresponding 40-dimensional row-vector from the matrix $W$.

Let's visualize the row-vectors of $W$, which correspond to the vectors of the training set.    
Also, let's group the vectors, which have the same labels.    

In [0]:
# Visualize (Documents x Hidden features) matrix (where documents are grouped by the class)
# With 10 hidden features

def visuzlize_latent_features(W):
    import matplotlib
    import matplotlib.pyplot as plt
    import numpy as np
    
    fig = plt.figure()
    fig.set_size_inches(5, 12)
    ax = fig.add_subplot(111)
    
    max_for_all_W = np.amax(W)
    
    features_2_head = W[:train_set_size,:].copy()
    
    tuples = []
    for row_label, row, lab in zip(range(features_2_head.shape[0]), features_2_head, labels.values.ravel()):
        tuples.append((row, lab))
    
    # tuples.sort(key=lambda x: ' '.join(str(i) for i in x[1]))
    tuples.sort(key=lambda x: x[1])
    
    width = 20
    prev_lab = None
    row_num = 0
    for row, lab in tuples:
        if lab != prev_lab:
            col_num = 0
            for col in row:
                rect1 = matplotlib.patches.Rectangle(
                    (col_num * width, row_num * width), 
                    width, width * 2, 
                    color=(1, 0, 0, 1))
                ax.add_patch(rect1)
                col_num += 1
            row_num += 2
        # Normalization of the row
        # norm = 0
        # for col in row:
        #     norm += col
        norm = max_for_all_W
        col_num = 0
        for col in row:
            rect1 = matplotlib.patches.Rectangle(
                (col_num * width, row_num * width), 
                width, width, 
                color=(0, 0, 1, 1.0 / norm * col))
            ax.add_patch(rect1)
            col_num += 1
        prev_lab = lab
        row_num += 1
        if row_num % 100 == 0:
            print(row_num)
    
    plt.xlim([0, features_2_head.shape[1] * width])
    plt.ylim([0, (features_2_head.shape[0] + 20) * width])
    plt.show()
    
visuzlize_latent_features(W)

# For every* class of documents we can observe the prevalent latent features

\* **Except the class 0**

# Additional visualization + K-Means clustering of similar vectors of the latent features

In [0]:
# Visualize (Documents x Hidden features) matrix (where documents are grouped by the class) 
# and cluster the documents within the same class according to the hidden feature vectors.
# With 10 hidden features

def cluster_and_visuzlize_latent_features(W):
    
    import matplotlib
    import matplotlib.pyplot as plt
    import numpy as np
    from sklearn.cluster import AgglomerativeClustering, KMeans
    from sklearn.preprocessing import normalize, scale
    import math
    
    max_for_all_W = np.amax(W)
    
    features_2_head = W[:train_set_size,:].copy()
    
    tuples = []
    
    lab_to_row = dict()
    for row, lab in zip(features_2_head, labels.values.ravel()):
        if lab not in lab_to_row:
            lab_to_row[lab] = list()
        lab_to_row[lab].append(row)
        
    centroids = dict() # dict[label_class, list[tuple[vector, count]]]
    lab_to_clust_cnt = {
        0:3,
        1:3,
        2:3,
        3:3,
        4:3,
        5:3,
        6:3,
        7:3,
        8:3,
        9:3
    }
    for lab in sorted(lab_to_row.keys()):
        
        centroids[lab] = list()
        
        arr = np.array(lab_to_row[lab])
        
        arr = normalize(arr.copy(), axis=1, norm='l2')
        clust = KMeans(n_clusters=lab_to_clust_cnt[lab], random_state=0, max_iter=500).fit(arr)
        # print("Label: {}".format(lab))
        center_to_assigned_points = list()
        cluster_indices_count = dict()
        for c, cluster_index in zip(clust.cluster_centers_, range(lab_to_clust_cnt[lab])):
            # center_vector_length = math.sqrt(sum([x**2 for x in c]))
            cluster_index_count = sum([1 if cluster_index == ll else 0 for ll in clust.labels_])
            cluster_indices_count[cluster_index] = cluster_index_count
            center_to_assigned_points.append((
                ["{:1.5f}".format(x) for x in c], 
                cluster_index_count))
            centroids[lab].append((c, cluster_index_count))
        center_to_assigned_points.sort(key=lambda x: -x[1])
        print("{}: [".format(lab))
        for center, points_cnt in center_to_assigned_points:
            # print("[{}] \t {}".format(', '.join(center), points_cnt))
            print("([{}], {}),".format(', '.join(center), points_cnt))
        print("],")
        print
        
        # clust = AgglomerativeClustering(n_clusters=5, affinity='cosine', linkage='average').fit(arr)
        rows_and_cluster_indices = list(zip(lab_to_row[lab], clust.labels_))
        # rows_and_cluster_indices.sort(key=lambda x: x[1])
        rows_and_cluster_indices.sort(key=lambda x: -cluster_indices_count[x[1]])
        sorted_rows = [(row, lab) for row, _ in rows_and_cluster_indices]
        tuples.extend(sorted_rows)
    
    fig = plt.figure()
    fig.set_size_inches(5, 12)
    ax = fig.add_subplot(111)
        
    width = 20
    prev_lab = None
    row_num = 0
    for row, lab in tuples:
        if lab != prev_lab:
            col_num = 0
            for col in row:
                rect1 = matplotlib.patches.Rectangle(
                    (col_num * width, row_num * width), 
                    width, width * 2, 
                    color=(1, 0, 0, 1))
                ax.add_patch(rect1)
                col_num += 1
            row_num += 2
        # Normalization of the row
        # norm = 0
        # for col in row:
        #     norm += col**2
        norm = max_for_all_W
        col_num = 0
        for col in row:
            rect1 = matplotlib.patches.Rectangle(
                (col_num * width, row_num * width), 
                width, width, 
                color=(0, 0, 1, 1.0 / norm * col))
            ax.add_patch(rect1)
            col_num += 1
        prev_lab = lab
        row_num += 1
        if row_num % 100 == 0:
            print(row_num)
    
    plt.xlim([0, features_2_head.shape[1] * width])
    plt.ylim([0, (features_2_head.shape[0] + 20) * width])
    plt.show()
    
    # Collisions
    print("\n Collisions of centroids: \n")
    from sklearn.metrics.pairwise import cosine_similarity
    for label1 in centroids.keys():
        if label1 == 0:
            continue
        for label2 in centroids.keys():
            if label2 <= label1:
                continue
            for centroid_1, cnt_1 in centroids[label1]:
                for centroid_2, cnt_2 in centroids[label2]:    
                    sim = cosine_similarity([centroid_1], [centroid_2])[0][0]
                    if sim > 0.7:
                        print("{} ({}) and {} ({}) ==> {} ==> {} and {}".format(label1, cnt_1, label2, cnt_2, sim, centroid_1, centroid_2))
                        
cluster_and_visuzlize_latent_features(W)

# Visualization of the similarity of the latent features ("topics")

"Association" (similarity) between the latent features: $S = W^T W$    
The matrix $S$ has 40 rows and 40 columns.

In [0]:
import matplotlib.pyplot as plt
import numpy as np

def display_matrices(W, H):

    # features_2_head = W[:train_set_size,:].copy()
    
    # tuples = []
    # for row_label, row, lab in zip(range(features_2_head.shape[0]), features_2_head, labels.values.ravel()):
    #     tuples.append((row, lab))
    
    # # tuples.sort(key=lambda x: ' '.join(str(i) for i in x[1]))
    # tuples.sort(key=lambda x: x[1])
    
    # # 2D matrix of documents
    # rows_sorted_by_label = []
    # for row, lab in tuples:
    #     rows_sorted_by_label.append(row)
        
    # dim = len(rows_sorted_by_label)
        
    # rows_sorted_by_label = np.array(rows_sorted_by_label)  
    
    # from sklearn.preprocessing import normalize
    # # rows_sorted_by_label = normalize(rows_sorted_by_label, norm='l2', axis=1)
    
    # displayed = rows_sorted_by_label.dot(rows_sorted_by_label.T)
    
    # plt.figure(figsize=(10,10))
    # plt.imshow(displayed, interpolation='none', alpha=1)
    # plt.colorbar()
    # plt.title("Association of Documents")
    # plt.xlabel('Document index')
    
    # # Display delimiters
    # prev_lab = None
    # row_idx = 0
    # for row, lab in tuples:
    #     if prev_lab is not None and lab != prev_lab:
    #         plt.plot([0,dim], [row_idx, row_idx], color='y',linewidth=0.5)
    #         plt.plot([row_idx, row_idx], [0,dim], color='y',linewidth=0.5)
    #     row_idx += 1
    #     prev_lab = lab
    
    # plt.show()
    
    # http://jpfairbanks.net/2017/07/15/email-topics-with-nmf/
    
    plt.figure(figsize=(7,7))
    plt.imshow(W.T.dot(W), interpolation='none')
    plt.colorbar()
    plt.title("Association of Topics (W'W)")
    plt.xlabel('Topic index')
    
    # plt.figure(figsize=(7,7))
    # plt.imshow(H.dot(H.T), interpolation='none')
    # plt.colorbar()
    # plt.title("Association of Topics (HH')")
    # plt.xlabel('Topic index')

W = W_orig.copy()
display_matrices(W, H)

# Appending the extracted latent features to the existing feature-vectors

$X_{new} = [ \  X \ \  \   W \  ]$

In [0]:
# Append columns with extracted features to the columns with original features

import numpy as np

W = W_orig.copy()

from sklearn.preprocessing import normalize
W = normalize(W, norm='l2', axis=1)

# Print the shape of the documents matrix
print("Extracted features (labeled + unlabeled) matrix dimension: {}".format(W.shape))
print("Original features (labeled + unlabeled) matrix dimension: {}".format(X.shape))

# W = np.concatenate((W_orig.copy(), X), axis=1)

W = np.concatenate((W, X), axis=1)

print("Concatenated (extracted + original) features matrix dimension: {}".format(W.shape))
# Extract from the documents matrix - the items, which correspond to the training set
features_2 = W[:train_set_size,:]
print("New labeled features dimension: {}".format(features_2.shape))

# PCA visualization of $X_{new}$

In [0]:
pca_2d(W)
pca_3d(W, elev=40)

# Defining the ensemble of classifiers

In [0]:
# Grid search

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn import svm
from sklearn.ensemble import BaggingClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import VarianceThreshold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import Normalizer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.semi_supervised import LabelPropagation
from sklearn.neighbors.nearest_centroid import NearestCentroid
from sklearn.preprocessing import RobustScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

cls = Pipeline([
  ('cls', VotingClassifier(estimators=[
        ('svm', svm.SVC(random_state=11, 
                        decision_function_shape='ovr',
                        kernel='linear',
                        class_weight='balanced',
                        C=1,
                        cache_size=300,
                        probability=True)), 
        ('rf', RandomForestClassifier(random_state=11, 
                                      bootstrap=True,
                                      max_depth=None,
                                      max_features='log2',
                                      min_samples_leaf=1,
                                      n_estimators=1500,
                                      class_weight='balanced',
                                      criterion='gini',
                                      n_jobs=8)),
    ], 
    voting='soft',
    ))
])

# Cross-validation

In [0]:
# Split labeled features into the training and validation set

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features_2, labels, test_size=0.4, random_state=109)
X_train.shape

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn import svm
from sklearn.ensemble import BaggingClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import VarianceThreshold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import Normalizer
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors.nearest_centroid import NearestCentroid
from sklearn.neighbors import RadiusNeighborsClassifier
from sklearn.semi_supervised import LabelSpreading

# Cross-validation:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(cls, features_2, labels.values.ravel(), cv=5, n_jobs=5, verbose=10)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

cls.fit(X_train, y_train.values.ravel())

#Predict the response for test dataset
y_pred = cls.predict(X_test)

#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

# Training the classifier on 60% of the training set and evaluation on the rest 40%

In [0]:
# Confusion matrix
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.utils.multiclass import unique_labels

def plot_confusion_matrix(y_true, y_pred,
                          normalize=False,
                          title=None,
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if not title:
        if normalize:
            title = 'Normalized confusion matrix'
        else:
            title = 'Confusion matrix, without normalization'

    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    # Only use the labels that appear in the data
    classes = unique_labels(y_true, y_pred)
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    fig, ax = plt.subplots()
    fig.set_size_inches(8, 8)
    
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    # We want to show all ticks...
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels=classes, yticklabels=classes,
           title=title,
           ylabel='True label',
           xlabel='Predicted label')

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout()
    return ax

np.set_printoptions(precision=2)

# Plot confusion matrices

plot_confusion_matrix(y_test, y_pred, normalize=False,
                      title='Confusion matrix, without normalization')

plot_confusion_matrix(y_test, y_pred, normalize=True,
                      title='Confusion matrix, with normalization')

# Training the classifier on the whole training set

In [0]:
cls.fit(features_2, labels.values.ravel())

In [0]:
unlabeled_features = W[train_set_size:,:]
unlabeled_features.shape

# Producing the final results

In [0]:
classifier = cls

#Predict the response for test dataset
unlabeled_pred = classifier.predict(unlabeled_features)

results_path = '/tmp/submission_NB44R392BHPZ_demo.csv'
with open(results_path, 'w') as f:
    f.write('ID,label\n')
    processed = 0
    for row_id, row_pred in zip(unlabeled_ids['ID'], unlabeled_pred):
        f.write("{},{}\n".format(row_id, row_pred))
        processed += 1
        
print('Done!')

# Top-10 most relevant tokens for every topic

In [0]:
def top_words(topic, n_top_words):
    return topic.argsort()[:-n_top_words - 1:-1]
def topic_table(model, feature_names, n_top_words):
    topics = {}
    for topic_idx, topic in enumerate(model.components_):
        t = ("topic_%d:" % topic_idx)
        topics[t] = [feature_names[i] for i in top_words(topic, n_top_words)]
    return pd.DataFrame(topics)

topic_table(model, features.columns.tolist(), 20).head(10)

## Getting our model output out of Eider and into Leaderboard
Great. Now we have a dummie sample submission in Eider that we now need to export locally so that we may then upload to Leaderboard in the following steps:
1. Within the Eider console top bar, select [Files](https://eider.corp.amazon.com/file)
2. You should now see 'Files', 'TMP' and 'Exported notebooks' tabs. 
3. Select 'TMP' then select 'Connect to workspace'. You should now see any files from your last run of your workspace. If there was no 'Connect to workspace' option, your files from the last run should already be present. *Files in the 'TMP' should be considered temporary as they will expire after an hour's worth of idle time.*
4. Go to the ```TTT_fake_sub.csv``` file and select Save
5. This file will now be permanently saved to your Eider account and available for local download.
6. Go to the 'Files' tab, and click 'download' to save it to your local machine.

We now have our model's output .csv and are ready to upload to Leaderboard
1. Search for your [Leaderboard instance](https://leaderboard.corp.amazon.com/tasks/292) and go to the 'Make a Submission' section
2. Upload your local file and include your notebook version URL for tracking.
3. Your score on the public leaderboard should now appear. 

The private leaderboard contains the vast majority of the data, and so your final rankings in this competition will be a bit of a surprise! Take care and avoid overfitting!