Removing all information about a model and brand gave me **2.37** score. That was clearly a very bad model, so let's see how it is possible to improve it.'

In [1]:
import pandas as pd
import numpy as np

from datetime import datetime
from IPython.display import display, HTML, clear_output
from sklearn.preprocessing import LabelEncoder
from sklearn.cross_validation import StratifiedKFold
from sklearn.metrics import log_loss
from sklearn import linear_model, svm, neighbors, ensemble, naive_bayes
%matplotlib inline

df_train = pd.DataFrame.from_csv("train.csv")

Explanation about the features:
 - `phone_brand` - phone brand
 - `device_model` - device model
 - `e_num` - number of events
 - `h_X` - how many events happened at hour X. X is from 0 to 23
 - `d_X` - how many events happened at day X. X is from 0 to 6
 - `lat/lng` - position of the device
 - `pos_n` - number of unique clusters of positions
 - `app_num_all` - number of all application installed
 - `app_num_active` - number of active applications
 - `c_X` - number of categories of applications

In [2]:
print 'No events :', sum(df_train['e_num'] == 0)
print '1 event   :', sum(df_train['e_num'] == 1)
print '2 events  :', sum(df_train['e_num'] == 2)
print '3 events  :', sum(df_train['e_num'] == 3)
print '4 and more:', sum(df_train['e_num'] >  3)

No events : 51336
1 event   : 2375
2 events  : 1467
3 events  : 1076
4 and more: 18391


In [3]:
print "Num of brands:", df_train['phone_brand'].nunique()
print "Num of models:", df_train['device_model'].nunique()
print "Num of b&m   :", (df_train['phone_brand'] + ' ' + df_train['device_model']).nunique()

Num of brands: 104
Num of models: 1438
Num of b&m   : 1486


-----

Majority of the data (more than 50%) do not have anything except of brand and label. The number of different brands and models is big (104/1438). A lot of them were seen only once or twice. There are a few things that makes sense:
 - train two models. One for the data with no events and another one for the rest of the data
 - mark infrequent brands/models as other. This reduces the space. But it is not which frequency should be selected as a threshold. My approach is to try various frequencies and select the one that performs the best.
 
A few helper functions that can help me to try different thresholds.

In [4]:
def score(X, Y, clf, num_classes):
    # usage score((ensemble.GradientBoostingClassifier(n_estimators=300))
    pred = np.zeros((Y.shape[0], num_classes))
    for itrain, itest in StratifiedKFold(Y, n_folds=5, shuffle=True, random_state=0):
        Xtr, Xte = X[itrain, :], X[itest, :]
        ytr, yte = Y[itrain],    Y[itest]
        clf.fit(Xtr, ytr)
        pred[itest,:] = clf.predict_proba(Xte)

    return log_loss(Y, pred)

In [5]:
def encode(df, col):
    le = LabelEncoder()
    le.fit(df[col])
    df[col] = le.transform(df[col])
    return le

def get_modified_dataframe(pred_cat, brand_threshold, model_threshold):
    cols = ['phone_brand', 'device_model'] + [pred_cat]
    df = df_train[df_train['e_num'] == 0][cols].copy()

    infrequent = df['phone_brand'].value_counts()
    df.replace(infrequent[infrequent <= brand_threshold].index, 'other', inplace=True)

    infrequent = df['device_model'].value_counts()
    df.replace(infrequent[infrequent <= model_threshold].index, 'other', inplace=True)
    
    n_brands = df['phone_brand'].nunique()
    n_models = df['device_model'].nunique()
    
    le_brand = encode(df, 'phone_brand')
    le_model = encode(df, 'device_model')
    
    le = encode(df, pred_cat)
    
    dummies_brand = pd.get_dummies(df['phone_brand'],  prefix="brand")
    dummies_model = pd.get_dummies(df['device_model'], prefix="model")

    df = pd.concat([dummies_brand, dummies_model, df], axis=1)
    df.drop(['phone_brand', 'device_model'], inplace=True, axis=1)
    return df, le, n_brands, n_models

Now I will try various different thresholds and decide which one is the best.

In [6]:
def find_best_threshold(category_to_predict, brand_threshold_arr, model_threshold_arr):
    data, already_done = [], {}
    for brand_threshold in brand_threshold_arr:
        for model_threshold in model_threshold_arr:
            startTime = datetime.now()
            df, le, n_brands, n_models = get_modified_dataframe(category_to_predict, brand_threshold, model_threshold)
            
            if (n_brands, n_models) in already_done:
                continue
                
            X, Y = df.values[:,:-1], df.values[:,-1]

            score_val = score(X, Y, linear_model.LogisticRegression(), len(le.classes_))
            time_took = datetime.now() - startTime
            already_done[(n_brands, n_models)] = score_val

            print 'Brand threshold:      ', brand_threshold
            print 'Model threshold:      ', model_threshold
            print 'Num brands/model left:', (n_brands, n_models)
            print 'Score:                ', score_val
            print 'Time:                 ', time_took
            print
            
            data.append((brand_threshold, model_threshold, n_brands, n_models, score_val, time_took))  

    clear_output()
    cols = ['brand_threshold', 'model_threshold', 'n_brands', 'n_models', 'Logistic', 'time']
    return pd.DataFrame(data, columns=cols).set_index(['brand_threshold', 'model_threshold'])

#res_df = find_best_threshold('group', xrange(1, 4), xrange(160, 220))

The best value **2.405190** is obtained with `brand_threshold=1`, `model_threshold=180`. The similar values are for `brand_threshold` from 1 to 4, `model_threshold` between 160 and 200. Each of the logistic regressions run approximately 13 seconds.

Tried `find_best_threshold('group', xrange(1, 20), xrange(5, 200, 5))`, took ~ 6 hours (without cache). Refined with `find_best_threshold('group', xrange(1, 6), xrange(150, 210))`, took ~ 1 hour. 

Had an idea to **predict the group based on age and gender**. This lead me nowhere because female/male baskets have different ranges.

----

So now I have a model that calculates the the group when the data have 0 events. Need to build another one, when we have events.

In [32]:
def get_modified_dataframe_2(brand_threshold, model_threshold):
    df = df_train[df_train['e_num'] > 0].copy().drop(['gender', 'age'], axis=1)
    df['lat'].fillna(0, inplace=True)
    df['lng'].fillna(0, inplace=True)

    infrequent = df['phone_brand'].value_counts()
    df.replace(infrequent[infrequent <= brand_threshold].index, 'other', inplace=True)

    infrequent = df['device_model'].value_counts()
    df.replace(infrequent[infrequent <= model_threshold].index, 'other', inplace=True)
    
    n_brands = df['phone_brand'].nunique()
    n_models = df['device_model'].nunique()
    
    le_brand = encode(df, 'phone_brand')
    le_model = encode(df, 'device_model')
    
    le = encode(df, 'group')
    
    dummies_brand = pd.get_dummies(df['phone_brand'],  prefix="brand")
    dummies_model = pd.get_dummies(df['device_model'], prefix="model")

    df = pd.concat([dummies_brand, dummies_model, df], axis=1)
    df.drop(['phone_brand', 'device_model'], inplace=True, axis=1)
    return df, le, n_brands, n_models

In [35]:
df, le, n_brands, n_models = get_modified_dataframe_2(1, 180)

X, Y = df.values[:,:-1], df.values[:,-1]
score_val = score(X, Y, linear_model.LogisticRegression(), len(le.classes_))
print score_val

2.48495693622
