In [1]:
import numpy as np
import pandas as pd
from sklearn import metrics, cross_validation
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [2]:
def cv_loop(X, y, model, K):
    '''
    Cross validation: for K iterations, split the data into train and test
    sets, build a model, and return the mean AUC.
    '''
    
    SEED = 15
    mean_auc = 0.
    for i in range(K):
        X_train, X_cv, y_train, y_cv = cross_validation.train_test_split(
            X, y, test_size = 0.2,
            random_state = i*SEED)
        model.fit(X_train, y_train)
        preds = model.predict_proba(X_cv)[:,1]
        auc = metrics.roc_auc_score(y_cv, preds)
        print("AUC (fold %d/%d): %f" % (i + 1, K, auc))
        mean_auc += auc
    return mean_auc/K


### Kaggle StumbleUpon Evergreen Data

First let's read in the data downloaded from the Kaggle competition.

In [7]:
train = pd.read_csv('/Users/youngtodd/Documents/utkml_examples/stumble_evergreen/data/train.csv')
test = pd.read_csv('/Users/youngtodd/Documents/utkml_examples/stumble_evergreen/data/test.csv')

train.head()

Unnamed: 0.1,Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,...,is_news,lengthyLinkDomain,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label
0,2351,http://ciaobella50.com/?p=4679,2407,"{""title"":""Funky Tie Dyed and Fabulous "",""body""...",arts_entertainment,0.608835,1.38,0.43,0.19,0.04,...,1,0,28,1,1511,100,0,0.81,0.124031,0
1,1115,http://freefashioninternships.com/,2605,"{""title"":""Free Fashion Internships com Fashion...",business,0.827129,2.558824,0.529412,0.352941,0.029412,...,1,0,9,1,4448,34,0,0.441176,0.115254,0
2,1049,http://www.manjulaskitchen.com/,1202,"{""title"":""Manjula s Kitchen Indian Vegetarian ...",business,0.768478,1.210145,0.401408,0.070423,0.007042,...,1,0,47,1,864,142,0,0.021127,0.041667,1
3,2790,http://ohsheglows.com/recipage?recipe_id=6002030,6418,"{""title"":""Recipage Oh She Glows "",""body"":""I ve...",?,?,1.35,0.608696,0.103261,0.032609,...,1,1,8,0,13117,184,2,0.326087,0.101562,1
4,2829,http://www.perpetualkid.com/index.asp?PageActi...,1159,"{""title"":""HEART SWALLOW BUSINESS CARD HOLDER ""...",business,0.735727,1.909091,0.533333,0.238095,0.047619,...,?,0,63,?,530,105,6,0.133333,0.089286,0


Note: there is some missing data within the training and test sets encoded with '?'. Let's take the naive approach and set all missing data to 0.

In [8]:
train = train.replace('?', value=0)
test = test.replace('?', value=0)

## Starting Off

Some of the best information in this dataset is contained in some natural language features e.g. 'boilerplate', the description of the website to be promoted. While we ultimately want to make use of this information, modeling it can be a bit tricky. Here we will first take a look at what we can do with only the numeric features of the data. This will serve as a benchmark for all of our future modeling. 

In [9]:
# Our first model will only make use of numeric features
# Let's also separate the label from the training features
X_train = train[train.columns[5:-1]]
y_train = train['label']

# Note: the test set does not have a label (we need to predict this)
X_test = test[test.columns[5:]]

### Preprocesssing

Many of the models work better when the data is represented on a common scale. This allows us a sense of magnitude between the features we want to learn from.

In [10]:
X_train = StandardScaler().fit_transform(X_train)
X_test = StandardScaler().fit_transform(X_test)

### On With It, Train The Model!

In [22]:
# Gaussian NB for real valued features.
model = GaussianNB()
K = 10 # Number of iterations for cross validation

print("AUC score for 10 fold cross validation:\n")
score = cv_loop(X_train, y_train, model, K)

print("\nMean AUC: %f" %score)

AUC score for 10 fold cross validation:

AUC (fold 1/10): 0.604088
AUC (fold 2/10): 0.600203
AUC (fold 3/10): 0.613043
AUC (fold 4/10): 0.653669
AUC (fold 5/10): 0.637583
AUC (fold 6/10): 0.632930
AUC (fold 7/10): 0.631132
AUC (fold 8/10): 0.611367
AUC (fold 9/10): 0.622946
AUC (fold 10/10): 0.614334

Mean AUC: 0.622129.2


### First thoughts

That is not too bad for an incredibly quick model. Assuming that classes are relatively balanced, our model is approximately 12% better than guessing randomly. With some work, we should be able to improve that score. For now, let's formally fit this model using all the training data (above we were training on smaller samples of the dataset).

In [25]:
# Fitting the Gaussian NB
model.fit(X_train, y_train)

# Let's make predictions on the unseen test set
preds = model.predict_proba(X_test)[:,1]

## Kaggle Submission

Our submissions to the Kaggle competition are in the form of .csv files with two columns: 1. the URL ID feature of the test set; 2. our predictions for that URL ID (these will be real valued estimates for the probability that our URL is an 'evergreen' [label=1])

In [27]:
# Get the url ID from the test set
urlid = test['urlid']

# combine the predictions and url IDs into a pandas dataframe
# Note: this is easy with Python's dictionaries, but make sure to set
# the index to url ID (otherwise the column orders will be random. Reason=dictionaries are unordered)
pred_df = pd.DataFrame({'label':preds, 'urlid': urlid}).set_index('urlid')
pred_df.to_csv('first_evergreen_submission')