# Kaggle Competition: Stack Overflow Closed Discussions

Currently about 6% of all new questions on Stack Overflow end up "closed". Questions can be closed as off-topic, not constructive, not a real question, or too localized. Your goal is to build a classifier that predicts whether or not a question will be closed given the question as submitted. Additional data about the user at question creation time is also available.

## 1. Reading the data

Let's load the data, go through it and prepare it a little.

In [None]:
import pandas as pd

In [None]:
train = pd.read_csv('./data/train.csv', index_col=0)

In [None]:
train.head()

In [None]:
# Shape of the train set
train.shape

In [None]:
# Balance of the train set
train.OpenStatus.value_counts()

The dataset is perfectly balanced. That's going to be convenient.

In [None]:
# Types of the train set
train.dtypes

In [None]:
# Dates columns to datetime format
train.PostCreationDate = pd.to_datetime(train.PostCreationDate)
train.OwnerCreationDate = pd.to_datetime(train.OwnerCreationDate)
# train.PostClosedDate = pd.to_datetime(train.PostClosedDate)  -- will be dropped afterwards

In [None]:
# Missing values
train.isnull().sum()

PostClosedDate has as many missing values as open posts, which is pretty logical. This column is the only one which we will not use, for two reasons. First, it contains "future information", i.e. information which will not be available on the test set. Then, the only other information it contains is the date of closing, which coes not interest us.

Other that that, only tags contain missing values, which is also going to be convenient.

In [None]:
# Removing PostClosedDate
train = train.drop('PostClosedDate', axis=1)

In [None]:
# Closer look at the timeframes considered
print('Min post creation date: ', train.PostCreationDate.min())
print('Max post creation date: ', train.PostCreationDate.max())
print('Min owner creation date: ', train.OwnerCreationDate.min())
print('Max owner creation date: ', train.OwnerCreationDate.max())

## 2. Feature Engineering

In this step we are trying to create new numeric columns which might be helpful to train our model predict the closure of a post.

In [None]:
%matplotlib inline

### 2.a) Post Creation Date features

In [None]:
# Post Creation time values
train['PostCreationDate_hour'] = train.PostCreationDate.dt.hour
train['PostCreationDate_day'] = train.PostCreationDate.dt.dayofweek
train['PostCreationDate_month'] = train.PostCreationDate.dt.month
train['PostCreationDate_year'] = train.PostCreationDate.dt.year

In [None]:
train.head()

Let us now check one by one these new features and check if some of them seem to be good at predicting OpenStatus. To do so, we look at the mean of OpenStatus, grouping by every of these features. This gives us the probability of a post being Open, per hour/dayofweek/month/year.

In [None]:
train.groupby('PostCreationDate_hour').OpenStatus.mean().plot()

One can notice that posts created in the night (hour 4 to 7) are less likely to be open.

In [None]:
train.groupby('PostCreationDate_day').OpenStatus.mean().plot()

One can notice that posts created during the weekend (days 5 and 6) are less likely to be open.

In [None]:
train.groupby('PostCreationDate_month').OpenStatus.mean().plot()

One can notive that posts published in july are the most unlikely to be open.

In [None]:
train.groupby('PostCreationDate_year').OpenStatus.mean().plot()

One can notice than oldest (2008) newest posts (at least 2011) are more likely to be closed than the others.

**We can now create or new columns for post creation date, which will help us predict if a post is closed or not.**

In [None]:
train['PostCreationDate_night'] = ((train.PostCreationDate_hour) >= 4 & (train.PostCreationDate_hour <= 7)).astype(int)
train['PostCreationDate_weekend'] = ((train.PostCreationDate_day) >= 5 & (train.PostCreationDate_day <= 6)).astype(int)
train['PostCreationDate_july'] = (train.PostCreationDate_month == 7).astype(int)
train['PostCreationDate_recent'] = ((train.PostCreationDate_year >= 2011) | (train.PostCreationDate_year == 2008)).astype(int)

In [None]:
train.head()

### 2.b) Owner Creation Date features

In [None]:
# Owner Creation time values
train['OwnerCreationDate_hour'] = train.OwnerCreationDate.dt.hour
train['OwnerCreationDate_day'] = train.OwnerCreationDate.dt.dayofweek
train['OwnerCreationDate_month'] = train.OwnerCreationDate.dt.month
train['OwnerCreationDate_year'] = train.OwnerCreationDate.dt.year

In [None]:
train.tail()

Let us now check one by one these new features and check if some of them seem to be good at predicting OpenStatus. We proceed exactly as in the section before.

In [None]:
train.groupby('OwnerCreationDate_hour').OpenStatus.mean().plot()

One can notice that posts by owners created in the night (hour 4 to 6) are less likely to be open.

In [None]:
train.groupby('OwnerCreationDate_day').OpenStatus.mean().plot()

One can notice that posts by owners created during the weekend (days 5 and 6) are less likely to be open.

In [None]:
train.groupby('OwnerCreationDate_month').OpenStatus.mean().plot()

One can notive that posts by owners published in july are the most unlikely to be open.

In [None]:
train.groupby('OwnerCreationDate_year').OpenStatus.mean().plot()

One can notice than posts by newest owners (at least 2011) are more likely to be closed than the others.

**We can now create or new columns for owner creation date, which will help us predict if a post is closed or not.**

In [None]:
train['OwnerCreationDate_night'] = ((train.OwnerCreationDate_hour) >= 4 & (train.OwnerCreationDate_hour <= 6)).astype(int)
train['OwnerCreationDate_weekend'] = ((train.OwnerCreationDate_day) >= 5 & (train.OwnerCreationDate_day <= 6)).astype(int)
train['OwnerCreationDate_july'] = (train.OwnerCreationDate_month == 7).astype(int)
train['OwnerCreationDate_recent'] = (train.OwnerCreationDate_year >= 2011).astype(int)

### 2.c) Number of tags

In [None]:
tags = ['Tag1', 'Tag2', 'Tag3', 'Tag4', 'Tag5']
train['nb_tags'] = 5 - train[tags].isnull().sum(axis=1)

In [None]:
train[tags + ['nb_tags']].tail()

In [None]:
train.groupby('nb_tags').OpenStatus.mean()

One can see that the lower the number of tags, the higher the likelyhood of being closed. This feature might thus be useful in the future.

### 2.d) Length of title

In [None]:
train['len_title'] = ((train.Title.str.len())//5)*5
train.groupby('len_title').OpenStatus.mean()

One can notice that very short titles have a high probability of being closed. Let us set a threshold at 20.

In [None]:
train['short_title'] = (train.Title.str.len() < 25).astype(int)
train.short_title.value_counts()

Around 10% of the posts are among the short-titled posts.

### 2.e) Length of post

In [None]:
import numpy as np

In [None]:
train['len_post'] = train.BodyMarkdown.str.len().apply(np.log).astype(int)
train.groupby('len_post').OpenStatus.mean().plot()

In [None]:
train.len_post.value_counts().sort_index()

One can notice that for the lowest and highest values of log(len(post)), the likelihood of being closed is much higher. Lets create a feature for taking this into account.

In [None]:
train['len_post_extreme'] = ((train.BodyMarkdown.str.len().apply(np.log) < 6) | (train.BodyMarkdown.str.len().apply(np.log) > 9)).astype(int)

In [None]:
train[['len_post', 'len_post_extreme']].tail()

### 2.f) Avg length of tags

In [None]:
def get_len_tags(tags):
    if str(tags[0]) == 'nan':
        return 0
    else:
        return (np.mean([len(str(tag)) for tag in tags if str(tag) != 'nan'])//1)

In [None]:
train['len_tags'] = train[tags].apply(get_len_tags, axis=1)

In [None]:
train[tags + ['len_tags']].head()

In [None]:
train.groupby('len_tags').OpenStatus.mean().plot()

It does not seem that this feature is very predictive of the status of the post.

### 2.g) Number of posts of the author

In [None]:
nb_posts_author = train.groupby('OwnerUserId').OwnerUserId.count()

In [None]:
train['nb_posts_author'] = (train.OwnerUserId.apply(lambda OwnerUserId: nb_posts_author[OwnerUserId]))

In [None]:
train.groupby('nb_posts_author').OpenStatus.mean().plot()

In [None]:
train.nb_posts_author.value_counts()

It does not seem that this feature is very predictive of the status of the post.

### 2.h) Number of closed posts of the author

In [None]:
ClosedPosts = pd.concat([train.OwnerUserId, (train.OpenStatus == 0).astype(int)], axis=1).groupby('OwnerUserId').sum()
ClosedPosts.columns=(['NumberOfClosedPostsAtSubmissionTime'])
ClosedPosts['HasClosedPostsAtSubmissionTime'] = (ClosedPosts.NumberOfClosedPostsAtSubmissionTime > 0).astype(int)
ClosedPosts.head()

In [None]:
ClosedPosts.NumberOfClosedPostsAtSubmissionTime.value_counts()

In [None]:
train['nb_closed_posts_author'] = train.OwnerUserId.apply(lambda x: ClosedPosts.NumberOfClosedPostsAtSubmissionTime[x] if x in ClosedPosts.index else 0)

In [None]:
train.groupby('nb_closed_posts_author').OpenStatus.mean().plot()

Let's separate those with at lease one closed post and those without.

In [None]:
train['has_closed_posts_author'] = train.OwnerUserId.apply(lambda x: ClosedPosts.HasClosedPostsAtSubmissionTime[x] if x in ClosedPosts.index else 0)

In [None]:
train.groupby('has_closed_posts').OpenStatus.mean()

## 3) Learning on numerical features only

### 3.a) Preparing train data

In [1]:
import pandas as pd
import numpy as np

train = pd.read_csv('./data/train.csv', index_col=0)
train.PostCreationDate = pd.to_datetime(train.PostCreationDate)
train.OwnerCreationDate = pd.to_datetime(train.OwnerCreationDate)

In [2]:
train.columns

Index(['PostId', 'PostCreationDate', 'OwnerUserId', 'OwnerCreationDate',
       'ReputationAtPostCreation', 'OwnerUndeletedAnswerCountAtPostTime',
       'Title', 'BodyMarkdown', 'Tag1', 'Tag2', 'Tag3', 'Tag4', 'Tag5',
       'PostClosedDate', 'OpenStatus'],
      dtype='object')

In [3]:
ClosedPosts = pd.concat([train.OwnerUserId, (train.OpenStatus == 0).astype(int)], axis=1).groupby('OwnerUserId').sum()
ClosedPosts.columns=(['NumberOfClosedPostsAtSubmissionTime'])
ClosedPosts['HasClosedPostsAtSubmissionTime'] = (ClosedPosts.NumberOfClosedPostsAtSubmissionTime > 0).astype(int)
ClosedPosts.head()

Unnamed: 0_level_0,NumberOfClosedPostsAtSubmissionTime,HasClosedPostsAtSubmissionTime
OwnerUserId,Unnamed: 1_level_1,Unnamed: 2_level_1
3,0,0
4,1,1
5,1,1
9,0,0
13,2,1


In [4]:
def create_features(df):
    
    new_features = []
    
    # dates columns to datetime format
    df.PostCreationDate = pd.to_datetime(train.PostCreationDate)
    df.OwnerCreationDate = pd.to_datetime(train.OwnerCreationDate)
    
    # post dates features
    df['PostCreationDate_night'] = ((df.PostCreationDate.dt.hour) >= 4 & (df.PostCreationDate.dt.hour <= 7)).astype(int)
    df['PostCreationDate_weekend'] = ((df.PostCreationDate.dt.dayofweek) >= 5 & (df.PostCreationDate.dt.dayofweek <= 6)).astype(int)
    df['PostCreationDate_july'] = (df.PostCreationDate.dt.month == 7).astype(int)
    #df['PostCreationDate_recent'] = ((df.PostCreationDate.dt.year >= 2011) | (df.PostCreationDate.dt.year == 2008)).astype(int)
    new_features += ['PostCreationDate_night', 'PostCreationDate_weekend', 'PostCreationDate_july']
    
    # owner dates features
    df['OwnerCreationDate_night'] = ((df.OwnerCreationDate.dt.hour) >= 4 & (df.OwnerCreationDate.dt.hour <= 6)).astype(int)
    df['OwnerCreationDate_weekend'] = ((df.OwnerCreationDate.dt.dayofweek) >= 5 & (df.OwnerCreationDate.dt.dayofweek <= 6)).astype(int)
    df['OwnerCreationDate_july'] = (df.OwnerCreationDate.dt.month == 7).astype(int)
    #df['OwnerCreationDate_recent'] = (df.OwnerCreationDate.dt.year >= 2011).astype(int)
    new_features += ['OwnerCreationDate_night', 'OwnerCreationDate_weekend', 'OwnerCreationDate_july']
    
    # number of tags
    tags = ['Tag1', 'Tag2', 'Tag3', 'Tag4', 'Tag5']
    df['nb_tags'] = 5 - df[tags].isnull().sum(axis=1)
    new_features += ['nb_tags']
    
    # length of title
    df['short_title'] = (df.Title.str.len() < 25).astype(int)
    new_features += ['short_title']
    
    # length of post
    df['len_post_extreme'] = ((df.BodyMarkdown.str.len().apply(np.log) < 6) | (df.BodyMarkdown.str.len().apply(np.log) > 9)).astype(int)
    new_features += ['len_post_extreme']
    
    # has the author closed posts at post submission
    # df['has_closed_posts'] = df.OwnerUserId.apply(lambda x: ClosedPosts.HasClosedPostsAtSubmissionTime[x] if x in ClosedPosts.index else 0)
    # new_features += ['has_closed_posts']
    
    return df, new_features

In [5]:
train, new_features = create_features(train)
numerical_features = ['ReputationAtPostCreation', 'OwnerUndeletedAnswerCountAtPostTime'] + new_features

In [6]:
numerical_features

['ReputationAtPostCreation',
 'OwnerUndeletedAnswerCountAtPostTime',
 'PostCreationDate_night',
 'PostCreationDate_weekend',
 'PostCreationDate_july',
 'OwnerCreationDate_night',
 'OwnerCreationDate_weekend',
 'OwnerCreationDate_july',
 'nb_tags',
 'short_title',
 'len_post_extreme']

In [7]:
X_num = train[numerical_features]
y_num = train.OpenStatus

In [8]:
X_num.head(10)

Unnamed: 0,ReputationAtPostCreation,OwnerUndeletedAnswerCountAtPostTime,PostCreationDate_night,PostCreationDate_weekend,PostCreationDate_july,OwnerCreationDate_night,OwnerCreationDate_weekend,OwnerCreationDate_july,nb_tags,short_title,len_post_extreme
0,1,2,1,1,0,1,1,0,1,0,0
1,192,24,1,1,0,1,1,0,3,0,0
2,1,0,1,1,1,1,1,1,3,1,1
3,4,1,1,1,0,1,1,1,2,0,0
4,334,14,1,1,0,1,1,0,2,0,0
5,20,0,1,1,0,1,1,0,2,0,0
6,95,10,1,0,0,1,1,0,4,0,0
7,32,0,1,1,0,1,1,0,4,1,1
8,1,0,1,1,0,1,1,0,1,0,1
9,1,0,1,1,0,1,1,0,1,0,0


### 3.b) Choosing a model

In [9]:
from sklearn import metrics
from sklearn.pipeline import make_pipeline

import scipy as sp
from sklearn.grid_search import GridSearchCV
from sklearn.grid_search import RandomizedSearchCV
from sklearn.cross_validation import cross_val_score

from sklearn.linear_model import LogisticRegression as LogReg
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.naive_bayes import MultinomialNB as MNB
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.svm import LinearSVC
from sklearn.svm import SVC

In [None]:
Models = [LogReg, KNN, LinearSVC, LDA]

In [None]:
# For MNB:
# X['ReputationAtPostCreation'] = X.ReputationAtPostCreation + np.absolute(np.min(X.ReputationAtPostCreation, axis=0))
# Result: 0.558094286125

In [None]:
import warnings
from datetime import datetime

In [None]:
warnings.filterwarnings('ignore')

for Model in Models:
    start = datetime.now()
    model = Model()
    scores = cross_val_score(model, X_num, y_num, scoring='accuracy', cv=5)
    end = datetime.now()
    delta = end-start
    print('Accuracy: {0} - Computation time: {1} seconds.'.format(np.mean(scores), delta.seconds))
    
warnings.filterwarnings('default')

One can notice that very good results are reached, even without using vectorizers and without using the text data. This is very promising for the future. Let us choose Logistic Regression, which seems to provide the best results.

### 3.c) Tuning the model

In [None]:
# Checking the distribution chosen
warnings.filterwarnings('ignore')

%matplotlib inline
rayl_data = sp.stats.rayleigh.rvs(size=1000)
pd.DataFrame(rayl_data).plot(kind='density')

warnings.filterwarnings('default')

In [10]:
# LOGISTIC REGRESSION
logreg_num = LogReg()
param_grid_num = {'penalty':['l1', 'l2'], 'C':sp.stats.rayleigh()}
rand_num = RandomizedSearchCV(logreg_num, param_grid_num, n_iter=10, scoring='log_loss', cv=5)

In [None]:
%time rand_num.fit(X_num,y_num)

In [None]:
print(rand_num.best_score_, ' — ', rand_num.best_params_)

### 3.d) Predicting on test data

In [11]:
test = pd.read_csv('./data/test.csv')
test.PostCreationDate = pd.to_datetime(test.PostCreationDate)
test.OwnerCreationDate = pd.to_datetime(test.OwnerCreationDate)

In [12]:
test.head()

Unnamed: 0,PostId,PostCreationDate,OwnerUserId,OwnerCreationDate,ReputationAtPostCreation,OwnerUndeletedAnswerCountAtPostTime,Title,BodyMarkdown,Tag1,Tag2,Tag3,Tag4,Tag5
0,11768878,2012-08-01 23:10:12,756422,2011-05-16 21:49:59,155,11,Maven & yui-compressor Plugin issues,I'm using the yui-compressor plugin for maven ...,maven,maven-3,yui-compressor,,
1,11768880,2012-08-01 23:10:21,1569892,2012-08-01 22:24:37,1,0,Inconsistent behaviour of html select dropdowns,I have written a javascript-generated web page...,html,select,drop-down-menu,scrollbar,
2,11803678,2012-08-03 21:40:49,1301879,2012-03-29 21:01:29,781,37,Why Does MSFT C# Compiler Compile fixed Statem...,The .NET c# compiler (.NET 4.0) compiles the `...,c#,.net,compiler,il,
3,11803496,2012-08-03 21:24:02,1196150,2012-02-08 02:20:44,538,0,Dump sql file to ClearDB in Heroku,I have a sql file that I want to be dumped int...,mysql,ruby-on-rails,heroku,,
4,11803700,2012-08-03 21:43:13,772581,2009-11-13 16:24:05,70,2,mysql query to get rows with conditions,"\r\nI have a table called ""articles"" on the da...",mysql,query,,,


In [13]:
test, new_features = create_features(test)

In [14]:
X_test_num = test[numerical_features]
X_test_num.head(20)

Unnamed: 0,ReputationAtPostCreation,OwnerUndeletedAnswerCountAtPostTime,PostCreationDate_night,PostCreationDate_weekend,PostCreationDate_july,OwnerCreationDate_night,OwnerCreationDate_weekend,OwnerCreationDate_july,nb_tags,short_title,len_post_extreme
0,155,11,1,1,0,1,1,0,3,0,0
1,1,0,1,1,0,1,1,0,4,0,0
2,781,37,1,1,1,1,1,1,4,0,1
3,538,0,1,1,0,1,1,1,3,0,0
4,70,2,1,1,0,1,1,0,2,0,0
5,176,0,1,1,0,1,1,0,1,1,1
6,2657,161,1,0,0,1,1,0,3,0,0
7,11,0,1,1,0,1,1,0,3,0,0
8,202,4,1,1,0,1,1,0,2,0,1
9,70,5,1,1,0,1,1,0,5,0,0


In [None]:
y_pred_num = rand_num.predict_proba(X_test_num)[:,1]

In [None]:
output = pd.DataFrame({'id':X_test_num.index, 'OpenStatus':y_pred_num})

In [None]:
output[['id','OpenStatus']].to_csv('./out/submission.csv', index=False)

In [None]:
output.OpenStatus.value_counts()

## 4) Learning on text data

In [46]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import make_pipeline

In [47]:
train.columns

Index(['PostId', 'PostCreationDate', 'OwnerUserId', 'OwnerCreationDate',
       'ReputationAtPostCreation', 'OwnerUndeletedAnswerCountAtPostTime',
       'Title', 'BodyMarkdown', 'Tag1', 'Tag2', 'Tag3', 'Tag4', 'Tag5',
       'PostClosedDate', 'OpenStatus', 'PostCreationDate_night',
       'PostCreationDate_weekend', 'PostCreationDate_july',
       'OwnerCreationDate_night', 'OwnerCreationDate_weekend',
       'OwnerCreationDate_july', 'nb_tags', 'short_title', 'len_post_extreme',
       'Text'],
      dtype='object')

In [48]:
def create_text_feature(df):
    
    df['Text'] = df.Title.str.cat(df.BodyMarkdown, sep=' ')
    df.Text = df.Text.str.cat(df.Tag1, sep=' ', na_rep='')
    df.Text = df.Text.str.cat(df.Tag2, sep=' ', na_rep='')
    df.Text = df.Text.str.cat(df.Tag3, sep=' ', na_rep='')
    df.Text = df.Text.str.cat(df.Tag4, sep=' ', na_rep='')
    df.Text = df.Text.str.cat(df.Tag5, sep=' ', na_rep='')

    # Let's add the title another time
    df.Text = df.Text.str.cat(df.Title, sep=' ', na_rep='')
    
    return df

In [49]:
train = create_text_feature(train)

In [50]:
train.Text[1]

'How to insert schemalocation in a xml document via DOM i create a xml document with JAXP and search a way to insert the schemalocation.\r\nAt the moment my application produces:\r\n\r\n    <?xml version="1.0" encoding="UTF-8"?>\r\n    <root>\r\n    ...\r\n    </root>\r\n\r\nBut i need:\r\n\r\n    <?xml version="1.0" encoding="UTF-8"?>\r\n    <root xmlns="namespaceURL" \r\n    xmlns:xs="http://www.w3.org/2001/XMLSchema-instance"\r\n    xs:schemaLocation="namespaceURL pathToMySchema.xsd">\r\n    ...\r\n    </root>\r\n\r\nMy code:\r\n\r\n    StreamResult result = new StreamResult(writer);\r\n    Document doc = getDocument();\r\n\r\n    Transformer trans = transfac.newTransformer();\r\n    trans.setOutputProperty(OutputKeys.INDENT, "yes");\r\n    trans.setOutputProperty(OutputKeys.METHOD, "xml");\r\n    trans.setOutputProperty(OutputKeys.VERSION, "1.0");\r\n    trans.setOutputProperty(OutputKeys.ENCODING, "UTF-8");\r\n\r\n    DOMSource source = new DOMSource(depl.getAsElement(doc));\r\n  

In [51]:
X_txt = train.Text
y_txt = train.OpenStatus

In [52]:
vect_txt = CountVectorizer()
logreg_txt = LogReg()

pipe_txt = make_pipeline(vect_txt, logreg_txt)

In [53]:
rand_param_txt = {'countvectorizer__min_df':[1,2,3],
                  'countvectorizer__stop_words':[None,'english'],
                  'countvectorizer__max_df':sp.stats.uniform(loc=0.1,scale=0.8),
                  'logisticregression__penalty':['l1','l2'],
                  'logisticregression__C':sp.stats.rayleigh()}
rand_txt = RandomizedSearchCV(pipe_txt, rand_param_txt, n_iter=25, cv=5, scoring='log_loss')

In [None]:
%time rand_txt.fit(X_txt,y_txt)

In [None]:
print(rand_txt.best_score_, ' — ', rand_txt.best_params_)

In [None]:
rand_txt.best_params_

It's done! Now we apply our best model to the test set and export to CSV.

In [54]:
test = create_text_feature(test)

In [None]:
X_test_txt = test.Text
y_pred_txt = rand_txt.predict_proba(X_test_txt)[:,1]

In [None]:
output = pd.DataFrame({'id':X_test_txt.index, 'OpenStatus':y_pred_txt})

In [None]:
output[['id','OpenStatus']].to_csv('./out/submission.csv', index=False)

## 4) Learning on both

### 4.a) Model Stacking

In [None]:
y_pred = (y_pred_txt + y_pred_num)/2
output = pd.DataFrame({'id':X_test_txt.index, 'OpenStatus':y_pred})
output[['id','OpenStatus']].to_csv('./out/stacking-1-1.csv', index=False)

In [None]:
y_pred = (2*y_pred_txt + y_pred_num)/3
output = pd.DataFrame({'id':X_test_txt.index, 'OpenStatus':y_pred})
output[['id','OpenStatus']].to_csv('./out/stacking-2-1.csv', index=False)

In [None]:
y_pred = (y_pred_txt + 2*y_pred_num)/3
output = pd.DataFrame({'id':X_test_txt.index, 'OpenStatus':y_pred})
output[['id','OpenStatus']].to_csv('./out/stacking-1-2.csv', index=False)

### 4.b) Text/Num Data Combination

In [75]:
from sklearn.pipeline import make_union
from sklearn.preprocessing import FunctionTransformer

First step is to create two transformers, which will extract the relevant parts of the dataframe: the engineered new features on the one hand and the Text feature on the other hand.

In [76]:
# function returning new engineered features
def get_new_features(df):
    return df.loc[:,numerical_features]

In [77]:
get_new_features(train).head(3)

Unnamed: 0,ReputationAtPostCreation,OwnerUndeletedAnswerCountAtPostTime,PostCreationDate_night,PostCreationDate_weekend,PostCreationDate_july,OwnerCreationDate_night,OwnerCreationDate_weekend,OwnerCreationDate_july,nb_tags,short_title,len_post_extreme
0,1,2,1,1,0,1,1,0,1,0,0
1,192,24,1,1,0,1,1,0,3,0,0
2,1,0,1,1,1,1,1,1,3,1,1


In [78]:
get_new_features_ft = FunctionTransformer(get_new_features, validate=False)

In [79]:
get_new_features_ft.transform(train).head(3)

Unnamed: 0,ReputationAtPostCreation,OwnerUndeletedAnswerCountAtPostTime,PostCreationDate_night,PostCreationDate_weekend,PostCreationDate_july,OwnerCreationDate_night,OwnerCreationDate_weekend,OwnerCreationDate_july,nb_tags,short_title,len_post_extreme
0,1,2,1,1,0,1,1,0,1,0,0
1,192,24,1,1,0,1,1,0,3,0,0
2,1,0,1,1,1,1,1,1,3,1,1


In [80]:
# function returning grouped text feature
def get_text_feature(df):
    return df.Text

In [81]:
get_text_feature(train).head(3)

0    For Mongodb is it better to reference an objec...
1    How to insert schemalocation in a xml document...
2    Too many lookup tables  What are the adverse e...
Name: Text, dtype: object

In [82]:
get_text_feature_ft = FunctionTransformer(get_text_feature, validate=False)

In [83]:
get_text_feature_ft.transform(train).head(3)

0    For Mongodb is it better to reference an objec...
1    How to insert schemalocation in a xml document...
2    Too many lookup tables  What are the adverse e...
Name: Text, dtype: object

Now, let us create a union. This will enable us to run both data preparation separately and concatenate horizontaly the result, so that we can learn in the same time on the data coming from feature engineer and vectorized from the text.

In [84]:
vect_ft = CountVectorizer()
logreg_ft = LogReg()

In [85]:
union_ft = make_union(get_new_features_ft, make_pipeline(get_text_feature_ft, vect_ft))
pipe_ft = make_pipeline(union_ft, logreg_ft)

In [86]:
pipe_ft.named_steps

{'featureunion': FeatureUnion(n_jobs=1,
        transformer_list=[('functiontransformer', FunctionTransformer(accept_sparse=False,
           func=<function get_new_features at 0x207bfe6a8>, pass_y=False,
           validate=False)), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(accept_sparse=False,
           func=<functi...strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
         tokenizer=None, vocabulary=None))]))],
        transformer_weights=None),
 'logisticregression': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
           penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
           verbose=0, warm_start=False)}

In [87]:
param_grid_ft = {'logisticregression__C':sp.stats.rayleigh(),
                 'logisticregression__penalty':['l1', 'l2'],
                 'featureunion__pipeline__countvectorizer__min_df':[1,2,3],
                 'featureunion__pipeline__countvectorizer__stop_words':[None],
                 'featureunion__pipeline__countvectorizer__max_df':sp.stats.uniform(loc=0.1,scale=0.8)}

In [88]:
rand_ft = RandomizedSearchCV(pipe_ft, param_distributions=param_grid_ft, cv=5, n_iter=30, scoring='log_loss')

In [89]:
%time rand_ft.fit(train, train.OpenStatus)

CPU times: user 4h 4min, sys: 4min 32s, total: 4h 8min 32s
Wall time: 12h 33min 35s


RandomizedSearchCV(cv=5, error_score='raise',
          estimator=Pipeline(steps=[('featureunion', FeatureUnion(n_jobs=1,
       transformer_list=[('functiontransformer', FunctionTransformer(accept_sparse=False,
          func=<function get_new_features at 0x207bfe6a8>, pass_y=False,
          validate=False)), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionT...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
          fit_params={}, iid=True, n_iter=30, n_jobs=1,
          param_distributions={'featureunion__pipeline__countvectorizer__min_df': [1, 2, 3], 'featureunion__pipeline__countvectorizer__max_df': <scipy.stats._distn_infrastructure.rv_frozen object at 0x1733efcc0>, 'logisticregression__penalty': ['l1', 'l2'], 'logisticregression__C': <scipy.stats._distn_infrastructure.rv_frozen object at 0x11feb3630>, 'featureunion__pipeline__countvectorizer__stop_words': [None]},
          pre_dispatch='2*n_jobs', random_state

In [90]:
print(rand_ft.best_score_, ' — ', rand_ft.best_params_)

-0.52521883304  —  {'featureunion__pipeline__countvectorizer__max_df': 0.7967374734652921, 'featureunion__pipeline__countvectorizer__min_df': 2, 'logisticregression__penalty': 'l2', 'logisticregression__C': 0.95937344250408318, 'featureunion__pipeline__countvectorizer__stop_words': None}


In [91]:
test = create_text_feature(test)
y_pred_ft = rand_ft.predict_proba(test)[:,1]

In [92]:
output = pd.DataFrame({'id':test.PostId, 'OpenStatus':y_pred_ft})

In [93]:
output[['id','OpenStatus']].to_csv('./out/submission.csv', index=False)