### Object:
The goal of this notebook is to explore the dataset and build a baseline model by using different optimization method (solver).

### Reference:
1. https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-qiqc
2. https://www.kaggle.com/sudalairajkumar/simple-feature-engg-notebook-spooky-author


In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import string
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

['embeddings', 'sample_submission.csv', 'test.csv', 'train.csv']


In [2]:
from sklearn import model_selection, preprocessing, metrics, ensemble, naive_bayes, linear_model
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import f1_score
import lightgbm as lgb

In [3]:
train_df = pd.read_csv("../input/train.csv")
test_df = pd.read_csv("../input/test.csv")
print("Train shape : ", train_df.shape)
print("Test shape : ", test_df.shape)

Train shape :  (1306122, 3)
Test shape :  (375806, 2)


In [4]:
train_df.head()

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0


## Data distribution

In [5]:
cnt_pos = train_df[train_df['target'] == 1].count()[0]
cnt_neg = train_df[train_df['target'] == 0].count()[0]
print('There are %d insincere questions and %d sincere questions.' % (cnt_pos, cnt_neg))
print('%f percents of questions are insincere.' %(cnt_pos/(cnt_pos+cnt_neg)*100))

There are 80810 insincere questions and 1225312 sincere questions.
6.187018 percents of questions are insincere.


**Meta Features:**

Now let us create some meta features and then look at how they are distributed between the classes. The ones that we will create are
1. Number of words in the text
2. Number of unique words in the text
3. Number of characters in the text
4. Number of stopwords
5. Number of punctuations
6. Number of upper case words
7. Number of title case words
8. Average length of the words

In [6]:
from nltk.corpus import stopwords
STOPWORDS = set(stopwords.words('english'))

## Number of words in the text ##
train_df["num_words"] = train_df["question_text"].apply(lambda x: len(str(x).split()))
test_df["num_words"] = test_df["question_text"].apply(lambda x: len(str(x).split()))

## Number of unique words in the text ##
train_df["num_unique_words"] = train_df["question_text"].apply(lambda x: len(set(str(x).split())))
test_df["num_unique_words"] = test_df["question_text"].apply(lambda x: len(set(str(x).split())))

## Number of characters in the text ##
train_df["num_chars"] = train_df["question_text"].apply(lambda x: len(str(x)))
test_df["num_chars"] = test_df["question_text"].apply(lambda x: len(str(x)))

## Number of stopwords in the text ##
train_df["num_stopwords"] = train_df["question_text"].apply(lambda x: len([w for w in str(x).lower().split() if w in STOPWORDS]))
test_df["num_stopwords"] = test_df["question_text"].apply(lambda x: len([w for w in str(x).lower().split() if w in STOPWORDS]))

## Number of punctuations in the text ##
train_df["num_punctuations"] =train_df['question_text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )
test_df["num_punctuations"] =test_df['question_text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )

## Number of title case words in the text ##
train_df["num_words_upper"] = train_df["question_text"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))
test_df["num_words_upper"] = test_df["question_text"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))

## Number of title case words in the text ##
train_df["num_words_title"] = train_df["question_text"].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))
test_df["num_words_title"] = test_df["question_text"].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))

## Average length of the words in the text ##
train_df["mean_word_len"] = train_df["question_text"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))
test_df["mean_word_len"] = test_df["question_text"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))

In [7]:
## Truncate some extreme values for better visuals ##
train_df['num_words'].loc[train_df['num_words']>60] = 60 #truncation for better visuals
train_df['num_punctuations'].loc[train_df['num_punctuations']>10] = 10 #truncation for better visuals
train_df['num_chars'].loc[train_df['num_chars']>350] = 350 #truncation for better visuals

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


In [8]:
train_df.head()

Unnamed: 0,qid,question_text,target,num_words,num_unique_words,num_chars,num_stopwords,num_punctuations,num_words_upper,num_words_title,mean_word_len
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0,13,13,72,7,1,0,2,4.615385
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0,16,15,81,9,2,0,1,4.125
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0,10,8,67,3,2,0,2,5.8
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0,9,9,57,3,1,0,4,5.444444
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0,15,15,77,8,1,2,3,4.2


### Baseline model

In [9]:
train_text = train_df['question_text']
test_text = test_df['question_text']
all_text = pd.concat([train_text, test_text])

# Get the tfidf vectors #
tfidf_vec = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='word',
    token_pattern=r'\w{1,}',
    stop_words='english',
    ngram_range=(1, 3),
    max_features=5000)
tfidf_vec.fit_transform(all_text)
train_tfidf = tfidf_vec.transform(train_text)
test_tfidf = tfidf_vec.transform(test_text)

In [10]:
features = ['num_words', 'num_unique_words', 'num_chars', 
                'num_stopwords', 'num_punctuations', 'num_words_upper', 
                'num_words_title', 'mean_word_len']

train_ = train_df[features]
train_.head()

Unnamed: 0,num_words,num_unique_words,num_chars,num_stopwords,num_punctuations,num_words_upper,num_words_title,mean_word_len
0,13,13,72,7,1,0,2,4.615385
1,16,15,81,9,2,0,1,4.125
2,10,8,67,3,2,0,2,5.8
3,9,9,57,3,1,0,4,5.444444
4,15,15,77,8,1,2,3,4.2


In [11]:
from scipy.sparse import hstack, csr_matrix
train_ = hstack((csr_matrix(train_), train_tfidf))
print(train_.shape)

(1306122, 5008)


In [12]:
test_ = test_df[features]
test_ = hstack((csr_matrix(test_), test_tfidf))
print(test_.shape)

(375806, 5008)


In [13]:
train_y = train_df["target"].values

x_train, x_val, y_train, y_val = model_selection.train_test_split(train_, train_y, test_size=0.2, random_state=42)

print(x_train.shape, x_val.shape)

(1044897, 5008) (261225, 5008)


In [14]:
model = linear_model.LogisticRegression(C=5., solver='saga')
model.fit(x_train, y_train)
pred_y_val = model.predict_proba(x_val)[:,1]

best_f1 = 0
best_threshold = 0
for threshold in np.arange(0.1, 0.201, 0.01):
    f1 = f1_score(y_val, (pred_y_val>threshold).astype(int))
    if f1 > best_f1:
        best_f1 = f1
        best_threshold = threshold
print('Best threshold is %f and f1 is %f' %(best_threshold, best_f1))



Best threshold is 0.110000 and f1 is 0.354914


In [15]:
def runModel(x_train, x_val, y_train, solver):
    model = linear_model.LogisticRegression(C=5., solver=solver)
    model.fit(x_train, y_train)
    pred_y_val = model.predict_proba(x_val)[:,1]
    return model, pred_y_val

solver = ['sag', 'saga', 'newton-cg', 'lbfgs', 'liblinear']
model_final = None
best_f1 = 0
for solve in solver:
    model, pred_y_val = runModel(x_train, x_val, y_train, solve)
    f1 = f1_score(y_val, (pred_y_val>best_threshold).astype(int))
    if f1 > best_f1:
        best_f1 = f1
        model_final = model
    print('%s solver has f1 score %f' %(solve, f1))



sag solver has f1 score 0.407154
saga solver has f1 score 0.354859
newton-cg solver has f1 score 0.521055




lbfgs solver has f1 score 0.418541
liblinear solver has f1 score 0.522287


In [16]:
## use whole training set to train
model_final.fit(train_, train_y)

LogisticRegression(C=5.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

In [17]:
pred_test_y = model_final.predict_proba(test_)[:,1]
submit_df = pd.DataFrame({"qid": test_df["qid"], "prediction": (pred_test_y > best_threshold).astype(np.int)})
submit_df.head()

Unnamed: 0,qid,prediction
0,0000163e3ea7c7a74cd7,1
1,00002bd4fb5d505b9161,0
2,00007756b4a147d2b0b3,0
3,000086e4b7e1c7146103,0
4,0000c4c3fbe8785a3090,0


In [18]:
submit_df['prediction'].value_counts()

0    329205
1     46601
Name: prediction, dtype: int64

In [None]:
submit_df.to_csv("submission.csv", index=False)