<center>
<img src="https://habrastorage.org/files/fd4/502/43d/fd450243dd604b81b9713213a247aa20.jpg">
    
## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course 
Author: [Yury Kashnitskiy](https://yorko.github.io) (@yorko). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose.

## <center> Assignment 4. Sarcasm detection with logistic regression
    
We'll be using the dataset from the [paper](https://arxiv.org/abs/1704.05579) "A Large Self-Annotated Corpus for Sarcasm" with >1mln comments from Reddit, labeled as either sarcastic or not. A processed version can be found on Kaggle in a form of a [Kaggle Dataset](https://www.kaggle.com/danofer/sarcasm).

Sarcasm detection is easy. 
<img src="https://habrastorage.org/webt/1f/0d/ta/1f0dtavsd14ncf17gbsy1cvoga4.jpeg" />

In [3]:
!ls ../input/sarcasm/

test-balanced.csv    train-balanced-sarc.csv.gz
test-unbalanced.csv  train-balanced-sarcasm.csv


In [4]:
# some necessary imports
import os
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns
from matplotlib import pyplot as plt

from plotly import tools
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import string
from wordcloud import STOPWORDS
from collections import defaultdict
from scipy.sparse import hstack

In [5]:
train_df = pd.read_csv('../input/sarcasm/train-balanced-sarcasm.csv')

In [6]:
train_df.head()

Unnamed: 0,label,comment,author,subreddit,score,ups,downs,date,created_utc,parent_comment
0,0,NC and NH.,Trumpbart,politics,2,-1,-1,2016-10,2016-10-16 23:55:23,"Yeah, I get that argument. At this point, I'd ..."
1,0,You do know west teams play against west teams...,Shbshb906,nba,-4,-1,-1,2016-11,2016-11-01 00:24:10,The blazers and Mavericks (The wests 5 and 6 s...
2,0,"They were underdogs earlier today, but since G...",Creepeth,nfl,3,3,0,2016-09,2016-09-22 21:45:37,They're favored to win.
3,0,"This meme isn't funny none of the ""new york ni...",icebrotha,BlackPeopleTwitter,-8,-1,-1,2016-10,2016-10-18 21:03:47,deadass don't kill my buzz
4,0,I could use one of those tools.,cush2push,MaddenUltimateTeam,6,-1,-1,2016-12,2016-12-30 17:00:13,Yep can confirm I saw the tool they use for th...


In [7]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1010826 entries, 0 to 1010825
Data columns (total 10 columns):
label             1010826 non-null int64
comment           1010773 non-null object
author            1010826 non-null object
subreddit         1010826 non-null object
score             1010826 non-null int64
ups               1010826 non-null int64
downs             1010826 non-null int64
date              1010826 non-null object
created_utc       1010826 non-null object
parent_comment    1010826 non-null object
dtypes: int64(4), object(6)
memory usage: 77.1+ MB


Some comments are missing, so we drop the corresponding rows.

In [8]:
train_df.dropna(subset=['comment'], inplace=True)

We notice that the dataset is indeed balanced

In [9]:
train_df['label'].value_counts()

0    505405
1    505368
Name: label, dtype: int64

We split data into training and validation parts.

In [10]:
train_texts, valid_texts, y_train, y_valid = \
        train_test_split(train_df['comment'], train_df['label'], random_state=17)

In [9]:
train_texts.head(), y_train.head()

(827869                      Should have named it Samsquanch
 800568                       All that knob wants is uranus.
 506459                  No their dogs gave up halfway here.
 372707    I'm sure icefrog is that bad at critical self-...
 548483      Thanks for your contribution to the discussion.
 Name: comment, dtype: object, 827869    0
 800568    1
 506459    0
 372707    1
 548483    1
 Name: label, dtype: int64)

## Tasks:
1. Analyze the dataset, make some plots. This [Kernel](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-qiqc) might serve as an example
2. Build a Tf-Idf + logistic regression pipeline to predict sarcasm (`label`) based on the text of a comment on Reddit (`comment`).
3. Plot the words/bigrams which a most predictive of sarcasm (you can use [eli5](https://github.com/TeamHG-Memex/eli5) for that)
4. (optionally) add subreddits as new features to improve model performance. Apply here the Bag of Words approach, i.e. treat each subreddit as a new feature.

## Links:
  - Machine learning library [Scikit-learn](https://scikit-learn.org/stable/index.html) (a.k.a. sklearn)
  - Kernels on [logistic regression](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-2-classification) and its applications to [text classification](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-4-more-of-logit), also a [Kernel](https://www.kaggle.com/kashnitsky/topic-6-feature-engineering-and-feature-selection) on feature engineering and feature selection
  - [Kaggle Kernel](https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle) "Approaching (Almost) Any NLP Problem on Kaggle"
  - [ELI5](https://github.com/TeamHG-Memex/eli5) to explain model predictions


Task#1. Analyze the dataset, make some plots


In [None]:
# target count
target_count = y_train.value_counts()
trace = go.Bar(x=target_count.index, y=target_count.values, marker=dict(color=target_count.values, colorscale='Picnic', reversescale=True))

layout = go.Layout(title='Target count', font=dict(size=18))
data = [trace]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='TargetCount')

In [None]:
# target distribution
labels = np.array(target_count.index)
sizes = np.array((target_count / target_count.sum())*100)

trace = go.Pie(labels=labels, values=sizes)
layout = go.Layout(title='Target distribution', font=dict(size=18), width=600, height=600)

data = [trace]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='usertype')

In [None]:
# word frequency plot for sarcasmic and not comments
train0_df = train_df[train_df['label']==0]
train1_df = train_df[train_df['label']==1]

# generate ngrams
def generate_ngrams(text, n_gram=1):
    token = [token for token in text.lower().split(' ') if token != '' if token not in STOPWORDS]
    ngrams = zip(*[token[i:] for i in range(n_gram)])
    return [' '.join(ngram) for ngram in ngrams]

# horizontal chart
def horizontal_bar_chart(df, color):
    trace = go.Bar(
        y=df['word'].values[::-1], 
        x=df['wordcount'].values[::-1], 
        showlegend=False, 
        orientation='h', 
        marker=dict(color=color))
    return trace

# bar chart for non sarcastic
freq_dict = defaultdict(int)
for sent in train0_df['comment']:
    for word in generate_ngrams(sent):
        freq_dict[word] += 1
freq_dict_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
freq_dict_sorted.columns = ['word', 'wordcount']
trace0 = horizontal_bar_chart(freq_dict_sorted.head(50), 'blue')

# bar chart for sarcastic
freq_dict = defaultdict(int)
for sent in train1_df['comment']:
    for word in generate_ngrams(sent):
        freq_dict[word] += 1
freq_dict_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
freq_dict_sorted.columns = ['word', 'wordcount']
trace1 = horizontal_bar_chart(freq_dict_sorted.head(50), 'blue')

# create two subplots
fig = tools.make_subplots(rows=1, cols=2, vertical_spacing=0.04, 
                         subplot_titles=['Freq words in non sarcastic comments', 
                                         'Freq words in sarcastic comments'])
fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig['layout'].update(height=1200, width=900, paper_bgcolor='rgb(233,233,233)', title='Word Count Plots')
py.iplot(fig, filename='word-plots')

Observations:
- Some of the top words are common across both the classes like one, people, will
- The other top words in non sarcastic comments: think, fuck, good
- The other top words in sarcastic comments: yeah, well, sure

Create bigrams frequency plots for both classes

In [None]:
# bar chart for non sarcastic
freq_dict = defaultdict(int)
for sent in train0_df['comment']:
    for word in generate_ngrams(sent, 2):
        freq_dict[word] += 1
freq_dict_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
freq_dict_sorted.columns = ['word', 'wordcount']
trace0 = horizontal_bar_chart(freq_dict_sorted.head(50), 'orange')

# bar chart for sarcastic
freq_dict = defaultdict(int)
for sent in train1_df['comment']:
    for word in generate_ngrams(sent, 2):
        freq_dict[word] += 1
freq_dict_sorted_1 = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
freq_dict_sorted_1.columns = ['word', 'wordcount']
trace1 = horizontal_bar_chart(freq_dict_sorted_1.head(50), 'orange')

# create two subplots
fig = tools.make_subplots(rows=1, cols=2, vertical_spacing=0.04, horizontal_spacing=0.15,
                         subplot_titles=['Freq bigrams in non sarcastic comments', 
                                         'Freq bigrams in sarcastic comments'])
fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig['layout'].update(height=1200, width=1200, paper_bgcolor='rgb(233,233,233)', title='Word Count Plots')
py.iplot(fig, filename='word-plots')

Observation:
- non sarcastic comment often contains bigrams: pretty sure, pretty much, fake news and many junk comment with repeated words like jerry or fuck
- sarcastic comment often contains bigrams: good thing, everyone knows, pretty sure, white people, black people

Create trigram plots

In [None]:
# bar chart for non sarcastic
freq_dict = defaultdict(int)
for sent in train0_df['comment']:
    for word in generate_ngrams(sent, 3):
        freq_dict[word] += 1
freq_dict_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
freq_dict_sorted.columns = ['word', 'wordcount']
trace0 = horizontal_bar_chart(freq_dict_sorted.head(50), 'green')

# bar chart for sarcastic
freq_dict = defaultdict(int)
for sent in train1_df['comment']:
    for word in generate_ngrams(sent, 3):
        freq_dict[word] += 1
freq_dict_sorted_1 = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
freq_dict_sorted_1.columns = ['word', 'wordcount']
trace1 = horizontal_bar_chart(freq_dict_sorted_1.head(50), 'green')

# create two subplots
fig = tools.make_subplots(rows=1, cols=2, vertical_spacing=0.04, horizontal_spacing=0.15,
                         subplot_titles=['Freq trigrams in non sarcastic comments', 
                                         'Freq trigrams in sarcastic comments'])
fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig['layout'].update(height=1200, width=1200, paper_bgcolor='rgb(233,233,233)', title='Word Count Plots')
py.iplot(fig, filename='word-plots')

Observations:
- many copy pasted words in top of the non sarcastic comments
- number of trigrams in sarcastic comments is less and more adequate then non sarcastic

Create meta features:
1. Number of words in the text
2. Number of unique words in the text
3. Number of characters in the text
4. Number of stopwords
5. Number of punctuations
6. Number of upper case words
7. Number of title case words
8. Average length of the words


In [None]:
train_df_2 = pd.concat([train_texts, y_train], axis=1)
test_df_2 = pd.concat([valid_texts, y_valid], axis=1)

In [None]:
train_df_2.info()

In [None]:
test_df_2.head()

In [None]:
x = test_df_2['comment'][240293]
print(x)
print(str(x).split())
print([w for w in str(x).split() if w.istitle()])
print(len([w for w in str(x).split() if w.istitle()]))

In [None]:
%%time

# Number of words in the text
train_df_2['num_words'] = train_df_2['comment'].apply(lambda x: len(str(x).split()))
test_df_2['num_words'] = test_df_2['comment'].apply(lambda x: len(str(x).split()))

# Number of unique words in the text
train_df_2['num_unique_words'] = train_df_2['comment'].apply(lambda x: len(set(str(x).split())))
test_df_2['num_unique_words'] = test_df_2['comment'].apply(lambda x: len(set(str(x).split())))

# Number of characters in the text
train_df_2['num_chars'] = train_df_2['comment'].apply(lambda x: len(str(x)))
test_df_2['num_chars'] = test_df_2['comment'].apply(lambda x: len(str(x)))

# Number of stopwords
train_df_2['num_stopwords'] = train_df_2['comment'].apply(lambda x: len([w for w in str(x).lower().split() if w in STOPWORDS]))
test_df_2['num_stopwords'] = test_df_2['comment'].apply(lambda x: len([w for w in str(x).lower().split() if w in STOPWORDS]))

# Number of punctuations
train_df_2['num_punctuations'] = train_df_2['comment'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]))
test_df_2['num_punctuations'] = test_df_2['comment'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]))

# Number of upper case words
train_df_2['num_words_upper'] = train_df_2['comment'].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))
test_df_2['num_words_upper'] = test_df_2['comment'].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))

# Number of title case words
train_df_2['num_words_title'] = train_df_2['comment'].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))
test_df_2['num_words_title'] = test_df_2['comment'].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))

# Average length of the word
train_df_2['mean_word_len'] = train_df_2['comment'].apply(lambda x: np.mean([len(w) for w in str(x).split()]))
test_df_2['mean_word_len'] = test_df_2['comment'].apply(lambda x: np.mean([len(w) for w in str(x).split()]))

In [None]:
# Truncate some extreme values for better visulas
train_df_2['num_words'].loc[train_df_2['num_words']>60] = 60
train_df_2['num_punctuations'].loc[train_df_2['num_punctuations']>10] = 10
train_df_2['num_chars'].loc[train_df_2['num_chars']>350] = 350

In [None]:
train_df_2.describe()

In [None]:
f, axes = plt.subplots(3, 1, figsize=(10, 20))

sns.set(style='dark')
sns.boxplot(x='label', y='num_words', data=train_df_2, ax=axes[0])
axes[0].set_xlabel('Label', fontsize=12)
axes[0].set_title('Number of words in each class', fontsize=15, )

sns.boxplot(x='label', y='num_chars', data=train_df_2, ax=axes[1])
axes[1].set_xlabel('Label', fontsize=12)
axes[1].set_title('Number of characters in each class', fontsize=15)

sns.boxplot(x='label', y='num_punctuations', data=train_df_2, ax=axes[2])
axes[2].set_xlabel('Label', fontsize=12)
axes[2].set_title('Number of punctuations in each class', fontsize=15)

plt.show()

Observations:
- sarcastic comments have more number characters and words compared to non sarcastic comments
- both classes have equally number punctuations

### Task#2. Build a Tf-Idf + logistic regression pipeline to predict sarcasm (label) based on the text of a comment on Reddit (comment).

In [11]:
%%time
# create pipeline
tfidf_logit_pipe = make_pipeline(TfidfVectorizer(ngram_range=(1,3), max_features=50000, min_df=2), 
                                 LogisticRegression(solver='lbfgs', n_jobs=4, random_state=17, max_iter=1000, verbose=1))
tfidf_logit_pipe.fit(train_texts, y_train)
print(tfidf_logit_pipe.score(valid_texts, y_valid))
# ngram_range=(1, 2), Wall time: 2min 5s, score=0.721196387725866
# ngram_range=(1, 3), Wall time: 2min 3s, score=0.7218256072562071
# ngram_range=(1, 4), Wall time: 3min 14s, score=0.7216989718790315

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   1 out of   1 | elapsed:   48.5s finished


0.7218256072562071
CPU times: user 1min 47s, sys: 4.07 s, total: 1min 51s
Wall time: 2min 40s


### Task#3. Plot the words/bigrams which a most predictive of sarcasm (you can use eli5 for that)

In [None]:
import eli5

In [None]:
eli5.show_weights(estimator=tfidf_logit_pipe, top=100)

Task#4. (optionally) add subreddits as new features to improve model performance. Apply here the Bag of Words approach, i.e. treat each subreddit as a new feature.

In [12]:
train_subreddits, valid_subreddits = train_test_split(train_df['subreddit'], random_state=17)

In [13]:
# create separate Tf-Idf vectorizers for comments and for subreddits. 
# It's possible to stick to a pipeline as well, but in that case it becomes a bit less straightforward (https://stackoverflow.com/questions/36731813/computing-separate-tfidf-scores-for-two-different-columns-using-sklearn)

tf_idf_texts = TfidfVectorizer(ngram_range=(1, 3), max_features=50000, min_df=2)
tf_idf_subreddits = TfidfVectorizer(ngram_range=(1, 1))

In [14]:
# do transformations separately for comments and subreddits
X_train_texts = tf_idf_texts.fit_transform(train_texts)
X_valid_texts = tf_idf_texts.transform(valid_texts)

X_train_subreddits = tf_idf_texts.fit_transform(train_subreddits)
X_valid_subreddits = tf_idf_texts.transform(valid_subreddits)

In [15]:
X_train_texts.shape, X_valid_texts.shape, X_train_subreddits.shape, X_valid_subreddits.shape

((758079, 50000), (252694, 50000), (758079, 7809), (252694, 7809))

In [16]:
# stack all features together
X_train = hstack([X_train_texts, X_train_subreddits])
X_valid = hstack([X_valid_texts, X_valid_subreddits])

In [17]:
X_train.shape, X_valid.shape

((758079, 57809), (252694, 57809))

In [18]:
logit = LogisticRegression(C=1, n_jobs=4, random_state=17, verbose=1)

In [19]:
logit.fit(X_train, y_train)




'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.



[LibLinear]

LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn', n_jobs=4,
          penalty='l2', random_state=17, solver='warn', tol=0.0001,
          verbose=1, warm_start=False)

In [20]:
logit.score(X_valid, y_valid)
# C=1, score=0.727211568141705

0.727211568141705

In [25]:
logit.get_params().keys()

dict_keys(['C', 'class_weight', 'dual', 'fit_intercept', 'intercept_scaling', 'max_iter', 'multi_class', 'n_jobs', 'penalty', 'random_state', 'solver', 'tol', 'verbose', 'warm_start'])

In [21]:
%%time 
# find optimal value for L2 regularization hyperparameter C
C_values_1 = [.001, .01, .1, 1, 10, 100, 1000] # for initial search
C_values_2 = np.logspace(.1, 10, 10) # for fine tuning
param_grid_logit = {'C': C_values_2}
grid_logit = GridSearchCV(logit, param_grid_logit, return_train_score=True, cv=3, verbose=1)

grid_logit.fit(X_train, y_train)

# with C_values_2 , Wall time: 7h 24min!!!, 'C': 1.2589254117941673

Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.



'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.



[LibLinear]


'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.



[LibLinear]


'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.



[LibLinear]


'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.



[LibLinear]


'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.



[LibLinear]


'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.



[LibLinear]


'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.



[LibLinear]


'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.



[LibLinear]


'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.



[LibLinear]


'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.



[LibLinear]


'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.



[LibLinear]


'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.



[LibLinear]


'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.



[LibLinear]


'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.



[LibLinear]


'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.



[LibLinear]


'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.



[LibLinear]


'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.



[LibLinear]


'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.



[LibLinear]


'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.



[LibLinear]


'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.



[LibLinear]


'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.



[LibLinear]


'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.



[LibLinear]


'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.



[LibLinear]


'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.



[LibLinear]


'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.



[LibLinear]


'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.



[LibLinear]


'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.



[LibLinear]


'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.



[LibLinear]


'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.



[LibLinear]


'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.



[LibLinear]

[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed: 443.1min finished

'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.



[LibLinear]CPU times: user 7h 23min 44s, sys: 8.3 s, total: 7h 23min 52s
Wall time: 7h 24min


In [22]:
# print best value for hyperparameter C and cv_score
grid_logit.best_params_, grid_logit.best_score_

({'C': 1.2589254117941673}, 0.7215989362586221)

In [26]:
# for the validation set
grid_logit.score(X_valid, y_valid)

0.7268831076321559