<center>
<img src="https://habrastorage.org/files/fd4/502/43d/fd450243dd604b81b9713213a247aa20.jpg">
    
## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course 
Author: [Yury Kashnitskiy](https://yorko.github.io) (@yorko). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose.

## <center> Assignment 4. Sarcasm detection with logistic regression
    
We'll be using the dataset from the [paper](https://arxiv.org/abs/1704.05579) "A Large Self-Annotated Corpus for Sarcasm" with >1mln comments from Reddit, labeled as either sarcastic or not. A processed version can be found on Kaggle in a form of a [Kaggle Dataset](https://www.kaggle.com/danofer/sarcasm).

Sarcasm detection is easy. 
<img src="https://habrastorage.org/webt/1f/0d/ta/1f0dtavsd14ncf17gbsy1cvoga4.jpeg" />

In [None]:
!ls ../input/sarcasm/

In [None]:
# some necessary imports
import os
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns
from matplotlib import pyplot as plt
%config InlineBackend.figure_format = 'retina'

In [None]:
train_df = pd.read_csv('../input/sarcasm/train-balanced-sarcasm.csv')

In [None]:
train_df.head()

In [None]:
pd.options.display.max_colwidth = 200
train_df[train_df['ups'] == train_df['ups'].max()]['comment']
# train_df.index[train_df['comment']].tolist()

In [None]:
train_df.info()

Some comments are missing, so we drop the corresponding rows.

In [None]:
train_df.dropna(subset=['comment'], inplace=True)

In [None]:
# train_df = train_df.drop('comment_clean', axis=1)

In [None]:
train_df.insert(2, 'comment_clean', train_df['comment'].str.replace('[^\w\s^\']',''))
train_df.head()

We notice that the dataset is indeed balanced

In [None]:
train_df['label'].value_counts()

We split data into training and validation parts.

In [None]:
xtrain, xvalid, ytrain, yvalid = \
        train_test_split(train_df['comment_clean'], train_df['label'], test_size=0.25, random_state=17)

In [None]:
xtrain, xvalid, ytrain, yvalid = \
        train_test_split(train_df['comment'], train_df['label'], test_size=0.25, random_state=17)

## Tasks:
1. Analyze the dataset, make some plots. This [Kernel](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-qiqc) might serve as an example
2. Build a Tf-Idf + logistic regression pipeline to predict sarcasm (`label`) based on the text of a comment on Reddit (`comment`).
3. Plot the words/bigrams which a most predictive of sarcasm (you can use [eli5](https://github.com/TeamHG-Memex/eli5) for that)
4. (optionally) add subreddits as new features to improve model performance. Apply here the Bag of Words approach, i.e. treat each subreddit as a new feature.

## Links:
  - Machine learning library [Scikit-learn](https://scikit-learn.org/stable/index.html) (a.k.a. sklearn)
  - Kernels on [logistic regression](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-2-classification) and its applications to [text classification](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-4-more-of-logit), also a [Kernel](https://www.kaggle.com/kashnitsky/topic-6-feature-engineering-and-feature-selection) on feature engineering and feature selection
  - [Kaggle Kernel](https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle) "Approaching (Almost) Any NLP Problem on Kaggle"
  - [ELI5](https://github.com/TeamHG-Memex/eli5) to explain model predictions

## plots

In [None]:
from wordcloud import STOPWORDS
stopwords = set(STOPWORDS)
more_stopwords = {'comcast', 'jerry', 'ziggo', 'gjallarhorn', '7'}
stopwords = stopwords.union(more_stopwords)

from plotly import tools
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
from collections import defaultdict

In [None]:
train_df[train_df['comment'].str.contains('money money')]

In [None]:
train_df['label'].hist()

In [None]:
train1_df = train_df[train_df["label"]==1]
train0_df = train_df[train_df["label"]==0]

## custom function for ngram generation ##
def generate_ngrams(text, n_gram=1):
    token = [token for token in text.lower().split(" ") if token != "" if token not in stopwords]
    ngrams = zip(*[token[i:] for i in range(n_gram)])
    return [" ".join(ngram) for ngram in ngrams]

## custom function for horizontal bar chart ##
def horizontal_bar_chart(df, color):
    trace = go.Bar(
        y=df["word"].values[::-1],
        x=df["wordcount"].values[::-1],
        showlegend=False,
        orientation = 'h',
        marker=dict(
            color=color,
        ),
    )
    return trace

## Get the bar chart from sincere comments ##
freq_dict = defaultdict(int)
for sent in train0_df["comment_clean"]:
    for word in generate_ngrams(sent):
        freq_dict[word] += 1
fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
fd_sorted.columns = ["word", "wordcount"]
trace0 = horizontal_bar_chart(fd_sorted.head(20), 'lightgreen')

## Get the bar chart from sarcastic comments ##
freq_dict = defaultdict(int)
for sent in train1_df["comment_clean"]:
    for word in generate_ngrams(sent):
        freq_dict[word] += 1
fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
fd_sorted.columns = ["word", "wordcount"]
trace1 = horizontal_bar_chart(fd_sorted.head(20), 'pink')

# Creating two subplots
fig = tools.make_subplots(rows=1, cols=2, vertical_spacing=0.04,
                          subplot_titles=["Несаркастичные комментарии", 
                                          "Саркастичные комментарии"])
fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig['layout'].update(height=600, width=700, paper_bgcolor='rgb(233,233,233)', title="Самые частотные слова")
py.iplot(fig, filename='word-plots')

In [None]:
freq_dict = defaultdict(int)
for sent in train0_df["comment_clean"]:
    for word in generate_ngrams(sent,2):
        freq_dict[word] += 1
fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
fd_sorted.columns = ["word", "wordcount"]
trace0 = horizontal_bar_chart(fd_sorted.head(20), 'yellow')


freq_dict = defaultdict(int)
for sent in train1_df["comment_clean"]:
    for word in generate_ngrams(sent,2):
        freq_dict[word] += 1
fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
fd_sorted.columns = ["word", "wordcount"]
trace1 = horizontal_bar_chart(fd_sorted.head(20), 'orange')

# Creating two subplots
fig = tools.make_subplots(rows=1, cols=2, vertical_spacing=0.04,horizontal_spacing=0.15,
                          subplot_titles=["Несаркастичные комментарии",
                                          "Саркастичные комментарии"])
fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig['layout'].update(height=600, width=1000, paper_bgcolor='rgb(233,233,233)', title="Самые частотные биграммы")
py.iplot(fig, filename='word-plots')

## tfidf

tfidfTransformer (plus CountVectorizer...) vs tfidfvectorizer

In [None]:
# pipe = make_pipeline(TfidfVectorizer(min_df=2, max_features=None, strip_accents='unicode', analyzer='word', token_pattern=r'\w{1,}',
#                                       ngram_range=(1, 2), use_idf=1, smooth_idf=1, sublinear_tf=1, stop_words = 'english'),
#                       LogisticRegression(solver='lbfgs', C=1, n_jobs=-1))
pipe = Pipeline([('tfidf', TfidfVectorizer(min_df=3, max_features=50000, ngram_range=(1, 3))),
                 ('logit', LogisticRegression(solver='lbfgs', C=1, n_jobs=-1))])

In [None]:
# pipe.fit(list(xtrain) + list(xvalid))
# xtrain_tfv =  pipe.transform(xtrain)
# xvalid_tfv = pipe.transform(xvalid)
# pipe.fit(xtrain_tfv, ytrain)
# round(pipe.score(xtrain_tfv, ytrain), 3), round(pipe.score(xvalid_tfv, yvalid), 3)

In [None]:
%%time
pipe.fit(xtrain, ytrain)
predictions = pipe.predict(xvalid)
print(accuracy_score(yvalid, predictions))

In [None]:
print(pipe.score(xvalid, yvalid))

* 0.1, valid_size, no pipeline, comment_clean: (0.745, 0.688), logloss: 0.584
* 0.25 valid_size, pipeline, comment_clean: 0.6882118293271704
* 0.25 valid_size, pipeline, comment: 0.6875192921082416
* 0.25 valid_size, pipeline, comment: 0.6885046736368888
* 0.25 valid_size, pipeline, comment_clean, solution tuning of tfidf: 0.7207135903503843 + 30s faster
* df_min has no influence
* trigrams require 2min26s, the score then with solution settings is 0.7213744687250192

## word weights

In [None]:
import eli5
eli5.show_weights(estimator=pipe.named_steps['logit'],
                  vec=pipe.named_steps['tfidf'])

In [None]:
def plot_confusion_matrix(actual, predicted, classes,
                          normalize=False,
                          title='Confusion matrix', figsize=(7,7),
                          cmap=plt.cm.Blues, path_to_save_fig=None):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    import itertools
    from sklearn.metrics import confusion_matrix
    cm = confusion_matrix(actual, predicted).T
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    
    plt.figure(figsize=figsize)
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=90)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('Predicted label')
    plt.xlabel('True label')
    
    if path_to_save_fig:
        plt.savefig(path_to_save_fig, dpi=300, bbox_inches='tight')

In [None]:
plot_confusion_matrix(yvalid, predictions, pipe.named_steps['logit'].classes_, figsize=(8, 8))

### CountVectorizer

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
ctv_pipe = Pipeline([('ctv', CountVectorizer(min_df=2, max_features=50000, ngram_range=(1, 2))),
                 ('logit', LogisticRegression(solver='lbfgs', C=1, n_jobs=-1))])

In [None]:
%%time
ctv_pipe.fit(xtrain, ytrain)
predictions = ctv_pipe.predict(xvalid)
print(accuracy_score(yvalid, predictions))