<center>
<img src="https://habrastorage.org/files/fd4/502/43d/fd450243dd604b81b9713213a247aa20.jpg">
    
## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course 
Author: [Yury Kashnitskiy](https://yorko.github.io) (@yorko). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose.

## <center> Assignment 4. Sarcasm detection with logistic regression
    
We'll be using the dataset from the [paper](https://arxiv.org/abs/1704.05579) "A Large Self-Annotated Corpus for Sarcasm" with >1mln comments from Reddit, labeled as either sarcastic or not. A processed version can be found on Kaggle in a form of a [Kaggle Dataset](https://www.kaggle.com/danofer/sarcasm).

Sarcasm detection is easy. 
<img src="https://habrastorage.org/webt/1f/0d/ta/1f0dtavsd14ncf17gbsy1cvoga4.jpeg" />

In [None]:
!ls ../input/sarcasm/

In [None]:
# some necessary imports
import os
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns

# plotting
from plotly import tools
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go

from matplotlib import pyplot as plt

In [None]:
train_df = pd.read_csv('../input/sarcasm/train-balanced-sarcasm.csv')

In [None]:
train_df.head()

In [None]:
train_df.info()

Some comments are missing, so we drop the corresponding rows.

In [None]:
train_df.dropna(subset=['comment'], inplace=True)

We notice that the dataset is indeed balanced

In [None]:
train_df['label'].value_counts()

We split data into training and validation parts.

In [None]:
train_texts, valid_texts, y_train, y_valid = \
        train_test_split(train_df['comment'], train_df['label'], random_state=17)

## Tasks:
1. Analyze the dataset, make some plots. This [Kernel](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-qiqc) might serve as an example
2. Build a Tf-Idf + logistic regression pipeline to predict sarcasm (`label`) based on the text of a comment on Reddit (`comment`).
3. Plot the words/bigrams which a most predictive of sarcasm (you can use [eli5](https://github.com/TeamHG-Memex/eli5) for that)
4. (optionally) add subreddits as new features to improve model performance. Apply here the Bag of Words approach, i.e. treat each subreddit as a new feature.

## Links:
  - Machine learning library [Scikit-learn](https://scikit-learn.org/stable/index.html) (a.k.a. sklearn)
  - Kernels on [logistic regression](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-2-classification) and its applications to [text classification](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-4-more-of-logit), also a [Kernel](https://www.kaggle.com/kashnitsky/topic-6-feature-engineering-and-feature-selection) on feature engineering and feature selection
  - [Kaggle Kernel](https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle) "Approaching (Almost) Any NLP Problem on Kaggle"
  - [ELI5](https://github.com/TeamHG-Memex/eli5) to explain model predictions
  
  ---
  
## DONE
- Word counts
- Strip punctuation
- Would be interesting to see if there is a correlation with the up/down votes ratio, but 
  - I don't understand why there are negative counts of up or down votes

In [None]:
# Shortcuts for the data with label 1 and label 0
train1_df = train_df[train_df["label"]==1]
train0_df = train_df[train_df["label"]==0]

In [None]:
# How many ARE sarcasm?
# -> "balanced" dataset
print(train_df['label'].value_counts())

Frequent words?

In [None]:
# Word cloud
from wordcloud import WordCloud, STOPWORDS

# Thanks : https://www.kaggle.com/aashita/word-clouds-of-various-shapes ##
def plot_wordcloud(text, mask=None, max_words=200, max_font_size=100, figure_size=(24.0,16.0), 
                   title = None, title_size=40, image_color=False):
    stopwords = set(STOPWORDS)
    more_stopwords = {'one', 'br', 'Po', 'th', 'sayi', 'fo', 'Unknown'}
    stopwords = stopwords.union(more_stopwords)

    wordcloud = WordCloud(background_color='black',
                    stopwords = stopwords,
                    max_words = max_words,
                    max_font_size = max_font_size, 
                    random_state = 42,
                    width=800, 
                    height=400,
                    mask = mask)
    wordcloud.generate(str(text))
    
    plt.figure(figsize=figure_size)
    if image_color:
        image_colors = ImageColorGenerator(mask);
        plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation="bilinear");
        plt.title(title, fontdict={'size': title_size,  
                                  'verticalalignment': 'bottom'})
    else:
        plt.imshow(wordcloud);
        plt.title(title, fontdict={'size': title_size, 'color': 'black', 
                                  'verticalalignment': 'bottom'})
    plt.axis('off');
    plt.tight_layout()  
    
plot_wordcloud(train_df["comment"], title="Word Cloud of comments")

In [None]:
from collections import defaultdict
import string # for punctuation

## custom function for ngram generation ##
def generate_ngrams(text, n_gram=1): # TODO: optimize and don't strip twice
    token = [token.strip(string.punctuation) # remove leading/trailing punctuation
             for token in text.lower().split(" ") # for words in text; lower cased
             if token.strip(string.punctuation) != "" if token not in STOPWORDS] # except empty and stop words
    ngrams = zip(*[token[i:] for i in range(n_gram)])
    return [" ".join(ngram) for ngram in ngrams]

## custom function for horizontal bar chart ##
def horizontal_bar_chart(df, color):
    trace = go.Bar(
        y=df["word"].values[::-1],
        x=df["wordcount"].values[::-1],
        showlegend=False,
        orientation = 'h',
        marker=dict(
            color=color,
        ),
    )
    return trace

## Get the bar chart from sincere questions ##
freq_dict = defaultdict(int)
for sent in train0_df["comment"]:
    for word in generate_ngrams(sent):
        freq_dict[word] += 1
fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
fd_sorted.columns = ["word", "wordcount"]
trace0 = horizontal_bar_chart(fd_sorted.head(30), 'blue')

## Get the bar chart from insincere questions ##
freq_dict = defaultdict(int)
for sent in train1_df["comment"]:
    for word in generate_ngrams(sent):
        freq_dict[word] += 1
fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
fd_sorted.columns = ["word", "wordcount"]
trace1 = horizontal_bar_chart(fd_sorted.head(30), 'blue')

# Creating two subplots
fig = tools.make_subplots(rows=1, cols=2, vertical_spacing=0.04,
                          subplot_titles=["Frequent words of sincere comments", 
                                          "Frequent words of sarcasm comments"])
fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig['layout'].update(height=1200, width=900, paper_bgcolor='rgb(233,233,233)', title="Word Count Plots")
py.iplot(fig, filename='word-plots')


In [None]:
# BIGRAMS
freq_dict = defaultdict(int)
for sent in train0_df["comment"]:
    for word in generate_ngrams(sent,2):
        freq_dict[word] += 1
fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
fd_sorted.columns = ["word", "wordcount"]
trace0 = horizontal_bar_chart(fd_sorted.head(30), 'orange')


freq_dict = defaultdict(int)
for sent in train1_df["comment"]:
    for word in generate_ngrams(sent,2):
        freq_dict[word] += 1
fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
fd_sorted.columns = ["word", "wordcount"]
trace1 = horizontal_bar_chart(fd_sorted.head(30), 'orange')

# Creating two subplots
fig = tools.make_subplots(rows=1, cols=2, vertical_spacing=0.04,horizontal_spacing=0.15,
                          subplot_titles=["Frequent bigrams of sincere comments", 
                                          "Frequent bigrams of sarcasm comments"])
fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig['layout'].update(height=1200, width=900, paper_bgcolor='rgb(233,233,233)', title="Bigram Count Plots")
py.iplot(fig, filename='word-plots')

In [None]:
# TRIGRAMS
freq_dict = defaultdict(int)
for sent in train0_df["comment"]:
    for word in generate_ngrams(sent,3):
        freq_dict[word] += 1
fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
fd_sorted.columns = ["word", "wordcount"]
trace0 = horizontal_bar_chart(fd_sorted.head(30), 'green')


freq_dict = defaultdict(int)
for sent in train1_df["comment"]:
    for word in generate_ngrams(sent,3):
        freq_dict[word] += 1
fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
fd_sorted.columns = ["word", "wordcount"]
trace1 = horizontal_bar_chart(fd_sorted.head(30), 'green')

# Creating two subplots
fig = tools.make_subplots(rows=1, cols=2, vertical_spacing=0.04,horizontal_spacing=0.15,
                          subplot_titles=["Frequent bigrams of sincere comments", 
                                          "Frequent bigrams of sarcasm comments"])
fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig['layout'].update(height=1200, width=900, paper_bgcolor='rgb(233,233,233)', title="Bigram Count Plots")
py.iplot(fig, filename='word-plots')

Observations:

- 'yeah', 'well' -> sarcasm
- 'good thing', 'everyone knows' -> sarcasm
- Many combinations - for both sarcasm and non-sarcasm
- Many triplets - weird

TODO:
- subreddit?
- number of up/down votes
- as a function of up/down-votes ratio

* ## Part 1.1 - analyze metadata

In [None]:
train1_df['comment'].str.len().apply(np.log1p).hist(label='sarcastic', alpha=.5, density=True)
train0_df['comment'].str.len().apply(np.log1p).hist(label='normal', alpha=.5, density=True)
plt.legend();

Almost equal distribution

In [None]:
## Group by to analyze subreddits
sub_df = train_df.groupby('subreddit')['label'].agg([np.size, np.mean, np.sum])
sub_df[sub_df['size'] > 1000].sort_values(by='mean', ascending=False).head(10)

## Part 2 - training the model

- TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus 
  - `min_df=2` - 

In [None]:
# build bigrams, put a limit on maximal number of features
# and minimal word frequency
tf_idf = TfidfVectorizer(ngram_range=(1, 2),  # unigrams and bigrams
                         max_features=50000, 
                         min_df=2) # inogre a feature if it is encounter only 2 times or less
# multinomial logistic regression a.k.a softmax classifier
logit = LogisticRegression(C=1, #inverse regularization strength; smaller -> higher reguralization
                           n_jobs=4, # parallelize
                           solver='lbfgs', 
                           random_state=17, # for shuffling the data
                           verbose=1) 
# sklearn's pipeline
# Sequentially apply a list of transforms and a final estimator. 
# Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods.
#The final estimator only needs to implement fit.
tfidf_logit_pipeline = Pipeline([('tf_idf', tf_idf), 
                                 ('logit', logit)] # Steps
                               )

tfidf_logit_pipeline

In [None]:
## Train
# Fit all the transforms one after the other and transform the data,
# then fit the transformed data using the final estimator.
tfidf_logit_pipeline.fit(train_texts, # training data
                         y_train) # labels

In [None]:
## test
valid_pred = tfidf_logit_pipeline.predict(valid_texts)

## accuracy score - fraction of correct ones
accuracy_score(y_valid, valid_pred)

## Part 3 - explain results

In [None]:
import eli5
eli5.show_weights(estimator=tfidf_logit_pipeline.named_steps['logit'],
                  vec=tfidf_logit_pipeline.named_steps['tf_idf'])

## Part 4 - improve results

- Try cutting rare words or words which are way too frequent

In [None]:
## play with parameters:
# Don't use rare words (<10)
# build bigrams, put a limit on maximal number of features
# and minimal word frequency

tf_idf = TfidfVectorizer(ngram_range=(1, 2),  # unigrams and bigrams
                        # max_features=50000, # build only from top features, ordered by encounter frequency
                         min_df=20,  # inogre a feature if it is encountered only 2 times or less
                         max_df=0.95) # inogre a feature if it is encountered too often
# multinomial logistic regression a.k.a softmax classifier
logit = LogisticRegression(C=1, #inverse regularization strength; smaller -> higher reguralization
                           n_jobs=4, # parallelize
                           solver='lbfgs', 
                           random_state=17, # for shuffling the data
                           verbose=1) 
# sklearn's pipeline
# Sequentially apply a list of transforms and a final estimator. 
# Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods.
#The final estimator only needs to implement fit.
tfidf_logit_pipeline = Pipeline([('tf_idf', tf_idf), 
                                 ('logit', logit)] # Steps
                               )

tfidf_logit_pipeline.fit(train_texts, # training data
                         y_train) # labels

## test
valid_pred = tfidf_logit_pipeline.predict(valid_texts)

## accuracy score - fraction of correct ones
accuracy_score(y_valid, valid_pred)

In [None]:
eli5.show_weights(estimator=tfidf_logit_pipeline.named_steps['logit'],
                  vec=tfidf_logit_pipeline.named_steps['tf_idf'], 
                  top=(10,10))

### Try bagging

In [None]:
tf_idf = TfidfVectorizer(ngram_range=(1, 2),  # unigrams and bigrams
                        # max_features=50000, # build only from top features, ordered by encounter frequency
                         min_df=20,  # inogre a feature if it is encountered only 2 times or less
                         max_df=0.95) # inogre a feature if it is encountered too often

X = tf_idf.fit(train_texts)

In [None]:
X

In [None]:
from sklearn.ensemble import BaggingClassifier


# multinomial logistic regression a.k.a softmax classifier
logit = LogisticRegression(C=1, #inverse regularization strength; smaller -> higher reguralization
                           solver='lbfgs') 

bc = BaggingClassifier(logit, n_estimators=50, n_jobs=4, max_samples=0.5)
# sklearn's pipeline
# Sequentially apply a list of transforms and a final estimator. 
# Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods.
#The final estimator only needs to implement fit.
bc_pl = Pipeline([('tf_idf', tf_idf), 
                                 ('bc', bc)] # Steps
                               )

bc_pl.fit(train_texts, # training data
                         y_train) # labels

# bc.fit(X, y_train)

## test
valid_pred = bc_pl.predict(valid_texts)

## accuracy score - fraction of correct ones
accuracy_score(y_valid, valid_pred)

In [None]:
from sklearn.model_selection import RandomizedSearchCV


parameters = {'max_features': [0.2, 0.5, 0.9], 'max_samples': [0.4, 0.6, 0.8], 
              'base_estimator__C': [0.01, 1, 10, 100]}

#
tf_idf = TfidfVectorizer(ngram_range=(1, 2),  # unigrams and bigrams
                        # max_features=50000, # build only from top features, ordered by encounter frequency
                         min_df=20,  # inogre a feature if it is encountered only 2 times or less
                         max_df=0.95) # inogre a feature if it is encountered too often

X = tf_idf.fit_transform(train_df['comment'])

# multinomial logistic regression a.k.a softmax classifier
logit = LogisticRegression(C=1, #inverse regularization strength; smaller -> higher reguralization
                           solver='lbfgs') 

bc = BaggingClassifier(logit, n_estimators=50, n_jobs=4, max_samples=0.5)
# sklearn's pipeline
# Sequentially apply a list of transforms and a final estimator. 
# Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods.
#The final estimator only needs to implement fit.
# bc_pl = Pipeline([('tf_idf', tf_idf), 
#                                  ('bc', bc)] # Steps
#                                )


In [None]:
rscv = RandomizedSearchCV(bc, parameters, n_iter=20, n_jobs=4, cv=5, random_state=1)

rscv.fit(X, train_df['label'])

In [None]:
# 3. Best ROC AUC?
score = rscv.best_score_
param = rscv.best_params_
est = racv.best_estimator_
print(score, param)