<center>
<img src="https://habrastorage.org/files/fd4/502/43d/fd450243dd604b81b9713213a247aa20.jpg">
    
## [mlcourse.ai](https://mlcourse.ai) â€“ Open Machine Learning Course 
Author: [Yury Kashnitskiy](https://yorko.github.io) (@yorko). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose.

## <center> Assignment 4. Sarcasm detection with logistic regression
    
We'll be using the dataset from the [paper](https://arxiv.org/abs/1704.05579) "A Large Self-Annotated Corpus for Sarcasm" with >1mln comments from Reddit, labeled as either sarcastic or not. A processed version can be found on Kaggle in a form of a [Kaggle Dataset](https://www.kaggle.com/danofer/sarcasm).

Sarcasm detection is easy. 
<img src="https://habrastorage.org/webt/1f/0d/ta/1f0dtavsd14ncf17gbsy1cvoga4.jpeg" />

In [None]:
!ls ../input/sarcasm/

In [None]:
# some necessary imports
import os
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns
from matplotlib import pyplot as plt

In [None]:
train_df = pd.read_csv('../input/sarcasm/train-balanced-sarcasm.csv')

In [None]:
train_df.head()

In [None]:
train_df.info()

Some comments are missing, so we drop the corresponding rows.

In [None]:
train_df.dropna(subset=['comment'], inplace=True)

We notice that the dataset is indeed balanced

In [None]:
train_df['label'].value_counts()

In [None]:
train_df['label'].hist()

We split data into training and validation parts.

In [None]:
train_texts, valid_texts, y_train, y_valid = \
        train_test_split(train_df['comment'], train_df['label'], random_state=17)

## Tasks:
1. Analyze the dataset, make some plots. This [Kernel](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-qiqc) might serve as an example
2. Build a Tf-Idf + logistic regression pipeline to predict sarcasm (`label`) based on the text of a comment on Reddit (`comment`).
3. Plot the words/bigrams which a most predictive of sarcasm (you can use [eli5](https://github.com/TeamHG-Memex/eli5) for that)
4. (optionally) add subreddits as new features to improve model performance. Apply here the Bag of Words approach, i.e. treat each subreddit as a new feature.

## Links:
  - Machine learning library [Scikit-learn](https://scikit-learn.org/stable/index.html) (a.k.a. sklearn)
  - Kernels on [logistic regression](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-2-classification) and its applications to [text classification](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-4-more-of-logit), also a [Kernel](https://www.kaggle.com/kashnitsky/topic-6-feature-engineering-and-feature-selection) on feature engineering and feature selection
  - [Kaggle Kernel](https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle) "Approaching (Almost) Any NLP Problem on Kaggle"
  - [ELI5](https://github.com/TeamHG-Memex/eli5) to explain model predictions

Let's look at our data

In [None]:
train_df.head()

Let's find most common words for two types of comments

In [None]:
sarcasm_texsts = train_df[train_df["label"] == 1]
non_sarcasm_texsts = train_df[train_df["label"] == 0]

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

Method for find top n words in cv vocabulary

In [None]:
def get_top_n_words(corpus, n=None):
    vec = CountVectorizer().fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in     vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

For sarcasm comments

In [None]:
most_sarcasm_words = get_top_n_words(train_df[train_df["label"] == 1]["comment"], 30)

For non sarcasm comments

In [None]:
most_non_sarcasm_words = get_top_n_words(train_df[train_df["label"] == 0]["comment"], 30)

Let's plot words in sarcasm comments

In [None]:
data = pd.DataFrame(most_sarcasm_words, columns=["word", "frequency"])
fig_dims = (18, 4)
fig, ax = plt.subplots(figsize=fig_dims)
sns.barplot(x="word", y="frequency", data=data, ax=ax)

And in non sarcasm comments

In [None]:
data = pd.DataFrame(most_non_sarcasm_words, columns=["word", "frequency"])
fig_dims = (18, 4)
fig, ax = plt.subplots(figsize=fig_dims)
sns.barplot(x="word", y="frequency", data=data, ax=ax)

We see that the words are almost the same, we can assume that they will have small weights in our model

Use CountVectorizer to process all comments

In [None]:
cv = CountVectorizer()
cv.fit(train_texts)

Length of our vocabulary of all used in comments words

In [None]:
len(cv.vocabulary_)

Transform all comments in sparse matrix

In [None]:
X_train = cv.transform(train_texts)

In [None]:
print(cv.get_feature_names()[10000])

In [None]:
X_train[10000].nonzero()[1]

In [None]:
X_test = cv.transform(valid_texts)

Let's fit our model LogisticRegression

In [None]:
logit = LogisticRegression(solver='lbfgs', n_jobs=-1, random_state=7)
logit.fit(X_train, y_train)

And check result on the test sample

In [None]:
logit.score(X_test, y_valid)

Lets make pipeline for our model

In [None]:
from sklearn.pipeline import make_pipeline

In [None]:
text_pipe_logit = make_pipeline(CountVectorizer(), LogisticRegression(solver='lbfgs', 
                                                                       n_jobs=1,
                                                                       random_state=7))

In [None]:
%%time
text_pipe_logit.fit(train_texts, y_train)

In [None]:
text_pipe_logit.score(valid_texts, y_valid)

Let's find optimal regularization parameter

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid_logit = {'logisticregression__C': np.logspace(-3, 3, 20)}
grid_logit = GridSearchCV(text_pipe_logit, 
                          param_grid_logit, 
                          return_train_score=True, 
                          cv=3, n_jobs=-1)

grid_logit.fit(train_texts, y_train)

In [None]:
print(grid_logit.best_params_, grid_logit.best_score_,sep="\n")

Check the final score

In [None]:
grid_logit.score(valid_texts, y_valid)

In [None]:
plt.plot(grid_logit.param_grid["logisticregression__C"], grid_logit.cv_results_["mean_test_score"],
        color="red", label="test")

In our case, almost nothing depends on the regularization parameter

Most important words

In [None]:
import eli5
eli5.show_weights(text_pipe_logit, top=20)