# Summary

* Read and explore text data with pandas
* From words to numbers: convert text to structured data
* Machine Learning on text
* Interpreting results 

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline

In [None]:
import warnings
warnings.filterwarnings('ignore')

# I- Text classification: introduction to the use case

## I.1 - Read data

We use pandas.read_csv function to read data. 
* index_col=0: take first column (review_id) as index
* quoting=3 : ignore quote characters

In [None]:
def load_data(path):
    data = pd.read_csv(path, sep='\t', quoting=3, index_col=0)
    return data

In [None]:
train = load_data('../input/labeledTrainData.tsv')
test = load_data('../input/testData.tsv')
unlabeled = load_data('../input/unlabeledTrainData.tsv')

#### Let's check some samples

In [None]:
def print_sample(sample):
    print('Id: {}'.format(sample.name))
    print('Sentiment: {}'.format(sample['sentiment']))
    print('Text:')
    print(sample['review'])

In [None]:
def get_a_sample_review():
    obs = train.sample().iloc[0]
    print_sample(obs)

In [None]:
get_a_sample_review()

## I.2 - Exploring text data

### pandas.Series.str method
String series in Pandas can call .str method to get many useful string function:
* Simple usage and code
* Vectorized operatiom: very fast compared to loop

The exaustive list is available on pandas documentation, in this section we will use some of these functions

### Reviews length in characters

.str.len() return a length of each string

In [None]:
train.review.str.len().hist(bins=100);

### Find some keywords

.str.contains(pattern) give a series of True/False on whether each string contains the pattern.

The code below take an example of review containing the pattern.

In [None]:
sample = train.loc[train.review.str.contains('bad')].sample().iloc[0]
print_sample(sample)

### Simple keywords approach

Intuitively, using some keywords, one could roughly evaluate the rating of the reviews.
The function below allow, given a list of keywords, compute for each keyword:
* how many review contain this keyword
* the % of postive/negative reviews among ones containing it

In [None]:
def check_keywords(keywords):
    result = pd.DataFrame(columns=['not_present', 'positive'])
    for word in keywords:
        df = train.loc[train.review.str.contains(word, case=False)]
        result.loc[word] = [train.shape[0] - df.shape[0], df.sentiment.sum()]
    result = result / train.shape[0]
    result['negative'] = 1- result['not_present'] - result['positive']
    result.plot(kind='bar', stacked=True)
    plt.legend(loc='best')
    return result

In [None]:
keywords = ['good', 'bad', 'great', 'disaster', 'fun', 'phenomenal']

In [None]:
check_keywords(keywords)

# II- Cleaning texts

Let's create a column for the clean text

In [None]:
train['clean_review'] = train.review

## II.1 Remove HTML tags

.str.replace(pattern, replacement) is a very useful function. The pattern can be a __regular expression__

__regular expression__ is a way to encode pattern of character. For an exhaustive grammar, please refer to the official Python documentation

In this case, we are replacing html tags with empty string

In [None]:
%%time
train['clean_review'] = train.clean_review.str.replace('<.+? />','')

**Before**

In [None]:
train.review.iloc[0]

**After**

In [None]:
train.clean_review.iloc[0]

## II.2 Convert all texts to lower case

.str.lower() transform all strings to lower case

See also: .str.upper(), .str.capitalize()

In [None]:
train['clean_review'] = train.clean_review.str.lower()

In [None]:
train.clean_review.iloc[0]

## II.3 Remove punctuation & special characters

Here we will just remove special characters, digits and punctuation. In some case these can contain information and you may want to encode them with special word

In [None]:
# Everything not a alphabet character replaced with a space
train['clean_review'] = train.clean_review.str.replace('[^a-zA-Z]', ' ')
# Remove double space
train['clean_review'] = train.clean_review.str.replace(' +', ' ')
# Remove trailing space at the beginning or end
train['clean_review'] = train.clean_review.str.strip()

In [None]:
train.clean_review.iloc[0]

## II.4 Introduction to stemming

stemming is removing some part of the word keeping the main part. It help to remove unused noise from document. However it could remove useful information in some case

In [None]:
import nltk

In [None]:
stemmer = nltk.stem.SnowballStemmer('english')

In [None]:
stemmer.stem('people')

In [None]:
stemmer.stem('guys')

In [None]:
stemmer.stem('closed')

## II.5 Introduction to stopwords

In ordinary language, there are some words with very high frequency. Removing these words allow to focus on other rare but useful word. This could cause information loss in some case. It is better to craft a use-case specific stopwords  list

print(nltk.corpus.stopwords.words('english'))

## Exercice 1

Let's write all our cleaning operations into one function so we could apply it properly to both train and test set

In [None]:
def review_cleaning(reviewSeries):
    result = reviewSeries.copy()
    # Remove HTML tags

    # Convert to lower case
    
    # Remove non alphabetic characters
    
    # Remove double space and strip spaces
    
    return result

In [None]:
train['clean_review'] = review_cleaning(train.review)
test['clean_review'] = review_cleaning(test.review)
unlabeled['clean_review'] = review_cleaning(unlabeled.review)

# III - From text to numbers

## III.1 - Bag of words - Count Vectorizer and Tf-Idf Vectorizer

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [None]:
test_corpus = [ 
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',     
    'Is this the first document?',]

#### Word Count vectorizer

In [None]:
count_vectorizer = CountVectorizer(ngram_range=(1,1), analyzer='word')

In [None]:
count_vectorizer.fit(test_corpus)

*Once fitted, the vectorizer contains the stats of text corpus*

In [None]:
count_vectorizer.vocabulary_

*Transform method will turn sequence of texts into sparse matrix*

In [None]:
count_vectorizer.transform(test_corpus)

In [None]:
count_vectorizer.transform(test_corpus).todense()

#### Tf-Idf Vectorizer

In [None]:
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,2), analyzer='word')

In [None]:
tfidf_vectorizer.fit(test_corpus)

In [None]:
tfidf_vectorizer.vocabulary_

In [None]:
tfidf_vectorizer.transform(test_corpus)

In [None]:
tfidf_vectorizer.transform(test_corpus).todense()

## Exercise 2

Exploring parameters for Tf-Idf Vectorizer

### a. Use count vectorizer with option _binary = True_

### b. Use count vectorizer with option *analyzer='word'* and *max_df=0.8*

### c. Use TF IDF vectorizer with option *analyzer='char'* and *ngram\_range=(3,5)*

## III-2 Application to our use case

### Train - Validation -Test separation

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(train.clean_review, train.sentiment, test_size=0.2, 
                                                          stratify=train.sentiment)

### Build a simple pipeline

In [None]:
from sklearn.linear_model import LogisticRegression

** Vectorizer **

In [None]:
vectorizer = TfidfVectorizer(analyzer='word', # word or char
                             ngram_range=(1,1), # ngram setting  
                             binary=True, # set tf to binary instead of count
                             max_df=0.9, # remove too frequent words
                             min_df=10, # remove too rare words
                             max_features = None, # max words in vocabulary, will keep most frequent words
                             ) 

In [None]:
X_vect_train = vectorizer.fit_transform(X_train)

** ML model** 

In [None]:
lreg = LogisticRegression()

In [None]:
lreg.fit(X_vect_train, y_train)

** Performance on validation **

In [None]:
from sklearn.metrics import accuracy_score, roc_auc_score

In [None]:
p_valid = lreg.predict_proba(
            vectorizer.transform(X_valid)
            )

In [None]:
accuracy_score(y_valid, p_valid.argmax(axis=1))

### Quickly test the pipeline

In [None]:
def test_prediction(p_test):
    target = test.index.str.slice(-2,-1).isin(['7','8','9','0']).astype(np.int8)
    print('accuracy: {}'.format(accuracy_score(target, p_test[:,1]>=0.5)))
    print('roc auc: {}'.format(roc_auc_score(target, p_test[:,1])))

In [None]:
p_test = lreg.predict_proba(
            vectorizer.transform(test.clean_review)
            )

In [None]:
test_prediction(p_test)

## Excercice 3
** Optimize pipeline parameters**

**Try some random the parameters below and share your validation result:**
* vectorizer:
    * ngram_range: up to 3-gram
    * max_df: from 0.6 to 1.0
    * min_df: from 5 to 100
    * binary: True or False
    * use_idf: True or False
* lreg:
    * C: [10, 1, 1e-1, 1e-2, 1e-3]

# IV - Interpreting the model

## IV.1 Coefficients of the regressions

The coefficients of regression is stored in lreg.coef\_. We will match them with the words in vocabulary 

** Get the coefficient for each word **

In [None]:
# vectorizer.vocabulary_ is a dictionary of word -> column index
vectorizer.vocabulary_['good']

In [None]:
# Let's store them in a DataFrame
coefs = pd.DataFrame(columns=['word'])
for word, ind in vectorizer.vocabulary_.items():
    coefs.loc[ind, 'word'] = word
coefs.sort_index(inplace=True)

In [None]:
coefs.head()

In [None]:
# the coefficient are stored in a 1xn array
lreg.coef_

In [None]:
# Once sorted, the word order correspond to coefficient order
coefs['coefs'] = lreg.coef_[0,:]

** Get most relevant words** 

In [None]:
most_relevant_words = coefs.iloc[np.argsort(coefs.coefs.abs())].tail(20)

In [None]:
most_relevant_words.sort_values('coefs', inplace=True)

In [None]:
def plot_impact(words, impacts):
    pos_ind = (impacts > 0)
    position = np.arange(len(words))
    plt.barh(bottom=position[pos_ind], width=impacts[pos_ind], color='green')
    plt.barh(bottom=position[~pos_ind], width=impacts[~pos_ind], color='red')
    plt.yticks(position + 0.4 ,words)
    plt.show()

In [None]:
plot_impact(most_relevant_words.word.values, most_relevant_words.coefs.values)

## IV.2 Explaining model output on examples 

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
vectorizer = TfidfVectorizer(analyzer='word', # word or char
                             ngram_range=(1,1), # ngram setting  
                             binary=True, # set tf to binary instead of count
                             max_df=0.9, # remove too frequent words
                             min_df=10, # remove too rare words
                             max_features = None, # max words in vocabulary, will keep most frequent words
                             ) 

In [None]:
lreg = LogisticRegression()

In [None]:
pipeline = Pipeline([
        ('vectorizer', vectorizer), 
        ('lreg', lreg)
    ])

In [None]:
pipeline.fit(X_train, y_train)

In [None]:
! conda install -yq seaborn

In [None]:
import seaborn as sns

In [None]:
def words_impacts(text):
    baseline_ = pipeline.predict_proba([text])[:,1]
    words_list = text.split()
    text_excl_words = list()
    for i, word in enumerate(words_list):
        new_words_list = text.split()
        new_words_list.pop(i)
        text_excl_words.append(' '.join(new_words_list)) 
    impacts = baseline_ - pipeline.predict_proba(text_excl_words)[:, 1]
    return words_list, impacts, baseline_

In [None]:
def reshape_pad(iterable, shape, pad_value=0):
    n_ = len(iterable)
    pad_length = shape[0] * shape[1] - n_
    data = list(iterable)
    data.extend([pad_value for _ in range(pad_length)])
    assert len(data) == shape[0] * shape[1]
    data = np.reshape(data, shape)
    return data

def plot_text_with_impacts(words_list, impacts, n_cols=10):
    assert len(words_list) == len(impacts)
    n_rows = (len(words_list) // n_cols) + 1
    words = reshape_pad(words_list, (n_rows, n_cols), pad_value='')
    impact_data = reshape_pad(impacts, (n_rows, n_cols))
    plt.figure(figsize=(20, n_rows//2))
    sns.heatmap(impact_data, annot=words, square=False, fmt='')
    sns.set(font_scale=1)

In [None]:
review_sample = X_valid.sample()
words_list, impacts, baseline = words_impacts(review_sample.iloc[0])

In [None]:
print('Ground_truth: {}'.format(y_valid.loc[review_sample.index].iloc[0]))
print('Predicted probability: {}'.format(baseline[0]))
plot_text_with_impacts(words_list, impacts, n_cols=15)

## IV.3 Let's try some word cloud

In [None]:
!conda install -yq -c amueller wordcloud

In [None]:
from wordcloud import WordCloud, wordcloud, ImageColorGenerator

In [None]:
def wcloud_color(word, font_size, position, orientation, random_state=None, **kwargs):
        i = words_list.index(word)
        if impacts[i] > 0:
            return 'Tomato'
        else:
            return 'DodgerBlue'

In [None]:
wcloud = WordCloud(max_words=100, 
                   background_color='white', 
                   color_func=wcloud_color, 
                   relative_scaling=1)

In [None]:
img = wcloud.generate_from_frequencies(zip(words_list, np.abs(impacts)))

In [None]:
plt.imshow(img);

# Apendix

## RandomSearch on Pipeline parameters

In [None]:
from sklearn.pipeline import Pipeline

A pipeline inherits parameters from its components. The parameter name become "(component name)__(parameter name)"

In [None]:
vectorizer = TfidfVectorizer(analyzer='word', # word or char
                             ngram_range=(1,1), # ngram setting  
                             binary=True, # set tf to binary instead of count
                             max_df=0.9, # remove too frequent words
                             min_df=10, # remove too rare words
                             max_features = None, # max words in vocabulary, will keep most frequent words
                             ) 

In [None]:
lreg = LogisticRegression()

In [None]:
pipeline = Pipeline([
        ('vectorizer', vectorizer), 
        ('lreg', lreg)
    ])

In [None]:
pipeline.get_params()['vectorizer__analyzer']

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
## answer
params_grid = {
    'lreg__C': [10, 1, 1e-1, 1e-2, 1e-3],
    'vectorizer__ngram_range': [(1,1), (1,2), (1,3)],
    'vectorizer__stop_words':['english', None],
    'vectorizer__min_df': [5, 10, 20, 50, 100],
    'vectorizer__max_df': [0.6, 0.7, 0.8, 0.9, 1.0],
    'vectorizer__binary': [True, False],
    'vectorizer__use_idf': [True, False]
}

In [None]:
search_ = RandomizedSearchCV(pipeline, params_grid, 
                             n_iter=4, n_jobs=4, verbose=1, cv=5)

In [None]:
search_.fit(X_train, y_train)

In [None]:
search_.best_score_

In [None]:
search_.best_params_