# IASA Natural Language Processing Workshop (Part 1)

The purpose of this workshop is a general overview of various techniques and methods of text analytics, and in particular natural language processing from classical approaches to modern state of the art ones. After completing this course attentive students will have a solid general understanding of the NLP domain and will be able to implement various basic algorithms for text data processing. However, for a deeper understanding, a more detailed study of proposed materials will be necessary. 

These particular materials were created by <a href="https://www.kaggle.com/abazdyrev">Anton Bazdyrev</a> and <a href="https://www.kaggle.com/yakuben">Oleksii Yakubenko</a> for IASA students and inspired by <a href="https://mlcourse.ai/">MLCOURSE.AI</a> and <a href="https://ods.ai/">ODS.AI</a> by <a href="https://www.kaggle.com/kashnitsky">
Yury Kashnitsky</a>.

In [None]:
from IPython.display import Image

# Text Mining and NLP roadmaps and tasks overview
## 1. Low Level Tasks

1. Tokenization <br>
    Goal: Split given sentense to sequence of tokens.
    
2. Sentence boundary detection <br>
    Goal: Split given text to sequence of sentences.

In [None]:
Image("../input/nlp-lectures-images/tokenization.png")

## 2. Text Meaning/Representation Tasks

Goal: represent meaning and context <br>
Applications: creation of words and sentence embeddings, finding similar words (similar vectors) <br>
Representation: word vectors, the mapping of words to vectors (n-dimensional numeric vectors) aka embeddings


In [None]:
Image("../input/nlp-lectures-images/embeddings.png")

## 3. Text Classification Tasks

Goal: predict classes (categories) for given text. <br>
Applications: spam detection, toxic detection, domain classification, importance detection. <br>
Representation: bag of words, count and tf-idf vectorizers (does not preserve word order), embeddings (preserve word order)

In [None]:
Image("../input/nlp-lectures-images/classification.png")

## 4. Sequence Processing Tasks

Goal: language modeling - predict next/previous word(s), text generation, sequence labeling <br>
Applications: sequence tagging (predict POS tags for each word in sequence), named entity recognition <br>
Representation: sequences of embedding vectors of words (preserves word order) 

In [None]:
Image("../input/nlp-lectures-images/ner.png")

## 5. Sequence to Sequence Tasks

Goal: generate text for some given text <br>
Applications: machine translation, chatbots, Q&A systems, summarization <br>
Representation: sequences of embedding vectors of words (preserves word order)

In [None]:
Image("../input/nlp-lectures-images/seq2seq-teacher-forcing.png")

## Roadmaps:

In [None]:
Image("../input/nlp-lectures-images/textmining.png")

In [None]:
Image("../input/nlp-lectures-images/nlp.png")

# Task: Real or Not? NLP with Disaster Tweets
<a href="https://www.kaggle.com/c/nlp-getting-started/overview">Kaggle Competition</a> <br>
Twitter has become an important communication channel in times of emergency.
The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).

But, it’s not always clear whether a person’s words are actually announcing a disaster. Take this example:

In [None]:
Image('../input/nlp-lectures-images/tweet_screenshot.png')

The author explicitly uses the word “ABLAZE” but means it metaphorically. This is clear to a human right away, especially with the visual aid. But it’s less clear to a machine.

In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified. If this is your first time working on an NLP problem, we've created a quick tutorial to get you up and running.

Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.

## Data Loading
What am I predicting?<br>
You are predicting whether a given tweet is about a real disaster or not. If so, predict a 1. If not, predict a 0.

In [None]:
import pandas as pd

train = pd.read_csv('../input/nlp-getting-started/train.csv', index_col='id')
train.head()

In [None]:
train = train.drop(columns=['keyword', 'location'])
train.head()

## Data overview

In [None]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS

In [None]:
print(train.shape)

train['target'].value_counts(normalize=True)

In [None]:
disaster_samples = train[train['target'] == 1]
fake_samples = train[train['target'] == 0]

In [None]:
for sample in disaster_samples['text'].sample(3, random_state=42):
    print(sample)
    print('\n=======\n')

In [None]:
for sample in fake_samples['text'].sample(3, random_state=42):
    print(sample)
    print('\n=======\n')

In [None]:
text = disaster_samples['text'].values

wc = WordCloud(max_font_size=60, background_color="black", max_words=2000, stopwords=STOPWORDS)
wc.generate(" ".join(text))
plt.figure(figsize=(12,6))
plt.axis("off")
plt.imshow(wc.recolor(colormap= 'viridis' , random_state=17),interpolation="bilinear")
plt.show()

In [None]:
text = fake_samples['text'].values

wc = WordCloud(max_font_size=60, background_color="black", max_words=2000, stopwords=STOPWORDS)
wc.generate(" ".join(text))
plt.figure(figsize=(12,6))
plt.axis("off")
plt.imshow(wc.recolor(colormap= 'viridis' , random_state=17),interpolation="bilinear")
plt.show()

## Baseline solutions

In [None]:
from sklearn.metrics import accuracy_score

### Trivial Solution

In [None]:
constant_prediction = [0 for text in train['text']]

constant_accuracy_score = accuracy_score(train['target'], constant_prediction)
print(f'accuracy_score : {constant_accuracy_score}')

### Popular words differences

In [None]:
from collections import Counter

disaster_text = ' '.join(disaster_samples['text'].tolist())
disaster_words = disaster_text.split()
disaster_count = Counter(disaster_words)
disaster_count = pd.Series(disaster_count)
disaster_count = disaster_count.sort_values(ascending=False)

fake_text = ' '.join(fake_samples['text'].tolist())
fake_words = fake_text.split()
fake_count = Counter(fake_words)
fake_count = pd.Series(fake_count)
fake_count = fake_count.sort_values(ascending=False)

In [None]:
disaster_count.iloc[:10]

In [None]:
#removing stopwords
disaster_count[~disaster_count.index.str.lower().isin(STOPWORDS)].iloc[:20]

In [None]:
print('Disaster', disaster_count[~disaster_count.index.str.lower().isin(STOPWORDS)].iloc[:30].index)
print('Fake', fake_count[~fake_count.index.str.lower().isin(STOPWORDS)].iloc[:30].index)

Seems that if we have words like fire, killed, suicide, disaster, bomb, crash, families, police, buildings, fatal, train, burning and so on in text it means that they re rather disaster than fake.

In [None]:
disaster_words = ['fire', 'killed', 'suicide', 'disaster', 'bomb', 'crash', 'families', 'police', 'buildings', 'fatal', 'train', 'burning']

basic_prediction = [any([d_w in text.lower() for d_w in disaster_words]) for text in train['text']]

basic_accuracy_score = accuracy_score(train['target'], basic_prediction)
print(f'accuracy_score : {basic_accuracy_score}')

# Machine Learning Approach
![mlvd](https://github.com/mephistopheies/mlworkshop39_042017/raw/a6426fd652faa38864c3ea4538e000539106fb56/1_ml_intro/ipy/images/bengio.png)

In [None]:
Image("../input/nlp-lectures-images/machine_learning_overview.jpg")

## Supervised learning

### Model

*Model* - parametric space of functions (hypotheses):

$\large \mathcal{H} = \left\{ h\left(x, \theta\right) | \theta \in \Theta \right\}$

* where
    * $\large h: X \times \Theta \rightarrow Y$    
    * $\large \Theta$ - parametric space
    * $\large X$ - factor space (exogenous variables)
    * $\large Y$ - target space
    
### Training algorithm


*Training algorithm* - map from data space to hypotheses space:

$\large \mathcal{M}: X \times Y \rightarrow \mathcal{H}$

There are 2 steps in algorithm:
1. Selection of hypothesis: $\large h = \mathcal{M}\left(D\right)$, where $\large D$ - our particular dataset
2. Testing for given example $\large x$ calculation of model prediction $\large \hat{y} = h\left(x\right)$

### Selection of hypothesis

Define *loss function*:
$\large L: Y \times Y \rightarrow \mathbb{R}$ <br>

With loss function we can measure how the prediction $\large \hat {y}$ differs from the ground truth values $\large y$.

$\large Q_{\text{emp}}\left(h\right) = \frac{1}{n} \sum_{i=1}^n L\left(h\left(x_i\right), y_i\right)$, 
where $\large \mathcal{D} = \left\{ \left(x_i, y_i\right) \right\}$ - our training dataset, $\large h$ - hypothesis (function)

We should select hypothesis, that minimizes average loss:
$\large \hat{h} = \arg \min_{h \in \mathcal{H}} Q_{\text{emp}}\left(h\right)$

Examples of loss functions:
* classification: $\large L\left(\hat{y}, y\right) = \text{I}\left[\hat{y} = y\right]$
* regression: $\large L\left(\hat{y}, y\right) = \left(\hat{y} - y\right)^2$

## Example: Logistic Regression


### Model
Given $\large \mathcal{D} = \left\{ \left(x_i, y_i\right) \right\}$ - our training dataset, $\large \mathcal{H} = \left\{ h\left(x, \theta\right) | \theta \in R^n \right\}$ - logistic regression model

* where
    * $\large x_i \in  R^n$    
    * $\large y_i \in  \left\{0, 1\right\}$
    * $\large h(x_i, \theta) = \sigma \left(\theta ^ T x_i \right) = \dfrac{1}{1 + e ^ \left( - \theta ^ T x_i \right)} = \dfrac{1}{1 + e ^ \left( - \sum_{j=1}^n \theta_j x_{ij} \right)},$ $\large h(x_i, \theta) \in \left(0, 1\right)$
    * $\sigma \left(z\right) = \dfrac{1}{1 + e ^ \left(-z \right)}$ - logistic function, aka sigmoid
    
### Loss function
$y_i \sim Bernoulli(p)$, where p is an unknown parameter <br>
Let's estimate conditional probability $Pr\left\{y_i = 1 | x_i \right\} = \large h(x_i, \theta)$ with Maximum Likelihood Estimation.<br> Obviously, that $Pr\left\{y_i = 0 | x_i \right\} = 1 - \large h(x_i, \theta)$ <br>
Also we can define $Pr\left\{y = y_i | x_i \right\} = \large h(x_i, \theta)^{y_i} \left(1 - \large h(x_i, \theta)\right)^{1 - y_i}$ <br> 
We now calculate the likelihood function assuming that all the observations in the sample are independently Bernoulli distributed: <br>

$L\left(\theta | x\right) = \prod_{i=1}^m Pr\left\{y = y_i | x_i \right\} = \prod_{i=1}^m h\left(x_i, \theta\right)^{y_i} \left(1 - h\left(x_i, \theta\right)\right)^{1 - y_i}$ <br>


Typically, the log likelihood is maximized: 

$log\left(L\left(\theta | x\right) \right) = log\left(\prod_{i=1}^m Pr\left\{y = y_i | x_i \right\}\right) = log\left(\prod_{i=1}^m h\left(x_i, \theta\right)^{y_i} \left(1 - h\left(x_i, \theta\right)\right)^{1 - y_i}\right) = \sum_{i=1}^m log\left(h\left(x_i, \theta\right)^{y_i} \left(1 - h\left(x_i, \theta\right)\right)^{1 - y_i}\right) = \sum_{i=1}^m y_ilog\left(h\left(x_i, \theta\right) \right) + \left(1 - y_i\right) log\left(1 - h\left(x_i, \theta\right) \right)$ <br>

So, finally, we can define our classification loss function as $\large L\left(\hat{y}, y\right) = - ylog\left(\hat{y} \right) - \left(1 - y\right) log\left(1 - \hat{y} \right)$ where $\hat{y} = h\left(x_i, \theta\right) $

Now, when we have dataset, model and loss function, we can perform training and find the best estimation $\large \hat{h} =\arg \min_{h \in \mathcal{H}} \left(\frac{1}{n} \sum_{i=1}^n L\left(h\left(x_i\right), y_i\right)\right)$ <br>

### Optimization
We can find it using Gradient Descent method: <br>
Gradient descent is based on the observation that if the multi-variable function $\large F$ is defined and differentiable in a neighborhood of a point $\large a$, then $\large F$ decreases fastest if one goes from a in the direction of the negative gradient of $\large F$ at $\large a$. <br>

Define $\vec{w_{n + 1}} = \vec{w_{n}} - \alpha\frac{\partial \mathcal{L}}{\partial \vec{w}} \left(\vec{w_{n}}\right)$, where $\alpha$ is a small number, called learning rate. <br>

If $\vec{w_{n}}$ converges, then it converges to local minimum. If $\mathcal{L}$ is a convex function, then $\vec{w_{n}}$ converges to the global minimum.

In our case of optimization our loss function (log-loss) for logistic regression model: <br>

$$\large \begin{array}{rcl} - \frac{\partial \mathcal{L}}{\partial \vec{w}} &=& \frac{\partial}{\partial \vec{w}}\sum_{i=1}^n y_i \ln \sigma\left(\vec{w}^T \vec{x}_i\right) + \left(1 - y_i\right) \ln \left(1 - \sigma\left(\vec{w}^T \vec{x}_i\right)\right) \\
&=& \sum_{i=1}^n y_i \frac{1}{\sigma} \sigma \left(1 - \sigma\right) \vec{x}_i + \left(1 - y_i\right) \frac{1}{1 - \sigma} \left(-1\right)\sigma \left(1 - \sigma\right) \vec{x}_i \\
&=& \sum_{i=1}^n y_i \left(1 - \sigma\right) \vec{x}_i - \left(1 - y_i\right) \sigma \vec{x}_i \\
&=& \sum_{i=1}^n \vec{x}_i \left(y_i - \sigma\right)
\end{array}$$

# Validation

If we have a few different models how can we define wich one performs better for our task? <br>
We can measure the quality of predictions using some metric or loss functions. <br>

Let's look through the example: <br>
We have regression task with target variable y. We can observe x0, x1 and we assume that we can estimate y as function of x0, x1. Also, we know that there may be some "noise" in our data and also there may also be other factors that affect the target y. We have 100 examples in training set and 20 examples in evaluation set.

In [None]:
import numpy as np
np.random.seed(42)

X_train = np.random.normal(0, 1, (500, 2))
noise_train = np.random.normal(0, 1., 500)

X_test = np.random.normal(0, 1, (100, 2))
noise_test = np.random.normal(0, 1., 100)

y_train = (np.sin(X_train) + np.cos(2*X_train)).sum(axis=1) + noise_train
y_test = (np.sin(X_test) + np.cos(2*X_test)).sum(axis=1) + noise_test

Let's build some simple Linear Regression model and consider Mean Squared Error as a target metric

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train)

y_train_preds = lr.predict(X_train)
y_test_preds = lr.predict(X_test)

print(f'MSE on train: {mean_squared_error(y_train, y_train_preds)}\nMSE on test: {mean_squared_error(y_test, y_test_preds)}')

Maybe we can improve results by adding comlexity such as polynomial features to our model?

In [None]:
lr = LinearRegression()

X_train_poly = np.hstack([X_train, X_train**2])
X_test_poly = np.hstack([X_test, X_test**2])

lr.fit(X_train_poly, y_train)

y_train_preds = lr.predict(X_train_poly)
y_test_preds = lr.predict(X_test_poly)

print(f'MSE on train: {mean_squared_error(y_train, y_train_preds)}\nMSE on test: {mean_squared_error(y_test, y_test_preds)}')

Wow! Thats great, we've got some improvements by just adding squared features to our model. But can we improve it more? Let's add more powers up to 10!



In [None]:
lr = LinearRegression()

X_train_poly = np.hstack([X_train**(p) for p in range(10)])
X_test_poly = np.hstack([X_test**(p) for p in range(10)])

lr.fit(X_train_poly, y_train)

y_train_preds = lr.predict(X_train_poly)
y_test_preds = lr.predict(X_test_poly)

print(f'MSE on train: {mean_squared_error(y_train, y_train_preds)}\nMSE on test: {mean_squared_error(y_test, y_test_preds)}')

Hmmm... That looks really weird isn't it? Why we've got huge improvements on train and such a terrible result on test?

In [None]:
from matplotlib import pyplot as plt

train_result = []
test_result = []
for pow_n in range(1, 10):
    lr = LinearRegression()

    X_train_poly = np.hstack([X_train**(p) for p in range(pow_n)])
    X_test_poly = np.hstack([X_test**(p) for p in range(pow_n)])

    lr.fit(X_train_poly, y_train)

    y_train_preds = lr.predict(X_train_poly)
    y_test_preds = lr.predict(X_test_poly)
    
    train_result.append(mean_squared_error(y_train, y_train_preds))
    test_result.append(mean_squared_error(y_test, y_test_preds))
    
plt.plot(list(range(1, 10)), train_result, list(range(1, 10)), test_result)
plt.show()

In fact, the thing is, that when we use more complex models, then we can "explain" functional dependency in our training data, even if we actually don't have it. In real world we always face situations when we have such kind of "noise" in real training data, so very important thing in model validation is to validate it using unseen in train phase data.

# Back to the original task: Modeling and Validation

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
Image('../input/nlp-lectures-images/countvectorizer.png')

In [None]:
Image('../input/nlp-lectures-images/countvectorizer2.jpg')

In [None]:
#split to train test for validation purposes
X, y = train['text'], train['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
#init model
count_vectorizer = CountVectorizer(binary=True)
log_reg = LogisticRegression(solver='liblinear', random_state=42)

In [None]:
#fit model
count_vectorizer.fit(X_train)
X_train_vectorized = count_vectorizer.transform(X_train)

log_reg.fit(X_train_vectorized, y_train);

In [None]:
#prediction

X_test_vectorized = count_vectorizer.transform(X_test)

prediction_train = log_reg.predict(X_train_vectorized)
prediction_test = log_reg.predict(X_test_vectorized)

In [None]:
accuracy_score(y_train, prediction_train), accuracy_score(y_test, prediction_test),

So, as we've writen before, there can be actually a huge difference between results on train and test data! And we consider test results as a benchmark ones.

# Scikit-Learn Pipeline API

In [None]:
Image('../input/nlp-lectures-images/pipeline.png')

In [None]:
from sklearn.pipeline import Pipeline, FeatureUnion

#init
count_vectorizer = CountVectorizer()
log_reg = LogisticRegression(solver='liblinear', random_state=42)

model = Pipeline([('count_vectorizer', count_vectorizer),  ('log_reg', log_reg)])

In [None]:
#fit
model.fit(X_train, y_train);

In [None]:
#predict
accuracy_score(y_test, model.predict(X_test))

# Model Improvement
Let's calculate tf-idf measure instead of simple counts for each term

In [None]:
Image('../input/nlp-lectures-images/tf_idf.png')

#### TF-IDF calculation example:

In [None]:
Image('../input/nlp-lectures-images/tf-idf-example.jpg')

We can also finetune some parameters for vectorizer in scikit-learn

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

word_vectorizer = TfidfVectorizer(
    analyzer='word',
    stop_words='english',
    ngram_range=(1, 3),
    lowercase=True,
    min_df=5,
    max_features=30000)

char_vectorizer = TfidfVectorizer(
    analyzer='char',
    stop_words='english',
    ngram_range=(3, 6),
    lowercase=True,
    min_df=5,
    max_features=50000)


log_reg = LogisticRegression(solver='liblinear', random_state=42)

fu = FeatureUnion([('word_vectorizer', word_vectorizer),  ('char_vectorizer', char_vectorizer)])
model = Pipeline([('vectorizers', fu),  ('log_reg', log_reg)])

In [None]:
model.fit(X_train, y_train);

In [None]:
accuracy_score(y_test, model.predict(X_test))

# Model Explainer

In [None]:
import eli5

log_reg = model.named_steps['log_reg']
vectorizers = model.named_steps['vectorizers']
feature_names = vectorizers.get_feature_names()

In [None]:
eli5.explain_weights(log_reg, feature_names=feature_names, top=100)

In [None]:
print(X_test.iloc[18])
eli5.show_prediction(log_reg, X_test.iloc[18], vec=vectorizers, feature_names=feature_names)

# Final thoughts

For various text classification tasks we can achive pretty decent results without any domain knowledge, comlicated algorithms and powerful computational resources. We need just a few lines of code for scikit-learn pipeline: TF-IDF Vectorizer -> Logistic Regression, that is very easy to use.