## Introduction
Here will try to cover some standard technics like tokenization, stemming, lemmatization and topic modeling with LDA.

**Acknowledgments.**  This notebook was inspired by kernels [Spooky NLP and Topic Modelling tutorial](https://www.kaggle.com/arthurtok/spooky-nlp-and-topic-modelling-tutorial) (Anisotropic) and [NIPS papers visualized with NMF and t-SNE](https://www.kaggle.com/dschniertshauer/nips-papers-visualized-with-nmf-and-t-sne) (Lurchi).

In [None]:
import pandas as pd
import numpy as np
# LDA, tSNE
from sklearn.manifold import TSNE
from gensim.models.ldamodel import LdaModel
from sklearn.metrics.pairwise import pairwise_distances
# NLTK
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import stopwords
import re
# Bokeh
from bokeh.io import output_notebook
from bokeh.plotting import figure, show
from bokeh.models import HoverTool, CustomJS, ColumnDataSource, Slider
from bokeh.layouts import column
from bokeh.palettes import all_palettes
output_notebook()

## 1. Loading data
Let's **load the dataset** with papers and glimpse some first rows of a paper.

In [None]:
an_author = 1211
df = pd.read_csv("../input/train.csv")
df_test = pd.read_csv("../input/test.csv")
print("Author: {}, ID: {}".format(df.author[an_author], df.id[an_author]))
print(df.text[an_author][:500])

## 2. Processing
Here we'll process our corpus using some standard technics ...
### 2.1. Initial cleaning
Just **removing numbers** (if exist) and **reducing** all words **to the lowercase**. Let also see what we'll get:


In [None]:
# Removing numerals:
df['text_tokens'] = df.text.map(lambda x: re.sub(r'\d+', '', x))
# Lower case:
df['text_tokens'] = df.text_tokens.map(lambda x: x.lower())
print("Author: {}, ID: {}".format(df.author[an_author], df.id[an_author]))
print(df['text_tokens'][an_author][:500])

### 2.2. Tokenize
**Spliting** texts **into separete words**, also **removing punctuanions** and other stuff. After that procedure we should obtain texts as lists of words in lowercase:

In [None]:
df['text_tokens'] = df.text_tokens.map(lambda x: RegexpTokenizer(r'\w+').tokenize(x))
print("Author: {}, ID: {}".format(df.author[an_author], df.id[an_author]))
print(df['text_tokens'][an_author][:25])

### 2.3 Lemmatization
*Stemming* is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form. The stem **need not be identical to the morphological root of the word** (see [Wikipedia](https://en.wikipedia.org/wiki/Stemming) for more details).

*Lemmatization* is the process of determining the [lemma](https://en.wikipedia.org/wiki/Lemma_(morphology) of a word based on its intended meaning. Unlike stemming, lemmatisation depends on **correctly identifying the intended part of speech and meaning of a word** in a sentence (see [Wikipedia](https://en.wikipedia.org/wiki/Lemmatisation) for more details.) We'll use `WordNetLemmatizer` from `nltk`. 

In [None]:
lemma = WordNetLemmatizer()
df['tags'] = df.text_tokens.map(lambda x: list(zip(*pos_tag(x)))[1])

def recode_tag(tag):
    if tag[0].lower() in ['n', 'r', 'v', 'j']:
        if tag[0].lower() == 'j': return 'a'
        else: return tag[0].lower()
    else: return None

df['tags'] = df.tags.map(lambda x: list(map(recode_tag, x)))
df['tags'] = df.apply(lambda x: list(zip(x.text_tokens, x.tags)), axis=1)

def lemmatize_tokens(pairs):
    return [lemma.lemmatize(tok, pos=tag) if tag != None else tok 
            for (tok, tag) in pairs]

df['text_tokens'] = df.tags.map(lemmatize_tokens)
print("Author: {}, ID: {}".format(df.author[an_author], df.id[an_author]))
print(df['text_tokens'][an_author][:25])
print(df['tags'][an_author][:25])


### 2.4. Stop words
**Removing common** English **words** like `and`, `the`, `of` and so on.

In [None]:
stop_en = stopwords.words('english')
df['text_tokens'] = df.text_tokens.map(lambda x: [t for t in x if t not in stop_en])
print("Author: {}, ID: {}".format(df.author[an_author], df.id[an_author]))
print(df['text_tokens'][an_author][:25])

### 2.5. Bigrams
Let's construct bigrams (**words' pairs**) for every text:

In [None]:
df['text_tokens_bigrams'] = df.text_tokens.map(lambda x: [' '.join(x[i:i+2]) 
                                                          for i in range(len(x)-1)])
print("Author: {}, ID: {}".format(df.author[an_author], df.id[an_author]))
print(df['text_tokens_bigrams'][an_author][:25])

### 2.5. Final cleaning
#### 2.5.1. Short words
Here we'll remove all "extremely short" words (that have less than 2 characters, if those are still exist in the texts for some reason).

In [None]:
df['text_tokens'] = df.text_tokens.map(lambda x: [t for t in x if len(t) > 1])
print("Author: {}, ID: {}".format(df.author[an_author], df.id[an_author]))
print(df['text_tokens'][an_author][:25])

#### 2.5.1. Join tokens and bigrams
Adding our bigrams to to pool of tokens for every text:

In [None]:
df['text_tokens'] = df.text_tokens + df.text_tokens_bigrams
print("Author: {}, ID: {}".format(df.author[an_author], df.id[an_author]))
print(df['text_tokens'][an_author][:100])


## 3. LDA
Finally, let's use **LDA** ([Latent Dirichlet allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)) to extract topic structure from the corpus of texts.


In [None]:
from gensim import corpora, models
T = 4 # number of topics
np.random.seed(2017)
texts = df['text_tokens'].values
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
ldamodel = models.ldamodel.LdaModel(corpus, id2word=dictionary, 
                                    num_topics=T, passes=7, minimum_probability=0)

Top words for every topic are:

In [None]:
ldamodel.print_topics(num_topics=3, num_words=5)

Refactoring results of LDA into numpy matrix (`number_of_texts` $\times$ `number_of_topics`). 
Also let us precompute **pairwise distance** between text with **cosine distance**.

In [None]:
# Matrix with topics probabilities for every text:
hm = np.array([[y for (x,y) in ldamodel[corpus[i]]] for i in range(len(corpus))])
# Computing pairwise cosine distance between texts:
precomp_cosine = pairwise_distances(hm, metric='cosine')

And **reduce dimensionality** using **t-SNE**:

In [None]:
tsne = TSNE(random_state=2017, perplexity=25, metric='precomputed', early_exaggeration=4)
tsne_rep = tsne.fit_transform(precomp_cosine)
tsne_rep = pd.DataFrame(tsne_rep, columns=['x','y'])
tsne_rep['hue'] = [['EAP', 'HPL', 'MWS'].index(x) for x in df.author.values]

## 4. Ploting
Using `Bokeh` for scatter plot with interactions. Hover mouse over a dot to see the title of the respective text:

In [None]:
source = ColumnDataSource(
        data=dict(
            x = tsne_rep.x,
            y = tsne_rep.y,
            colors = [all_palettes['Inferno'][4][i] for i in tsne_rep.hue],
            author = df.author,
            text = df.text,
            alpha = [0.7] * tsne_rep.shape[0],
            size = [7] * tsne_rep.shape[0]
        )
    )

hover_tsne = HoverTool(names=["df"], tooltips="""
    <div style="margin: 10">
        <div style="margin: 0 auto; width:300px;">
            <span style="font-size: 12px; font-weight: bold;">Author:</span>
            <span style="font-size: 12px">@author</span>
        </div>
        <div style="margin: 0 auto; width:300px;">
            <span style="font-size: 12px; font-weight: bold;">Text:</span>
            <span style="font-size: 12px">@text</span>
        </div>
    </div>
    """)

tools_tsne = [hover_tsne, 'pan', 'wheel_zoom', 'reset']
plot_tsne = figure(plot_width=700, plot_height=700, tools=tools_tsne, title='Spooky')

plot_tsne.circle('x', 'y', size='size', fill_color='colors', 
                 alpha='alpha', line_alpha=0, line_width=0.01, source=source, name="df")

layout = column(plot_tsne)

In [None]:
show(layout)

Thanks for reading!