In this kernel I'll try some **Latent Dirichlet Allocation** to automaticallly extract the topics that charactereze Medium articles. Good tuning of LDA will give a really good result on the Leaderboard. If you are out of ideas try to add some LDA features to your analysis.

In [2]:
import os
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns
import json
from tqdm import tqdm_notebook
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import mean_absolute_error
from scipy import sparse
import pyLDAvis.gensim
import gensim
from gensim.matutils  import Sparse2Corpus
from gensim.corpora import Dictionary
from gensim.models import LdaModel
from sklearn.linear_model import Ridge

Let's start with standart preprocessing and get Bag Of Words from our text data with CountVectorizer. Preprocessing part was taken from the [kernel of Yury Kashnitsky](http://www.kaggle.com/kashnitsky/ridge-countvectorizer-baseline) with some modifications.

## Preproccesing

In [3]:
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

In [4]:
PATH_TO_DATA = '../input/'

In [5]:
def read_json_line(line=None):
    result = None
    try:        
        result = json.loads(line)
    except Exception as e:      
        idx_to_replace = int(str(e).split(' ')[-1].replace(')',''))      
        new_line = list(line)
        new_line[idx_to_replace] = ' '
        new_line = ' '.join(new_line)     
        return read_json_line(line=new_line)
    return result

def preprocess(path_to_inp_json_file):
    output_list = []
    with open(path_to_inp_json_file, encoding='utf-8') as inp_file:
        for line in tqdm_notebook(inp_file):
            json_data = read_json_line(line)
            content = json_data['content'].replace('\n', ' ').replace('\r', ' ')
            content_no_html_tags = strip_tags(content)
            output_list.append(content_no_html_tags)
    return output_list

In [6]:
train_raw_content = preprocess(path_to_inp_json_file=os.path.join(PATH_TO_DATA, 'train.json'),)

FileNotFoundError: [Errno 2] No such file or directory: '../input/train.json'

In [None]:
test_raw_content = preprocess(path_to_inp_json_file=os.path.join(PATH_TO_DATA,  'test.json'),)

In [None]:
cv = CountVectorizer(max_features=10000, min_df = 0.1, max_df = 0.8)
sparse_train = cv.fit_transform(train_raw_content)
sparse_test  = cv.transform(test_raw_content)

In [None]:
full_sparse_data =  sparse.vstack([sparse_train, sparse_test])

In [None]:
train_target = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_log1p_recommends.csv'), 
                           index_col='id')

In [None]:
y_train = train_target['log_recommends'].values

## Extracting topics with LDA

Shortly, LDA represents documents as mixtures of topics that spit out words with certain probabilities.

For each possible topic Z, we'll multiply the frequency of this word type W in Z by the number of other words in document D that already belong to Z. The result will represent the probability that this word came from Z. Here's the actual formula:

![](http://tedunderwood.files.wordpress.com/2012/04/ldaformula.png?w=584****)

Finding the right parameters for LDA is 'an art'. 

3 main parameters need to be optimized:
1. ** K**: the number of topics
2. **Alpha** which dictates how many topics a document potentially has. The lower alpha, the lower the number of topics per documents
3. **Beta** which dictates the number of word per document. Similarly to Alpha, the lower Beta is, the lower the number for words per topic.

That is all with theory, let is LDA!

I'll use realisation of LDA from gensim library, so it needs some data transformation.

In [None]:
#Transform our sparse_data to corpus for gensim
corpus_data_gensim = gensim.matutils.Sparse2Corpus(full_sparse_data, documents_columns=False)

In [None]:
#Create dictionary for LDA model
vocabulary_gensim = {}
for key, val in cv.vocabulary_.items():
    vocabulary_gensim[val] = key
    
dict = Dictionary()
dict.merge_with(vocabulary_gensim)

Let's assume, that we can devide our articels in 30 different topics.

In [None]:
lda = LdaModel(corpus_data_gensim, num_topics = 30 )

Let's look at our topics.

In [None]:
data_ =  pyLDAvis.gensim.prepare(lda, corpus_data_gensim, dict)
# pyLDAvis.display(data_)

I commented the code and inserted the image because pyLDAvis caused kernel disturtion, but you can download the notebook and run notebook by yourself.

![](http://github.com/Twoweaks/random/blob/master/lda.png?raw=true)

Сircles represent different topics and the distance between them. Similar topics appear closer and the dissimilar topics farther. The relative size of a topic's circle in the plot corresponds to the relative frequency of the topic in the corpus. Despite our model was built with corpus of only 10000 words, we can understand the general tone of some topics.

Obviously, we can definitely improve this to achieve better separation between the topics!

Transforms a bag of words document to features. It returns the proportion of how much each topic was present in the document.

In [None]:
def document_to_lda_features(lda_model, document):
    topic_importances = lda.get_document_topics(document, minimum_probability=0)
    topic_importances = np.array(topic_importances)
    return topic_importances[:,1]

lda_features = list(map(lambda doc:document_to_lda_features(lda, doc),corpus_data_gensim))

In [None]:
data_pd_lda_features = pd.DataFrame(lda_features)
data_pd_lda_features.head()

Let's look at the correlation of generated lda features.

In [None]:
data_pd_lda_features_train = data_pd_lda_features.iloc[:y_train.shape[0]]
data_pd_lda_features_train['target'] = y_train

fig, ax = plt.subplots()
# the size of A4 paper
fig.set_size_inches(20.7, 8.27)
sns.heatmap(data_pd_lda_features_train.corr(method = 'spearman'), cmap="RdYlGn", ax = ax)

Some topics have correlation with target variable (topic 14 or topic 15).

So we can use probability of topics for each article  as  features for our model. 

In [None]:
X_tr = sparse.hstack([sparse_train, data_pd_lda_features_train.drop('target', axis = 1)]).tocsr()

In [None]:
X_test = sparse.hstack([sparse_test, data_pd_lda_features.iloc[y_train.shape[0]:]]).tocsr()

In [None]:
ridge = Ridge(random_state=17)
ridge.fit(X_tr,y_train)

In [None]:
subm = ridge.predict(X_test)

In [None]:
plt.hist(subm, bins=30, alpha=.5, color='green', label='pred', range=(0,10));
plt.legend();

*What's next:*
* in this kernel there was only some basic ideas of how to add some more features to your model, tunig LDA will boost your result on leaderboard
* change number of features for Countvectorizer,  tune some parametrs of it
* change LDA parametrs (more topics, find optimal beta and alpha)

Good Luck!