## 01 Admin and Imports

In [1]:
# the usual suspects...
import os, sys
from datetime import datetime
import numpy as np
import pandas as pd
import itertools
import re

# this is for my bitcoin prices
import quandl

# visualisation libraries: I'll go ahead and use plotly (offline) for graphing because its d3 
# and so I can track prices over time
import matplotlib.pyplot as plt
%matplotlib inline
import scipy.stats as sct
import seaborn as sns
sns.set()
sns.set(style="ticks")
sns.set_color_codes("muted")

import plotly.offline as py
import plotly.graph_objs as go
import plotly.figure_factory as ff
py.init_notebook_mode(connected=True)
pd.set_option('display.float_format', lambda x:'%.10f' %x)
import glob

# we'll need this for plotting in nice colours. Trust me on this.
COLOR_PALETTE = [    
               "#348ABD",
               "#A60628",
               "#7A68A6",
               "#467821",
               "#CF4457",
               "#188487",
               "#E24A33"
              ]

## Read data

In [2]:
df_white = pd.read_json('tldr_whitepaper_raw.json', orient='index')
df_white

Unnamed: 0,body,summary
filecoin,1 Introduction Filecoin is a protocol ...,The internet is in the middle of a revolution:...
gnosis,1.1 Problem Overview Generally speak...,Prediction markets are poised to become one of...


## Split into sentences

### Test different approaches

In [3]:
import nltk

I will use punkt tokeniser as the abstract class for sentence splitting. It is an implmentation of [Unsupervised Multilingual Sentence Boundary Detection (Kiss and Strunk (2005)](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.5017&rep=rep1&type=pdf). See [this link](https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L79) for more details.

In [4]:
# # download the punkt tokenizer for sentence splitting
# nltk.download('punkt')
# print("The punkt tokenizer is downloaded")

In [5]:
# Load the punkt tokenizer
tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")
print("The punkt tokenizer is loaded")

The punkt tokenizer is loaded


In [6]:
raw_sentences = tokenizer.tokenize(df_white.body[0])
print("We have {0:,} raw sentences".format(len(raw_sentences)))

We have 543 raw sentences


#### compare this to simply splitting by periods

I reckon we will have numerous spurious sentences formed by phrases such as "U.S.A." ... Lets compare punkt with the naive splitter.

In [7]:
df_white.body.str.split('.').apply(len)

filecoin    898
gnosis      867
Name: body, dtype: int64

Notice how filecoin has 898 sentence splits identified as opposed to 543 identified by the unsupervised learner.

### Use punkt to create a list of sentences

In [8]:
df_white['sentences_body']=df_white.body.apply(lambda x: tokenizer.tokenize(x))
df_white['sentences_summary']=df_white.summary.apply(lambda x: tokenizer.tokenize(x))

In [9]:
df_white.sentences_body.apply(len)

filecoin    543
gnosis      700
Name: sentences_body, dtype: int64

In [10]:
df_sentenceBody = pd.DataFrame(data=[sent for sent in df_white.sentences_body], 
                               index=df_white.index).T.rename(columns=lambda x: '{:}_{:}'.format(x,'sentenceBody'))
print ("df_sentenceBody attributes: ",df_sentenceBody.shape)
df_sentenceBody.head()

df_sentenceBody attributes:  (700, 2)


Unnamed: 0,filecoin_sentenceBody,gnosis_sentenceBody
0,1 Introduction Filecoin is a protocol ...,1.1 Problem Overview Generally speak...
1,Filecoin protocol provides a data storage and ...,"Despite the ease of access we enjoy today, thi..."
2,1.1 Elementary Components The Filecoin...,"More often than not, the data is severely lack..."
3,1.,The reason for this is straightforward: writte...
4,Decentralized Storage Network (DSN): We provid...,"In other words, it’s easy to find what people ..."


## Pre-process and cleanse sentences

- Any character other than A-Za-z0-9 will be culled
- Check length of sentence. Cull if too small - could be just headlines or subject identifiers such as "Introduction" and so on

In [11]:
def process_sentence(sentence):
    
    if type(sentence) is str:
        # cleanse all non alpha numeric, non space, non _ or non - 
        cleaned_sentence = re.sub('[^A-Za-z0-9 _-]', '', sentence)
        # convert - or _ into spaces so that you can use n-grams later
        cleaned_sentence = re.sub('[^A-Za-z0-9 ]', ' ', cleaned_sentence)
        # convert more than one space from all these operations into a single space
        cleaned_sentence = re.sub('\s\s+', ' ', cleaned_sentence)
        # convert to lower case
        cleaned_sentence = cleaned_sentence.lower()
        # if cleansed sentence contains fewer than 4 words, it probably doesnt contain much info
        if len(cleaned_sentence.split(' '))<=3:
            cleaned_sentence = np.nan
    else:
        cleaned_sentence = np.nan
            
    return cleaned_sentence

In [40]:
# perform data cleansing on every sentence in the body
df_sentenceBody_clean = df_sentenceBody.applymap(lambda x: process_sentence(x))

In [41]:
# normalise the dataset such that you can take idf from every white paper
# i achieve this by pivot stacking, then swapping the levels of the multi-index 
# and finally sorting by multi-index in order
df_sentenceBody_clean = df_sentenceBody_clean.stack().swaplevel().sort_index().rename('sentence').reset_index()
df_sentenceBody_clean = df_sentenceBody_clean.sort_values(by=['level_0','level_1'],
                                    ascending=[True,True]).reindex()
df_sentenceBody_clean.shape

(1161, 3)

In [42]:
df_sentenceBody_clean.head()

Unnamed: 0,level_0,level_1,sentence
0,filecoin_sentenceBody,0,1 introduction filecoin is a protocol token wh...
1,filecoin_sentenceBody,1,filecoin protocol provides a data storage and ...
2,filecoin_sentenceBody,2,11 elementary components the filecoin protocol...
3,filecoin_sentenceBody,4,decentralized storage network dsn we provide a...
4,filecoin_sentenceBody,5,later we present the filecoin protocol as an i...


## TF-IDF to manipulate the string

In [25]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

In [29]:
ordered_series_of_sentences = df_sentenceBody_clean.sort_index()['sentence'].as_matrix()
count_vect = CountVectorizer(ngram_range=(1, 3),min_df=2)
count_vect = count_vect.fit(ordered_series_of_sentences)
dtm = count_vect.transform(ordered_series_of_sentences)
dtm.shape

(1161, 6653)

In [31]:
dtm.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [32]:
features = count_vect.get_feature_names()
print (len(features))
features[155:162]

6653


['additional',
 'additionally',
 'addorders',
 'address',
 'address derived',
 'address derived from',
 'adds']

In [33]:
tfidf = TfidfTransformer(norm="l2")
dtm_tfidf = tfidf.fit_transform(dtm)
dtm_tfidf

<1161x6653 sparse matrix of type '<class 'numpy.float64'>'
	with 35317 stored elements in Compressed Sparse Row format>

In [34]:
dtm_tfidf.toarray()

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

In [35]:
sorted_dtm_tfidf = np.argsort(dtm_tfidf.toarray(), axis=1)[:,::-1]
sorted_dtm_tfidf[:,0]

array([4282, 3283, 1174, ..., 1594, 5446, 1368])

In [43]:
# initialise tfidf output rows
df_sentenceBody_clean['phrase_with_max_tfidf']=np.nan
df_sentenceBody_clean['tfidf_of_phrase_with_max_tfidf']=np.nan

In [55]:
for i,d in df_sentenceBody_clean.sort_index().iterrows():
    df_sentenceBody_clean.loc[i,'phrase_with_max_tfidf'] = features[sorted_dtm_tfidf[i,0]]
    df_sentenceBody_clean.loc[i,'tfidf_of_phrase_with_max_tfidf'] = dtm_tfidf.toarray()[i][sorted_dtm_tfidf[i,0]]
df_sentenceBody_clean.head()

Unnamed: 0,level_0,level_1,sentence,phrase_with_max_tfidf,tfidf_of_phrase_with_max_tfidf
0,filecoin_sentenceBody,0,1 introduction filecoin is a protocol token wh...,proof,0.2285848862
1,filecoin_sentenceBody,1,filecoin protocol provides a data storage and ...,miners earn tokens,0.2224005903
2,filecoin_sentenceBody,2,11 elementary components the filecoin protocol...,components,0.5762457212
3,filecoin_sentenceBody,4,decentralized storage network dsn we provide a...,storage,0.2269995076
4,filecoin_sentenceBody,5,later we present the filecoin protocol as an i...,present the filecoin,0.2715922381


## String them into summary

In [82]:
df_top5 = df_sentenceBody_clean.groupby('level_0')['tfidf_of_phrase_with_max_tfidf'].nlargest(5).reset_index().set_index('level_1')
df_top5 = pd.merge(left=df_sentenceBody_clean, right=df_top5)
df_top5.head(10)

Unnamed: 0,level_0,level_1,sentence,phrase_with_max_tfidf,tfidf_of_phrase_with_max_tfidf
0,filecoin_sentenceBody,62,is it always f m1,m1,0.6624478622
1,filecoin_sentenceBody,135,verifyvk x 01,01,1.0
2,filecoin_sentenceBody,184,the ledger is append only3,the ledger,0.6999849876
3,filecoin_sentenceBody,210,orderbooks are sets of orders,orderbooks,0.6328708081
4,filecoin_sentenceBody,521,blockchain archives and inter blockchain stam...,blockchain,0.8821386573
5,gnosis_sentenceBody,46,3httpwwwnewyorkercomnewsjohn cassidywhat kille...,intrade,0.6546981134
6,gnosis_sentenceBody,211,massive growth in social media following,growth,0.7779553081
7,gnosis_sentenceBody,212,surpassed 1000 slack members and 2500 twitter ...,members,0.7078104247
8,gnosis_sentenceBody,433,used technologies include 1,technologies,0.6588233867
9,gnosis_sentenceBody,524,a sensor measurement by a specific sensor,sensor,0.8353642481


In [83]:
def return_summary(str_token):
    return ' '.join(df_sentenceBody.loc[(df_sentenceBody.index.isin(df_top5[df_top5.level_0==str_token+'_sentenceBody'].level_1))
                ,str_token+'_sentenceBody'].tolist())

In [84]:
return_summary('filecoin')

'Is it always f = m−1? • Verify(vk, x,π) → 0,1. The ledger is append-only3. Orderbooks are sets of orders. • Blockchain archives and inter-blockchain stamping with Braid.'

In [85]:
return_summary('gnosis')

'3http://www.newyorker.com/news/john-cassidy/what-killed-intrade 4 https://www.predictit.org/Home/TermsAndConditions                       10 Chapter 1. Massive growth in social media following. Surpassed 1000 slack members and 2500 Twitter followers. Used technologies include:           1. a sensor measurement by a specific sensor.'

Haha this is pretty sad. Let's try to make the extractive summaries more legit before jumping into generative.

TTD:
    - REMEMBER TO CHANGE \N IN THE CLEANSE PROCESS
    - INCLUDE MORE PAPERS TO MAKE THE IDF TENABLE
    - GET WEIGHTED TF-IDF OF WHOLE SENTENCE AS DESCRIBED IN http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings3/NTCIR3-TSC-SekiY.pdf. USE https://medium.com/@acrosson/summarize-documents-using-tf-idf-bdee8f60b71