In [1]:
import numpy as np
import pandas as pd
import warnings
from cleaning_funcs import *
import re 
warnings.simplefilter(action='ignore')

In [2]:
# Consists of concatenation of all paragraphs as articles and bold lines as summaries
# wikihow_all = pd.read_csv('./datasets/wikihowAll.csv')

In [3]:
# wikihow_all.head()

In [4]:
# Consists of each paragraph and its summary
wikihow_sep = pd.read_csv('./datasets/wikihowSep.csv')

# Picking a Dataset 

## Dataset structuring

In a wikihow article a paragraph's summary usually comes at the top of the section and is highlighted in bold lines. We need to re-create the classic wikihow structure by appending the `headline` sentence back at the beginning of the paragraph for further processing.

* We will create a column called `full_text` where the first line will be the sentence summarizing the whole paragraph.  

* The postion of the sentence at the beginning of the paragraph is not representative of a typical text/article structure (other than in a wikihow structure), so if we plan on using some kind of sentence location feature for the summarization it will probably end up not useful in our model just using the wikihow dataset. This problem could later be adressed by adding new and different summaries to the dataset. 

* The `wikihow_sep` dataset seems to be a good starting point for our analysis since it provides the paragraph and the sentence chosen to be its summary. 

In [5]:
# Check the dataset decription
wikihow_sep.describe()

Unnamed: 0,overview,headline,text,sectionLabel,title
count,1583187.0,1585695,1387290,1583791,1585694
unique,128543.0,1357301,1354189,188924,214613
top,,\nFinished.\n\n,;\n,Steps,How to Create an Overall Status Workbook in XL...
freq,826.0,4707,14164,449451,227


In [6]:
# Select preliminary useful columns 
wiki_filtered = wikihow_sep[['headline', 'text', 'title']]

In [7]:
wikihow_sep.head(30)

Unnamed: 0,overview,headline,text,sectionLabel,title
0,So you're a new or aspiring artist and your c...,\nSell yourself first.,"Before doing anything else, stop and sum up y...",Steps,How to Sell Fine Art Online
1,"If you want to be well-read, then, in the wor...",\nRead the classics before 1600.,Reading the classics is the very first thing ...,Reading the Classics,How to Be Well Read
2,So you're a new or aspiring artist and your c...,\nJoin online artist communities.,Depending on what scale you intend to sell yo...,Steps,How to Sell Fine Art Online
3,So you're a new or aspiring artist and your c...,\nMake yourself public.,Get yourself out there as best as you can by ...,Steps,How to Sell Fine Art Online
4,So you're a new or aspiring artist and your c...,\nBlog about your artwork.,"Given the hundreds of free blogging websites,...",Steps,How to Sell Fine Art Online
5,So you're a new or aspiring artist and your c...,\nCreate a mailing list.,This could be your most effective tool if man...,Steps,How to Sell Fine Art Online
6,So you're a new or aspiring artist and your c...,\nTake good pictures.,"Like they say, ""a picture's worth a thousand ...",Steps,How to Sell Fine Art Online
7,So you're a new or aspiring artist and your c...,\nBe sure to properly license your art.,Licensing art is a way of proving what belong...,Steps,How to Sell Fine Art Online
8,So you're a new or aspiring artist and your c...,\nConsider the option of creating your own site.,Having your own site means that you can optim...,Steps,How to Sell Fine Art Online
9,So you're a new or aspiring artist and your c...,\nExpect this to be a gradual process and don'...,An online art business needs to be built up l...,Steps,How to Sell Fine Art Online


In [8]:
wikihow_sep.iloc[0, 2]

" Before doing anything else, stop and sum up yourself as an artist. Now, think about how to translate that to an online profile. Be it the few words, Twitter allows you or an entire page of indulgence that your own website would allow you. Bring out the most salient features of your creativity, your experience, your passion, and your reasons for painting. Make it clear to readers why you are an artist who loves art, produces high quality art, and is a true champion of art. If you're not great with words, find a friend who can help you with this really important aspect of selling online – the establishment of your credibility and reliability.;\n"

In [9]:
# Re-create a full wikihow paragraph
wiki_filtered['full_text'] = wiki_filtered['headline'] + wiki_filtered['text']

In [10]:
#Reset index for later use
wiki_filtered = wiki_filtered.reset_index()

In [11]:
# Rename index column to text_id column 
wiki_filtered['text_id'] = wiki_filtered['index']

In [12]:
#Assign `text_id` by title 
wiki_filtered.title = pd.Categorical(wiki_filtered.title)

In [13]:
wiki_filtered['text_id'] = wiki_filtered.title.cat.codes

In [14]:
# Filter dataframe columns 
wiki_filtered = wiki_filtered[['text_id', 'full_text', 'title']]

In [15]:
#Explore end of dataframe
wiki_filtered.head(20)

Unnamed: 0,text_id,full_text,title
0,176320,\nSell yourself first. Before doing anything e...,How to Sell Fine Art Online
1,12342,\nRead the classics before 1600. Reading the c...,How to Be Well Read
2,176320,\nJoin online artist communities. Depending on...,How to Sell Fine Art Online
3,176320,\nMake yourself public. Get yourself out there...,How to Sell Fine Art Online
4,176320,\nBlog about your artwork. Given the hundreds ...,How to Sell Fine Art Online
5,176320,\nCreate a mailing list. This could be your mo...,How to Sell Fine Art Online
6,176320,"\nTake good pictures. Like they say, ""a pictur...",How to Sell Fine Art Online
7,176320,\nBe sure to properly license your art. Licens...,How to Sell Fine Art Online
8,176320,\nConsider the option of creating your own sit...,How to Sell Fine Art Online
9,176320,\nExpect this to be a gradual process and don'...,How to Sell Fine Art Online


In [16]:
# Remove special characters
wiki_filtered['full_text'] = wiki_filtered['full_text'].apply(lambda x: clean_special_chars(str(x)))

In [17]:
#Remove digits
wiki_filtered['full_text'] = wiki_filtered['full_text'].apply(lambda x: remove_digits(str(x)))

In [18]:
# Deduplicates spaces
wiki_filtered['full_text'] = wiki_filtered['full_text'].apply(lambda x: dedupe_spaces(str(x)))

In [19]:
# Remove new lines
wiki_filtered['full_text'] = wiki_filtered['full_text'].apply(lambda x: remove_newlines(str(x)))

In [20]:
# Remove empty sentences
wiki_filtered['full_text'] = wiki_filtered['full_text'].apply(lambda x: remove_empty_sentences(str(x)))

In [21]:
wiki_filtered.head()

Unnamed: 0,text_id,full_text,title
0,176320,Sell yourself first. Before doing anything el...,How to Sell Fine Art Online
1,12342,Read the classics before . Reading the classi...,How to Be Well Read
2,176320,Join online artist communities. Depending on ...,How to Sell Fine Art Online
3,176320,Make yourself public. Get yourself out there ...,How to Sell Fine Art Online
4,176320,Blog about your artwork. Given the hundreds o...,How to Sell Fine Art Online


In [22]:
# Create a list of tuples containing in index0, text_id and in index 1 the list of sentences corresponding to this text 
title_id_tup_set = list(set(list(zip(wiki_filtered['text_id'], wiki_filtered['title']))))

#Loop through the set to get dict
text_id_to_title = dict()
for element in title_id_tup_set:
    text_id_to_title[element[0]] = element[1]       

In [23]:
text_id_to_title[176320]

'How to Sell Fine Art Online'

In [24]:
# Split text by sentences   
wiki_filtered['sentences'] = wiki_filtered['full_text'].apply(lambda x: split_sentences(str(x)) )

In [25]:
# Create a list of tuples containing in index0, text_id and in index 1 the list of sentences corresponding to this text 
tuples = list(zip(wiki_filtered['text_id'], [sentence for sentence in wiki_filtered['sentences']]))

In [26]:
# Apply custom function to identify each sentence with its original text
tup_list = tup_list_maker(tuples)

In [27]:
# Converting the tuples list into a dataframe 
sentences = pd.DataFrame(tup_list, columns =['text_id', 'sentence'])

In [28]:
#Check the result
sentences.head()

Unnamed: 0,text_id,sentence
0,176320,Sell yourself first
1,176320,"Before doing anything else, stop and sum up yo..."
2,176320,"Now, think about how to translate that to an o..."
3,176320,"Be it the few words, Twitter allows you or an ..."
4,176320,Bring out the most salient features of your cr...


In [29]:
sentences.tail()

Unnamed: 0,text_id,sentence
7354624,99172,Shade in color prints is typically made via va...
7354625,99172,Look for blurriness
7354626,99172,"Typically, fine details will be somewhat blurr..."
7354627,99172,"Often, the paper won t quite stick, or will ot..."
7354628,99172,This is typically a sign of planographic litho...


In [30]:
sentences.head(30)

Unnamed: 0,text_id,sentence
0,176320,Sell yourself first
1,176320,"Before doing anything else, stop and sum up yo..."
2,176320,"Now, think about how to translate that to an o..."
3,176320,"Be it the few words, Twitter allows you or an ..."
4,176320,Bring out the most salient features of your cr...
5,176320,Make it clear to readers why you are an artist...
6,176320,"If you re not great with words, find a friend ..."
7,12342,Read the classics before
8,12342,Reading the classics is the very first thing y...
9,12342,If you want to build a solid foundation for yo...


## Adding a column specifying if the sentence is part of the summary

To create our labeled dataset, for each sentence, we need to identify if they are part of the final summary or not. In order to do that, we will use a trick with pandas diff function on the `text_id` column which will compare subsequent rows and give us the difference, if this difference is other than 0, then the sentence was part of the summary.

In [31]:
# Strip leading and trailing whitespace
sentences['sentence'] = sentences['sentence'].apply(lambda sentence: sentence.strip())

In [32]:
# Take the headlines col put it in a set, clean and compare
headlines = wikihow_sep.headline
#Clean headlines
headlines
# Remove special characters
headlines = headlines.apply(lambda x: clean_special_chars(str(x)))
#Remove digits
headlines = headlines.apply(lambda x: remove_digits(str(x)))
#Remove sentence boundary
headlines = headlines.apply(lambda x: remove_boundaries(str(x)))
# Deduplicates spaces
headlines = headlines.apply(lambda x: dedupe_spaces(str(x)))
# Remove new lines
headlines = headlines.apply(lambda x: remove_newlines(str(x)))
# Remove empty sentences
headlines = headlines.apply(lambda x: remove_empty_sentences(str(x)))
#remove leading and trailing whitespace
headlines = headlines.apply(lambda sentence: sentence.strip())
#headline_set = set(headlines)
headline_array = np.array(headlines)

In [33]:
headline_array

array(['Sell yourself first', 'Read the classics before',
       'Join online artist communities', ...,
       'Look for the flatness of the ink',
       'Look for the illusion of shade, created by multiple layers',
       'Look for blurriness'], dtype=object)

In [34]:
sentences['is_summary'] = sentences.sentence.isin(headline_array).astype(int)

In [35]:
sentences.head(35)

Unnamed: 0,text_id,sentence,is_summary
0,176320,Sell yourself first,1
1,176320,"Before doing anything else, stop and sum up yo...",0
2,176320,"Now, think about how to translate that to an o...",0
3,176320,"Be it the few words, Twitter allows you or an ...",0
4,176320,Bring out the most salient features of your cr...,0
5,176320,Make it clear to readers why you are an artist...,0
6,176320,"If you re not great with words, find a friend ...",0
7,12342,Read the classics before,1
8,12342,Reading the classics is the very first thing y...,0
9,12342,If you want to build a solid foundation for yo...,0


In [36]:
sentences['is_summary'].value_counts()

0    5911711
1    1442918
Name: is_summary, dtype: int64

In [37]:
wikihow_sep.headline.describe()

count             1585695
unique            1357301
top       \nFinished.\n\n
freq                 4707
Name: headline, dtype: object

## Split Sentences by Words 

Now that we have (mostly) cleaned our dataset, we need to analyse each sentence to later extract the features that we need for our analysis. To analyse our sentences, we need to split them into words and perform some frequency calculations on them.

In [38]:
# Strip leading and trailing whitespace
sentences['sentence'] = sentences['sentence'].apply(lambda sentence: sentence.strip())

In [39]:
# Strip leading and ending whitespace
sentences['words'] = sentences['sentence'].apply(lambda sentence: sentence.split(' '))

In [47]:
sentences.head(55)

Unnamed: 0,text_id,sentence,is_summary,words,title
0,176320,Sell yourself first,1,"[Sell, yourself, first]",How to Sell Fine Art Online
1,176320,"Before doing anything else, stop and sum up yo...",0,"[Before, doing, anything, else,, stop, and, su...",How to Sell Fine Art Online
2,176320,"Now, think about how to translate that to an o...",0,"[Now,, think, about, how, to, translate, that,...",How to Sell Fine Art Online
3,176320,"Be it the few words, Twitter allows you or an ...",0,"[Be, it, the, few, words,, Twitter, allows, yo...",How to Sell Fine Art Online
4,176320,Bring out the most salient features of your cr...,0,"[Bring, out, the, most, salient, features, of,...",How to Sell Fine Art Online
5,176320,Make it clear to readers why you are an artist...,0,"[Make, it, clear, to, readers, why, you, are, ...",How to Sell Fine Art Online
6,176320,"If you re not great with words, find a friend ...",0,"[If, you, re, not, great, with, words,, find, ...",How to Sell Fine Art Online
7,12342,Read the classics before,1,"[Read, the, classics, before]",How to Be Well Read
8,12342,Reading the classics is the very first thing y...,0,"[Reading, the, classics, is, the, very, first,...",How to Be Well Read
9,12342,If you want to build a solid foundation for yo...,0,"[If, you, want, to, build, a, solid, foundatio...",How to Be Well Read


In [41]:
wikihow_sep['title'].head()

0    How to Sell Fine Art Online
1            How to Be Well Read
2    How to Sell Fine Art Online
3    How to Sell Fine Art Online
4    How to Sell Fine Art Online
Name: title, dtype: object

In [42]:
sentences['title'] = sentences['text_id'].map(text_id_to_title)

In [43]:
sentences.head(20)

Unnamed: 0,text_id,sentence,is_summary,words,title
0,176320,Sell yourself first,1,"[Sell, yourself, first]",How to Sell Fine Art Online
1,176320,"Before doing anything else, stop and sum up yo...",0,"[Before, doing, anything, else,, stop, and, su...",How to Sell Fine Art Online
2,176320,"Now, think about how to translate that to an o...",0,"[Now,, think, about, how, to, translate, that,...",How to Sell Fine Art Online
3,176320,"Be it the few words, Twitter allows you or an ...",0,"[Be, it, the, few, words,, Twitter, allows, yo...",How to Sell Fine Art Online
4,176320,Bring out the most salient features of your cr...,0,"[Bring, out, the, most, salient, features, of,...",How to Sell Fine Art Online
5,176320,Make it clear to readers why you are an artist...,0,"[Make, it, clear, to, readers, why, you, are, ...",How to Sell Fine Art Online
6,176320,"If you re not great with words, find a friend ...",0,"[If, you, re, not, great, with, words,, find, ...",How to Sell Fine Art Online
7,12342,Read the classics before,1,"[Read, the, classics, before]",How to Be Well Read
8,12342,Reading the classics is the very first thing y...,0,"[Reading, the, classics, is, the, very, first,...",How to Be Well Read
9,12342,If you want to build a solid foundation for yo...,0,"[If, you, want, to, build, a, solid, foundatio...",How to Be Well Read


In [49]:
cleaned_sentences = sentences 

In [50]:
len(cleaned_sentences)

7354629

In [51]:
# Length of the sentence 
cleaned_sentences['sentence_len'] = cleaned_sentences['sentence'].apply(lambda x: len(str(x).split(' ')))

In [52]:
cleaned_sentences.head()

Unnamed: 0,text_id,sentence,is_summary,words,title,sentence_len
0,176320,Sell yourself first,1,"[Sell, yourself, first]",How to Sell Fine Art Online,3
1,176320,"Before doing anything else, stop and sum up yo...",0,"[Before, doing, anything, else,, stop, and, su...",How to Sell Fine Art Online,12
2,176320,"Now, think about how to translate that to an o...",0,"[Now,, think, about, how, to, translate, that,...",How to Sell Fine Art Online,11
3,176320,"Be it the few words, Twitter allows you or an ...",0,"[Be, it, the, few, words,, Twitter, allows, yo...",How to Sell Fine Art Online,21
4,176320,Bring out the most salient features of your cr...,0,"[Bring, out, the, most, salient, features, of,...",How to Sell Fine Art Online,18


In [53]:
indexNames = cleaned_sentences[(cleaned_sentences['sentence_len'] == 1) & (cleaned_sentences['is_summary'] != 'yes')].index
indexNames

Int64Index([     70,      71,      72,     110,     142,     144,     145,
                150,     157,     167,
            ...
            7353299, 7353307, 7353671, 7353696, 7353729, 7353822, 7353948,
            7354045, 7354456, 7354458],
           dtype='int64', length=105506)

In [54]:
# Delete these row indexes from dataFrame
cleaned_sentences.drop(indexNames, inplace= True)

In [55]:
cleaned_sentences[cleaned_sentences['sentence'] == '2']

Unnamed: 0,text_id,sentence,is_summary,words,title,sentence_len


In [56]:
#Removing nonesense titles 
num_title_indexes = cleaned_sentences[(cleaned_sentences['sentence'].str.contains('[0-9]', regex=True)) & (cleaned_sentences['sentence_len'] == 1)].index


In [57]:
num_title_indexes

Int64Index([], dtype='int64')

In [58]:
# Delete these row indexes from dataFrame
cleaned_sentences.drop(num_title_indexes, inplace= True)

In [59]:
empty_indexes = cleaned_sentences[cleaned_sentences['sentence'] == ''].index
empty_indexes

Int64Index([], dtype='int64')

In [60]:
cleaned_sentences.drop(empty_indexes, inplace= True)

In [62]:
cleaned_sentences.head(55)

Unnamed: 0,text_id,sentence,is_summary,words,title,sentence_len
0,176320,Sell yourself first,1,"[Sell, yourself, first]",How to Sell Fine Art Online,3
1,176320,"Before doing anything else, stop and sum up yo...",0,"[Before, doing, anything, else,, stop, and, su...",How to Sell Fine Art Online,12
2,176320,"Now, think about how to translate that to an o...",0,"[Now,, think, about, how, to, translate, that,...",How to Sell Fine Art Online,11
3,176320,"Be it the few words, Twitter allows you or an ...",0,"[Be, it, the, few, words,, Twitter, allows, yo...",How to Sell Fine Art Online,21
4,176320,Bring out the most salient features of your cr...,0,"[Bring, out, the, most, salient, features, of,...",How to Sell Fine Art Online,18
5,176320,Make it clear to readers why you are an artist...,0,"[Make, it, clear, to, readers, why, you, are, ...",How to Sell Fine Art Online,24
6,176320,"If you re not great with words, find a friend ...",0,"[If, you, re, not, great, with, words,, find, ...",How to Sell Fine Art Online,29
7,12342,Read the classics before,1,"[Read, the, classics, before]",How to Be Well Read,4
8,12342,Reading the classics is the very first thing y...,0,"[Reading, the, classics, is, the, very, first,...",How to Be Well Read,16
9,12342,If you want to build a solid foundation for yo...,0,"[If, you, want, to, build, a, solid, foundatio...",How to Be Well Read,33


In [63]:
#Saving the cleaned Dataset
cleaned_sentences.to_csv('./datasets/clean_wikihow_sep.csv', index=False)