In [1]:
import numpy as np
import pandas as pd
import warnings
from cleaning_funcs import *
import re 
warnings.simplefilter(action='ignore')

In [2]:
# Consists of concatenation of all paragraphs as articles and bold lines as summaries
# wikihow_all = pd.read_csv('./datasets/wikihowAll.csv')

In [3]:
# wikihow_all.head()

In [4]:
# Consists of each paragraph and its summary
wikihow_sep = pd.read_csv('./datasets/wikihowSep.csv')

# Picking a Dataset 

## Dataset structuring

In a wikihow article a paragraph's summary usually comes at the top of the section and is highlighted in bold lines. We need to re-create the classic wikihow structure by appending the `headline` sentence back at the beginning of the paragraph for further processing.

* We will create a column called `full_text` where the first line will be the sentence summarizing the whole paragraph.  

* The postion of the sentence at the beginning of the paragraph is not representative of a typical text/article structure (other than in a wikihow structure), so if we plan on using some kind of sentence location feature for the summarization it will probably end up not useful in our model just using the wikihow dataset. This problem could later be adressed by adding new and different summaries to the dataset. 

* The `wikihow_sep` dataset seems to be a good starting point for our analysis since it provides the paragraph and the sentence chosen to be its summary. 

In [5]:
# Check the datset decription
wikihow_sep.describe()

Unnamed: 0,overview,headline,text,sectionLabel,title
count,1583187.0,1585695,1387290,1583791,1585694
unique,128543.0,1357301,1354189,188924,214613
top,,\nFinished.\n\n,;\n,Steps,How to Create an Overall Status Workbook in XL...
freq,826.0,4707,14164,449451,227


In [6]:
# Select preliminary useful columns 
wiki_filtered = wikihow_sep[['headline', 'text', 'title']]

In [7]:
# Re-create a full wikihow paragraph
wiki_filtered['full_text'] = wiki_filtered['headline'] + wiki_filtered['text']

In [8]:
#Reset index for later use
wiki_filtered = wiki_filtered.reset_index()

In [9]:
# Rename index column to text_id column 
wiki_filtered['text_id'] = wiki_filtered['index']

In [10]:
# Filter dataframe columns 
wiki_filtered = wiki_filtered[['text_id', 'full_text', 'title']]

In [11]:
#Explore end of dataframe
wiki_filtered.tail()

Unnamed: 0,text_id,full_text,title
1585690,1585690,\nMagnify the image. Unlike some of the other ...,How to Identify Prints3
1585691,1585691,\nLook for the absence of plate marks. If you ...,How to Identify Prints3
1585692,1585692,\nLook for the flatness of the ink. Upon close...,How to Identify Prints3
1585693,1585693,"\nLook for the illusion of shade, created by m...",How to Identify Prints3
1585694,1585694,"\nLook for blurriness. Typically, fine details...",How to Identify Prints3


In [12]:
# Remove special characters
wiki_filtered['full_text'] = wiki_filtered['full_text'].apply(lambda x: clean_special_chars(str(x)))

In [13]:
# Deduplicates spaces
wiki_filtered['full_text'] = wiki_filtered['full_text'].apply(lambda x: dedupe_spaces(str(x)))

In [14]:
# Remove new lines
wiki_filtered['full_text'] = wiki_filtered['full_text'].apply(lambda x: remove_newlines(str(x)))

In [15]:
# Remove empty sentences
wiki_filtered['full_text'] = wiki_filtered['full_text'].apply(lambda x: remove_empty_sentences(str(x)))

In [16]:
wiki_filtered.tail()

Unnamed: 0,text_id,full_text,title
1585690,1585690,Magnify the image. Unlike some of the other v...,How to Identify Prints3
1585691,1585691,Look for the absence of plate marks. If you f...,How to Identify Prints3
1585692,1585692,Look for the flatness of the ink. Upon close ...,How to Identify Prints3
1585693,1585693,"Look for the illusion of shade, created by mu...",How to Identify Prints3
1585694,1585694,"Look for blurriness. Typically, fine details ...",How to Identify Prints3


In [17]:
# Split text by sentences   
wiki_filtered['sentences'] = wiki_filtered['full_text'].apply(lambda x: split_sentences(str(x)) )

In [19]:
# Create a list of tuples containing in index0, text_id and in index 1 the list of sentences corresponding to this text 
tuples = list(zip(wiki_filtered['text_id'], [sentence for sentence in wiki_filtered['sentences']]))

In [20]:
# Apply custom function to identify each setence with its original text
tup_list = tup_list_maker(tuples)

In [21]:
# Converting the tuples list into a dataframe 
sentences = pd.DataFrame(tup_list, columns =['text_id', 'sentence'])

In [22]:
#Check the result
sentences.head()

Unnamed: 0,text_id,sentence
0,0,Sell yourself first
1,0,"Before doing anything else, stop and sum up yo..."
2,0,"Now, think about how to translate that to an o..."
3,0,"Be it the few words, Twitter allows you or an ..."
4,0,Bring out the most salient features of your cr...


In [23]:
sentences.tail()

Unnamed: 0,text_id,sentence
7366786,1585693,Shade in color prints is typically made via va...
7366787,1585694,Look for blurriness
7366788,1585694,"Typically, fine details will be somewhat blurr..."
7366789,1585694,"Often, the paper won t quite stick, or will ot..."
7366790,1585694,This is typically a sign of planographic litho...


In [24]:
sentences.head(15)

Unnamed: 0,text_id,sentence
0,0,Sell yourself first
1,0,"Before doing anything else, stop and sum up yo..."
2,0,"Now, think about how to translate that to an o..."
3,0,"Be it the few words, Twitter allows you or an ..."
4,0,Bring out the most salient features of your cr...
5,0,Make it clear to readers why you are an artist...
6,0,"If you re not great with words, find a friend ..."
7,1,Read the classics before 1600
8,1,Reading the classics is the very first thing y...
9,1,If you want to build a solid foundation for yo...


## Adding a column specifying if the sentence is part of the summary

To create our labeled dataset, for each sentence, we need to identify if they are part of the final summary or not. In order to do that, we will use a trick with pandas diff function on the `text_id` column which will compare subsequent rows and give us the difference, if this difference is other than 0, then the sentence was part of the summary.

In [25]:
# Add a column specifying if the sentence is part of the summary
sentences['difference'] = sentences['text_id'].diff()

In [26]:
sentences['is_summary'] = sentences['difference'].apply(is_summary)

In [27]:
sentences.head(15)

Unnamed: 0,text_id,sentence,difference,is_summary
0,0,Sell yourself first,,yes
1,0,"Before doing anything else, stop and sum up yo...",0.0,no
2,0,"Now, think about how to translate that to an o...",0.0,no
3,0,"Be it the few words, Twitter allows you or an ...",0.0,no
4,0,Bring out the most salient features of your cr...,0.0,no
5,0,Make it clear to readers why you are an artist...,0.0,no
6,0,"If you re not great with words, find a friend ...",0.0,no
7,1,Read the classics before 1600,1.0,yes
8,1,Reading the classics is the very first thing y...,0.0,no
9,1,If you want to build a solid foundation for yo...,0.0,no


## Split Sentences by Words 

Now that we have (mostly) cleaned our dataset, we need to analyse each sentence to later extract the features that we need for our analysis. To analyse our sentences, we need to split them into words and perform some frequency calculations on them.

In [28]:
# Strip leading and trailing whitespace
sentences['sentence'] = sentences['sentence'].apply(lambda sentence: sentence.strip())

In [29]:
# Strip leading and ending whitespace
sentences['words'] = sentences['sentence'].apply(lambda sentence: sentence.split(' '))

In [30]:
sentences.tail(15)

Unnamed: 0,text_id,sentence,difference,is_summary,words
7366776,1585692,"Upon close examination, you should notice that...",0.0,no,"[Upon, close, examination,, you, should, notic..."
7366777,1585692,"Everything should be on the same level, with n...",0.0,no,"[Everything, should, be, on, the, same, level,..."
7366778,1585692,Noticing this will require serious magnificati...,0.0,no,"[Noticing, this, will, require, serious, magni..."
7366779,1585693,"Look for the illusion of shade, created by mul...",1.0,yes,"[Look, for, the, illusion, of, shade,, created..."
7366780,1585693,Since the planographic surface holds and repel...,0.0,no,"[Since, the, planographic, surface, holds, and..."
7366781,1585693,"Usually, shaded areas will be spotty, shooting...",0.0,no,"[Usually,, shaded, areas, will, be, spotty,, s..."
7366782,1585693,One mark will not be lighter or darker than th...,0.0,no,"[One, mark, will, not, be, lighter, or, darker..."
7366783,1585693,This creates the illusion of shade,0.0,no,"[This, creates, the, illusion, of, shade]"
7366784,1585693,A print with multiple colors will overlap thos...,0.0,no,"[A, print, with, multiple, colors, will, overl..."
7366785,1585693,"In general, you won t find green, but overlapp...",0.0,no,"[In, general,, you, won, t, find, green,, but,..."


In [31]:
wikihow_sep['title'].head()

0    How to Sell Fine Art Online
1            How to Be Well Read
2    How to Sell Fine Art Online
3    How to Sell Fine Art Online
4    How to Sell Fine Art Online
Name: title, dtype: object

In [32]:
# Mapping text titles with corresponding indexes 
index_dicts= {}
for index, item in enumerate(wikihow_sep['title']):
    index_dicts[index] = item


In [33]:
sentences['title'] = sentences['text_id'].map(index_dicts)

In [34]:
sentences.head(20)

Unnamed: 0,text_id,sentence,difference,is_summary,words,title
0,0,Sell yourself first,,yes,"[Sell, yourself, first]",How to Sell Fine Art Online
1,0,"Before doing anything else, stop and sum up yo...",0.0,no,"[Before, doing, anything, else,, stop, and, su...",How to Sell Fine Art Online
2,0,"Now, think about how to translate that to an o...",0.0,no,"[Now,, think, about, how, to, translate, that,...",How to Sell Fine Art Online
3,0,"Be it the few words, Twitter allows you or an ...",0.0,no,"[Be, it, the, few, words,, Twitter, allows, yo...",How to Sell Fine Art Online
4,0,Bring out the most salient features of your cr...,0.0,no,"[Bring, out, the, most, salient, features, of,...",How to Sell Fine Art Online
5,0,Make it clear to readers why you are an artist...,0.0,no,"[Make, it, clear, to, readers, why, you, are, ...",How to Sell Fine Art Online
6,0,"If you re not great with words, find a friend ...",0.0,no,"[If, you, re, not, great, with, words,, find, ...",How to Sell Fine Art Online
7,1,Read the classics before 1600,1.0,yes,"[Read, the, classics, before, 1600]",How to Be Well Read
8,1,Reading the classics is the very first thing y...,0.0,no,"[Reading, the, classics, is, the, very, first,...",How to Be Well Read
9,1,If you want to build a solid foundation for yo...,0.0,no,"[If, you, want, to, build, a, solid, foundatio...",How to Be Well Read


In [35]:
# Drop temporary columns 
cleaned_sentences = sentences.drop(['difference'], axis=1)

In [36]:
cleaned_sentences.head()

Unnamed: 0,text_id,sentence,is_summary,words,title
0,0,Sell yourself first,yes,"[Sell, yourself, first]",How to Sell Fine Art Online
1,0,"Before doing anything else, stop and sum up yo...",no,"[Before, doing, anything, else,, stop, and, su...",How to Sell Fine Art Online
2,0,"Now, think about how to translate that to an o...",no,"[Now,, think, about, how, to, translate, that,...",How to Sell Fine Art Online
3,0,"Be it the few words, Twitter allows you or an ...",no,"[Be, it, the, few, words,, Twitter, allows, yo...",How to Sell Fine Art Online
4,0,Bring out the most salient features of your cr...,no,"[Bring, out, the, most, salient, features, of,...",How to Sell Fine Art Online


In [37]:
len(cleaned_sentences)

7366791

In [38]:
# Length of the sentence 
cleaned_sentences['sentence_len'] = cleaned_sentences['sentence'].apply(lambda x: len(str(x).split(' ')))

In [39]:
cleaned_sentences.head()

Unnamed: 0,text_id,sentence,is_summary,words,title,sentence_len
0,0,Sell yourself first,yes,"[Sell, yourself, first]",How to Sell Fine Art Online,3
1,0,"Before doing anything else, stop and sum up yo...",no,"[Before, doing, anything, else,, stop, and, su...",How to Sell Fine Art Online,12
2,0,"Now, think about how to translate that to an o...",no,"[Now,, think, about, how, to, translate, that,...",How to Sell Fine Art Online,11
3,0,"Be it the few words, Twitter allows you or an ...",no,"[Be, it, the, few, words,, Twitter, allows, yo...",How to Sell Fine Art Online,21
4,0,Bring out the most salient features of your cr...,no,"[Bring, out, the, most, salient, features, of,...",How to Sell Fine Art Online,18


In [40]:
indexNames = cleaned_sentences[(cleaned_sentences['sentence_len'] == 1) & (cleaned_sentences['is_summary'] != 'yes')].index
indexNames

Int64Index([     70,      71,      72,     142,     144,     145,     150,
                157,     167,     169,
            ...
            7365315, 7365317, 7365356, 7365357, 7365424, 7365461, 7365984,
            7366110, 7366618, 7366620],
           dtype='int64', length=82664)

In [41]:
# Delete these row indexes from dataFrame
cleaned_sentences.drop(indexNames, inplace= True)

In [42]:
cleaned_sentences[cleaned_sentences['sentence'] == '2']

Unnamed: 0,text_id,sentence,is_summary,words,title,sentence_len
1830373,315893,2,yes,[2],How to Count to 10 in Danish,1
1959352,356571,2,yes,[2],How to Download World of Warcraft Addons,1
4711029,996544,2,yes,[2],How to Unlock Huawei E585 Mifi Router,1
4711036,996550,2,yes,[2],How to Unlock Huawei E585 Mifi Router,1
6317036,1335970,2,yes,[2],How to Play the Sicilian Defence Opening in Chess,1
7044016,1515301,2,yes,[2],How to Sell Rare Books,1


In [43]:
#Removing nonesense titles 
num_title_indexes = cleaned_sentences[(cleaned_sentences['sentence'].str.contains('[0-9]', regex=True)) & (cleaned_sentences['sentence_len'] == 1)].index


In [44]:
num_title_indexes

Int64Index([ 380125, 1668428, 1756055, 1830310, 1830373, 1830374, 1830375,
            1830376, 1830377, 1830378, 1830380, 1830381, 1830382, 1959298,
            1959352, 1959354, 1959357, 1959361, 1959364, 1959366, 2028277,
            2363007, 2790445, 2794581, 3526549, 3621058, 4185344, 4438873,
            4460717, 4711027, 4711029, 4711031, 4711032, 4711034, 4711036,
            4711038, 4711040, 4711042, 4711044, 4714992, 5211450, 5233813,
            6123498, 6303874, 6303897, 6316747, 6317036, 6317046, 6330084,
            6330091, 6330147, 6330154, 6330159, 6330253, 6330259, 6330266,
            6330273, 6330459, 6330464, 6330470, 6330480, 6330485, 6330489,
            6330494, 6820324, 6829573, 6829589, 6829597, 6829602, 6930025,
            7017567, 7044016, 7232349],
           dtype='int64')

In [45]:
# Delete these row indexes from dataFrame
cleaned_sentences.drop(num_title_indexes, inplace= True)

In [63]:
empty_indexes = cleaned_sentences[cleaned_sentences['sentence'] == ''].index
empty_indexes

Int64Index([ 234099,  234967,  318983,  498183,  549845,  645705,  771768,
             886107,  978125, 1005635, 1591962, 1841633, 1960192, 2051178,
            2066380, 2114590, 2725905, 3064487, 3365855, 3428905, 3556633,
            3556634, 3699128, 3805657, 3805659, 3876022, 4215405, 4435197,
            4435199, 4599322, 5038614, 5234289, 5364096, 5382867, 5382942,
            5382946, 5382948, 5382950, 5382952, 5382954, 5382957, 5382959,
            5382963, 5573362, 5822872, 6294814, 6325156, 6821723, 6959415,
            6967497, 7089784, 7110220, 7176380, 7333006],
           dtype='int64')

In [64]:
cleaned_sentences.drop(empty_indexes, inplace= True)

In [65]:
#Saving the cleaned Dataset
cleaned_sentences.to_csv('./datasets/clean_wikihow_sep.csv', index=False)