In [1]:
import numpy as np
import pandas as pd
import warnings
from cleaning_funcs import tup_list_maker, is_summary, is_sentence, clean_n_char
warnings.simplefilter(action='ignore')

In [2]:
# Consists of concatenation of all paragraphs as articles and bold lines as summaries
wikihow_all = pd.read_csv('./datasets/wikihowAll.csv')

In [3]:
wikihow_all.head()

Unnamed: 0,headline,title,text
0,"\nKeep related supplies in the same area.,\nMa...",How to Be an Organized Artist1,"If you're a photographer, keep all the necess..."
1,\nCreate a sketch in the NeoPopRealist manner ...,How to Create a Neopoprealist Art Work,See the image for how this drawing develops s...
2,"\nGet a bachelor’s degree.,\nEnroll in a studi...",How to Be a Visual Effects Artist1,It is possible to become a VFX artist without...
3,\nStart with some experience or interest in ar...,How to Become an Art Investor,The best art investors do their research on t...
4,"\nKeep your reference materials, sketches, art...",How to Be an Organized Artist2,"As you start planning for a project or work, ..."


In [4]:
# Consists of each paragraph and its summary
wikihow_sep = pd.read_csv('./datasets/wikihowSep.csv')

In [5]:
wikihow_sep.head()

Unnamed: 0,overview,headline,text,sectionLabel,title
0,So you're a new or aspiring artist and your c...,\nSell yourself first.,"Before doing anything else, stop and sum up y...",Steps,How to Sell Fine Art Online
1,"If you want to be well-read, then, in the wor...",\nRead the classics before 1600.,Reading the classics is the very first thing ...,Reading the Classics,How to Be Well Read
2,So you're a new or aspiring artist and your c...,\nJoin online artist communities.,Depending on what scale you intend to sell yo...,Steps,How to Sell Fine Art Online
3,So you're a new or aspiring artist and your c...,\nMake yourself public.,Get yourself out there as best as you can by ...,Steps,How to Sell Fine Art Online
4,So you're a new or aspiring artist and your c...,\nBlog about your artwork.,"Given the hundreds of free blogging websites,...",Steps,How to Sell Fine Art Online


# Picking a Dataset 

## Dataset structuring

In a wikihow article a paragraph's summary usually comes at the begining of the section and is highlighted in bold lines. We need to re-create the classic wikihow structure by appending the `headline` sentence back at the begining of the paragraph for further processing.

* We will create a column called `full_text` where the first line will be the sentence summarizing the whole paragraph.  

* The postion of the sentence at the begining of the paragraph is not representative of a typical text/article structure (other than in a wikihow structure), so if we plan on using some kind of sentence location feature for the summarization it will probably end up not useful in our model just using the wikihow dataset.This problem could later be adressed by adding new and different summaries to the dataset. 

* The `wikihow_sep` dataset seems to be a good starting point for our analysis since it provides the paragraph and the sentence chosen to be its summary. 

In [6]:
# Check the datset decription
wikihow_sep.describe()

Unnamed: 0,overview,headline,text,sectionLabel,title
count,1583187.0,1585695,1387290,1583791,1585694
unique,128543.0,1357301,1354189,188924,214613
top,,\nFinished.\n\n,;\n,Steps,How to Create an Overall Status Workbook in XL...
freq,826.0,4707,14164,449451,227


In [7]:
# Select preliminary useful columns 
wiki_filtered = wikihow_sep[['headline', 'text', 'title']]

In [8]:
# Re-create a full wikihow paragraph
wiki_filtered['full_text'] = wiki_filtered['headline'] + wiki_filtered['text']

In [9]:
#Reset index for later use
wiki_filtered = wiki_filtered.reset_index()

In [10]:
# Rename index column to text_id column 
wiki_filtered['text_id'] = wiki_filtered['index']

In [11]:
# Filter dataframe columns 
wiki_filtered = wiki_filtered[['text_id', 'full_text', 'title']]

In [12]:
#Explore end of dataframe
wiki_filtered.tail()

Unnamed: 0,text_id,full_text,title
1585690,1585690,\nMagnify the image. Unlike some of the other ...,How to Identify Prints3
1585691,1585691,\nLook for the absence of plate marks. If you ...,How to Identify Prints3
1585692,1585692,\nLook for the flatness of the ink. Upon close...,How to Identify Prints3
1585693,1585693,"\nLook for the illusion of shade, created by m...",How to Identify Prints3
1585694,1585694,"\nLook for blurriness. Typically, fine details...",How to Identify Prints3


In [13]:
# Split text by sentences   
wiki_filtered['sentences'] = wiki_filtered['full_text'].apply(lambda x: str(x).split('.') )

In [14]:
wiki_filtered.head()

Unnamed: 0,text_id,full_text,title,sentences
0,0,\nSell yourself first. Before doing anything e...,How to Sell Fine Art Online,"[\nSell yourself first, Before doing anything..."
1,1,\nRead the classics before 1600. Reading the c...,How to Be Well Read,"[\nRead the classics before 1600, Reading the..."
2,2,\nJoin online artist communities. Depending on...,How to Sell Fine Art Online,"[\nJoin online artist communities, Depending ..."
3,3,\nMake yourself public. Get yourself out there...,How to Sell Fine Art Online,"[\nMake yourself public, Get yourself out the..."
4,4,\nBlog about your artwork. Given the hundreds ...,How to Sell Fine Art Online,"[\nBlog about your artwork, Given the hundred..."


In [15]:
# Create a list of tuples containing in index0, text_id and in index 1 the list of sentences corresponding to this text 
tuples = list(zip(wiki_filtered['text_id'], [sentence for sentence in wiki_filtered['sentences']]))

In [16]:
# Apply custom function to identify each setence with its original text
tup_list = tup_list_maker(tuples)

In [17]:
# Converting the tuples list into a dataframe 
sentences = pd.DataFrame(tup_list, columns =['text_id', 'sentence'])

In [18]:
#Check the result
sentences.head()

Unnamed: 0,text_id,sentence
0,0,\nSell yourself first
1,0,"Before doing anything else, stop and sum up y..."
2,0,"Now, think about how to translate that to an ..."
3,0,"Be it the few words, Twitter allows you or an..."
4,0,Bring out the most salient features of your c...


In [19]:
sentences.tail()

Unnamed: 0,text_id,sentence
8766523,1585694,\nLook for blurriness
8766524,1585694,"Typically, fine details will be somewhat blur..."
8766525,1585694,"Often, the paper won't quite stick, or will o..."
8766526,1585694,This is typically a sign of planographic lith...
8766527,1585694,


In [20]:
sentences.head(15)

Unnamed: 0,text_id,sentence
0,0,\nSell yourself first
1,0,"Before doing anything else, stop and sum up y..."
2,0,"Now, think about how to translate that to an ..."
3,0,"Be it the few words, Twitter allows you or an..."
4,0,Bring out the most salient features of your c...
5,0,Make it clear to readers why you are an artis...
6,0,"If you're not great with words, find a friend..."
7,0,;\n
8,1,\nRead the classics before 1600
9,1,Reading the classics is the very first thing ...


## Further Cleaning 

The sentences have some unwanted characters like blank lines and '\n' at the begining/end and some rows look like they could be deleted, we will explore the dataset to see what needs cleaning.

In [21]:
# Unique repeated values we might not need
sentences.head()

Unnamed: 0,text_id,sentence
0,0,\nSell yourself first
1,0,"Before doing anything else, stop and sum up y..."
2,0,"Now, think about how to translate that to an ..."
3,0,"Be it the few words, Twitter allows you or an..."
4,0,Bring out the most salient features of your c...


In [22]:
sentences['is_sentence'] = sentences['sentence'].apply(is_sentence)

In [23]:
#Keep only Dataframe with sentences 
sentences = sentences[sentences['is_sentence']== True]

In [24]:
# Take off the '\n' character at the begining of some sentences
sentences['sentence'] = sentences['sentence'].apply(clean_n_char)

## Adding a column specifying if the sentence is part of the summary

To create our labeled dataset, for each sentence, we need to identify if they are part of the final summary or not. In order to do that, we will use a trick with pandas diff function on the `text_id` column which will compare subsequent rows and give us the difference, if this difference is other than 0, then the sentence was part of the summary.

In [25]:
# Add a column specifying if the sentence is part of the summary
sentences['difference'] = sentences['text_id'].diff()

In [26]:
sentences['is_summary'] = sentences['difference'].apply(is_summary)

In [30]:
sentences.head(15)

Unnamed: 0,text_id,sentence,is_sentence,difference,is_summary
0,0,Sell yourself first,True,,yes
1,0,"Before doing anything else, stop and sum up y...",True,0.0,no
2,0,"Now, think about how to translate that to an ...",True,0.0,no
3,0,"Be it the few words, Twitter allows you or an...",True,0.0,no
4,0,Bring out the most salient features of your c...,True,0.0,no
5,0,Make it clear to readers why you are an artis...,True,0.0,no
6,0,"If you're not great with words, find a friend...",True,0.0,no
8,1,Read the classics before 1600,True,1.0,yes
9,1,Reading the classics is the very first thing ...,True,0.0,no
10,1,If you want to build a solid foundation for y...,True,0.0,no


## Split Sentences by Words 

Now that we have (mostly) cleaned our dataset, we need to analyse each sentence to later extract the features that we need for our analysis. To analyse our sentences, we need to split them into words and perform some frequency calculations on them.

In [37]:
# Strip leading and ending whitespace
sentences['sentence'] = sentences['sentence'].apply(lambda sentence: sentence.strip())

In [38]:
# Strip leading and ending whitespace
sentences['words'] = sentences['sentence'].apply(lambda sentence: sentence.split(' '))

In [39]:
sentences.head(15)

Unnamed: 0,text_id,sentence,is_sentence,difference,is_summary,words
0,0,Sell yourself first,True,,yes,"[Sell, yourself, first]"
1,0,"Before doing anything else, stop and sum up yo...",True,0.0,no,"[Before, doing, anything, else,, stop, and, su..."
2,0,"Now, think about how to translate that to an o...",True,0.0,no,"[Now,, think, about, how, to, translate, that,..."
3,0,"Be it the few words, Twitter allows you or an ...",True,0.0,no,"[Be, it, the, few, words,, Twitter, allows, yo..."
4,0,Bring out the most salient features of your cr...,True,0.0,no,"[Bring, out, the, most, salient, features, of,..."
5,0,Make it clear to readers why you are an artist...,True,0.0,no,"[Make, it, clear, to, readers, why, you, are, ..."
6,0,"If you're not great with words, find a friend ...",True,0.0,no,"[If, you're, not, great, with, words,, find, a..."
8,1,Read the classics before 1600,True,1.0,yes,"[Read, the, classics, before, 1600]"
9,1,Reading the classics is the very first thing y...,True,0.0,no,"[Reading, the, classics, is, the, very, first,..."
10,1,If you want to build a solid foundation for yo...,True,0.0,no,"[If, you, want, to, build, a, solid, foundatio..."
