# Data for Abstractive Summarization

## 1. CNN and Dailymail dataset

1. cnn_stories & dailymail_stories - Main dataset<br>
Link - https://cs.nyu.edu/~kcho/DMQA/ <br>
cnn: 392 mb<br>
dailymail: 979 mb<br>
Store as ```file.story```

2. cnn folder - proccessed previous dataset<br>
3 files: train.txt (1.6 gb), test.txt (12.6 mb), dev.txt (16.5 mb)<br>
Structure:<br>
```
summary
document_id as @entity0
text
blank line
```

#### Literature to read:
Here: good article about CNN data - https://machinelearningmastery.com/prepare-news-articles-text-summarization/
<br>
Ideas:<br>
Some data cleaning ideas for this data include:
<br>
a. Normalize case to lowercase (e.g. “An Italian”).<br>
b. Remove punctuation (e.g. “on-time”).<br>
We could also further reduce the vocabulary to speed up testing models, such as:<br>
c. Remove numbers (e.g. “93.4%”).<br>
d. Remove low-frequency words like names (e.g. “Tom Watkins”).<br>

Notable examples are the papers:<br>

Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond, 2016.<br>
https://arxiv.org/abs/1602.06023

Get To The Point: Summarization with Pointer-Generator Networks, 2017.<br>
https://arxiv.org/abs/1704.04368

In [13]:
import os
import pandas as pd
pd.set_option('display.max_colwidth', 500)

def get_files(directory_path):
    '''Read all files in the given directory to a dataframe'''
    data = []

    for filename in os.listdir(directory_path):
        with open(os.path.join(directory_path, filename)) as f:
            lines = f.read().split('\n')
            if len(lines[0]) < 10:
                title = ' '.join(lines[:2])
                article = ('\n'.join(lines[2:])).strip('\n')
            else:
                title = lines[0]
                article = ('\n'.join(lines[1:])).strip('\n')
            data.append((title, article))

    return pd.DataFrame(data, columns=['title', 'article'])

In [8]:
# 92'579 rows
df_cnn = get_files('CNN_stories/')
df_cnn.head(10)

Unnamed: 0,title,article
0,"(CNN) -- Federal authorities are using words uttered by the co-founder of a radical Islamic group to charge him with threats against the creators of ""South Park.""","A criminal complaint alleging the communication of threats was filed in Virginia late last week against Jesse Curtis Morton, also known as Younus Abdullah Mohammad.\n\nA senior law enforcement source Thursday told CNN, which interviewed Morton in 2009, that the suspect is believed to be in Morocco, where he maintains Islampolicy.com, an English-language website propagating pro al Qaeda views.\n\nThat website is a successor to Revolutionmuslim.com.\n\nMorton, a former resident of Brooklyn, Ne..."
1,"Even after Boomer Esiason apologized for what he called his ""insensitive"" comment about scheduling a C-section before the season started, his suggestion plus critical stances by other radio hosts demonstrate how much paternity leave is still not widely accepted in our society.","In conversations with men across the country, it's clear that while most join many women in expressing outrage at the view that a Major League Baseball game should come before the birth of a child, there were men who felt the player should have gotten back to his job as quickly as possible.\n\nBy now, you probably know the particulars: the New York Mets' Daniel Murphy missed the first two games of the season to attend his son's birth. He didn't do anything he wasn't allowed to do. Major Leag..."
2,"(CNN) -- Vacationers at Yellowstone and Grand Teton national parks this summer should make extra efforts to wash their hands, the National Park Service urged Wednesday, after noting a spike in sicknesses among visitors so far.","In a news release, the park service noted ""greater than normal reports of gastrointestinal illness"" among those visiting the park in northwestern Wyoming as well as areas in Montana outside the two parks.\n\nThat includes an incident June 7, when members of a tour group visiting Mammoth Hot Springs -- a part of Yellowstone that's located on the Montana/Wyoming border -- began complaining of stomach and other issues. Park employees who had been in contact with this group reported similar symp..."
3,"BERLIN, Germany (CNN) -- U.S. officials urged American citizens in Germany to keep a low profile and remain wary of their surroundings after the terrorist organization al Qaeda posted a video message threatening attacks in the country.","German special police patrol in Berlin last month during a visit by Israeli Prime Minister Benajmin Netanyahu.\n\nA State Department travel alert, issued Wednesday, remains in effect until November 11 -- two weeks after Germany holds its federal elections on Sunday.\n\nAl Qaeda posted its video threat on the Internet on September 18, vowing attacks if the elections do not come out the way it wants.\n\nThe same day, the German government reacted to the video by raising its own alert level and..."
4,(CNN) -- The lessons of the first round of the French presidential elections are multiple and somewhat contradictory.,"There is, on the one hand, the first-round victory of a self-described ""normal man"" who is still -- in spite of very tight results -- likely to become the next president of France: François Hollande. His lack of charisma has not been a handicap, so great was the rejection of incumbent President Nicolas Sarkozy.\n\nFrançois Hollande's good-naturedness and his smiling personality evoke a mixture of Jacques Chirac and even Georges Pompidou. One should not be deluded: Any successful politician h..."
5,"(CNN) -- On the surface, water polo appears an elegant pursuit played by extremely polished performers.","But beneath the water line, a different storyline is playing out.\n\nLimbs bash against each other, punches and kicks are thrown, nails are used to claw at an opponent and every so often, a player inadvertently disrobes another.\n\nThe thing is, like most players, Australian goal-machine Rowena Webster wouldn't want it any other way.\n\n""We have a running joke that the referees probably only see about 20% of what really happens,"" Webster told CNN's Human to Hero series.\n\n""I guess what you ..."
6,(CNN) -- I've been in Thomas Hurley III's shoes.,"Hurley is the 12-year-old Connecticut boy whose misspelling of ""Emancipation"" during a Kids Week episode on ""Jeopardy!"" took social media by storm. Fans are sharply divided over whether the show should have accepted his Final Jeopardy answer, even though he would have finished second regardless.\n\nBut what we forget is that a young man's honest mistake was broadcast to millions of people across the United States and Canada. No matter the age, realizing that your ""blooper"" will be seen by ma..."
7,"(CNN) -- He's the man who rolled into a bedroom in Abbottabad, Pakistan, raised his gun and shot Osama bin Laden three times in the forehead.","Nearly two years later, the SEAL Team Six member is a secret celebrity with nothing to show for the deed; no job, no pension, no recognition outside a small circle of colleagues.\n\nJournalist Phil Bronstein profiled the man in the March issue of Esquire, calling him only the Shooter -- a husband, father and SEAL Team Six member who says he happened to pull the trigger on the notorious terrorist. It's a detailed account of how the raid unfolded, and what comes after for those involved. The h..."
8,"Santa Rosa, Peru (CNN) -- Murder suspect Joran van der Sloot arrived Friday in Peru to face charges that he killed a Peruvian woman as police in Lima said they had identified the weapon that killed 21-year-old Stephany Flores Ramirez.","Flores' body was found Wednesday in a Lima hotel room registered to van der Sloot, a Dutch citizen who was twice arrested and released in connection with the 2005 disappearance of an American teenager, Natalee Holloway, in Aruba.\n\nInvestigators also found a baseball bat in the room, two law enforcement sources -- who said it was the murder weapon -- told HLN's ""Nancy Grace.""\n\nChilean authorities delivered van der Sloot to their Peruvian counterparts in the border town of Santa Rosa, wher..."
9,"Washington (CNN) -- When presumptive Republican presidential nominee Mitt Romney appears before Latino small-business owners in Washington on Wednesday, he'll address a group whose explosive birth rates foreshadow a seismic political shift in GOP strongholds in the Deep South and Southwest.","""The Republicans' problem is their voters are white, aging and dying off,"" said David Bositis, a senior research associate at the Joint Center for Political and Economic Studies, who studies minority political engagement.\n\n""There will come a time when they suffer catastrophic losses with the realization of the population changes.""\n\nOver the next several generations, the wave of minority voters -- who, according to U.S. Census figures released this week, now represent more than half of th..."


In [14]:
# 219'506 rows
df_daily = get_files('Dailymail_stories/')
df_daily.head(10)

Unnamed: 0,title,article
0,"A plane forced into an emergency landing because of a technical fault, accidentally sent out a hijacking signal after the pilot pressed the wrong button.","Ground crew mistakenly believed that the Vietnam Airlines jet which had been travelling from Ho Chi Minh City to the northern town of Vinh on Tuesday, was under attack.\n\nIt was initially understood that someone had tried to force their way into the pilot’s cockpit.\n\nGround crew mistakenly believed that the Vietnam Airlines jet which had been travelling from Ho Chi Minh City to the northern town of Vinh on Tuesday, was under attack\n\nHowever, it later turned out that Czech captain Pechan..."
1,"American supermodel Kendra Spears, who is married to Prince Rahim Aga Khan lived her own real-life fairy-tale.","And now just like Kate and William, this royal couple will be adding to their family as they announce they are expecting their first child.\n\nOn the website for Nizari Ismailism, of which Prince Rahim Aga Khan is Imam, his father, Aga Khan IV, Mawlana Hazar Imam, posted the statement:\n\nScroll down for video \n\nKendra Spears and Prince Rahim Aga Khan, pictured on their wedding day, are expecting their first child \n\nThe couple made the announcement on the official Ismaili website on Frid..."
2,By Jade Watkins,"He only just landed back in Sydney after a trip to Dubai, and radio shock jock Kyle Sandilands is already in hot water.\n\nThe 43-year-old was stopped for speeding by police on his way back from the airport to his St. Ives mansion on Monday morning.\n\nThe Kiis 1065 host admitted to his listeners on his breakfast show that he broke the law, telling them he was going 20kms over the 60km speed limit, according to the Daily Telegraph.\n\nIn hot water: Kyle Sandilands was stopped by police for ..."
3,Zlatan Ibrahimovic will miss Paris Saint-Germain's Champions League showdown with Barcelona on Tuesday because of a heel injury.,"The French champions announced in a statement that the Sweden star, who has missed the last two Ligue 1 games, has not fully recovered from the left heel problem and is out of the clash with his former club.\n\nPSG examined the striker's foot on Monday morning but were not satisfied he was fit enough to face Barca.\n\nZlatan Ibrahimovic has been ruled out of PSG's Champions League match with Barcelona with a heel injury\n\nThe Swedish striker has already missed two matches and isn't fit enou..."
4,The U.S Surgeon General has enlisted Elmo to urge American children to get vaccinated in the wake of recent national debate over the right to refuse immunization.,"A campaign video sees Dr Vivek Murthy and the Sesame Street favourite go through the process of getting vaccinated and explaining to children (and parents) why it is so important.\n\nIn what appears to be a direct response to anti-vaccination campaigners, Elmo and Dr Murthy questions why everybody does not get a shot. \n\nScroll down for video \n\nCampaign: Surgeon General Vivek Murthy and Elmo explain why getting vaccinated is crucial to keep other children safe and healthy\n\nAs Dr Murthy ..."
5,"Cristiano Ronaldo is at the peak of physical fitness, something which was highlighted by his Champions League final celebration.","His muscles bulged, his torso rippled and thousands of blokes pledged to hit the gym as the world watched the best player on the planet's big moment.\n\nBut Ronaldo has had to work hard for it - his physique is the result of an intense exercise regime, detailed below.\n\nPeak fitness: Cristiano Ronaldo shows off his ridiculously ripped torso in the Champions League final\n\nIn a weights session, Ronaldo lifts the equivalent of 16 Toyota Prius cars\n\nRonaldo can reportedly do 3,000 sit-ups a..."
6,By Steve Robson,"PUBLISHED:\n\n12:16 EST, 24 August 2013\n\n\n| \n\nUPDATED:\n\n09:49 EST, 25 August 2013\n\nThousands of Syrians were treated for nerve gas symptoms following an alleged chemical attack in Damascus, the medical charity Medecins Sans Frontieres (MSF) has said today.\n\nThree hospitals in the city reported 355 deaths on Wednesday after around 3,600 people were admitted to hospital following exposure to a 'neurotoxic agent'.\n\nThe Syrian opposition has accused \ngovernment forces of gassing hu..."
7,By Anna Hodgekiss,"MERS - Middle East Respiratory Syndrome - may be airbone, meaning it can spread easier\n\nThe deadly MERS virus which has killed more than 300 people may be airborne, it has been claimed.\n\nSaudi scientists drew the conclusion after finding gene fragments of the deadly Middle East Respiratory Syndrome in air from a barn housing an infected camel.\n\nThey say this suggests the disease may be transmitted through the air.\n\nThis is concerning because viruses that spread through air - such as ..."
8,Southampton have emerged as favourites to sign Italy starlet Gianluca Scamacca.,"Dubbed the 'new Zlatan Ibrahimovic' by Italian media, the 6ft 4ins 15-year-old centre-forward is on Roma's books but is ready to move abroad.\n\nRoma will offer the Italy U17 international a three-year contract when he turns 16 on January 1, initially starting on £1,142 a month. \n\nGianluca Scamacca (centre) in action for Italy U17s against Germany U17s in September\n\nRonald Koeman (left)'s Southampton have emerged as the favourites to sign Gianluca Scamacca\n\nHowever, his family and agen..."
9,European laws on food labelling mean the details on the packaging of probiotic drinks are always rather vague,"It started with those little bottles of sickly-sweet-tasting yogurt shots, and today probiotics can be found in everything from fortified milks and supplements to face creams – and all claim the ‘friendly bacteria’ they contain can do us a world of good.\n\nBut tough European laws on food labelling mean the details on the packaging are always rather vague when it comes to explaining just what the benefits are, mainly because many studies have proved inconclusive. So are the positive effects ..."


In [24]:
def cnn_data_to_df(file_path):
    with open(file_path, 'r') as file:
        pos = 0
        documents = []
        for line in file:
            if pos == 0:
                summary = line
                pos += 1
            elif pos == 1 and line[0] == '@':
                doc_id = line
                pos += 1
            elif pos == 2:
                text = line
                pos += 1
            elif pos == 3 and line == '\n':
                documents.append((summary, doc_id, text))
                pos = 0
    return pd.DataFrame(documents, columns = ['summary', 'doc_id', 'text'])
        
file_path = 'cnn/train.txt'
cnn_data_to_df(file_path).head(10)

Unnamed: 0,summary,doc_id,text
0,officials : the suspects were taken to the local @placeholder army base for questioning\n,@entity2\n,"days after two @entity2 journalists were killed in northern @entity3 , authorities rounded up dozens of suspects and a group linked to @entity6 claimed responsibility for the deaths . at least 30 suspects were seized in desert camps near the town of @entity13 and taken to the local @entity2 army base for questioning , three officials in @entity3 said . the officials did not want to be named because they are not authorized to talk to the media . @entity17 ( @entity17 ) has allegedly claimed r..."
1,"@placeholder ate 6,000 calories and trained for three hours daily in preparation\n",@entity1\n,"( @entity0 ) -- when we think of the perfect summer blockbuster , we think of action -- and july 's "" @entity5 "" will have more than enough , star @entity1 says . the 44 - year - old is back for another round as the ferocious mutant @entity8 , but @entity10 's take is somewhat darker than its predecessors . as the trailers have shown , the typically sly @entity8 appears to be in a pretty serious funk as he 's haunted by thoughts of @entity15 ( @entity16 ) . when a man whose life he once save..."
2,most of the deaths occurred in hard - hit cities of @entity0 and @placeholder\n,@entity13\n,"@entity0 , @entity1 ( @entity2 ) -- the death toll from flooding caused by torrential rains in @entity1 's @entity8 state rose to 591 people saturday , @entity1 's official news agency reported . most of the deaths were reported in the cities of @entity13 and @entity0 , located in a mountainous region northeast of @entity15 , according to @entity16 . rescuers have not been able to reach some hard - hit areas and many more people are feared dead , the agency said friday . the rain is predicte..."
3,the hit @placeholder show took five seasons to cover two years of @entity2 's life\n,@entity212\n,"( @entity0 ) -- with @entity2 dead , fans everywhere are mourning , celebrating , tallying up bets and discussing what just happened . was the series finale of "" @entity6 "" satisfying ? did it tie up all loose ends ? did the character you wanted to live survive and did the ones you wanted to die get their just deserts ? is it sending you back to the beginning to binge watch it all over again ? just when it seemed @entity19 was heading out of his @entity21 hideaway to exact revenge on @entity..."
4,"the @entity9 went to "" @placeholder , "" written and illustrated by @entity11\n",@entity68\n,"get ready to meet the new classics of children 's literature . children 's and young adult books are sporting some shiny new seals after the @entity5 announced its most esteemed literary prizes monday , including the @entity8 and @entity9 medals . the @entity9 went to "" locomotive , "" written and illustrated by @entity11 . the book follows family and crew traveling together on @entity16 's new transcontinental railroad in the summer of 1869 . the @entity8 was awarded to "" @entity17 : the @en..."
5,"@placeholder : @entity8 was more energetic , but @entity18 showed he 's a credible alternative\n",@entity171\n,"( @entity0 ) -- if the first presidential debate in @entity2 was a game changer , tuesday night 's was not . but that does n't mean it was n't a spirited , heavyweight bout with several consequential moments . president @entity8 entered the second presidential debate needing to make up serious ground after his first debate performance . he turned around the narrative from the first debate -- that he was listless and lethargic and on the defensive -- but showing up is one thing , winning is a..."
6,"this year , 18 states are expected to gain or lose representation in the @placeholder\n",@entity14\n,"( @entity0 ) -- when @entity3 gov. @entity2 of @entity4 hits the campaign trail before @entity6 , you might want to listen , because the outcome his re-election bid could have a direct impact on you -- even if you do n't live in his state . the number crunchers at @entity12 estimate that next year , @entity4 is going to lose two seats in the @entity15 @entity14 . it 's up to the @entity4 state government -- including the governor -- to decide which @entity14 members will go : @entity3 or @en..."
7,@placeholder is the latest in a string of @entity9 centrists to announce retirement .\n,@entity3\n,"@entity0 ( @entity1 ) -- when @entity4 sen. @entity3 of @entity5 rocked the political world with her announcement that she would not seek a fourth term in the @entity9 , she was forthright in expressing her frustration with "" an atmosphere of polarization "" in politics . but for all her transparency , it was one of @entity3 's @entity9 colleagues who perhaps best summed up her motivation for deciding to end her decades - long tenure on @entity20 . "" i think she lost hope , "" sen. @entity21 ,..."
8,@entity3 did not deny rumors to @placeholder\n,@entity15\n,"( @entity0 ) -- @entity3 and spouse @entity4 "" are working through their issues "" and "" nothing else will be said "" about rumors the couple is splitting , according to @entity3 's publicist . rumors have been swirling that @entity4 , left , and @entity3 are splitting . online buzz about the @entity12 marriage grew louder this week after @entity3 did not give a clear - cut denial in a @entity15 interview on tuesday . the former talk show host 's publicist echoed her non-denial in a statement ..."
9,@placeholder could have received up to $ 3 million if it cut sports program\n,@entity19\n,"( @entity0 ) -- for most of us , college donations entail little more than occasionally dropping a small check in the mail after receiving repeated pleas for cash from our alma maters . some people , though , tend to be a bit more individualistic with their generosity . let 's take a look at some of the quirkier donations schools have received : 1 . bequest puts jocks on the ropes in 1907 , fledgling @entity19 received a bequest that was estimated to be worth somewhere between $ 1 and $ 3 mi..."


## 2. Gigaword Dataset
Giga_datafiles<br>
https://github.com/Ethanscuter/gigaword <br>


In [23]:
# 3'803'958

def gigaword_to_df():
    with open('Giga_datafiles/train.title.txt') as f:
        titles = f.read().split('\n')
        
    with open('Giga_datafiles/train.article.txt') as f:
        articles = f.read().split('\n')
        
    return pd.DataFrame({'title': titles, 'article': articles})

df_giga = gigaword_to_df()
df_giga.head(10)

Unnamed: 0,title,article
0,australian current account deficit narrows sharply,"australia 's current account deficit shrunk by a record #.## billion dollars -lrb- #.## billion us -rrb- in the june quarter due to soaring commodity prices , figures released monday showed ."
1,at least two dead in southern philippines blast,"at least two people were killed in a suspected bomb attack on a passenger bus in the strife-torn southern philippines on monday , the military said ."
2,australian stocks close down #.# percent,"australian shares closed down #.# percent monday following a weak lead from the united states and lower commodity prices , dealers said ."
3,envoy urges north korea to restart nuclear disablement,south korea 's nuclear envoy kim sook urged north korea monday to restart work to disable its nuclear plants and stop its `` typical '' brinkmanship in negotiations .
4,skorea announces tax cuts to stimulate economy,"south korea on monday announced sweeping tax reforms , including income and corporate tax cuts to boost growth by stimulating sluggish private consumption and business investment ."
5,taiwan shares close down #.## percent,"taiwan share prices closed down #.## percent monday on wall street weakness and lacklustre interim earnings from electronics manufacturing giant hon hai , dealers said ."
6,australian stocks close down #.# percent,"australian shares closed down #.# percent monday following a weak lead from the united states and lower commodity prices , dealers said ."
7,spain 's colonial posts #.## billion euro loss,"spanish property group colonial , struggling under huge debts , announced losses of #.## billion euros for the first half of #### which it blamed on asset depreciation ."
8,kadhafi promises wide political economic reforms,libyan leader moamer kadhafi monday promised wide political and economic reforms that he said would see ministries dismantled and oil revenues going directly into the pockets of the people .
9,un 's top aid official arrives in drought-hit ethiopia,"the united nations ' humanitarian chief john holmes arrived in ethiopia monday to tour regions affected by drought , which has left some eight million people in need of urgent food aid ."


## 3. BBC_News_Summary
BBC_News_Summary - 7.3 mb<br>
The dataset consists of articles (avg - 2266 characters) and their summaries (avg - 1000 characters). The summaries are quite long, usually several sentences. So I will consider how to use them.<br>
Two folders News Article, Summaries: files as '001.txt' in each folder<br>

In [21]:
def get_files_from_folders(path):
    '''Read all files from all subdirectories from a directory'''
    files_list = []
    for directory in os.listdir(path):
        if os.path.isdir(path + directory):
            for filename in os.listdir(path + directory):
                with open(os.path.join(path + directory, filename), 'rb') as f:
                    files_list.append(f.read())
    return files_list

# 4449 rows
def bbc_to_df():
    
    articles = get_files_from_folders('BBC News Summary/News Articles/')
    summaries = get_files_from_folders('BBC News Summary/Summaries/')
    
    return pd.DataFrame({'article': articles, 'summary': summaries})
    
df_bbc = bbc_to_df()
df_bbc.head(10)

Unnamed: 0,article,summary
0,"b'Apple Mac mini gets warm welcome\n\nThe Mac mini has been welcomed by Apple fans, industry experts and PC users.\n\nThe release of the tiny, low-cost machine is seen as a good move for Apple which currently has a small share of the desktop computer market. Mac watchers and some analysts say the Mac mini will go a long way to help Apple appeal to the mass of consumers. They speculate that the Mac mini will be bought by iPod owners and those wanting an easy-to-use and administer second home ...","b'But, he said, the Mac mini changed that perception.""The Mac mini is not quite ready for that yet,"" he said.The release of the Mac mini fit perfectly with this trend, he said.Apple has traditionally done well in the market that the Mac mini is aimed at, said Mr Fogg, who also expected many PC makers to release copycat devices in reaction.The Mac mini could find a role in homes that need a second computer that is easy to install and administer, he said.""Apple has been hoping that sales of th..."
1,"b'Mobiles \'not media players yet\'\n\nMobiles are not yet ready to be all-singing, all-dancing multimedia devices which will replace portable media players, say two reports.\n\nDespite moves to bring music download services to mobiles, people do not want to trade multimedia services with size and battery life, said Jupiter. A separate study by Gartner has also said real-time TV broadcasts to mobiles is ""unlikely"" in Europe until 2007. Technical issues and standards must be resolved first, s...","b'The service uses 3GP technology, one of the standards for mobile TV.A service from the Norwegian Broadcasting Corporation lets people watch TV programmes on their mobiles 24 hours a day.""Mobile phone music services must not be positioned to compete with the PC music experience as the handsets are not yet ready,"" said Thomas Husson, mobile analyst at Jupiter research.A separate study by Gartner has also said real-time TV broadcasts to mobiles is ""unlikely"" in Europe until 2007.Mobile TV wil..."
2,"b'\'No re-draft\' for EU patent law\n\nA proposed European law on software patents will not be re-drafted by the European Commission (EC) despite requests by MEPs.\n\nThe law is proving controversial and has been in limbo for a year. Some major tech firms say it is needed to protect inventions, while others fear it will hurt smaller tech firms. The EC says the Council of Ministers will adopt a draft version that was agreed upon last May but said it would review ""all aspects of the directive""...","b'A proposed European law on software patents will not be re-drafted by the European Commission (EC) despite requests by MEPs.But that will not guarantee that the directive will become law - instead it will probably mean further delays and controversy over the directive.The EC says the Council of Ministers will adopt a draft version that was agreed upon last May but said it would review ""all aspects of the directive"".Supporters say current laws are inefficient and it would serve to even up a..."
3,"b'Humanoid robot learns how to run\n\nCar-maker Honda\'s humanoid robot Asimo has just got faster and smarter.\n\nThe Japanese firm is a leader in developing two-legged robots and the new, improved Asimo (Advanced Step in Innovative Mobility) can now run, find his way around obstacles as well as interact with people. Eventually Asimo could find gainful employment in homes and offices. ""The aim is to develop a robot that can help people in their daily lives,"" said a Honda spokesman.\n\nTo get...","b""Asimo has already made his mark on the international robot scene and in November was inducted into the Robot Hall of Fame.The Japanese firm is a leader in developing two-legged robots and the new, improved Asimo (Advanced Step in Innovative Mobility) can now run, find his way around obstacles as well as interact with people.Car-maker Honda's humanoid robot Asimo has just got faster and smarter.To get the robot running for the first time was not an easy process as it involved Asimo making a..."
4,"b'Sony wares win innovation award\n\nSony has taken the prize for top innovator at the annual awards of PC Pro Magazine.\n\nIt won the award for taking risks with products and for its ""brave"" commitment to good design. Conferring the award, PC Pro\'s staff picked out Sony\'s PCG-X505/P Vaio laptop as a ""stunning piece of engineering"". The electronics giant beat off strong competition from Toshiba and chip makers AMD and Intel to take the gong.\n\nPaul Trotter, news and features editor of PC ...","b'Paul Trotter, news and features editor of PC Pro, said several Sony products helped it to take the innovation award.Sony has taken the prize for top innovator at the annual awards of PC Pro Magazine.Other awards decided by PC Pro\'s staff and contributors included one for Canon\'s EOS 300D digital camera in the Most Wanted Hardware category.Mr Trotter said Sony\'s combining of computer, screen and keyboard in the W1 was likely to be widely copied in future home PCs.Conferring the award, PC..."
5,"b'Microsoft gets the blogging bug\n\nSoftware giant Microsoft is taking the plunge into the world of blogging.\n\nIt is launching a test service to allow people to publish blogs, or online journals, called MSN Spaces. Microsoft is trailing behind competitors like Google and AOL, which already offer services which make it easy for people to set up web journals. Blogs, short for web logs, have become a popular way for people to talk about their lives and express opinions online.\n\nMSN Spaces ...","b'Microsoft is trailing behind competitors like Google and AOL, which already offer services which make it easy for people to set up web journals.It is launching a test service to allow people to publish blogs, or online journals, called MSN Spaces.Competitors like Google already offer free services through its Blogger site, while AOL provides its members with journals.Blogs, short for web logs, have become a popular way for people to talk about their lives and express opinions online.It now..."
6,"b'Cable offers video-on-demand\n\nCable firms NTL and Telewest have both launched video-on-demand services as the battle between satellite and cable TV heats up.\n\nMovies from Sony Pictures, Walt Disney, Touchstone, Miramax, Columbia and Buena Vista will be among those on offer. The service is similar to Sky Plus, as users can pause, fast forward and rewind content, but they cannot store programmes on their set top box - yet. It could sound the death knell for some TV channels, Telewest pre...","b'Cable firms NTL and Telewest have both launched video-on-demand services as the battle between satellite and cable TV heats up.With both services on offer from Telewest, Mr Tveter is confident the cable firm can dent not just the viewing figures for terrestrial TV but also gain a huge competitive advantage over Sky.Telewest customers in Bristol and NTL viewers in Glasgow will be the first to test the new service, which sees a raft of movies on offer for 24 hour rental.NTL said it had not r..."
7,"b'Britons fed up with net service\n\nA survey conducted by PC Pro Magazine has revealed that many Britons are unhappy with their internet service.\n\nThey are fed up with slow speeds, high prices and the level of customer service they receive. 17% of readers have switched suppliers and a further 16% are considering changing in the near future. It is particularly bad news for BT, the UK\'s biggest internet supplier, with almost three times as many people trying to leave as joining.\n\nA third...","b'Every month the prices drop, and more and more people are trying to switch,"" he said.""We discovered a huge variety of problems, but one of the biggest issues is the current supplier withholding the information that people need to give to their new supplier,"" said Tim Danton, editor of PC Pro.A third of the 2,000 broadband users interviewed were fed up with their current providers but this could be just the tip of the iceberg thinks Tim Danton, editor of PC Pro Magazine.A survey conducted b..."
8,"b""Go-ahead for new internet names\n\nThe internet could soon have two new domain names, aimed at mobile services and the jobs market.\n\nThe Internet Corporation for Assigned Names and Numbers (Icann) has given preliminary approval to two new addresses - .mobi and .jobs. They are among 10 new names being considered by the net's oversight body. Others include a domain for pornography, an anti-spam domain as well as .post and .travel, for the postal and travel industries.\n\nThe .mobi domain w...","b""The internet could soon have two new domain names, aimed at mobile services and the jobs market.The .mobi domain would be aimed at websites and other services that work specifically around mobile phones, while the .jobs address could be used by companies wanting a dedicated site for job postings.The Internet Corporation for Assigned Names and Numbers (Icann) has given preliminary approval to two new addresses - .mobi and .jobs.The process to see the new domain names go live in cyberspace c..."
9,"b'Argonaut founder rebuilds empire\n\nJez San, the man behind the Argonaut games group which went into administration a week ago, has bought back most of the company.\n\nThe veteran games developer has taken over the Cambridge-based Just Add Monsters studios and the London subsidiary Morpheme. The Argonaut group went into administration due to a severe cash crisis, firing about half of its staff. In August it had warned of annual losses of \xc2\xa36m for the year to 31 July.\n\nJez San is on...","b""Jez San, the man behind the Argonaut games group which went into administration a week ago, has bought back most of the company.He founded Argonaut in 1982 and has been behind titles such as 1993 Starfox game.The veteran games developer has taken over the Cambridge-based Just Add Monsters studios and the London subsidiary Morpheme.Mr Rubin said the administrators were in talks over the sale of the Argonaut software division in Edgware and were hopeful of finding a buyer.Mr San has re-emerg..."


In [5]:
df_bbc['len1'] = df_bbc['article'].map(len)
df_bbc['len2'] = df_bbc['summary'].map(len)
df_bbc[['len1', 'len2']].mean()

len1    2265.790562
len2    1000.882697
dtype: float64

## 4. Microsoft Paraphrase dataset

Additional dataset<br>
Paraphrasing dataset of news sources on the web.<br>
Proccessed dataset from https://github.com/wasiahmad/paraphrase_identification/tree/master/dataset/msr-paraphrase-corpus

I plan to use this dataset for developing my metric for comparing ground truth and generated summaries.<br>

In [16]:
import csv
para_df = pd.read_csv('MSRParaphraseCorpus.txt', sep='\t', quoting=csv.QUOTE_NONE)
para_df.head(3)

Unnamed: 0,quality,id1,id2,string1,string2
0,1,702876,702977,"Amrozi accused his brother, whom he called ""the witness"", of deliberately distorting his evidence.","Referring to him as only ""the witness"", Amrozi accused his brother of deliberately distorting his evidence."
1,0,2108705,2108831,Yucaipa owned Dominick's before selling the chain to Safeway in 1998 for $2.5 billion.,Yucaipa bought Dominick's in 1995 for $693 million and sold it to Safeway for $1.8 billion in 1998.
2,1,1330381,1330521,"They had published an advertisement on the Internet on June 10, offering the cargo for sale, he added.","On June 10, the ship's owners had published an advertisement on the Internet, offering the explosives for sale."


## Conclusions
Four datasets were analyzed:
1. CNN and Dailymail dataset<br>
Document count: 92'579 + 219'506
2. Gigaword Dataset<br>
Document count: 3'803'958
3. BBC News Summary<br>
Document count: 4449
4. Microsoft Paraphrase dataset<br>
Document count: 5802

The three first datasets consist of article and title/summary. The fourth one is parapharizing dataset for evaluation purpose.<br>
I've found all datasets based on news, as my project is aimed to work with news data.<br>

Taking  the sizes of datasets into account, I presume that would be enough for the first steps. Later on, I will consider to scrape the date from news websites for testing my model.