LOADING THE DATASET
-

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('cnn_dailymail/validation.csv')

In [3]:
data.head()

Unnamed: 0,id,article,highlights
0,61df4979ac5fcc2b71be46ed6fe5a46ce7f071c3,"Sally Forrest, an actress-dancer who graced th...","Sally Forrest, an actress-dancer who graced th..."
1,21c0bd69b7e7df285c3d1b1cf56d4da925980a68,A middle-school teacher in China has inked hun...,Works include pictures of Presidential Palace ...
2,56f340189cd128194b2e7cb8c26bb900e3a848b4,A man convicted of killing the father and sist...,"Iftekhar Murtaza, 29, was convicted a year ago..."
3,00a665151b89a53e5a08a389df8334f4106494c2,Avid rugby fan Prince Harry could barely watch...,Prince Harry in attendance for England's crunc...
4,9f6fbd3c497c4d28879bebebea220884f03eb41a,A Triple M Radio producer has been inundated w...,Nick Slater's colleagues uploaded a picture to...


In [4]:
data = data.drop(columns='id')

In [8]:
data.shape

(13368, 2)

In [5]:
data.head()

Unnamed: 0,article,highlights
0,"Sally Forrest, an actress-dancer who graced th...","Sally Forrest, an actress-dancer who graced th..."
1,A middle-school teacher in China has inked hun...,Works include pictures of Presidential Palace ...
2,A man convicted of killing the father and sist...,"Iftekhar Murtaza, 29, was convicted a year ago..."
3,Avid rugby fan Prince Harry could barely watch...,Prince Harry in attendance for England's crunc...
4,A Triple M Radio producer has been inundated w...,Nick Slater's colleagues uploaded a picture to...


In [6]:
print('ARTICLE:\n', data['article'][0])
print('=====================================================================================================================================================')
print('Summary:\n', data['highlights'][0])

ARTICLE:
 Sally Forrest, an actress-dancer who graced the silver screen throughout the '40s and '50s in MGM musicals and films such as the 1956 noir While the City Sleeps died on March 15 at her home in Beverly Hills, California. Forrest, whose birth name was Katherine Feeney, was 86 and had long battled cancer. Her publicist, Judith Goffin, announced the news Thursday. Scroll down for video . Actress: Sally Forrest was in the 1951 Ida Lupino-directed film 'Hard, Fast and Beautiful' (left) and the 1956 Fritz Lang movie 'While the City Sleeps' A San Diego native, Forrest became a protege of Hollywood trailblazer Ida Lupino, who cast her in starring roles in films including the critical and commercial success Not Wanted, Never Fear and Hard, Fast and Beautiful. Some of Forrest's other film credits included Bannerline, Son of Sinbad, and Excuse My Dust, according to her iMDB page. The page also indicates Forrest was in multiple Climax! and Rawhide television episodes. Forrest appeared as 

SPLITTING THE DATA INTO TRAIN DATA AND TEST DATA
-

In [21]:
from sklearn.model_selection import train_test_split

In [22]:
train_df, test_df = train_test_split(data, test_size=0.2, random_state=42)

In [29]:
print('Train Data ------->',len(train_df))
print('Test Data -------->',len(test_df))

Train Data -------> 10694
Test Data --------> 2674


In [30]:
print('Train Data ------->',train_df.columns)
print('Test Data -------->',test_df.columns)

Train Data -------> Index(['article', 'highlights'], dtype='object')
Test Data --------> Index(['article', 'highlights'], dtype='object')


In [31]:
datasets = [train_df, test_df]
dataset_names = ['Train Data', 'Test Data']
for i,dataset in enumerate(datasets):
    print(f'{dataset_names[i]} missing values:')
    print(dataset.isnull().sum())
    print('-------------------------')

Train Data missing values:
article       0
highlights    0
dtype: int64
-------------------------
Test Data missing values:
article       0
highlights    0
dtype: int64
-------------------------


PRE-PROCESSING
-

Converting all the text into lower case letters

In [32]:
train_df['article'] = train_df['article'].apply(lambda x: x.lower())
train_df['highlights'] = train_df['highlights'].apply(lambda x: x.lower())

In [36]:
train_df.head()

Unnamed: 0,article,highlights
5537,father-of-four imran sharif admitted brutally ...,imran sharif had brutally killed his wife rahe...
7608,real madrid full back marcelo has described is...,marcelo praises 22-year-old midfielder in talk...
426,"(cnn)in fairy tales, it's usually the princess...",parisa tabriz is the 31-year-old computer whiz...
621,(cnn)following last year's successful u.k. tou...,it will be a first time for the tour stateside...
2094,a hapless vietnamese pair caught running a £10...,"chien nguyen, 32, and hieu nguyen, 35, admitte..."


Expanding the contractions ( eg: don't is converted to do not )

In [37]:
import contractions
def expand_contractions(text):
    expanded_text = contractions.fix(text)
    return expanded_text

train_df['article'] = train_df['article'].apply(expand_contractions)
train_df['highlights'] = train_df['highlights'].apply(expand_contractions)

Removing Numbers, Special Characters and Punctuations

In [39]:
import re

def clean_text(text):
    # numbers
    text = re.sub(r'\d+', '', text)
    # Removing special characters and punctuation
    text = re.sub(r'[^\w\s]', '', text)
    # Removing extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    return text

train_df['article'] = train_df['article'].apply(clean_text)
train_df['highlights'] = train_df['highlights'].apply(clean_text)

In [40]:
train_df.head()

Unnamed: 0,article,highlights
5537,fatheroffour imran sharif admitted brutally mu...,imran sharif had brutally killed his wife rahe...
7608,real madrid full back marcelo has described is...,marcelo praises yearold midfielder in talk to ...
426,cnnin fairy tales it is usually the princess t...,parisa tabriz is the yearold computer whizz pa...
621,cnnfollowing last years successful youk tour p...,it will be a first time for the tour stateside...
2094,a hapless vietnamese pair caught running a can...,chien nguyen and hieu nguyen admitted cultivat...


Tokenization

In [55]:
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt')

train_df['article'] = train_df['article'].apply(word_tokenize)
train_df['highlights'] = train_df['highlights'].apply(word_tokenize)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sivaa\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [56]:
train_df.head()

Unnamed: 0,article,highlights
5537,"[fatheroffour, imran, sharif, admitted, brutal...","[imran, sharif, had, brutally, killed, his, wi..."
7608,"[real, madrid, full, back, marcelo, has, descr...","[marcelo, praises, yearold, midfielder, in, ta..."
426,"[cnnin, fairy, tales, it, is, usually, the, pr...","[parisa, tabriz, is, the, yearold, computer, w..."
621,"[cnnfollowing, last, years, successful, youk, ...","[it, will, be, a, first, time, for, the, tour,..."
2094,"[a, hapless, vietnamese, pair, caught, running...","[chien, nguyen, and, hieu, nguyen, admitted, c..."


In [63]:
train_df['highlights'][5537]

['imran',
 'sharif',
 'had',
 'brutally',
 'killed',
 'his',
 'wife',
 'raheela',
 'imrans',
 'at',
 'their',
 'home',
 'court',
 'heard',
 'the',
 'yearold',
 'then',
 'got',
 'changed',
 'and',
 'calmly',
 'went',
 'to',
 'work',
 'sharif',
 'has',
 'denied',
 'slitting',
 'his',
 'spouses',
 'throat',
 'but',
 'later',
 'confessed',
 'to',
 'a',
 'friend',
 'he',
 'has',
 'been',
 'remanded',
 'in',
 'custody',
 'for',
 'sentencing',
 'and',
 'could',
 'face',
 'life',
 'in',
 'jail']

Stop Words Removing

In [64]:
len(train_df['article'][5537])

665

In [65]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

train_df['article'] = train_df['article'].apply(lambda tokens: [word for word in tokens if word.lower() not in stop_words])
train_df['highlights'] = train_df['highlights'].apply(lambda tokens: [word for word in tokens if word.lower() not in stop_words])


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sivaa\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [66]:
len(train_df['article'][5537])

369

In [71]:
len(train_df['highlights'][5537])

31