###**Data Acquisition:-**

In this step we acquire/load our data into our local machine so as to supply it to our pipeline

In [None]:
#Importing required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
#Loading our dataset and displaying the first 05 rows
data=pd.read_csv("articles_data.csv")
data.head()

Unnamed: 0.1,Unnamed: 0,source_id,source_name,author,title,description,url,url_to_image,published_at,content,top_article,engagement_reaction_count,engagement_comment_count,engagement_share_count,engagement_comment_plugin_count
0,0,reuters,Reuters,Reuters Editorial,NTSB says Autopilot engaged in 2018 California...,The National Transportation Safety Board said ...,https://www.reuters.com/article/us-tesla-crash...,https://s4.reutersmedia.net/resources/r/?m=02&...,2019-09-03T16:22:20Z,WASHINGTON (Reuters) - The National Transporta...,0.0,0.0,0.0,2528.0,0.0
1,1,the-irish-times,The Irish Times,Eoin Burke-Kennedy,Unemployment falls to post-crash low of 5.2%,Latest monthly figures reflect continued growt...,https://www.irishtimes.com/business/economy/un...,https://www.irishtimes.com/image-creator/?id=1...,2019-09-03T10:32:28Z,The States jobless rate fell to 5.2 per cent l...,0.0,6.0,10.0,2.0,0.0
2,2,the-irish-times,The Irish Times,Deirdre McQuillan,"Louise Kennedy AW2019: Long coats, sparkling t...",Autumn-winter collection features designer’s g...,https://www.irishtimes.com/\t\t\t\t\t\t\t/life...,https://www.irishtimes.com/image-creator/?id=1...,2019-09-03T14:40:00Z,Louise Kennedy is showing off her autumn-winte...,1.0,,,,
3,3,al-jazeera-english,Al Jazeera English,Al Jazeera,North Korean footballer Han joins Italian gian...,Han is the first North Korean player in the Se...,https://www.aljazeera.com/news/2019/09/north-k...,https://www.aljazeera.com/mritems/Images/2019/...,2019-09-03T17:25:39Z,"Han Kwang Song, the first North Korean footbal...",0.0,0.0,0.0,7.0,0.0
4,4,bbc-news,BBC News,BBC News,UK government lawyer says proroguing parliamen...,"The UK government's lawyer, David Johnston arg...",https://www.bbc.co.uk/news/av/uk-scotland-4956...,https://ichef.bbci.co.uk/news/1024/branded_new...,2019-09-03T14:39:21Z,,0.0,0.0,0.0,0.0,0.0


In [None]:
#We want to target specific column on this dataset which is the "description" column
data=data["description"]
data.head()

0    The National Transportation Safety Board said ...
1    Latest monthly figures reflect continued growt...
2    Autumn-winter collection features designer’s g...
3    Han is the first North Korean player in the Se...
4    The UK government's lawyer, David Johnston arg...
Name: description, dtype: object

###**Data Pre-Processing:-**

The data that we get from vaious sources is often not ready to use. This is because our data may contain missing values, it may contain certain elements which cannot be understood by our NLP model, so removing such elements or replacing them with appropriate words is needed.

Below, are certain pre-processing techniques which we apply on our data:

**Checking for Null Values:-**

Null values are entries in our data which have been left unspecified either due to human error, or due to improper handling of data.

If null values are fed into the algorithm, it will give certain errors since null values are ambigous and act like noise in the data. So, we'

In [None]:
#Checking for null values in our data
data.isnull().sum()

#This means our dataset has 24 null values and we've to handle them

24

In [None]:
#Removing the null values
data=data.dropna(axis=0)
data.isnull().sum() #Cross-Checking for null values after removal

0

**Lowercasing:-**

**Lowercasing:**

In this method we convert all our text data into lowercase.

This is done because python is a *case-sensitive* language. If our text contains two same words just with a capital letter eg. "Dog" and "dog", then Python will interpret both these words differently. So, we ensure to convert our data into lowercase.

The **dataframe.str** attribute in pandas is used to access the string methods for Series or Index objects. It provides various methods for manipulating and analyzing strings in a pandas dataframe.

Here are some commonly used methods:

**dataframe.str.lower()** converts all strings in the dataframe to lowercase.

**dataframe.str.upper()** converts all strings in the dataframe to uppercase.

**dataframe.str.capitalize()** capitalizes the first letter of each string in the dataframe.

**dataframe.str.len()** returns the length of each string in the dataframe.

In [None]:
#Converting every entry in our data into lowercase
data1=data.str.lower()
data1.head()

0    the national transportation safety board said ...
1    latest monthly figures reflect continued growt...
2    autumn-winter collection features designer’s g...
3    han is the first north korean player in the se...
4    the uk government's lawyer, david johnston arg...
Name: description, dtype: object

**Removing Punctuation Marks:-**

Punctuation Marks are symbols/characters which are used in english language construction, to improve readability and comprehension.

It's important to remove these symbols and this can be understood by following argument:

  Let's say we've a string "Hello!,how are you?".When we tokenize the string, there are two possibilities:

  **i)** ["Hello","!","how","are","you","?"] ie. Either punctuation marks are considered as seperate words.

  This approach is problematic since punctuation marks carry no meaning individually and they act just like the noise in the list. They increase the complexity and nothing else.

  **ii)**["Hello!","how","are","you?"] ie. Or, punctuation marks are taken along with their preceding words.

  This approach is problematic because python will interpret "Hello!" and "Hello" as two different entitites whenever it encounters "Hello" in other parts of string.
  
So, it is necessary to remove punctuation marks.

To access all the punctuation marks in python we just have to write following lines of code,



```
import string
string.punctuation
```



In [None]:
#Importing the string module
import string

In [None]:
#The string module contains a list which contains all the punctuation marks
exclude=string.punctuation
exclude

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [None]:
#Creating a function to remove punctuation marks from string
def remove_punctuation(text):
  for i in exclude:
    if i in text:
      text=text.replace(i," ")
  return(text)

In [None]:
#Invoking the function
data2=data1.apply(remove_punctuation)
data2.head()

0    the national transportation safety board said ...
1    latest monthly figures reflect continued growt...
2    autumn winter collection features designer’s g...
3    han is the first north korean player in the se...
4    the uk government s lawyer  david johnston arg...
Name: description, dtype: object

**Removing StopWords:-**

Stopwords are words like "of","the","a","an" which are used for sentence construction but they do not carry an individual meaning. They're basically conjunctions and articles and need to be removed.

**NLTK** library provided a list of such stopwords from different languages and we use it to remove stopwords from our data.

In [None]:
#Importing the required library
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
#Displaying the set of stopwords
stopwords.words("english")

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [None]:
#Creating a function to remove the stopwords
def remove_stopwords(text):
  new_text=[]
  for i in text.split(" "):
    if i in stopwords.words("english"):
      new_text.append(" ")
    else:
      new_text.append(i)
  return(" ".join(new_text).lower())

In [None]:
#Invoking the function
data3=data2.apply(remove_stopwords)
data3.head()

0      national transportation safety board said tu...
1    latest monthly figures reflect continued growt...
2    autumn winter collection features designer’s g...
3    han     first north korean player     serie   ...
4      uk government   lawyer  david johnston argue...
Name: description, dtype: object

**Tokenization:-**

In [None]:
#Importing the required library
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize,sent_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
#Sentence tokenizing the dataset
data4=data3.apply(sent_tokenize)
data4.head()

0    [  national transportation safety board said t...
1    [latest monthly figures reflect continued grow...
2    [autumn winter collection features designer’s ...
3    [han     first north korean player     serie  ...
4    [  uk government   lawyer  david johnston argu...
Name: description, dtype: object