#### In this notebook we will collect and preprocess over the Datasets befor concat them in one DF.

### Setup 

Data manipulation libraries:

In [1]:
import pandas as pd 
import numpy  as np

Text Preprocessing libraries:

In [41]:
import string as st
import re
import nltk
from   nltk  import WordNetLemmatizer

Storage libraries: 

In [38]:
import  pickle
from    sqlalchemy import create_engine

### Data-Sets

The datasets will be selected are the [**Global News Feeds**](https://www.kaggle.com/therohk/global-news-week) and the [**Language Identification dataset**](https://www.kaggle.com/zarajamshaid/language-identification-datasst?select=dataset.csv).

- **Global News Feeds:** This dataset is a snapshot of most of the new news content published online over one week.   
It covers the seven day period from august 24 through august 30 for the years of 2017 and 2018.



1. 2017 Dataset

In [3]:
G_news_2017 = pd.read_csv('news-week-17aug24.csv')

In [4]:
G_news_2017.shape

(1395586, 4)

2. 2018 Dataset

In [5]:
G_news_2018 = pd.read_csv('news-week-18aug24.csv')

In [6]:
G_news_2018.shape

(1909739, 4)

- **Language Identification dataset:** Contains multiple langauge, and each language contains 1000 rows/paragraphs.

3. Language Identification dataset

In [7]:
Lang = pd.read_csv('dataset.csv')

In [8]:
Lang.shape

(22000, 2)

### Data Preprocessing

- First of all we need to concat the Global News Feeds datasets, since they have the same columns.

In [9]:
Global_News_Feeds = pd.concat([G_news_2017,G_news_2018])

In [10]:
Global_News_Feeds.shape

(3305325, 4)

-  Drop the uneeded columns and rename. 

In [11]:
Global_News_Feeds.rename(
    inplace=True,
    columns={'headline_text':'Text'})

In [12]:
Global_News_Feeds.drop(['publish_time' ,'feed_code', 'source_url'], axis=1, inplace=True)

In [13]:
Global_News_Feeds.shape

(3305325, 1)

- language Detection

In [14]:
# identify the English stop words
englishStopWords    = set(nltk.corpus.stopwords.words('english'))
stopWordsDictionary = {lang: set(nltk.corpus.stopwords.words(lang)) for lang in nltk.corpus.stopwords.fileids()}

In [15]:
# detection function 
def get_language(text):
    if type(text) is str:
        text = text.lower()
    words = set(nltk.wordpunct_tokenize(text))
    return max(((lang, len(words & stopwords)) for lang, stopwords in stopWordsDictionary.items()), key = lambda x: x[1])[0]

In [16]:
Global_News_Feeds['language']= Global_News_Feeds['Text'].astype(str).apply(get_language)

In [17]:
Global_News_Feeds.shape

(3305325, 2)

- Create new DFs for global news that is in english language only.

In [18]:
English_Global_news = Global_News_Feeds[Global_News_Feeds.language == 'english']

In [19]:
English_Global_news.shape

(1015669, 2)

In [22]:
English_Global_news.head(2)

Unnamed: 0,Text,language
0,Here Are the Details on Facebook's Global Part...,english
3,Petrol & diesel on the rise post daily price r...,english


In [20]:
English_Lang = Lang[Lang.language == 'English']

In [21]:
English_Lang.shape

(1000, 2)

In [23]:
English_Lang.head(2)

Unnamed: 0,Text,language
37,in johnson was awarded an american institute ...,English
40,bussy-saint-georges has built its identity on ...,English


- Concat the last dataset

In [24]:
English_Global_news = pd.concat([English_Global_news,English_Lang])

In [25]:
English_Global_news.shape

(1016669, 2)

In [26]:
English_Global_news.head(2)

Unnamed: 0,Text,language
0,Here Are the Details on Facebook's Global Part...,english
3,Petrol & diesel on the rise post daily price r...,english


- Since all the news in English, so there's no need for the language colum.

In [27]:
English_Global_news.drop(['language'], axis=1, inplace=True)

In [28]:
English_Global_news.shape

(1016669, 1)

- Drop duplicate

In [29]:
English_Global_news = English_Global_news.drop_duplicates()

In [30]:
English_Global_news.shape

(694951, 1)

-  Checking for NaN 

In [31]:
English_Global_news.isnull().values.any()

False

- Since the function above does not detect the language correctly,  we need to make sure in removing any character from any other language.

In [32]:
patternDel = r'[^\x00-\x7F]+'
filter = English_Global_news['Text'].str.contains(patternDel)

In [33]:
English_Global_news = English_Global_news[~filter]

In [34]:
English_Global_news.shape

(665181, 1)

- Sample of the data:

In [35]:
English_Global_news.sample(15)

Unnamed: 0,Text
1606425,Lancaster service member surprises little sist...
1304372,Understanding the Fake News Phenomenon
424391,Sligo group honoured with Spanish civil order
1329427,Greenville kayak maker sending dozens to polic...
1387230,Couple Celebrates Cracker Barrel Milestone Wit...
814850,"Industry News Diodes Market Size, Share, Growt..."
1334144,Major Transportation Projects in Eastvale FAQ's
164984,Listowel man charged with careless driving aft...
231567,What Are CGI Scripts and How Do They Improve W...
1002386,New Australian PM Gets to Work; Gov't Support ...


### Text Preprocessing

In [36]:
def precosseing_pipeline(text):
        # remove urls
        text = re.sub(r'https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', ' ', text)
        # remove punctuations 
        text = ("".join([ch for ch in text if ch not in st.punctuation]))
        # remove non-alphanumeric characters
        text = re.sub(r'[^a-zA-Z]', ' ', text)
        # lower casing
        text = text.lower()
        # convert text to tokens
        text = re.split('\s+' ,text)
        tokens = [x.lower() for x in text]
        # remove stopwords using NLTK corpus stopwords list to match
        tokens = [word for word in text if word not in nltk.corpus.stopwords.words('english')]
        # convert words to feature vectors
        text = " ".join([word for word in tokens])     
        return text

In [37]:
English_Global_news['Text'] = English_Global_news['Text'].apply(precosseing_pipeline)

### Apply Lemmatization

 - Lemmatization: cut word down to base form using vocabulary and morphological analysis.

In [39]:
def apply_lemmatize(text):
    text_split = text.split(' ')
    lem_v_text = ''
    
    for text in text_split:
        lem_v_text += lemmatizer.lemmatize(text, pos='v') + ' '
        text_split  = lem_v_text.split(' ')
        lem_text    =''
        
    for text in text_split:
        lem_text += lemmatizer.lemmatize(text, pos='a') + ' '
    return lem_text

In [42]:
lemmatizer = WordNetLemmatizer()
English_Global_news['Text_lemma'] = English_Global_news['Text'].apply(apply_lemmatize)

In [43]:
English_Global_news.sample(5)

Unnamed: 0,Text,Text_lemma
248933,kentucky cabinet health family services redesi...,kentucky cabinet health family service redesig...
1490395,blue earth county looking election judges ahea...,blue earth county look election judge ahead no...
238841,redbox sony pictures home entertainment announ...,redbox sony picture home entertainment announc...
47202,download gods soldiers penguin anthology conte...,download gods soldier penguin anthology contem...
1117416,michigans tarik black get ready nico collins,michigans tarik black get ready nico collins


- Store the data

In [None]:
# csv

In [44]:
English_Global_news.to_csv('English_Global_news.csv')

In [None]:
# pickle

In [45]:
English_Global_news.to_pickle("./English_Global_news.pkl")

In [None]:
# sql

In [46]:
engine = create_engine('sqlite://', echo=False)
English_Global_news.to_sql('English_Global_news', con=engine)