## HIV News articles extraction from The Kabul Times. Data Extraction of following parameters
- Headline
- Description
- Author
- Published Date
- Category
- Publication
- News
- URL
- Keywords
- Summary

### Importing the necessary Libraries

In [1]:
from newspaper import Article # Article scraping & curation
from bs4 import BeautifulSoup # Python library for pulling data out of HTML and XML files
from requests import get # standard for making HTTP requests in Python
import pandas as pd # library written for data manipulation and analysis
import sys, time #  System-specific parameters and functions

### Creating Empty lists for HIV News Articles parameters data to be extracted

In [2]:
headlines, descriptions, dates, authors, news, keywords, summaries, urls, category, publication = [], [], [], [], [], [], [], [], [], []

### Finding the total no.of.pages by total no.of articles from google search results¶

In [3]:
url = 'https://thekabultimes.gov.af/?s=HIV'

soup = BeautifulSoup(get(url).text, 'lxml')

tokens = [soup.select('.page-numbers')[i].text for i in range(len(soup.select('.page-numbers')))]
max_pages = [token for token in tokens if token.isdigit()]

### Iterates max_pages value through while loop. Scraping the Articles urls

In [4]:
for page in max_pages:
    try:
        url = 'https://thekabultimes.gov.af/page/' + page + '/?s=HIV'
        soup = BeautifulSoup(get(url).text, 'lxml')
        
        # Extracts the Headlines
        try:
            headline = [soup.select('h2.entry-title')[i].text for i in range(len(soup.select('h2.entry-title')))]
            headlines.extend(headline)
        except:
            headlines.extend(None)
            
        # Extracts the Authors
        try:
            author = [soup.select('.entry-meta')[i].select_one('.author').text for i in range(len(soup.select('.entry-meta')))]
            authors.extend(author)
        except:
            authors.extend(None)
            
        # Extracts the published dates
        try:
            pub_date = [soup.select('.entry-meta')[i].select_one('.entry-date').text for i in range(len(soup.select('.entry-meta')))]
            dates.extend(pub_date)
        except:
            dates.extend(None)
            
        # Extracts the news category
        try:
            cat = [soup.select('.penci-cat-links')[i].text.split() for i in range(len(soup.select('.penci-cat-links')))]
            category.extend(cat)
        except:
            category.extend(None)
        
        # Extracts URL's
        for i in range(len(soup.select('h2.entry-title'))):
            urls.append(soup.select('h2.entry-title')[i].a['href'])
    
    except:
        pass
    
    sys.stdout.write('\r' + str(page) + '\r')
    sys.stdout.flush()

2

### To remove duplicates urls entries in the list by executing below line

In [5]:
urls = list(dict.fromkeys(urls))
print("Total Extracted URL's are" + ' : ' + str(len(urls)), type(urls))

Total Extracted URL's are : 18 <class 'list'>


### Iterates urls through for loop. Scraping the Articles with above parameters

In [6]:
%%time
for index, url in enumerate(urls):
    try:
        # Parse the url to NewsPlease 
        article = Article(url)
        article.download()
        article.parse()
        article.nlp()
            
        # Extracts the Descriptions    
        try:
            descriptions.append(article.meta_description.strip())
        except:
            descriptions.append(None)
            
        # Extracts the news articles
        try:
            news.append(' '.join(article.text.split()).replace("\'\'"," ").replace("\'", "").replace(" / ", ""))
        except:
            news.append(None)

        # Extracts Keywords and Summaries
        try:
            keywords.append(article.keywords)
            summaries.append(' '.join(article.summary.split()))
        except:
            keywords.append(None)
            summaries.append(None)
            
        # Extracts the news publication
        try:
            publication.append(article.meta_data['og']['site_name'])
        except:
            publication.append(None)
                        
    except:
        descriptions.append(None)
        news.append(None)
        keywords.append(None)
        publication.append(None)
        summaries.append(None)

    sys.stdout.write('\r' + str(index) + ' : ' + str(url) + '\r')
    sys.stdout.flush()

Wall time: 1min 38sultimes.gov.af/2018/07/02/women-have-made-considerable-progress-in-various-areas-safi/ld-be-considered-in-publications-minister-safi/


### Checking Array Length of each list to create DataFrame

In [7]:
print(len(headlines), len(descriptions), len(authors), len(dates), len(category), len(publication), len(news), len(keywords), len(summaries), len(urls))

18 18 18 18 18 18 18 18 18 18


### Creating a csv file after checking array length and droping the missing values from the dataset

In [8]:
if len(headlines) == len(descriptions) == len(authors) == len(dates) == len(news) == len(publication) == len(keywords) == len(summaries) == len(urls) == len(category):
    tbl = pd.DataFrame({'Headlines' : headlines,
                        'Descriptions' : descriptions,
                        'Authors' : authors,
                        'Published_Dates' : dates,
                        'Publication' : publication,
                        'Articles' : news,
                        'category' : category,
                        'Keywords' : keywords,
                        'Summaries' : summaries,
                        'Source_URLs' : urls})
    tbl.dropna()
    path = 'D:\\#Backups\\Desktop\\!Code!\\CDRI\\HIV\\Data Extraction\\#Datasets\\'
    tbl.to_csv(path+'The_Kabul_Times.csv', index=False)
else:
    print('Array lenght does not match!')

tbl.head()

Unnamed: 0,Headlines,Descriptions,Authors,Published_Dates,Publication,Articles,category,Keywords,Summaries,Source_URLs
0,"National Archive needs cooperation to develop,...",KABUL: The National Archive of Afghanistan has...,Saida Ahmadi,"August 27, 2018",The Kabul Times,KABUL: The National Archive of Afghanistan has...,"[Culture, Social]","[official, archive, visit, afghanistan, rahbee...",KABUL: The National Archive of Afghanistan has...,https://thekabultimes.gov.af/2018/08/27/nation...
1,Sancharaki inaugurates photo exhibition in Nat...,"KABUL: Fazel Sancharaki, the Deputy Minister o...",Saida Ahmadi,"August 14, 2018",The Kabul Times,"KABUL: Fazel Sancharaki, the Deputy Minister o...",[National],"[culture, archive, inaugurates, kabul, deputy,...","KABUL: Fazel Sancharaki, the Deputy Minister o...",https://thekabultimes.gov.af/2018/08/14/sancha...
2,“Attention to National Archive means to protec...,KABUL: Acting and Nominee Minister of Informat...,Saida Ahmadi,"August 1, 2018",The Kabul Times,KABUL: Acting and Nominee Minister of Informat...,[National],"[information, archive, identity, relics, safi,...",KABUL: Acting and Nominee Minister of Informat...,https://thekabultimes.gov.af/2018/08/01/attent...
3,"“Arg Archive would only be run by MoIC,” Presi...","KABUL: President Mohammad Ashraf Ghani, during...",Saida Ahmadi,"July 21, 2018",The Kabul Times,"KABUL: President Mohammad Ashraf Ghani, during...",[National],"[ghani, ministry, archive, information, kabul,...","KABUL: President Mohammad Ashraf Ghani, during...",https://thekabultimes.gov.af/2018/07/21/arg-ar...
4,Health sector still facing challenges despite ...,World Health Day is marked globally on 7 April...,Saida Ahmadi,"April 7, 2019",The Kabul Times,World Health Day is marked globally on 7 April...,[Health],"[incidents, despite, sector, afghanistan, chal...",Based on remarks of minister of public health ...,https://thekabultimes.gov.af/2019/04/07/health...


In [9]:
tbl.shape

(18, 10)