**Step 1:** Get articles from [The lancet](https://www.thelancet.com).
We can use the requests library to do this.

In [1]:
# import statements
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
# fetch web page
r = requests.get("https://www.thelancet.com/coronavirus")

**Step 2:** Use BeautifulSoup to remove HTML tags.
Use "lxml" rather than "html5lib".
Outputting all the results may overload the space available to load this notebook, so we omit a print statement here.


In [3]:
soup = BeautifulSoup(r.text, "lxml")

**Step 3:** Find all course summaries
Use the BeautifulSoup's find_all method to select based on tag type and class name. On Chrome, you can right click on the item, and click "Inspect" to view its html on a web page.

In [4]:
# Find all articles
articles = soup.find_all("div", {"class":"articleCitation"})
print('Number of articles:', len(articles))

Number of articles: 45


In [8]:
# print the first summary in articles
print(articles[30].prettify())

<div class="articleCitation">
 <li>
  <div class="detail">
   <div class="article-details">
    <div class="articleType doctopic-1-primaryResearch label-articles">
     Articles
    </div>
    <div class="articleTitle">
     <h4 class="title" id="S0140-6736(20)30183-5-title">
      <div class="rightTitleInfo">
       <div class="icons atype-fla">
        <!--${freeContentIcon: 10.1016/S0140-6736(20)30183-5}-->
       </div>
      </div>
      <a href="/journals/lancet/article/PIIS0140-6736(20)30183-5/fulltext">
       Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China
      </a>
     </h4>
    </div>
    <div class="doi" data-doi="10.1016/S0140-6736(20)30183-5">
     DOI:
     <a href="https://doi.org/10.1016/S0140-6736(20)30183-5">
      https://doi.org/10.1016/S0140-6736(20)30183-5
     </a>
    </div>
    <div class="citation">
     <span class="journalTitleSp">
      The Lancet
     </span>
     ,
     <span class="issueVolSp">
      Vol. 395
     </

In [9]:
# Extract article type
articles[30].select_one(".articleType").get_text().strip()

'Articles'

In [10]:
# Extract article title
articles[30].select_one(".articleTitle").get_text().strip()

'Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China'

In [11]:
# Article Link
link = articles[30].select_one(".articleTitle").select_one("a")['href']
link

'/journals/lancet/article/PIIS0140-6736(20)30183-5/fulltext'

In [21]:
# Get Abstract from link
d = requests.get("https://www.thelancet.com" + link)
content_soup = BeautifulSoup(d.text, "lxml")
date = content_soup.select_one(".article-header__publish-date__value")
sections = content_soup.find_all("div", {"class":"section-paragraph"})
print(date.get_text())
print()
for section in sections:
    print(section.get_text().replace('\n', '').rstrip())
    print()

January 24, 2020

BackgroundA recent cluster of pneumonia cases in Wuhan, China, was caused by a novel betacoronavirus, the 2019 novel coronavirus (2019-nCoV). We report the epidemiological, clinical, laboratory, and radiological characteristics and treatment and clinical outcomes of these patients.MethodsAll patients with suspected 2019-nCoV were admitted to a designated hospital in Wuhan. We prospectively collected and analysed data on patients with laboratory-confirmed 2019-nCoV infection by real-time RT-PCR and next-generation sequencing. Data were obtained with standardised data collection forms shared by WHO and the International Severe Acute Respiratory and Emerging Infection Consortium from electronic medical records. Researchers also directly communicated with patients or their families to ascertain epidemiological and symptom data. Outcomes were also compared between patients who had been admitted to the intensive care unit (ICU) and those who had not.FindingsBy Jan 2, 2020, 

In [11]:
# Extract article Digital Object Identifier
articles[0].select_one(".doi").get_text().strip()

'DOI: https://doi.org/10.1016/S1474-4422(20)30147-2'

In [12]:
# Extract article authors
articles[0].select_one(".authors").get_text().strip()

'Maria Pia Sormani on behalf of the Italian Study Group on COVID-19 infection in multiple sclerosis'

In [13]:
# Extract citation
articles[30].select_one(".citation").get_text().strip()

'The Lancet, Vol. 395, No. 10223'

In [15]:
# Extract published date
articles[30].select_one(".published-online")

In [15]:
# Extract availability
'Open' if articles[0].find_all(".OALabel") else 'Closed'

'Closed'

## Create dataset from All Articles

In [16]:
# Get Abstract from Detail page (link)
def get_abstract(link: str):
    print("getting abstract from https://www.thelancet.com" + link)
    abstract = ""
    try:
        d = requests.get("https://www.thelancet.com" + link, timeout=10)
        content_soup = BeautifulSoup(d.text, "lxml")
    except:
        return "N/A"
    sections = content_soup.find_all("div", {"class":"section-paragraph"})
    for section in sections:
        abstract += section.get_text()

    return abstract

In [17]:
# Create data list
data = list()
for article in articles:
    data.append(
        [article.select_one(".articleType").get_text().strip() if article.find_all("div", {"class": "articleType"}) else 'N/A',
         article.select_one(".articleTitle").get_text().strip(),
         article.select_one(".articleTitle").select_one("a")['href'],
         article.select_one(".doi").get_text().split('DOI:')[1],
         article.select_one(".authors").get_text() if article.find_all("div", {"class": "authors"}) else 'N/A',
         article.select_one(".citation").get_text().strip(),
         get_abstract(article.select_one(".articleTitle").select_one("a")['href']),
         article.select_one(".published-online").get_text().split('Published:')[1] if article.find_all("div", {"class": "published-online"}) else 'N/A',
         'Open' if article.find_all(".OALabel") else 'Closed'
    ])

getting abstract from https://www.thelancet.com/journals/laneur/article/PIIS1474-4422(20)30147-2/fulltext
getting abstract from https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)31016-3/fulltext
getting abstract from https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)31023-0/fulltext
getting abstract from https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)31015-1/fulltext
getting abstract from https://www.thelancet.com/journals/lanchi/article/PIIS2352-4642(20)30131-0/fulltext
getting abstract from https://www.thelancet.com/journals/landia/article/PIIS2213-8587(20)30156-X/fulltext
getting abstract from https://www.thelancet.com/journals/langlo/article/PIIS2214-109X(20)30204-7/fulltext
getting abstract from https://www.thelancet.com/journals/langlo/article/PIIS2214-109X(20)30213-8/fulltext
getting abstract from https://www.thelancet.com/journals/langlo/article/PIIS2214-109X(20)30214-X/fulltext
getting abstract from https://www.thelancet.co

In [18]:
# Create pandas dataframe
df = pd.DataFrame(data, columns = ['Type', 'Title', 'Link', 'DOI_link', 'Authors', 'Citation', 'Abstract', 'PublishedDate', 'Availability'])
df.head()

Unnamed: 0,Type,Title,Link,DOI_link,Authors,Citation,Abstract,PublishedDate,Availability
0,Correspondence,An Italian programme for COVID-19 infection in...,/journals/laneur/article/PIIS1474-4422(20)3014...,https://doi.org/10.1016/S1474-4422(20)30147-2,Maria Pia Sormani on behalf of the Italian Stu...,The Lancet Neurology,Italy was the first European country to encoun...,"April 30, 2020",Closed
1,Correspondence,"Institutional, not home-based, isolation could...",/journals/lancet/article/PIIS0140-6736(20)3101...,https://doi.org/10.1016/S0140-6736(20)31016-3,"Borame L Dickens, Joel R Koo, Annelies Wilder-...",The Lancet,"In the absence of vaccines, non-pharmaceutical...","April 29, 2020",Closed
2,Comment,Remdesivir for COVID-19: challenges of underpo...,/journals/lancet/article/PIIS0140-6736(20)3102...,https://doi.org/10.1016/S0140-6736(20)31023-0,John David Norrie,The Lancet,"In The Lancet, Yeming Wang and colleagues1Wang...","April 29, 2020",Closed
3,Correspondence,Preventing major outbreaks of COVID-19 in jails,/journals/lancet/article/PIIS0140-6736(20)3101...,https://doi.org/10.1016/S0140-6736(20)31015-1,"Justin T Okano, Sally Blower",The Lancet,Severe acute respiratory syndrome coronavirus ...,"April 29, 2020",Closed
4,Comment,Promoting healthy movement behaviours among ch...,/journals/lanchi/article/PIIS2352-4642(20)3013...,https://doi.org/10.1016/S2352-4642(20)30131-0,"Hongyan Guan, Anthony D Okely, Nicolas Aguilar...",The Lancet Child & Adolescent Health,Global movement behaviour guidelines recommend...,"April 29, 2020",Closed


In [49]:
# Export CSV
df.to_csv('data/the_lancet_articles.csv')

In [21]:
def lemmatize(text: str) -> []:
    print(text)
    words = text.split()
    words = [w for w in words if w not in stopwords.words("english")]
    return [WordNetLemmatizer().lemmatize(w, pos='v') for w in words]


In [27]:
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
df['lemmas_abstract'] = df.apply(lambda row: lemmatize(row.Abstract), axis=1)

Italy was the first European country to encounter the effects of the coronavirus disease 2019 (COVID-19) pandemic.1Li C Romagnani P von Brunn A Hans-Joachim A SARS-CoV-2 and Europe: timing of containment measures for outbreak control.Infection. 2020;  (published online April 9.)DOI:10.1007/s15010-020-01420-9Google Scholar For people with multiple sclerosis, the situation carries additional reasons for concern. Although emerging work suggests that some coexisting diseases, such as hypertension, might increase the severity of the COVID-19 infection, how less common conditions, such as multiple sclerosis, effect COVID-19 outcomes is still uncertain. Furthermore, immunosuppressive therapies, the mainstay of treatment for multiple sclerosis, might confer additional risks or, on the contrary, confer some protection. Therefore, collecting information to evaluate the relationship between multiple sclerosis and COVID-19 and implement immediate and appropriate protective strategies is crucial. L

In [28]:
pd.set_option('max_colwidth', 500)
df.lemmas_abstract.head()

0    [Italy, first, European, country, encounter, effect, coronavirus, disease, 2019, (COVID-19), pandemic.1Li, C, Romagnani, P, von, Brunn, A, Hans-Joachim, A, SARS-CoV-2, Europe:, time, containment, measure, outbreak, control.Infection., 2020;, (published, online, April, 9.)DOI:10.1007/s15010-020-01420-9Google, Scholar, For, people, multiple, sclerosis,, situation, carry, additional, reason, concern., Although, emerge, work, suggest, coexist, diseases,, hypertension,, might, increase, severity,...
1    [In, absence, vaccines,, non-pharmaceutical, interventions, physical, distancing,, intensive, contact, tracing,, case, isolation, remain, frontline, measure, control, spread, severe, acute, respiratory, syndrome, coronavirus, 2.1Wilder-Smith, A, Freedman, DO, Isolation,, quarantine,, social, distance, community, containment:, pivotal, role, old-style, public, health, measure, novel, coronavirus, (2019-nCoV), outbreak.J, Travel, Med., 2020;, 27taaa020Google, Scholar, In, Wuhan,, China,,

In [29]:
raw_lancet = context.catalog.load("raw_lancet_articles")

2020-04-30 13:42:57,443 - kedro.io.data_catalog - INFO - Loading data from `raw_lancet_articles` (CSVDataSet)...


In [None]:
raw_lancet['lemmas_abstract'] = raw_lancet.apply(lambda row: lemmatize(row.Abstract), axis=1)

In [34]:
raw_lancet.iloc[23 , : ]

Type                                                                                                                      NaN
Title            Remdesivir in adults with severe COVID-19: a randomised, double-blind, placebo-controlled, multicentre trial
Link                                                                                        /lancet/article/s0140673620310229
DOI_link                                                                        https://doi.org/10.1016/S0140-6736(20)31022-9
Authors                                Yeming Wang, Dingyu Zhang, Guanhua Du, Ronghui Du, Jianping Zhao, Yang Jin, and others
Citation                                                                                                           The Lancet
Abstract                                                                                                                  NaN
PublishedDate                                                                                                  April 2