**Step 1:** Get articles from [NEJM](https://www.nejm.org/coronavirus).
We can use the requests library to do this.

In [38]:
# import statements
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [27]:
# fetch web page
r = requests.get("https://www.nejm.org/coronavirus")

**Step 2:** Use BeautifulSoup to remove HTML tags.
Use "lxml" rather than "html5lib".
Outputting all the results may overload the space available to load this notebook, so we omit a print statement here.


In [28]:
soup = BeautifulSoup(r.text, "lxml")

**Step 3:** Find all course summaries
Use the BeautifulSoup's find_all method to select based on tag type and class name. On Chrome, you can right click on the item, and click "Inspect" to view its html on a web page.

In [29]:
# Find all articles
articles = soup.find_all("li", {"class":"m-article"})
print('Number of articles:', len(articles))

Number of articles: 51


In [30]:
# print the first summary in articles
print(articles[0].prettify())

<li class="m-article m-article--xl">
 <a class="m-article__img" href="/doi/full/10.1056/NEJMoa2008457?query=featured_coronavirus">
  <span class="m-article__img-container">
   <img alt="publication image" height="200" sizes="(min-width: 1480px) 560px,
                            (min-width: 1280px) calc(((100vw - 120px) / 12) * 5),
                            (min-width: 1024px) calc(((100vw - 90px) / 12) * 5),
                            (min-width: 768px) calc(((100vw - 70px) / 8) * 5),
                             (min-width: 480px) calc(100vw - 40px),
                            calc(100vw - 32px)" src="/pb-assets/images/editorial/large/NEJMoa2008457_600x400-1588017826020.jpg" srcset="/pb-assets/images/editorial/large/NEJMoa2008457_600x400-1588017826020.jpg" width="300"/>
  </span>
 </a>
 <a class="m-article__type f-caps" href="/medical-articles/original-article">
  Original Article
 </a>
 <a class="m-article__link" href="/doi/full/10.1056/NEJMoa2008457?query=featured_coronavirus">

In [31]:
# Extract article type
articles[0].select_one(".m-article__type").get_text().strip()

'Original Article'

In [44]:
# Extract article link
articles[0].select_one(".m-article__link")['href']

'/doi/full/10.1056/NEJMoa2008457?query=featured_coronavirus'

In [33]:
# Extract article title
articles[0].select_one(".m-article__title").get_text().strip()

'Presymptomatic SARS-CoV-2 in a Nursing Facility'

In [53]:
# Extract article authors
articles[0].select_one(".m-article__author").get_text().strip()

'M.M. Arons and Others'

In [21]:
# Extract citation
articles[0].select_one(".m-article__blurb").get_text().strip()

'The authors assessed transmission of SARS-CoV-2 and evaluated the adequacy of symptom-based screening in a skilled nursing facility. More than half of residents with positive test results were asymptomatic at the time of testing. Infection-control strategies focused solely on symptomatic residents were not sufficient to prevent transmission.'

In [52]:
# Extract published date
articles[0].select_one(".m-article__date").get_text().strip()

'Apr 24'

In [24]:
# Extract availability
articles[0].select_one(".m-article__icons").find('svg').attrs['class'][0]

'icon--free'

## Create dataset from All Articles

In [76]:
# Create data list
data = list()
for article in articles:
    data.append(
        [article.select_one(".m-article__type").get_text().strip(),
         article.select_one(".m-article__title").get_text().strip(),
         article.select_one(".m-article__link")['href'] if article.find_all("a", {"class": "m-article__link"}) else 'N/A',
         article.select_one(".m-article__author").get_text() if article.find_all("em", {"class": "m-article__author"}) else 'N/A',
         article.select_one(".m-article__blurb").get_text().strip(),
         article.select_one(".m-article__date").get_text() if article.find_all("em", {"class": "m-article__date"}) else 'N/A',
         article.select_one(".m-article__icons").find('svg').attrs['class'][0].split('--')[1] if article.find_all("em", {"class": "m-article__icons"}) else 'N/A'
    ])

In [77]:
# Create pandas dataframe
df = pd.DataFrame(data, columns = ['Type', 'Title', 'Link', 'Authors', 'Abstract', 'PublishedDate', 'Availability'])
df.head()

Unnamed: 0,Type,Title,Link,Authors,Abstract,PublishedDate,Availability
0,Original Article,Presymptomatic SARS-CoV-2 in a Nursing Facility,/doi/full/10.1056/NEJMoa2008457?query=featured...,M.M. Arons and Others,The authors assessed transmission of SARS-CoV-...,Apr 24,
1,Clinical Practice,Mild or Moderate Covid-19,/doi/full/10.1056/NEJMcp2009249?query=featured...,"R.T. Gandhi, J.B. Lynch, and C. del Rio",The diagnosis of Covid-19 is usually based on ...,Apr 24,
2,Correspondence,Transforming ORs into ICUs,/doi/full/10.1056/NEJMc2010853?query=featured_...,"A.W. Peters, K.S. Chawla, and Z.A. Turnbull","In the epicenter of New York City, a medical c...",Apr 24,
3,Correspondence,ST-Segment Elevation in Covid-19,/doi/full/10.1056/NEJMc2009020?query=featured_...,S. Bangalore and Others,Eighteen patients with Covid-19 presented with...,Apr 17,
4,Correspondence,Neurologic Features in SARS-CoV-2 Infection,/doi/full/10.1056/NEJMc2008597?query=featured_...,J. Helms and Others,In a consecutive series of 64 patients with Co...,Apr 15,


In [78]:
# Export CSV
df.to_csv('data/nejm_articles.csv')