# Web Scraping of News Tweets and Summary Pages from Altmetric

This code aims to scrape the latest information for all News, Tweets and summary page from Altmetric. The specific steps include:

1) News: Since the raw data "Altmetric_News.xlsx" only includes the news with valid links, in order to get all the news regardless of whether their links are valid or not, this code will first scrape all headlines, sources and subtitles of news without links (dataframe is called: df1). Then, I merge "df1" with the raw data set to match all news with their links (for the news that have invalid links, I will match them with "NA"), and export "News_all_links.xlsx". Besides, this code will scrape the entire news contents from top5 media sources for further research purpose.

2) Twitter: Get the account handles (under a column called medialink) and headlines of all tweets, and delete our Twitter posts. Since tweets are always without any subtitles, I don't add this column here. Then, put the medialinks and headlines into a new dataframe "df", and add other columns 'altmetric', '#article', 'author', 'mediatype', 'mediasource‘.

3) Fetch the Altmetric scores, citations, readers and demographic information, and add them into "df" according to existing paper links. Then, this code exports this dataset as "Altmetric_tweet+sum.xlsx".

After the above processing, this code will export two new data sets "Altmetric_tweet+sum.xlsx" and "News_all_links.xlsx", which will also locate in a folder of google drive.

In [None]:
# Import needed packages and connect to google drive
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
#from google.colab import drive
#drive.mount('/content/drive')
#%cd /content/drive/My Drive/Research/_Fiverr/Altmetric

## Step 1 Extract all news

Scrape headlines and subtitles of all news, and merge two data sets.

In [None]:
# Map article links to authors
base_url2 = 'https://oxfordjournals.altmetric.com/details/72683542/news' # Article 2 - Gangwisch
def get_pages(base_url):
    pages = []
    text = requests.get(base_url).text
    soup = BeautifulSoup(text, 'html.parser')
    pagination_div = soup.find('div', class_='pagination_page_links')
    if pagination_div is None:
        pages = [base_url]
    else:
        asoup = BeautifulSoup(text, 'html.parser')
        pagination_div = soup.find('div', class_='pagination_page_links')
        a_elements = pagination_div.find_all('a')
        page_number = int(a_elements[-2].text)
        for i in range(1, page_number+1):
            url = f"{base_url}/page:{i}"
            pages.append(url)
    return pages
pages2 = get_pages(base_url2)

base_url3 = 'https://science.altmetric.com/details/60552876/news' # Article 3 - Lee
pages3 = get_pages(base_url3)

base_url4 = 'https://jamanetwork.altmetric.com/details/64368646/news' # Article 4 - Kim
pages4 = get_pages(base_url4)

base_url5 = 'https://science.altmetric.com/details/69584866/news' # Article 5 - Mina
pages5 = get_pages(base_url5)

base_url6 = 'https://annals.altmetric.com/details/56459321/news' # Article 6 - Hviid
pages6 = get_pages(base_url6)

base_url7 = 'https://scienceadvances.altmetric.com/details/69530897/news' # Article 7 - Maxwell
pages7 = get_pages(base_url7)

base_url8 = 'https://nature.altmetric.com/details/63584063/news' # Article 8 - Brezaghi
pages8 = get_pages(base_url8)

pages_all_articles = [pages2, pages3, pages4, pages5, pages6, pages7, pages8]
author_list = ['Gangwisch et al.', 'Lee et al.', 'Kim et al.', 'Mina et al.', 'Hviid et al.', 'Maxwell et al.', 'Berzaghi et al.']
mp = {}
for author, pages in zip(author_list, pages_all_articles):
    for page in pages:
        mp[page] = author

In [None]:
# Get required data for all news (with missing rows)
headlines = []
subtitles = []
sources = []
authors = []
pages_all = pages2 + pages3 + pages4 + pages5 + pages6 + pages7 + pages8

for page in pages_all:
    text = requests.get(page).text
    soup = BeautifulSoup(text, 'html.parser')
    articles = soup.find_all('article', class_ = 'post msm')

    # Get the headlines of all news
    for article in articles:
        title = article.find('h3').text
        headlines.append(title)

    # Get the subtitles of all news
        subtitle = article.find('p', class_ = 'summary').text
        subtitles.append(subtitle)

    # Get the sources of all news
        source = article.find('h4').text
        source_name = source.split(',')[0]
        sources.append(source_name)

        author_name = mp.get(page, 'N/A')
        authors.append(author_name)

# Check the amounts for them
print(len(headlines))
print(len(subtitles))
print(len(sources))

1447
1447
1447


In [None]:
# Put the above data into a dataset
data = {
    'mediaheadline': headlines,
    'mediasubtitle': subtitles,
    'mediasource': sources,
    'author': authors,
}
df1 = pd.DataFrame(data)

# Define a function to count the number for rows with same headlines, subtitiles and sources
def generate_aux_column(df):
    aux_column = []
    occur_cnt = {} # map from (x, y, x) -> #cnt
    for (headline, subtitle, source) in zip(df['mediaheadline'], df['mediasubtitle'], df['mediasource']):
        occur_cnt.setdefault((headline, subtitle, source), 0)
        occur_cnt[(headline, subtitle, source)] += 1
        aux_column.append(occur_cnt[(headline, subtitle, source)])
    return aux_column
df1['aux'] = generate_aux_column(df1)
df1

Unnamed: 0,mediaheadline,mediasubtitle,mediasource,author,aux
0,Sleep problems can increase as you age. These ...,Consumer Reports has no financial relationship...,Washington Post,Gangwisch et al.,1
1,Breve historia del insomnio y de cómo nos obse...,"Shutterstock Philippa Martyr, The University o...",RED/ACCIÓN,Gangwisch et al.,1
2,A Brief History Of Insomnia And How We Become ...,"By Philippa Martyr, The University of Western ...",Nation World News,Gangwisch et al.,1
3,Breve historia del insomnio y de cómo nos obse...,"Publicidad Por Philippa Martyr, The University...",On Cuba News,Gangwisch et al.,1
4,News story from IOL on Friday 13 October 2023,,IOL,Gangwisch et al.,1
...,...,...,...,...,...
1442,Dwindling Forest Elephant Populations in the C...,Forest elephants in Odzala-Kokoua National Par...,Newswise,Berzaghi et al.,1
1443,Elephant extinction in Africa would ‘speed up ...,Wiping out all of Africa’s elephants could acc...,Yahoo! News,Berzaghi et al.,1
1444,News story from The Independent on Monday 15 J...,,The Independent,Berzaghi et al.,1
1445,Elephant extinction in Africa would ‘speed up ...,Wiping out all of Africa’s elephants could acc...,Yahoo! News,Berzaghi et al.,2


In [None]:
# Merge two data sets
df2 = pd.read_excel('Altemtric_News.xlsx')
df2 = df2[['Medialink', 'Mediasource', 'Mediaheadline', 'Mediasubtitle']]
df2.rename(columns = {'Medialink': 'medialink',
                      'Mediasource': 'mediasource',
                      'Mediaheadline': 'mediaheadline',
                      'Mediasubtitle': 'mediasubtitle'
                     }, inplace=True)
df2['aux'] = generate_aux_column(df2)
df2

df = pd.merge(df1, df2, on = ['mediaheadline', 'mediasubtitle', 'mediasource', 'aux'], how = 'left')

# Add article numbers
df.loc[df['author'] == 'Gangwisch et al.', '#article'] = 'Article 2'
df.loc[df['author'] == 'Lee et al.', '#article'] = 'Article 3'
df.loc[df['author'] == 'Kim et al.', '#article'] = 'Article 4'
df.loc[df['author'] == 'Mina et al.', '#article'] = 'Article 5'
df.loc[df['author'] == 'Hviid et al.', '#article'] = 'Article 6'
df.loc[df['author'] == 'Maxwell et al.', '#article'] = 'Article 7'
df.loc[df['author'] == 'Berzaghi et al.', '#article'] = 'Article 8'
df[['#article', 'author', 'medialink', 'mediasource', 'mediaheadline', 'mediasubtitle']]

Unnamed: 0,#article,author,medialink,mediasource,mediaheadline,mediasubtitle
0,Article 2,Gangwisch et al.,http://ct.moreover.com/?a=52078324219&p=1pl&v=...,Washington Post,Sleep problems can increase as you age. These ...,Consumer Reports has no financial relationship...
1,Article 2,Gangwisch et al.,http://ct.moreover.com/?a=52075489541&p=1pl&v=...,RED/ACCIÓN,Breve historia del insomnio y de cómo nos obse...,"Shutterstock Philippa Martyr, The University o..."
2,Article 2,Gangwisch et al.,,Nation World News,A Brief History Of Insomnia And How We Become ...,"By Philippa Martyr, The University of Western ..."
3,Article 2,Gangwisch et al.,http://ct.moreover.com/?a=52061741759&p=1pl&v=...,On Cuba News,Breve historia del insomnio y de cómo nos obse...,"Publicidad Por Philippa Martyr, The University..."
4,Article 2,Gangwisch et al.,,IOL,News story from IOL on Friday 13 October 2023,
...,...,...,...,...,...,...
1442,Article 8,Berzaghi et al.,https://www.newswise.com/articles/dwindling-fo...,Newswise,Dwindling Forest Elephant Populations in the C...,Forest elephants in Odzala-Kokoua National Par...
1443,Article 8,Berzaghi et al.,https://uk.news.yahoo.com/elephant-extinction-...,Yahoo! News,Elephant extinction in Africa would ‘speed up ...,Wiping out all of Africa’s elephants could acc...
1444,Article 8,Berzaghi et al.,,The Independent,News story from The Independent on Monday 15 J...,
1445,Article 8,Berzaghi et al.,https://ca.news.yahoo.com/elephant-extinction-...,Yahoo! News,Elephant extinction in Africa would ‘speed up ...,Wiping out all of Africa’s elephants could acc...


In [None]:
df.to_excel('News_all_links.xlsx', index=False)

In [26]:
# Scrap contents for yahoo!news
df = pd.read_excel('News_all_links.xlsx')
df_yn = df.loc[df['mediasource'] == 'Yahoo! News']
links = df_yn['medialink']
pd.set_option('display.max_colwidth', None)
len(links)

44

In [25]:
for link in links:
    fetch = requests.get(link)
    if fetch.status_code != 200:
      print('This link is missing')
      print('\n----------------------------------------------\n')
      continue
    text = fetch.text
    soup = BeautifulSoup(text, 'html.parser')
    article = soup.find('div', class_ = 'caas-body')
    content = article.find_all('p')
    for paragraph in content:
      print(paragraph.text)
    print('\n----------------------------------------------\n')

By Lisa Rapaport
(Reuters Health) - Older women who eat lots of sweets and processed grains may be more likely to suffer from insomnia than their counterparts whose don't consume much of these foods, a U.S. study suggests.
Researchers examined data from food diaries for more than 50,000 women in their mid-60s who had already gone through menopause, a transition that is also associated with an increased risk of sleep problems and insomnia. They focused on the "dietary glycemic index," a measure of how many foods people consume that can contribute to spikes in blood sugar levels.
Women with the highest dietary glycemic index scores - meaning they consumed more refined carbohydrates like white bread, sweets and sugary soda - were 11% more likely than women with the lowest scores to report insomnia at the start of the study period.
They were also 16% more likely to develop new insomnia during the three-year follow-up period.
"Our results point to the importance of diet for those who suffer

In [31]:
# Scrap contents for Physician's Briefing
import math
df_yn = df.loc[df['mediasource'] == "Physician's Briefing"]
links = df_yn['medialink']
print(len(df_yn))

for link in links:
    if (not type(link) is float) or (not math.isnan(link)):
        fetch = requests.get(link)
        if fetch.status_code != 200:
          print('This link is missing')
          print('\n----------------------------------------------\n')
          continue
        text = fetch.text
        soup = BeautifulSoup(text, 'html.parser')
        article = soup.find('div', class_ = 'body-description')
        content = article.find_all('p')
        for paragraph in content:
          print(paragraph.text)
        print('\n----------------------------------------------\n')

22
WEDNESDAY, Dec. 18, 2019 (HealthDay News) -- Diets with a higher glycemic index (GI) may be a risk factor for insomnia in postmenopausal women, according to a study published online Dec. 11 in the American Journal of Clinical Nutrition.
James E. Gangwisch, Ph.D., from Columbia University in New York City, and colleagues investigated the odds of insomnia among 77,860 postmenopausal women participating in the Women's Health Initiative Observational Study at baseline (1994 to 1998), and in 53,069 participants after three years of follow-up (1997 to 2001), based upon associations with GI, glycemic load, other carbohydrate measures (added sugars, starch, total carbohydrate), dietary fiber, and specific carbohydrate-containing foods.
The researchers found that in fully adjusted models, higher dietary GI was associated with increasing odds of prevalent (fifth compared with first quintile odds ratio [OR], 1.11) and incident (fifth compared with first quintile OR, 1.16) insomnia. Incident in

In [None]:
# Scrap titles for The Conversation
df_yn = df.loc[df['mediasource'] == 'The Conversation']
links = df_yn['medialink']
print(len(df_yn))

from urllib.request import Request, urlopen
head = {
      'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Mobile Safari/537.36',
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
      'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
      'Accept-Encoding': 'none',
      'Accept-Language': 'en-US,en;q=0.8',
      'Connection': 'keep-alive',
    }

def fetch_link(link, head):
    req = Request(link, headers = head)
    return urlopen(req)

def fetch_content(link):
    content = fetch_link(link, head).read()
    content = str(content, encoding ='utf8')
    return content

for link in links:
    if (not type(link) is float) or (not math.isnan(link)):
      text = fetch_content(link)
      soup = BeautifulSoup(text, 'html.parser')
      article = soup.find('div', class_ = "grid-ten large-grid-nine grid-last content-body content entry-content instapaper_body inline-promos")
      if article:
        content = article.find_all('p')
        for paragraph in content:
          print(paragraph.text)
        print('\n----------------------------------------------\n')
      else:
        article = soup.find('div', class_ = "grid-ten large-grid-nine grid-last content-body content entry-content instapaper_body")
        content = article.find_all('p')
        for paragraph in content:
          print(paragraph.text)
        print('\n----------------------------------------------\n')


17
La escritora francesa Marie Darrieussecq escribe en sus memorias de 2023 Sleepless:
El mundo se divide entre los que pueden dormir y los que no. 
El insomnio es una preocupación bien documentada a lo largo de la historia que incluye la dificultad tanto para conciliar el sueño como para permanecer dormido. Suele venir acompañado de angustia y ansiedad durante el día. 
Hay muchas y variadas razones por las que la gente padece insomnio. Entre ellas se incluyen cambios biológicos a medida que envejecemos o debidos a nuestras hormonas, problemas de salud física o mental, los medicamentos que tomamos, así como la forma y el lugar en que vivimos y trabajamos.
La privación del sueño es, literalmente, una forma de tortura. Y el cónsul romano Marco Atilio Régulo es supuestamente la primera persona de la historia que murió de insomnio. Alrededor del año 256 a. e. c. fue entregado a los enemigos de Roma, los cartagineses, que al parecer lo torturaron hasta la muerte. Para ello, le amputaron los

In [None]:
# Scrap contents for The Medical News
df_yn = df.loc[df['mediasource'] == 'The Medical News']
links = df_yn['medialink']
print(len(df_yn))
links

for link in links:
    fetch = requests.get(link)
    if fetch.status_code != 200:
      print('This link is missing')
      print('\n----------------------------------------------\n')
      continue
    text = fetch.text
    soup = BeautifulSoup(text, 'html.parser')
    article = soup.find('div', class_ = 'content')
    content = article.find_all('p')
    for paragraph in content:
      print(paragraph.text)
    print('\n----------------------------------------------\n')

9
This link is missing

----------------------------------------------

A new study from the Beth Israel Deaconess Medical Center has found that broccoli, Brussels sprouts, kale, cauliflower, cabbage and collard greens contain a substance that could inactivate a vital gene that plays a role in cancers. Their study titled, “Reactivation of PTEN tumor suppressor for cancer treatment through inhibition of a MYC-WWP1 inhibitory pathway,” was published in the latest issue of the journal Science.
The authors led by Pier Paolo Pandolfi, Director of the Cancer Center and Cancer Research Institute at Beth Israel Deaconess Medical Center, write that there have been numerous studies that show the cancer-protective effects of broccoli and other members of the family including cruciferous vegetables. They contain a substance that targets a gene called the WWP1 and thus in lab animals they were shown to suppress tumour growth. Pandolfi in a statement said, “We found a new important player that drive

In [None]:
# Scrap contents for Newsbreak
df_yn = df.loc[df['mediasource'] == 'Newsbreak']
links = df_yn['medialink']
print(len(df_yn))

for link in links:
    fetch = requests.get(link)
    if fetch.status_code != 200:
      print('This link is missing')
      print('\n----------------------------------------------\n')
      continue
    text = fetch.text
    soup = BeautifulSoup(text, 'html.parser')
    article = soup.find('div', class_ = 'content')
    content = article.find_all('p')
    for paragraph in content:
      print(paragraph.text)
    print('\n----------------------------------------------\n')


9

 AsiaVision / Getty Images 
 Chances are, you've made a few midnight raids on the fridge — most people have. But is there any harm in making your nighttime snack a regular habit? Even experts haven't found a clear answer to that question. 
 "The effects of nighttime eating can differ based on personal characteristics and on the type and amount of food being consumed," says  Dr. Sarah Musleh  , endocrinologist at  Anzara Health  — so, it's hard to provide a simple answer, like "It's good" or "It's bad." 
 According to Musleh, some research connects nighttime eating to potentially negative health effects, such as: 
 But if you can't sleep on an empty stomach, here's some good news: Some evidence  suggests what you eat matters more than the timing  . 
 Below, you'll find four important things to keep in mind about late-night eating, along with tips on snacking wisely throughout the day to reduce nighttime hunger. 
 Some foods and beverages may have more of an  impact on your sleep and 

## Step 2 Extract tweets

In [None]:
# Map article links to authors
base_url2 = 'https://oxfordjournals.altmetric.com/details/72683542/twitter' # Article 2 - Gangwisch
def get_pages(base_url):
    pages = []
    text = requests.get(base_url).text
    soup = BeautifulSoup(text, 'html.parser')
    pagination_div = soup.find('div', class_='pagination_page_links')
    if pagination_div is None:
        pages = [base_url]
    else:
        asoup = BeautifulSoup(text, 'html.parser')
        pagination_div = soup.find('div', class_='pagination_page_links')
        a_elements = pagination_div.find_all('a')
        page_number = int(a_elements[-2].text)
        for i in range(1, page_number+1):
            url = f"{base_url}/page:{i}"
            pages.append(url)
    return pages
pages2 = get_pages(base_url2)

base_url3 = 'https://science.altmetric.com/details/60552876/twitter' # Article 3 - Lee
pages3 = get_pages(base_url3)

base_url4 = 'https://jamanetwork.altmetric.com/details/64368646/twitter' # Article 4 - Kim
pages4 = get_pages(base_url4)

base_url5 = 'https://science.altmetric.com/details/69584866/twitter' # Article 5 - Mina
pages5 = get_pages(base_url5)

base_url6 = 'https://annals.altmetric.com/details/56459321/twitter' # Article 6 - Hviid
pages6 = get_pages(base_url6)

base_url7 = 'https://scienceadvances.altmetric.com/details/69530897/twitter' # Article 7 - Maxwell
pages7 = get_pages(base_url7)

base_url8 = 'https://nature.altmetric.com/details/63584063/twitter' # Article 8 - Brezaghi
pages8 = get_pages(base_url8)

pages_all_articles = [pages2, pages3, pages4, pages5, pages6, pages7, pages8]
author_list = ['Gangwisch et al.', 'Lee et al.', 'Kim et al.', 'Mina et al.', 'Hviid et al.', 'Maxwell et al.', 'Berzaghi et al.']
mp1 = {}
for author, pages in zip(author_list, pages_all_articles):
    for page in pages:
        mp1[page] = author

In [None]:
# Get the account handles (medialink) of all tweets, and store them in lists
handles = []
headlines = []
authors = []
pages_all = pages2 + pages3 + pages4 + pages5 + pages6 + pages7 + pages8

for page in pages_all:
    text = requests.get(page).text
    soup = BeautifulSoup(text, 'html.parser')
    articles = soup.find_all('article', class_ = 'post twitter')

    # Get the account handles of tweets
    for article in articles:
        author_handle = article.find('div', class_ = 'handle').text
        handles.append(author_handle)

    # Get the headlines of all tweets
        headline = article.find('p', class_ = 'summary').text
        headlines.append(headline)

        author_name = mp1.get(page, 'N/A')
        authors.append(author_name)

In [None]:
# Check the numbers of headlines and handles
print(f'There are {len(handles)} tweets.')
print(f'There are {len(headlines)} headlines.')

There are 14546 tweets.
There are 14546 headlines.


In [None]:
# Put the above info into a new data frame "df", and add the other columns and values.
data = {
    'medialink': handles,
    'mediaheadline': headlines,
    'author': authors
}
df = pd.DataFrame(data)
df

Unnamed: 0,medialink,mediaheadline,author
0,@tolucky39,参考文献\nHigh glycemic index and glycemic load di...,Gangwisch et al.
1,@researchfindin,New study finds shocking link between High GI ...,Gangwisch et al.
2,@researchfindstd,Your Diet Could be the Culprit Behind Insomnia...,Gangwisch et al.
3,@research_finds_,New Study Finds Probable Link Between Diet and...,Gangwisch et al.
4,@researchfindin,Is Your Diet Triggering Insomnia? https://t.co...,Gangwisch et al.
...,...,...,...
14541,@Zacarias15D,"Elephants, megagarderners of African rainfores...",Berzaghi et al.
14542,@geomatlab,#Climate NPG: Carbon stocks in central African...,Berzaghi et al.
14543,@savieira08,Acaba de sair nosso artigo sobre o papel dos e...,Berzaghi et al.
14544,@tlyadeen,RT @NatureGeosci: NGeo: Decline of elephant po...,Berzaghi et al.


In [None]:
# Add a conlumn "mediatype"
df['mediatype'] = 'Tweet'

# Add a conlumn "mediasource"
df['mediasource'] = 'Twitter'

# Add a conlumn "mediasubtitile"
df['mediasubtitile'] = ''

# Add a conlumn "#Article"
df.loc[df['author'] == 'Gangwisch et al.', '#article'] = 'Article 2'
df.loc[df['author'] == 'Lee et al.', '#article'] = 'Article 3'
df.loc[df['author'] == 'Kim et al.', '#article'] = 'Article 4'
df.loc[df['author'] == 'Mina et al', '#article'] = 'Article 5'
df.loc[df['author'] == 'Hviid et al.', '#article'] = 'Article 6'
df.loc[df['author'] == 'Maxwell et al.', '#article'] = 'Article 7'
df.loc[df['author'] == 'Berzaghi et al.', '#article'] = 'Article 8'

# Add a conlumn that includes "altmetric" link
df.loc[df['author'] == 'Gangwisch et al.', 'altmetric'] = 'https://oxfordjournals.altmetric.com/details/72683542'
df.loc[df['author'] == 'Lee et al.', 'altmetric'] = 'https://science.altmetric.com/details/60552876'
df.loc[df['author'] == 'Kim et al.', 'altmetric'] = 'https://jamanetwork.altmetric.com/details/64368646'
df.loc[df['author'] == 'Mina et al', 'altmetric'] = 'https://science.altmetric.com/details/69584866'
df.loc[df['author'] == 'Hviid et al.', 'altmetric'] = 'https://annals.altmetric.com/details/56459321'
df.loc[df['author'] == 'Maxwell et al.', 'altmetric'] = 'https://scienceadvances.altmetric.com/details/69530897'
df.loc[df['author'] == 'Berzaghi et al.', 'altmetric'] = 'https://nature.altmetric.com/details/63584063'

df = df[['altmetric', '#article', 'author', 'mediatype', 'medialink', 'mediasource', 'mediaheadline', 'mediasubtitile']]
df

In [None]:
# Detele out twitter posts
condition = df['medialink'].str.lower().str.contains('find|research')
df1 = df[~condition]
df1 = df1.reset_index(drop = True)
df1

Unnamed: 0,altmetric,#article,author,mediatype,medialink,mediasource,mediaheadline,mediasubtitile
0,https://oxfordjournals.altmetric.com/details/7...,Article 2,Gangwisch et al.,Tweet,@tolucky39,Twitter,参考文献\nHigh glycemic index and glycemic load di...,
1,https://oxfordjournals.altmetric.com/details/7...,Article 2,Gangwisch et al.,Tweet,@JBianco215,Twitter,https://t.co/Gqp1PWsbwm\n\nHigh glycemic index...,
2,https://oxfordjournals.altmetric.com/details/7...,Article 2,Gangwisch et al.,Tweet,@diet_Tr_beau5kg,Twitter,昨日の勉強した論文と記事\n\nhttps://t.co/Ura3ne4cz3\nhttps...,
3,https://oxfordjournals.altmetric.com/details/7...,Article 2,Gangwisch et al.,Tweet,@movingintosleep,Twitter,"Tuesday #sleepscience ""Rapid spikes in blood s...",
4,https://oxfordjournals.altmetric.com/details/7...,Article 2,Gangwisch et al.,Tweet,@DrMichaelMol,Twitter,"The saying ""eat for sleep"" has more to do with...",
...,...,...,...,...,...,...,...,...
13796,https://nature.altmetric.com/details/63584063,Article 8,Berzaghi et al.,Tweet,@Zacarias15D,Twitter,"Elephants, megagarderners of African rainfores...",
13797,https://nature.altmetric.com/details/63584063,Article 8,Berzaghi et al.,Tweet,@geomatlab,Twitter,#Climate NPG: Carbon stocks in central African...,
13798,https://nature.altmetric.com/details/63584063,Article 8,Berzaghi et al.,Tweet,@savieira08,Twitter,Acaba de sair nosso artigo sobre o papel dos e...,
13799,https://nature.altmetric.com/details/63584063,Article 8,Berzaghi et al.,Tweet,@tlyadeen,Twitter,RT @NatureGeosci: NGeo: Decline of elephant po...,


## Step 3 Add the summary information

In [None]:
# Get the summary information of all 7 papers from Altmetric
links = [
    'https://oxfordjournals.altmetric.com/details/72683542', # Article 2 - Gangwisch
    'https://science.altmetric.com/details/60552876', # Article 3 - Lee
    'https://jamanetwork.altmetric.com/details/64368646', # Article 4 - Kim
    'https://science.altmetric.com/details/69584866', # Article 5 - Mina
    'https://annals.altmetric.com/details/56459321', # Article 6 - Hviid
    'https://scienceadvances.altmetric.com/details/69530897', # Article 7 - Maxwell
    'https://nature.altmetric.com/details/63584063', # Article 8 - Brezaghi
]
scores = []
citations = []
readers = []
count = []
percent = []

for index, link in enumerate (links, start = 2):
    text = requests.get(link).text
    soup = BeautifulSoup(text, 'html.parser')

    # Get the Altmetric score
    score = soup.find('div', class_ = 'altmetric-badge')
    s = score['style']
    s_start = s[s.find('score=')+6:]
    s_end = s_start[:s_start.find('&')]
    scores.append(s_end)
    print(f'Score for Article {index} is: {s_end}')

    # Get citations
    citation = soup.find('div', class_ = 'scholarly-citation-counts-wrapper')
    c = citation.find('strong').text
    citations.append(c)
    print(f'Citation for Article {index} is: {c}')

    # Get readers
    reader = soup.find('div', class_ = 'reader-counts-wrapper')
    r = reader.find('strong').text
    readers.append(r)
    print(f'Readers for Article {index} is: {r}')

    # Get the demographics infomation
    table = soup.find('div', class_ = 'table-wrapper users')
    for td in list(list(table.children)[3].children)[3::2]:
        caption = td.find('td')
        nums = [i.text for i in td.find_all('td', class_ = 'num')]
        count.append(nums[0])
        percent.append(nums[1])
        print(caption.text)
        print(nums)
    print('\n----------------------------------------------\n')
print(count, percent)

Score for Article 2 is: 688
Citation for Article 2 is: 52
Readers for Article 2 is: 155
Members of the public
['86', '75%']
Practitioners (doctors, other healthcare professionals)
['17', '15%']
Scientists
['12', '10%']

----------------------------------------------

Score for Article 3 is: 504
Citation for Article 3 is: 182
Readers for Article 3 is: 246
Members of the public
['240', '71%']
Scientists
['63', '19%']
Practitioners (doctors, other healthcare professionals)
['28', '8%']
Science communicators (journalists, bloggers, editors)
['9', '3%']

----------------------------------------------

Score for Article 4 is: 507
Citation for Article 4 is: 31
Readers for Article 4 is: 59
Members of the public
['74', '76%']
Practitioners (doctors, other healthcare professionals)
['12', '12%']
Scientists
['9', '9%']
Science communicators (journalists, bloggers, editors)
['2', '2%']

----------------------------------------------

Score for Article 5 is: 3654
Citation for Article 5 is: 271
Read

In [None]:
# Add a conlumn "score"
df1.loc[df1['author'] == 'Gangwisch et al.', 'score'] = scores[0]
df1.loc[df1['author'] == 'Lee et al.', 'score'] = scores[1]
df1.loc[df1['author'] == 'Kim et al.', 'score'] = scores[2]
df1.loc[df1['author'] == 'Mina et al', 'score'] = scores[3]
df1.loc[df1['author'] == 'Hviid et al.', 'score'] = scores[4]
df1.loc[df1['author'] == 'Maxwell et al.', 'score'] = scores[5]
df1.loc[df1['author'] == 'Berzaghi et al.', 'score'] = scores[6]

# Add a conlumn "citations"
df1.loc[df1['author'] == 'Gangwisch et al.', 'citations'] = citations[0]
df1.loc[df1['author'] == 'Lee et al.', 'citations'] = citations[1]
df1.loc[df1['author'] == 'Kim et al.', 'citations'] = citations[2]
df1.loc[df1['author'] == 'Mina et al', 'citations'] = citations[3]
df1.loc[df1['author'] == 'Hviid et al.', 'citations'] = citations[4]
df1.loc[df1['author'] == 'Maxwell et al.', 'citations'] = citations[5]
df1.loc[df1['author'] == 'Berzaghi et al.', 'citations'] = citations[6]

# Add a conlumn "readers"
df1.loc[df1['author'] == 'Gangwisch et al.', 'readers'] = readers[0]
df1.loc[df1['author'] == 'Lee et al.', 'readers'] = readers[1]
df1.loc[df1['author'] == 'Kim et al.', 'readers'] = readers[2]
df1.loc[df1['author'] == 'Mina et al', 'readers'] = readers[3]
df1.loc[df1['author'] == 'Hviid et al.', 'readers'] = readers[4]
df1.loc[df1['author'] == 'Maxwell et al.', 'readers'] = readers[5]
df1.loc[df1['author'] == 'Berzaghi et al.', 'readers'] = readers[6]

# Add a conlumn "demo_public_count"
df1.loc[df1['author'] == 'Gangwisch et al.', 'demo_public_count'] = count[0]
df1.loc[df1['author'] == 'Lee et al.', 'demo_public_count'] = count[3]
df1.loc[df1['author'] == 'Kim et al.', 'demo_public_count'] = count[7]
df1.loc[df1['author'] == 'Mina et al', 'demo_public_count'] = count[11]
df1.loc[df1['author'] == 'Hviid et al.', 'demo_public_count'] = count[15]
df1.loc[df1['author'] == 'Maxwell et al.', 'demo_public_count'] = count[20]
df1.loc[df1['author'] == 'Berzaghi et al.', 'demo_public_count'] = count[24]

# Add a conlumn "demo_public_perc"
df1.loc[df1['author'] == 'Gangwisch et al.', 'demo_public_perc'] = percent[0]
df1.loc[df1['author'] == 'Lee et al.', 'demo_public_perc'] = percent[3]
df1.loc[df1['author'] == 'Kim et al.', 'demo_public_perc'] = percent[7]
df1.loc[df1['author'] == 'Mina et al', 'demo_public_perc'] = percent[11]
df1.loc[df1['author'] == 'Hviid et al.', 'demo_public_perc'] = percent[15]
df1.loc[df1['author'] == 'Maxwell et al.', 'demo_public_perc'] = percent[20]
df1.loc[df1['author'] == 'Berzaghi et al.', 'demo_public_perc'] = percent[24]

# Add a conlumn "demo_scientist_count"
df1.loc[df1['author'] == 'Gangwisch et al.', 'demo_scientist_count'] = count[2]
df1.loc[df1['author'] == 'Lee et al.', 'demo_scientist_count'] = count[4]
df1.loc[df1['author'] == 'Kim et al.', 'demo_scientist_count'] = count[9]
df1.loc[df1['author'] == 'Mina et al', 'demo_scientist_count'] = count[12]
df1.loc[df1['author'] == 'Hviid et al.', 'demo_scientist_count'] = count[16]
df1.loc[df1['author'] == 'Maxwell et al.', 'demo_scientist_count'] = count[21]
df1.loc[df1['author'] == 'Berzaghi et al.', 'demo_scientist_count'] = count[25]

# Add a conlumn "demo_scientist_perc"
df1.loc[df1['author'] == 'Gangwisch et al.', 'demo_scientist_perc'] = percent[2]
df1.loc[df1['author'] == 'Lee et al.', 'demo_scientist_perc'] = percent[4]
df1.loc[df1['author'] == 'Kim et al.', 'demo_scientist_perc'] = percent[9]
df1.loc[df1['author'] == 'Mina et al', 'demo_scientist_perc'] = percent[12]
df1.loc[df1['author'] == 'Hviid et al.', 'demo_scientist_perc'] = percent[16]
df1.loc[df1['author'] == 'Maxwell et al.', 'demo_scientist_perc'] = percent[21]
df1.loc[df1['author'] == 'Berzaghi et al.', 'demo_scientist_perc'] = percent[25]

# Add a conlumn "demo_practitioners_count"
df1.loc[df1['author'] == 'Gangwisch et al.', 'demo_practitioners_count'] = count[1]
df1.loc[df1['author'] == 'Lee et al.', 'demo_practitioners_count'] = count[5]
df1.loc[df1['author'] == 'Kim et al.', 'demo_practitioners_count'] = count[8]
df1.loc[df1['author'] == 'Mina et al', 'demo_practitioners_count'] = count[13]
df1.loc[df1['author'] == 'Hviid et al.', 'demo_practitioners_count'] = count[17]
df1.loc[df1['author'] == 'Maxwell et al.', 'demo_practitioners_count'] = count[23]
df1.loc[df1['author'] == 'Berzaghi et al.', 'demo_practitioners_count'] = 0

# Add a conlumn "demo_practitioners_perc"
df1.loc[df1['author'] == 'Gangwisch et al.', 'demo_practitioners_perc'] = percent[1]
df1.loc[df1['author'] == 'Lee et al.', 'demo_practitioners_perc'] = percent[5]
df1.loc[df1['author'] == 'Kim et al.', 'demo_practitioners_perc'] = percent[8]
df1.loc[df1['author'] == 'Mina et al', 'demo_practitioners_perc'] = percent[13]
df1.loc[df1['author'] == 'Hviid et al.', 'demo_practitioners_perc'] = percent[17]
df1.loc[df1['author'] == 'Maxwell et al.', 'demo_practitioners_perc'] = percent[23]
df1.loc[df1['author'] == 'Berzaghi et al.', 'demo_practitioners_perc'] = 0

# Add a conlumn "demo_science_communicators_count"
df1.loc[df1['author'] == 'Gangwisch et al.', 'demo_science_communicators_count'] = 0
df1.loc[df1['author'] == 'Lee et al.', 'demo_science_communicators_count'] = count[6]
df1.loc[df1['author'] == 'Kim et al.', 'demo_science_communicators_count'] = count[10]
df1.loc[df1['author'] == 'Mina et al', 'demo_science_communicators_count'] = count[14]
df1.loc[df1['author'] == 'Hviid et al.', 'demo_science_communicators_count'] = count[18]
df1.loc[df1['author'] == 'Maxwell et al.', 'demo_science_communicators_count'] = count[22]
df1.loc[df1['author'] == 'Berzaghi et al.', 'demo_science_communicators_count'] = count[26]

# Add a conlumn "demo_science_communicators_perc"
df1.loc[df1['author'] == 'Gangwisch et al.', 'demo_science_communicators_perc'] = 0
df1.loc[df1['author'] == 'Lee et al.', 'demo_science_communicators_perc'] = percent[6]
df1.loc[df1['author'] == 'Kim et al.', 'demo_science_communicators_perc'] = percent[10]
df1.loc[df1['author'] == 'Mina et al', 'demo_science_communicators_perc'] = percent[14]
df1.loc[df1['author'] == 'Hviid et al.', 'demo_science_communicators_perc'] = percent[18]
df1.loc[df1['author'] == 'Maxwell et al.', 'demo_science_communicators_perc'] = percent[22]
df1.loc[df1['author'] == 'Berzaghi et al.', 'demo_science_communicators_perc'] = percent[26]
df1

Unnamed: 0,altmetric,#article,author,mediatype,medialink,mediasource,mediaheadline,mediasubtitile,score,citations,readers,demo_public_count,demo_public_perc,demo_scientist_count,demo_scientist_perc,demo_practitioners_count,demo_practitioners_perc,demo_science_communicators_count,demo_science_communicators_perc
0,https://oxfordjournals.altmetric.com/details/7...,Article 2,Gangwisch et al.,Tweet,@tolucky39,Twitter,参考文献\nHigh glycemic index and glycemic load di...,,688,52,155,86,75%,12,10%,17,15%,0.0,0.0
1,https://oxfordjournals.altmetric.com/details/7...,Article 2,Gangwisch et al.,Tweet,@JBianco215,Twitter,https://t.co/Gqp1PWsbwm\n\nHigh glycemic index...,,688,52,155,86,75%,12,10%,17,15%,0.0,0.0
2,https://oxfordjournals.altmetric.com/details/7...,Article 2,Gangwisch et al.,Tweet,@diet_Tr_beau5kg,Twitter,昨日の勉強した論文と記事\n\nhttps://t.co/Ura3ne4cz3\nhttps...,,688,52,155,86,75%,12,10%,17,15%,0.0,0.0
3,https://oxfordjournals.altmetric.com/details/7...,Article 2,Gangwisch et al.,Tweet,@movingintosleep,Twitter,"Tuesday #sleepscience ""Rapid spikes in blood s...",,688,52,155,86,75%,12,10%,17,15%,0.0,0.0
4,https://oxfordjournals.altmetric.com/details/7...,Article 2,Gangwisch et al.,Tweet,@DrMichaelMol,Twitter,"The saying ""eat for sleep"" has more to do with...",,688,52,155,86,75%,12,10%,17,15%,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13796,https://nature.altmetric.com/details/63584063,Article 8,Berzaghi et al.,Tweet,@Zacarias15D,Twitter,"Elephants, megagarderners of African rainfores...",,1342,61,218,136,88%,13,8%,0,0,6,4%
13797,https://nature.altmetric.com/details/63584063,Article 8,Berzaghi et al.,Tweet,@geomatlab,Twitter,#Climate NPG: Carbon stocks in central African...,,1342,61,218,136,88%,13,8%,0,0,6,4%
13798,https://nature.altmetric.com/details/63584063,Article 8,Berzaghi et al.,Tweet,@savieira08,Twitter,Acaba de sair nosso artigo sobre o papel dos e...,,1342,61,218,136,88%,13,8%,0,0,6,4%
13799,https://nature.altmetric.com/details/63584063,Article 8,Berzaghi et al.,Tweet,@tlyadeen,Twitter,RT @NatureGeosci: NGeo: Decline of elephant po...,,1342,61,218,136,88%,13,8%,0,0,6,4%


In [None]:
# Export a new sheet which will include all information for tweets and summary pages
df.to_excel('Altemtric_tweet+sum.xlsx', index=False)