## Checking versions
아래는 버전 확인을 위한 셀이며, 안돌아가면 돌리지 마세요..!

In [1]:
%load_ext version_information
import time
now = time.strftime("%Y-%m-%d %H:%M:%S (%Z = GMT%z)")
print(f"This notebook was generated at {now} ")

vv = %version_information requests, tqdm, pandas, astroquery, version_information
for i, pkg in enumerate(vv.packages):
    print(f"{i} {pkg[0]:10s} {pkg[1]:s}")

This notebook was generated at 2020-04-15 17:43:54 (KST = GMT+0900) 
0 Python     3.7.6 64bit [Clang 4.0.1 (tags/RELEASE_401/final)]
1 IPython    7.13.0
2 OS         Darwin 18.7.0 x86_64 i386 64bit
3 requests   2.23.0
4 tqdm       4.43.0
5 pandas     1.0.3
6 astroquery 0.4.1.dev5959
7 version_information 1.0.3


## Importing and Setting Up

In [2]:
import math
import requests
import time
from itertools import product
from pathlib import Path
import pandas as pd
from astroquery import nasa_ads as na
from tqdm import tqdm

# helped from https://stackoverflow.com/questions/37573483/progress-bar-while-download-file-over-http-with-requests
def download_pdf(response, fpath):
    total_size = int(response.headers.get('content-length', 0)); 
    block_size = 1024
    wrote = 0 
    with open(fpath, 'wb') as f:
        for data in tqdm(response.iter_content(block_size), total=math.ceil(total_size//block_size), unit='kB', unit_scale=True):
            wrote = wrote + len(data)
            f.write(data)
#     if total_size != 0 and wrote != total_size:
#         print("ERROR, something went wrong")  

def altnames(fullname):
    names = [fullname]
    lastname = fullname.split(', ')[0]
    firstmiddle_names = fullname.split(', ')[-1].split(' ')
    N = len(firstmiddle_names)
    pieces = {'0':firstmiddle_names, '1':[]}  # 0/1 = full/initial
    
    for n in firstmiddle_names:
        pieces['1'].append('{}.'.format(n[0].upper()))
    
    for ind in product('012', repeat=N):
        altname = ''
        for i, case in enumerate(ind):
            if case != '2':
                altname += "{} ".format(pieces[case][i])
        if altname == '':
            continue
        names.append("{}, {}".format(lastname, altname[:-1]))    
    
    return list(set(names))

## Team Member Setting

**Define team members.**

``altnames``에 이니셜 등을 사용하는 경우가 자동으로 저장됩니다. (If multiple names are there, add as separate person. ``altnames`` are the alternative combinations of initials of middle/first names.)

* NOTE: Middle name은 이니셜로 써도 되나, 원 저자가 사용하는 방식에 맞추는 게 좋습니다 (Middle name can be either expanded or used as an initial. If the author always use full middle name in publications, it's better to give the full name here.)

In [3]:
team = dict(
    names=["Ishiguro, Masateru", "Yoonsoo P. Bach"],
    kornames=["이시구로 마사테루", "박윤수"],
    researcher_number=[1111,1212],
    altnames=[]
)

for name in team["names"]:
    team["altnames"].append(altnames(name))


team_df = pd.DataFrame.from_dict(team)
team_df

Unnamed: 0,names,kornames,researcher_number,altnames
0,"Ishiguro, Masateru",이시구로 마사테루,1111,"[Ishiguro, Masateru, Ishiguro, M.]"
1,Yoonsoo P. Bach,박윤수,1212,"[Yoonsoo P. Bach, P., Yoonsoo P. Bach, Yoonsoo..."


원래 ``team`` 에는 **학생들만** 넣어야 됩니다. 여기서는 테스트를 위해 이시구로 교수님과 저를 넣었고, 마치 임명신 교수님이 지도교수님인 것으로 가정하여 아래 결과를 출력했습니다.

* **NOTE**: You may make many different such excel/csv/txt files and load them by ``pd.read_csv``, etc.

## Query to ADS

ADS에서 찾을 때, 지도교수가 포함되지 않은 채로 나온 논문은 없다고 가정합니다. 즉, 찾는 방식은

1. 지도교수 이름으로 ADS를 싹 다 검색한다. (아래 ``author`` 수정할 것)
2. 그 중에서 위에 ``team``으로 저장된 사람들이 끼어있는 논문의 경우 어느 학생인지를 추가로 기입한다.

입니다.

**IMPORTANT**: ADS에서 개인 API토큰을 발급받아야 합니다.

1. Go to [ADS](https://ui.adsabs.harvard.edu/), log in. 
2. Then go to [Account - Settings - API Token](https://ui.adsabs.harvard.edu/user/settings/token). 
3. Generate your token.
4. Copy and paste it to ``na.ADS.TOKEN`` below:

In [4]:
na.ADS.TOKEN = 'Your_API_Token_Here'

# by default, the top 10 records are returned, sorted in
# reverse chronological order. This can be changed
na.ADS.NROWS = 9999

# change the fields that are returned (enter as strings in a list)
na.ADS.ADS_FIELDS = ["title", "bibcode", "author", "pubdate", "property", "esources",
                     "pub", "issn", "volume", "issue", "page", "doi", "arxiv", "bibstem", "database"]

author = "Im, Myungshin" # <<---- Professor's name
year = "2019-2020"
query_str = f'author:"={author}" year:{year}'
print(f"Query with: \n\t {query_str}")
results_raw = na.ADS.query_simple(query_str)

results_raw.sort(['pubdate', "title"])

# flatten the shape to convert to pandas... 
# I currently don't know what bad thing will happen.
# It was OK when I tested for my personal purposes.
for c in results_raw.colnames:
    if len(results_raw[c].shape) > 1:
        results_raw[c] = results_raw[c][:, 0]

results = results_raw.to_pandas()

results["N_author"] = results["author"].str.len()
results["YYYYMM"] = results["pubdate"].str[:-3].str.replace("-", "").astype(int)
results["refereed"] = [True if "REFEREED" in row["property"] else False for i, row in results.iterrows()]
results["astronomy"] = [True if "astronomy" in row["database"] else False for i, row in results.iterrows()]
results["volume"] = [-1 if row["volume"]==[None] else row["volume"] for i, row in results.iterrows()]

results_ref = results[((results["refereed"]==True) 
                      & (results["astronomy"]==True) 
                      & (results["volume"] != -1))]

print(f"ADS contains {len(results)} match with <{author}> (refreed: {len(results_ref)}) in {year}.")
if len(results_ref) > 5:
    print(f"\nHey {author.split(',')[1]}, you are awesome...!")

Query with: 
	 author:"=Im, Myungshin" year:2019-2020
ADS contains 55 match with <Im, Myungshin> (refreed: 15) in 2019-2020.

Hey  Myungshin, you are awesome...!


* **NOTE**: If you want to search for your results, change the ``query_str``.
* **NOTE**: See http://adsabs.github.io/help/search/comprehensive-solr-term-list for the complete list of columns.
* ~~**NOTE**: As of 2019-07-02, the ``issn`` is not yet supported from ADS.~~ It seems like it's now supported (2020-04-15)

## Select Rows for This BK Survey
연월은 ``201908 <= YYYYMM <= 202001``로 선택했습니다. 또한 BK21측에서 준 엑셀에 따라 아래 정보만 취합했습니다.

1. title
2. journal (full name) ``pub``
3. doi 
3. issn 
4. volume 
5. issue
6. page
7. YYYYMM
8. number of authors 

(이 순서로). 그리고 나서 학생들 이름과 KRI연구자등록번호를 추가했습니다. ``BKoutput.csv``로 저장됩니다.

(It will be saved as ``BKoutput.csv`` and you can open it with Excel, copy-and-paste to the original Excel file. **WARNING**: The formatting is crazy in the original Excel from BK (it got better in 2020 but it was horrible in 2019), so you should do it by yourself.)

In [5]:
results_ref_thisyear = results_ref[(results_ref["YYYYMM"] >= 201908) 
                                 & (results_ref["YYYYMM"] <= 202001)]
results_ref_bk = results_ref_thisyear[
    ["author", "title", 'doi', "pub", "issn", 
     "volume", "issue", "page", "YYYYMM", "N_author"]
]
results_ref_bk["students"] = ""
results_ref_bk["researcher_number"] = ""


for i, row in results_ref_bk.iterrows():
    students = ""
    researcher_number = ""
    for _, student in team_df.iterrows():
        student_names = student["altnames"]
        for name in student_names:
            if name in row["author"]:
                students += "{},".format(student["kornames"])
                researcher_number += "{},".format(student["researcher_number"])
    results_ref_bk.at[i, "students"] = students[:-1]
    results_ref_bk.at[i, "researcher_number"] = researcher_number[:-1]
    

# del results_ref_thisyear["author"]
results_ref_bk.to_csv("BKoutput.csv", index=False)

In [6]:
results_ref_bk

Unnamed: 0,author,title,doi,pub,issn,volume,issue,page,YYYYMM,N_author,students,researcher_number
37,"[Kriss, G. A., De Rosa, G., Ely, J., Peterson,...",Space Telescope and Optical Reverberation Mapp...,10.3847/1538-4357/ab3049,The Astrophysical Journal,0004-637X,881,2,153,201908,167,,
38,"[Kwon, Yuna G., Ishiguro, Masateru, Kwon, Jung...",Near-infrared polarimetric study of near-Earth...,10.1051/0004-6361/201935542,Astronomy and Astrophysics,0004-6361,629,[None],A121,201909,10,이시구로 마사테루,1111.0
39,"[Kim, Joonho, Im, Myungshin, Choi, Changsu, Hw...",Medium-band Photometry Reverberation Mapping o...,10.3847/1538-4357/ab40cd,The Astrophysical Journal,0004-637X,884,2,103,201910,4,,
40,"[Harikane, Yuichi, Ouchi, Masami, Ono, Yoshiak...",SILVERRUSH. VIII. Spectroscopic Identification...,10.3847/1538-4357/ab2cd5,The Astrophysical Journal,0004-637X,883,2,142,201910,35,,
42,"[Lee, Seong-Kook, Im, Myungshin, Hyun, Minhee,...","More connected, more active: galaxy clusters a...",10.1093/mnras/stz2564,Monthly Notices of the Royal Astronomical Society,0035-8711,490,1,135,201911,7,,


## Download the PDF Files of the Papers

I will use the ADS web link and try
1. to access to the publisher's PDF if available
  - For Science, the publisher's PDF link is not directed to the full pdf, so I added some conditional clause.
2. if unavailable, I tried something
  - Nature, for example, adding ``.pdf`` seem to direct you to the pdf.
  
As time goes, I will add more exceptions so that it works as perfect as possible.

In [7]:
BASE = "https://ui.adsabs.harvard.edu/link_gateway/"
# helped from https://stackoverflow.com/questions/43165341/python3-requests-connectionerror-connection-aborted-oserror104-econnr/43167631
manual = dict(bib=[], pub_html=[])
headers = requests.utils.default_headers()
headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'

for i, row in results_ref_thisyear.iterrows():
    bib = row["bibcode"]
    fpath = Path('{}.pdf'.format(bib))
    print(fpath, end=' ')
    
    if fpath.exists():
        print('already exists!'.format(bib))
        continue
        
    if "PUB_PDF" in row["esources"]:
        url = BASE + row["bibcode"] + "/PUB_PDF"
        print('Downloading...'.format(bib), end=' ')

        response = requests.get(url, headers=headers, stream=True)
        
        if "Science" in row["pub"]:
            if response.url.endswith("/tab-pdf"):
                url = response.url.replace("/tab-pdf", ".full.pdf")
            else:
                url = response.url + ".full.pdf"
            response = requests.get(url, headers=headers, stream=True)

        print("\n\t" + response.url)
        time.sleep(1)
        
        download_pdf(response, fpath)

    else:
        try:
            print("trying to find pdf...", end=' ')
            url = BASE + row["bibcode"] + "/PUB_HTML"
            response = requests.get(url, headers=headers, stream=True)
            if "nature.com" in response.url:
                url = response.url + ".pdf"
            else:
                raise ConnectionError()
            response = requests.get(url, headers=headers, stream=True)
            if response.status_code == 404:
                raise ConnectionError()
            print('I found it! Downloading...'.format(bib), end=' ')
            print("\n\t" + response.url)
            time.sleep(1)
        
            download_pdf(response, fpath)
            
        except ConnectionError:            
            print("\n!!! I couldn't find a valid link. Download from below:".format(bib))
            print("\t" + BASE + bib + "/PUB_HTML")
            manual["bib"].append(bib)
            manual["pub_html"].append(BASE + bib + "/PUB_HTML")

2019ApJ...881..153K.pdf Downloading... 
	https://iopscience.iop.org/article/10.3847/1538-4357/ab3049/pdf


4.58kkB [00:42, 108kB/s] 


2019A&A...629A.121K.pdf Downloading... 
	https://www.aanda.org/articles/aa/pdf/2019/09/aa35542-19.pdf


2.09kkB [00:01, 1.19kkB/s]                           


2019ApJ...884..103K.pdf Downloading... 
	https://iopscience.iop.org/article/10.3847/1538-4357/ab40cd/pdf


1.05kkB [00:01, 743kB/s]


2019ApJ...883..142H.pdf Downloading... 
	https://iopscience.iop.org/article/10.3847/1538-4357/ab2cd5/pdf


3.50kkB [00:08, 404kB/s]


2019MNRAS.490..135L.pdf Downloading... 
	https://watermark.silverchair.com/stz2564.pdf?token=AQECAHi208BE49Ooan9kkhW_Ercy7Dm3ZL_9Cf3qfKAc485ysgAAAn4wggJ6BgkqhkiG9w0BBwagggJrMIICZwIBADCCAmAGCSqGSIb3DQEHATAeBglghkgBZQMEAS4wEQQMQEslZu3LTBFcv5tfAgEQgIICMS4xjGJHDP996MdajWeovrNvcE4BraMxsg4Su9IMoO98cnrpeKZchoD3W7hnVbaUICxhz6OiOULR2a1p8E0F93eOHauBIhd6ju7hyFoLVJGG6DuzzRkS6jzX4QBxkcxKbVZY1sk5rxaHEs57IsHhGemXi_CSBu7_b-Q7tpc2jH7IqvCY9lQXaL--LH-dPyqAVy2kJMu_XashuZ8x9DuPwn7Q0hsOHMTu-46xYNfMN-BEiHbHYf2s7BE6MQORTdQdBxQDCnvVeFWaDuL7YAOWSnq8BdLgE7uarazdD5oIIBhS2cN0Qq8XwCgOykcTELxwSXUI1LNCkcIz55yvXzxSbVtn4ObbEPDaBT5RSCfjGd1yVpXI_NoNd4-OVA2ZoGy5e3v1d0-blHosKzXRB_k3N1t9ordK9_l1PRZIRse23T_6f9kj9mMexSNvsOAGSoxeS7yk_iDKgh7gShtoDpwVckxOsZgfKu7ErzpI1t6AhhRhzdr1OpZuxFjbUDdqyGPUjfWGa5KaWDZRrlF2fkxTOpPgxwyh8_9uXaw_7SIjYnhtLRnXi48kfs6hIywor0BRZmwCfgpDq4hxcT04KY0-RopXGLGlRewxQ-f7pjH0LCpn1bBE0eFekII5nCqBRdOvtHyaoXzl9F1H4YRsw_wloxdFI5rF5ukvSGSqSeCRKu0NcnciWHSa--HIYTWZoPOkZXxUKHp4cfHWUcS1tgFxYY22xcOfj4J1-aUwYX3lo5uAyw


3.98kkB [00:23, 169kB/s]                            


* **WARNING**: You may have some papers that are accepted but not on ADS yet. You **MUST** find those by yourself!!!
* **NOTE**: I didn't put much effort to automatize the "paper download link finding" algorithm. But anyway it gives the link to PDF download, it may save a lot of time.

In [8]:
# Papers you have to download manually...
import pandas as pd
pd.DataFrame(manual)

Unnamed: 0,bib,pub_html
