**About API in Elsevier**
- About text minig in Elsevier: https://dev.elsevier.com/tecdoc_text_mining.html
- Get access to your personal API key: https://dev.elsevier.com/apikey/manage

**JML ULR Format**
- Old version (- August 2019): doi/10.1016/j.jml/2018.xx.xxx
    - (a) xx is the month, which ranges from 00-12 (N.B.: two digit from 00 to 12)
    - (b) xxx is the unique label of the paper, which ranges from 000- (N.B.: three digit from 000 to 999)
    - (c) I haven't seen a case though that xxx exceeds 020. I thus used the range of 000-020 in my code
- New version (October 2019 -): doi/10.106/j.jml/2019.xxxxxx
    - xxxxxx is the unique label of the paper
    - but the number is not in the order of publication

**Just for my own note**
- JML began to include a paper on computational modeling, as of Feb 2020

**Summary for the code**
- Old version: 2018.01.000 - 2019.12.019
- New version: 2019.104027 - 2019.104059, 2020-104038 - 2020.104145 (as of July 1, 2020)
    - These are the numbers that I extracted by eyeballing the unique label numbers published on the JML website
- Because there are some missing numbers, I will code accordingly by looping through the range but pass the ones that do not host papers (i.e., returning HTTPError)

*The labels of the papers published online* (Not important here)

2020 Oct
2020.104132, 2020.104145, 2020.104144, 2020.104128, 2020.104125, 2020.104129, 2020.104130

2020 Aug
2020.104109, 2020.104110, 2020.104111, 2020.104106, 2020.104107, 2020.104113, 2020.104114, 2020.104124, 2020.104123, 2020.104126, 2020.104127

2020 June
2020.104088, 2020.104086, 2020.104089, 2020.104090, 2020.104092, 2020.104087, 2020.104105, 2020.104104, 2020.104108, 2020.104112, 2020.104091

2020 April
2020.104068, 2020.104082, 2020.104085, 2020.104084, 2020.104083, 2020.104063, 2020.104070, 2020.104069, 2020.104071

2020 Feb
2020.104065, 2020.104067, 2020.104064, 2020.104066, 2020.104038, 2020.104055, 2020.104052

2019 Dec
2019.104036, 2019.104047, 2019.104039, 2019.104050, 2019.104048, 2019.104049, 2019.104051, 2019.104054, 2019.104053

2019 Oct
2019.104028, 2019.104029, 2019.104030, 2019.104031, 2019.104027, 2019.104032, 2019.104034, 2019.104035, 2019.104037, 2019.104033

In [None]:
import bs4
import urllib.error
from urllib.request import urlopen as ureq
from bs4 import BeautifulSoup as soup
import pandas as pd
from collections import Counter as Counter
# import csv
# import itertools

# Extracting texts from Journal of Memory of Language (JML)

## 1. Findinig the valid URL

### Old DOI format

In [None]:
month_list = [f"{i:02}" for i in range(13)]
month_list

In [None]:
unique_num_list = [f"{i:03}" for i in range(20)]
unique_num_list

In [None]:
url_raw_old = []
for year in range(2016, 2020):
    for month in month_list:
        for unique_num in unique_num_list:
            url_string = 'https://api.elsevier.com/content/article/doi/10.1016/j.jml.%d.%s.%s?APIKey=13bf0ab31bf66221becf114979941195' %(year, month, unique_num)
            url_raw_old.append(url_string)
        

In [None]:
url_raw_old;
len(url_raw_old)

### New DOI format

In [None]:
url_raw_new = []
for year in range(2019, 2021):
    for unique_num in range(104027, 104146):
        url_string = 'https://api.elsevier.com/content/article/doi/10.1016/j.jml.%d.%d?APIKey=13bf0ab31bf66221becf114979941195' %(year, unique_num)
        url_raw_new.append(url_string)

In [None]:
url_raw_new;
len(url_raw_new)

In [None]:
url_journal = url_raw_old + url_raw_new
len(url_journal)

### Extracting the valid URLs 

In [None]:
valid_url = []

for i in url_journal:
    try:
        ureq(i)
        print("The URL, %s is valid" %i)
        valid_url += [i]
    except urllib.error.HTTPError:
        pass

Variable **VALID_URL** for the "untouched data"

In [None]:
VALID_URL = valid_url
len(valid_url)

## 2. Reading the pages

### This is a test with a test_url

In [None]:
test_url = valid_url[3]
test_url

In [None]:
client_url = ureq(test_url)
page_html = client_url.read()
client_url.close()
page_soup = soup(page_html, "html.parser")
title = page_soup.find("dc:title").text
vol = page_soup.find("prism:volume").text
date = page_soup.find("prism:coverdisplaydate").text
month = date.split()[0]
year = date.split()[1]
abstract = page_soup.find("dc:description").text
abstract = abstract.replace("Abstract","").strip()
keywords = page_soup.findAll("dcterms:subject")
keywords = str(keywords)
doi = 'https://doi.org/' + page_soup.find("prism:doi").text

remove_string = ["<dcterms:subject>","</dcterms:subject>","[","]"]

for s in remove_string:
    keywords = keywords.replace(s, "")
    keywords = keywords.replace(",", ";")

test_dict = {'title': title, 
              'year': year, 
              'month': month,
              'vol': vol,
              'keywords': keywords,
              'abstract': abstract,
              'DOI': doi
              }

test_dict

**What is eid** https://dev.elsevier.com/documentation/FullTextRetrievalAPI.wadl#d1e461

### Defining the extracter

In [None]:
def extractJML(url_list):
    
    jml_dict = {}
    
    for i in url_list:
        
        # request client
        client_url = ureq(i)
        page_html = client_url.read()
        client_url.close()
        page_soup = soup(page_html, "html.parser")
        
        # get title
        title = page_soup.find("dc:title").text
        
        # get volume
        vol = page_soup.find("prism:volume").text
        
        # get date
        date = page_soup.find("prism:coverdisplaydate").text
        month = date.split()[0]
        year = date.split()[1]
        
        # get abstract
        abstract = page_soup.find("dc:description").text
        abstract = abstract.replace("Abstract","").strip()
        
        # get keywords
        keywords = page_soup.findAll("dcterms:subject")
        keywords = str(keywords)
        remove_string = ["<dcterms:subject>","</dcterms:subject>","[","]"]

        for s in remove_string:
            keywords = keywords.replace(s, "")
            keywords = keywords.replace(",", ";")
        
        # get doi
        doi = 'https://doi.org/' + page_soup.find("prism:doi").text
        
        # get eid
        eid = page_soup.find("eid").text
        
        # store in an embedded dictionary
        jml_dict[eid] = {}
        
        jml_dict[eid]['title'] = title
        jml_dict[eid]['year'] = year
        jml_dict[eid]['month'] = month
        jml_dict[eid]['volume'] = vol
        jml_dict[eid]['keywords'] = keywords
        jml_dict[eid]['abstract'] = abstract
        jml_dict[eid]['DOI'] = doi
        
    return jml_dict
    
    

In [None]:
test_urls = []
test_urls.append(valid_url[3])
test_urls.append(valid_url[5])
test_urls

### Let's extract the texts from JML

In [None]:
jml_dict = extractJML(valid_url)

### Save the data as a csv file

In [None]:
data = pd.DataFrame(jml_dict)

In [None]:
data.to_csv("./JML_2016-2020.csv")

## 3. Frequency count 

In [None]:
data.head(6)

In [None]:
len(Counter(data))

### Let's get the frequency of the keywords 

In [None]:
word_list = []
word_temp = []

for i in range(len(counter(data))):
    word_temp = data.iloc[2][i].split(";")
    word_list += word_temp
    
word_list
Counter(word_list).most_common()

In [None]:
word_str = str(word_list)

In [None]:
test = word_list[3].split(';')
counter(test)

### Let's get the frequency of the **unique** keywords

# Cf. Basic example with one article from one URL (no loop)

**Sanity check**
- The url_journal is a wrong one, so should return a HTTP error

In [None]:
# url_journal = 'https://api.elsevier.com/content/article/doi/10.1016/j.jml.2019.104026?APIKey=13bf0ab31bf66221becf114979941195'

In [None]:
url_journal = 'https://api.elsevier.com/content/article/doi/10.1016/j.jml.2019.104027?APIKey=13bf0ab31bf66221becf114979941195'

In [None]:
client_url = ureq(url_journal)
client_url

In [None]:
page_html = client_url.read()
client_url.close()

In [None]:
page_soup = soup(page_html, "html.parser")

In [None]:
title = page_soup.find("dc:title").text
vol = page_soup.find("prism:volume").text
date = page_soup.find("prism:coverdisplaydate").text
eid = page_soup.find("eid").text

In [None]:
abstract = page_soup.find("dc:description").text
abstract = abstract.replace("Abstract","").strip()

In [None]:
keywords = page_soup.findAll("dcterms:subject")
keywords = str(keywords)

remove_string = ["<dcterms:subject>","</dcterms:subject>","[","]"]

for s in remove_string:
    keywords = keywords.replace(s, "")
    keywords = keywords.replace(",", ";")

Ignore the code below -- I haven't worked on getting the author information

In [None]:
# authors = page_soup.findAll("dc:creator")
# authors = str(authors)

# remove_string = ["<dc:creator>","</dc:creator>","[","]"]

# for s in remove_string:
#     authors = authors.replace(s, "")