# Data Analytics Applied to Journal Academic Publications

##### by Pedro Veronezi

In [10]:
# Loading Bokeh libraries for interactives graphs
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
output_notebook()

### BUSINESS / RESEARCH UNDERSTANDING

Technical writing and academic publications are a large part of the academic environment, where researches and students develop novel approaches aiming to solve foreseen problems and others that have not even reveled itself yet. This industry has significant impact on new technologies and itself movement a large sum of money. 

The process of submitting a paper involves great amount of work and resources for development and revision rounds with peers and editors. Publication in a peer-reviewed journal can be a lengthy and often exhausting process. The acceptance process can take a couple months or even years. Therefore, knowing which journal to target, topics that are generally published or are in more evidence, and common expressions and words used by other authors that published in the same venue, can help the researcher to polish his work seeking for better fit to the journal editors’ expectations. 

Finally, the goal is to increase the chances of having the paper accepted and speed up the publishing process. In this study, we acquire data from published scientific research articles on the practice of management in two major scholarly journals, Manufacturing & Service Operation Management (MSOM) and Management and Science (MNSC). We investigate the important aspects of the papers such as the most common discussed topics, top authors and affiliations, and time for acceptance and publication, over the past 15 years. 

The main contribution is to provide high quality information that can assist researchers in the processes of selecting a target journal for publication and writing the paper according to the characteristics of the venue.

### DATA ACQUISITION & EXPLORING

For this study, data was gathered by downloading papers from two major scholarly journals, the Manufacturing & Service Operation Management (MSOM) and Management and Science (MNSC). In total, 1445 articles were downloaded from the journals’ websites, including publications since 2003. 

About 60% of the gathered database were published in MNSC, and 40% consists of MSOM papers. The difference is related to the fact the MNSC starts its publications in an earlier date (since 2003) than the MSOM (since 2009). A R script was developed to scrap data from the journal’s website as well as downloading the PDF file version of the papers. For each article, the following information was acquired directly from the web page:

    * Link ID: consists of the link for the web page where the paper is located. It is a string formed by the acronym for the journal, the year of acceptance, and a 4 digits number. Example: ‘mnsc.2015.2339’;
    * Title: title of the paper. String format;
    * Author (s): name or names of the authors of the paper;
    * Affiliation (s): name of the institution for which the author (s) works, contributes, or is associated. It can be a university, college, school, or even a company, such as a bank or independent research institute. Example: School of Systems and Enterprises, Stevens Institute of Technology, Hoboken NJ, United States. String format;
    * Keywords: re words or phrases that you feel capture the most important aspects of your paper. Example: “short-term open-pit mine production scheduling; hybrid optimisation; nonlinear programming”;
    * Date received: date the paper was submitted for the journal by author. Example: 2013-07-10;
    * Date accepted: date when the paper was accepted. Example: 2015-06-02;
    * Date published: date when the paper was finally published. Example: 2016-03-07; and
    * Abstract.


### DATA PREPARATION

This phase includes the cleaning processes of all attributes collected for each paper. For several papers, some information such as received, accepted, and published date were directly available from the web page, therefore, values received a NaN value. 

A Python script was developed to drop such missing values. For the dates, from the original string format which include expressions such as “Published Online:”, we extract only the specific date with the month-day-year format. Similar processes of cleaning were executed to transform strings containing all authors names and affiliations, which also included the actual address of the institutions, were converted to lists with solely the name of the institutions. For all these processes, the data was treated as a pandas data frame.

New attributes were also create based on information of the original attributes. The year of publication was extracted from the “publish date” attribute, while we also calculated the total time, in days, that the paper took from the received date up to the published date.

### DATA EXPLORATION

The following tables present the number of publications, number of authors, the average number of author by article, and the average, maximum, and minimum time, in days, from the acceptance date to the publication date. Table 1 and Table 2 presents the descriptive statistics for the MSOM and MNSC journals, respectively.

It can be seen from the data in Tables 1 and 2 that the average time for publication is considerably high, commonly taking close to 2 years, but sometimes the whole process took over 7 years. Moreover, statistics from the two journals are similar to each other, which is consistent with the fact that both are published by the Institute for Operations Research and the Management Sciences (INFORMS) that prizes for high quality research that must go through rigorous rounds of peer-reviews.

In [12]:
plot = figure()
plot.circle([1,2], [3,4])

show(plot)