# Data Engineering Project 
## Importing the raw data, exporting the clean data

**Authors**: 
- Dmitri Rozgonjuk
- Eerik Sven Puudist
- Lisanne Siniväli
- Cheng-Han Chung


The aim of this script is to clean the main raw data frame and write a new, clean data frame for further use. In this notebook, the comparisons of different read- and write-methods are demonstrated.

First, we install and import the necessary libraries from one cell (to avoid having libraries in some individual cells below). The packages and their versions to be installed will later be added to the `requirements.txt` file.

We also use this section to set global environment parameters.

In [1]:
########### Library Installations ##############
# !pip install opendatasets # install the library for downloading the data set
# ! pip install habanero

################################################

################### Imports ####################
### Data wrangling
import pandas as pd # working with dataframes
import numpy as np # vector operations

### Specific-purpose libraries
import opendatasets as od # downloading the data set from Kaggle
# from habanero import Crossref # CrossRef API

### Misc
import warnings # suppress warnings
import time # tracking time
import os # accessing directories

########## SETTING ENV PARAMETERS ################
warnings.filterwarnings('ignore') # suppress warnings

## 1. Data Import
In order to download the data from Kaggle to a machine, it would be necessary to create a Kaggle API token. Make sure to include the `kaggle.json` fle in the same directory as this notebook.

Some additional resources:
- How to download the datasets from kaggle with `opendatasets` library https://www.analyticsvidhya.com/blog/2021/04/how-to-download-kaggle-datasets-using-jupyter-notebook/
- Github repo for `opendatasets` library: https://github.com/JovianML/opendatasets

First download the file (should be around `1.09 GB`. It will be stored in the `.arxiv/` directory. In case the file already exists, the download will be ignored with the `force = False` argument.

In [2]:
# Initialize the time of pipeline
start_pipe = time.time()

print(f'Time of pipeline start: {time.ctime(start_pipe)}')

Time of pipeline start: Thu Nov 24 14:55:36 2022


In [3]:
od.download("https://www.kaggle.com/datasets/Cornell-University/arxiv", 
                     force = False # force = True downloads the file even if it finds a file with the same name
                    )

Skipping, found downloaded files in "./arxiv" (use force=True to force download)


Import the JSON file as pandas dataframe. For testing purposes, select how many rows are included. if `n_rows = "all"`, the entire data set is imported.

In [4]:
n_rows = 'all'

start_time = time.time()
if n_rows == "all":
    df_raw = pd.read_json("arxiv/arxiv-metadata-oai-snapshot.json", lines = True)
else:
    df_raw  = pd.read_json("arxiv/arxiv-metadata-oai-snapshot.json", lines = True, nrows = n_rows)

end_time = time.time()

print(f'Time elapsed: {end_time - start_time} seconds.')
print(f'Memory usage of raw df: {df_raw.memory_usage(deep = True).sum()/1024/1024/1024} GB.')
print(f'Dataframe dimensions: {df_raw.shape}')
df_raw.head(2)

Time elapsed: 214.96538710594177 seconds.
Memory usage of raw df: 4.030794090591371 GB.
Dataframe dimensions: (2146946, 14)


Unnamed: 0,id,submitter,authors,title,comments,journal-ref,doi,report-no,categories,license,abstract,versions,update_date,authors_parsed
0,704.0001,Pavel Nadolsky,"C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-...",Calculation of prompt diphoton production cros...,"37 pages, 15 figures; published version","Phys.Rev.D76:013009,2007",10.1103/PhysRevD.76.013009,ANL-HEP-PR-07-12,hep-ph,,A fully differential calculation in perturba...,"[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...",2008-11-26,"[[Balázs, C., ], [Berger, E. L., ], [Nadolsky,..."
1,704.0002,Louis Theran,Ileana Streinu and Louis Theran,Sparsity-certifying Graph Decompositions,To appear in Graphs and Combinatorics,,,,math.CO cs.CG,http://arxiv.org/licenses/nonexclusive-distrib...,"We describe a new algorithm, the $(k,\ell)$-...","[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",2008-12-13,"[[Streinu, Ileana, ], [Theran, Louis, ]]"


## 2. Preliminary Data Cleaning
In this step, data cleaning is performed. Here are the guidelines from the assignment:

- You can drop the abstract as it is not required in the scope of this project,
- You can drop publications with very short titles, e.g. one word, with empty authors

What we do is we first drop all the columns that we are not planning to use in the project. Then, we are excluding the rows where works do not have a DOI. While we aknowledge that some valid publications do not have a DOI, a DOI demonstrates that this work is published (whether in a journal, as a pre-print, etc) and, hence, serves as a marker for publication quality. Finally, we exclude titles which have a length smaller than 10 characters - here, the main idea is to exclude all non-validly titled works, as <10 characters would amount to three words of three characters with two spaces - a rather rare title.

In [5]:
# Drop the abstract, submitter, comments, report-no, versions, journal-ref, and license, as these features are not used in this project
## Of note, journal name will be retrieved later with a more standard label
df_raw = df_raw.drop(['abstract', 'submitter', 'comments', 'report-no', 'license', 'versions', 'journal-ref'], axis = 1)
df_raw.shape

(2146946, 7)

In [6]:
# Include only works with non-null values in doi
df_raw = df_raw[~df_raw['doi'].isnull()]
df_raw.shape

(1077424, 7)

In [7]:
# Drop the publications with very short titles (less than 3 words)
df_raw = df_raw[(df_raw['title'].map(len) > 10)]
df_raw = df_raw.reset_index(drop = True)
print(df_raw.shape)

# Set the index of each paper to 'id'
# df = df.set_index('id')
print(f'Dataframe dimensions: {df_raw.shape}')
print(f'Memory usage of raw pandas df: {df_raw.memory_usage(deep = True).sum()/1024/1024/1024} GB.')
df_raw.head(3)

(1077226, 7)
Dataframe dimensions: (1077226, 7)
Memory usage of raw pandas df: 0.6780343065038323 GB.


Unnamed: 0,id,authors,title,doi,categories,update_date,authors_parsed
0,704.0001,"C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-...",Calculation of prompt diphoton production cros...,10.1103/PhysRevD.76.013009,hep-ph,2008-11-26,"[[Balázs, C., ], [Berger, E. L., ], [Nadolsky,..."
1,704.0006,Y. H. Pong and C. K. Law,Bosonic characters of atomic Cooper pairs acro...,10.1103/PhysRevA.75.043613,cond-mat.mes-hall,2015-05-13,"[[Pong, Y. H., ], [Law, C. K., ]]"
2,704.0007,"Alejandro Corichi, Tatjana Vukasinac and Jose ...",Polymer Quantum Mechanics and its Continuum Limit,10.1103/PhysRevD.76.044016,gr-qc,2008-11-26,"[[Corichi, Alejandro, ], [Vukasinac, Tatjana, ..."


## 3. Fact and Dimension tables for Data Warehouse (DWH)

Here, we create the tables with placeholder columns. In this data schema, we are using two factless fact tables: `authorship` that links articles (and its properties) with authors, and `article_category` which reflects scientific domain information for each article.

**Fact table** <br>
- `authorship`: links articles to authors
    - `article_id`: VARCHAR article id (allows to retrieve this id from the original, raw df)
    - `author_id`: VARCHAR composed from author's last name and first name initial (e.g., LastF)
    
    
- `article_category`: links articles to authors
    - `article_id`: VARCHAR article id (allows to retrieve this id from the original, raw df)
    - `category_id`: VARCHAR composed from author's last name and first name initial (e.g., LastF)

**Dimension tables** <br>
- `article`: contains the information about all unique publications and links the dimension tables. The columns are:
    - PK `article_id`: VARCHAR article id (allows to retrieve this id from the original, raw df)
    - `title`: VARCHAR article title
    - `doi`: VARCHAR article DOI
    - `journal_id`:VARCHAR journal ID based on ISSN linking to the `journal` table
    - `date`: DATE linking to the `date` table
    - `n_cites`: INT the number of citations (FACT)
    - `n_authors`: INT the number of co-authors
    

- `author`: includes all individual authors of publications.
    - PK `author_id`: VARCHAR composed from author's last name and first name initial (e.g., LastF)
    - `lastname`: VARCHAR author's last name 
    - `first`: VARCHAR author's first name initial
    - `middle`: VARCHAR author's middle name initial (if any)
    - `gender`: INT (1 or 0), denoting 'Female' and 'Male', respectively (AUGMENTED VIA API!)
    - `affiliation`: VARCHAR author's affiliation (AUGMENTED VIA API!)
    - `hindex`: VARCHAR author's hindex (AUGMENTED VIA API OR COMPUTED (N PAPERS W/ N CITES)!
    
    
- `journal`: includes all unique journals in which works were published
    - PK `journal_id`: VARCHAR journal ID
    - `issn`: VARCHAR journal ISSN (necessary for augmentation)
    - `title`: VARCHAR journal title
    - `if_latest`: FLOAT journal's latest Impact Factor (AUGMENTED VIA API!)
    
    
- `date`: includes publication-related data.
    - PK `date`: DATE DD-M-YYYY
    - `year`: INT year
    - `month`: INT month
    
    
- `category`: includes categories associated with articles
    - PK `category_id`: VARCHAR
    - `superdom`: VARCHAR super-domain of the category
    - `subdom`: VARCHAR sub-domain of the category
    
    
The DWH ERD figure is below:

![](images/dwh_erd.png)

**<font color = 'red'> USE A TEST DATA SET OF 1000 SAMPLES: </font>**

In [8]:
## Prepare data for small-scale testing
df = df_raw.iloc[:1000,:] # Take a thousand rows for testing
df.head()

Unnamed: 0,id,authors,title,doi,categories,update_date,authors_parsed
0,704.0001,"C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-...",Calculation of prompt diphoton production cros...,10.1103/PhysRevD.76.013009,hep-ph,2008-11-26,"[[Balázs, C., ], [Berger, E. L., ], [Nadolsky,..."
1,704.0006,Y. H. Pong and C. K. Law,Bosonic characters of atomic Cooper pairs acro...,10.1103/PhysRevA.75.043613,cond-mat.mes-hall,2015-05-13,"[[Pong, Y. H., ], [Law, C. K., ]]"
2,704.0007,"Alejandro Corichi, Tatjana Vukasinac and Jose ...",Polymer Quantum Mechanics and its Continuum Limit,10.1103/PhysRevD.76.044016,gr-qc,2008-11-26,"[[Corichi, Alejandro, ], [Vukasinac, Tatjana, ..."
3,704.0008,Damian C. Swift,Numerical solution of shock and ramp compressi...,10.1063/1.2975338,cond-mat.mtrl-sci,2009-02-05,"[[Swift, Damian C., ]]"
4,704.0009,"Paul Harvey, Bruno Merin, Tracy L. Huard, Luis...","The Spitzer c2d Survey of Large, Nearby, Inste...",10.1086/518646,astro-ph,2010-03-18,"[[Harvey, Paul, ], [Merin, Bruno, ], [Huard, T..."


### 3.1. Factless fact tables

#### 3.1.1. Factless fact table: `authorship`

In [9]:
# Create the table fro article id and authors list
## NB! Creating `authorship_raw` - for later authors extraction
authorship_raw = df[['id', 'authors_parsed']].set_index('id')
authorship_raw['n_authors'] = authorship_raw['authors_parsed'].str.len()
authorship_raw = pd.DataFrame(authorship_raw['authors_parsed'].explode()).reset_index()

# Create additional columns
authorship_raw['last_name'] = np.nan
authorship_raw['first_name'] = np.nan
authorship_raw['middle_name'] = np.nan

# Update the last, first, and middle names
for i in range(len(authorship_raw)):
    authorship_raw['last_name'][i] = authorship_raw['authors_parsed'][i][0]
    authorship_raw['first_name'][i] = authorship_raw['authors_parsed'][i][1]
    authorship_raw['middle_name'][i] = authorship_raw['authors_parsed'][i][2]

# Drop the redundant column
authorship_raw = authorship_raw.drop(columns = 'authors_parsed')

# Author_identifier
authorship_raw['author_id'] = authorship_raw['last_name'] + authorship_raw['first_name'].str[0]
# Rename article id column
authorship_raw = authorship_raw.rename({'id':'article_id'}, axis = 1)

# Final table
authorship = authorship_raw.drop(columns = ['last_name', 'first_name', 'middle_name'])

print(f'Dataframe dimensions: {authorship.shape}')
print(f'Memory usage of raw pandas df: {authorship.memory_usage(deep = True).sum()/1024/1024/1024} GB.')
authorship.head()

Dataframe dimensions: (3651, 2)
Memory usage of raw pandas df: 0.000444863922894001 GB.


Unnamed: 0,article_id,author_id
0,704.0001,BalázsC
1,704.0001,BergerE
2,704.0001,NadolskyP
3,704.0001,YuanC
4,704.0006,PongY


#### 3.1.2. Factless fact table: `article_category`

In [10]:
# Article-category factless fact table
article_category = df[['id', 'categories']].set_index('id')
article_category = pd.DataFrame(article_category['categories'].str.split(' ').explode()) # extract category codes for articles in long-df
article_category = article_category.reset_index()

article_category = article_category.rename(columns = {'id':'article_id', 'categories':'category_id'})

print(f'Dataframe dimensions: {article_category.shape}')
print(f'Memory usage of raw pandas df: {article_category.memory_usage(deep = True).sum()/1024/1024} MB.')
article_category.head()

Dataframe dimensions: (1470, 2)
Memory usage of raw pandas df: 0.18647289276123047 MB.


Unnamed: 0,article_id,category_id
0,704.0001,hep-ph
1,704.0006,cond-mat.mes-hall
2,704.0007,gr-qc
3,704.0008,cond-mat.mtrl-sci
4,704.0009,astro-ph


### 3.2. Dimensions tables

#### 3.2.1. Dimension table: `article`

In [11]:
article = pd.DataFrame(columns = ['article_id', 'title', 'doi', 'n_authors', 'journal_issn', 'n_cites', 'date'])
article['article_id'] = df['id']
article['title'] = df['title']
article['doi'] = df['doi']
article['n_authors'] = df['authors_parsed'].str.len() # get the number of authors
article['date'] = df['update_date']

print(f'Dataframe dimensions: {article.shape}')
print(f'Memory usage of raw pandas df: {article.memory_usage(deep = True).sum()/1024/1024} MB.')
article.head()

Dataframe dimensions: (1000, 7)
Memory usage of raw pandas df: 0.39719676971435547 MB.


Unnamed: 0,article_id,title,doi,n_authors,journal_issn,n_cites,date
0,704.0001,Calculation of prompt diphoton production cros...,10.1103/PhysRevD.76.013009,4,,,2008-11-26
1,704.0006,Bosonic characters of atomic Cooper pairs acro...,10.1103/PhysRevA.75.043613,2,,,2015-05-13
2,704.0007,Polymer Quantum Mechanics and its Continuum Limit,10.1103/PhysRevD.76.044016,3,,,2008-11-26
3,704.0008,Numerical solution of shock and ramp compressi...,10.1063/1.2975338,1,,,2009-02-05
4,704.0009,"The Spitzer c2d Survey of Large, Nearby, Inste...",10.1086/518646,7,,,2010-03-18


#### 3.2.2. Dimension table: `author`
NB! Dependency on `authorship_raw` table, i.e., data is extracted from it.

In [12]:
# Create the table from the `authorship` table
author = authorship_raw[['author_id', 'last_name', 'first_name', 'middle_name']]

# Drop duplicates
author.drop_duplicates(keep=False,inplace=True)

# Add the `gender` column to be augmented
author['gender'] = np.nan
author['affiliation'] = np.nan
author['hindex'] = np.nan

# Sort alphabetically by last name
author = author.sort_values('author_id').reset_index(drop = True)

# Final table
print(f'Dataframe dimensions: {author.shape}')
print(f'Memory usage of raw pandas df: {author.memory_usage(deep = True).sum()/1024/1024} MB.')
author.head()

Dataframe dimensions: (3214, 7)
Memory usage of raw pandas df: 0.8347568511962891 MB.


Unnamed: 0,author_id,last_name,first_name,middle_name,gender,affiliation,hindex
0,AarsethS,Aarseth,Sverre J.,,,,
1,AbabnehB,Ababneh,Bashar S.,,,,
2,AbbottD,Abbott,Derek,,,,
3,AbeE,Abe,Eisuke,,,,
4,AbrahamsE,Abrahams,E.,,,,


#### 3.2.3. Dimension table: `journal `

In [13]:
journal = pd.DataFrame(columns = ['journal_id', 'issn', 'title', 'if_latest'])

print(f'Dataframe dimensions: {journal.shape}')
print(f'Memory usage of raw pandas df: {journal.memory_usage(deep = True).sum()/1024/1024} MB.')
journal.head()

Dataframe dimensions: (0, 4)
Memory usage of raw pandas df: 0.0 MB.


Unnamed: 0,journal_id,issn,title,if_latest


#### 3.2.4. Dimension table: `date`

In [14]:
# Time_dim
date = pd.DataFrame({'date' : df['update_date'].unique()}) # rename the date variable and find unique dates
date['year'] = date['date'].str.split('-', expand = True)[0] # extract the year
date['month'] =  date['date'].str.split('-', expand = True)[1] # extract the month
date = date.sort_values(by=['date']).reset_index(drop = True) # sort by date

print(f'Dataframe dimensions: {date.shape}')
print(f'Memory usage of raw pandas df: {date.memory_usage(deep = True).sum()/1024/1024} MB.')
date.head()

Dataframe dimensions: (239, 3)
Memory usage of raw pandas df: 0.04274463653564453 MB.


Unnamed: 0,date,year,month
0,2007-05-23,2007,5
1,2007-06-07,2007,6
2,2007-06-09,2007,6
3,2007-06-12,2007,6
4,2007-06-13,2007,6


#### 3.2.5. Dimension table: `category`
NB! Dependency on `article_category` table, i.e., data is extracted from it.

In [15]:
# Categories dimension table
category = pd.DataFrame(article_category['category_id'].copy().reset_index(drop = True))
category[['superdom', 'subdom']] = category['category_id'].str.split('.', expand = True) # exract supr- and subdomain
category = category.drop_duplicates() # drop duplicate rows
category = category.sort_values('category_id').reset_index(drop = True) # sort values, reset index

print(f'Dataframe dimensions: {category.shape}')
print(f'Memory usage of raw pandas df: {category.memory_usage(deep = True).sum()/1024/1024} MB.')
category.head()

Dataframe dimensions: (88, 3)
Memory usage of raw pandas df: 0.015672683715820312 MB.


Unnamed: 0,category_id,superdom,subdom
0,astro-ph,astro-ph,
1,astro-ph.HE,astro-ph,HE
2,cond-mat.dis-nn,cond-mat,dis-nn
3,cond-mat.mes-hall,cond-mat,mes-hall
4,cond-mat.mtrl-sci,cond-mat,mtrl-sci


## Total Pipeline Runtime

In [17]:
end_pipe = time.time()

print(f'Time of pipeline start: {time.ctime(end_pipe)}')
print(f'Total pipeline runtime: {(round(end_pipe - start_pipe)/60,4)} min.')

Time of pipeline start: Thu Nov 24 15:00:02 2022
Total pipeline runtime: (4.433333333333334, 4) min.


# 4. Preparing Graph DB Data

See a helpful resource: https://www.markhneedham.com/blog/2019/03/27/from-graph-model-to-neo4j-import/

Another one: https://towardsdatascience.com/link-prediction-with-neo4j-part-2-predicting-co-authors-using-scikit-learn-78b42356b44c

Creting co-authorship nets: https://stackoverflow.com/questions/18900870/creating-a-co-authorship-graph


Graph database data receives input from the Data Warehouse. Specifically, there are two tables: `nodes` and `edges`. 

* `nodes`
    * `author_id`
    * `n_pubs_total`
    * `n_cites_total`
    * `h_index`
    * `gender`
    * `affiliation`
    
    
* `edges`
    * `author_id1`
    * `author_id2`
    * `n_pubs_together`
    * `n_cites_together`

In [21]:
!ls

ls: .: Operation not permitted
