# Data Engineering Project 
## ETL

**Authors**: 
- Dmitri Rozgonjuk
- Eerik Sven Puudist
- Lisanne Siniväli
- Cheng-Han Chung


The aim of this script is to clean the main raw data frame and write a new, clean data frame for further use. In this notebook, the comparisons of different read- and write-methods are demonstrated.

First, we install and import the necessary libraries from one cell (to avoid having libraries in some individual cells below). The packages and their versions to be installed will later be added to the `requirements.txt` file.

We also use this section to set global environment parameters.

In [None]:
!conda install psycopg2 -y

In [None]:
!pip install -r requirements.txt

In [250]:
## NB!! run the installs from terminal
########### Library Installations ##############

################### Imports ####################
### Data wrangling
import pandas as pd # working with dataframes
import numpy as np # vector operationsõ


### Specific-purpose libraries
# NB! Most configure with an API key
#from pybliometrics.scopus import AbstractRetrieval
from habanero import Crossref # CrossRef API
from genderize import Genderize # Gender API

### Misc
from math import floor
import time
import requests
import warnings # suppress warnings
import os # accessing directories
from tqdm import tqdm # track loop runtime
from unidecode import unidecode # international encoding fo names

### Custom Scripts (ETL, augmentations, SQL)
from scripts.raw_to_tables import *
from scripts.augmentations import *
from scripts.sql_queries import *

#import psycopg2

########## SETTING ENV PARAMETERS ################
warnings.filterwarnings('ignore') # suppress warnings

## Pipeline start

In [2]:
# First check if the tables are already in the system
## If tables exist, import from .csv

if os.path.exists('./tables') and len(os.listdir('./tables')) == 7: # directory + 6 tables
    print('Tables exist...')
    author = pd.read_csv('./tables/author.csv')
    authorshiphip = pd.read_csv('./tables/authorship.csv')
    article = pd.read_csv('./tables/article.csv')
    article_category = pd.read_csv('./tables/article_category.csv')
    category = pd.read_csv('./tables/category.csv')
    journal = pd.read_csv('./tables/journal.csv')
    print('Tables are in the working directory!')
    
## If tables do not exist, pull from kaggle (or local machine), proprocess to tables
else: 
    print('Preparing tables...')
    print()
    ingest_and_prepare()
    print('Tables are in the working directory!')

Preparing tables...

Time of pipeline start: Tue Dec 27 14:27:43 2022

Skipping, found downloaded files in "./arxiv" (use force=True to force download)
Dimensions of the df with valid DOIs: (1088470, 6)
Dimensions of the df with dropped duplicates: (1088468, 6)
Dimensions of the df with short titles removed: (79958, 6)
Dimensions of the df with only CS papers: (79958, 6)

Data Ingestion Time elapsed: 39.64675688743591 seconds.
Memory usage of raw df: 0.03783251345157623 GB.
Directory 'tables' exists.
Writing pandas tables to .csv-files.
Pandas tables to .csv-files successfully written!

ETL Runtime: 49.638529 sec.
Tables are in the working directory!


# 2. Data Augmentation

In [None]:
# Tables:
## authorship
## journal <-- augment all data (use ISSN from DOI)
## article <-- augment with number of citations
## author <-- augment with gender and affiliation

### Article
In this section, we use the `requests` library to fetch the citation based onthe Crossref URL of the work's DOI. We have found that this method is faster than querying the Crossref API. We extract the work type and the number of citations that the work has received; additionally, the journal ISSN for the publication is retrieved if it is available.

We want to note that although we initially also wanted to fetch author affiliation, it is not really feasible, as most of this information is missing.

In [13]:
# NB! Must follow the sequence below

# Check if clean 'article' table exists
if os.path.exists('./data_ready/article.csv'):
    article = pd.read_csv('./data_ready/article.csv')
else:
    # Check if augments already done
    if os.path.exists('./tables/article_augmented_raw.csv'):
        article = pd.read_csv('./tables/article_augmented_raw.csv')
    else:
        # Candidate for citation updating!!
        article = pd.read_csv('./tables/article.csv')

        # Use Crossref API for extracting cites, paper type, and journal ISSNs
        batches = range(0, len(article), 2000)
        for b in batches:
            start_range = b
            end_range = b + 2000
            # Use the custom augmentation script
            ## NB! 5k records in appx 30min, 2k records in appx 14min 
            fetch_article_augments(start_range, end_range)
        # Last batch
        print(time.ctime())
        start_article = time.time()
        start_range = batches[-1]
        end_range = len(article)
        fetch_article_augments(start_range, end_range)
        end_article = time.time()
        end_article - start_article/60
        end_batches = time.time()
        print(f'End of article augmentation: {end_batches}')

        # Write to a separate csv (without filtering
        article.to_csv('tables/article_augmented_raw.csv', index = False)
    
    # Include only journal articles
    article_journal = article[article['type'] == 'journal-article'].reset_index(drop = True)
    
    # Write to 'data_ready' directory
    article_journal.to_csv('./data_ready/article.csv', index = False)

## Journal (and article update)

In [14]:
if os.path.exists('./data_ready/journal.csv'):
    journal = pd.read_csv('./data_ready/journal.csv')
else:
    journal = pd.read_csv('tables/journal.csv')
    
    # Import the journal database data
    ## NB! It may take some time
    print('Importing CWTS data (2021)...')
    cwts_data = pd.read_excel('augmentation/CWTS Journal Indicators April 2022.xlsx',
                         sheet_name = 'Sources')
    # Fix colnames (replace spaces and lower)
    cwts_data.columns = [col.replace(' ','_').lower() for col in cwts_data.columns] 
    # Include only 2021 records
    cwts21 = cwts_data[cwts_data['year'].isin([2021])].reset_index(drop = True)
    print('CWTS (2021) data imported!')
    
    # Find the journals
    journal['journal_issn'] = article['journal_issn'].unique() # NB!! 'article' must be in pwd
    journal = journal[~journal['journal_issn'].isnull()] # remove NAs
    journal = journal.sort_values('journal_issn').reset_index(drop = True)
    
    print(f'The number of unique journals: {len(journal)}')
    journal = find_journal_stats(journal, cwts21) # from augmentations.py
    
    print('Writing a clean journal.csv')
    journal.to_csv('./data_ready/journal.csv', index = False)
    print("'journal.csv' written to 'data_ready' directory!")

# Remove not found journals from articles
article = article[article['journal_issn'].isin(journal['journal_issn'])].reset_index(drop = True)
# Update 'article.csv' in 'data_ready' directory
article.to_csv('./data_ready/article.csv', index = False)

Importing CWTS data (2021)...
CWTS (2021) data imported!
The number of unique journals: 2219
Matching journal ISSNs with names and SNIPs...


100%|██████████| 2219/2219 [00:07<00:00, 305.80it/s]


Removing ISSNs with missing data...
Writing a clean journal.csv
'journal.csv' written to 'data_ready' directory!


### Authorship update

In [15]:
if os.path.exists('./data_ready/authorship.csv'):
    authorship = pd.read_csv('./data_ready/authorship.csv')
else:
    authorship = pd.read_csv('./tables/authorship.csv')
    # Include only the relations in 'article' table
    authorship = authorship[authorship['article_id'].isin(article['article_id'])].sort_values('article_id').reset_index(drop = True)
    # Write to csv
    authorship.to_csv('./data_ready/authorship.csv', index = False)

### Author update and augments
In order to query 'gender' of a given author, we first extract all valid (length > 3) first names. We acknowledge that there may be first names that are smaller than four characters in length, but given that query amount is limited, we are going with a more robust way to extract as many names as possible.

In [254]:
if os.path.exists('./data_ready/author.csv'):
    author = pd.read_csv('./data_ready/author.csv')
else:
    author = pd.read_csv('./tables/author.csv').drop_duplicates()
    
    # Filter authors
    author = author[author['author_id'].isin(authorship['author_id'])].drop_duplicates(['author_id']).reset_index(drop = True)
    
    # Augment gender
    print('Importing gender information...')
    names_genders = pd.read_csv('./augmentation/names_genders.csv')[['first_name', 'gender']]
    author = author.merge(names_genders, on = 'first_name', how = 'left')
    print('Gender augmentation done where possible')
    
    # Number of publications
    npubs = pd.DataFrame(authorship.reset_index(drop = True).groupby('author_id').size()).sort_values('author_id').reset_index()
    npubs.columns = ['author_id', 'total_pubs']
    author = author.merge(npubs, on = 'author_id')
    
    # Additional augments
    ## Statistics table
    stats = authorship.merge(article[['article_id', 'n_cites', 'n_authors']], on = 'article_id').sort_values('author_id').reset_index(drop = True)

    ## Add new columns to author table
    author['total_cites'] = np.zeros(len(author))
    author['avg_cites'] = np.zeros(len(author))
    author['med_coauthors'] = np.zeros(len(author))
    author['hindex'] = np.zeros(len(author))
    
    ## Add statistics to authors
    ### NB! Slow run...
    print('Computing author statistics...')
    for i in tqdm(range(len(author))):    
        author_id = author.loc[i, 'author_id']
        papers = stats[stats['author_id'] == author_id].sort_values('n_cites').reset_index(drop = True)
        citations = papers['n_cites'].sort_values(ascending = False).reset_index(drop = True)

            # Stats
        author.loc[i, 'total_cites'] = papers['n_cites'].sum() # Total number of citations
        author.loc[i, 'avg_cites'] = round(author.loc[i, 'total_cites']/len(papers),3) # Average number of citations per paper
        author.loc[i, 'med_coauthors'] = np.median(papers['n_authors']-1) # subtract oneself

            # h-index
        author.loc[i, 'hindex'] = hindex(citations, len(citations))
    
    print('Computing done!') 
    print('Saving author table to .csv...') 
    
    # Save to csv
    author.to_csv('./data_ready/author.csv', index = False)
    print('author table saved!') 

In [258]:
author

Unnamed: 0,author_id,last_name,first_name,middle_name,gender,total_pubs,total_cites,avg_cites,med_coauthors,hindex
0,201819N,201819,Networks,Computer,M,1,0.0,0.000,6.0,0.0
1,ADNIt,ADNI,the,,,1,4.0,4.000,7.0,1.0
2,AISahafH,AISahaf,Harith,,M,1,5.0,5.000,5.0,1.0
3,ALTamF,ALTam,F,,,1,29.0,29.000,2.0,1.0
4,AWANOT,AWANO,Tomoharu,,M,1,9.0,9.000,4.0,1.0
...,...,...,...,...,...,...,...,...,...,...
56197,vonWaldthausenL,vonWaldthausen,Leopold,,M,1,7.0,7.000,7.0,1.0
56198,vonWangenheimA,vonWangenheim,Aldo,,,3,19.0,6.333,1.0,2.0
56199,vonWurstembergerP,vonWurstemberger,Philippe,,M,2,26.0,13.000,4.0,2.0
56200,vonderOheU,vonderOhe,Ulrich,,M,1,2.0,2.000,2.0,1.0


### Article category

In [266]:
if os.path.exists('./data_ready/article_category.csv'):
    article_category = pd.read_csv('./data_ready/article_category.csv')
else:
    article_category = pd.read_csv('./tables/article_category.csv')
    article_category = article_category[article_category['article_id'].isin(article['article_id'])].reset_index(drop = True)
    article_category.to_csv('./data_ready/article_category.csv', index = False)


### Category

In [278]:
if os.path.exists('./data_ready/category.csv'):
    category = pd.read_csv('./data_ready/category.csv')
else:
    category = pd.read_csv('./tables/category.csv')
    category = category[category['category_id'].isin(article_category['category_id'])].reset_index(drop = True)
    category.to_csv('./data_ready/category.csv', index = False)

Unnamed: 0,category_id,superdom,subdom
0,adap-org,adap-org,
1,astro-ph,astro-ph,
2,astro-ph.CO,astro-ph,CO
3,astro-ph.EP,astro-ph,EP
4,astro-ph.GA,astro-ph,GA
...,...,...,...
131,stat.CO,stat,CO
132,stat.ME,stat,ME
133,stat.ML,stat,ML
134,stat.OT,stat,OT


### Journal
In order to get the journal information, we need the journal ISSN list from the `article` table. Although journal Impact Factor are more common metrics, they are trademarked and, hence, retrieving them is not open-source. The alternative is to use SNIP - source-normalized impact per publication. This is the average number of citations per publication, corrected for differences in citation practice between research domains. Fortunately, the list of journals and their SNIP is available from the CWTS website (https://www.journalindicators.com/).

#### Merge author-names-genders

# 3. From Pandas to PostgreSQL

In [None]:
# Import the data from Pandas
authorship = pd.read_csv('data_ready/authorship.csv')
article_category = pd.read_csv('data_ready/article_category.csv')
category = pd.read_csv('data_ready/category.csv')
article = pd.read_csv('data_ready/article.csv')
author = pd.read_csv('data_ready/author.csv')
journal = pd.read_csv('data_ready/journal.csv')

tables = [authorship, article_category, category, article, author, journal]

# Name of tables (for later print)
authorship.name = 'authorship'
article_category.name = 'article_category'
category.name = 'category'
article.name = 'article'
author.name = 'author'
journal.name = 'journal'

In [None]:
journal

# Database Connection

In [None]:
# Connect to the database
conn = psycopg2.connect(host="postgres", user="postgres", password="password", database="postgres")
conn.set_session(autocommit=True)
cur = conn.cursor()

# create sparkify database with UTF8 encoding
cur.execute("DROP DATABASE IF EXISTS research_db")
cur.execute("CREATE DATABASE research_db WITH ENCODING 'utf8' TEMPLATE template0")

## Load the possiblity to run magic function

In [None]:
%load_ext sql
%sql postgresql://postgres:password@postgres/postgres

# Drop Tables

In [None]:
# Drop Tables 
for query in drop_tables:
    cur.execute(query)
    conn.commit()

In [None]:
# Check that a table, e.g., 'jounal', is not in the database
%sql SELECT * FROM journal

# Create Tables

In [None]:
for query in create_tables:
        cur.execute(query)
        conn.commit()

In [None]:
# Check that the tables (e.g., 'author') are created
## Should be empty
%sql SELECT * FROM journal

# Insert into Tables

In [None]:
def insert_to_tables(table, query):
    ''' Helper function for inserting values to Postresql tables
    Args:
        table (pd.DataFrame): pandas table
        query (SQL query): correspondive SQL query for 'table' for data insertion in DB
    '''
    
    print(f'Inserting table -- {table.name} -- ...')
    
    try:
        for i, row in table.iterrows():
            cur.execute(query, list(row))
        print(f'Table -- {table.name} -- successfully inserted!')
    except:
        print(f'Error with table -- {table.name} --')
    print()
        
for  i in range(len(tables)):
    insert_to_tables(tables[i], insert_tables[i])

In [None]:
%sql SELECT * FROM author LIMIT 10

# Test Queries

In [None]:
%sql SELECT * FROM authorship LIMIT 10;

In [None]:
%sql SELECT * FROM article_category LIMIT 10;

In [None]:
%sql SELECT * FROM article LIMIT 10;

In [None]:
%sql SELECT * FROM category LIMIT 10;

In [None]:
%sql SELECT * FROM journal LIMIT 10;

# 4. Preparing Graph DB Data
In essence, we need to (a) rename the attributes to be compliant with Neo4J notation, and (b) save the above-created tables to .csv-s: https://medium.com/@st3llasia/analyzing-arxiv-data-using-neo4j-part-1-ccce072a2027

- about network analysis with these data in Neo4J: https://medium.com/swlh/network-analysis-of-arxiv-dataset-to-create-a-search-and-recommendation-engine-of-articles-cd18b36a185e

- link prediction: https://towardsdatascience.com/link-prediction-with-neo4j-part-2-predicting-co-authors-using-scikit-learn-78b42356b44c

The Graph Database Schema is pictured below:
<img src="images/graph_db_schema.png"/>

# 5. Example Queries

## 5.1. Data Warehouse

## 5.2. Graph Database

## Total Pipeline Runtime

In [None]:
end_pipe = time.time()

print(f'Time of pipeline start: {time.ctime(end_pipe)}')
print(f'Total pipeline runtime: {(end_pipe - start_pipe)/60} min.')