# Data Engineering Project 
## ETL

**Authors**: 
- Dmitri Rozgonjuk
- Eerik Sven Puudist
- Lisanne Siniväli
- Cheng-Han Chung


The aim of this script is to clean the main raw data frame and write a new, clean data frame for further use. In this notebook, the comparisons of different read- and write-methods are demonstrated.

First, we install and import the necessary libraries from one cell (to avoid having libraries in some individual cells below). The packages and their versions to be installed will later be added to the `requirements.txt` file.

We also use this section to set global environment parameters.

In [1]:
## NB!! run the installs from terminal
########### Library Installations ##############

################### Imports ####################
### Data wrangling
import pandas as pd # working with dataframes
import numpy as np # vector operations

### Specific-purpose libraries
# from habanero import Crossref # CrossRef API

### Misc
import warnings # suppress warnings
import os # accessing directories


from scripts.raw_to_tables import *
from scripts.sql_queries import *

import psycopg2

########## SETTING ENV PARAMETERS ################
warnings.filterwarnings('ignore') # suppress warnings

In [2]:
start_pipe = time.time() # Initialize the time of pipeline
start_etl = time.time() # Initialize the time of ETL
print(f'Time of pipeline start: {time.ctime(start_pipe)}')
print()

# Data ingestion
df_raw = ingest_data(n_rows = 100000) # NB! make sure to use 'df_raw' name!

# Create the df from raw
df = raw_to_df(df_raw)
del df_raw # delete the initial raw df (cleanup)

# Prepare Pandas dataframes
authorship, author = authorship_author_extract(df)
article_category, category = article_category_category_extract(df)
article = article_extract(df)
journal = journal_extract()

end_etl = time.time() # Endtime of ETL

print(f'ETL Runtime: {round(end_etl - start_etl, 6)} sec.')

Time of pipeline start: Fri Dec 16 14:46:19 2022

Skipping, found downloaded files in "./arxiv" (use force=True to force download)
Data Ingestion Time elapsed: 1.6785180568695068 seconds.
Memory usage of raw df: 0.1687495606020093 GB.
Raw df dimensions: (100000, 14)

Initial preprocessing time elapsed: 0.08308219909667969 seconds.
Memory usage of cleaned df: 0.032786483876407146 GB.
Cleaned df dimensions: (61315, 7)

ETL Runtime: 9.221953 sec.


In [3]:

#del df_raw # delete the original data frame

#### 3.1.1. Factless fact table: `authorship` and Dimension table `author`

#### 3.1.2. Factless fact table: `article_category`

# 2. Data Augmentation

In [4]:
# Tables:
## authorship
## article_category
## category
## journal <-- augment all data (use ISSN from DOI)
## article <-- augment with number of citations
## author <-- augment with gender and affiliation

In [4]:
# Clean the data last time: remove all authors with NaNs or too short names
## NaNs
author = author[~author['author_id'].isnull()]
nan_authors = authorship[authorship['author_id'].isnull()]['article_id'].values
article = article.loc[~article['article_id'].isin(nan_authors)]
authorship = authorship.loc[~authorship['article_id'].isin(nan_authors)]

## Too short (< 4) names
author = author[~(author['author_id'].str.len() < 4)].reset_index(drop = True)
short_authors = authorship[(authorship['author_id'].str.len() < 4)]['article_id'].values
article = article.loc[~article['article_id'].isin(short_authors)].reset_index(drop = True)
authorship = authorship.loc[~authorship['article_id'].isin(short_authors)].reset_index(drop = True)

### To .csv

In [6]:
# Make a directory 'tables'
!mkdir tables

mkdir: cannot create directory ‘tables’: File exists


In [7]:
authorship.to_csv('tables/authorship.csv', index = False)
article_category.to_csv('tables/article_category.csv', index = False)
category.to_csv('tables/category.csv', index = False)
journal.to_csv('tables/journal.csv', index = False)
article.to_csv('tables/article.csv', index = False)
author.to_csv('tables/author.csv', index = False)

# 3. From Pandas to PostgreSQL

In [8]:
# Import the data from Pandas
authorship = pd.read_csv('tables/authorship.csv')
article_category = pd.read_csv('tables/article_category.csv')
category = pd.read_csv('tables/category.csv')
article = pd.read_csv('tables/article.csv')
author = pd.read_csv('tables/author.csv')
journal = pd.read_csv('tables/journal.csv')

tables = [authorship, article_category, category, article, author, journal]

# Name of tables (for later print)
authorship.name = 'authorship'
article_category.name = 'article_category'
category.name = 'category'
article.name = 'article'
author.name = 'author'
journal.name = 'journal'

In [9]:
authorship

Unnamed: 0,article_id,author_id
0,704.0001,BalázsC
1,704.0001,BergerE
2,704.0001,NadolskyP
3,704.0001,YuanC
4,704.0006,PongY
...,...,...
201594,812.3872,CichowolskiS
201595,812.3872,RomeroG
201596,812.3872,OrtegaM
201597,812.3872,CappaC


# Database Connection

In [10]:
# Connect to the database
conn = psycopg2.connect(host="postgres", user="postgres", password="password", database="postgres")
conn.set_session(autocommit=True)
cur = conn.cursor()

# create sparkify database with UTF8 encoding
cur.execute("DROP DATABASE IF EXISTS research_db")
cur.execute("CREATE DATABASE research_db WITH ENCODING 'utf8' TEMPLATE template0")

## Load the possiblity to run magic function

In [11]:
%load_ext sql
%sql postgresql://postgres:password@postgres/postgres

# Drop Tables

In [12]:
# Drop Tables 
for query in drop_tables:
    cur.execute(query)
    conn.commit()

In [13]:
%sql SELECT * FROM journal

 * postgresql://postgres:***@postgres/postgres
(psycopg2.errors.UndefinedTable) relation "journal" does not exist
LINE 1: SELECT * FROM journal
                      ^

[SQL: SELECT * FROM journal]
(Background on this error at: https://sqlalche.me/e/14/f405)


# Create Tables

In [14]:
for query in create_tables:
        cur.execute(query)
        conn.commit()

In [15]:
%sql SELECT * FROM author

 * postgresql://postgres:***@postgres/postgres
0 rows affected.


author_id,last_name,first_name,middle_name,gender,affiliation,hindex


# Insert into Tables

In [16]:
def insert_to_tables(table, query):
    ''' Helper function for inserting values to Postresql tables
    Args:
        table (pd.DataFrame): pandas table
        query (SQL query): correspondive SQL query for 'table' for data insertion in DB
    '''
    
    print(f'Inserting table -- {table.name} -- ...')
    
    try:
        for i, row in table.iterrows():
            cur.execute(query, list(row))
        print(f'Table -- {table.name} -- successfully inserted!')
    except:
        print(f'Error with table -- {table.name} --')
    print()
        
for  i in range(len(tables)):
    insert_to_tables(tables[i], insert_tables[i])

Trying to insert table -- authorship -- ...
Table -- authorship -- successfully inserted!

Trying to insert table -- article_category -- ...
Table -- article_category -- successfully inserted!

Trying to insert table -- category -- ...
Table -- category -- successfully inserted!

Trying to insert table -- article -- ...
Table -- article -- successfully inserted!

Trying to insert table -- author -- ...
Table -- author -- successfully inserted!

Trying to insert table -- journal -- ...
Table -- journal -- successfully inserted!



In [18]:
%sql SELECT * FROM author LIMIT 10

 * postgresql://postgres:***@postgres/postgres
10 rows affected.


author_id,last_name,first_name,middle_name,gender,affiliation,hindex
A'HearnM,A'Hearn,M. F.,,,,
AagesenM,Aagesen,M.,,,,
AaltonenA,Aaltonen,A.,,,,
AarnioH,Aarnio,Harri,,,,
AaronsonS,Aaronson,Scott,,,,
AarsethJ,Aarseth,Jan B.,,,,
AartsJ,Aarts,J.,,,,
AasA,Aas,A. J.,,,,
AazamiA,Aazami,Amir B.,,,,
AbabnehB,Ababneh,Bashar S.,,,,


# Test Queries

In [19]:
%sql SELECT * FROM authorship LIMIT 10;

 * postgresql://postgres:***@postgres/postgres
10 rows affected.


article_id,author_id
704.0001,BalázsC
704.0001,BergerE
704.0001,NadolskyP
704.0001,YuanC
704.0006,PongY
704.0006,LawC
704.0007,CorichiA
704.0007,VukasinacT
704.0007,ZapataJ
704.0008,SwiftD


In [20]:
%sql SELECT * FROM article_category LIMIT 10;

 * postgresql://postgres:***@postgres/postgres
10 rows affected.


article_id,category_id
704.0001,hep-ph
704.0006,cond-mat.mes-hall
704.0007,gr-qc
704.0008,cond-mat.mtrl-sci
704.0009,astro-ph
704.0015,hep-th
704.0016,hep-ph
704.0017,astro-ph
704.002,hep-ex
704.0021,nlin.PS


In [21]:
%sql SELECT * FROM article LIMIT 10;

 * postgresql://postgres:***@postgres/postgres
10 rows affected.


article_id,title,doi,n_authors,journal_issn,n_cites,year
704.0001,Calculation of prompt diphoton production cross sections at Tevatron and  LHC energies,10.1103/PhysRevD.76.013009,4,,,2008
704.0006,Bosonic characters of atomic Cooper pairs across resonance,10.1103/PhysRevA.75.043613,2,,,2015
704.0007,Polymer Quantum Mechanics and its Continuum Limit,10.1103/PhysRevD.76.044016,3,,,2008
704.0008,Numerical solution of shock and ramp compression for general material  properties,10.1063/1.2975338,1,,,2009
704.0009,"The Spitzer c2d Survey of Large, Nearby, Insterstellar Clouds. IX. The  Serpens YSO Population As Observed With IRAC and MIPS",10.1086/518646,7,,,2010
704.0015,Fermionic superstring loop amplitudes in the pure spinor formalism,10.1088/1126-6708/2007/05/034,1,,,2009
704.0017,Spectroscopic Observations of the Intermediate Polar EX Hydrae in  Quiescence,10.1111/j.1365-2966.2007.11762.x,6,,,2009
704.0021,Molecular Synchronization Waves in Arrays of Allosterically Regulated  Enzymes,10.1103/PhysRevLett.99.048301,3,,,2007
704.0023,ALMA as the ideal probe of the solar chromosphere,10.1007/s10509-007-9626-1,3,,,2009
704.0025,Spectroscopic Properties of Polarons in Strongly Correlated Systems by  Exact Diagrammatic Monte Carlo Method,10.1007/978-1-4020-6348-0_12,2,,,2015


In [22]:
%sql SELECT * FROM author LIMIT 10;

 * postgresql://postgres:***@postgres/postgres
10 rows affected.


author_id,last_name,first_name,middle_name,gender,affiliation,hindex
A'HearnM,A'Hearn,M. F.,,,,
AagesenM,Aagesen,M.,,,,
AaltonenA,Aaltonen,A.,,,,
AarnioH,Aarnio,Harri,,,,
AaronsonS,Aaronson,Scott,,,,
AarsethJ,Aarseth,Jan B.,,,,
AartsJ,Aarts,J.,,,,
AasA,Aas,A. J.,,,,
AazamiA,Aazami,Amir B.,,,,
AbabnehB,Ababneh,Bashar S.,,,,


In [23]:
%sql SELECT * FROM category LIMIT 10;

 * postgresql://postgres:***@postgres/postgres
10 rows affected.


category_id,superdom,subdom
astro-ph,astro-ph,
astro-ph.CO,astro-ph,CO
astro-ph.EP,astro-ph,EP
astro-ph.GA,astro-ph,GA
astro-ph.HE,astro-ph,HE
astro-ph.IM,astro-ph,IM
astro-ph.SR,astro-ph,SR
cond-mat.dis-nn,cond-mat,dis-nn
cond-mat.mes-hall,cond-mat,mes-hall
cond-mat.mtrl-sci,cond-mat,mtrl-sci


# 4. Preparing Graph DB Data
In essence, we need to (a) rename the attributes to be compliant with Neo4J notation, and (b) save the above-created tables to .csv-s: https://medium.com/@st3llasia/analyzing-arxiv-data-using-neo4j-part-1-ccce072a2027

- about network analysis with these data in Neo4J: https://medium.com/swlh/network-analysis-of-arxiv-dataset-to-create-a-search-and-recommendation-engine-of-articles-cd18b36a185e

- link prediction: https://towardsdatascience.com/link-prediction-with-neo4j-part-2-predicting-co-authors-using-scikit-learn-78b42356b44c

The Graph Database Schema is pictured below:
<img src="images/graph_db_schema.png"/>

# 5. Example Queries

## 5.1. Data Warehouse

## 5.2. Graph Database

## Total Pipeline Runtime

In [None]:
end_pipe = time.time()

print(f'Time of pipeline start: {time.ctime(end_pipe)}')
print(f'Total pipeline runtime: {(end_pipe - start_pipe)/60} min.')