# Data Engineering Project 
## ETL

**Authors**: 
- Dmitri Rozgonjuk
- Eerik Sven Puudist
- Lisanne Siniväli
- Cheng-Han Chung


The aim of this script is to clean the main raw data frame and write a new, clean data frame for further use. In this notebook, the comparisons of different read- and write-methods are demonstrated.

First, we install and import the necessary libraries from one cell (to avoid having libraries in some individual cells below). The packages and their versions to be installed will later be added to the `requirements.txt` file.

We also use this section to set global environment parameters.

In [None]:
!conda install psycopg2 -y
!pip install pybliometrics

In [None]:
!pip install -r requirements.txt

In [4]:
## NB!! run the installs from terminal
########### Library Installations ##############

################### Imports ####################
### Data wrangling
import pandas as pd # working with dataframes
import numpy as np # vector operations


### Specific-purpose libraries
# NB! Most configure with an API key
#from pybliometrics.scopus import AbstractRetrieval
from habanero import Crossref # CrossRef API
from genderize import Genderize # Gender API

### Misc
import requests
import warnings # suppress warnings
import os # accessing directories
from tqdm import tqdm # track loop runtime


from scripts.raw_to_tables import *
from scripts.sql_queries import *

#import psycopg2

########## SETTING ENV PARAMETERS ################
warnings.filterwarnings('ignore') # suppress warnings

## Pipeline start

In [5]:
start_pipe = time.time() # Initialize the time of pipeline
start_etl = time.time() # Initialize the time of ETL
print(f'Time of pipeline start: {time.ctime(start_pipe)}')
print()

# Data ingestion
df = ingest_and_process(force = False)

# Prepare Pandas dataframes
authorship, author = authorship_author_extract(df)
article_category, category = article_category_category_extract(df)
article = article_extract(df)
journal = journal_extract()

end_etl = time.time() # Endtime of ETL
print(f'ETL Runtime: {round(end_etl - start_etl, 6)} sec.')

Time of pipeline start: Thu Dec 22 16:38:15 2022

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: qetdrteq
Your Kaggle Key: ········
Downloading arxiv.zip to ./arxiv


100%|██████████████████████████████████████| 1.11G/1.11G [01:37<00:00, 12.2MB/s]



Dimensions of the df with valid DOIs: (1088470, 6)
Dimensions of the df with dropped duplicates: (1088468, 6)
Dimensions of the df with short titles removed: (79958, 6)
Dimensions of the df with only CS papers: (79958, 6)

Data Ingestion Time elapsed: 387.64597606658936 seconds.
Memory usage of raw df: 0.03783251345157623 GB.
ETL Runtime: 394.590855 sec.


### Additional data cleaning

In [7]:
# Clean the data last time: remove all authors with NaNs or too short names
## NaNs
author = author[~author['author_id'].isnull()]
nan_authors = authorship[authorship['author_id'].isnull()]['article_id'].values
article = article.loc[~article['article_id'].isin(nan_authors)]
authorship = authorship.loc[~authorship['article_id'].isin(nan_authors)]

## Too short (< 4) names
author = author[~(author['author_id'].str.len() < 4)].reset_index(drop = True)
short_authors = authorship[(authorship['author_id'].str.len() < 4)]['article_id'].values
article = article.loc[~article['article_id'].isin(short_authors)].reset_index(drop = True)
authorship = authorship.loc[~authorship['article_id'].isin(short_authors)].reset_index(drop = True)

# 2. Data Augmentation

In [None]:
# Tables:
## authorship
## article_category
## category
## journal <-- augment all data (use ISSN from DOI)
## article <-- augment with number of citations
## author <-- augment with gender and affiliation

### Gender
In order to query 'gender' of a given author, we first extract all valid (length > 3) first names. We acknowledge that there may be first names that are smaller than four characters in length, but given that query amount is limited, we are going with a more robust way to extract as many names as possible.

To that end, we are querying the names via the Genderize.io API. It allows for querying 1500 names per day. We exract the names and probabilities, and update our own data table with these data. We then finally join the table by firstname to include the gender column.

In [8]:
# Extract unique valid first names and create a temporary df with firstname and gender
names_genders = pd.DataFrame(np.sort(author[author['first_name'].str.len() > 4]['first_name'].unique()))
names_genders.columns = ['first_name']
names_genders.head()

# Data from names_genders df
names_genders_file = pd.read_csv('names_genders.csv')

# Merge the dfs
names_genders = names_genders.merge(names_genders_file, on = 'first_name', how = 'left')
names_genders.head()

In [13]:
def update_names_table(names_genders):
    
    # For loop querying the genderize.io API
    for i in tqdm(range(len(names_genders))):
        # Extract the name
        first_name = names_genders.loc[i, 'first_name'] # first name
        # Check if the name has already been checked
        ## Query only if the name hasn't been checked already
        if names_genders.loc[i, 'prob'] >= 0 and names_genders.loc[i, 'prob'] <= 1:
            pass
        else:
            try: 
                gender_info = Genderize().get([first_name])
                names_genders.loc[i, 'gender'] = gender_info[0]['gender']
                names_genders.loc[i,'prob'] = gender_info[0]['probability']
            except:
                print(f'Iteration nr {i}')
                print('Limit likely exceeded.')
                break
            finally:
                # Write to csv
                names_genders.to_csv('names_genders.csv', index = False)

In [15]:
update_names_table(names_genders)

  5%|█▋                                  | 1046/21979 [12:23<4:07:56,  1.41it/s]

Iteration nr 1046
Limit likely exceeded.





In [19]:
names_genders[:1050]

Unnamed: 0,first_name,gender,prob
0,'Gholamali,,0.0
1,'Omid,,0.0
2,AL-Qawasmi,,0.0
3,Aabhas,male,1.0
4,Aadarsh,male,1.0
...,...,...,...
1045,António,male,1.0
1046,Antônio,,
1047,Anuar,,
1048,Anubhab,,


In [None]:
# For loop querying the genderize.io API
for i in tqdm(range(len(names_genders))):
    # Extract the name
    first_name = names_genders.loc[i, 'first_name'] # first name
    
    # Check if the name has already been checked
    ## Query only if the name hasn't been checked already
    if names_genders.loc[i, 'prob'] >= 0 and names_genders.loc[i, 'prob'] <= 1:
        pass
#    else:
        gender_info = Genderize().get([first_name])
        names_genders.loc[i, 'gender'] = gender_info[0]['gender']
        names_genders.loc[i,'prob'] = gender_info[0]['probability']

# Write to csv
# names_genders.to_csv('names_genders.csv', index = False)

In [None]:
# Import gender table
names_genders = pd.read_csv('names_genders.csv')
# Exclude the names that were not found
found_names = names_genders[names_genders['prob']>0]
# Gender values to 'M' and 'F'
found_names['gender'] = found_names['gender'].replace(to_replace=['male','female'], value=['M', 'F'])

author = author.merge(found_names[['first_name', 'gender']], on = ['first_name'], how = 'right')
author.head()

### Article
This section serves both the augmentation as well as data cleaning function. First, the articles are checked for type: only journal articles are being extracted (later, the records that are not journal articles, alongside with authors, etc, will be deleted). Second, article citation count is extracted. Finally, journal ISSN is extracted. The latter is later used for retrieving journal title.

In [None]:
article

In [None]:
## Scientific records
DOIs = article['doi']

# Initialize Crossref
cr = Crossref()

for i in tqdm(range(len(article))):
    doi = DOIs[i]
    ref_obj = cr.works(query= doi)['message']['items'][0]
    pub_type = ref_obj['type']
  #  print(f'Publication type: {pub_type}')
    
    if pub_type == 'journal-article':
        
     #   print(f'Fetching... {doi}')
              
        article.loc[i, 'n_cites'] = ref_obj['reference-count']
        article.loc[i, 'journal_issn'] = ref_obj['ISSN'][0]
        
    #    print(f'DOI {doi} Fetched!')
    #    print()
        
    else:
    #    print('Not a journal article, passed')
    #    print()
        pass

In [None]:
for index, row in test_authorship.iterrows():
    #Get row of Article
    art_row = article.loc[article["article_id"] == row["article_id"]]
    doi = art_row["doi"].iloc[0]
    data = get_publication_data(doi)
    if data is not None:
        # Update the dataframe with data from the API
        if "author" in data["message"]:
            message = data["message"]["author"]
            for i in range(len(message)):
                if "name" in message[i]['affiliation']:
                    test_author.loc[test_author['last_name'] == message[i]['family'], 'affiliation'] = message[i]['affiliation'][0]

In [None]:
ref_obj

### To .csv

In [None]:
# Make a directory 'tables'
!mkdir tables

In [None]:
authorship.to_csv('tables/authorship.csv', index = False)
article_category.to_csv('tables/article_category.csv', index = False)
category.to_csv('tables/category.csv', index = False)
journal.to_csv('tables/journal.csv', index = False)
article.to_csv('tables/article.csv', index = False)
author.to_csv('tables/author.csv', index = False)

# 3. From Pandas to PostgreSQL

In [None]:
# Import the data from Pandas
authorship = pd.read_csv('tables/authorship.csv')
article_category = pd.read_csv('tables/article_category.csv')
category = pd.read_csv('tables/category.csv')
article = pd.read_csv('tables/article.csv')
author = pd.read_csv('tables/author.csv')
journal = pd.read_csv('tables/journal.csv')

tables = [authorship, article_category, category, article, author, journal]

# Name of tables (for later print)
authorship.name = 'authorship'
article_category.name = 'article_category'
category.name = 'category'
article.name = 'article'
author.name = 'author'
journal.name = 'journal'

In [None]:
journal

# Database Connection

In [None]:
# Connect to the database
conn = psycopg2.connect(host="postgres", user="postgres", password="password", database="postgres")
conn.set_session(autocommit=True)
cur = conn.cursor()

# create sparkify database with UTF8 encoding
cur.execute("DROP DATABASE IF EXISTS research_db")
cur.execute("CREATE DATABASE research_db WITH ENCODING 'utf8' TEMPLATE template0")

## Load the possiblity to run magic function

In [None]:
%load_ext sql
%sql postgresql://postgres:password@postgres/postgres

# Drop Tables

In [None]:
# Drop Tables 
for query in drop_tables:
    cur.execute(query)
    conn.commit()

In [None]:
# Check that a table, e.g., 'jounal', is not in the database
%sql SELECT * FROM journal

# Create Tables

In [None]:
for query in create_tables:
        cur.execute(query)
        conn.commit()

In [None]:
# Check that the tables (e.g., 'author') are created
## Should be empty
%sql SELECT * FROM journal

# Insert into Tables

In [None]:
def insert_to_tables(table, query):
    ''' Helper function for inserting values to Postresql tables
    Args:
        table (pd.DataFrame): pandas table
        query (SQL query): correspondive SQL query for 'table' for data insertion in DB
    '''
    
    print(f'Inserting table -- {table.name} -- ...')
    
    try:
        for i, row in table.iterrows():
            cur.execute(query, list(row))
        print(f'Table -- {table.name} -- successfully inserted!')
    except:
        print(f'Error with table -- {table.name} --')
    print()
        
for  i in range(len(tables)):
    insert_to_tables(tables[i], insert_tables[i])

In [None]:
%sql SELECT * FROM author LIMIT 10

# Test Queries

In [None]:
%sql SELECT * FROM authorship LIMIT 10;

In [None]:
%sql SELECT * FROM article_category LIMIT 10;

In [None]:
%sql SELECT * FROM article LIMIT 10;

In [None]:
%sql SELECT * FROM category LIMIT 10;

In [None]:
%sql SELECT * FROM journal LIMIT 10;

# 4. Preparing Graph DB Data
In essence, we need to (a) rename the attributes to be compliant with Neo4J notation, and (b) save the above-created tables to .csv-s: https://medium.com/@st3llasia/analyzing-arxiv-data-using-neo4j-part-1-ccce072a2027

- about network analysis with these data in Neo4J: https://medium.com/swlh/network-analysis-of-arxiv-dataset-to-create-a-search-and-recommendation-engine-of-articles-cd18b36a185e

- link prediction: https://towardsdatascience.com/link-prediction-with-neo4j-part-2-predicting-co-authors-using-scikit-learn-78b42356b44c

The Graph Database Schema is pictured below:
<img src="images/graph_db_schema.png"/>

# 5. Example Queries

## 5.1. Data Warehouse

## 5.2. Graph Database

## Total Pipeline Runtime

In [None]:
end_pipe = time.time()

print(f'Time of pipeline start: {time.ctime(end_pipe)}')
print(f'Total pipeline runtime: {(end_pipe - start_pipe)/60} min.')

In [None]:
# !python -m pip install "dask[complete]" 

In [None]:
# import dask.dataframe as dd
# use it with chunks
#df = dd.read_json("arxiv/arxiv-metadata-oai-snapshot.json", lines = True)

In [None]:
df_raw  = pd.read_json("arxiv/arxiv-metadata-oai-snapshot.json", lines = True, nrows = 100)

In [None]:
df_raw.columns

In [None]:
df_raw = df_raw[~df_raw['doi'].isnull()]
df_raw = df_raw[(df_raw['categories'].str.contains('cs.')) & (~df_raw['categories'].str.contains('physics'))].reset_index(drop = True)



In [None]:
df_raw.head()