# Data Engineering Project 
## From Pandas DataFrame to DWH Staging Tables

**Authors**: 
- Dmitri Rozgonjuk
- Eerik Sven Puudist
- Lisanne Siniväli
- Cheng-Han Chung


The aim of this script is to convert the cleaned `pandas` dataframe to DWH fact and dimensions tables.

First, we install and import the necessary libraries from one cell (to avoid having libraries in some individual cells below). The packages and their versions to be installed will later be added to the `requirements.txt` file.

We also use this section to set global environment parameters.

In [1]:
########### Library Installations ##############
# !pip install opendatasets # install the library for downloading the data set
# ! pip install habanero

################################################
### Specific-purpose libraries
from habanero import Crossref # CrossRef API

################### Imports ####################
### Data wrangling
import numpy as np # general mathematical and algebraic operations
import pandas as pd # working with dataframes

### Specific-purpose libraries

### Misc
import warnings 

########## SETTING ENV PARAMETERS ################
warnings.filterwarnings('ignore') # disable warnings

## 1. Data import
Here, we import the cleaned data and explore its dimensions.

In [2]:
df = pd.read_feather('df_clean.feather')
print(f'Dataframe dimensions: {df.shape}')
print(f'Memory usage of raw pandas df: {df.memory_usage(deep = True).sum()/1024/1024/1024} GB.')
df.head(3)

Dataframe dimensions: (27499, 7)
Memory usage of raw pandas df: 0.016686415299773216 GB.


Unnamed: 0,id,authors,title,doi,categories,update_date,authors_parsed
0,704.0062,"Rastislav \v{S}r\'amek, Bro\v{n}a Brejov\'a, T...",On-line Viterbi Algorithm and Its Relationship...,10.1007/978-3-540-74126-8_23,cs.DS,2010-01-25,"[[Šrámek, Rastislav, ], [Brejová, Broňa, ], [V..."
1,704.0301,Akitoshi Kawamura,Differential Recursion and Differentially Alge...,10.1145/1507244.1507252,cs.CC,2009-04-19,"[[Kawamura, Akitoshi, ]]"
2,704.1267,"Laurence Likforman-Sulem, Abderrazak Zahour, B...",Text Line Segmentation of Historical Documents...,10.1007/s10032-006-0023-z,cs.CV,2007-05-23,"[[Likforman-Sulem, Laurence, ], [Zahour, Abder..."


## 2. Fact and Dimension tables for Data Warehouse (DWH)

Here, we create the tables with placeholder columns. In this data schema, we are using a factless fact table `authorship` that links articles (and its properties) with authors.

**Fact table** <br>
- `authorship`: links articles to authors
    - `article_id`: VARCHAR article id (allows to retrieve this id from the original, raw df)
    - `author_id`: VARCHAR composed from author's last name and first name initial (e.g., LastF)

**Dimension tables** <br>
- `article`: contains the information about all unique publications and links the dimension tables. The columns are:
    - PK `article_id`: VARCHAR article id (allows to retrieve this id from the original, raw df)
    - `title`: VARCHAR article title
    - `doi`: VARCHAR article DOI
    - `journal_id`:VARCHAR journal ID based on ISSN linking to the `journal` table
    - `date`: DATE linking to the `date` table
    - `n_cites`: INT the number of citations (FACT)
    - `n_authors`: INT the number of co-authors
    

- `author`: includes all individual authors of publications.
    - PK `author_id`: VARCHAR composed from author's last name and first name initial (e.g., LastF)
    - `lastname`: VARCHAR author's last name 
    - `first`: VARCHAR author's first name initial
    - `middle`: VARCHAR author's middle name initial (if any)
    - `gender`: INT (1 or 0), denoting 'Female' and 'Male', respectively (AUGMENTED VIA API!)
    
    
- `journal`: includes all unique journals in which works were published
    - PK `journal_id`: VARCHAR journal ID
    - `issn`: VARCHAR journal ISSN (necessary for augmentation)
    - `title`: VARCHAR journal title
    - `if_latest`: FLOAT journal's latest Impact Factor (AUGMENTED VIA API!)
    
    
- `date`: includes publication-related data.
    - PK `date`: DATE DD-M-YYYY
    - `year`: INT year
    - `month`: INT month
    
    
**<span style="color:red"> The DWH schema is depicted below: TO BE ADDED!! </span>** 

### 2.1. Dimension: `articles`

In [3]:
articles = pd.DataFrame(columns = ['article_id', 'title', 'doi', 'n_authors', 'journal_issn', 'n_cites', 'date'])
articles.head()
articles['article_id'] = df['id']
articles['title'] = df['title']
articles['doi'] = df['doi']
articles['n_authors'] = df['authors_parsed'].str.len() # get the number of authors
articles['date'] = df['update_date']
print(f'Dataframe dimensions: {articles.shape}')
print(f'Memory usage of raw pandas df: {articles.memory_usage(deep = True).sum()/1024/1024/1024} GB.')
articles.head()

Dataframe dimensions: (27499, 7)
Memory usage of raw pandas df: 0.01073651947081089 GB.


Unnamed: 0,article_id,title,doi,n_authors,journal_issn,n_cites,date
0,704.0062,On-line Viterbi Algorithm and Its Relationship...,10.1007/978-3-540-74126-8_23,3,,,2010-01-25
1,704.0301,Differential Recursion and Differentially Alge...,10.1145/1507244.1507252,1,,,2009-04-19
2,704.1267,Text Line Segmentation of Historical Documents...,10.1007/s10032-006-0023-z,3,,,2007-05-23
3,704.2344,Parallel computing for the finite element method,10.1051/epjap:1998151,3,,,2007-05-23
4,704.3238,Alternative axiomatics and complexity of delib...,10.1007/s10992-007-9078-7,3,,,2011-04-29


### 2.2. Fact: `authorship`
Here, we create a factless fact table `authorship` that links individual authors to specific article ID-s. Because this table is initially used for creating the `authors` table, in the following cell output, redundant columns are present. These columns are removed after creating the `authors` table.

In [4]:
# Create the table fro article id and authors list
authorship = df[['id', 'authors_parsed']].set_index('id')
authorship['n_authors'] = authorship['authors_parsed'].str.len()
authorship = pd.DataFrame(authorship['authors_parsed'].explode()).reset_index()

# Create additional columns
authorship['last_name'] = np.nan
authorship['first_name'] = np.nan
authorship['middle_name'] = np.nan

# Update the last, first, and middle names
for i in range(len(authorship)):
    authorship['last_name'][i] = authorship['authors_parsed'][i][0]
    authorship['first_name'][i] = authorship['authors_parsed'][i][1]
    authorship['middle_name'][i] = authorship['authors_parsed'][i][2]

# Drop the redundant column
authorship = authorship.drop(columns = 'authors_parsed')

# Author_identifier
authorship['author_id'] = authorship['last_name'] + authorship['first_name'].str[0]
# Rename article id column
authorship = authorship.rename({'id':'article_id'}, axis = 1)

# Final table
print(f'Dataframe dimensions: {authorship.shape}')
print(f'Memory usage of raw pandas df: {authorship.memory_usage(deep = True).sum()/1024/1024/1024} GB.')
authorship.head()

Dataframe dimensions: (101334, 5)
Memory usage of raw pandas df: 0.029967344366014004 GB.


Unnamed: 0,article_id,last_name,first_name,middle_name,author_id
0,704.0062,Šrámek,Rastislav,,ŠrámekR
1,704.0062,Brejová,Broňa,,BrejováB
2,704.0062,Vinař,Tomáš,,VinařT
3,704.0301,Kawamura,Akitoshi,,KawamuraA
4,704.1267,Likforman-Sulem,Laurence,,Likforman-SulemL


### 2.3. Dimension: `authors`
The authors table is created so that the data are first copied from the `authorship` table. Then, duplicate rows are dropped. Finally, redundant columns are removed from the `authorship` table.

In [5]:
# Create the table from the `authorship` table
author = authorship[['author_id', 'last_name', 'first_name', 'middle_name']]

# Drop duplicates
author.drop_duplicates(keep=False,inplace=True)

# Add the `gender` column to be augmented
author['gender'] = np.nan

# Final table
print(f'Dataframe dimensions: {author.shape}')
print(f'Memory usage of raw pandas df: {author.memory_usage(deep = True).sum()/1024/1024/1024} GB.')
author.head()

Dataframe dimensions: (47021, 5)
Memory usage of raw pandas df: 0.011723142117261887 GB.


Unnamed: 0,author_id,last_name,first_name,middle_name,gender
0,ŠrámekR,Šrámek,Rastislav,,
3,KawamuraA,Kawamura,Akitoshi,,
4,Likforman-SulemL,Likforman-Sulem,Laurence,,
5,ZahourA,Zahour,Abderrazak,,
6,TaconetB,Taconet,Bruno,,


In [6]:
# Remove redundant columns from `authorship table`
# Author_identifier
authorship = authorship.drop(columns = ['last_name', 'first_name', 'middle_name'])

# Final table
print(f'Dataframe dimensions: {authorship.shape}')
print(f'Memory usage of raw pandas df: {authorship.memory_usage(deep = True).sum()/1024/1024/1024} GB.')
authorship.head()

Dataframe dimensions: (101334, 2)
Memory usage of raw pandas df: 0.012463199906051159 GB.


Unnamed: 0,article_id,author_id
0,704.0062,ŠrámekR
1,704.0062,BrejováB
2,704.0062,VinařT
3,704.0301,KawamuraA
4,704.1267,Likforman-SulemL


### 2.4. Dimension: `time_dim`
For this table, we extract the date (`DD-MM-YYYY`) as the ID, and we will extract the year of publication.

In [7]:
# Time_dim
time_dim = pd.DataFrame({'date' : df['update_date'].unique()}) # rename the date variable and find unique dates
time_dim['year'] = time_dim['date'].str.split('-', expand = True)[0] # extract the year
time_dim['month'] =  time_dim['date'].str.split('-', expand = True)[1] # extract the month
time_dim = time_dim.sort_values(by=['date']).reset_index(drop = True) # sort by date
print(f'Dimensions: {time_dim.shape}')
time_dim.head()

Dimensions: (3085, 3)


Unnamed: 0,date,year,month
0,2007-05-23,2007,5
1,2007-05-24,2007,5
2,2007-05-25,2007,5
3,2007-05-29,2007,5
4,2007-06-11,2007,6


### 2.5. Dimension: `journals`

In [8]:
journals = pd.DataFrame(columns = ['journal_id', 'issn', 'title', 'if_latest'])
journals.head()

Unnamed: 0,journal_id,issn,title,if_latest


## 3. Data Augmentation
In this section, [an API] (Google Scholar? CrossRef?) is used to retrieve the data of papers based on DOI. We are fetching the number of citations per each paper as well as journal ISSN to the `articles` table, and journal ISSN and journal name to `journals` table.

First, the empty dataframes are created for incomplete tables. Then, API queries are made to update the `articles` table. Because the query should, in theory, allow for retrieving also the journal title and author list, it should be possible to update those tables, too.

With regards to `journals`, one approach would be to do the following:

1. The API query is made based on DOI
2. Journal ISSN is saved to the `articles` table
3. Journal ISSN and title are saved into a separate structure (e.g., two lists with matching indices, or a dataframe).
4. Once the queries are completed, duplicates from `journals` table are removed.

With `authors`, it is trickier. Since the original dataset includes quite messy data (e.g., in some cases there are full names, in some cases only given name initials), it would make sense to fetch the names of authors, in standardized form, from the API query. A brute-force solution would be as follows:

1. The API query is made based on DOI
2. Two data structures are saved (either two lists with mathching indices or a pandas dataframe) where first structure marks the `article_id` and the second column, say, `authors_ids`, is a list with authors. Alternatively, it would be possible to save this as a long data format where `article_id` is repeated for each individual `author_name`.
3. Once the authors are fetched, names should be normalized ('cleaned').
4. Unique names are extracted.
5. Each unique author is assigned an ID.
6. Update the table `articles` with the list of author IDs.

For testing out these solutions, it may be a good idea to sample some rows from the entire df and run the queries on those rows.

In [9]:
# Sample a df of size N 
N = 50
test_df = articles.sample(N).reset_index(drop = True)

# Head of the df
test_df.head(3)

Unnamed: 0,article_id,title,doi,n_authors,journal_issn,n_cites,date
0,1405.5341,A fast algorithm for computing the characteris...,10.1145/2608628.2608650,3,,,2014-05-22
1,1904.01392,Context-Aware Misbehavior Detection Scheme for...,10.1016/j.vehcom.2019.100186,1,,,2019-10-03
2,1109.2142,Generalizing Boolean Satisfiability III: Imple...,10.1613/jair.1656,5,,,2011-09-13


In [None]:
## Testing the crossref API
author_papers = pd.DataFrame(columns = ['article_id', 'author_family', 'author_given'])
author_papers.head()

ids = []
author_list = []
for i in range(len(test_df)):
    qry_rslt = cr.works(ids = test_df['doi'][i])['message'] # query result
    
    if qry_rslt['type'] == 'journal-article': # select only journal articles
#        qry_rslt = cr.works(ids = test_df['doi'][i])['message'] # querying the DOI via the CrossRef API
        test_df['n_cites'][i] = qry_rslt['reference-count'] # citation count
        test_df['journal_issn'][i] = qry_rslt['ISSN'][0] # get the only ISSN OR the print ISSN
        
      #  author_list = pd.Series(qry_rslt['author'])
      #  ids.append(test_df['article_id'][i])
      #  author_list = qry_rslt['author']
        
    else:
        pass