# Data Engineering Project 
## From Pandas DataFrame to DWH Staging Tables

**Authors**: 
- Dmitri Rozgonjuk
- Eerik Sven Puudist
- Lisanne Siniväli
- Cheng-Han Chung


The aim of this script is to convert the cleaned `pandas` dataframe to DWH fact and dimensions tables.

First, we install and import the necessary libraries from one cell (to avoid having libraries in some individual cells below). The packages and their versions to be installed will later be added to the `requirements.txt` file.

We also use this section to set global environment parameters.

In [1]:
########### Library Installations ##############
# !pip install opendatasets # install the library for downloading the data set
# ! pip install habanero

################################################
### Specific-purpose libraries
from habanero import Crossref # CrossRef API

################### Imports ####################
### Data wrangling
import numpy as np # general mathematical and algebraic operations
import pandas as pd # working with dataframes

### Specific-purpose libraries

### Misc
import warnings 

########## SETTING ENV PARAMETERS ################
warnings.filterwarnings('ignore') # disable warnings

## 1. Data import
Here, we import the cleaned data and explore its dimensions.

In [2]:
df = pd.read_feather('df_clean.feather')
print(f'Dataframe dimensions: {df.shape}')
print(f'Memory usage of raw pandas df: {df.memory_usage(deep = True).sum()/1024/1024/1024} GB.')
df.head(2)

Dataframe dimensions: (1077226, 7)
Memory usage of raw pandas df: 0.6910948613658547 GB.


Unnamed: 0,id,authors,title,doi,categories,update_date,authors_parsed
0,704.0001,"C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-...",Calculation of prompt diphoton production cros...,10.1103/PhysRevD.76.013009,hep-ph,2008-11-26,"[[Balázs, C., ], [Berger, E. L., ], [Nadolsky,..."
1,704.0006,Y. H. Pong and C. K. Law,Bosonic characters of atomic Cooper pairs acro...,10.1103/PhysRevA.75.043613,cond-mat.mes-hall,2015-05-13,"[[Pong, Y. H., ], [Law, C. K., ]]"


## 2. Fact and Dimension tables for Data Warehouse (DWH)

Here, we create the tables with placeholder columns.

**Fact table** <br>
- `articles`: contains the information about all unique publications and links the dimension tables. The columns are:
    - PK `article_id`: VARCHAR article id (allows to retrieve this id from the original, raw df)
    - `title`: VARCHAR article title
    - `doi`: VARCHAR article DOI
    - `journal_id`:VARCHAR journal ID based on ISSN
    - `date`: DATE
    - `authors_ids`: LIST list of author ids.
    - `n_cites`: INT the number of citations (FACT)
    - `n_authors`: INT the number of co-authors

**Dimension tables** <br>
- `authors`: includes all individual authors of publications.
    - PK `author_id`: BIGNIT(??)
    - `lastname`: VARCHAR author's last name 
    - `first`: VARCHAR author's first name initial
    - `middle`: VARCHAR author's middle name initial (if any)
    - `country`: VARCHAR author's affiliation country <---- NEEDED AUGMENTATION (OR GENDER!)
    
    
- `journals`: includes all unique journals in which works were published
    - PK `journal_id`: VARCHAR journal ID
    - `issn`: VARCHAR journal ISSN
    - `title`: VARCHAR journal title
    - `if_latest`: FLOAT journal's latest Impact Factor
    
    
- `time_dim`: includes publication-related data.
    - PK `date`: DATE DD-M-YYYY
    - `year`: INT year
    - `month`: INT month

### 2.1. Dimension: `time_dim`
For this table, we extract the date (`DD-MM-YYYY`) as the ID, and we will extract the year of publication.

In [3]:
# Time_dim
time_dim = pd.DataFrame({'date' : df['update_date'].unique()}) # rename the date variable and find unique dates
time_dim['year'] = time_dim['date'].str.split('-', expand = True)[0] # extract the year
time_dim['month'] =  time_dim['date'].str.split('-', expand = True)[1] # extract the month
time_dim = time_dim.sort_values(by=['date']).reset_index(drop = True) # sort by date
print(f'Dimensions: {time_dim.shape}')
time_dim.head()

Dimensions: (4428, 3)


Unnamed: 0,date,year,month
0,2007-05-23,2007,5
1,2007-05-24,2007,5
2,2007-05-25,2007,5
3,2007-05-28,2007,5
4,2007-05-29,2007,5


### 2.2. Dimension: `journals`

In [4]:
journals = pd.DataFrame(columns = ['journal_id', 'issn', 'title', 'if_latest'])
journals.head()

Unnamed: 0,journal_id,issn,title,if_latest


### 2.3. Dimension: `authors`

In [5]:
authors = pd.DataFrame(columns = ['author_id', 'lastname', 'first', 'middle', 'country'])
authors.head()

Unnamed: 0,author_id,lastname,first,middle,country


### 2.4. Fact: `articles`

In [17]:
articles = pd.DataFrame(columns = ['article_id', 'title', 'doi', 'n_authors', 'journal_issn', 'authors', 'date', 'n_cites'])
articles.head()
articles['article_id'] = df['id']
articles['title'] = df['title']
articles['doi'] = df['doi']
articles['n_authors'] = df['authors_parsed'].str.len() # get the number of authors
articles['date'] = df['update_date']

articles.head()

Unnamed: 0,article_id,title,doi,n_authors,journal_issn,authors,date,n_cites
0,704.0001,Calculation of prompt diphoton production cros...,10.1103/PhysRevD.76.013009,4,,,2008-11-26,
1,704.0006,Bosonic characters of atomic Cooper pairs acro...,10.1103/PhysRevA.75.043613,2,,,2015-05-13,
2,704.0007,Polymer Quantum Mechanics and its Continuum Limit,10.1103/PhysRevD.76.044016,3,,,2008-11-26,
3,704.0008,Numerical solution of shock and ramp compressi...,10.1063/1.2975338,1,,,2009-02-05,
4,704.0009,"The Spitzer c2d Survey of Large, Nearby, Inste...",10.1086/518646,7,,,2010-03-18,


In [18]:
articles.shape

(1077226, 8)

## 3. Data Augmentation
In this section, [an API] (Google Scholar? CrossRef?) is used to retrieve the data of papers based on DOI. We are fetching the number of citations per each paper as well as journal ISSN to the `articles` table, and journal ISSN and journal name to `journals` table.

First, the empty dataframes are created for incomplete tables. Then, API queries are made to update the `articles` table. Because the query should, in theory, allow for retrieving also the journal title and author list, it should be possible to update those tables, too.

With regards to `journals`, one approach would be to do the following:

1. The API query is made based on DOI
2. Journal ISSN is saved to the `articles` table
3. Journal ISSN and title are saved into a separate structure (e.g., two lists with matching indices, or a dataframe).
4. Once the queries are completed, duplicates from `journals` table are removed.

With `authors`, it is trickier. Since the original dataset includes quite messy data (e.g., in some cases there are full names, in some cases only given name initials), it would make sense to fetch the names of authors, in standardized form, from the API query. A brute-force solution would be as follows:

1. The API query is made based on DOI
2. Two data structures are saved (either two lists with mathching indices or a pandas dataframe) where first structure marks the `article_id` and the second column, say, `authors_ids`, is a list with authors. Alternatively, it would be possible to save this as a long data format where `article_id` is repeated for each individual `author_name`.
3. Once the authors are fetched, names should be normalized ('cleaned').
4. Unique names are extracted.
5. Each unique author is assigned an ID.
6. Update the table `articles` with the list of author IDs.

For testing out these solutions, it may be a good idea to sample some rows from the entire df and run the queries on those rows.

In [13]:
# Sample a df of size N 
N = 50
test_df = articles.sample(N).reset_index(drop = True)

# Head of the df
test_df.head(3)

Unnamed: 0,article_id,title,doi,journal_issn,authors,date,n_cites
0,0802.0877,The shapes of galaxies in the Sloan Digital Sk...,10.1111/j.1365-2966.2008.13480.x,,,2008-12-18,
1,1410.3541,Memcomputing with membrane memcapacitive systems,10.1088/0957-4484/26/22/225201,,,2015-07-09,
2,astro-ph/0401289,The European Large Area ISO Survey IX: the 90 ...,10.1111/j.1365-2966.2004.08358.x,,,2009-11-10,


In [None]:
## Testing the crossref API
author_papers = pd.DataFrame(columns = ['article_id', 'author_family', 'author_given'])
author_papers.head()

ids = []
author_list = []
for i in range(len(test_df)):
    qry_rslt = cr.works(ids = test_df['doi'][i])['message'] # query result
    
    if qry_rslt['type'] == 'journal-article': # select only journal articles
#        qry_rslt = cr.works(ids = test_df['doi'][i])['message'] # querying the DOI via the CrossRef API
        test_df['n_cites'][i] = qry_rslt['reference-count'] # citation count
        test_df['journal_issn'][i] = qry_rslt['ISSN'][0] # get the only ISSN OR the print ISSN
        
      #  author_list = pd.Series(qry_rslt['author'])
      #  ids.append(test_df['article_id'][i])
      #  author_list = qry_rslt['author']
        
    else:
        pass