# Data Engineering Project 
## Importing the raw data, exporting the clean data

**Authors**: 
- Dmitri Rozgonjuk
- Eerik Sven Puudist
- Lisanne Siniväli
- Cheng-Han Chung


The aim of this script is to clean the main raw data frame and write a new, clean data frame for further use. In this notebook, the comparisons of different read- and write-methods are demonstrated.

First, we install and import the necessary libraries from one cell (to avoid having libraries in some individual cells below). The packages and their versions to be installed will later be added to the `requirements.txt` file.

We also use this section to set global environment parameters.

In [1]:
########### Library Installations ##############
# !pip install opendatasets # install the library for downloading the data set
# ! pip install habanero

################################################

################### Imports ####################
### Data wrangling
import pandas as pd # working with dataframes

### Specific-purpose libraries
import opendatasets as od # downloading the data set from Kaggle
from habanero import Crossref # CrossRef API

### Misc
import warnings # suppress warnings
import time # tracking time
import os # accessing directories

########## SETTING ENV PARAMETERS ################
warnings.filterwarnings('ignore') # suppress warnings

## 1. Data import
In order to download the data from Kaggle to a machine, it would be necessary to create a Kaggle API token. Make sure to include the `kaggle.json` fle in the same directory as this notebook.

Some additional resources:
- How to download the datasets from kaggle with `opendatasets` library https://www.analyticsvidhya.com/blog/2021/04/how-to-download-kaggle-datasets-using-jupyter-notebook/
- Github repo for `opendatasets` library: https://github.com/JovianML/opendatasets

First download the file (should be around `1.09 GB`. It will be stored in the `.arxiv/` directory. In case the file already exists, the download will be ignored with the `force = False` argument.

In [None]:
od.download("https://www.kaggle.com/datasets/Cornell-University/arxiv", 
                     force = False # force = True downloads the file even if it finds a file with the same name
                    )

Import the JSON file as pandas dataframe. For testing purposes, select how many rows are included. if `n_rows = "all"`, the entire data set is imported.

In [2]:
n_rows = 'all'

start_time = time.time()
if n_rows == "all":
    df = pd.read_json("arxiv/arxiv-metadata-oai-snapshot.json", lines = True)
else:
    df = pd.read_json("arxiv/arxiv-metadata-oai-snapshot.json", lines = True, nrows = n_rows)

end_time = time.time()

print(f'Time elapsed: {end_time - start_time} seconds.')
print(f'Memory usage of raw df: {df.memory_usage(deep = True).sum()/1024/1024/1024} GB.')
print(f'Dataframe dimensions: {df.shape}')
df.head(2)

Time elapsed: 174.75681114196777 seconds.
Memory usage of raw df: 4.030794090591371 GB.
Dataframe dimensions: (2146946, 14)


Unnamed: 0,id,submitter,authors,title,comments,journal-ref,doi,report-no,categories,license,abstract,versions,update_date,authors_parsed
0,704.0001,Pavel Nadolsky,"C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-...",Calculation of prompt diphoton production cros...,"37 pages, 15 figures; published version","Phys.Rev.D76:013009,2007",10.1103/PhysRevD.76.013009,ANL-HEP-PR-07-12,hep-ph,,A fully differential calculation in perturba...,"[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...",2008-11-26,"[[Balázs, C., ], [Berger, E. L., ], [Nadolsky,..."
1,704.0002,Louis Theran,Ileana Streinu and Louis Theran,Sparsity-certifying Graph Decompositions,To appear in Graphs and Combinatorics,,,,math.CO cs.CG,http://arxiv.org/licenses/nonexclusive-distrib...,"We describe a new algorithm, the $(k,\ell)$-...","[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",2008-12-13,"[[Streinu, Ileana, ], [Theran, Louis, ]]"


## 2. Data cleaning
In this step, data cleaning is performed. Here are the guidelines from the assignment:

- You can drop the abstract as it is not required in the scope of this project,
- You can drop publications with very short titles, e.g. one word, with empty authors

What we do is we first drop all the columns that we are not planning to use in the project. Then, we are excluding the rows where works do not have a DOI. While we aknowledge that some valid publications do not have a DOI, a DOI demonstrates that this work is published (whether in a journal, as a pre-print, etc) and, hence, serves as a marker for publication quality. Finally, we exclude titles which have a length smaller than 10 characters - here, the main idea is to exclude all non-validly titled works, as <10 characters would amount to three words of three characters with two spaces - a rather rare title.

In [3]:
# Drop the abstract, submitter, comments, report-no, versions, journal-ref, and license, as these features are not used in this project
## Of note, journal name will be retrieved later with a more standard label
df = df.drop(['abstract', 'submitter', 'comments', 'report-no', 'license', 'versions', 'journal-ref'], axis = 1)
df.shape

(2146946, 7)

In [4]:
# Include only works with non-null values in doi
df = df[~df['doi'].isnull()]
df.shape

(1077424, 7)

In [5]:
# Drop the publications with very short titles (less than 3 words)
df = df[(df['title'].map(len) > 10)]
df = df.reset_index(drop = True)
print(df.shape)

# Set the index of each paper to 'id'
# df = df.set_index('id')
df.head(3)

(1077226, 7)


Unnamed: 0,id,authors,title,doi,categories,update_date,authors_parsed
0,704.0001,"C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-...",Calculation of prompt diphoton production cros...,10.1103/PhysRevD.76.013009,hep-ph,2008-11-26,"[[Balázs, C., ], [Berger, E. L., ], [Nadolsky,..."
1,704.0006,Y. H. Pong and C. K. Law,Bosonic characters of atomic Cooper pairs acro...,10.1103/PhysRevA.75.043613,cond-mat.mes-hall,2015-05-13,"[[Pong, Y. H., ], [Law, C. K., ]]"
2,704.0007,"Alejandro Corichi, Tatjana Vukasinac and Jose ...",Polymer Quantum Mechanics and its Continuum Limit,10.1103/PhysRevD.76.044016,gr-qc,2008-11-26,"[[Corichi, Alejandro, ], [Vukasinac, Tatjana, ..."


In [6]:
print(f'Dataframe dimensions: {df.shape}')
print(f'Memory usage of raw pandas df: {df.memory_usage(deep = True).sum()/1024/1024/1024} GB.')

Dataframe dimensions: (1077226, 7)
Memory usage of raw pandas df: 0.6780343065038323 GB.


# 3. Testing different write and read methods

Now, the goal is to save a data frame that is relatively smaller but is also written and read fast. Hence, we try out different methods to achieve our goal. Below are the results which show that the optimal mathod of our choice is the `feather` format, as it provides relatively fast write and read, and the the size is only bigger than for `parquet` (which takes significantly more time for reads and writes). Some of the approaches are discussed here: # Saving options: https://stackoverflow.com/questions/48770542/what-is-the-difference-between-save-a-pandas-dataframe-to-pickle-and-to-csv.

### `feather`

In [7]:
# Write to feather
feather_start = time.time()
df.to_feather('df_clean.feather')
feather_end = time.time()
print(f'Time writing to feather: {feather_end - feather_start} seconds.')

# Size of written file
feather_size = os.path.getsize('df_clean.feather')
print("feather file size is :", feather_size/1024/1024, "MB")

# Read feather
feather_start = time.time()
feather_read = pd.read_feather('df_clean.feather')
feather_end = time.time()
print(f'Time reading feather to pandas: {feather_end - feather_start} seconds.')

Time writing to feather: 2.7467079162597656 seconds.
feather file size is : 293.9583911895752 MB
Time reading feather to pandas: 2.473008871078491 seconds.


### `parquet`

In [8]:
# Write to parquet
parquet_start = time.time()
df.to_parquet('df_clean.parquet')
parquet_end = time.time()
print(f'Time writing to parquet: {parquet_end - parquet_start} seconds.')

# Size of written file
parquet_size = os.path.getsize('df_clean.parquet')
print("parquet file size is :", parquet_size/1024/1024, "MB")

# Read parquet
parquet_start = time.time()
parquet_read = pd.read_parquet('df_clean.parquet')
parquet_end = time.time()
print(f'Time reading parquet to pandas: {parquet_end - parquet_start} seconds.')

Time writing to parquet: 9.75090217590332 seconds.
parquet file size is : 220.77945232391357 MB
Time reading parquet to pandas: 12.483412027359009 seconds.


### `csv`

In [9]:
# Write to csv
csv_start = time.time()
df.to_csv('df_clean.csv')
csv_end = time.time()
print(f'Time writing to csv: {csv_end - csv_start} seconds.')

# Size of written file
csv_size = os.path.getsize('df_clean.csv')
print("csv file size is :", csv_size/1024/1024, "MB")

# Read csv
csv_start = time.time()
csv_read = pd.read_csv('df_clean.csv')
csv_end = time.time()
print(f'Time reading csv to pandas: {csv_end - csv_start} seconds.')

Time writing to csv: 10.570870637893677 seconds.
csv file size is : 402.45714378356934 MB
Time reading csv to pandas: 4.404193878173828 seconds.


### `pickle`

In [10]:
# Write to pickle
pickle_start = time.time()
df.to_pickle('df_clean.pkl')
pickle_end = time.time()
print(f'Time writing to pickle: {pickle_end - pickle_start} seconds.')

# Size of written file
pickle_size = os.path.getsize('df_clean.pkl')
print("pickle file size is :", pickle_size/1024/1024, "MB")

# Read feather
pickle_start = time.time()
pickle_read = pd.read_pickle('df_clean.pkl')
pickle_end = time.time()
print(f'Time reading pickle to pandas: {pickle_end - pickle_start} seconds.')

Time writing to pickle: 31.464216709136963 seconds.
pickle file size is : 411.2366714477539 MB
Time reading pickle to pandas: 42.279184103012085 seconds.
