# 4-CudaDataframe

This notebook demonstrates how to read and process a tabular datafile with the [cuDF](https://docs.rapids.ai/api/cudf/stable/) GPU dataframe library. The size of data is limited by the available GPU memory. cuDF provides a pandas-like API.

Author: Peter W. Rose (pwrose@ucsd.edu)

In [1]:
import os
import pandas as pd
import cudf
import time

If LOCAL_SCRATCH_DIR environment variable is not set, this notebook accesses the ../data directory for temporary files.

In [2]:
DATA_DIR = os.getenv("LOCAL_SCRATCH_DIR", default="../data")
filename = os.path.join(DATA_DIR, "gene_info.tsv")
file_size = f"{os.path.getsize(filename)/1E9:.1f}"
RESULTS_DIR = "results"
os.makedirs(RESULTS_DIR, exist_ok=True)

### Setup Benchmark

The ```file_format``` parameter is used for benchmarking different file formats. 
This cell below has been [parameterized](https://papermill.readthedocs.io/en/latest/usage-parameterize.html#jupyterlab-3-0) as input parameters for [papermill](https://papermill.readthedocs.io/en/latest/index.html).

In [3]:
file_format = "csv"

In [4]:
start = time.time()

### Read Data

In [5]:
# read only specified columns and rows
column_names = ["GeneID", "Symbol", "Synonyms", "description", "type_of_gene", "#tax_id", "chromosome"]
filters=[[("type_of_gene", "==", "protein-coding")]]

if file_format == "csv":
    filename = os.path.join(DATA_DIR, "gene_info.tsv")
    genes = cudf.read_csv(filename, usecols=column_names, dtype=str, sep="\t")
    genes = genes[genes["type_of_gene"] == 'protein-coding']    
elif file_format == "parquet":
    filename = os.path.join(DATA_DIR, "gene_info.parquet")
    genes = cudf.read_parquet(filename, columns=column_names, filters=filters)
else:
    print("invalid file format")
    
print("Filename:", filename)
    
genes = genes.rename(columns={"#tax_id": "tax_id"})

Filename: /scratch/pwrose/job_22854870/gene_info.tsv


### Process Data

In [6]:
groups = genes.groupby("tax_id").size().reset_index()
groups.columns = ["tax_id", "count"]
groups = groups.sort_values("count", ascending=False)

### Display Results

#### Number of human protein-coding genes (tax_id = 9606)

In [7]:
groups[groups["tax_id"]  == "9606"]

Unnamed: 0,tax_id,count
31505,9606,20648


#### Top 5 organisms with the most protein-coding genes

In [8]:
groups.head()

Unnamed: 0,tax_id,count
41449,4565,104037
9909,3708,90975
20332,90675,82686
15344,412133,72290
20364,94328,68154


In [9]:
end = time.time()

In [10]:
df = pd.DataFrame([{"cores": 1, "time": end-start, "size": file_size, "format": file_format, "dataframe": "Cuda"}])
output_file = f"5-CudaDataframe_{file_size}_{file_format}_1.csv"
df.to_csv(os.path.join(RESULTS_DIR, output_file), index=False)

In [11]:
print(f"cuDF: {end - start:.1f} sec.")

cuDF: 4.0 sec.
