# Exploratory Data Analysis using cuDF

#### Original Author: Meiran Peng, edited by Mitesh Patel to support 24.06

cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data. It provides a pandas-like API, users can easily speed workflow upto ~400x (tested on specific hardware, explore detail at the end) with changing pandas.DataFrame() to cudf.DataFrame().

Explore the latest [cuDF API](https://docs.rapids.ai/api/cudf/nightly/api_docs/index.html)

The performance speed up as below with comparasion cuDF vs pandas on RTX A6000 GPU and AMD Ryzen Threadripper PRO 3945WX @ 4.4GHz.(Detail info at the end)
**<p style="text-align: center;">Performance Results based on test results</p>**    


|function| GPU Time | CPU Time| GPU Speedup |
| --- | --- | --- | --- |   
|read|2.046443|36.500297|17.83|
|slice|0.014068|8.297392|589.81|
|na|0.078052|1.826956|23.40|
|dropna|0.057089|2.866669|50.21|
|unique|0.007049|0.218504|30.99|
|dropduplicate|0.048152|0.300804|6.24|
|group_sum|0.600811|4.932876|8.21|

This notebook introduces how to use DataFrames and cuDF to apply basic data analysis to the [MeteoNet Dataset](https://www.kaggle.com/datasets/katerpillar/meteonet), an open weather dataset by METEO FRANCE, the nation's official meteorological service.
  
The dataset represents realistic data collection, including missing or invalid data. Here, we illustrate how to:
- Loading and saving the data
- Perform some quick checks
- Calculate the rate of missing data    
- Check for invalid data
- Run a corrleation of meteorological parameters
- Do a computation performance check

At the beginning of each of these topics, guidance is provided to show which functions in cuDF are applied. We follow up with a summary to describe the information we can glean through analysis. 


## Prerequisites   
To use this notebook, [RAPIDS](https://rapids.ai/start.html) must be installed. Please review the following steps and ensure it's properly installed.

### System Requirements
All provisioned systems need to be RAPIDS capable. Here’s what's required:

 **GPU**: NVIDIA Volta™ or better with compute capability 7.0+

 **OS**: One of the following OS versions:
 - Ubuntu 20.04/22.04 or Rocky Linux 8 with gcc/++ 9.0+
 - Windows 11 using WSL2 See separate install guide 
 - RHEL 7/8 support is provided through Rocky Linux 8 builds/installs

 **CUDA & NVIDIA Drivers**: One of the following supported versions:
 - CUDA 11.2 with Driver 470.42.01 or newer
 - CUDA 11.4 with Driver 470.42.01 or newer
 - CUDA 11.5 with Driver 495.29.05 or newer
 - CUDA 11.8 with Driver 520.61.05 or newer
 - CUDA 12.0 with Driver 525.60.13 or newer see CUDA 12 section below for notes on usage
 - CUDA 12.2 with Driver 535.86.10 or newer see CUDA 12 section below for notes on usage

   
Note: RAPIDS is tested with and officially supports the versions listed above. Newer versions of CUDA, drivers, and OS may also work with RAPIDS.

### Environment for RAPIDS
You can install one of below environments for RAPIDS. Referring to [Step 2: Install Environment](https://rapids.ai/start.html), the possible environments are:
* Conda 
* Build from source 
* PIP installation
* Running a Docker container 

### Installing RAPIDS  
There's specific ways to install RAPIDS for different environments. 
#### Conda   
Below is the command for basic installation under Conda:
```
conda create -n rapids-24.06 -c rapidsai -c conda-forge -c nvidia rapids=24.06 python=3.11 cuda-version=12.2 jupyterlab
```

You can specify python version either 3.9 or 3.10, cudatoolkit version as one of 11.2, 11.4, 11.5, and 11.8.    

NOTE: ```rapids=24.06``` means to install standard selection contains all following packages: _cuDF, cuML, cuGraph, cuSpatial/cuProj, cuXFilter, cuCIM_, RAFT, cuVS. You can specify the package you want to use via instead ```cudf=24.06``` as an example. 

For addtional installation of Dask SQL, JupyterLab, Plotly Dash, Graphistry, etc., you can add the related package name to the conda install command. Find detailed information at [Step 3: Install Rapids](https://rapids.ai/start.html).

#### Docker
RAPIDS requires both Docker CE v19.03+ and [nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) installed. 

Here's an example using a Docker container from NGC, selecting CUDA 12.2 and Ubuntu 22.04. The command does both pulling the container and running it.
```
docker run --gpus all --pull always --rm -it \
    --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
    nvcr.io/nvidia/rapidsai/base:24.06-cuda12.2-py3.11
```
Check more on docker command for specific system and _Dask-SQL, CLX_ support see [Step 3: Install Rapids](https://rapids.ai/start.html)

## Prepare the Dataset - Ground Stations
In this section will show how to perform basic analysis on the Ground Stations Dataset from METEONET.

### Download the Dataset
During this first task, let's use the Northwest France ground station data. Each parameter was measured every six minutes (10 times an hour). The parameters in the data set are listed below. More detailed information is available in [this Github repo](https://meteofrance.github.io/meteonet/english/data/ground-observations/).

**<p style="text-align: center;">Metadata</p>**

|Name| Description | Unit|   
| --- | --- | --- |          
|number_sta|ground station ID| - |   
|lat| latitude| decimal degrees (10^-1°) |    
|lon| longitude| decimal degrees (10^-1°) |   
|height_sta| station height| meters(m) |    
|date| a datetime object| format 'YYYY-MM-DD HH:mm:ss' |    

**<p style="text-align: center;">Meteorological Parameters</p>**

|Name| Description | Unit| 
| --- | --- | --- |   
|dd| Wind direction | degrees (°)|    
|ff| Wind speed | m.s^-1|    
|precip| Precipitation during the reporting period | kg.m^2|    
|hu| Humidity | % |      
|td| Dew point | Kelvin (K) |     
|t| Temperature | Kelvin (K) |   
|psl| Pressure reduced to sea level | Pascal (Pa)|

With the data now described, let's download and unzip the 1.7G dataset.

In [1]:
# Download dataset
# These three wget commands will download three years' worth of data
!if [ ! -f "NW2018.csv" ]; then curl https://meteonet.umr-cnrm.fr/dataset/data/NW/ground_stations/NW_ground_stations_2018.tar.gz -o SE_ground_stations_2018.tar.gz; else echo "NW2018.csv found"; fi
!if [ ! -f "NW2017.csv" ]; then curl https://meteonet.umr-cnrm.fr/dataset/data/SE/ground_stations/NW_ground_stations_2017.tar.gz -o SE_ground_stations_2017.tar.gz; else echo "NW2017.csv found"; fi
!if [ ! -f "NW2016.csv" ]; then curl https://meteonet.umr-cnrm.fr/dataset/data/SE/ground_stations/NW_ground_stations_2016.tar.gz -o SE_ground_stations_2016.tar.gz; else echo "NW2016.csv found"; fi

NW2018.csv found
NW2017.csv found
NW2016.csv found


In [2]:
# Unzip it in the shell
!if [ ! -f "NW2018.csv" ]; then tar -xvf NW_ground_stations_2018.tar.gz && rm -f NW_ground_stations_2018.tar.gz else echo "NW2018.csv found"; fi
!if [ ! -f "NW2017.csv" ]; then tar -xvf NW_ground_stations_2017.tar.gz && rm -f NW_ground_stations_2017.tar.gz else echo "NW2017.csv found"; fi
!if [ ! -f "NW2016.csv" ]; then tar -xvf NW_ground_stations_2016.tar.gz && rm -f NW_ground_stations_2016.tar.gz else echo "NW2016.csv found"; fi

In [3]:
# Are they listed?
!ls -l -sh NW2*.csv

1.7G -rw-r--r-- 1 mitesh mitesh 1.7G Jan 23  2020 NW2016.csv
1.7G -rw-r--r-- 1 mitesh mitesh 1.7G Jan 23  2020 NW2017.csv
1.7G -rw-r--r-- 1 mitesh mitesh 1.7G Jan 23  2020 NW2018.csv


With the dataset now in hand, let's load the data and take an initial look at the contents.

## Loading and Saving the data, and perform some quick checks

There are some basic features of DataFrames that will make your work easier. Here are a few we'll use:
- The dataset from the METRONET is .csv format, meaning "comma separated values". Therefore, the ```.read_csv()``` function can help to load the dataset to data frame format. Make special note of the "line terminator: defined in the csv file.
- With the help of the functions ```.head()``` and ```.tail()``` from cudf library, we can see the first and last several observations of the dataset. Very handy for working with long datasets. 
- ```.shape``` will describe the shape of the DataFrame.
- ```.drop_duplicates()``` function drops the duplicated rows, optionally only considering a certain subset of the DataFrame's columns.
- ```.drop()``` function drop specifc columns.
- ```cudf.to_datetime()``` function help to convert argument to datetime dtype
- ```.concat()``` function to concatenate DataFrames, Series, or Indices row-wise.
- ```.to_csv()``` function help to write a dataframe to csv file format.

**Note**: The following processing uses a combined dataset from years 2016, 2017, and 2018, which is about 6GB in size. If limited by the GPU's memory (out of memory error), you can load just one of the datasets to investigate how cuDF works. 

In [4]:
## load cudf.pandas kernel
%load_ext cudf.pandas

In [5]:
# Here's where we import the cuDF and cuPY libraries
import pandas as pd
import cupy as cp

In [6]:
%%time
%%cudf.pandas.profile
# Do a warm-up when benchmarking performance. Refer to the last section of code for the performance check. 
# If you get an out of memory error, you can comment out two of read_cvs lines below. Just make sure
# to update the gdf_frames line, too, to reflect which one dataset you're keeping.

# Empty DataFrame placeholders so you can select just one or two of the years of data. 
gdf_2016 = pd.DataFrame()
gdf_2017 = pd.DataFrame()
gdf_2018 = pd.DataFrame()

# **********NOTE***********
# Comment out one or two of these if your GPU memory is full.
gdf_2016 = pd.read_csv('./NW2016.csv')
gdf_2017 = pd.read_csv('./NW2017.csv')
gdf_2018 = pd.read_csv('./NW2018.csv')

gdf_frames =[gdf_2016,gdf_2017,gdf_2018]
GS_df = pd.concat(gdf_frames,ignore_index=True)
GS_df.info()

<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 65826837 entries, 0 to 65826836
Data columns (total 12 columns):
 #   Column      Dtype
---  ------      -----
 0   number_sta  int64
 1   lat         float64
 2   lon         float64
 3   height_sta  float64
 4   date        object
 5   dd          float64
 6   ff          float64
 7   precip      float64
 8   hu          float64
 9   td          float64
 10  t           float64
        object
dtypes: float64(9), int64(1), object(2)
memory usage: 6.5+ GB


CPU times: user 1.65 s, sys: 2.51 s, total: 4.16 s
Wall time: 4.33 s


In [7]:
# Here's the bottom of the dataset
GS_df.tail()

Unnamed: 0,number_sta,lat,lon,height_sta,date,dd,ff,precip,hu,td,t,psl\r
65826832,86137003,47.035,0.098,96.0,20181231 23:54,40.0,2.9,0.0,88.0,278.85,280.75,\r
65826833,86165005,46.412,0.841,153.0,20181231 23:54,60.0,3.3,0.0,95.0,278.85,279.55,\r
65826834,86272002,46.839,0.457,120.0,20181231 23:54,,,0.0,,,,\r
65826835,91200002,48.526,1.993,116.0,20181231 23:54,270.0,0.8,0.0,96.0,279.75,280.35,\r
65826836,95690001,49.108,1.831,126.0,20181231 23:54,280.0,2.4,0.0,97.0,279.65,280.05,\r


In [8]:
%%time
%%cudf.pandas.profile
## Save the (concatenated) dataframe to csv file
GS_df.to_csv('./NW_data.csv',index=False,chunksize=500000)

CPU times: user 1.19 s, sys: 6.52 s, total: 7.71 s
Wall time: 7.92 s


Restart Kernel to release all GPU memory usage, then read the data for subsequent processing.

In [9]:
## Restart Kernels previous to doing the below performance comparasions.
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

In [3]:
# Let's make sure the GPU is visible!
!nvidia-smi

Fri Jul 19 16:55:24 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA RTX A6000               Off |   00000000:41:00.0 Off |                  Off |
| 30%   47C    P8             20W /  300W |     287MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX A6000               Off |   00

In [4]:
## load cudf.pandas kernel
%load_ext cudf.pandas

The cudf.pandas extension is already loaded. To reload it, use:
  %reload_ext cudf.pandas


In [5]:
# Import the necessary packages
# import cudf
import cupy as cp
import pandas as pd

In [6]:
%%time
%%cudf.pandas.profile
# Let's read in the dataset, which is in CSV format with newlines as the terminator.
# And let's also keep track of the time elapsed to do so (the first line of the cell).

GS_df = pd.read_csv('./NW_data.csv',lineterminator='\n')

CPU times: user 1.77 s, sys: 614 ms, total: 2.39 s
Wall time: 2.38 s


In [7]:
# change the date column to datetime dtype, see the DataFrame info
GS_df['date'] = pd.to_datetime(GS_df['date'])
GS_df.info()

<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 65826837 entries, 0 to 65826836
Data columns (total 12 columns):
 #   Column      Dtype
---  ------      -----
 0   number_sta  int64
 1   lat         float64
 2   lon         float64
 3   height_sta  float64
 4   date        datetime64[ns]
 5   dd          float64
 6   ff          float64
 7   precip      float64
 8   hu          float64
 9   td          float64
 10  t           float64
 11  psl         float64
dtypes: datetime64[ns](1), float64(10), int64(1)
memory usage: 5.9 GB


In [8]:
# Display the first five rows of the DataFrame to examine details
GS_df.head()

Unnamed: 0,number_sta,lat,lon,height_sta,date,dd,ff,precip,hu,td,t,psl
0,14066001,49.33,-0.43,2.0,2016-01-01,210.0,4.4,0.0,91.0,278.45,279.85,
1,14126001,49.15,0.04,125.0,2016-01-01,,,0.0,99.0,278.35,278.45,
2,14137001,49.18,-0.46,67.0,2016-01-01,220.0,0.6,0.0,92.0,276.45,277.65,102360.0
3,14216001,48.93,-0.15,155.0,2016-01-01,220.0,1.9,0.0,95.0,278.25,278.95,
4,14296001,48.8,-1.03,339.0,2016-01-01,,,0.0,,,278.35,


In [9]:
# Checking the DataFrame's dimensions. Millions of rows by 12 columns.
GS_df.shape

(65826837, 12)

In [10]:
# We can further examine the characteristics of a DataFrame using .info().
# This will show, for instance, the datatype of each column and the total GPU memory it occupies.
GS_df.info()

<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 65826837 entries, 0 to 65826836
Data columns (total 12 columns):
 #   Column      Dtype
---  ------      -----
 0   number_sta  int64
 1   lat         float64
 2   lon         float64
 3   height_sta  float64
 4   date        datetime64[ns]
 5   dd          float64
 6   ff          float64
 7   precip      float64
 8   hu          float64
 9   td          float64
 10  t           float64
 11  psl         float64
dtypes: datetime64[ns](1), float64(10), int64(1)
memory usage: 5.9 GB


In [11]:
# DataFrames simplify data cleaning, such as dropping duplicates entries for a column.
# In this case, we want to just keep unique ground station rows to check how many ground stations are monitored
unique_stat_info = GS_df.drop_duplicates(subset=['number_sta'])
unique_stat_info.shape[0]

287

### Summary
- The dataset series is aligned to the dataset metadata: number_sta, lat, lon, height_sta, date, dd, ff, etc.
- There are 65826837 records from the ground stations into the total dataset occupying 12 columns.
- There are 287 ground stations observed in this dataset.

## Calculate the rate of missing data     

Now we can further analyze the data, using a few handy DataFrame methods:
- With the help of ```.nunique()``` function, count the number of distinct elements in "number_sta" column 
- Using ```.str.contains()```, select the items with specific sub string
- The ```.to_numeric()``` function converts its argument into a numerical type
- The ```.diff()``` function returns a new DataFrame containing the results of difference between rows (default is difference with the previous row)
- The ```.min()``` function returns the minimum values in the DataFrame.
- Also, we use ```%%time``` as the first line of cells to display the time elapsed running the code.

In [12]:
%%time
# How many weather stations are covered in this dataset? 
# Call nunique() to count the distinct elements along a specified axis.

number_stations = GS_df['number_sta'].nunique()
print("The full dataset is composed of {} unique weather stations.".format(GS_df['number_sta'].nunique()))

The full dataset is composed of 287 unique weather stations.
CPU times: user 4.78 ms, sys: 0 ns, total: 4.78 ms
Wall time: 4.7 ms


In [13]:
%%time
## Investigate the the frequency of one specific station's data
## date column is datestime dtype, diff() function will calculate the delta time 
## TimedeltaProperties.seconds can help to get the delta seconds between each record, divide by 60 seconds to see the minutes difference.
delta_mins = GS_df['date'].diff().dt.seconds.max()/60
print(f"The data is recorded every {delta_mins} minutes")

The data is recorded every 6.0 minutes
CPU times: user 12.3 ms, sys: 16 ms, total: 28.3 ms
Wall time: 27.2 ms


The dataset including 287 unique stations, with 10 records per hour (record every 6 minutes), so the amount of data recorded shall be    
```
287 x 10 x 24 x 365 x 3 = 75,423,600 values in memory
```

Knowing this, we can calculate the missing record rate.

In [14]:
# Theoretical number of records is... 
theoretical_nb_records = number_stations * (60 / delta_mins) * 365 * 3 * 24 
actual_nb_of_rows = GS_df.shape[0]
missing_record_ratio = 1 - (actual_nb_of_rows/theoretical_nb_records)
print("Percentage of missing records of the NW dataset is: {:.1f}%".format(missing_record_ratio * 100))
print("Theoretical total number of values in dataset is: {:d}".format(int(theoretical_nb_records)))

Percentage of missing records of the NW dataset is: 12.7%
Theoretical total number of values in dataset is: 75423600


### Summary  
The dataset is composed of weather phenomena recordings of **287 unique ground statitions** in the Northwest of France during the year 2016, 2017 and 2018. The record is monitored every 6 minutes including wind direction, wind speed, humidity, temperature and pressure, etc. 
- The theoretical number of records is 75423600 
- Actual number of items in the dataset is 65826837
- There are missing records during the monitoring period at a percentage of during the year 2016 to 2018.

## Check for invalid data

Next, we check for invalid data, such as handling NA values, as well as by calculating columnar sums. We'll use:
- ```.isna()``` to create a new DataFrame with boolean values to mark the NA item with boolean True.
- ```.sum()``` to find the sum value of each series.
- ```.slice()``` to cut the date string to show only the month.
- ```.index``` function of Series, to find which series have NA values.
- ```.to_frame()```  to convert Series into a DataFrame.
- ```.reset_index()``` to reset the index of the DataFrame.

Overall, we can use these functions to check if there is NA data, and then total up the NA data for each category by month to see which months have the most missing records. (Note that NA data includes types such as None, numpy.NaN, '', and numpy.inf.)

In [15]:
# Let's focus on Find which items have NA value(s) during year 2018
NA_sum = GS_df[GS_df['date'].dt.year==2018].isna().sum()
NA_data = NA_sum[NA_sum>0]
NA_data.index

Index(['dd', 'ff', 'precip', 'hu', 'td', 't', 'psl'], dtype='object')

In [16]:
NA_data

dd         8605703
ff         8598613
precip     1279127
hu         8783452
td         8786154
t          2893694
psl       17621180
dtype: int64

In [17]:
%%time
# Let's slice the date format to select only the month
GS_df["month"] = GS_df["date"].dt.month
GS_df["year"] = GS_df["date"].dt.year
GS_df.head()

CPU times: user 5.27 ms, sys: 789 µs, total: 6.06 ms
Wall time: 4.79 ms


Unnamed: 0,number_sta,lat,lon,height_sta,date,dd,ff,precip,hu,td,t,psl,month,year
0,14066001,49.33,-0.43,2.0,2016-01-01,210.0,4.4,0.0,91.0,278.45,279.85,,1,2016
1,14126001,49.15,0.04,125.0,2016-01-01,,,0.0,99.0,278.35,278.45,,1,2016
2,14137001,49.18,-0.46,67.0,2016-01-01,220.0,0.6,0.0,92.0,276.45,277.65,102360.0,1,2016
3,14216001,48.93,-0.15,155.0,2016-01-01,220.0,1.9,0.0,95.0,278.25,278.95,,1,2016
4,14296001,48.8,-1.03,339.0,2016-01-01,,,0.0,,,278.35,,1,2016


In [18]:
%%cudf.pandas.profile
# Let's only analyze the NA columns with dates in them during year 2018
NA_column = pd.DataFrame(GS_df,columns=NA_data.index).isna()
NA_column["month"]=GS_df["month"]
NA_column["year"]=GS_df["year"]
NA_column = NA_column[NA_column['year']==2018]

In [19]:
NA_column.info()

<class 'cudf.core.dataframe.DataFrame'>
Index: 22034571 entries, 43792266 to 65826836
Data columns (total 9 columns):
 #   Column  Dtype
---  ------  -----
 0   dd      bool
 1   ff      bool
 2   precip  bool
 3   hu      bool
 4   td      bool
 5   t       bool
 6   psl     bool
 7   month   int16
 8   year    int16
dtypes: bool(7), int16(2)
memory usage: 399.3 MB


In [20]:
# We can group the data by month and then calculate the the sum of the NA data for each month.
# Note, reset_index() is used to set the group month as a Series, or it will be deleted as the index.
NA_column
NA_data_month = NA_column.groupby("month",sort=True).sum().reset_index().drop(columns=['year'])
NA_data_month

Unnamed: 0,month,dd,ff,precip,hu,td,t,psl
0,1,704384,703030,101402,721584,722345,242430,1461872
1,2,636406,636114,92062,650117,651227,224703,1329865
2,3,708710,708008,102324,726449,726476,248646,1474683
3,4,687953,686884,107086,708769,709227,245753,1432213
4,5,721118,720603,107630,742137,742504,258661,1489728
5,6,689394,688701,109634,708939,708951,247578,1430580
6,7,719556,719325,111958,738818,739053,250513,1485952
7,8,722750,722723,107515,749206,749225,251985,1497920
8,9,705047,703453,108238,720628,720653,242374,1444152
9,10,734415,734411,109960,735582,735617,237221,1497323


In [21]:
NA_data_month.info()

<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   month   12 non-null     int16
 1   dd      12 non-null     int64
 2   ff      12 non-null     int64
 3   precip  12 non-null     int64
 4   hu      12 non-null     int64
 5   td      12 non-null     int64
 6   t       12 non-null     int64
 7   psl     12 non-null     int64
dtypes: int16(1), int64(7)
memory usage: 696.0 bytes


### Summary:
- There are invalid data values in the monitored records year 2018.
- Seven meteorological parameters are monitored, all of which have some invalid data.
- Compared within all the categories, the precip (precipitation during the reporting period) and t (temperature) have lower numbers of NA occurances. 
- The psl (pressure at sea level) parameter has the largest number of invaid data with 17621180.
- For the other 4 meteorological parameters, there is no significant difference, as the NA data for them are all within 8000000 - 9000000.
- Examining the NA value distribution across the whole year, there is no signicant difference to suggest that there is one month with much more invalid data than the others.

## Run a corrleation of meteorological parameters 
We can apply a correlation analysis to figure out the correlation between meterological parameters. Removing correlated items may help us train a regression model, for instance. These DataFrame methods will help us:
- ```.corr()``` can be applied to compute the correlation matrix of a DataFrame, but only on a numeric matrix containing no NA data.
- ```.dropna()``` drop rows (or columns) containing NA data from a column (or rows), espeically hand for  dataset cleaning tasks.
- ```.drop()``` removes specific columns in the DataFrame.

In [22]:
%time

# Let's only analyze meteorological columns
Meteo_series = ['dd', 'ff', 'precip' ,'hu', 'td', 't', 'psl']
Meteo_df = pd.DataFrame(GS_df,columns=Meteo_series)
Meteo_corr = Meteo_df.dropna().corr()

# And let's check the items with correlation value > 0.7 
Meteo_corr[Meteo_corr>0.7]

CPU times: user 1 µs, sys: 0 ns, total: 1 µs
Wall time: 3.58 µs


Unnamed: 0,dd,ff,precip,hu,td,t,psl
dd,1.0,,,,,,
ff,,1.0,,,,,
precip,,,1.0,,,,
hu,,,,1.0,,,
td,,,,,1.0,0.840558,
t,,,,,0.840558,1.0,
psl,,,,,,,1.0


In [23]:
# %time
Meteo_df_less = Meteo_df.drop(columns=['td'])
Meteo_df_less.head()

Unnamed: 0,dd,ff,precip,hu,t,psl
0,210.0,4.4,0.0,91.0,279.85,
1,,,0.0,99.0,278.45,
2,220.0,0.6,0.0,92.0,277.65,102360.0
3,220.0,1.9,0.0,95.0,278.95,
4,,,0.0,,278.35,


### Summary:
- Apply ```.corr()``` method to analyze the relationship within the meteological parameters.
- There is strong correlation between td (dew point) and t (temperature)
- Based on that information, either td or t colomns shall be removed to improve a linear regression model. 
- Use the ```.drop()``` method to remove the td columns for downstream tasks. 

## Accelerated Computing Performance Check

This section covers the performance of a handful of typical functions used in this notebook, comparing between Pandas (CPU) and cuDF (GPU) with minimum code change with pd(pandas) -> cudf. You can adopt the code below to compare the performance improvement on your local machine. 

Test machine information:
- **GPU**: NVIDIA RTX A6000   
- **CPU**: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz 
- **RAPIDS**: Rapids 24.06 with CUDA 12.2

**<p style="text-align: center;">Performance Results based on test results</p>**


|function| GPU Time | CPU Time| GPU Speedup |
| --- | --- | --- | --- |   
|read|2.046443|36.500297|17.83|
|slice|0.014068|8.297392|589.81|
|na|0.078052|1.826956|23.40|
|dropna|0.057089|2.866669|50.21|
|unique|0.007049|0.218504|30.99|
|dropduplicate|0.048152|0.300804|6.24|
|group_sum|0.600811|4.932876|8.21|

In [24]:
## Restart Kernels previous to do below performance comparasion
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

In [1]:
!nvidia-smi

Fri Jul 19 16:55:54 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA RTX A6000               Off |   00000000:41:00.0 Off |                  Off |
| 30%   50C    P8             23W /  300W |     287MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX A6000               Off |   00

In [2]:
import numpy as np
import pandas as pd
import cupy as cp
from timeit import default_timer as timer

In [3]:
def computeAnalytics(file_path):
    start_t = timer()
    data_pd = pd.read_csv(file_path)
    read_time = timer() - start_t

    # Slicing function
    start_t = timer()
    data_pd_month_s = data_pd["date"].str.slice(4,6)
    slice_time = timer() - start_t

    data_pd_month=data_pd
    data_pd_month["date"]=data_pd_month_s

    # NA data check
    start_t = timer()
    NA_sum = data_pd.isna().sum()
    NA_data_pd = NA_sum[NA_sum>0]
    na_time = timer() - start_t

    # drop na 
    start_t = timer()
    data_pd.dropna()
    dropna_time = timer() - start_t

    # unique data check
    start_t = timer()
    number_stations = data_pd['number_sta'].nunique()
    unique_time = timer() - start_t

    # drop_duplicates
    start_t = timer()
    unique_stat_info = data_pd.drop_duplicates(subset=['number_sta'])
    dropdu_time = timer() - start_t

    # group and sum timer
    start_t = timer()
    NA_column_pd = pd.DataFrame(data_pd_month,columns=NA_data_pd.index).isna()
    NA_column_pd["month"]=data_pd_month["date"]
    # group the data by month, and then calculate the the sum of the NA data for each month
    # reset_index() is used to set the group month as a Series, or it will be deleted as index
    NA_data_pd_month = NA_column_pd.groupby("month",sort=True).sum().reset_index()
    group_sum_time = timer() - start_t

    return read_time, slice_time, na_time, dropna_time, unique_time, dropdu_time, group_sum_time

In [4]:

# Run analysis on CPU
read_cpu_time,slice_cpu_time,na_cpu_time,dropna_cpu_time,unique_cpu_time,dropdu_cpu_time,group_sum_cpu_time = computeAnalytics('./NW_data.csv')


In [5]:
## load cudf.pandas kernel
%load_ext cudf.pandas

In [6]:
import numpy as np
import pandas as pd
import cupy as cp
from timeit import default_timer as timer

In [7]:
 %%cudf.pandas.profile
# Run for GPU
read_gpu_time,slice_gpu_time,na_gpu_time,dropna_gpu_time,unique_gpu_time,dropdu_gpu_time,group_sum_gpu_time = computeAnalytics('./NW_data.csv')

In [8]:
# Build the performance table (as another DataFrame, of course!).
performance_df = pd.DataFrame()
performance_df['function'] = ['read','slice','na','dropna','unique','dropduplicate','group_sum']
performance_df['time_gpu']=[read_gpu_time,slice_gpu_time,na_gpu_time,dropna_gpu_time,unique_gpu_time,dropdu_gpu_time,group_sum_gpu_time]
performance_df['time_cpu']=[read_cpu_time,slice_cpu_time,na_cpu_time,dropna_cpu_time,unique_cpu_time,dropdu_cpu_time,group_sum_cpu_time]
performance_df['speedup']=performance_df['time_cpu']/performance_df['time_gpu']
performance_df

Unnamed: 0,function,time_gpu,time_cpu,speedup
0,read,3.573566,53.416235,14.947601
1,slice,0.023217,8.49431,365.871404
2,na,0.084238,2.015948,23.931468
3,dropna,0.05874,2.852879,48.568179
4,unique,0.009121,0.224486,24.613098
5,dropduplicate,0.049458,0.307427,6.215934
6,group_sum,0.620302,4.907235,7.911041


## Conclusion

In this notebook, we applied GPU acclerated DataFrame computation through the use of RAPIDS, cuDF, and cuPY. We demonstrated data loading methods, data cleaning techniques, an application of cuXfilter for data analysis, and finally how to derive performance values versus CPU-only computation on the same dataset.

### Citation
- Gwennaëlle Larvor, Léa Berthomier, Vincent Chabot, Brice Le Pape, Bruno Pradel, Lior Perez. MeteoNet, an open reference weather dataset by METEO FRANCE, 2020 [dataset link](https://www.kaggle.com/datasets/katerpillar/meteonet)    
