## Objective: Converting netCDF to CSV format

- `netCDF` files often possess substantial sizes, which can necessitate significant computational resources for efficient data analysis. This is because larger files typically require more memory and processing power to manipulate and extract valuable information.
- Streamlining the netCDF file by applying filtering techniques is a crucial step in extracting only the essential attributes and features needed for analysis. Filtering or subsetting netCDF files involves selecting specific variables, time periods, or spatial regions from the original file. This selective approach helps analysts focus on the relevant data, reducing the amount of information that needs to be processed.
- The following variables are extracted from the netCDF files:
    * sounding_id => DateTime
    * Xco2 => XCO2 ppm
    * Latitude, Longitude => Latitude, Longitude
    * xco2_quality_flag => xco2_quality_flag ( 0 =>good quality, 1 => bad quality)

### Steps and Pre-requisites:
- this is a begineer level tutorial, if you have already pre-processed and prepared data-ready for visualization you can skip this part
- You can obtain files by downloading them from publicly accessible web servers like **EarthData, OpenDAP** and specifying the destination directory. For more information, check the main homepage of the current git repository

In [10]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import os
import netCDF4 as nc

# converting the datetime format
from datetime import datetime

## File location:
To effectively manage your downloaded files, it's crucial to ascertain their storage arrangement. This code documentation provides guidance on distinguishing between two common scenarios:

### Scenario 1: Files in a Single Folder
- Description: If you've obtained files from the Earthdata website and you have downloaded files individually or through running the bash script, they are typically consolidated within a single folder. 
    - EG: //you can take following steps to download files after selecting the range from EarthData Search:
        - `chmod +x 4237267242-download`
        - `./4237267242-download`
- Recommendation: In this case, you can directly utilize "Option 1" for your operations.

### Note:
- You can see the locally downloaded files in the path: `multiple_netcdf_files`

In [11]:
path_a= ('../../multiple_netcdf_files/')

# Collect the paths of each individual files
file_names= []

for file in os.listdir(path_a):
    # Check whether file is in text format or not
    if file.endswith(".nc4"):
        file_path = f"{path_a}\{file}"
      
        # Store the path location of each individual files
        file_names.append(file_path)
        
        
# check first 10 files path
file_names[:10]

['../../multiple_netcdf_files/\\oco2_LtCO2_210804_B11014Ar_220728230833s.nc4',
 '../../multiple_netcdf_files/\\oco2_LtCO2_210808_B11014Ar_220728231045s.nc4',
 '../../multiple_netcdf_files/\\oco2_LtCO2_210829_B11014Ar_220728232132s.nc4',
 '../../multiple_netcdf_files/\\oco2_LtCO2_210909_B11014Ar_220728225546s.nc4',
 '../../multiple_netcdf_files/\\oco2_LtCO2_210916_B11014Ar_220728225855s.nc4',
 '../../multiple_netcdf_files/\\oco2_LtCO2_210919_B11014Ar_220728230031s.nc4',
 '../../multiple_netcdf_files/\\oco2_LtCO2_210920_B11014Ar_220728230103s.nc4',
 '../../multiple_netcdf_files/\\oco2_LtCO2_210923_B11014Ar_220728230243s.nc4',
 '../../multiple_netcdf_files/\\oco2_LtCO2_210926_B11014Ar_220728230409s.nc4',
 '../../multiple_netcdf_files/\\oco2_LtCO2_211004_B11014Ar_220728223641s.nc4']

## Scenario 2: Files in Different Subfolders
- Description: Files downloaded from a cluster machine are often distributed across various subdirectories, which necessitates a different approach.
- Recommendation: If your files are located in subdirectories, you'll need to employ an alternative method to manage and process them effectively.
- How it works: 
    - 1. Provide the root direcory
    - 2. Loop performs `joining` root + individual files at end paths of each directories
    - 3. Concatenates the path from ROOT dir. to individual file path from each dir.

In [12]:
# # list fo FILES 2021
# file_path_2021= []

# for root, dirs, files in os.walk('../../../Clusters_DATA_oil/OCO-2/2018/'):
#     for filename in files:
#         print(os.path.join(root, filename))
        
#         # Append the files into list
#         file_path_2021.append(os.path.join(root, filename))

In [13]:
#files= os.listdir('../ENTIRE_datasets/OCO-2_datasets/2019_2020/')

# files= os.listdir('')
# # LISTING the path of FILES
# files

# Example: 
### Open a single file in netCDF format from the path

In [14]:
df_xco2= nc.Dataset(file_names[0])

In [15]:
list(df_xco2.variables.keys())

['sounding_id',
 'levels',
 'bands',
 'vertices',
 'footprints',
 'date',
 'latitude',
 'longitude',
 'time',
 'solar_zenith_angle',
 'sensor_zenith_angle',
 'xco2_quality_flag',
 'xco2_qf_bitflag',
 'xco2_qf_simple_bitflag',
 'source_files',
 'file_index',
 'vertex_latitude',
 'vertex_longitude',
 'xco2',
 'xco2_uncertainty',
 'xco2_apriori',
 'pressure_levels',
 'co2_profile_apriori',
 'xco2_averaging_kernel',
 'pressure_weight']

### Function for changing DateTime format

In [16]:
# DATE time function
def conv_date(d):
    return datetime.strptime(str(d), '%Y%m%d%H%M%S%f')

### Check the total files in the DIRECTORY

In [17]:
countFiles=0

for j in file_names:
    if j.endswith(".nc4"):
        countFiles+=1
        #print(j)
        
print('\nTotalFiles: ', countFiles)


TotalFiles:  14


### Creating a new dir.

In [18]:
current_directory= os.getcwd()
frames_folder= os.path.join(current_directory, r'csv_files')

if not os.path.exists(frames_folder):
    os.makedirs(frames_folder)

## Function:
* Function below takes individual path of files and converts to CSV/TXT format
* Converted files are created on the same dir. of the code
- NOTE: Here, in this script ENTIRE dataframe are filtered by GOOD quality_flag->0

In [19]:
def convHdf(path_file, n=0):

    data= nc.Dataset(path_file)

    # get the HDF data and convert to CSV
    df_xco2= pd.DataFrame()

    df_xco2['Xco2']= data.variables['xco2'][:]
    df_xco2['Latitude']= data.variables['latitude'][:]
    df_xco2['Longitude']= data.variables['longitude'][:] 
    df_xco2['quality_flag']= data.variables['xco2_quality_flag'][:] 
    
    # Date
    df_xco2['DateTime']= data.variables['sounding_id'][:]
    
    #Convert soundingID to datetime format
    df_xco2['DateTime']= df_xco2['DateTime'].apply(conv_date)
    df_xco2['DateTime']= pd.to_datetime(df_xco2['DateTime'])
    
    # YEAR and month column
    df_xco2['Year']= df_xco2['DateTime'].dt.year
    df_xco2['Month']= df_xco2['DateTime'].dt.month
    df_xco2['Day']= df_xco2['DateTime'].dt.day
    
    # Refine the ENTIRE dataframe by GOOD quality_flag->0
    # NOTE: REDUCES the size of the file
    df_xco2= df_xco2[df_xco2['quality_flag'] == 0]   
    
   
    date= str(data.variables['sounding_id'][0])      
    
    # create a CSV and store on new folder: csv_files
    df_xco2.to_csv('csv_files'+'/'+ data.Sensor+'_xco2_'+ date+'_.txt', index= False)

## Similar work with OCO3's SIF files netCDF to CSV conversion

In [None]:
# # FUNCTION to convert data
# def convOCO3(path_file, n=0):

#     #path= '../hdf_format/Los_angeles_GROUPED/'
#     data= nc.Dataset(path_file)

#     # get the HDF data and convert to CSV
#     df_sif= pd.DataFrame()

#     df_sif['sif_757nm']= data.variables['Daily_SIF_757nm'][:]
#     df_sif['Latitude']= data.variables['Latitude'][:]
#     df_sif['Longitude']= data.variables['Longitude'][:] 
#     df_sif['quality_flag']= data.variables['Quality_Flag'][:] 
    
#     # Date
#     # Date time not found 
# #     df_xco2['DateTime']= data.variables['sounding_id'][:]
    
# #     #Convert soundingID to datetime format
# #     df_xco2['DateTime']= df_xco2['DateTime'].apply(conv_date)
# #     df_xco2['DateTime']= pd.to_datetime(df_xco2['DateTime'])
    
# #     # YEAR and month column
# #     df_xco2['Year']= df_xco2['DateTime'].dt.year
# #     df_xco2['Month']= df_xco2['DateTime'].dt.month
# #     df_xco2['Day']= df_xco2['DateTime'].dt.day
    
    
#     # xco2 quality flag -> 0
#  #   df_sif= df_sif[df_sif['quality_flag'] == 0]
    
# #    date= str(data.variables['sounding_id'][0])                                   
#     # create a CSV
#     # OCO3 sensor
#     df_sif.to_csv(data.sensor[:5]+'_sif_'+str(n)+ '_.txt', index= False)
# #     df_xco2.to_feather(data.Sensor+'_xco2_'+ date+'_.txt')

## Testing: Single file conversion and open with pandas

In [20]:
convHdf(file_names[0])

In [21]:
## read the file
df= pd.read_csv("csv_files/OCO-2_xco2_2021080400235307_.txt")
df.head(5)

Unnamed: 0,Xco2,Latitude,Longitude,quality_flag,DateTime,Year,Month,Day
0,415.94247,-42.315613,-157.57877,0,2021-08-04 00:24:05.760,2021,8,4
1,415.47308,-42.302628,-157.6313,0,2021-08-04 00:24:05.780,2021,8,4
2,415.41177,-42.30393,-157.55707,0,2021-08-04 00:24:06.050,2021,8,4
3,414.96786,-42.286022,-157.56154,0,2021-08-04 00:24:06.350,2021,8,4
4,415.89227,-42.27342,-157.6139,0,2021-08-04 00:24:06.370,2021,8,4


## RUN the following cell 
- below provided few lines of code will iterate over every netCDF files and convert to CSV files and save to dir. `csv_files`
- NOTE: Filtering XCO2 quality flag(0) to reduce the total size of file

In [None]:
# using Function to READ FILES from the direcotry and convert all netCDF files to csv/txt    
for j in range(0, len(file_names)):
  
       # EG to read FIRST dataset from THE DIRECTORY       
        convHdf(file_names[j], j)