<a href="https://colab.research.google.com/github/rwcitek/C11-capstone-project/blob/main/eBird/ebird_bigquery.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Working with BigQuery


We'll see if this runs in CoLab.




### Set up the workspace


In [None]:
!rm -rf data/


In [None]:
mkdir data

In [None]:
cd data

/content/data


In [None]:
pwd

'/content/data'

### Download some sample data

In [None]:
%%capture output
%%bash
apt-get update
apt-get install -y tree jq curl less

In [None]:
!curl -O 'https://ebird.org/downloads/samples/ebd-datafile-SAMPLE.zip'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  455k  100  455k    0     0   618k      0 --:--:-- --:--:-- --:--:--  619k


In [None]:
!ls -la


total 468
drwxr-xr-x 2 root root   4096 Nov 22 13:43 .
drwxr-xr-x 1 root root   4096 Nov 22 13:43 ..
-rw-r--r-- 1 root root 466849 Nov 22 13:43 ebd-datafile-SAMPLE.zip


In [None]:
!unzip ebd-datafile-SAMPLE.zip

Archive:  ebd-datafile-SAMPLE.zip
   creating: ebd_US-AL-101_202204_202204_relApr-2022_SAMPLE/
  inflating: __MACOSX/._ebd_US-AL-101_202204_202204_relApr-2022_SAMPLE  
  inflating: ebd_US-AL-101_202204_202204_relApr-2022_SAMPLE/ebd_US-AL-101_202204_202204_relApr-2022.txt  
  inflating: __MACOSX/ebd_US-AL-101_202204_202204_relApr-2022_SAMPLE/._ebd_US-AL-101_202204_202204_relApr-2022.txt  
  inflating: ebd_US-AL-101_202204_202204_relApr-2022_SAMPLE/terms_of_use.txt  
  inflating: __MACOSX/ebd_US-AL-101_202204_202204_relApr-2022_SAMPLE/._terms_of_use.txt  
  inflating: ebd_US-AL-101_202204_202204_relApr-2022_SAMPLE/BCRCodes.txt  
  inflating: __MACOSX/ebd_US-AL-101_202204_202204_relApr-2022_SAMPLE/._BCRCodes.txt  
  inflating: ebd_US-AL-101_202204_202204_relApr-2022_SAMPLE/eBird_Basic_Dataset_Metadata_v1.14.pdf  
  inflating: __MACOSX/ebd_US-AL-101_202204_202204_relApr-2022_SAMPLE/._eBird_Basic_Dataset_Metadata_v1.14.pdf  
  inflating: ebd_US-AL-101_202204_202204_relApr-2022_SAMPLE/IBACod

In [None]:
!ls -la


total 476
drwxr-xr-x 4 root root   4096 Nov 22 13:43 .
drwxr-xr-x 1 root root   4096 Nov 22 13:43 ..
-rw-r--r-- 1 root root 466849 Nov 22 13:43 ebd-datafile-SAMPLE.zip
drwxr-xr-x 2 root root   4096 Jun  8  2022 ebd_US-AL-101_202204_202204_relApr-2022_SAMPLE
drwxr-xr-x 3 root root   4096 Nov 22 13:43 __MACOSX


In [None]:
!rm -rf __MACOSX/

In [None]:
!tree ebd_US-AL-101_202204_202204_relApr-2022_SAMPLE/


[01;34mebd_US-AL-101_202204_202204_relApr-2022_SAMPLE/[0m
├── [01;32mBCRCodes.txt[0m
├── [00mebd_US-AL-101_202204_202204_relApr-2022.txt[0m
├── [00meBird_Basic_Dataset_Metadata_v1.14.pdf[0m
├── [01;32mIBACodes.txt[0m
├── [01;32mrecommended_citation.txt[0m
├── [00mterms_of_use.txt[0m
└── [01;32mUSFWSCodes.txt[0m

0 directories, 7 files


In [None]:
%%bash
wc -l ebd_US-AL-101_202204_202204_relApr-2022_SAMPLE/ebd_US-AL-101_202204_202204_relApr-2022.txt


1400 ebd_US-AL-101_202204_202204_relApr-2022_SAMPLE/ebd_US-AL-101_202204_202204_relApr-2022.txt


## Set up Google Cloud Platform

In [1]:
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


In [2]:
import os
project_id = "default-256400"
os.environ["GOOGLE_CLOUD_PROJECT"] = project_id
os.environ["GOOGLE_CLOUD_BUCKET"] = "c11-ebird"


In [3]:
project_id

'default-256400'

In [4]:
os.environ["GOOGLE_CLOUD_PROJECT"]

'default-256400'

In [5]:
os.environ["GOOGLE_CLOUD_BUCKET"]

'c11-ebird'

## Need to create a *bucket*

Go to https://console.cloud.google.com/storage/browser




Got a bucket:

- https://console.cloud.google.com/storage/browser/c11-ebird;tab=objects?project=default-256400&prefix=&forceOnObjectsSortingFiltering=false

## Phase 1:



1. upload the CSV file to GS
1. import into BQ



In [None]:
file = 'ebd_US-AL-101_202204_202204_relApr-2022_SAMPLE/ebd_US-AL-101_202204_202204_relApr-2022.txt'


In [None]:
!gcloud storage cp {file} gs://c11-ebird/

Copying file://ebd_US-AL-101_202204_202204_relApr-2022_SAMPLE/ebd_US-AL-101_202204_202204_relApr-2022.txt to gs://c11-ebird/ebd_US-AL-101_202204_202204_relApr-2022.txt


In [None]:
%%bash
gcloud storage ls -l gs://${GOOGLE_CLOUD_BUCKET}/

    504698  2023-11-21T23:28:26Z  gs://c11-ebird/ebd_US-AL-101_202204_202204_relApr-2022.txt
     60011  2023-11-22T00:57:52Z  gs://c11-ebird/ebd_US-AL-101_202204_202204_relApr-2022.txt.gz
TOTAL: 2 objects, 564709 bytes (551.47kiB)


### From web interface

Create a data set and import CSV to data set:

- https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv

### Repeat using bq command

In [None]:
!bq ls {project_id}:ebird

          tableId            Type    Labels   Time Partitioning   Clustered Fields  
 -------------------------- ------- -------- ------------------- ------------------ 
  ebd_NM_relSep_2023         TABLE                                                  
  ebd_sampling_relSep_2023   TABLE                                                  
  ebird_small                TABLE                                                  


In [None]:
!bq rm -f {project_id}:ebird.ebird_small_2

In [None]:
%%bash
bq --project_id=${GOOGLE_CLOUD_PROJECT} \
  load \
  --source_format=CSV \
  --field_delimiter='\t' \
  --autodetect \
  --skip_leading_rows=1 \
  ebird.ebird_small_2 \
  gs://${GOOGLE_CLOUD_BUCKET}/ebd_US-AL-101_202204_202204_relApr-2022.txt


Waiting on bqjob_r41bf882ca6d188fa_0000018bf8187ee7_1 ... (0s) Current status: RUNNING                                                                                      Waiting on bqjob_r41bf882ca6d188fa_0000018bf8187ee7_1 ... (0s) Current status: DONE   


In [None]:
!bq ls {project_id}:ebird

          tableId            Type    Labels   Time Partitioning   Clustered Fields  
 -------------------------- ------- -------- ------------------- ------------------ 
  ebd_NM_relSep_2023         TABLE                                                  
  ebd_sampling_relSep_2023   TABLE                                                  
  ebird_small                TABLE                                                  
  ebird_small_2              TABLE                                                  


In [None]:
!bq show {project_id}:ebird.ebird_small_2

Table default-256400:ebird.ebird_small_2

   Last modified                   Schema                   Total Rows   Total Bytes   Expiration   Time Partitioning   Clustered Fields   Total Logical Bytes   Total Physical Bytes   Labels  
 ----------------- --------------------------------------- ------------ ------------- ------------ ------------------- ------------------ --------------------- ---------------------- -------- 
  22 Nov 17:33:58   |- GLOBAL_UNIQUE_IDENTIFIER: string     1399         534727                                                            534727                                               
                    |- LAST_EDITED_DATE: timestamp                                                                                                                                              
                    |- TAXONOMIC_ORDER: integer                                                                                                                                           

In [None]:
!bq --project_id=${GOOGLE_CLOUD_PROJECT} query "select * from ebird.ebird_small_2 limit 10"

+-------------------------------------------------+---------------------+-----------------+----------+------------------+------------------+----------------------+------------------------+----------------------------+-------------+-------------------+---------------+-------------------+---------------+---------+---------------+--------------+---------+------------+------------+-------------+----------+----------+------------+-------------+------------------------------------------------------------+-------------+---------------+-----------+-------------+------------------+---------------------------+-------------+---------------------------+---------------+---------------+--------------+------------------+--------------------+----------------+------------------+----------------------+------------------+-----------+----------+----------+--------+---------------+-------------------------------------------------------------------------------------------------------------------------------

In [None]:
%%bigquery df --project {project_id}
select *
from ebird.ebird_small_2
limit 10

Query is running:   0%|          |

Downloading:   0%|          |

In [None]:
df

Unnamed: 0,GLOBAL_UNIQUE_IDENTIFIER,LAST_EDITED_DATE,TAXONOMIC_ORDER,CATEGORY,TAXON_CONCEPT_ID,COMMON_NAME,SCIENTIFIC_NAME,SUBSPECIES_COMMON_NAME,SUBSPECIES_SCIENTIFIC_NAME,EXOTIC_CODE,...,NUMBER_OBSERVERS,ALL_SPECIES_REPORTED,GROUP_IDENTIFIER,HAS_MEDIA,APPROVED,REVIEWED,REASON,TRIP_COMMENTS,SPECIES_COMMENTS,string_field_49
0,URN:CornellLabOfOrnithology:EBIRD:OBS1380605933,2022-04-02 09:55:59.208261+00:00,20496,species,avibase-1DA430B8,Blue Jay,Cyanocitta cristata,,,,...,1,1,,0,1,0,,,,
1,URN:CornellLabOfOrnithology:EBIRD:OBS1380605935,2022-04-02 09:55:59.212963+00:00,8773,species,avibase-7AA076EF,Barred Owl,Strix varia,,,,...,1,1,,0,1,0,,,,
2,URN:CornellLabOfOrnithology:EBIRD:OBS1380605934,2022-04-02 09:55:59.209422+00:00,1180,species,avibase-9C5ED06A,Wild Turkey,Meleagris gallopavo,,,,...,1,1,,0,1,0,,,,
3,URN:CornellLabOfOrnithology:EBIRD:OBS1380605936,2022-04-02 09:55:59.214060+00:00,27403,species,avibase-8E1D9327,Wood Thrush,Hylocichla mustelina,,,,...,1,1,,0,1,0,,,,
4,URN:CornellLabOfOrnithology:EBIRD:OBS1380683993,2022-04-02 11:12:42.368739+00:00,7909,species,avibase-EB98812F,Cooper's Hawk,Accipiter cooperii,,,,...,1,0,,0,1,0,,,,
5,URN:CornellLabOfOrnithology:EBIRD:OBS1380953818,2022-04-02 14:41:30.486432+00:00,31267,species,avibase-B4EE123D,Purple Finch,Haemorhous purpureus,,,,...,1,0,,0,1,0,,,Male,
6,URN:CornellLabOfOrnithology:EBIRD:OBS1380974027,2022-04-02 14:59:14.096444+00:00,7412,species,avibase-4FF7DE80,Black Vulture,Coragyps atratus,,,,...,1,0,,0,1,0,,,,
7,URN:CornellLabOfOrnithology:EBIRD:OBS1381139222,2022-04-02 17:46:33.220130+00:00,31905,species,avibase-37E9CCDA,Chipping Sparrow,Spizella passerina,,,,...,1,0,,0,1,0,,,,
8,URN:CornellLabOfOrnithology:EBIRD:OBS1383013775,2022-04-04 14:55:28.984346+00:00,27389,species,avibase-B01E8BD4,Hermit Thrush,Catharus guttatus,,,,...,2,0,,0,1,0,,,,
9,URN:CornellLabOfOrnithology:EBIRD:OBS1386285800,2022-04-08 15:53:44.167690+00:00,1180,species,avibase-9C5ED06A,Wild Turkey,Meleagris gallopavo,,,,...,5,0,,0,1,0,,,Tall brown bird of classic turkey silhouette a...,


## Phase 2:


1. compress the CSV file with gzip
1. upload the compressed CSV file to GS
1. import into BQ



In [None]:
!ls -la {file}

-rw-r--r-- 1 root root 504698 May 19  2022 ebd_US-AL-101_202204_202204_relApr-2022_SAMPLE/ebd_US-AL-101_202204_202204_relApr-2022.txt


In [None]:
!gzip {file}

In [None]:
!ls -la {file}.gz


-rw-r--r-- 1 root root 60011 May 19  2022 ebd_US-AL-101_202204_202204_relApr-2022_SAMPLE/ebd_US-AL-101_202204_202204_relApr-2022.txt.gz


In [None]:
!gcloud storage cp {file}.gz gs://c11-ebird/

Copying file://ebd_US-AL-101_202204_202204_relApr-2022_SAMPLE/ebd_US-AL-101_202204_202204_relApr-2022.txt.gz to gs://c11-ebird/ebd_US-AL-101_202204_202204_relApr-2022.txt.gz


In [None]:
%%bash
gcloud storage ls -l gs://${GOOGLE_CLOUD_BUCKET}/

1032617810  2023-11-22T14:27:03Z  gs://c11-ebird/data.nm.txt.gz
    504698  2023-11-21T23:28:26Z  gs://c11-ebird/ebd_US-AL-101_202204_202204_relApr-2022.txt
27103974002  2023-11-22T08:07:44Z  gs://c11-ebird/ebd_sampling_relSep-2023.txt
4815040900  2023-11-22T07:20:39Z  gs://c11-ebird/ebd_sampling_relSep-2023.txt.gz
TOTAL: 4 objects, 32952137410 bytes (30.69GiB)


In [None]:
!bq ls {project_id}:ebird

          tableId            Type    Labels   Time Partitioning   Clustered Fields  
 -------------------------- ------- -------- ------------------- ------------------ 
  ebd_NM_relSep_2023         TABLE                                                  
  ebd_sampling_relSep_2023   TABLE                                                  
  ebird_small                TABLE                                                  
  ebird_small_2              TABLE                                                  


In [None]:
!bq rm -f {project_id}:ebird.ebird_small_3

In [None]:
%%bash
bq --project_id=${GOOGLE_CLOUD_PROJECT} \
  load \
  --source_format=CSV \
  --field_delimiter='\t' \
  --autodetect \
  --skip_leading_rows=1 \
  ebird.ebird_small_3 \
  gs://${GOOGLE_CLOUD_BUCKET}/ebd_US-AL-101_202204_202204_relApr-2022.txt.gz


BigQuery error in load operation: Error processing job
'default-256400:bqjob_r6fc80c33e33569d1_0000018bf824e293_1': Not found: URI
gs://c11-ebird/ebd_US-AL-101_202204_202204_relApr-2022.txt.gz


Waiting on bqjob_r6fc80c33e33569d1_0000018bf824e293_1 ... (0s) Current status: DONE   


CalledProcessError: ignored

In [None]:
!bq ls {project_id}:ebird

     tableId      Type    Labels   Time Partitioning   Clustered Fields  
 --------------- ------- -------- ------------------- ------------------ 
  ebird_small     TABLE                                                  
  ebird_small_3   TABLE                                                  


In [None]:
!bq rm -f {project_id}:ebird.ebird_small_3

In [None]:
!bq ls {project_id}:ebird

    tableId     Type    Labels   Time Partitioning   Clustered Fields  
 ------------- ------- -------- ------------------- ------------------ 
  ebird_small   TABLE                                                  


## Phase 3:


1. compress the CSV file with gzip
1. create a tar file with the compressed CSV file
1. untar and stream the compressed CSV file to GS
1. import into BQ



In [None]:
!ls -la {file}.gz

-rw-r--r-- 1 root root 60011 May 19  2022 ebd_US-AL-101_202204_202204_relApr-2022_SAMPLE/ebd_US-AL-101_202204_202204_relApr-2022.txt.gz


Create the tar file

In [None]:
path, file1 = file.split("/")

In [None]:
!tar -cvf example.tar -C {path} {file1}.gz

ebd_US-AL-101_202204_202204_relApr-2022.txt.gz


In [None]:
!ls -la


total 544
drwxr-xr-x 3 root root   4096 Nov 22 06:37 .
drwxr-xr-x 1 root root   4096 Nov 22 06:22 ..
-rw-r--r-- 1 root root 466849 Nov 22 06:23 ebd-datafile-SAMPLE.zip
drwxr-xr-x 2 root root   4096 Nov 22 06:25 ebd_US-AL-101_202204_202204_relApr-2022_SAMPLE
-rw-r--r-- 1 root root  71680 Nov 22 06:44 example.tar


In [None]:
!tar -tvf example.tar

-rw-r--r-- root/root     60011 2022-05-19 17:35 ebd_US-AL-101_202204_202204_relApr-2022.txt.gz


Untar and stream to GS.

In [None]:
%%bash
gcloud storage ls -l gs://${GOOGLE_CLOUD_BUCKET}/


    504698  2023-11-21T23:28:26Z  gs://c11-ebird/ebd_US-AL-101_202204_202204_relApr-2022.txt
     60011  2023-11-22T00:57:52Z  gs://c11-ebird/ebd_US-AL-101_202204_202204_relApr-2022.txt.gz
TOTAL: 2 objects, 564709 bytes (551.47kiB)


In [None]:
%%bash
tar -O -xf example.tar ebd_US-AL-101_202204_202204_relApr-2022.txt.gz |
gcloud storage cp - gs://${GOOGLE_CLOUD_BUCKET}/ebd_US-AL-101_202204_202204_relApr-2022.txt.v02.gz



Copying file://- to gs://c11-ebird/ebd_US-AL-101_202204_202204_relApr-2022.txt.v02.gz
  
......


In [None]:
%%bash
gcloud storage ls -l gs://${GOOGLE_CLOUD_BUCKET}/

1032617810  2023-11-22T14:27:03Z  gs://c11-ebird/data.nm.txt.gz
    504698  2023-11-21T23:28:26Z  gs://c11-ebird/ebd_US-AL-101_202204_202204_relApr-2022.txt
27103974002  2023-11-22T08:07:44Z  gs://c11-ebird/ebd_sampling_relSep-2023.txt
4815040900  2023-11-22T07:20:39Z  gs://c11-ebird/ebd_sampling_relSep-2023.txt.gz
TOTAL: 4 objects, 32952137410 bytes (30.69GiB)


In [None]:
!bq ls {project_id}:ebird

          tableId            Type    Labels   Time Partitioning   Clustered Fields  
 -------------------------- ------- -------- ------------------- ------------------ 
  ebd_NM_relSep_2023         TABLE                                                  
  ebd_sampling_relSep_2023   TABLE                                                  
  ebird_small                TABLE                                                  
  ebird_small_2              TABLE                                                  


In [None]:
!bq rm -f {project_id}:ebird.ebird_small_4

In [None]:
%%bash
bq --project_id=${GOOGLE_CLOUD_PROJECT} \
  load \
  --source_format=CSV \
  --field_delimiter='\t' \
  --autodetect \
  --skip_leading_rows=1 \
  ebird.ebird_small_4 \
  gs://${GOOGLE_CLOUD_BUCKET}/ebd_US-AL-101_202204_202204_relApr-2022.txt.v02.gz


Waiting on bqjob_r80dce339e5bf06_0000018bf5cd9540_1 ... (0s) Current status: RUNNING                                                                                    Waiting on bqjob_r80dce339e5bf06_0000018bf5cd9540_1 ... (1s) Current status: RUNNING                                                                                    Waiting on bqjob_r80dce339e5bf06_0000018bf5cd9540_1 ... (1s) Current status: DONE   


In [None]:
!bq ls {project_id}:ebird

     tableId      Type    Labels   Time Partitioning   Clustered Fields  
 --------------- ------- -------- ------------------- ------------------ 
  ebird_small     TABLE                                                  
  ebird_small_4   TABLE                                                  


In [None]:
!bq show {project_id}:ebird.ebird_small

In [None]:
!bq show {project_id}:ebird.ebird_small_4

In [None]:
!bq rm -f {project_id}:ebird.ebird_small_4

## Phase 4:



1. compress the CSV file with gzip
1. create a tar file with the compressed CSV file
1. upload the tar file to DO's object store
1. get, untar, and stream the compressed CSV file to GS
1. import into BQ


In [None]:
!curl -I https://cnmi-ebird.sfo3.digitaloceanspaces.com/example.tar

HTTP/2 200 
[1mcontent-length[0m: 71680
[1maccept-ranges[0m: bytes
[1mlast-modified[0m: Wed, 22 Nov 2023 06:55:56 GMT
[1mx-rgw-object-type[0m: Normal
[1metag[0m: "fb5a9186a43fc4d150feb7bfff027363"
[1mx-amz-request-id[0m: tx00000bee062f4ff82cacc-00655e3f9a-3c6f48c0-sfo3a
[1mcontent-type[0m: application/x-tar
[1mdate[0m: Wed, 22 Nov 2023 17:51:22 GMT
[1mvary[0m: Origin, Access-Control-Request-Headers, Access-Control-Request-Method
[1mstrict-transport-security[0m: max-age=15552000; includeSubDomains; preload
[1mx-envoy-upstream-healthchecked-cluster[0m: 



In [None]:
%%bash
curl -s https://cnmi-ebird.sfo3.digitaloceanspaces.com/example.tar |
tar -O -xf - ebd_US-AL-101_202204_202204_relApr-2022.txt.gz |
dd bs=1M status=progress |
gcloud storage cp - gs://${GOOGLE_CLOUD_BUCKET}/ebd_US-AL-101_202204_202204_relApr-2022.txt.v03.gz


0+5 records in
0+5 records out
60011 bytes (60 kB, 59 KiB) copied, 0.309062 s, 194 kB/s
Copying file://- to gs://c11-ebird/ebd_US-AL-101_202204_202204_relApr-2022.txt.v03.gz
  
....


11100700672 bytes (11 GB, 10 GiB) copied, 25 s, 444 MB/s
0+1751059 records in
0+1751059 records out
11479859200 bytes (11 GB, 11 GiB) copied, 25.8381 s, 444 MB/s
^C


In [None]:
%%bash
gcloud storage ls -l gs://${GOOGLE_CLOUD_BUCKET}/

1032617810  2023-11-22T14:27:03Z  gs://c11-ebird/data.nm.txt.gz
    504698  2023-11-21T23:28:26Z  gs://c11-ebird/ebd_US-AL-101_202204_202204_relApr-2022.txt
     60011  2023-11-22T17:52:45Z  gs://c11-ebird/ebd_US-AL-101_202204_202204_relApr-2022.txt.v03.gz
27103974002  2023-11-22T08:07:44Z  gs://c11-ebird/ebd_sampling_relSep-2023.txt
4815040900  2023-11-22T07:20:39Z  gs://c11-ebird/ebd_sampling_relSep-2023.txt.gz
TOTAL: 5 objects, 32952197421 bytes (30.69GiB)


In [None]:
!curl -I https://cnmi-ebird.sfo3.digitaloceanspaces.com/ebd_relSep-2023.tar

HTTP/2 200 
[1mcontent-length[0m: 200955054080
[1maccept-ranges[0m: bytes
[1mlast-modified[0m: Thu, 16 Nov 2023 17:18:22 GMT
[1mx-rgw-object-type[0m: Normal
[1metag[0m: "61420aff1e69e47d679e92353ab0332f-913"
[1mx-amz-meta-s3cmd-attrs[0m: atime:1699546641/ctime:1699889200/gid:0/gname:root/md5:086921907fba6ecdf715782309a3e5d0/mode:33188/mtime:1699672342/uid:0/uname:root
[1mx-amz-request-id[0m: tx0000002a239604640fa19-00655e4059-3c6f4933-sfo3a
[1mcontent-type[0m: application/x-tar
[1mdate[0m: Wed, 22 Nov 2023 17:54:33 GMT
[1mvary[0m: Origin, Access-Control-Request-Headers, Access-Control-Request-Method
[1mstrict-transport-security[0m: max-age=15552000; includeSubDomains; preload
[1mx-envoy-upstream-healthchecked-cluster[0m: 



In [None]:
# 200,955,054,080

In [None]:
!ls -la

total 24
drwxr-xr-x 1 root root 4096 Nov 20 14:42 .
drwxr-xr-x 1 root root 4096 Nov 22 17:24 ..
drwxr-xr-x 1 root root 4096 Nov 22 17:31 .config
drwxr-xr-x 1 root root 4096 Nov 20 14:42 sample_data


In [None]:
!bq ls {project_id}:ebird

          tableId            Type    Labels   Time Partitioning   Clustered Fields  
 -------------------------- ------- -------- ------------------- ------------------ 
  ebd_NM_relSep_2023         TABLE                                                  
  ebd_sampling_relSep_2023   TABLE                                                  
  ebird_small                TABLE                                                  
  ebird_small_2              TABLE                                                  


In [None]:
!bq rm -f {project_id}:ebird.ebird_small_5

In [None]:
%%bash
bq --project_id=${GOOGLE_CLOUD_PROJECT} \
  load \
  --source_format=CSV \
  --field_delimiter='\t' \
  --autodetect \
  --skip_leading_rows=1 \
  ebird.ebird_small_5 \
  gs://${GOOGLE_CLOUD_BUCKET}/ebd_US-AL-101_202204_202204_relApr-2022.txt.v03.gz


Waiting on bqjob_r2512cc9f16e3a39b_0000018bf82e5fcc_1 ... (0s) Current status: RUNNING                                                                                      Waiting on bqjob_r2512cc9f16e3a39b_0000018bf82e5fcc_1 ... (1s) Current status: RUNNING                                                                                      Waiting on bqjob_r2512cc9f16e3a39b_0000018bf82e5fcc_1 ... (1s) Current status: DONE   


In [None]:
!bq ls {project_id}:ebird

          tableId            Type    Labels   Time Partitioning   Clustered Fields  
 -------------------------- ------- -------- ------------------- ------------------ 
  ebd_NM_relSep_2023         TABLE                                                  
  ebd_sampling_relSep_2023   TABLE                                                  
  ebird_small                TABLE                                                  
  ebird_small_2              TABLE                                                  
  ebird_small_5              TABLE                                                  


In [None]:
!bq show {project_id}:ebird.ebird_small

Table default-256400:ebird.ebird_small

   Last modified                   Schema                   Total Rows   Total Bytes   Expiration   Time Partitioning   Clustered Fields   Total Logical Bytes   Total Physical Bytes   Labels  
 ----------------- --------------------------------------- ------------ ------------- ------------ ------------------- ------------------ --------------------- ---------------------- -------- 
  21 Nov 23:41:57   |- GLOBAL_UNIQUE_IDENTIFIER: string     1399         534727                                                            534727                41452                          
                    |- LAST_EDITED_DATE: timestamp                                                                                                                                              
                    |- TAXONOMIC_ORDER: integer                                                                                                                                             

In [None]:
!bq show {project_id}:ebird.ebird_small_5

Table default-256400:ebird.ebird_small_5

   Last modified                   Schema                   Total Rows   Total Bytes   Expiration   Time Partitioning   Clustered Fields   Total Logical Bytes   Total Physical Bytes   Labels  
 ----------------- --------------------------------------- ------------ ------------- ------------ ------------------- ------------------ --------------------- ---------------------- -------- 
  22 Nov 07:06:00   |- GLOBAL_UNIQUE_IDENTIFIER: string     1399         534727                                                            534727                                               
                    |- LAST_EDITED_DATE: timestamp                                                                                                                                              
                    |- TAXONOMIC_ORDER: integer                                                                                                                                           

In [None]:
!bq rm -f {project_id}:ebird.ebird_small_5

In [None]:
%%bash
gcloud storage ls -l gs://${GOOGLE_CLOUD_BUCKET}/*.gz

1032617810  2023-11-22T14:27:03Z  gs://c11-ebird/data.nm.txt.gz
     60011  2023-11-22T17:52:45Z  gs://c11-ebird/ebd_US-AL-101_202204_202204_relApr-2022.txt.v03.gz
4815040900  2023-11-22T07:20:39Z  gs://c11-ebird/ebd_sampling_relSep-2023.txt.gz
TOTAL: 3 objects, 5847718721 bytes (5.45GiB)


In [None]:
# %%bash
# gcloud storage rm gs://${GOOGLE_CLOUD_BUCKET}/*.gz

Removing objects:
Removing gs://c11-ebird/ebd_US-AL-101_202204_202204_relApr-2022.txt.gz...
Removing gs://c11-ebird/ebd_US-AL-101_202204_202204_relApr-2022.txt.v02.gz...
Removing gs://c11-ebird/ebd_US-AL-101_202204_202204_relApr-2022.txt.v03.gz...


## Transfer 5GB eBird sampling file

In [None]:
%%bash
curl -s https://cnmi-ebird.sfo3.digitaloceanspaces.com/ebd_sampling_relSep-2023.tar |
dd bs=1M status=progress |
tar -tvf -


-rw-r--r-- ebdBuilder/ebdBuilder 4815040900 2023-10-15 18:45 ebd_sampling_relSep-2023.txt.gz
-rwxr-xr-x ebdBuilder/ebdBuilder       1960 2023-10-15 10:00 BCRCodes.txt
-rwxr-xr-x ebdBuilder/ebdBuilder     126957 2023-10-15 10:00 IBACodes.txt
-rwxr-xr-x ebdBuilder/ebdBuilder      39670 2023-10-15 10:00 USFWSCodes.txt
-rw-r--r-- ebdBuilder/ebdBuilder     431439 2023-10-15 10:00 eBird_Basic_Dataset_Metadata_v1.15.pdf
-rwxr-xr-x ebdBuilder/ebdBuilder        103 2023-10-15 10:00 recommended_citation.txt
-rwxr-xr-x ebdBuilder/ebdBuilder       6703 2023-10-15 10:00 terms_of_use.txt


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     010010624 bytes (10 MB, 9.5 MiB) copied, 1 s, 9.7 MB/s  0 4592M    0 27.5M    0     0  19.6M      0  0:03:53  0:00:01  0:03:52 19.6M57651200 bytes (58 MB, 55 MiB) copied, 2 s, 28.8 MB/s  1 4592M    1 75.4M    0     0  31.1M      0  0:02:27  0:00:02  0:02:25 31.1M105046016 bytes (105 MB, 100 MiB) copied, 3 s, 34.8 MB/s  2 4592M    2  118M    0     0  34.8M      0  0:02:11  0:00:03  0:02:08 34.8M153063424 bytes (153 MB, 146 MiB) copied, 4 s, 38.1 MB/s  3 4592M    3  164M    0     0  37.6M      0  0:02:02  0:00:04  0:01:58 37.6M199786496 bytes (200 MB, 191 MiB) copied, 5 s, 39.8 MB/s  4 4592M    4  208M    0     0  38.6M      0  0:01:58  0:00:05  0:01:53 41.5M251236352

In [None]:
%%bash
curl -s https://cnmi-ebird.sfo3.digitaloceanspaces.com/ebd_sampling_relSep-2023.tar |
tar -O -xf - ebd_sampling_relSep-2023.txt.gz |
dd bs=1M status=progress |
gcloud storage cp - gs://${GOOGLE_CLOUD_BUCKET}/ebd_sampling_relSep-2023.txt.gz


Copying file://- to gs://c11-ebird/ebd_sampling_relSep-2023.txt.gz
  
71168 bytes (71 kB, 70 KiB) copied, 4 s, 18.3 kB/s136704 bytes (137 kB, 134 KiB) copied, 4 s, 35.1 kB/s142848 bytes (143 kB, 140 KiB) copied, 4 s, 36.7 kB/s972288 bytes (972 kB, 950 KiB) copied, 4 s, 241 kB/s 24667648 bytes (25 MB, 24 MiB) copied, 5 s, 4.9 MB/s69600768 bytes (70 MB, 66 MiB) copied, 6 s, 11.6 MB/s104977920 bytes (105 MB, 100 MiB) copied, 10 s, 11.0 MB/s105043456 bytes (105 MB, 100 MiB) copied, 10 s, 11.0 MB/s105108992 bytes (105 MB, 100 MiB) copied, 10 s, 11.0 MB/s114343424 bytes (114 MB, 109 MiB) copied, 10 s, 11.4 MB/s160214528 bytes (160 MB, 153 MiB) copied, 11 s, 14.6 MB/s205536768 bytes (206 MB, 196 MiB) copied, 12 s, 17.1 MB/s209786368 bytes (210 MB, 200 MiB) copied, 15 s, 13.9 MB/s209851904 bytes (210 MB, 200 MiB) copied, 15 s, 13.9 MB/s209917440 bytes (210 MB, 200 MiB) copied, 15 s, 14.0 MB/s233881088 bytes (234 MB, 223 MiB) copied, 16 s, 14.6 MB/s277759488 bytes (278 MB, 265 

In [None]:
%%bash
gcloud storage ls -l gs://${GOOGLE_CLOUD_BUCKET}/

1032617810  2023-11-22T14:27:03Z  gs://c11-ebird/data.nm.txt.gz
    504698  2023-11-21T23:28:26Z  gs://c11-ebird/ebd_US-AL-101_202204_202204_relApr-2022.txt
     60011  2023-11-22T17:52:45Z  gs://c11-ebird/ebd_US-AL-101_202204_202204_relApr-2022.txt.v03.gz
27103974002  2023-11-22T08:07:44Z  gs://c11-ebird/ebd_sampling_relSep-2023.txt
4815040900  2023-11-22T07:20:39Z  gs://c11-ebird/ebd_sampling_relSep-2023.txt.gz
TOTAL: 5 objects, 32952197421 bytes (30.69GiB)


In [None]:
%%bash
gsutil cp gs://${GOOGLE_CLOUD_BUCKET}/ebd_sampling_relSep-2023.txt.gz - |
gunzip |
dd bs=1M status=progress |
gsutil cp - gs://${GOOGLE_CLOUD_BUCKET}/ebd_sampling_relSep-2023.txt


Copying from <STDIN>...
/ [0 files][    0.0 B/    0.0 B]                                                / [0 files][264.0 KiB/    0.0 B]                                                / [0 files][ 20.4 MiB/    0.0 B]                                                / [0 files][ 86.6 MiB/    0.0 B]                                                / [0 files][100.0 MiB/    0.0 B]                                                / [0 files][125.6 MiB/    0.0 B]                                                / [0 files][176.9 MiB/    0.0 B]                                                / [0 files][200.1 MiB/    0.0 B]                                                / [0 files][225.3 MiB/    0.0 B]                                                / [0 files][276.9 MiB/    0.0 B]                                                / [0 files][300.1 MiB/    0.0 B]   21.9 MiB/s                                   / [0 files][323.8 MiB/    0.0 B]   26.3 MiB/s                                   / [0

In [None]:
%%bash
gcloud storage ls -l gs://${GOOGLE_CLOUD_BUCKET}/

1032617810  2023-11-22T14:27:03Z  gs://c11-ebird/data.nm.txt.gz
    504698  2023-11-21T23:28:26Z  gs://c11-ebird/ebd_US-AL-101_202204_202204_relApr-2022.txt
     60011  2023-11-22T17:52:45Z  gs://c11-ebird/ebd_US-AL-101_202204_202204_relApr-2022.txt.v03.gz
27103974002  2023-11-22T08:07:44Z  gs://c11-ebird/ebd_sampling_relSep-2023.txt
4815040900  2023-11-22T07:20:39Z  gs://c11-ebird/ebd_sampling_relSep-2023.txt.gz
TOTAL: 5 objects, 32952197421 bytes (30.69GiB)


In [None]:
%%bash
bq --project_id=${GOOGLE_CLOUD_PROJECT} \
  load \
  --source_format=CSV \
  --field_delimiter='\t' \
  --autodetect \
  --skip_leading_rows=1 \
  ebird.ebd_sampling_relSep_2023 \
  gs://${GOOGLE_CLOUD_BUCKET}/ebd_sampling_relSep-2023.txt


Waiting on bqjob_r220523e18874cbcb_0000018bf7675d17_1 ... (0s) Current status: RUNNING                                                                                      Waiting on bqjob_r220523e18874cbcb_0000018bf7675d17_1 ... (1s) Current status: RUNNING                                                                                      Waiting on bqjob_r220523e18874cbcb_0000018bf7675d17_1 ... (2s) Current status: RUNNING                                                                                      Waiting on bqjob_r220523e18874cbcb_0000018bf7675d17_1 ... (3s) Current status: RUNNING                                                                                      Waiting on bqjob_r220523e18874cbcb_0000018bf7675d17_1 ... (4s) Current status: RUNNING                                                                                      Waiting on bqjob_r220523e18874cbcb_0000018bf7675d17_1 ... (5s) Current status: RUNNING                                          

In [None]:
!bq ls {project_id}:ebird

          tableId            Type    Labels   Time Partitioning   Clustered Fields  
 -------------------------- ------- -------- ------------------- ------------------ 
  ebd_NM_relSep_2023         TABLE                                                  
  ebd_sampling_relSep_2023   TABLE                                                  
  ebird_small                TABLE                                                  
  ebird_small_2              TABLE                                                  


In [None]:
!bq show {project_id}:ebird.ebd_sampling_relSep_2023

Table default-256400:ebird.ebd_sampling_relSep_2023

   Last modified                   Schema                  Total Rows   Total Bytes   Expiration   Time Partitioning   Clustered Fields   Total Logical Bytes   Total Physical Bytes   Labels  
 ----------------- -------------------------------------- ------------ ------------- ------------ ------------------- ------------------ --------------------- ---------------------- -------- 
  22 Nov 14:20:43   |- LAST_EDITED_DATE: timestamp         111388197    27063315375                                                       27063315375           0                              
                    |- country: string                                                                                                                                                         
                    |- COUNTRY_CODE: string                                                                                                                                        

## New Mexico observations

In [None]:
!curl -I https://cnmi-ebird.sfo3.digitaloceanspaces.com/data.nm.txt.gz

# 1,032,617,810

HTTP/2 200 
[1mcontent-length[0m: 1032617810
[1maccept-ranges[0m: bytes
[1mlast-modified[0m: Thu, 16 Nov 2023 17:18:22 GMT
[1mx-rgw-object-type[0m: Normal
[1metag[0m: "f65095637a2e54553346a337547e3b82-66"
[1mx-amz-meta-s3cmd-attrs[0m: atime:1700146076/ctime:1700146258/gid:0/gname:root/md5:c7ce3b4dbaf87bac3fca0f5d6d592f5c/mode:33188/mtime:1700146258/uid:0/uname:root
[1mx-amz-request-id[0m: tx000000aea1807f1f75728-00655e0f40-3c6f48c0-sfo3a
[1mcontent-type[0m: application/gzip
[1mdate[0m: Wed, 22 Nov 2023 14:25:04 GMT
[1mvary[0m: Origin, Access-Control-Request-Headers, Access-Control-Request-Method
[1mstrict-transport-security[0m: max-age=15552000; includeSubDomains; preload
[1mx-envoy-upstream-healthchecked-cluster[0m: 



In [None]:
%%bash
curl -s https://cnmi-ebird.sfo3.digitaloceanspaces.com/data.nm.txt.gz |
dd bs=1M status=progress |
gcloud storage cp - gs://${GOOGLE_CLOUD_BUCKET}/data.nm.txt.gz


Copying file://- to gs://c11-ebird/data.nm.txt.gz
  
94208 bytes (94 kB, 92 KiB) copied, 3 s, 28.9 kB/s159744 bytes (160 kB, 156 KiB) copied, 3 s, 49.0 kB/s225280 bytes (225 kB, 220 KiB) copied, 3 s, 69.1 kB/s33894400 bytes (34 MB, 32 MiB) copied, 4 s, 8.5 MB/s 104927232 bytes (105 MB, 100 MiB) copied, 7 s, 15.2 MB/s104992768 bytes (105 MB, 100 MiB) copied, 7 s, 15.2 MB/s110399488 bytes (110 MB, 105 MiB) copied, 7 s, 15.7 MB/s184999936 bytes (185 MB, 176 MiB) copied, 8 s, 23.1 MB/s209784832 bytes (210 MB, 200 MiB) copied, 11 s, 19.7 MB/s209850368 bytes (210 MB, 200 MiB) copied, 11 s, 19.7 MB/s225570816 bytes (226 MB, 215 MiB) copied, 11 s, 20.5 MB/s312340480 bytes (312 MB, 298 MiB) copied, 12 s, 26.0 MB/s314642432 bytes (315 MB, 300 MiB) copied, 14 s, 21.7 MB/s314707968 bytes (315 MB, 300 MiB) copied, 14 s, 21.7 MB/s342028288 bytes (342 MB, 326 MiB) copied, 15 s, 22.8 MB/s419500032 bytes (420 MB, 400 MiB) copied, 18 s, 22.8 MB/s419565568 bytes (420 MB, 400 MiB) copied,

In [None]:
%%bash
gcloud storage ls -l gs://${GOOGLE_CLOUD_BUCKET}

1032617810  2023-11-22T14:27:03Z  gs://c11-ebird/data.nm.txt.gz
    504698  2023-11-21T23:28:26Z  gs://c11-ebird/ebd_US-AL-101_202204_202204_relApr-2022.txt
     60011  2023-11-22T17:52:45Z  gs://c11-ebird/ebd_US-AL-101_202204_202204_relApr-2022.txt.v03.gz
27103974002  2023-11-22T08:07:44Z  gs://c11-ebird/ebd_sampling_relSep-2023.txt
4815040900  2023-11-22T07:20:39Z  gs://c11-ebird/ebd_sampling_relSep-2023.txt.gz
TOTAL: 5 objects, 32952197421 bytes (30.69GiB)


In [None]:
%%bash
bq --project_id=${GOOGLE_CLOUD_PROJECT} \
  load \
  --source_format=CSV \
  --field_delimiter='\t' \
  --autodetect \
  --skip_leading_rows=1 \
  ebird.ebd_NM_relSep_2023 \
  gs://${GOOGLE_CLOUD_BUCKET}/data.nm.txt.gz


Waiting on bqjob_r8dae043b2db7705_0000018bf76f58ec_1 ... (0s) Current status: RUNNING                                                                                     Waiting on bqjob_r8dae043b2db7705_0000018bf76f58ec_1 ... (1s) Current status: RUNNING                                                                                     Waiting on bqjob_r8dae043b2db7705_0000018bf76f58ec_1 ... (2s) Current status: RUNNING                                                                                     Waiting on bqjob_r8dae043b2db7705_0000018bf76f58ec_1 ... (3s) Current status: RUNNING                                                                                     Waiting on bqjob_r8dae043b2db7705_0000018bf76f58ec_1 ... (4s) Current status: RUNNING                                                                                     Waiting on bqjob_r8dae043b2db7705_0000018bf76f58ec_1 ... (5s) Current status: RUNNING                                                     

In [None]:
!bq ls {project_id}:ebird

          tableId            Type    Labels   Time Partitioning   Clustered Fields  
 -------------------------- ------- -------- ------------------- ------------------ 
  ebd_NM_relSep_2023         TABLE                                                  
  ebd_sampling_relSep_2023   TABLE                                                  
  ebird_small                TABLE                                                  


In [None]:
!bq show {project_id}:ebird.ebd_NM_relSep_2023

Table default-256400:ebird.ebd_NM_relSep_2023

   Last modified                   Schema                   Total Rows   Total Bytes   Expiration   Time Partitioning   Clustered Fields   Total Logical Bytes   Total Physical Bytes   Labels  
 ----------------- --------------------------------------- ------------ ------------- ------------ ------------------- ------------------ --------------------- ---------------------- -------- 
  22 Nov 14:32:01   |- GLOBAL_UNIQUE_IDENTIFIER: string     10170370     4151731058                                                        4151731058                                           
                    |- LAST_EDITED_DATE: timestamp                                                                                                                                              
                    |- TAXONOMIC_ORDER: integer                                                                                                                                      

In [None]:
%%bigquery df --project {project_id}
select *
from ebird.ebd_NM_relSep_2023
limit 1000

Query is running:   0%|          |

Downloading:   0%|          |

In [None]:
df

Unnamed: 0,GLOBAL_UNIQUE_IDENTIFIER,LAST_EDITED_DATE,TAXONOMIC_ORDER,CATEGORY,TAXON_CONCEPT_ID,COMMON_NAME,SCIENTIFIC_NAME,SUBSPECIES_COMMON_NAME,SUBSPECIES_SCIENTIFIC_NAME,EXOTIC_CODE,...,EFFORT_AREA_HA,NUMBER_OBSERVERS,ALL_SPECIES_REPORTED,GROUP_IDENTIFIER,HAS_MEDIA,APPROVED,REVIEWED,REASON,TRIP_COMMENTS,SPECIES_COMMENTS
0,URN:CornellLabOfOrnithology:EBIRD:OBS22089322,2021-03-26 05:56:39.420466+00:00,33853,species,avibase-7C2FCB13,Rose-breasted Grosbeak,Pheucticus ludovicianus,,,,...,,,1,,0,1,0,,,
1,URN:CornellLabOfOrnithology:EBIRD:OBS1352217352,2022-02-25 16:11:27.244439+00:00,21054,species,avibase-69544B59,American Crow,Corvus brachyrhynchos,,,,...,,1,0,,0,1,0,,,
2,URN:CornellLabOfOrnithology:EBIRD:OBS33720336,2021-03-26 05:57:55.198592+00:00,26546,species,avibase-A15C5071,Blue-gray Gnatcatcher,Polioptila caerulea,,,,...,,,1,,0,1,0,,,
3,URN:CornellLabOfOrnithology:EBIRD:OBS262951408,2020-04-11 17:42:55+00:00,23610,species,avibase-58C502EA,Barn Swallow,Hirundo rustica,,,,...,,2,1,G951347,0,1,0,,,
4,URN:CornellLabOfOrnithology:EBIRD:OBS553681096,2017-12-01 09:51:07+00:00,31648,species,avibase-89431E9F,House Finch,Haemorhous mexicanus,,,,...,,2,0,,0,1,0,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,URN:CornellLabOfOrnithology:EBIRD:OBS42623066,2018-11-07 15:06:08+00:00,11110,species,avibase-9D06AF89,Hairy Woodpecker,Dryobates villosus,,,,...,,3,1,,0,1,0,,,
996,URN:CornellLabOfOrnithology:EBIRD:OBS551225513,2021-01-10 09:49:23.659195+00:00,8062,species,avibase-420850D6,Northern Goshawk,Accipiter gentilis,,,,...,,1,0,,0,1,0,,,
997,URN:CornellLabOfOrnithology:EBIRD:OBS1724503013,2023-05-13 22:22:33.446387+00:00,33855,species,avibase-824361E5,Black-headed Grosbeak,Pheucticus melanocephalus,,,,...,,1,0,,0,1,0,,,
998,URN:CornellLabOfOrnithology:EBIRD:OBS263555463,2014-07-13 12:31:24+00:00,33188,species,avibase-5CBA3391,Great-tailed Grackle,Quiscalus mexicanus,,,,...,,1,0,,0,1,0,,,


In [None]:
%%bash
bq show \
   --format=prettyjson \
   ${GOOGLE_CLOUD_PROJECT}:ebird


{
  "access": [
    {
      "role": "WRITER",
      "specialGroup": "projectWriters"
    },
    {
      "role": "OWNER",
      "specialGroup": "projectOwners"
    },
    {
      "role": "OWNER",
      "userByEmail": "Robert.Citek@gmail.com"
    },
    {
      "role": "READER",
      "specialGroup": "projectReaders"
    },
    {
      "role": "READER",
      "userByEmail": "brooksburkhead7@gmail.com"
    },
    {
      "role": "READER",
      "userByEmail": "crthebroker@gmail.com"
    },
    {
      "dataset": {
        "dataset": {
          "datasetId": "ebird",
          "projectId": "default-256400"
        },
        "targetTypes": [
          "VIEWS"
        ]
      }
    }
  ],
  "creationTime": "1700609524317",
  "datasetReference": {
    "datasetId": "ebird",
    "projectId": "default-256400"
  },
  "etag": "KtVEyfZC+UcXJUc7NGiMQw==",
  "id": "default-256400:ebird",
  "isCaseInsensitive": false,
  "kind": "bigquery#dataset",
  "lastModifiedTime": "1700678240429",
  "location": 

## All observations

In [None]:
!curl -I https://cnmi-ebird.sfo3.digitaloceanspaces.com/ebd_relSep-2023.tar

# 200,955,054,080

HTTP/2 200 
[1mcontent-length[0m: 200955054080
[1maccept-ranges[0m: bytes
[1mlast-modified[0m: Thu, 16 Nov 2023 17:18:22 GMT
[1mx-rgw-object-type[0m: Normal
[1metag[0m: "61420aff1e69e47d679e92353ab0332f-913"
[1mx-amz-meta-s3cmd-attrs[0m: atime:1699546641/ctime:1699889200/gid:0/gname:root/md5:086921907fba6ecdf715782309a3e5d0/mode:33188/mtime:1699672342/uid:0/uname:root
[1mx-amz-request-id[0m: tx0000013ed84124d3dfaf0-00655fba1b-3c6f48ac-sfo3a
[1mcontent-type[0m: application/x-tar
[1mdate[0m: Thu, 23 Nov 2023 20:46:19 GMT
[1mvary[0m: Origin, Access-Control-Request-Headers, Access-Control-Request-Method
[1mstrict-transport-security[0m: max-age=15552000; includeSubDomains; preload
[1mx-envoy-upstream-healthchecked-cluster[0m: 



These code blocks did not work in CoLab, but worked just fine in a GCP instance.

In [None]:
# %%bash
# curl -s https://cnmi-ebird.sfo3.digitaloceanspaces.com/ebd_relSep-2023.tar |
# tar -tf -


Process is interrupted.


In [None]:
# %%bash
# curl -s https://cnmi-ebird.sfo3.digitaloceanspaces.com/ebd_relSep-2023.tar |
# tar -O -xf - ebd_relSep-2023.txt.gz |
# zcat |
# dd bs=1M status=progress |
# gcloud storage cp - gs://${GOOGLE_CLOUD_BUCKET}/ebd_relSep-2023.txt


```bash
557616562176 bytes (558 GB, 519 GiB) copied, 58965 s, 9.5 MB/s
0+16895278 records in
0+16895278 records out
557617963097 bytes (558 GB, 519 GiB) copied, 58965 s, 9.5 MB/s
```

In [6]:
%%bash
gcloud storage ls -l gs://${GOOGLE_CLOUD_BUCKET}

1032617810  2023-11-22T14:27:03Z  gs://c11-ebird/data.nm.txt.gz
    504698  2023-11-21T23:28:26Z  gs://c11-ebird/ebd_US-AL-101_202204_202204_relApr-2022.txt
     60011  2023-11-27T16:27:27Z  gs://c11-ebird/ebd_US-AL-101_202204_202204_relApr-2022.txt.v03.gz
557617963097  2023-11-28T08:50:57Z  gs://c11-ebird/ebd_relSep-2023.txt
27103974002  2023-11-22T08:07:44Z  gs://c11-ebird/ebd_sampling_relSep-2023.txt
4815040900  2023-11-22T07:20:39Z  gs://c11-ebird/ebd_sampling_relSep-2023.txt.gz
TOTAL: 6 objects, 590570160518 bytes (550.01GiB)


In [7]:
%%bash
bq --project_id=${GOOGLE_CLOUD_PROJECT} \
  load \
    --source_format=CSV \
    --field_delimiter='\t' \
    --autodetect \
    --skip_leading_rows=1 \
  ebird.ebd_relSep_2023 \
  gs://${GOOGLE_CLOUD_BUCKET}/ebd_relSep-2023.txt


Waiting on bqjob_r5028faf55fb14f56_0000018c161c52af_1 ... (0s) Current status: RUNNING                                                                                      Waiting on bqjob_r5028faf55fb14f56_0000018c161c52af_1 ... (1s) Current status: RUNNING                                                                                      Waiting on bqjob_r5028faf55fb14f56_0000018c161c52af_1 ... (2s) Current status: RUNNING                                                                                      Waiting on bqjob_r5028faf55fb14f56_0000018c161c52af_1 ... (3s) Current status: RUNNING                                                                                      Waiting on bqjob_r5028faf55fb14f56_0000018c161c52af_1 ... (4s) Current status: RUNNING                                                                                      Waiting on bqjob_r5028faf55fb14f56_0000018c161c52af_1 ... (5s) Current status: RUNNING                                          

In [8]:
!bq ls {project_id}:ebird

          tableId            Type    Labels   Time Partitioning   Clustered Fields  
 -------------------------- ------- -------- ------------------- ------------------ 
  ebd_NM_relSep_2023         TABLE                                                  
  ebd_relSep_2023            TABLE                                                  
  ebd_sampling_relSep_2023   TABLE                                                  
  ebird_small                TABLE                                                  
  ebird_small_2              TABLE                                                  


In [9]:
!bq show {project_id}:ebird.ebd_relSep_2023

Table default-256400:ebird.ebd_relSep_2023

   Last modified                   Schema                   Total Rows   Total Bytes    Expiration   Time Partitioning   Clustered Fields   Total Logical Bytes   Total Physical Bytes   Labels  
 ----------------- --------------------------------------- ------------ -------------- ------------ ------------------- ------------------ --------------------- ---------------------- -------- 
  28 Nov 13:27:40   |- GLOBAL_UNIQUE_IDENTIFIER: string     1472745160   584150792669                                                       584150792669                                         
                    |- LAST_EDITED_DATE: timestamp                                                                                                                                               
                    |- TAXONOMIC_ORDER: integer                                                                                                                                     

Number of rows: 1,472,745,160

In [10]:
%%bigquery df --project {project_id}
select count(1)
from ebird.ebd_relSep_2023


Query is running:   0%|          |

Downloading:   0%|          |

In [11]:
df

Unnamed: 0,f0_
0,1472745160


In [12]:
%%bash
bq show \
   --format=prettyjson \
   ${GOOGLE_CLOUD_PROJECT}:ebird


{
  "access": [
    {
      "role": "WRITER",
      "specialGroup": "projectWriters"
    },
    {
      "role": "OWNER",
      "specialGroup": "projectOwners"
    },
    {
      "role": "OWNER",
      "userByEmail": "Robert.Citek@gmail.com"
    },
    {
      "role": "READER",
      "specialGroup": "projectReaders"
    },
    {
      "role": "READER",
      "userByEmail": "brooksburkhead7@gmail.com"
    },
    {
      "role": "READER",
      "userByEmail": "crthebroker@gmail.com"
    },
    {
      "dataset": {
        "dataset": {
          "datasetId": "ebird",
          "projectId": "default-256400"
        },
        "targetTypes": [
          "VIEWS"
        ]
      }
    }
  ],
  "creationTime": "1700609524317",
  "datasetReference": {
    "datasetId": "ebird",
    "projectId": "default-256400"
  },
  "etag": "KtVEyfZC+UcXJUc7NGiMQw==",
  "id": "default-256400:ebird",
  "isCaseInsensitive": false,
  "kind": "bigquery#dataset",
  "lastModifiedTime": "1700678240429",
  "location": 