You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am reading billions of rows of data from gzipped CSV files exported from Google Cloud Compute's BigQuery. Here is a Swift script using CSV.swift to read a gzipped CSV file with 12 columns and 848,563 rows:
import Foundation
import CSV
import Gzip
// CSV.swift// Takes 1m5s without decoding the objects// Takes 5m15s when decoding to objectsfunc testReadBigQueryCSV(){letfilePath="data/my-data-export-2020-09-20-000000000161.csv.gz"letfileData:Data=try!Data(contentsOf:URL(fileURLWithPath: filePath))letdecodedBody:Data=try! fileData.gunzipped()letstream=InputStream(data: decodedBody)letcsv=try!CSVReader(stream: stream, hasHeaderRow: true)print(csv.headerRow!)varrows:[[String]]=[]letdecoder=CSVRowDecoder()
decoder.userInfo[.knownFormatKey]=KnownFormat.BigQueryCSV
for row in csv {
rows.append(row)}/*var rows: [MyDecodableType] = []*//*while csv.next() != nil {*//*print("\(row)")*//*let row: MyDecodableType = try! decoder.decode(MyDecodableType.self, from: csv)*//*rows.append(row)*//*}*/print("Got \(rows.count) rows")}
When simply reading the CSV fields into an [[String]], it takes 1m5s. When decoding the Strings into types, it takes 5m15s.
For comparison, here is a Python script that reads and parses the same file in 3.6 seconds:
#!/usr/bin/env python3importpandasaspd# Takes 3.6 seconds when not parsing the dates# Takes 2m43s when parsing the datesdefmain():
file_path="data/my-data-export-2020-09-20-000000000161.csv.gz"print("Reading {}".format(file_path))
# df = pd.read_csv(file_path, compression='gzip', parse_dates=["time", "timeUpdateReceived", "inserted"])df=pd.read_csv(file_path, compression='gzip')
print("Read {} rows".format(df.shape))
print(df.columns)
print(df.dtypes)
print(df.iloc[0])
if__name__=="__main__":
main()
The comparison is 3.6s vs 1m5s and 2m43s vs 5m15s. That's an 18x slower read. pandas also uses a single CPU core.
The text was updated successfully, but these errors were encountered:
I am reading billions of rows of data from gzipped CSV files exported from Google Cloud Compute's BigQuery. Here is a Swift script using CSV.swift to read a gzipped CSV file with 12 columns and 848,563 rows:
When simply reading the CSV fields into an
[[String]]
, it takes 1m5s. When decoding theString
s into types, it takes 5m15s.For comparison, here is a Python script that reads and parses the same file in 3.6 seconds:
The comparison is 3.6s vs 1m5s and 2m43s vs 5m15s. That's an 18x slower read. pandas also uses a single CPU core.
The text was updated successfully, but these errors were encountered: