A command-line tool written in Rust that infers SQL data types from a CSV file and generates a CREATE TABLE
SQL DDL statement. It aims to determine the strictest possible SQL type for each column that can accommodate all its values.
- infers sql types for columns:
integer
,bigint
,float
,char(n)
,varchar(n)
,date
,datetime
,boolean
. - Determines the strictest type that fits all values in a column.
- e.g., a column with
1
,2
,3000000000
will beBIGINT
. - e.g., a column with
1
,2.0
will beFLOAT
. - e.g., a column with
apple
,grape
(both length 5, and not a more specific type like boolean/date etc.) will beCHAR(5)
. - e.g., a column with
apple
,banana
(lengths 5 and 6) will beVARCHAR(6)
. - e.g., a column with
1
,text
(different types, varying lengths if applicable) will beVARCHAR
.
- e.g., a column with
- handles empty strings: empty strings are treated as
null
and do not influence the type inference for the non-empty values in the column. if all values in a column are empty, it defaults tovarchar(0)
(which becomesvarchar(1)
in sql due tomax(1)
). - Generates
CREATE TABLE
SQL DDL statements. - Table name is derived from the input CSV filename (e.g.,
my_data.csv
becomes tablemy_data
). - Column names are taken directly from the CSV header.
- Utilizes parallel processing for type inference across columns using Rayon for improved performance on multi-core CPUs.
-
Ensure you have Rust and Cargo installed. If not, follow the instructions at rust-lang.org.
-
Clone the repository:
git clone <your-repository-url> cd csv-sql-inference
-
Build the project:
cargo build --release
The executable will be located at
target/release/csv_sql_inference
.
Run the tool from the command line, providing the path to your CSV file:
./target/release/csv_sql_inference <path_to_csv_file>
or, if you've added the target/release
directory to your path:
csv_sql_inference <path_to_csv_file>
the generated create table
sql statement will be printed to standard output.
given a csv file named products.csv
:
id,product_name,quantity,price,entry_date,last_updated
1,apple,10,0.50,2023-01-01,2024-05-01 10:00:00
2,banana,,1.20,2023-01-02,2024-05-02 11:30:15
3,orange,5,0.75,invalid-date,2024-05-03 09:00:00
4,grape,20000000000,0.05,2023-03-10,2024-05-04 14:20:05
running the command:
./target/release/csv_sql_inference products.csv
would produce output similar to this (exact varchar lengths depend on the longest string in each respective column):
CREATE TABLE "products" (
"id" integer,
"product_name" varchar(6),
"quantity" bigint,
"price" float,
"entry_date" varchar(12),
"last_updated" datetime
);
explanation of example output:
"id"
: all are integers."product_name"
: mixed strings, sovarchar
. length determined by "orange" (6)."quantity"
: contains an empty string and a very large number. the empty string forcesvarchar
. "20000000000" is 11 chars."price"
: all are valid floats."entry_date"
: contains "invalid-date", so it becomesvarchar
. length determined by "invalid-date" (12)."last_updated"
: all are valid datetime strings.
to run the unit and integration tests:
cargo test