Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: proposal for new types addition and optional integration with rust_xlsxwriter #339

Closed
lucatrv opened this issue Jul 2, 2023 · 10 comments · Fixed by #406

Comments

@lucatrv
Copy link
Contributor

lucatrv commented Jul 2, 2023

I would like to open a discussion to add a few convenience data types with serde::Deserialize trait implementations. With these new data types it would become trivial to deserialize Excel cells which normally contain a certain type (for instance a float) but may also contain other values such as strings, errors, etc. This proposal also discusses the optional integration with the stellar rust_xlsxwriter crate, and the opportunity to use its Excel specific rust_xlsxwriter::ExcelDateTime type to represent date and/or time instead of the generic chrono::naive::NaiveDate.

I am splitting this proposal into separate sections within a single Feature Request because all points are related:

  1. Rename DataType::as_f64 into DataType::as_float and DataType::as_i64 to DataType::as_int. This is for consistency with DataType::is_float / DataType::get_float and DataType::is_int / DataType::get_int. and with the proposed new data types (see point 2). This is not a loss of generality because numerical values in Excel are anyway stored as f64 types, see reference here. This change could be introduced by first adding DataType::as_float and DataType::as_int and moving DataType::as_f64 and DataType::as_i64 to undocumented / deprecated state, before final removal.

  2. For each DataType variant, add a corresponding VariantOrString type, and implement the Default, PartialEq, Eq when appropriate for the type (see std::cmp::Eq), and serde::Deserialize traits (plus the optional trait rust_xlsxwriter::IntoExcelData when the feature rust_xlsxwriter is enabled, see point 3).
    So for instance for the DataType::Float variant, a FloatOrSTring type should be introduced as follows. This way, the very common case where an Excel cell can contain either correct floats or anything else could be easily deserialized as let float_or_string: FloatOrString = result?;.

#[derive(PartialEq)]
enum FloatOrString {
    Float(f64),
    String(String),
}

impl Default for FloatOrString {
    fn default() -> Self {
        Self::String(Default::default())
    }
}

impl<'de> serde::Deserialize<'de> for FloatOrString {
    fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
    where
        D: serde::Deserializer<'de>,
    {
        let data_type = calamine::DataType::deserialize(deserializer)?;
        if let Some(float) = data_type.as_f64() {
            Ok(Self::Float(float))
        } else {
            Ok(Self::String(data_type.to_string()))
        }
    }
}
  1. Add an optional feature rust_xlsxwriter, to conveniently integrate Calamine types with the rust_xlsxwriter crate. With this feature enabled, the rust_xlsxwriter::IntoExcelData would be implemented for most Calamine types (including the new ones), the rust_xlsxwriter::IntoExcelDateTime would be implemented for Calamine types which can be converted to a date type with as_date, and the rust_xlsxwriter::IntoFilterData trait would be implemented for string and float types (maybe not for integer types because they are i64).

  2. Update the as_datetime / as_date / as_time methods to return rust_xlsxwriter::ExcelDateTime which is a type specifically designed to represent Excel date and/or time, instead of the generic chrono::naive::NaiveDate. The code would be simplified by relying on the ExcelDateTime::parse_from_str method. This means that the date feature would depend on the rust_xlsxwriter feature.

@tafia if you agree on this proposal I should be able to implement most of this myself and push PRs for your evaluation.

@lucatrv
Copy link
Contributor Author

lucatrv commented Jul 5, 2023

@tafia please let me know what you think in the next days if you can, at least for the first points, because this coming weekend I would have some time to work on this.

If you need more explanation, this is how the de_opt_f64 example would become using the NumberOrString type:

use calamine::{RangeDeserializerBuilder, Reader, Xlsx, NumberOrString};

struct ExcelRow {
    metric: String,
    value: NumberOrString,
}

fn main() ->  Result<(), Box<dyn std::error::Error>> {
    let path = format!("{}/tests/excel.xlsx", env!("CARGO_MANIFEST_DIR"));
    let mut excel: Xlsx<_> = open_workbook(path)?;

    let range = excel
      .worksheet_range("Sheet1")
      .ok_or(calamine::Error::Msg("Cannot find Sheet1"))??;

    let iter_result =
        RangeDeserializerBuilder::with_headers(&COLUMNS).from_range::<_, ExcelRow>(&range)?;
  }

@tafia
Copy link
Owner

tafia commented Jul 26, 2023

Hey, sorry for the late answer,

  1. This makes sense indeed.
  2. I am not sure this is something we want. Deserializing inconsistent data should be left to the users to handle as they see fit. One could maybe decide it is better to have a Result<f64, String> or directly an Option<f64>.
  3. I agree there is a need to write back excel files. I believe though that it would be best to have it implemented on another independent crate (using some wrapper over DataType I suppose).
  4. I'd like to keep it independent from XlsxWritter as much as possible. Reworking the date type is definitely an option but I am not sure how much of a simplification it is.

@dimastbk
Copy link
Contributor

dimastbk commented Jul 31, 2023

  1. I think, the main problem is not result of as_datetime/as_date/as_time. Make sense to refactor internal representation of datetime (DataType::*) for solving problem 1904. And if somebody adds this flag, than creation ExcelDateTime in external code will not be problem.

@jmcnamara
Copy link

4. Make sense to refactor internal representation of datetime (DataType::*) for solving problem 1904.

Could you explain a bit more what you mean here. For context I understand the Excel 1900/1904 date differences in general, just not what is meant here.

@lucatrv
Copy link
Contributor Author

lucatrv commented Dec 19, 2023

@tafia:

  1. OK I created PR Rename DataType methods as_f64 as_i64 into as_float as_int #389.
  2. I agree with your comment, but please have a look at the invalid_option and invalid_result functions of the csv crate. They provide general and flexible methods to deal with invalid values, either discarding them (invalid_option) or returning them as strings (invalid_result). In my opinion these two functions should be added to calamine too. If you agree I could work on this.
  3. Starting from version 0.57.0, rust_xlsxwriter supports Serde serialization, hurray! This provides the required glue between calamine and rust_xlsxwriter, so maybe we don't need anything else now.
  4. I leave this to your evaluation.

@dimastbk
Copy link
Contributor

  1. Make sense to refactor internal representation of datetime (DataType::*) for solving problem 1904.

Could you explain a bit more what you mean here. For context I understand the Excel 1900/1904 date differences in general, just not what is meant here.

Calamine now uses hack for converting time 1904 to 1900, and for time 1904 as_time() produces bad results. I think 1) calamine should store start date in DataType (like DataType::DateTime({value: f64, start_date: f64})) or 2) make is_1904 (or start_date) public on Xls/Xlsx/Xlsx and pass start date in as_time().

@lucatrv
Copy link
Contributor Author

lucatrv commented Dec 21, 2023

Calamine now uses hack for converting time 1904 to 1900, and for time 1904 as_time() produces bad results. I think 1) calamine should store start date in DataType (like DataType::DateTime({value: f64, start_date: f64})) or 2) make is_1904 (or start_date) public on Xls/Xlsx/Xlsx and pass start date in as_time().

@dimastbk can you please clarify if this issue would be fixed by using rust_xlsxwriter::ExcelDateTime instead of chrono::naive::NaiveDate?

@lucatrv
Copy link
Contributor Author

lucatrv commented Jan 20, 2024

@lucatrv
Copy link
Contributor Author

lucatrv commented Jan 22, 2024

@tafia if you can please merge PR #389 and #398, then I'm ready to create a PR (building on them) to implement the invalid_*_option and invalid_*_result functions which will behave similarly to the invalid_option and invalid_result functions of the csv crate. This will finally close this feature request.

@lucatrv
Copy link
Contributor Author

lucatrv commented Jan 22, 2024

@tafia I hope you're going to answer me soon because I have some time this week to work on this, while next week I'll have to move to other projects, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants