Skip to content

Commit

Permalink
Deserializing CSV rows into Rust types
Browse files Browse the repository at this point in the history
To turn csv rows into our own structs, we'll can add `serde` to our package.

```shell
cargo add -p upload-pokemon-data serde
```

[Serde](https://serde.rs/) is a library that is widely used in the Rust ecosystem for serializing and deserializing Rust data types into various formats. Some of those formats include JSON, TOML, YAML, and MessagePack.

We will be using the `Deserialize` type to `derive` a deserializer for our `PokemonCsv` type, so be sure to add `derive` as a feature to the `serde` dependency in `update-pokemon-data`'s `Cargo.toml`. If you do not do this, you will see errors because the code that enables us to do this won't be included in the serde library. You will need to add the `version` and `features` keys, although the version will already be specified as a string.

```toml
serde = { version = "1.0.129", features = ["derive"] }
```

Cargo features are powerful ways to turn different pieces of code on and off depending on what a consumer needs to use, ensuring that our projects don't include code they don't need in the compilation process.

Next we'll create a new sub-module to hold the `PokemonCSV` struct that represents what's in each CSV row. In `src/main.rs`:

```rust
mod pokemon_csv;
use pokemon_csv::*;
```

Rust modules don't necessarily reflect the filesystem, but in this case it will. We'll use `mod pokemon_csv` to define a new submodule that will exist in `src/pokemon_csv.rs`, and we'll `use pokemon_csv::*` to pull all of the public items into scope in `main.rs`.

In `src/pokemon_csv.rs` we'll define a new public struct `PokemonCsv`. We also label all of the fields as `pub` so that they're accessible if we want them wherever we create a `PokemonCsv`.

At the top we've included two derive macros for `Debug` and `Deserialize`. `Debug` we've already talked about. We're deriving it for convenience if we want to use the `Debug` formatter (`"{:?}"`) or the `dbg!` macro with a value of type `PokemonCsv`. `Deserialize` is from serde, and does a pretty good job of automatically handling all of the fields with the types we've given. If we didn't derive `Deserialize` we'd have to manually implement it for every type... but all of these types already have implmentations so we let serde write that code for us.

We use a few different number types for the data in our csv: `u8`, `u16`, and `f32`.

* A `u8` is an *unsigned* (unsigned means: not negative) integer from 0-255, just like a color in CSS.
* A `u16` is bigger than a `u8` and can hold values from 0 to 65535.

Why are there different integer types? because they're different sizes in memory. Storing a u8 takes 8 bits (one byte), while storing a u64 takes 64 bits (8 bytes). So if we can appropriately size the number we use, we can store more numberts in less memory.

An f32 is a 32 bit float. Floats store numbers with decimal places, so not integers.

The other types we're using are `Option`, `bool`, and `String`.

* `bool` is `true` or `false`
* `Option` is an enum that represent a value that can be there or not. The values are `Some(value)` if there is a value or `None` if there isn't.
* `String` is an owned string. That is, it's mostly what you think of when you think of strings in languages like JavaScript. We can add more to it and otherwise do whatever we want with it.

We match all of these types to the types in the CSV. I've chosen to match the integer types as tightly as possible, even though I don't know if more pokemon with bigger values will be added in future generations, because I'm trying to match what's in the CSV now, not what could be added to the database in the future.

```rust
use serde::Deserialize;

pub struct PokemonCsv {
    pub name: String,
    pub pokedex_id: u16,
    pub abilities: String,
    pub typing: String,
    pub hp: u8,
    pub attack: u8,
    pub defense: u8,
    pub special_attack: u8,
    pub special_defense: u8,
    pub speed: u8,
    pub height: u16,
    pub weight: u16,
    pub generation: u8,
    pub female_rate: Option<f32>,
    pub genderless: bool,
    #[serde(rename(deserialize = "legendary/mythical"))]
    pub is_legendary_or_mythical: bool,
    pub is_default: bool,
    pub forms_switchable: bool,
    pub base_experience: u16,
    pub capture_rate: u8,
    pub egg_groups: String,
    pub base_happiness: u8,
    pub evolves_from: Option<String>,
    pub primary_color: String,
    pub number_pokemon_with_typing: f32,
    pub normal_attack_effectiveness: f32,
    pub fire_attack_effectiveness: f32,
    pub water_attack_effectiveness: f32,
    pub electric_attack_effectiveness: f32,
    pub grass_attack_effectiveness: f32,
    pub ice_attack_effectiveness: f32,
    pub fighting_attack_effectiveness: f32,
    pub poison_attack_effectiveness: f32,
    pub ground_attack_effectiveness: f32,
    pub fly_attack_effectiveness: f32,
    pub psychic_attack_effectiveness: f32,
    pub bug_attack_effectiveness: f32,
    pub rock_attack_effectiveness: f32,
    pub ghost_attack_effectiveness: f32,
    pub dragon_attack_effectiveness: f32,
    pub dark_attack_effectiveness: f32,
    pub steel_attack_effectiveness: f32,
    pub fairy_attack_effectiveness: f32,
}
```

The only thing left to talk about is the use of a field-level attribute macro. Serde offers us the power to rename fields when we're deserializing, so we'll take advantage of that to rename the `/` out of `legendary/mythical` and transform it into `is_legendary_or_mythical`.

```rust
pub is_legendary_or_mythical: bool,
```

In our for loop that iterates over the CSV reader, we can change the function used from `records` to `deserialize`. The `deserialize` function needs a type parameter to tell it what type to deserialize into. We can get Rust to infer that type if we label the `record` as a `PokemonCsv` type, because Rust is capable of knowing that this type will propogate back up to the deserialize function and it is the only possible value for that type parameter.

```rust
for result in rdr.deserialize() {
    let record: PokemonCsv = result?;
    println!("{:?}", record);
}
```

If you have Rust Analyzer with the type inlays on, you will see that Rust Analyzer correctly shows the type of `result` as `Result<PokemonCsv, csv::Error>`.

Running the program result in a `DeserializeError` that specifically specifies a `ParseBool` error at a specific byte on a specific line of the csv.

```
❯ cargo run --bin upload-pokemon-data
Error: Error(Deserialize { pos: Some(Position { byte: 781, line: 1, record: 1 }), err: DeserializeError { field: Some(14), kind: ParseBool(ParseBoolError { _priv: () }) } })
```

If we look at the csv values, we can see that this is because the true/false values are capital T `True` and capital F `False`, which don't parse into Rust's `true` and `false`.

We can create a new function, just for these fields, to deserialize `True` and `False` into bools.

We first need to use serde's field-level attribute macro to tell it that when we deserialize, we're going to use a function called `from_capital_bool`. You can see that we can also add it alongside other usage, such as the `rename`.

```rust
pub genderless: bool,
    rename(deserialize = "legendary/mythical"),
    deserialize_with = "from_capital_bool"
)]
pub is_legendary_or_mythical: bool,
pub is_default: bool,
```

The `from_capital_bool` function signature is already defined for us by serde and is [shown in the docs](https://serde.rs/field-attrs.html#deserialize_with). We do not get the option to change it aside from the `bool` value that represents the value we'll be parsing out.

The function signature from the docs is

```rust
fn<'de, D>(D) -> Result<T, D::Error> where D: Deserializer<'de>
```

The function signature we use reads as:

The function `from_capital_bool`, which makes use of a lifetime named `'de` and some type `D`, accepts an argument named `deserializer` that is of type `D`, and returns a `Result` where a successful deserialization ends up being a `bool` type and a failure is the associated `Error` type that `D` defines.

Additionally, `D` must implement the `Deserializer` trait, which makes use of the same `'de` lifetime we talked about earlier.

As it happens, serde has [an entire page](https://serde.rs/lifetimes.html) explaining why the `'de` lifetime is like this, and what the `D: de::Deserializer<'de>` trait bound is useful for.

The short version is that this new (to us) usage of lifetimes and generics is responsible for safely ensuring the ability to create zero-copy deserializations, which is some advanced Rust. We haven't done that in our `PokemonCsv` struct, but we could.

The usage of the `'de` lifetime means that the input string that we're deserializing from needs to live as long as the struct that we're creating from it.

Overall, as it turns out, we're doing this so that we can take advantage of the csv crate's implementation of `Deserializer` to deserialize the string "True" or "False" from the csv's values.

Then we can directly match on that string value and turn `"True"` into `true` and `"False"` into `false`. If for some reason we've annotated the wrong field with this function and we get something that isn't one of those two strings, we fail with a custom error message.

```rust
fn from_capital_bool<'de, D>(
    deserializer: D,
) -> Result<bool, D::Error>
where
    D: de::Deserializer<'de>,
{
    let s: &str =
        de::Deserialize::deserialize(deserializer)?;

    match s {
        "True" => Ok(true),
        "False" => Ok(false),
        _ => Err(de::Error::custom("not a boolean!")),
    }
}
```

Keep in mind that the previous explanation of lifetimes and generics is something we could have avoided entirely if we wanted to. We could have mapped over the `StringRecord`s and manually constructed the `PokemonCsv`s ourselves, never having touched serde.

We could have also cleaned up the csv data before attempting to parse it at all, manually switching out `True` for `true` and `False` for `false`. I've chosen to present you this `deserialize_with` approach specifically because it brings up new concepts and that's what this course is all about: learning more about Rust little by little.

We're left with the output being a `PokemonCsv` now.

```rust
PokemonCsv {
	name: "Bulbasaur",
	pokedex_id: 1,
	abilities: "Overgrow, Chlorophyll",
	typing: "Grass, Poison",
	hp: 45,
	attack: 49,
	defense: 49,
	special_attack: 65,
	special_defense: 65,
	speed: 45,
	height: 7,
	weight: 69,
	generation: 1,
	female_rate: Some(0.125),
	genderless: false,
	is_legendary_or_mythical: false,
	is_default: true,
	forms_switchable: false,
	base_experience: 64,
	capture_rate: 45,
	egg_groups: "Monster, Plant",
	base_happiness: 70,
	evolves_from: None,
	primary_color: "green",
	number_pokemon_with_typing: 15.0,
	normal_attack_effectiveness: 1.0,
	fire_attack_effectiveness: 2.0,
	water_attack_effectiveness: 0.5,
	electric_attack_effectiveness: 0.5,
	grass_attack_effectiveness: 0.25,
	ice_attack_effectiveness: 2.0,
	fighting_attack_effectiveness: 0.5,
	poison_attack_effectiveness: 1.0,
	ground_attack_effectiveness: 1.0,
	fly_attack_effectiveness: 2.0,
	psychic_attack_effectiveness: 2.0,
	bug_attack_effectiveness: 1.0,
	rock_attack_effectiveness: 1.0,
	ghost_attack_effectiveness: 1.0,
	dragon_attack_effectiveness: 1.0,
	dark_attack_effectiveness: 1.0,
	steel_attack_effectiveness: 1.0,
	fairy_attack_effectiveness: 0.5,
};
```

Finally, we can see a few of these fields are actually multiple values, `abilities` for example is `"Overgrow, Chlorophyll"`, which is two abilities.

We can take the same approach we just did for the capital booleans to turn these array-strings into Vecs. Instead of returning `bool` we'll return a `Vec<String>` from our new `from_comma_separated` function.

We can use `split` to turn the string values into an Iterator over string slices (`&str`), which are views into the original string. Then we can filter out any potentially empty strings using `filter` and `.is_empty()` and finally map over those views to turn them into owned `String`s, and `collect` into a `Vec`.

`.collect()` infers that it's type should be `Vec<String>` from the function signature, so we don't need to additionally specify it.

```rust
fn from_comma_separated<'de, D>(
    deserializer: D,
) -> Result<Vec<String>, D::Error>
where
    D: de::Deserializer<'de>,
{
    let s: &str =
        de::Deserialize::deserialize(deserializer)?;

    Ok(s.split(", ")
        .filter(|v| !v.is_empty())
        .map(|v| v.to_string())
        .collect())
}
```

With our new `from_comma_separated` function set up, we can put the `deserialize_with` on any types we want to deserialize into a `Vec<String>`.

```rust
pub abilities: Vec<String>,
pub typing: Vec<String>,
pub egg_groups: Vec<String>,
```

And now we have a fully serialized `PokemonCsv` struct for every element in the csv.
  • Loading branch information
ChristopherBiscardi committed Aug 24, 2021
1 parent ac509d4 commit fee8adc
Show file tree
Hide file tree
Showing 4 changed files with 145 additions and 2 deletions.
50 changes: 50 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions crates/upload-pokemon-data/Cargo.toml
Expand Up @@ -7,3 +7,4 @@ edition = "2018"

[dependencies]
csv = "1.1.6"
serde = { version = "1.0.129", features = ["derive"] }
7 changes: 5 additions & 2 deletions crates/upload-pokemon-data/src/main.rs
@@ -1,9 +1,12 @@
mod pokemon_csv;
use pokemon_csv::*;

fn main() -> Result<(), csv::Error> {
let mut rdr = csv::Reader::from_path(
"./crates/upload-pokemon-data/pokemon.csv",
)?;
for result in rdr.records() {
let record = result?;
for result in rdr.deserialize() {
let record: PokemonCsv = result?;
println!("{:?}", record);
}
Ok(())
Expand Down
89 changes: 89 additions & 0 deletions crates/upload-pokemon-data/src/pokemon_csv.rs
@@ -0,0 +1,89 @@
use serde::{de, Deserialize};

#[derive(Debug, Deserialize)]
pub struct PokemonCsv {
pub name: String,
pub pokedex_id: u16,
#[serde(deserialize_with = "from_comma_separated")]
pub abilities: Vec<String>,
#[serde(deserialize_with = "from_comma_separated")]
pub typing: Vec<String>,
pub hp: u8,
pub attack: u8,
pub defense: u8,
pub special_attack: u8,
pub special_defense: u8,
pub speed: u8,
pub height: u16,
pub weight: u16,
pub generation: u8,
pub female_rate: Option<f32>,
#[serde(deserialize_with = "from_capital_bool")]
pub genderless: bool,
#[serde(
rename(deserialize = "legendary/mythical"),
deserialize_with = "from_capital_bool"
)]
pub is_legendary_or_mythical: bool,
#[serde(deserialize_with = "from_capital_bool")]
pub is_default: bool,
#[serde(deserialize_with = "from_capital_bool")]
pub forms_switchable: bool,
pub base_experience: u16,
pub capture_rate: u8,
#[serde(deserialize_with = "from_comma_separated")]
pub egg_groups: Vec<String>,
pub base_happiness: u8,
pub evolves_from: Option<String>,
pub primary_color: String,
pub number_pokemon_with_typing: f32,
pub normal_attack_effectiveness: f32,
pub fire_attack_effectiveness: f32,
pub water_attack_effectiveness: f32,
pub electric_attack_effectiveness: f32,
pub grass_attack_effectiveness: f32,
pub ice_attack_effectiveness: f32,
pub fighting_attack_effectiveness: f32,
pub poison_attack_effectiveness: f32,
pub ground_attack_effectiveness: f32,
pub fly_attack_effectiveness: f32,
pub psychic_attack_effectiveness: f32,
pub bug_attack_effectiveness: f32,
pub rock_attack_effectiveness: f32,
pub ghost_attack_effectiveness: f32,
pub dragon_attack_effectiveness: f32,
pub dark_attack_effectiveness: f32,
pub steel_attack_effectiveness: f32,
pub fairy_attack_effectiveness: f32,
}

fn from_capital_bool<'de, D>(
deserializer: D,
) -> Result<bool, D::Error>
where
D: de::Deserializer<'de>,
{
let s: &str =
de::Deserialize::deserialize(deserializer)?;

match s {
"True" => Ok(true),
"False" => Ok(false),
_ => Err(de::Error::custom("not a boolean!")),
}
}

fn from_comma_separated<'de, D>(
deserializer: D,
) -> Result<Vec<String>, D::Error>
where
D: de::Deserializer<'de>,
{
let s: &str =
de::Deserialize::deserialize(deserializer)?;

Ok(s.split(", ")
.filter(|v| !v.is_empty())
.map(|v| v.to_string())
.collect())
}

0 comments on commit fee8adc

Please sign in to comment.