Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
404 lines (321 sloc) 15.3 KB

F#-style type providers in Circle

I had the pleasure of being the guest on Cpp.chat yesterday, in which Phil, the co-host, asked if it was possible to implement F#-style type providers using Circle. I had not heard of this feature before, but responded affirmatively nonetheless.

Today I decided to investigate this feature. From MS's Type Provider overview:

An F# type provider is a component that provides types, properties, and methods for use in your program. Type Providers generate what are known as Provided Types, which are generated by the F# compiler and are based on an external data source.

For example, an F# Type Provider for SQL can generate types representing tables and columns in a relational database. In fact, this is what the SQLProvider Type Provider does.

Provided Types depend on input parameters to a Type Provider. Such input can be a sample data source (such as a JSON schema file), a URL pointing directly to an external service, or a connection string to a data source. A Type Provider can also ensure that groups of types are only expanded on demand; that is, they are expanded if the types are actually referenced by your program. This allows for the direct, on-demand integration of large-scale information spaces such as online data markets in a strongly typed way.

In essence, the type provider generates bindings to a data source when given a schema or by inferring a schema from a sample data file. This process occurs at compile time, allowing the user to access data fields by member name. If the data's schema changes, the dependent source code is automatically refreshes its struct definitions. If the user references a field that has been removed during a schema change, a compile-time error is raised, rather than a runtime error.

To keep things simple, I decided to implement a CSV type provider, which F# also provides.

Using the Circle type provider

type_provider.cxx

int main() {
  // Use type providers to define the object type from the CSV schema.
  // This happens at compile time.
  @macro define_csv_type("obj_type_t", "schema.csv");

  // Print the field types and names.
  @meta std::cout<< @type_name(@member_types(obj_type_t))<< " "<< 
    @member_names(obj_type_t)<< "\n" ... ;
 
  // Load the values at runtime. The schema is inferred and checked against
  // the static type info.
  auto data = read_csv_file<obj_type_t>("earthquakes1970-2014.csv");

  printf("Read %d records\n", data.size());

  // Print out 10 random coordinates. Access the members by name. These names
  // are inferred from the schema at compile time.
  for(int i = 0; i < 10; ++i) {
    int index = rand() % data.size();

    // The first line is CSV schema. The first index comes at line 2.
    int line = index + 2;
    double lat = data[index].Latitude;
    double lon = data[index].Longitude;
    printf("line = %4d Latitude = %+10f  Longitude = %+10f\n", line, lat, lon);
  }

  // Print out all the fields from a random record. Use introspection for
  // this.
  int index = rand() % data.size();
  std::cout<< index + 2<< ":\n";
  std::cout<< "  "<< @member_names(obj_type_t)<< ": "<< 
    @member_pack(data[index])<< "\n" ...;

  return 0;
}

This listing includes all the client code necessary to use type providers to define a CSV record struct at compile time, load it with data at runtime, then print data values at runtime, both by using member access (which assumes the programmer knows at least something about the contents of the resource) and by reflection (which assumes nothing).

DateTime,Latitude,Longitude,Depth,Magnitude,MagType,NbStations,Gap,Distance,RMS,Source,EventID
1970/01/04 17:00:40.20,24.138999999999900,102.503000000000000,31.00,7.50,Ms,90,,,0.000000000000000,NEI,1970010440

A sample CSV file is required at compile-time. It must provide a header line followed by at least one line of data. The header line defines the field names. The field types are inferred from the data line. In this simple implementation, if an entire field can be lexed as a double-precision value, the field is double. Otherwise the field has type std::string.

The define_csv_type macro is provided the name of the struct with members for each CSV field as a string, along the name of the schema file. The macro is expanded into the calling scope, in this case declaring obj_type_t in function main. This macro leans heavily on Circle's integrated interpreter. It uses iostreams to load the schema file from disk and parse out the field names and types. After the struct is defined, its member types and names are printed, at compile time, as a diagnostic.

The runtime phase is pretty basic. The read_csv_file function template loads a data file at runtime. This file must conform to schema parsed at compile time. The returned records are printed in two different ways: first by direct access of Latitude and Longitude members; then again by using the introspection keywords @member_names and @member_pack.

The point of type providers are that they allow both generic and non-generic access. In this example, the user accesses two members by name, indicating some expectation about the data format. If this expectation doesn't hold, it will generate a compile-time error.

$ circle type_provider.cxx 
std::string DateTime
double Latitude
double Longitude
double Depth
double Magnitude
std::string MagType
double NbStations
std::string Gap
std::string Distance
double RMS
std::string Source
double EventID

$ ./type_provider 
Read 5304 records
line = 1185 Latitude = -22.426000  Longitude = +173.624000
line = 4080 Latitude =  -5.724000  Longitude = +154.424000
line = 1235 Latitude = +50.205000  Longitude = +147.727000
line = 2229 Latitude = -10.335000  Longitude = +113.660000
line = 4267 Latitude = -15.213000  Longitude = -172.367000
line = 3201 Latitude =  -4.022000  Longitude = +101.776000
line = 5292 Latitude = +54.685900  Longitude = +162.302400
line = 4334 Latitude = +13.564000  Longitude = -90.599000
line = 2291 Latitude =  -5.589000  Longitude = +110.186000
line = 1959 Latitude = +39.710000  Longitude = +39.605000
2812:
  DateTime: 1998/04/27 18:40:38.53
  Latitude: -2.995
  Longitude: 136.282
  Depth: 53
  Magnitude: 6.2
  MagType: Me
  NbStations: 155
  Gap: 
  Distance: 
  RMS: 1.14
  Source: NEI
  EventID: 1.99804e+09

Generating a type from a schema

type_provider.cxx

struct type_info_t {
  struct field_t {
    std::string field_type;
    std::string field_name;
  };
  std::vector<field_t> fields;
};

// Define a structure from a type_info_t known at compile time. This macro
// may inject the struct declaration into any namespace from any scope.
@macro void define_type(const char* name, const type_info_t& type_info) {
  struct @(name) {
    @meta for(auto& field : type_info.fields)
      @type_id(field.field_type) @(field.field_name);
  };
}

@macro void define_csv_type(const char* name, const char* filename) {
  @macro define_type(name, read_csv_schema(filename));
}

std::string make_ident(std::string s) {
  // Turn a string into an identifier.
  if(!s.size())
    { };

  if(isdigit(s[0]))
    s = '_' + s;

  for(char& c : s) {
    if(c != '_' && !isalnum(c))
      c = '_';
  }

  return s;
}

type_info_t read_csv_schema(std::istream& is, bool read_types) {
  type_info_t type_info { };

  // Read the field names.
  std::string s = good_getline(is);

  const char* text = s.c_str();
  while(*text) {
    // Consume the leading ',' delimiter.
    if(type_info.fields.size())
      ++text;

    // Find the next comma or end-of-string.
    const char* end = text;
    while(*end && ',' != end[0]) ++end;

    // Push the identifier field_name.
    type_info.fields.push_back({ { }, make_ident(std::string(text, end)) });

    // Advance to the comma or end-of-string.
    text = end;
  }

  if(read_types) {
    // Infer the field types from the first data line.
    s = good_getline(is);
    text = s.c_str();

    size_t num_fields = type_info.fields.size();
    for(size_t i = 0; i < num_fields; ++i) {
      auto& field = type_info.fields[i];

      if(i && !*text) {
        throw std::runtime_error(format(
          "field %s not found at line %d in CSV file", 
          field.field_name.c_str(), 0
        ));
      }

      // Consume the leading ',' delimiter.
      if(i)
        ++text;

      // Find the next comma or end-of-string.
      const char* end = text;
      while(*end && ',' != end[0]) ++end;

      // Test if the field is a double.
      int chars_read;
      double x;
      int result = sscanf(text, "%lf%n", &x, &chars_read);
      field.field_type = (result && chars_read == end - text) ? 
        "double" : "std::string";

      // Advance to the comma or end-of-string.
      text = end;
    }
  }

  return type_info;
}

type_info_t read_csv_schema(const char* filename) {
  std::ifstream file(filename);
  if(!file.is_open()) {
    throw std::runtime_error(format(
      "cannot open CSV file %s", filename
    ));
  }

  return read_csv_schema(file, true);
}

This listing defines type_info_t, which is a collection of field type and name pairs. The define_type macro is worth looking at in detail:

@macro void define_type(const char* name, const type_info_t& type_info) {
  struct @(name) {
    @meta for(auto& field : type_info.fields)
      @type_id(field.field_type) @(field.field_name);
  };
}

Statement macros are expanded into the calling scope. The struct's name is passed in as a string, because you can't pass identifiers to functions or macros. The dynamic name operator @() converts the name back to an identifier. The class-specifier simple loops over the fields and uses @type_id to convert each string-form type to its C++ type. The field name is also converted to an identifier, but only after being hammered into a valid identifier with make_ident inside the read_csv_schema function.

Schema verification

type_provider.cxx

template<typename type_t>
void verify_schema(const type_info_t& type_info) {
  @meta size_t num_fields = @member_count(type_t);

  // Test the number of fields.
  if(type_info.fields.size() != num_fields) {
    throw std::runtime_error(format(
      "%s has %d fields while schema has %d fields", 
      @type_name(type_t), num_fields, type_info.fields.size()
    ));
  }

  @meta for(size_t i = 0; i < num_fields; ++i) {
    // Test the name of each field.
    const auto& field = type_info.fields[i];

    if(field.field_name != @member_name(type_t, i)) {
      throw std::runtime_error(format(
        "field %d is called %s in %s and %s in schema",
        i, @member_name(type_t, i), @type_name(type_t), field.field_name.c_str()
      ));
    }
  }
}

When we load a file at runtime, we need to verify that its schema matches the schema sampled at compile time. The type information that was built at compile time is not kept. We instead compare the runtime type information (that is, the inferred schema) against the static type information accessed with Circle introspection. For CSV, this simply involves comparing the field names; non-conforming field types will be flagged when reading row data.

Runtime CSV reading

template<typename type_t>
type_t read_csv_line(const char* text, int line) {
  type_t obj { };

  @meta for(size_t i = 0; i < @member_count(type_t); ++i) {
    if(i && !*text) {
      throw std::runtime_error(format(
        "field %s not found at line %d in CSV file", 
        @member_name(type_t, i), line
      ));
    }

    // Consume the leading ',' delimiter.
    if constexpr(i) {
      assert(',' == text[0]);
      ++text;
    }

    // Find the next comma or end-of-string.
    const char* end = text;
    while(*end && ',' != end[0]) ++end;

    // Support strings and doubles.
    if constexpr(std::is_same<double, @member_type(type_t, i)>::value) {
      // Parse a double. Confirm that we've read all characters in the field.
      double x = 0;
      if(text < end) {
        int chars_read;
        int result = sscanf(text, "%lf%n", &x, &chars_read);
        if(!result || chars_read != end - text) {
          throw std::runtime_error(format(
            "field %s at line %d \'%s\' is not a number",
            @member_name(type_t, i), line, std::string(text, end - text).c_str()
          ));
        }
      }

      @member_ref(obj, i) = x;

    } else {
      @member_ref(obj, i) = std::string(text, end);
    }

    // Advance to the comma or end-of-string.
    text = end;
  }

  return obj;
}

template<typename type_t>
std::vector<type_t> read_csv_file(const char* filename) {
  std::ifstream file(filename);
  if(!file.is_open()) {
    throw std::runtime_error(format(
      "cannot open CSV file %s", filename
    ));
  }

  // Load the schema and verify against the static type.
  type_info_t type_info = read_csv_schema(file, false);
  verify_schema<type_t>(type_info);

  // Load each CSV line.
  std::vector<type_t> vec;
  int line = 1;
  
  while(file.good()) {
    ++line;
    std::string s = good_getline(file);
    if(!s.size())
      break;

    vec.push_back(read_csv_line<type_t>(s.c_str(), line));
  }

  return vec;
}

The function template read_csv_line uses reflection to generate a CSV deserializer from static type information. The major branch is over the field type: an if constexpr statement puts us in a double-parsing case or a string-parsing case. In either case, an empty field is supported.

Are type providers good?

Given introspection and reflection, there are many ways for Circle to generate serialization and deserialization bindings. Is it advantageous to load or infer a schema at compile time? The more obvious solution is to simply define your structure manually and let the serialization code treat it as a contract: the input CSV must at least satisfy the requirements of this contract; additional fields are ignored.

My previous example Walkthrough 3: Deserializing JSON to classes takes that path.

json_loader.cxx

enum class language_t {
  english,
  french,
  spanish,
  italian,
  german,
  japanese,
  chinese,
  korean,
};

enum class unit_t {
  mile,
  km,
  league,
  lightyear,
};

struct options_t {
  language_t language;
  unit_t unit = unit_t::km;     // An optional setting.
  std::map<std::string, double> constants;

  // Optional set of alternative options. A recursive definition that allows
  // embedded alt-options.
  std::unique_ptr<options_t> alt_options { };
};

JSON files are fundamentally schemaless. If the loaded file is compatible with the requirements of the options_t struct, then it is loaded. This lets us evolve the data and the application semi-independently; a new file format will still work with an older binary, although the converse is not necessarily true. (But it could be true!)

It seems possible that the type provider's tighter coupling of data and code will provide more benefit over the JSON loader's declarative-contract style as the underlying data type becomes more complex. Type providers involving remote procedure calls would automate the space historically occupied by technologies like CORBA and COM, and currently occupied by protobuf.

But maybe not.

You can’t perform that action at this time.