# Introduction to Text, Configuration, and Numerical Data Files
In this module, we'll introduce tools and approaches for working with text, configuration, and numerical data files. This is one of the most common tasks that you will do in data science and artificial intelligence applications (and most other technical computing tasks). 

By the end of this module, you will be able to explain and implement the following concepts:

* __Configuration files__: Configuration data, such as usernames, URLs, etc., is typically organized in the form of local TOML, YAML, JSON, and, to a lesser extent, XML or binary files. You’ll find it in almost every program and language. We’ll develop tools to read these files.
* __Text (numerical) data files__: Text files are used to store unstructured data, such as logs, documents, and other text-based or numerical information. We'll explore comma, space, or tab-delimited flat files, which are widespread formats for storing and exchanging numerical data. We’ll develop a general framework to parse these files (and will encounter our first _buy versus build_ decision).
* __File input/output operations__: We’ll develop a general framework for reading and writing files in Julia, including how to handle different file formats and encodings. Although we'll use Julia, the concepts and approaches we develop will be applicable to other programming languages as well.

Working with local files is a super common task in data science and ML/AI applications. You'll for sure need to do this in your future work. So, let's get started!
___

## Confiuration Files
Configuration data is stored in file types such as Tom’s Obvious Minimal Language (TOML) format, YAML files and JavaScript Object Notation (JSON) format. These files are used to store settings and parameters for applications, libraries, or systems. They are typically _human-readable_ and can be easily edited by users or developers.

### TOML files
TOML (Tom's Obvious, Minimal Language) is a configuration file format that is intended to be easy to read and write and is also easy to parse. It is used to store configuration data for applications. TOML files consist of key-value pairs, similar to a dictionary in Julia or Python, and can also include nested groups of keys. 

TOML files often have a `.toml` file extension. Let's look at a simple example of a TOML file:

```toml
# This is a TOML configuration file for a database

# section: holds connection information
[connection]
host = "localhost"      # The database hostname
port = 5432             # The port to connect to the database on
database = "mydatabase" # The name of the database
user = "myuser"         # The username to connect to the database with
password = "mypassword" # The password for the user
max_connections = 10    # The maximum number of connections to allow at once
connection_timeout = 30 # The amount of time to wait before timing out a connection

# section: holds a group of database options
[options]
ssl = true              # Whether to enable SSL connections to the database
ssl_mode = "require"    # The preferred SSL mode to use
```

TOML files are widely used for storing configuration information. For example, in [Julia](https://docs.julialang.org), the [package manager Pkg.jl](https://pkgdocs.julialang.org/v1/) stores information about the packages required for a project in a `Project.toml` file (which is automatically created when a project is activated). 
* __Standard library__: Because of its central role in [Julia](https://docs.julialang.org), `TOML.jl`, the package to read and write TOML files, is included in the [Julia standard library](https://docs.julialang.org/en/v1/stdlib/TOML/). Thus, we don't need to install it and can access it by placing the `using TOML` file at the start of our program.

### YAML files
[YAML](https://yaml.org) is a human-readable data serialization language that can be used to transmit data between systems. YAML is often a configuration file format for applications and is similar to TOML. YAML files use a simple syntax that consists of key-value pairs and can also include nested groups of keys. YAML uses indentation to denote structure, similar to [Python](https://www.python.org). YAML files often have a `.yaml` or `.yml` file extension.

Here is an example of a YAML file that could be used to store configuration data for an application:

```yaml
# This is a YAML configuration file for an application

# Metadata about MyApp
name: MyApp             # The application's name
version: 1.0.0          # The version of the application
host: localhost         # The hostname to bind the application to
port: 8080              # The port to bind the application to

# A group of database options
database:
  host: localhost       # The database hostname
  port: 5432            # The port to connect to the database on
  name: mydatabase      # The name of the database
  user: myuser          # The username to connect to the database with
  password: mypassword  # The password for the user
```

Unlike TOML, the YAML format is not included in the [Julia standard library](https://docs.julialang.org/en/v1/stdlib/TOML/). Instead, there are a variety of third-party packages available for working with YAML files, e.g., the [YAML.jl](https://github.com/JuliaData/YAML.jl) package.

### JSON files
[JavaScript Object Notation (JSON)](https://developer.mozilla.org/en-US/docs/Learn/JavaScript/Objects/JSON) is a lightweight, text-based, language-independent data interchange format that is easy for humans to read and write and easy for machines to parse and generate. JSON is based on a subset of the [JavaScript programming language](https://en.wikipedia.org/wiki/JavaScript) and is used to represent simple data structures and associative arrays.  

JSON consists of two data structures:
* A collection of name/value pairs, usually implemented as a struct, dictionary, keyed list, or associative array.
* An ordered list of values. In most languages, this appears as an array, vector, list, or sequence.

Here is an example of a JSON file that stores contact information:

```json
{
  "people": [
    {
      "name": "John Smith",
      "email": "john@example.com",
      "phone": "555-555-5555"
    },
    {
      "name": "Jane Smith",
      "email": "jane@example.com",
      "phone": "444-444-4444"
    }
  ]
}
```

This JSON file defines an object with a single key, `people`; the `people` key has a value that is a list of objects, each representing a person. Each person object has three keys: `name`, `email`, and `phone`. 
* __Usage__: As we'll see, JSON is not only used for configuration files but also as a common format for exchanging data between systems, especially in web applications.
* __Standard library?__: Python provides the built-in `json` module. However, Julia does not include a JSON package in the standard library. However, there are several excellent third-party packages for working with JSON in Julia!
___

## Numerical Data Files
Numerical data files are used to store numerical information in a structured way. These files can hold data from experiments, simulations, or other sources. Numerical data files might come in different formats, such as comma-separated values (CSV), space-separated values (SSV), tab-separated values (TSV), or other delimited formats.

Let's examine a typical comma-separated values (CSV) file that stores data; similar ideas can be applied to other delimited formats like SSV or TSV files.

### CSV files
[Comma-separated value (CSV)](https://en.wikipedia.org/wiki/Comma-separated_values) files are _delimited_ text files that use commas to separate values; a space or tab character can also serve as a delimiter. CSV files are widely used for storing and exchanging numerical data, especially in engineering and scientific fields. 

A CSV file typically contains tabular data (numbers and text) in plain text, with each line having the same number of fields (although this is not always guaranteed). Each line in the file is called a _record_, and each record contains one or more _fields_, separated by commas (or other characters such as a `tab` or `space`).

In engineering or other quantitative applications, [comma-separated value files](https://en.wikipedia.org/wiki/Comma-separated_values) are typically used to store, transmit, and work with numerical data. Consider a comma-separated value file holding historical interest rate data:

```CSV
Date,T=20-year-percentage,T=30-year-percentage
2021-09-17,1.82,1.88
2021-09-24,1.84,1.89
2021-10-01,2.00,2.05
2021-10-08,2.05,2.10
....
```

__What's shown in this file?__
* __Header__: The first row is a header row that contains the column names, separated by commas. The first column is the date, and the other two columns are the interest rates for 20-year and 30-year Treasury bonds. While including a header row is optional, it is a common practice, as it makes the data easier to understand. Alternatively, the producer of the data file must include a data dictionary of some type to inform users about the contents of the file.
* __Data records__: Each subsequent row contains a record of data, with each field separated by a comma. The first field is the date, and the other two fields are the interest rates for 20-year and 30-year treasury bonds on that date.
* __Organization__: The data is organized in a way that makes it easy to read and parse. Each record has the same number of fields, separated by commas. However, this is not always true in unstructured files containing text, such as movie reviews or books in digital form.

__What's not shown in this file?__
* __Metadata__: The file does not include any metadata about the data, such as the units of measurement or the source of the data. This information is typically included in a separate file or document. Or perhaps it could be included as comments within the CSV file itself.
* __Flat structure__: The CSV format does not support complex data structures, such as nested or hierarchical data. If the data being represented has a more complex structure, e.g., a record or field is somehow dependent on another record or field, it may be necessary to use a different file format, such as JSON.

___

## File Input and Output (File I/O) Operations
File Input/Output (File I/O) operations are common tasks that enable you to store and retrieve data from persistent storage, such as a hard drive or a cloud service, e.g., [Box](https://en.wikipedia.org/wiki/Box_(company)), [Google Drive](https://en.wikipedia.org/wiki/Google_Drive), or [One Drive](https://en.wikipedia.org/wiki/OneDrive).

In most programming languages, core file I/O operations are implemented using built-in functions or methods that allow you to open, read, write, and close files.
* __Third-party packages__: For standard formats, third-party developers have created packages that utilize the core I/O functionality to read and write [comma-separated values (CSV) files](https://en.wikipedia.org/wiki/Comma-separated_values) as well as newer formats like the [Tom's Obvious Minimal Language (TOML) format](https://toml.io/en/), [JavaScript Object Notation (JSON) format](https://en.wikipedia.org/wiki/JSON), or [YAML files](https://yaml.org).

Thus, we don't need to implement our own file I/O functions; instead, we can use the built-in functions or methods provided by the language or the packages we install. Should we buy or build?

* __Buy versus build__: You'll often face the buy versus build dilemma when working with data, developing models, and similar tasks. Should we create our own packages to perform common tasks ourselves or rely on someone else's code? As it turns out, this question is simple; 99.9% of the time __it's better to buy, and never to build__!
* __Always buy?__ The authors of (good) third-party packages have (typically) spent a huge amount of time making sure the package works, is efficient, handles weird edge cases that you haven't thought of, etc. It will take you a long time to reach the same level as most third-party packages, so be smart, use their work, and don't reinvent the wheel! 

With that being said, however, hypothetically, there may be very specific cases where we'll need to write our own basic codes. For example, suppose we want to implement our own file I/O functions, we would use the following pattern to _read_ a file.

```julia
open(path, "r") do io # open a stream to the file
    for line in eachline(io) # read each line from the stream
        # Your logic goes here!
        # process the line, e.g., parse it into a data structure, e.g., a dictionary or a list, etc
    end
end # the file stream is automatically closed when the block ends
```

And to _write_ a file, we can use the following pattern:
```julia
open(path, "w") do io # open a stream to the file
    for line in lines # iterate over the lines to write to the file
        write(io, line) # write the line to the file
    end
end # the file is automatically closed when the block ends
```

In both snippets, we open an IO stream, perform an action, and then close the stream. What is an IO Stream?

### What is an IO Stream?
An **IO stream** is an abstraction for a sequential flow of data between your program and some external source or sink (like a file, network socket, or in-memory buffer). 
* __An IO Stream__: wraps and underlying resource (e.g. a file descriptor), provides buffered access and methods such as `read`, `write` or `eachline`, and ensures you don’t have to manage low-level details yourself. 
* __Julia__: In Julia, using the `open(path, "r") do io … end` pattern automatically opens the stream for reading, yields it to your block for processing, and then closes it when you’re done. Similarly, you can use the `open(path, "w") do io … end` pattern to open a stream for writing, or `open(path, "a") do io … end` to open a stream for appending.

Ok, that's a lot of information. Let's jump into some examples to see how this works in practice!

___