# 11_01: understanding data

In [1]:
import math
import collections
import dataclasses
import datetime

import numpy as np
import pandas as pd
import matplotlib.pyplot as pp

In this chapter we will focus on the task of understanding the meaning of data by **modeling** it.

Given a dataset with multiple variables, we seek to capture the way in which the variation in one or more **response variables** is caused by the variation of one or more **explanatory variables**.

A model can be seen as a **function** that takes explanatory variables as input, and outputs response variables. The model depends on a number of **parameters**, which are usually not known in advance.

To **fit a model** to the data, we apply it to the explanatory variables for each case in our data frame, and compare the modeled response variables with the corresponding observed values. We then modify the parameters of the model until the difference (the **residual**) between the predicted and observed values is minimized (in a precise mathematical sense).

Once we have fit a model, it becomes useful for two different goals: one, its **parameters** may reveal important qualities of, or trends in, the population under study; two, we can use the model to **predict** the response value for sets of explanatory variables that we have not yet observed. (This is the way in which models are mostly used in machine learning.)

To choose between alternative models, we can compare their goodness of fit (usually a single number derived from the residuals). This is known as **in-sample** goodness of fit, because it tells us how well the model does on the data that was used to fit it. Doing so will tend to overstate the goodness of fit, especially with very complex models, because it will tailor them to the specific dataset we got rather than on the general characteristics of those data.

There are mathematical techniques that can adjust the in-sample goodness of fit by accounting for the complexity of the model.

Alternatively, we can set apart part of the data, excluding it from the fit, and then evaluate the residuals for those testing data. Doing so is known as **out-of-sample** goodness of fit, and it provides a more accurate measure of performance of a complex model.

But enough talk. Let's get to our data and to Python!