These notes will show an overview of statistical modeling. We will cover how to model outcomes of various varieties and briefly discuss the pros and cons of various modeling approaches.
So, you have some data. Now you need to write down a model of how the variables in the data interact with each other.
-
Why do I need a model?
-
because you're interested in either predicting some outcome variable (like sales)
-
or because you're interested in understanding a causal relationship (e.g. between advertising and sales)
-
- Outcome variable (typically denoted y), aka:
- response variable
- dependent variable
- target variable
- Covariates (typically denoted x), aka:
- predictors
- features
- independent variables
- control variables
- Parameters (typically denoted β or θ):
- map covariates into outcomes
- Error term (typically denoted ε), aka:
- unobservables
- distrubance term
A model is a mapping between covariates, parameters, unobservables, and outcomes. It is the "production function" that generates the outcome we are interested in.
Most generally, the model is:
y = f(x,θ,ε)
The model can be as simple as
y = β0 + β1x1 + β2x2 + ε
or it can be as complex as
y = β0x1(β1)β2x2eε
A model can even have a θ that is infinite-dimensional. We call this a non-parametric model. (Equivalently, a non-parametric model is one in which we do not make any assumption about the distribution from which ε is drawn.)
Variables can be:
- continuous
- binary
- categorical (ordered)
- categorical (unordered)
- integers (i.e. counts)
Note that both dependent and independent variables can be of these types
For each type of dependent variable, we can come up with a statistical model to describe the relationship between the dependent variable and the covariates and unobservables.
All models fall under two umbrellas:
- Parametric
- Nonparametric
-
The advantage of a parametric model is that one can interpret the model by looking at the parameters. The disadvantage is that it may not always be flexible enough to fit the data perfectly.
-
The advantage of a nonparametric model is that it typically fits the data better. The disadvantage is that it may not be readily interpretable.
A variable is continuous if it takes on any real number over some range (typically the entire real number line or the positive real numbers)
Examples: sales, earnings, number of page clicks, etc.
The table below lists examples of parametric and nonparametric models appropriate for dependent variables that are continuous.
Parametric | Nonparametric |
---|---|
OLS | regression tree (forest, etc.) |
Quantile regression | support vector machine (including k-nearest neighbor) |
naive Bayes regression | Artificial Neural Network (ANN) |
genetic programming (GP) | |
... |
A variable is binary if it only takes on two values (without loss of generality: 0 or 1)
Examples: product was purchased by customer (or not), individual has cancer (or not), loan is in default (or not)
Parametric | Nonparametric |
---|---|
Logistic regression | classification tree (forest, etc.) |
Probit regression | support vector classifier |
naive Bayes classification | Artificial Neural Network (ANN) |
genetic programming (GP) | |
... |
A variable is ordered categorical if it takes on a finite (typically small) set of values (without loss of generality: 0, 1, ..., K) and where order matters (i.e. K > K-1 > K-2 > ... > 1 > 0). Values are mutually exclusive, so that each observation in the data belongs to one and only one category.
Example 1: consumer has "interest" in a product by either: (0) not clicking on it; (1) clicking on it but not adding it to the cart; (2) adding it to the cart but not purchasing it; (3) purchasing it once; or (4) purchasing it multiple times
Example 2: a loan is either: (0) neither in default nor foreclosure; (1) in default but not in foreclosure; or (2) in foreclosure
Parametric | Nonparametric |
---|---|
Ordered logistic regression | ordinal trees |
Ordered probit regression | support vector ordinal regression |
ANN | |
GP | |
... |
A variable is unordered categorical if it takes on a finite (typically small) set of values (without loss of generality: 0, 1, ..., K) but where there is no inherent ordering among the categories. Values are mutually exclusive, so that each observation in the data belongs to one and only one category.
Example 1: consumer can purchase: (0) no handbags; (1) Louis Vuitton handbags; (2) Coach handbags; (3) Other designer handbags; or (4) non-designer ("generic") handbags
Example 2: a person can choose to live in either: (0) Oklahoma City metro area; (1) Tulsa metro area; or (2) somewhere else in Oklahoma
The algorithms here are exactly the same as for binary dependent variables, except that now there are multiple categories.
Parametric | Nonparametric |
---|---|
Multinomial logistic regression | classification tree (forest, etc.) |
Multinomial probit regression | support vector classifier |
naive Bayes classification | Artificial Neural Network (ANN) |
genetic programming (GP) | |
... |
A variable is integer-valued if it takes on values in the set of integers (0, 1, 2, ..., ∞). Sometimes data that has this property is referred to as count data.
Example 1: consumer can smoke 0, 1, or more cigarettes in a day.
Example 2: a soccer team can score 0, 1, ..., 8 goals in a game.
Example 3: a city can experience 0, 1, or more vehicle accidents in a day.
Depending on the setting, some researchers will simply assume log-normality of the dependent variable in this case. Again, a choice like this really depends on the exact case.
The parametric algorithms for modeling count data are a bit different from before, but the nonparametric algorithms are quite similar:
Parametric | Nonparametric |
---|---|
Poisson regression | regression tree (forest, etc.) |
Negative binomial regression | support vector machine |
zero-inflated count models | Artificial Neural Network (ANN) |
zero-truncated count models | genetic programming (GP) |
... |
As mentioned previously, we can have all kinds of independent variables. Some helpful things to know about independent variables:
- don't treat an ordered categorical variable as a continuous variable
- use "one-hot" encoding of categorical variables (equivalently
as.factor()
in R) - if using a liner regression model, it is sometimes helpful to create polynomial functions of continuous covariates
- interacting two binary covariates (or a binary and a continuous covariate) can also be helpful
** How you specify your right hand side variables depends a lot on the goal of your model (i.e. prediction vs. causality)**
Observational data is data that has been collected for no particular purpose.
- e.g. Census household survey data, twitter stream, Dow Jones stock prices, etc.
Experimental data is data that scientists have set up where certain units are randomly assigned to take on certain values of a variable of interest (the treatment variable)
Causal inference (as opposed to prediction) is the idea that we can figure out if x causes y by comparing units that were treated with those that were not treated.
Causal inference requires knowing the "counterfactual": "What would have happened to units in the control group if they had been treated?"
Libraries and functions (library::function
)
Algorithm | R | Python | Julia |
---|---|---|---|
OLS | base::lm |
statsmodels::OLS |
GLM::glm |
trees | rpart::rpart |
sklearn::tree |
DecisionTree::build_tree |
k-nearest neighbor | caret::knn3 |
sklearn::KNeighborsClassifier |
NearestNeighbors::knn |
SVM | e1071::svm |
sklearn::svm |
LIBSVM::svmtrain |
naive Bayes | e1071::naiveBayes |
sklearn::GaussianNB |
NaiveBayes::HybridNB |
ANN | nnet::nnet |
sklearn::neural_network |
Knet::train |
genetic prog. | rgp::geneticProgramming |
gplearn::est_gp.fit |
GeneticAlgorithms::runga |
logistic/probit regression | base::glm |
sklearn::linear_model.LogisticRegression |
GLM::glm |
ordered logit/probit | MASS:polr |
mord::LogisticIT |
LowRankModels::OrderedMultinomialLoss |
multionmial logit/probit | nnet::multinom |
same as logistic | SciKitLearn::LogisticRegression |
Poisson regression | base::glm |
statsmodels::GLM |
GLM::glm |