# Data Types

## Contents

1. [Data](#Data)
2. [Data Types](#Data-Types)
    1. [Transformations](#Transformations)
    2. [Granularity](#Granularity)
3. [Structured and Unstructured Data](#Structured-and-Unstructured-Data)

## [Introduction](#Contents)

When talking about Data, we need to go back to our systems view of the world and our data science problem.

Data is a collection of observations of a system, specifically, the system relevant to your context and need. If it isn't, you haven't described the underlying system very well. The data will be like snapshots of that system at a point in time or different points in time. If you work the other way (you started with your data), then the system you describe provides context for you data and, as we'll see later, your exploration of the data.

Having both the data and a description of the system, you can start asking yourself questions:

1. How does the system you've described give rise to the data that you've observed?
2. How does the data that you've observed relate to the system you've described? What is missing?

As a concrete example, imagine we started with the Population CLD above. If we want to estimate the parameters of that model, we need to find data. As it turns out that kind of data is available in the decennial US Census. The Census data is a collection of observations with each row representing values for various features including: year, name, address, parents, income, age, country of origin, number of years at the residence, etc.

Suppose we started out with the Census data. Then we might arrive at the Population CLD above as a description of the system that gives rise to our data as well as context.

And it also allows us to check our model. For example, if we were looking at just a single city in Iowa, then the Population CLD above might prove to be a poor model. If we start looking at the Census data and assume that anyone recorded at one census and not recorded in a subsequent one has died, we'll get an incorrect estimate of the death rate because our mental model is wrong or incomplete. Why? Because we're missing *migration*.

## [Data Types](#Contents)

So we have a system and variables. Those variables can be of different types, different kinds of measurements and assignments. Here is one break down of data types:

- Qualitative
	* **Nominal** - the characteristic is a label (Republican, Independent, Democrat), (North America, South America, Africa, Europe, Asia, Australia) and has no order.
	* **Ordinal** - the characteristic is a label but has an order: 1st, 2nd, 3rd, 4th, etc. 4th - 2nd makes no sense, Likert (1-5, 0-10) scales.
- Quantitative
	* **Interval** - an ordered numeric space of intervals where 4 - 2 makes sense. Fahrenheit, for example, divides the space between 0 (freezing) and 212 (boiling) into equal intervals. 0 does not mean "no value".
	* **Ratio** - 0 means "no value". Height, weight, Kelvin scale. Very often for such data, there are no negative values either (although negative derived values are possible).

Here's another, more detailed one from Wikipedia for [Statistical Data Types](https://en.wikipedia.org/wiki/Statistical_data_type).

- Simple Types
	* **binary** - nominal (yes, no) also called dichotomous.
	* **categorical** (nominal) - assignments to groups.
	* **ordinal** (ordinal) - labels indicating ordering.
	* **count** (ratio) - number of items in an interval/area/volume.
	* **real-values**
		+ additive - temperature (20 degrees - 13 degrees makes sensem, 0 doesn't mean "none")
		+ multiplicative - weight (0 means "none")
- Complex Types
	* **Money** (amount and currency)
	* **Date** (ordinal but different schemes)
	* **Time** (ordinal and ratio and different schemes)
	* **Location** (latitude and longitude)
	* **Relational** (friends)

You don't need to know about the other columns in the table until later in the semester (distribution, statistics, etc.).

The two different kinds of real values can be confusing but essentially for the additive real valued variable you can do addition but multiplication doesn't really apply. It is related to the 0 not meaning "none". For example, 0 degrees Fahrenheit is some temperature and 2 $\times$ 0 should mean twice as warm but it's still 0. Whereas with Kelvin (a multiplicative scale), 2 $\times$ 0 is still 0 and that makes sense because 0 *is* "none".

Additionally, quantitative measurements can be continuous or discrete. For example, counts are generally discrete whereas general measurements of distance, height, weight, etc., are at least theoretically continuous down to the resolution of the measuring device.

Probably the most important take away from this is that not all numbers can have arithmetic operations performed on them. For example, average *place* is not really anything so despite the fact that your statistics package might identify the data as numeric, that doesn't mean that addition and subtraction apply. Some people assert the same is true of Likert type scales (1 = not important at all to 5 = very important). Because they are relative and cannot be calibrated between people (is my somewhat important the same as yours?), taking and average of Likert values is not appropriate. Another problem area is codes or encodings. These will show up as numbers but are really categorical *codes*.

We can summarize the problem by saying not all numbers are quantitative.

### [Transformations](#Contents)

As we will soon see, it is often handy to be able to convert between continuous and discrete, qualitative and quantitative values. We sometimes want to convert between different types of qualitative and quantitative values as well or derive new measurements.

An example of the first type of transformation is converting income (a real valued multiplicative quantitative measurement) into income ranges (an ordinal qualitative measurement). An example of the second type of transformation is converting arrival time (a qualitative ordinal measurment) into *inter* arrival time (a quantitative real-valued multiplicative measurement).

*Qualitative to Quantitative Transformations*

* Binary to Integer

The most basic transformation is the transformation of binary/dichotomous data from yes/no, success/failure, alive/dead to 0 or 1. Many other transformations are based on this basic transformation.

* Categorical to Binary

Most statistical and machine learning techniques work with numbers not symbols. We therefore often need to convert symbols to number somehow. We often do this by counting. For example, to summarize a "religion" data, we will count the number of Christians, Jews, Muslims, Buddhists, Atheists, etc. and report the absolute or relative counts (proportions) for each label. This does not, however, help us with any algorithm that works only on numerical values.

We cannot simply assign numbers to our categories. For example, Christian = 0, Jewish = 1, Muslim = 2, Buddhist = 3, Atheist = 4, etc., is not any better/different/definitive than Christian = 4, Jewish = 2, Muslim = 1, Buddhist = 0, Atheist = 3, etc. Not only does Jewish - Buddhist make no sense whatsoever but 2 * Jewish != Christian!

The general solution to this problem is to use "One Hot Encodings". Given a categorical feature of m values, we create m binary features with two values named after each possible value where 0 indicates "not it" and 1 indicates "it". Thus an observation with Religion = Buddhist would become Christian = No, Jewish = No, Muslim = No, Buddhist = Yes, Atheist = No.

*Quantitative to Qualitative Transformations*

Some algorithms and summaries work better for real-valued measurements if they are converted to categorical (ordinal) values. For example, nobody wants to look at a distribution of ages by year with 100+ possible values. The processing of converting real-valued measurements into categorical ones is often called discretization or binning. Oddly, discretization does not actually turn the values into discrete *numeric* values but categorical, ordinal ones. We will use *binning* here. We will often want to bin a continuous feature to improve exploratory data analysis (EDA) or improve statistical analysis/machine learning. We will have more to say about each when the time comes.

Sometimes there are natural, socially or culturally important bins. For example, ages are often binned up to 16 or 18, then to 21, then by decade to 64, then 65 plus. For EDA type binning, there are large number of algorithms for binning including:

* square-root choice
* Sturges's formula
* Rice rule
* Doane's formula
* Scott's normal reference rule
* Freedman-Diaconis rule
* Minimized L2 risk

When we get to Exploratory Data Analysis and histograms, we'll talk more about this. The bins used in machine learning are often generated by more sophisticated algorithms using entropy or mutual information.

The interesting thing about data types is that you can derive completely different data types from other data types. For example, if you have data about runners and their "place" in a race, that's *ordinal* data but differences in place are real-valued, ratio data because 0 means "no difference" (ties have the same "place"). Additionally, the difference can be negative or positive.

Timestamped data is another good example. You might have data that shows when a customer first landed on your website. As it is, the timestamp is just ordinal data. But you can transform this data into number of people per hour which is counts or differences in time (after converting to seconds since the Epoch) which is ratio data.

There are other transformations you can do to create new features and we'll talk about them when we talk about modeling.

### [Granularity](#Contents)

Another thing you need to be aware of when working with data is the *granularity* of the data relative to your CoNVO and other data you're using. To illustrate the potential problem of not considering granularity let's consider to different data science problems. The first involves comparing government budgeting (fiscal policy) at the state level The second involves looking at crime at the state level.

In the first case, states are properly the unit of observation for comparing state fiscal budgets. Budgets happen *to* states. In the second case, however, crime really doesn't happen *to* states. Crimes happen to individuals and crime statistics can be aggregated to the state level.

Thinking about granularity becomes even more important when you try to combine data sets. For example, you might have individual data about your customers but you lack information about their income and education levels and you think they are relevant to your problem. You might be able to *enrich* your primary data with income and education data if you can obtain it.

One approach would be to simply ask your customers but this might not go over well. Another approach would be to obtain the data from private or public sources. These other sources might not have individual data but instead have data based on Census Tract, Zip or Postal Code, City/County or even State data. If you use this data, however, there will be a mismatch in granularity between the data sources.

What this all boils down to is being aware of the unit of observation relevant to your problem (context and need) and whether the data you have or data you obtain is at the same level. It doesn't *have* to be and sometimes it simply cannot be but you have to acknowledge that the aggregation imposes limits on the conclusions you can draw or may affect the model.

For example, crime statistics could vary between states simply because of different levels of urbanization (think New Jersy v. Wyoming). Or if you enrich your customer data with City-level income estimates then everyone in that city gets the same estimate. If most of your customers are from that one city, the income information doesn't add anything. On the other hand, zip code level income estimates might be more useful.

In the last section we talked about transforming web sites into counts per hour. This transforms the granularity of the data. For example, turning any set of observations into counts will affect the level of granularity.

## [Structured and Unstructured Data](#Contents)

We tend to think of data as a rectangular array where the rows are observations from the system of interest and columns are the features. Every cell is a value. Most analysis and modeling is geared towards data of this type. This is generally referred to as *structured* data.

There are a lot of artifacts which could be potential data sources but they're not composed of nice, neat obvious values. Although we talked about complex data types earlier, what I have in mind here are things like *text* (blog posts, comments, books, etc), *images* (GIFs, JPEGs, PNGs, analog paintings, etc.), *videos* (YouTube videos, movies, TV shows, and their analog versions), and *sounds* (music, etc).

If you want to get an idea of where they might arise, try to think of a system described by a Causal Loop Diagram where a blog post, a comment or a product review is one of the variables. How do we work with that?

There are a variety of techniques associated with turning these *unstructured* data sources into structured ones. We will concentrate on images and text.

*Images*

Images, especially digital images, are slightly easier than text because they are basically a matrix of complex data types, pixels. Consider a picture that is 64 x 64 pixel image. First, we can linearize the image into an array of 4,096 pixels. We can then decompose the pixels into their 3 constituent parts: red, green and blue for an array of 12,288 floats or integers (or 4 constituent parts if we include alpha).

The main issue here is that we now have 12,288 features (the red of location 3,664; the blue value of location 933) etc. And that is only for a 64x64 pixel image! If we have larger images, we will need to store more data. If we have images of different dimensions, we may have to normalize using some sort of lossy resizing.

There are many alternatives, though. For example, we might work with a black and white version of the image or perhaps a grayscale one. We can identify regions of interest. We can also use clustering and transformations to create meta features and those those in our modeling. This is, essentially, what *deep learning* neural networks do although automatically.

The analysis of images (and other such media like sounds and videos) has not yet become common in data science. Text, on the other hand, is the next most common data science problem after "just numbers".

### [Prose](#Contents)

While images have a natural representation as numbers, text is much more difficult to work with because it doesn't appear to have any obvious numerical representation. Specific words follow after specifc words and specific sentences follow after specific sentences or it's just all so much gibberish. 

Or is it?

It turns out that for many applications it's sufficient to treat a document, even something as small as a tweet, as a "bag of words". The bag of words model entails treating each document as if the words had no order and either simply noting their presence in the document or counting how many times they appear in the document. If we do this for each document, we will end up with structured data where each row represents a document, each column represents a particular word and each cell is some numeric value.

There are, however, a few questions that arise when you actually try to go and do this such as:

0. What is a word?
1. What about punctuation?
2. What about capitalization? If "The" and "the" appear are they the same or different?
3. And What about "the" and "and" and all the rest of the very common words? Do we really need to count them?
4. What about related words? inspection, inspected, inspector, inspecting? Are these to be counted as the same or different?
5. What about declension? good, better, best? bad, worse, worst? am, are, was?

The answers to each question depend (as always) on the task at hand. The first two problems (and the solution) involves [tokenization](https://en.wikipedia.org/wiki/Tokenization_(lexical_analysis)). Tokenization can occur at the glyph, word, or sentence level according to the problem. If the data was acquired via an API or web scraping, this might also include removing HTML or XML markup. Removing punctuation which used to be a no-brainer is not entirely clear in the Age of Emojis. A "wink" emoji can turn a comment from mean to sarcastic. And a "frowny" emoji can change the entire emotion behind a review. This makes the tokenization task more difficult. So simply removing emojis can remove important semantic information.

The next problem refers to regularization. If capitalization are not important for the particular application, then you downcase everything and remove capitalization. 
Removing common words is almost always a good idea. These are called "stop words" and there are lists of stop words for various languages. Transforming words to their common root is accomplished by using [stemming](https://en.wikipedia.org/wiki/Stemming) which follows rules for reducing English (or the appropriate language) to their stems. The transformations that result from grammatical rules (intensifiers, verb tense declension) are "undone" using [lemmatization](https://en.wikipedia.org/wiki/Lemmatisation). This isn't often done for simple applications.

The general approach is called [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf). Python has the raw tools available to accomplish most of these tasks. We can write regular expressions to identify the tokens of interest. Using a list of stop words, we can remove them from the document. A stemmer is also available.

There are two Python libraries that have modules for dealing with text, however. [Natural Language Toolkit](http://www.nltk.org/) and [Scikit Learn](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html). Although the documentation for either is not as straight forward as one would like.

Once you have reduced a document to a bag of words, there are a number of ways that it can be translated into a numeric feature vector. We might simply be interested in whether or not the word occurs in the document. In this case, we create a binary feature for each token (tokens are more general than words). We might be interested in the absolute counts--how many times each token appears in the document--but because larger documents have more words in general, we probably want to normalized token counts to *relative* frequency.

When we talk about modeling, we'll discuss further adjustments that are required to correctly use text data in models.