# Exploratory Data Analysis

In this chapter we talk about Exploratory Data Analysis (EDA).

Recall the "Green Cloud" model of the Data Science Process,

![Green Cloud](../resources/green_cloud_model.png){width="50%"}

We have already discussed the importance of the ASK step. 
In the ASK step, we determine the problem we want to solve or the question we want answered.
We described the CoNVO framework for fully fleshing out that problem or question.
What is the Need? 
What is the Context of the Need? 
What Outcome do we want or expect?
How will we implement it?
Finally, the Vision is a high level plan or sketch of what an answer might look like.
Of the four, Outcome is the most often overlooked.
We may talk about researching the effectiveness of promotional codes ("promocodes") but do we have a plan for those research findings to actually change the organizational culture?
Who needs to see the results?
What if the results show that promocodes are not good, in the long run, for business?
If you don't actually have such a plan, the resources will be wasted.

Regardless of whether or not we actually apply CoNVO (even implicitly), ASK requires us to frame a question or a problem, determine if this is an appropriate *data science* question or problem, and then think about how we might answer it, *and* use the answer.

All such questions are about real processes in the real world.
System Theory forces us to be explicit about our mental models of that process by identifying the key elements and how they influence each other.
Causal Loop Diagrams are one tool for making such a mental model manifest, if only in a qualitative sense.
Still, CLDs provide a means for stakeholders to share their differing views of the context and their domain knowledge.

After we have question and a general understanding of the process in which it is embedded, we can start getting data.
This is the GET step.
During the GET step we need to think about getting data that we identified during the ASK step.
Having identified the variables of the system, we can "get" data for them.
Of course, this "get" hides tons of details: is the data inside the organization or outside? 
Can we measure this variable directly or do we need a proxy? 
What is the format of the data (JSON, Binary, PostgreSQL, etc.)? 
What is the provenance of the data?
What are the ethics associated with using this data for this purpose?
Are we legally permitted to use this data?
Where do we keep the data?
Are we allowed to keep the data?
What format should we keep the data in?

As we saw in the last chapter, ETL is actually very complicated and data scientists spend a great deal of time involved with ETL.
But we're going to sweep that aside for now, and work mostly with delimited data (CSV).
And that point, we can go to the next step: EXPLORE.

Unfortunately, this makes it sound like GET and EXPLORE are very cleanly and clearly delineated and that probably isn't really the case.
I like to think of ETL as *syntax* and EDA as *semantics*.
What does that mean?

ETL is mostly involved with getting data and making sure it has a usable format.
At the very simplest level, we get a delimited file from the internet and we start looking over it to see:

1. What is the delimiter?
2. Does the file have a header row?
3. Does it have all the variables we expect?
4. Is there a data dictionary that tells us the exact meaning of the variables?
5. If the data has date types, what format are the dates in?
6. Are strings quoted?

and so on.

We can actually poke at the data on the command line using `head`, `cat`, and tools.
Maybe we need to use `sed` to replace printer's quotes with typewriter quotes.
We might also load the data into a database for faster querying and a few transformations.
And even if the data is already in a database, then we have to find out the table layout, primary keys, foreign keys, column types.
But, by and large, this is just the *syntax* of data.

Once we start doing EDA, however, we're concerned with the *semantics* of the data.
What is the *shape* of each variable?
Are there outliers? Inliers?
Is there a correlation between variables? What kind?
And, of course, you can start answering your question.

However, you will very often engage in a bit of ETL when you load the data into your analysis environment (in our case, Jupyter notebooks).
At this fuzzy boundary between the two, additional syntactic problems can arise.
The main assumption of EDA is that your data is *clean*. 
Anything that needs to be standardized, has been standardized. 
All data formats are correct and consistent. 
You must ask yourself, did the numeric value get read as a numeric value or was there a "missing value" token, "?", that caused the entire variable to be read in as a String?
In the course of EDA, you may find out that this is not the case and you may need to go back and redo the ETL step. 
You may discover that you don't have a variable you thought you did and you need to go back and get that data and answer all the questions all over again (is this legal? ethical? where does the data get stored? how?, etc).
This is to be expected.

Finally, if you are working from different data sources (sales data and weather data), if you are waiting for the weather data (because you need to buy an API key), then you don't necessarily need to wait to do EDA on the sales data.
Therefore, the project itself may have different data sets in different stages of ETL and EDA.

In the end, however, once we have our problem or question and the data we need, we can start looking at the shapes and relationships between variables. This step forms the foundation for inference and modeling.

* [Definition](definition.ipynb)
* [Descriptive Statistics](statistics.ipynb)
* [Importance of Visual Exploration](importance.ipynb)
* [Taxonomy](taxonomy.ipynb)
* [Example](example.ipynb)
* [Conclusion](conclusion.ipynb)