# Week 2 Overview

![image.png](attachment:image.png)
Last week, you learned about summarizing data and handling inconsistent data in the context of some bank customer data. This week, you’ll do the same, but with your own datasets that you collected last semester. Your goal is to explore these datasets to see what they are like. Is everything as expected? Are the numbers in the right ranges? Are there any missing values?

![image-2.png](attachment:image-2.png)
## Learning Objectives
At the end of this week, you should be able to:
- Identify the audience (who), the message (what), and the story/data graphs (how) for a data visualization
- Apply descriptive statistics and data quality practices to student-determined data/dataset 
- Apply context-driven decisions to create data visualizations (using Matplotlib)
- Apply descriptive statistics to real-world data/datasets 
- Identify data quality issues of a given dataset 
- Describe effective (good) and ineffective (bad) graphs
- Create data visualizations using Matplotlib
- Reproduce a graph/data visualization from an example graph/data visualization
- Analyze existing data visualizations to identify strengths and weaknesses 
- Apply principles from Storytelling with Data (SWD) to create graphs

## 2.1 Lesson: Problems that Require Preprocessing of Data
In this lesson, your goal is to apply preprocessing approaches to your chosen datasets. Some problems that require preprocessing of data include (click on each title to see more information):

![image.png](attachment:image.png)

| Problem |  Description |
| :--- | :--- |
| <b>Inconcistent Data</b> | For example, a person who opened a bank account in 1950 but was born in 1960. |
| <b>Duplicate Rows</b> | All values in all columns are identical between the two rows, indicating that the two rows likely do not represent distinct samples. |
| <b>Missing Values</b> | For example, an “age” column where typical values could be 20 or 80, but where some values are -1 or None instead. |
| <b>Outliers</b> | In an age column, it’s clear enough that 300 is an outlier. However, even if you don’t know the meaning of a column, you can detect outliers. For instance, if a fish species has a length in inches of 20, 22, 24, and 50, it’s clear that the 50 could be an outlier. |
| <b>Inconsistent Formatting</b> | If the column contains both “2024-01-12” and “March 15, 2024,” there’s a problem. |
| <b>Categorical Variables</b> | For example, if a column can contain “red,” “blue,” or “green,” then we could create three new columns, where “red” is recorded as (1, 0, 0), “blue” as (0, 1, 0), and “green” as (0, 0, 1). |
| <b>Imbalanced Classes</b> | If there are 999 “red” values and only one “blue” and one “green” value, then the red/blue/green columns may not be that useful for certain purposes. |
| <b>Data Type Inconcistencies</b> | If a column contains 123, “hello,” and “2024-02-12,” there may be a problem. |

### Think About It
- Does an outlier always indicate inaccurate data? How can you decide if an outlier is due to an error in the data?
- If an “age” value is listed as -1, what would be the best value to use instead: 0, the mean age of the rest of the data, or the median age of the rest of the data? 
When would imbalanced classes be a problem?

____

![image-2.png](attachment:image-2.png)
## Resources

### Reading | Knaflic, C. N. (2015). Storytelling with data: A data visualization guide for business professionals. Chapter 1: The Importance of Context. John Wiley & Sons.

In this chapter, you will learn about telling a story with “who,” “what,” and “how.”

This may sound counterintuitive, but success in data visualization does not start with data visualization. Rather, before you begin down the path of creating a data visualization or communication, attention and time should be paid to understanding the context for the need to communicate. In this chapter, we will focus on understanding the important components of context and discuss some strategies to help set you up for success when it comes to communicating visually with data.

#### Exploratory vs explanatory analysis
- Exploratory Analysis: what you do to understand the data and fiure out what might be noteworthy or interesting to highlight to others.
- Explanatory Analysis: When we're at the point of communicating our analysis to our audience, we really want to be in the *explanatory* space, meaning you want to explain a specific story you want to tell.

#### Who, what, and how
- There are a few things to think about before visualizing any data or creating content:
    - *To whom are you communicating?* Find a common ground that will ensure they hear your message.
    - *What do you want your audience to know?* Your message and tone should be clear.
- After you answer these first two questions that you are ready to move forward with the third: *How can you use your data to help make your point?*

#### Who
- The more specific you can be about your audience the better. Be wary of trying to communicate to too many audiences (with disparate needs) at once to avoid putting yourself in a position where you can't communicate effectively to anyone. identifying the decision maker is one way of narrowing down your audience.

#### You
- Understand the relationship you have with your audience and how you expect they will percieve you. These questions may impact the order and flow of the overall story you aim to tell.
    - Will this be your first meeting? Or do you have an established relationship?
    - Do they already trust you as an expert?

#### What
- Action:
    - What do you need your audience to know? Your message should be clear and concise.
    - When it isn't appropriate to recommend an action explicitly, encourage discussion toward one. Suggest next steps, rather than simply presenting data. This elicits a more productive reaction from your audience, which can lead to a more productive conversation.
- Mechanism:
    - *How will you communicate to your audience?* Ineffective methods can limit the amount of information the audience will be capable of absorbing. Ensure that your important points are explicit. 

![image-3.png](attachment:image-3.png)

- Tone:
    - *What tone do you want your communication to set?* 

#### How
- Only after we can clearly articulate who our audience is and what we need them to know ore do - we can turn to the data and ask the question: *What data is available to help make my point?*
    - Data becomes supporting evidence of the story you will build and tell.

#### The 3 minute story & Big Idea
The idea behind each of these concepts is that you are able to boil the "so-what" down to a paragraph, and ultimately to a single, concise statement.
- The 3-Minute Story
    - If you only had 3 minutes to tell your audience what they need to know, what would you say?
- Big Idea:
    - There are 3 components to the big idea
        1. It must articulate your unique point of view
        2. It must convey what's at stake; and
        3. It must be a complete sentence
