# "Normal" views of money

In this set of exercises, you'll work with this sample of US accounts. The first column is a unique account identifier, and the second contains the amount of savings that account has.

It is helpful to visualize distributions using a histogram. A histogram buckets each observation into a "bin". It reduces information, just like an average, but not down to a single value. Instead, the count of observations in each bin is represented in a histogram. Keep in mind when making a histogram that it's best to have more than 20 observations. Furthermore, if you are working with samples of data, you will want a random sample to prevent biased results.

Let's now take a look at what the savings distribution is and how it compares to the descriptive statistics that have already been calculated for you.

- SEE 02.02

# Visualizing customer longevity

Now that you see that savings has a peak with roughly equal "tails" to the left and right, let's explore another variable in the data: "Retention". The bank measures "retention" of each customer, which measures how long a customer stays with the company. This measure can be one aspect of a healthy bank because there is a lot of cost associated with acquiring and setting up new accounts.

When you made the savings histogram the peak was centered around ~1400. This sample data ranges from 1 to 65 months so the peak will be different, and you should note how the "tails" look in this visual. Additionally, review the mean and mode to see if they are as close to each other as the previous exercise.

- SEE 02.03

# Visualizing customer donations

Some banks will buy third party information to help with marketing. For example, some non-profit donations are publicly available. A bank can market to these likely affluent customers for additional banking services. This sample data has the household identifier joined to example third party data for public donations.

In this exercise, you will focus on the mode and standard deviation. Since the data is expected to be a proxy for affluence with most people not making donations, these descriptive statistics can aid you in understanding the distribution before you create another histogram.

- SEE 02.04

# Is the data "normally" distributed?

Let's revisit the savings data. Earlier, you noted that the mean and median savings values were almost equal. The savings histogram was also roughly symmetrical and resembled a bell curve.

Now you want to use statistics to verify if the distribution is approximately symmetrical or "normally distributed". To do so, you will calculate the skew and kurtosis.

Kurtosis: Calculated using KURT(). Measures how the tails behave. It identifies how values are concentrated around the mean and how they trail away from it in the tails.
Skew: Calculated using SKEW(). Measures how symmetrical the distribution is. 0 means the distribution is exactly symmetrical. Values above or below 0 indicate that there are more values above or below the mean.

- SEE 02.05

# Correlation between price and quantity sold

This data represents 100 online auctions selling multiple collectible items in each listing. Is there a correlation between price and quantity sold? Let's find out!

Recall that correlation, which you can calculate using CORREL(), ranges between -1 and 1.

0 correlation means there is no relationship between variables.
Positive correlation indicate that as one variable increases, the other also increases.
Negative correlation values signify that as one variable increases, the other decreases.
Here you will create a scatter plot (also known as a "scatter chart" or "XY graph") to visually represent the correlation. Do you think the price of an item has a positive or negative correlation to the total quantity sold? What do you expect the scatterplot pattern to look like? Let's find out in this exercise!

- SEE 02.07

# Correlation between seller rating and closing price

Now you will explore the correlation between the seller's rating and the auction closing price. Common sense would lead you to believe that the higher ratings a seller has, the higher the closing price. If this were the case you would see a "positive" correlation, both acting similarly.

Remember that "correlation is not causation"! You may suspect that higher ratings improves trust thereby causing prices to edge higher. However, other factors could really be the cause. Perhaps sellers with higher ratings are allowed to sell more expensive products. In which case the real cause is the expensive products, not trust. Thus be careful when exploring correlation & drawing conclusions.

- SEE 02.08

# Adding a trend line

Earlier, you saw how a high opening price has a negative correlation with the number of collectibles sold. The resulting scatterplot has a distinct pattern sloping down.

You then visualized the positive correlation between "Seller Rating" and "Close Price" as shown here. Let's now take this further by calculating the slope, intercept, and adding a trend line to better explore the relationship between the two columns.

Spreadsheets have 3 formulas for calculating the y-intercept and slope given two variables.

- `SLOPE()` - will return the slope of a trend line or linear regression representing the linear change in one unit to another.
- `INTERCEPT()` - returns the value where the trendline will intersect the y-axis.
- `LINEST()` - calculates both the slope & the intercept of two variables using the least-squares method.

- SEE 02.09

# Bar chart of competitive counts

So far you've examined numeric data. What should you do to learn about categorical data? This sheet contains 150 online auctions with non-numeric data. One method to explore this type of variable is with a bar chart, also called a "column chart".

Here, the COUNTIF() function has been to used to count the number of "1"s and "0"s. In DataCamp's Pivot Tables in Spreadsheets course, you can learn efficient ways like this to count data.

Your job in this exercise is to create two types of bar charts: stacked, and side-by-side. The difference is subtle but helps your audience in different ways.

Stacked: helps you focus on the proportion of the total that is a particular class. You will visually understand the amount of the total that represents competitive online auctions.
Side-by-side: lets the audience understand the values in relation to each other. You will visually interpret how similar each class values are.

- SEE 02.11

# Visualizing categories

We've added other categories to the data: Toys/Hobbies and Collectibles.

Notice that this Category column has no numeric information. This is because they are attributes of the auctions. They are distinct classes or factors. You can learn a lot about your sample descriptive statistics by better exploring these factors. For example, in patient data, gender could impact the average weight. Understanding if you have more "Females" would help you understand if the weight is possibly impacted.

In this new auction sample, closing price averages have been compared across categories. Are categories impacting this descriptive statistic? Time to find out!

- SEE 02.12