# Descriptive Statistics

> You can check further content in this Deepnote [link](https://deepnote.com/workspace/sebastian-minaya-a67e42f1-471f-4ef3-b708-827621c005a4/project/curso-estadistica-descriptiva-2021-Duplicate-48d38894-4504-44da-ab01-6eeaf7b9228d/)

Descriptive statistics are used to describe the basic features of the data in a study. They provide simple summaries about the sample and the measures. Together with simple graphics analysis, they form the basis of virtually every quantitative analysis of data.

## What is the difference between descriptive and inferential statistics?

Descriptive statistics are very different from inferential statistics. With descriptive statistics you are simply describing what is or what the data shows. With inferential statistics, you are trying to reach conclusions that extend beyond the immediate data alone. For instance, we use inferential statistics to try to infer from the sample data what the population might think. Or, we use inferential statistics to make judgments of the probability that an observed difference between groups is a dependable one or one that might have happened by chance in this study. Thus, we use inferential statistics to make inferences from our data to more general conditions; we use descriptive statistics simply to describe what's going on in our data.

## Descriptive Statistics for Data Science

### Statistics for Data Ingestion and Data Wrangling

**Data Ingestion** is the process of obtaining, importing, and processing data for later use or storage in a database. The data ingestion process involves prioritizing data sources, acquiring data, making sure that data is usable, and finally, moving data to storage.

**Data Wrangling** is the process of cleaning, structuring and enriching raw data into a desired format for better decision making in less time. In other words, data wrangling is the process of cleaning, structuring and enriching raw data into a desired format for better decision making in less time.

### Statistics for Data Analysis and Data Visualization

**Data Analysis** is a process of inspecting, cleansing, transforming and modeling data with the goal of discovering useful information, informing conclusions and supporting decision-making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, and is used in different business, science, and social science domains.

**Data Visualization** is the graphic representation of data. It involves producing images that communicate relationships among the represented data to viewers of the images. This communication is achieved through the use of a systematic mapping between graphic marks and data values in the creation of the visualization.

### Statistics for Data Modeling and Machine Learning

**Data Modeling** is the process of creating a data model for the data to be stored in a database. This data model is a conceptual representation of Data objects, the associations between different data objects, and the rules. Data modeling helps in the visual representation of data and enforces business rules, regulatory compliances, and government policies on the data.

**Machine Learning** is the study of computer algorithms that improve automatically through experience. It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so.

## Data Types

### Categorical Data

Categorical data are variables that contain label values rather than numeric values. The number of possible values is often limited to a fixed set. Categorical data can be further broken into two types: nominal and ordinal.

- Nominal variables have two or more categories without having an intrinsic order.
- Ordinal variables have two or more categories just like nominal variables, but there is a clear ordering of the categories.

### Numerical Data

Numerical data are values ​​that represent a count or measurement. Numerical data can be further broken into two types: discrete and continuous.

- Discrete variables represent counts (e.g. the number of objects in a collection).
- Continuous variables represent measurable amounts (e.g. water volume or weight).

## Central Tendency Measures

### Mean

The mean (or average) is the most popular and well known measure of central tendency. It can be used with both discrete and continuous data, although its use is most often with continuous data (see our Types of Variable guide for data types). The mean is equal to the sum of all the values in the data set divided by the number of values in the data set. So, if we have n values in a data set and they have values x1, x2, x3, …, xn, the sample mean, usually denoted by x̄, is:

$$\bar{x} = \frac{x_1 + x_2 + x_3 + ... + x_n}{n}$$

### Median

The median is the middle score for a set of data that has been arranged in order of magnitude. The median is less affected by outliers and skewed data. To find the median, place the numbers in value order and find the middle number. If there are two middle numbers, you average them.

### Mode

The mode is the most frequent score in our data set. The mode is the only measure of central tendency that can be used with categorical data because it is the only measure of central tendency that can be calculated with nominal (non-numeric) data; the median and the mean both require numeric data.

#### Frequency Table and Histogram

A frequency table is a table that represents the number of occurrences of every unique value in the variable. The frequency table below shows the results of a survey of 100 people who were asked to name their favorite color.

| Color | Frequency |
|-------|-----------|
| Red   | 10        |
| Blue  | 25        |
| Green | 30        |
| Yellow| 15        |
| Orange| 20        |

A histogram is a graphical representation of the distribution of numerical data. It is an estimate of the probability distribution of a continuous variable. It is a kind of bar graph. To construct a histogram, the first step is to "bin" (or "bucket") the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) must be adjacent, and are often (but are not required to be) of equal size.

![Histogram](https://upload.wikimedia.org/wikipedia/commons/8/8e/Histogram_example.svg)

## The Metaphor of Bill Gates in a Bar

The mean, median and mode are all valid measures of central tendency, but under different conditions, some measures of central tendency become more appropriate to use than others. In the following example, we will use the metaphor of Bill Gates in a bar to help us remember which measure of central tendency is most appropriate under certain conditions.

Bill Gates walks into a bar with 10 people and the average net worth of everyone in the bar is $100 million. Bill Gates leaves the bar and walks into another bar with 10 people and the average net worth of everyone in that bar is $1 million. The average net worth of everyone in the two bars combined is $50 million. The mean is heavily influenced by outliers such as Bill Gates and is therefore not a good measure of central tendency for data that are skewed (i.e., the distribution of the data is skewed to the left or right of the center). The median is the middle score for a set of data that has been arranged in order of magnitude. The median is less affected by outliers and skewed data. The median is the most appropriate measure of central tendency for data that are skewed. The mode is the most frequent score in our data set. The mode is the only measure of central tendency that can be used with categorical data because it is the only measure of central tendency that can be calculated with nominal (non-numeric) data; the median and the mean both require numeric data.