# Descriptive Statistics

> You can check further content in this Deepnote notebook [link](https://deepnote.com/workspace/sebastian-minaya-a67e42f1-471f-4ef3-b708-827621c005a4/project/curso-estadistica-descriptiva-2021-Duplicate-48d38894-4504-44da-ab01-6eeaf7b9228d/) as well as more notes in this [link](https://deepnote.com/@anthonymanotoa/Apuntes-de-Estadistica-Descriptiva-cfa882b6-c07f-43fe-9901-1c2e471ce120).

Descriptive statistics are used to describe the basic features of the data in a study. They provide simple summaries about the sample and the measures. Together with simple graphics analysis, they form the basis of virtually every quantitative analysis of data.

## What is the difference between descriptive and inferential statistics?

Descriptive statistics are very different from inferential statistics. With descriptive statistics you are simply describing what is or what the data shows. With inferential statistics, you are trying to reach conclusions that extend beyond the immediate data alone. 

For instance, we use inferential statistics to try to infer from the sample data what the population might think. Or, we use inferential statistics to make judgments of the probability that an observed difference between groups is a dependable one or one that might have happened by chance in this study. 

Thus, we use inferential statistics to make inferences from our data to more general conditions; we use descriptive statistics simply to describe what's going on in our data.

## Statistics for Data Science

### Statistics for Data Ingestion and Data Wrangling

**Data Ingestion** is the process of obtaining, importing, and processing data for later use or storage in a database. The data ingestion process involves prioritizing data sources, acquiring data, making sure that data is usable, and finally, moving data to storage.

**Data Wrangling** is the process of cleaning, structuring and enriching raw data into a desired format for better decision making in less time. In other words, data wrangling is the process of cleaning, structuring and enriching raw data into a desired format for better decision making in less time.

### Statistics for Data Analysis and Data Visualization

**Data Analysis** is a process of inspecting, cleansing, transforming and modeling data with the goal of discovering useful information, informing conclusions and supporting decision-making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, and is used in different business, science, and social science domains.

**Data Visualization** is the graphic representation of data. It involves producing images that communicate relationships among the represented data to viewers of the images. This communication is achieved through the use of a systematic mapping between graphic marks and data values in the creation of the visualization.

### Statistics for Data Modeling and Machine Learning

**Data Modeling** is the process of creating a data model for the data to be stored in a database. This data model is a conceptual representation of Data objects, the associations between different data objects, and the rules. Data modeling helps in the visual representation of data and enforces business rules, regulatory compliances, and government policies on the data.

**Machine Learning** is the study of computer algorithms that improve automatically through experience. It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so.

## Data Types

### Categorical Data

Categorical data are variables that contain label values rather than numeric values. The number of possible values is often limited to a fixed set. Categorical data can be further broken into two types: nominal and ordinal.

- Nominal variables have two or more categories without having an intrinsic order.
- Ordinal variables have two or more categories just like nominal variables, but there is a clear ordering of the categories.

### Numerical Data

Numerical data are values ​​that represent a count or measurement. Numerical data can be further broken into two types: discrete and continuous.

- Discrete variables represent counts (e.g. the number of objects in a collection).
- Continuous variables represent measurable amounts (e.g. water volume or weight).

## Central Tendency Measures

### Mean

The mean (or average) is the most popular and well known measure of central tendency. It can be used with both discrete and continuous data, although its use is most often with continuous data (see our Types of Variable guide for data types). The mean is equal to the sum of all the values in the data set divided by the number of values in the data set. So, if we have n values in a data set and they have values x1, x2, x3, …, xn, the sample mean, usually denoted by x̄, is:

$$\bar{x} = \frac{x_1 + x_2 + x_3 + ... + x_n}{n}$$

### Median

The median is the middle score for a set of data that has been arranged in order of magnitude. The median is less affected by outliers and skewed data. To find the median, place the numbers in value order and find the middle number. If there are two middle numbers, you average them.

### Mode

The mode is the most frequent score in our data set. The mode is the only measure of central tendency that can be used with categorical data because it is the only measure of central tendency that can be calculated with nominal (non-numeric) data; the median and the mean both require numeric data.

#### Frequency Table and Histogram

A frequency table is a table that represents the number of occurrences of every unique value in the variable. The frequency table below shows the results of a survey of 100 people who were asked to name their favorite color.

| Color | Frequency |
|-------|-----------|
| Red   | 10        |
| Blue  | 25        |
| Green | 30        |
| Yellow| 15        |
| Orange| 20        |

A histogram is a graphical representation of the distribution of numerical data. It is an estimate of the probability distribution of a continuous variable. It is a kind of bar graph. 

To construct a histogram, the first step is to "bin" (or "bucket") the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) must be adjacent, and are often (but are not required to be) of equal size.

![Histogram](https://upload.wikimedia.org/wikipedia/commons/8/8e/Histogram_example.svg)

## The Metaphor of Bill Gates in a Bar

The mean, median and mode are all valid measures of central tendency, but under different conditions, some measures of central tendency become more appropriate to use than others. In the following example, we will use the metaphor of Bill Gates in a bar to help us remember which measure of central tendency is most appropriate under certain conditions.

Bill Gates walks into a bar with 10 people and the average net worth of everyone in the bar is $100 million. Bill Gates leaves the bar and walks into another bar with 10 people and the average net worth of everyone in that bar is $1 million. 

The average net worth of everyone in the two bars combined is $50 million. The mean is heavily influenced by outliers such as Bill Gates and is therefore not a good measure of central tendency for data that are skewed (i.e., the distribution of the data is skewed to the left or right of the center). 

The median is the middle score for a set of data that has been arranged in order of magnitude. The median is less affected by outliers and skewed data. The median is the most appropriate measure of central tendency for data that are skewed. 

The mode is the most frequent score in our data set. The mode is the only measure of central tendency that can be used with categorical data because it is the only measure of central tendency that can be calculated with nominal (non-numeric) data; the median and the mean both require numeric data.

## Dispersion Measures

### Range

The range is the simplest measure of dispersion. It is simply the difference between the largest and smallest data values. The range is easy to compute but is not a good measure of dispersion because it is overly influenced by extreme values.

$ Range = Max - Min $

### Interquartile Range

The interquartile range (IQR) is a measure of variability, based on dividing a data set into quartiles. Quartiles divide a rank-ordered data set into four equal parts. The values that divide each part are called the first, second, and third quartiles; and they are denoted by Q1, Q2, and Q3, respectively.

- Q1 is the "middle" value in the first half of the rank-ordered data set.
- Q2 is the median value in the set.
- Q3 is the "middle" value in the second half of the rank-ordered data set.

The interquartile range is equal to Q3 minus Q1. The IQR is the range covered by the middle 50% of the data values. The IQR is a better measure of dispersion than the range as it is not affected by outliers as much as the range (because it uses the 25th and 75th percentile rather than the minimum and maximum values).

$ IQR = Q3 - Q1 $

### Variance

The variance is a measure of variability. It is calculated by taking the average of squared deviations from the mean. Variance tells you the degree of spread in your data set. The more spread the data, the larger the variance is in relation to the mean. Variance is calculated by taking the differences between each number in the data set and the mean, squaring those differences to give them positive value, and dividing the sum of the resulting squares by the number of values in the set.

$ \sigma^2 = \frac{\sum_{i=1}^{n}(x_i - \mu)^2}{n} $

### Standard Deviation

The standard deviation is a measure of variability. It is calculated by taking the square root of the variance. Standard deviation is calculated by taking the square root of variance. Standard deviation is a widely used measurement of variability or diversity used in statistics and probability theory. 

It shows how much variation or "dispersion" exists from the average (mean, or expected value). A low standard deviation indicates that the data points tend to be very close to the mean; a high standard deviation indicates that the data points are spread out over a large range of values.

$ \sigma = \sqrt{\frac{\sum_{i=1}^{n}(x_i - \mu)^2}{n}} $

#### Note

I should consider the min value as Q1 - 1.5 IQR and the max value as Q3 + 1.5 IQR, or only consider the min value as the mean - 3 * standard deviation and the max value as the mean + 3 * standard deviation.

## Skewness and Kurtosis

### Skewness

Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined. For a unimodal distribution, negative skew commonly indicates that the tail is on the left side of the distribution, and positive skew indicates that the tail is on the right. In cases where one tail is long but the other tail is fat, skewness does not obey a simple rule. 

For example, a zero value means that the tails on both sides of the mean balance out overall; this is the case for a symmetric distribution, but can also be true for an asymmetric distribution where one tail is long and thin, and the other is short but fat. Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.


![Skewness](https://upload.wikimedia.org/wikipedia/commons/f/f8/Negative_and_positive_skew_diagrams_%28English%29.svg)

### Kurtosis

In probability theory and statistics, kurtosis is a measure of the "tailedness" of the probability distribution of a real-valued random variable. In a similar way to the concept of skewness, kurtosis is a descriptor of the shape of a probability distribution and, just as for skewness, there are different ways of quantifying it for a theoretical distribution and corresponding ways of estimating it from a sample from a population. 

Depending on the particular measure of kurtosis that is used, there are various interpretations of kurtosis, and of how particular measures should be interpreted; these are primarily peakedness (width of peak), tail weight, and lack of shoulders (distribution primarily peak and tails, not in-between).


![Kurtosis](https://upload.wikimedia.org/wikipedia/commons/3/33/Standard_symmetric_pdfs.svg)

## Numerical Data Processing Pipelines

### Linear Scaling

Linear scaling is a method of normalizing data for processing by a neural network. Some methods for this are:

- Min-Max Scaling: This is the simplest method and consists of rescaling the range of features to scale the range in [0, 1] or [-1, 1]. Selecting the target range depends on the nature of the data. This method is sensitive to outliers. The calculation done would be:

$$ x' = \frac{x - min(x)}{max(x) - min(x)} $$

- Clippling: This method consists of clipping the data to a maximum and minimum value. This method is also sensitive to outliers. The calculation done would be:

$$ x' = max(min(x, max(x)), min(x)) $$

- Z-Score: This method consists of rescaling the data to have a mean of 0 and a standard deviation of 1. This method is not sensitive to outliers. The calculation done would be:

$$ x' = \frac{x - \mu}{\sigma} $$

### Non-Linear Scaling

Non-linear scaling is a method of normalizing data for processing by a neural network. Some methods for this are:

- Log Scaling: This method consists of rescaling the data by applying a logarithmic function. This method is not sensitive to outliers. The calculation done would be:

$$ x' = log(x) $$

- Power Transformation: This method consists of rescaling the data by applying a power function. This method is not sensitive to outliers. The calculation done would be:

$$ x' = x^p $$

- Sigmoid: This method consists of rescaling the data by applying a sigmoid function. This method is not sensitive to outliers. The calculation done would be:

$$ x' = \frac{1}{1 + e^{-x}} $$

- Hyperbolic Tangent: This method consists of rescaling the data by applying a hyperbolic tangent function. This method is not sensitive to outliers. The calculation done would be:

$$ x' = tanh(x) $$

- Softmax: This method consists of rescaling the data by applying a softmax function. This method is not sensitive to outliers. The calculation done would be:

$$ x' = \frac{e^x}{\sum_{i=1}^{n}e^{x_i}} $$

## Categorical Data Processing Pipelines

### Dummy

Dummy coding is a way of representing categorical data in a statistical model. If there are k categories, then k-1 dummy variables are needed to uniquely represent all possible values. In dummy coding, the reference category is coded with all zeros. The reference category is typically the category with the largest sample size. The other categories are coded with a single one in the column for that category.

### One-Hot

One-hot encoding is a representation of categorical variables as binary vectors. This first requires that the categorical values be mapped to integer values. Then, each integer value is represented as a binary vector that is all zero values except the index of the integer, which is marked with a 1.

## Covariance

Covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, i.e., the variables tend to show similar behavior, the covariance is positive. 

In the opposite case, when the greater values of one variable mainly correspond to the lesser values of the other, i.e., the variables tend to show opposite behavior, the covariance is negative. The sign of the covariance therefore shows the tendency in the linear relationship between the variables. The magnitude of the covariance is not easy to interpret because it is not normalized and hence depends on the magnitudes of the variables. 

The normalized version of the covariance, the correlation coefficient, however, shows by its magnitude the strength of the linear relation.

$$ cov(X, Y) = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{n} $$

## Correlation

Correlation is a statistical measure that expresses the extent to which two variables are linearly related (meaning they change together at a constant rate). It's a common tool for describing simple relationships without making a statement about cause and effect. Correlation coefficients are used to measure the strength of the relationship between two variables. The statistical measure indicates both the strength of the linear relationship and the direction (positive or negative) as shown below.

![Correlation](https://upload.wikimedia.org/wikipedia/commons/d/d4/Correlation_examples2.svg)

The correlation coefficient is a value that indicates the strength of the relationship between variables. The coefficient can take any values from -1 to 1. The interpretations of the values are:

- -1: Perfect negative linear correlation
- +1: Perfect positive linear correlation
- 0: No correlation

The correlation coefficient is calculated as follows:

$$ r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2}\sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2}} $$

## Pearson Correlation Coefficient

The Pearson correlation coefficient is a measure of the linear correlation between two variables X and Y. It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation. It is widely used in the sciences as a measure of the strength of linear dependence between two variables. It is defined as the covariance of the two variables divided by the product of their standard deviations.

$$ \rho_{X,Y} = \frac{cov(X, Y)}{\sigma_X \sigma_Y} $$

## Spearman Correlation Coefficient

The Spearman correlation coefficient is a nonparametric measure of the monotonicity of the relationship between two datasets. Unlike the Pearson correlation, the Spearman correlation does not assume that both datasets are normally distributed. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact monotonic relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.

$$ \rho = 1 - \frac{6\sum_{i=1}^{n}d_i^2}{n(n^2 - 1)} $$

where $d_i$ is the difference between the two ranks of each observation:

$$ d_i = rank(X_i) - rank(Y_i) $$

## Correlation Matrix

A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. A correlation matrix is used as a way to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses.

![Correlation Matrix](https://cdn-dfnaj.nitrocdn.com/xxeFXDnBIOflfPsgwjDLywIQwPChAOzV/assets/images/optimized/rev-cd2671e/www.displayr.com/wp-content/uploads/2018/07/rsz_correlation_matrix_3.png)

## PCA: Principal Component Analysis

Principal component analysis (PCA) is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss. It does so by creating new uncorrelated variables that successively maximize variance. PCA is a method used to reduce the number of variables in a dataset while preserving the variation in the dataset. 