# Understanding Descriptive Statistics

Descriptive Statistical Analysis helps you to understand your data and is a very important part of Machine Learning. This is due to Machine Learning being all about making predictions. 

Statistics is a branch of mathematics that deals with collecting, interpreting, organization and interpretation of data.

Within statistics, there are two main categories:
    
#### Descriptive Statistics: 
In Descriptive Statistics your are describing, presenting, summarizing and organizing your data (population), either through numerical calculations or graphs or tables.

#### Inferential statistics: 
Inferential Statistics are produced by more complex mathematical calculations, and allow us to infer trends and make assumptions and predictions about a population based on a study of a sample taken from it.

### Types of Measures
In this tutorial, you’ll learn about the following types of measures in descriptive statistics:

- **Central tendency** tells you about the centers of the data. Useful measures include the mean, median, and mode.

- **Variability** tells you about the spread of the data. Useful measures include variance and standard deviation.

- **Correlation or joint variability** tells you about the relation between a pair of variables in a dataset. Useful measures include covariance and the correlation coefficient.

## Univariate Analysis

### Population and Samples

### Outliers

In [6]:
#Python Code begins
import pandas as pd
import numpy as np

In [7]:
ps = pd.Series([ 8. ,  1. ,  2.5,  np.nan,  4. , 28. ])

In [3]:
ps

0     8.0
1     1.0
2     2.5
3     NaN
4     4.0
5    28.0
dtype: float64

### Percentiles

![image.png](attachment:image.png)

First Second and Third Quantile

The interquartile range (IQR) is a measure of statistical dispersion between upper (75th) and lower (25th) quartiles.

In [4]:
quartiles = ps.quantile([0.25, 0.75])
quartiles[0.75] - quartiles[0.25]

5.5

The range describes the difference between the largest and the smallest points in your data.

In [16]:
ps.max()-ps.min()

27.0

### Measures of Central Tendency

#### Mean

![image.png](attachment:image.png)

In [5]:
ps.mean()

8.7

#### Median

![image.png](attachment:image.png)

In [7]:
ps.median()

4.0

- If you increase its value (move it to the right), then the mean will rise, but the median value won’t ever change.
- If you decrease its value (move it to the left), then the mean will drop, but the median will remain the same until the value of the moving point is greater than or equal to 4.

You can compare the mean and median as one way to detect outliers and asymmetry in your data.

#### Mode

In [10]:
ps.mode() 

0     1.0
1     2.5
2     4.0
3     8.0
4    28.0
dtype: float64

In [4]:
import pandas as pd
pd.Series([1,2,3,3,4,4,5,5,6,7]).mode()

0    3
1    4
2    5
dtype: int64

![image.png](attachment:image.png)

### Measures of Variability

- Variance
- Standard deviation
- Skewness
- Percentiles
- Ranges

#### Variance

![image.png](attachment:image.png)

In [8]:
ps.var() #squared units

123.19999999999999

In [10]:
np.sqrt(ps.var())

11.099549540409285

#### Standard Deviation

In [9]:
ps.std()

11.099549540409285

#### Skewness

![image.png](attachment:image.png)

![image.png](attachment:image.png)

In [13]:
ps.skew()

1.9470432273905924

In [12]:
ps2 = pd.Series([ -88. ,  1. ,  2.5,  np.nan,  4. , 28. ])

In [13]:
ps2.skew()

-1.8710822880161715

In [14]:
###Describe Method
ps.describe()

count     5.00000
mean      8.70000
std      11.09955
min       1.00000
25%       2.50000
50%       4.00000
75%       8.00000
max      28.00000
dtype: float64

## Bi-Variate Analsis

### Correlation between a pair of data

- **Positive correlation** exists when larger values of 𝑥 correspond to larger values of 𝑦 and vice versa.
- **Negative correlation** exists when larger values of 𝑥 correspond to smaller values of 𝑦 and vice versa.
- **Weak or no correlation** exists if there is no such apparent relationship.

![image.png](attachment:image.png)

In [16]:
x = pd.Series(range(-10, 11))
y = pd.Series([0, 2, 2, 2, 2, 3, 3, 6, 7, 4, 7, 6, 6, 9, 4, 5, 5, 10, 11, 12, 14])

In [17]:
x.corr(y)

0.8619500056316061

## Covariance

The sample covariance is a measure that quantifies the strength and direction of a relationship between a pair of variables:

- If the correlation is positive, then the covariance is positive, as well. A stronger relationship corresponds to a higher value of the covariance.
- If the correlation is negative, then the covariance is negative, as well. A stronger relationship corresponds to a lower (or higher absolute) value of the covariance.
- If the correlation is weak, then the covariance is close to zero.

In [18]:
x.cov(y)

19.95

# Working With 2D Data

In [19]:
a = np.array([[1, 1, 1],
              [2, 3, 1],
              [4, 9, 2],
              [8, 27, 4],
              [16, 1, 1]])

In [20]:
row_names = ['first', 'second', 'third', 'fourth', 'fifth']
col_names = ['A', 'B', 'C']

In [21]:
df = pd.DataFrame(a, index=row_names, columns=col_names)

In [22]:
df

Unnamed: 0,A,B,C
first,1,1,1
second,2,3,1
third,4,9,2
fourth,8,27,4
fifth,16,1,1


In [23]:
df.mean()

A    6.2
B    8.2
C    1.8
dtype: float64

In [25]:
df.var()

A     37.2
B    121.2
C      1.7
dtype: float64

In [24]:
df.mean(axis=1)

first      1.0
second     2.0
third      5.0
fourth    13.0
fifth      6.0
dtype: float64

In [26]:
df.var(axis=1)

first       0.0
second      1.0
third      13.0
fourth    151.0
fifth      75.0
dtype: float64

In [None]:
df.values

In [27]:
df.corr()

Unnamed: 0,A,B,C
A,1.0,0.077443,0.100599
B,0.077443,1.0,0.996231
C,0.100599,0.996231,1.0


In [28]:
df.describe()

Unnamed: 0,A,B,C
count,5.0,5.0,5.0
mean,6.2,8.2,1.8
std,6.09918,11.009087,1.30384
min,1.0,1.0,1.0
25%,2.0,1.0,1.0
50%,4.0,3.0,1.0
75%,8.0,9.0,2.0
max,16.0,27.0,4.0
