# Basic Math: minimum data statistics

In [None]:
import numpy as np


In [None]:
heights = np.array([189, 170, 189, 163, 183, 171, 185, 168, 173, 183, 173, 173, 175, 178, 183, 193, 178, 173,
 174, 183, 183, 168, 170, 178, 182, 180, 183, 178, 182, 188, 175, 179, 183, 193, 182, 183,
 177, 185, 188, 188, 182, 185])
heights

In [None]:
print("Mean height:       ", heights.mean())
print("Standard deviation:", heights.std())
print("Minimum height:    ", heights.min())
print("Maximum height:    ", heights.max())

Note that in each case, the aggregation operation reduced the entire array to a single summarizing value, which gives us information about the distribution of values. We may also wish to compute quantiles:

In [None]:

print("25th percentile:   ", np.percentile(heights, 25))
print("Median:            ", np.median(heights))
print("75th percentile:   ", np.percentile(heights, 75))

We see that the median height of US presidents is 182 cm.

Of course, sometimes it's more useful to see a visual representation of this data, which we can accomplish using tools in Matplotlib

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt


In [None]:
plt.hist(heights)
plt.title('Height Distribution of US Presidents')
plt.xlabel('height (cm)')
plt.ylabel('number');

In [None]:
import seaborn; seaborn.set()  # set plot style

In [None]:
plt.hist(heights)
plt.title('Height Distribution of US Presidents')
plt.xlabel('height (cm)')
plt.ylabel('number');

What is difference between mean and median?

* The mean is the arithmetic average of a set of numbers, or distribution. 
    - The mean is used for normal distributions.
    - The mean is not a robust tool since it is largely influenced by outliers.
    - A mean is computed by adding up all the values and dividing that score by the number of values
* The median is described as the numeric value separating the higher half of a sample, a population, or a probability distribution, from the lower half
    - The median is generally used for skewed distributions.
    - The median is better suited for skewed distributions to derive at central tendency since it is much more robust and sensible.
    - The Median is the number found at the exact middle of the set of values. A median can be computed by listing all numbers in ascending order and then locating the number in the centre of that distribution.


**Challenge**

initialize array with values: `1,1,1,1,1,1,1,1,10`

In [None]:
# initialize array
mm = None

In [None]:
#what is mean of mm
mm_mean = None

In [None]:
#what is median of mm
mm_median = None

In [None]:
# plot distribution



As you see if data is not normaly distributed (here we have one outlier) you cannot use mean as "typical value" for data.

## Introducing Broadcasting¶
Recall that for arrays of the same size, binary operations are performed on an element-by-element basis:

In [None]:
a = np.array([0, 1, 2])
b = np.array([5, 5, 5])
a + b

Broadcasting allows these types of binary operations to be performed on arrays of different sizes–for example, we can just as easily add a scalar (think of it as a zero-dimensional array) to an array:

In [None]:
a + 5

In [None]:
a = np.arange(3)
b = np.arange(3)[:, np.newaxis]

print(a)
print(b)

In [None]:
a + b

Just as before we stretched or broadcasted one value to match the shape of the other, here we've stretched both a and b to match a common shape, and the result is a two-dimensional array! The geometry of these examples is visualized in the following figure 

![](./../images/broadcasting.png)

Broadcasting operations form the core of many solutions - especialy duuring data manipulation/normalization ie:




In [None]:
X = np.random.random((10, 3))

In [None]:
Xmean = X.mean(0)
Xmean


And now we can center the X array by subtracting the mean (this is a broadcasting operation):

In [None]:
X_centered = X - Xmean
X_centered

To double-check that we've done this correctly, we can check that the centered array has near zero mean:

In [None]:
X_centered.mean(0)

In [None]:
# what if we use our mm?
# mm - mean 
mm_centered_mean = None

In [None]:
#mm - median
mm_centered_median = None

#  three-sigma rule


In statistics, the 68–95–99.7 rule is a shorthand used to remember the percentage of values that lie within a band around the mean in a normal distribution with a width of two, four and six standard deviations, respectively; more accurately, 68.27%, 95.45% and 99.73% of the values lie within one, two and three standard deviations of the mean, respectively. In mathematical notation, these facts can be expressed as follows, where X is an observation from a normally distributed random variable, μ is the mean of the distribution, and σ is its standard deviation:

\begin{aligned}\Pr(\mu -\;\,\sigma \leq X\leq \mu +\;\,\sigma )&\approx 0.6827\\\Pr(\mu -2\sigma \leq X\leq \mu +2\sigma )&\approx 0.9545\\\Pr(\mu -3\sigma \leq X\leq \mu +3\sigma )&\approx 0.9974\end{aligned}


\begin{aligned}\Pr(\mu -\;\,\sigma \leq X\leq \mu +\;\,\sigma )&\approx 0.6827\\\Pr(\mu -2\sigma \leq X\leq \mu +2\sigma )&\approx 0.9545\\\Pr(\mu -3\sigma \leq X\leq \mu +3\sigma )&\approx 0.9974\end{aligned}

In the empirical sciences the so-called three-sigma rule of thumb expresses a conventional heuristic that nearly all values are taken to lie within three standard deviations of the mean, and thus it is empirically useful to treat 99.7% probability as near certainty.The usefulness of this heuristic depends significantly on the question under consideration. In the social sciences, a result may be considered "significant" if its confidence level is of the order of a two-sigma effect (95%), while in particle physics, there is a convention of a five-sigma effect (99.99994% confidence) being required to qualify as a discovery.

The "three-sigma rule of thumb" is related to a result also known as the three-sigma rule, which states that even for non-normally distributed variables, at least 88.8% of cases should fall within properly calculated three-sigma intervals. It follows from Chebyshev's Inequality. For unimodal distributions the probability of being within the interval is at least 95%. There may be certain assumptions for a distribution that force this probability to be at least 98%.


![For an approximately normal data set, the values within one standard deviation of the mean account for about 68% of the set; while within two standard deviations account for about 95%; and within three standard deviations account for about 99.7%. Shown percentages are rounded theoretical probabilities intended only to approximate the empirical data derived from a normal population.](../images/3sigma.svg.png)



In [None]:
#calculate sigma - standard deviation
sigma = np.std(X)
sigma

In [None]:
# is there outlier in data ?
np.abs(X_centered) - 3 * sigma > 0


**Challenge**  
Can you find outliers in our mm centered data?

In [None]:
# calculate mm sigma
mm_sigma = None

In [None]:
# check for outliers using mm_centered_mean


In [None]:
# check for outliers using mm_centered_median


As you can see depends how you average/recenter your data you can differently classify data - here anomaly example

## Plotting a two-dimensional function
One place that broadcasting is very useful is in displaying images based on two-dimensional functions. If we want to define a function z=f(x,y), broadcasting can be used to compute the function across the grid:

In [None]:
# x and y have 50 steps from 0 to 5
x = np.linspace(0, 5, 50)
y = np.linspace(0, 5, 50)[:, np.newaxis]

z = np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x)

In [None]:
plt.imshow(z, origin='lower', extent=[0, 5, 0, 5],
           cmap='viridis')
plt.colorbar();