# <center> Demystifying crucial statistics in Python </center>

### - Know about the basic statistics required for Machine Learning

You do not need any prerequisite knowledge of statistics before you can start learning and applying machine learning. 

However, knowing some statistics can be very helpful to understand machine learning technically as well intuitively. Knowing some statistics will eventually be required when you want to start validating your results.

Statistics is a field of mathematics with lots of theories and findings. However, there are various nuts and bolts, tools, techniques and notations are taken from this field in order to make machine learning what it is today. You can use descriptive statistical methods to transform raw observations into information that you can understand and share. You can use inferential statistical methods to reason from small samples of data to whole domains. Later in this post, you will study about descriptive and inferential statistics. So, don't worry.  

Before getting started, let's walk through ten examples where statistical methods are used in an applied machine learning project:

- **Problem Framing**: Requires the use of exploratory data analysis and data mining.
- __Data Understanding__: Requires the use of summary statistics and data visualization.
- **Data Cleaning**: Requires the use of outlier detection, imputation and more.
- **Data Selection**: Requires the use of data sampling and feature selection methods.
- **Data Preparation**: Requires the use of data transforms, scaling, encoding and much more.
- **Model Evaluation**: Requires experimental design and resampling methods.
- **Model Configuration**: Requires the use of statistical hypothesis tests and estimation statistics.
- **Model Selection**: Requires the use of statistical hypothesis tests and estimation statistics.
- **Model Presentation**: Requires the use of estimation statistics such as confidence intervals.
- **Model Predictions**: Requires the use of estimation statistics such as prediction intervals.


Isn't that fascinating? 

This post will give you a solid background in the essential but basic statistics required for becoming a good machine learning practitioner. 

In this post, you will study:

- Introduction to Statistics and its types
- Statistics for data preparation
- Statistics for model evaluation
- Introduction to Gaussian and Descriptive stats
- Variable correlation
- Non-parametric Statistics

You have a lot to cover and all of the topics are equally important. Let's get started!

## Introduction to Statistics and its types:

Let's briefly study how to define statistics in simple terms. 

Statistics is a subfield of mathematics. It refers to a collection of methods for working with data and using data to answer questions.

When it comes to the statistical tools that are used in practice, it can be helpful to divide the field of statistics into two large groups of methods: descriptive statistics for summarizing data, and inferential statistics for drawing conclusions from samples of data.

- __Descriptive Statistics__: Descriptive statistics are used to describe the basic features of the data in a study. They provide simple summaries about the sample and the measures. Together with simple graphics analysis, they form the basis of virtually every quantitative analysis of data. The below infographic provides a good summary of descriptive statistics:

<img src = "https://i2.wp.com/intellspot.com/wp-content/uploads/2017/11/descriptive-statistic-spreadsheet-and-pie-chart.png?resize=720%2C437"></img>

_**Source: IntellSpot**_

- __Inferential Statistics__: Inferential statistics is a fancy name for methods that aid in quantifying properties of the domain or population from a smaller set of obtained observations called a sample. Below is an inforgraphic which beautifully describes inferential statistics: 

    <img src = "https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/01/20150849/what-is-inferential-statistics.jpg">
    
_**Source: Analytics Vidhya**_

In the next section, you will study about the use of statistics for data preparation.

## Statistics for data preparation:

Statistical methods are required in the preparation of train and test data for your machine learning model.

This includes techniques for:

- Outlier detection
- Missing value imputation
- Data sampling
- Data scaling
- Variable encoding

A basic understanding of data distributions, descriptive statistics, and data visualization is required to help you identify the methods to choose when performing these tasks.

Let's analyze each of the above points briefly. 

### Outlier detection:

Let's first see what is an outlier. 

An outlier is an observation that appears to deviate markedly from other observations in the sample. The following figure makes the definition more prominent. 

<img src = "https://www.mathworks.com/matlabcentral/mlc-downloads/downloads/submissions/34795/versions/7/screenshot.png">

_**Source: MathWorks**_

You can spot the outliers in the data as given the above figure. 

Many machine learning algorithms are sensitive to the range and distribution of attribute values in the input data. Outliers in input data can skew and mislead the training process of machine learning algorithms resulting in longer training times, less accurate models and ultimately poorer results.

**Identification of potential outliers is important for the following reasons:**

- An outlier may indicate bad data. For example, the data may have been coded incorrectly or an experiment may not have been run correctly. If it can be determined that an outlying point is in fact erroneous, then the outlying value should be deleted from the analysis (or corrected if possible).

- In some cases, it may not be possible to determine if an outlying point is bad data. Outliers may be due to random variation or may indicate something scientifically interesting. In any event, you typically do not want to simply delete the outlying observation. However, if the data contains significant outliers, you may need to consider the use of robust statistical techniques.

So, outliers are often not good for your predictive models (Although, sometimes, these outliers can be used as an advantage. But that is out of the scope of this post). You need statistical know-how  in order to handle outliers efficiently 

### Missing value imputation:

Well, most of the datasets now suffer from the problem of missing values. Your machine learning model may not get trained effectively if the data that you are feeding to the model contains missing values. Statistical tools and techniques come here for rescue.

Many people tend to discard the data instances which contain missing value. But that is not a good practice because during that course you may lose important features/representations of the data. Although there are advanced methods for dealing with missing value problems these are the quick  techniques that one would go for: **Mean Imputation** and **Median Imputation**. 

It is very important that you understand what mean and median are. 

Say, you have a feature **X1** which has these values - 13, 18, 13, 14, 13, 16, 14, 21, 13

The **mean** is the usual average, so I'll add and then divide:

(13 + 18 + 13 + 14 + 13 + 16 + 14 + 21 + 13) / 9 = 15

Note that the mean, in this case, isn't a value from the original list. This is a common result. You should not assume that your mean will be one of your original numbers.

The **median** is the middle value, so first you will have to rewrite the list in numerical order:

13, 13, 13, 13, 14, 14, 16, 18, 21

There are nine numbers in the list, so the middle one will be the (9 + 1) / 2 = 10 / 2 = 5th number:

13, 13, 13, 13, 14, 14, 16, 18, 21

So the median is 14.

### Data sampling:

Data is the currency of applied machine learning. Therefore, it is important that it is both collected and used effectively.

Data sampling refers to statistical methods for selecting observations from the domain with the objective of estimating a population parameter. In other words, sampling is an active process of gathering observations with the intent of estimating a population variable.

Each row of a dataset represents an observation about something in the world. When working with data, you often do not have access to all possible observations.This could be for many reasons; for example:

- It may difficult or expensive to make more observations.
- It may be challenging to gather all observations together.
- More observations are expected to be made in the future.

Many a times, you will not have the right proportion of the data samples. So, you will have to under-sample or over-sample based on the type of problem. 

You perform under-sampling when the data samples for a particular category are very higher compared to other meaning you discard some of the data samples from the category where they are higher. You perform over-sampling when the data samples for a particular category are very lower compared to the other. In this case you generate data samples. 

This applies to multi-class scenarios as well. 

Statistical sampling is a large field of study, but in applied machine learning, there may be three types of sampling that you are likely to use: simple random sampling, systematic sampling, and stratified sampling.

- **Simple Random Sampling**: Samples are drawn with a uniform probability from the domain.
- **Systematic Sampling**: Samples are drawn using a pre-specified pattern, such as at intervals.
- **Stratified Sampling**: Samples are drawn within pre-specified categories (i.e. strata).

Although these are the more common types of sampling that you may encounter, there are other techniques.

### Data Scaling: 

Often, the features of your dataset may largely vary in ranges. Some features may have a range of 0 to 100 while the other may have ranges of 0.01 - 0.001, 10000- 20000 etc. 

This is very problematic for efficient modeling. Because a small change in the feature which has lower value range than the other feature may not have a significant impact on those other features. It affects the process of good learning. Dealing with this problem is known as **data scaling**. 

There are different data scaling techniques such as Min-Max scaling, Absolute scaling, Standard scaling etc. 

### Variable encoding:

At times, your datasets contain a mixture of both numeric and non-numeric data. And in order to speed up the computation process, you may have to supply all of the data in numeric format. 

Again, statistics comes for saving you. 

Techniques like Label encoding, One-Hot encoding etc. are used to convert non-numeric data to numeric. 

## It's time to apply the techniques!

You have covered a lot theory for now. You will apply some of these to get the real feel. 

You will start off by applying some statistical methods to detect **Outliers**.

But before you start, let's define a dataset which you can use to test the methods.

You will generate a population 10,000 random numbers drawn from a Gaussian distribution (_Gaussian distributions are discussed later_) with a mean of 50 and a standard deviation (_also, mean and standard deviation_) of 5.

Numbers drawn from a Gaussian distribution will have outliers. That is, by virtue of the distribution itself, there will be a few values that will be a long way from the mean, rare values that you can identify as outliers.

You will use the randn() function of NumPy to generate random Gaussian values with a mean of 0 and a standard deviation of 1, then multiply the results by your own standard deviation and add the mean to shift the values into the preferred range.

The pseudorandom number generator is seeded to ensure that you get the same sample of numbers each time you run the code.

In [1]:
# Dependencies
from numpy.random import seed
from numpy.random import randn
from numpy import mean
from numpy import std

# Seed the random number generator
seed(1)

In [2]:
# Generate univariate observations
data = 5 * randn(10000) + 50

# Smmarize
print('mean=%.3f stdv=%.3f' % (mean(data), std(data)))

mean=50.049 stdv=4.994


If you know that the distribution of values in the sample is Gaussian or Gaussian-like, you can use the standard deviation of the sample as a cut-off for identifying outliers.

The Gaussian distribution has the property that the standard deviation from the mean can be used to reliably summarize the percentage of values in the sample.

For example, within one standard deviation of the mean will cover 68% of the data.

So, if the mean is 50 and the standard deviation is 5, as in the test dataset above, then all data in the sample between 45 and 55 will account for about 68% of the data sample. You can cover more of the data sample if you expand the range as follows:

1 Standard Deviation from the Mean: 68%
2 Standard Deviations from the Mean: 95%
3 Standard Deviations from the Mean: 99.7%

A value that falls outside of 3 standard deviations is part of the distribution, but it is an unlikely or rare event at approximately 1 in 370 samples.

Three standard deviations from the mean is a common cut-off in practice for identifying outliers in a Gaussian or Gaussian-like distribution. For smaller samples of data, perhaps a value of 2 standard deviations (95%) can be used, and for larger samples, perhaps a value of 4 standard deviations (99.9%) can be used.

Sometimes, the data is standardized first (e.g. to a Z-score with zero mean and unit variance) so that the outlier detection can be performed using standard Z-score cut-off values. This is convenient and is not required in general, and you will perform the calculations in the original scale of the data here to make things clear.

You can calculate the mean and standard deviation of a given sample, then calculate the cut-off for identifying outliers as more than 3 standard deviations from the mean.

In [3]:
# Calculate summary statistics
data_mean, data_std = mean(data), std(data)

# Identify outliers
cut_off = data_std * 3
lower, upper = data_mean - cut_off, data_mean + cut_off

# Summarize
outliers = [x for x in data if x < lower or x > upper]
print('Identified outliers: %d' % len(outliers))

Identified outliers: 29


You can remove these outliers as well. 

In [4]:
# Remove outliers
outliers_removed = [x for x in data if x >= lower and x <= upper]
print('Non-outlier observations: %d' % len(outliers_removed))

Non-outlier observations: 9971


You saw how you can use standard deviation and set its threshold in order to detect potential outliers in the data and how you can filter them out. Next you will see how to do some **missing value imputation**. 

You will use the famous [**Pima Indian Diabetes**](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv) dataset which is known to have missing values. But before proceeding any further, you will have to load the dataset into your workspace. 

You will load the dataset into a DataFrame object **data**. 

In [6]:
import pandas as pd

data = pd.read_csv("https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv")
print(data.describe())

                6         148          72          35           0        33.6  \
count  767.000000  767.000000  767.000000  767.000000  767.000000  767.000000   
mean     3.842243  120.859192   69.101695   20.517601   79.903520   31.990482   
std      3.370877   31.978468   19.368155   15.954059  115.283105    7.889091   
min      0.000000    0.000000    0.000000    0.000000    0.000000    0.000000   
25%      1.000000   99.000000   62.000000    0.000000    0.000000   27.300000   
50%      3.000000  117.000000   72.000000   23.000000   32.000000   32.000000   
75%      6.000000  140.000000   80.000000   32.000000  127.500000   36.600000   
max     17.000000  199.000000  122.000000   99.000000  846.000000   67.100000   

            0.627          50           1  
count  767.000000  767.000000  767.000000  
mean     0.471674   33.219035    0.348110  
std      0.331497   11.752296    0.476682  
min      0.078000   21.000000    0.000000  
25%      0.243500   24.000000    0.000000  
50%   

You can see that there are columns that have a minimum value of zero (0). On some columns, a value of zero does not make sense and indicates an invalid or missing value.

Specifically, the following columns have an invalid zero minimum value:

- Plasma glucose concentration
- Diastolic blood pressure
- Triceps skinfold thickness
- 2-Hour serum insulin
- Body mass index

Let's confirm this my looking at the raw data, the example prints the first 20 rows of data.

In [8]:
data.head(20)

Unnamed: 0,6,148,72,35,0,33.6,0.627,50,1
0,1,85,66,29,0,26.6,0.351,31,0
1,8,183,64,0,0,23.3,0.672,32,1
2,1,89,66,23,94,28.1,0.167,21,0
3,0,137,40,35,168,43.1,2.288,33,1
4,5,116,74,0,0,25.6,0.201,30,0
5,3,78,50,32,88,31.0,0.248,26,1
6,10,115,0,0,0,35.3,0.134,29,0
7,2,197,70,45,543,30.5,0.158,53,1
8,8,125,96,0,0,0.0,0.232,54,1
9,4,110,92,0,0,37.6,0.191,30,0


You can clearly see 0 values in the columns 2, 3, 4, and 5.

You can get a count of the number of missing values on each of these columns. You can do this my marking all of the values in the subset of the DataFrame you are interested in that have zero values as True. You can then count the number of true values in each column.

You can do this my marking all of the values in the subset of the DataFrame you are interested in that have zero values as True. you can then count the number of true values in each column.

In [9]:
print((data[[1,2,3,4,5]] == 0).sum())

148       5
72       35
35      227
0       373
33.6     11
dtype: int64


You can see that columns 1,2 and 5 have just a few zero values, whereas columns 3 and 4 show a lot more, nearly half of the rows.

This highlights that different "missing value" strategies may be needed for different columns, e.g. to ensure that there are still a sufficient number of records left to train a machine learning model.

In Python, specifically Pandas, NumPy and Scikit-Learn, you mark missing values as NaN.

Values with a NaN value are ignored from operations like sum, count, etc.

You can mark values as NaN easily with the Pandas DataFrame by using the replace() function on a subset of the columns you are interested in.

After you have marked the missing values, you can use the isnull() function to mark all of the NaN values in the dataset as True and get a count of the missing values for each column.

In [13]:
# Mark zero values as missing or NaN
import numpy
data[[1,2,3,4,5]] = data[[1,2,3,4,5]].replace(0, numpy.NaN)

# Count the number of NaN values in each column
print(data.isnull().sum())

6          0
148        0
72         0
35         0
0          0
33.6       0
0.627      0
50         0
1          0
1          5
2         35
3        227
4        373
5         11
dtype: int64


Running the example prints the number of missing values in each column. You can see that the columns 1-5 have the same number of missing values as zero values identified above. This is a sign that you have marked the identified missing values correctly.

You can see that the columns 1 to 5 have the same number of missing values as zero values identified above. This is a sign that you have marked the identified missing values correctly.

Let's take a look at the full dataset now. 

In [14]:
data.head(20)

Unnamed: 0,6,148,72,35,0,33.6,0.627,50,1,1.1,2,3,4,5
0,1,85,66,29,0,26.6,0.351,31,0,85.0,66.0,29.0,,26.6
1,8,183,64,0,0,23.3,0.672,32,1,183.0,64.0,,,23.3
2,1,89,66,23,94,28.1,0.167,21,0,89.0,66.0,23.0,94.0,28.1
3,0,137,40,35,168,43.1,2.288,33,1,137.0,40.0,35.0,168.0,43.1
4,5,116,74,0,0,25.6,0.201,30,0,116.0,74.0,,,25.6
5,3,78,50,32,88,31.0,0.248,26,1,78.0,50.0,32.0,88.0,31.0
6,10,115,0,0,0,35.3,0.134,29,0,115.0,,,,35.3
7,2,197,70,45,543,30.5,0.158,53,1,197.0,70.0,45.0,543.0,30.5
8,8,125,96,0,0,0.0,0.232,54,1,125.0,96.0,,,
9,4,110,92,0,0,37.6,0.191,30,0,110.0,92.0,,,37.6


You can see that marking the missing values had the intended effect.

Up till now, you analyzed important trends when data is missing and how you can make use of simple statistical measures in order to get a hold of it. Now, you will impute the missing values using **Mean Imputation** which is essentially imputing the mean of the respective column in place of missing values. 

In [16]:
# Impute missing values with mean column values
data.fillna(data.mean(), inplace=True)

# Count the number of NaN values in each column
print(data.isnull().sum())

6        0
148      0
72       0
35       0
0        0
33.6     0
0.627    0
50       0
1        0
1        0
2        0
3        0
4        0
5        0
dtype: int64


Excellent! 

This [DataCamp article](https://www.datacamp.com/community/tutorials/preprocessing-in-data-science-part-1-centering-scaling-and-knn) effectively guides you about implementing **data scaling** as a data preprocessing step. Be sure to check it out. 

Next you will do **variable encoding**.

Before that you need a dataset which actually contains non-numeric data. You will use the famous [Iris dataset](http://archive.ics.uci.edu/ml/datasets/Iris) for this. 

In [22]:
# Load the dataset to a DataFrame object iris
iris = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",header=None)

In [24]:
# See first 20 rows of the dataset
iris.head(20)

Unnamed: 0,0,1,2,3,4
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


You can easily convert the string values to integer values using the [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html). The three class values (Iris-setosa, Iris-versicolor, Iris-virginica) are mapped to the integer values (0, 1, 2).

In this case, the fourth column/feature of the datset contains non-numeric values. So you need to separate it out. 

In [28]:
# Convert the DataFrame to a NumPy array
iris = iris.values

# Separate
Y = iris[:,4]

In [30]:
# Label Encode string class values as integers
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(Y)
label_encoded_y = label_encoder.transform(Y)

Now, let's study another area where the need of elementary knowledge of statistics is very crucial.

## Statistics for model evaluation:

You have designed and developed your machine learning model. Now, you want to evaluate the performance of your model on the test data. In this regards, you take help of various statistical metrics like Precision, Recall, ROC, AUC, RMSE etc. You also take help various data resampling techniques such as **k-fold Cross Validation**. 

Statistics can effectively be used to:
- Estimate a hypothesis accuracy
- Determine the error of two hypotheses
- Compare learning algorithms

_It is important to note that hypotheses are refer to learned models; the results of running a learning algorithm on a dataset. Evaluating and comparing hypotheses means comparing learned models, which is different from evaluating and comparing machine learning algorithms, which could be trained on different samples from the same problem or different problems._

Let's study Gaussian and Descriptive statistics now.

## Introduction to Gaussian and Descriptive stats: 

A sample of data is nothing but a snapshot from a broader population of all possible observations that could be taken from a domain or generated by a process.

Interestingly, many observations fit a common pattern or distribution called the normal distribution, or more formally, the Gaussian distribution. It is the bell-shaped distribution that you may be familiar with. Following figure denotes a Gaussian distribution:

<img src = "http://hyperphysics.phy-astr.gsu.edu/hbase/Math/immath/gauds.gif"></img>

_**Source: HyperPhysics**_

A lot is known about the Gaussian distribution, and as such, there are whole sub-fields of statistics and statistical methods that can be used with Gaussian data.

Any Gaussian distribution, and in turn any data sample drawn from a Gaussian distribution, can be summarized with just two parameters:

- **Mean**: The central tendency or most likely value in the distribution (the top of the bell).
- **Variance**: The average difference that observations have from the mean value in the distribution (the spread).

The units of the mean are the same as the units of the distribution, although the units of the variance are squared, and therefore harder to interpret. A popular alternative to the variance parameter is the standard deviation, which is simply the square root of the variance, returning the units to be the same as those of the distribution.

The mean, variance, and standard deviation can be calculated directly on data samples. You will be doing it right now using NumPy.

The example below generates a sample of 100 random numbers drawn from a Gaussian distribution with a known mean of 50 and a standard deviation of 5 and calculates the summary statistics.

First you will import all the dependencies. 

In [1]:
#  Dependencies
from numpy.random import seed
from numpy.random import randn
from numpy import mean
from numpy import var
from numpy import std

Next you set the random number generator seed so that your results are reproducible. 

In [2]:
seed(1)

In [3]:
# Generate univariate observations
data = 5 * randn(10000) + 50

In [4]:
# Calculate statistics
print('Mean: %.3f' % mean(data))
print('Variance: %.3f' % var(data))
print('Standard Deviation: %.3f' % std(data))

Mean: 50.049
Variance: 24.939
Standard Deviation: 4.994


Close enough, eh?

Let's study the next topic now. 

## Variable correlation:

There may be complex and unknown relationships between the variables in your dataset.

It is important to discover and quantify the degree to which variables in your dataset are dependent upon each other. This knowledge can help you better prepare your data to meet the expectations of machine learning algorithms, such as linear regression, whose performance will degrade with the presence of these interdependencies.

Variables within a dataset can be related for lots of reasons.

For example:

One variable could cause or depend on the values of another variable.
One variable could be lightly associated with another variable.
Two variables could depend on a third unknown variable.
It can be useful in data analysis and modeling to better understand the relationships between variables. The statistical relationship between two variables is referred to as their correlation.

A correlation could be positive, meaning both variables move in the same direction, or negative, meaning that when one variable’s value increases, the other variables’ values decrease. Correlation can also be neural or zero, meaning that the variables are unrelated.

- Positive Correlation: both variables change in the same direction.
- Neutral Correlation: No relationship in the change of the variables.
- Negative Correlation: variables change in opposite directions.

Correlation measurements form the fundamental of filter-based feature selection techniques. 

You can quantify the relationship between samples of two variables using a statistical method called Pearson’s correlation coefficient, named for the developer of the method, **Karl Pearson**.

The pearsonr() function of NumPy  can be used to calculate the Pearson’s correlation coefficient for samples of two variables.

You will now calculate where one variable is dependent upon the second.

In [8]:
# First things first
from scipy.stats import pearsonr

# seed random number generator
seed(1)

In [9]:
# Prepare data
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)

In [10]:
# Calculate Pearson's correlation
corr, p = pearsonr(data1, data2)
# Display the correlation
print('Pearsons correlation: %.3f' % corr)

Pearsons correlation: 0.888


In the next section, you will study non-parametric statistics.

## Non-parametric statistics: 

A large portion of the field of statistics and statistical methods is dedicated to data where the distribution is known.

Data in which the distribution is unknown or cannot be easily identified is called nonparametric.

In the case where you are working with nonparametric data, specialized nonparametric statistical methods can be used that discard all information about the distribution. As such, these methods are often referred to as distribution-free methods.

Before a nonparametric statistical method can be applied, the data must be converted into a rank format. As such, statistical methods that expect data in rank format are sometimes called _rank statistics_, such as rank correlation and rank statistical hypothesis tests. Ranking data is exactly as its name suggests.

The procedure is as follows:

- Sort all data in the sample in ascending order.
- Assign an integer rank from 1 to N for each unique value in the data sample.

A widely used nonparametric statistical hypothesis test for checking for a difference between two independent samples is the _**Mann-Whitney U test**_, named for Henry Mann and Donald Whitney.

It is the nonparametric equivalent of the _Student’s t-test_ but does not assume that the data is drawn from a Gaussian distribution.

You will implement this test in Python via the mannwhitneyu() SciPy function.

In [16]:
# The dependencies that you need
from scipy.stats import mannwhitneyu
from numpy.random import rand

# seed the random number generator
seed(1)

In [17]:
# Generate two independent samples
data1 = 50 + (rand(100) * 10)
data2 = 51 + (rand(100) * 10)
# Compare samples
stat, p = mannwhitneyu(data1, data2)
print('Statistics = %.3f, p = %.3f' % (stat, p))
# Interpret
alpha = 0.05
if p > alpha:
    print('Same distribution (fail to reject H0)')
else:
    print('Different distribution (reject H0)')

Statistics = 4077.000, p = 0.012
Different distribution (reject H0)


The other two popular non-parametric statistical significance tests that you can use are:

- Friedman test
- Wilcoxon signed-rank test

## That calls for a wrap up!

You have finally made it to the end. In this article, you studied a variety of essential statistical concepts that play very crucial role in your machine learning projects. So, understanding them is just important. 

From mere an introduction to statistics, you took it to statistical rankings that too with several implementations. That is definitely quite a feat. Next are some links for you if you want to take things further:

- [The Elements of Statistical Learning](https://web.stanford.edu/~hastie/Papers/ESLII.pdf)
- [Machine Learning book by Tom Mitchell](https://www.cs.ubbcluj.ro/~gabis/ml/ml-books/McGrawHill%20-%20Machine%20Learning%20-Tom%20Mitchell.pdf)
- [All for Statistics](https://www.ic.unicamp.br/~wainer/cursos/1s2013/ml/livro.pdf)

Following are the resources I took help from for writing this blog: 
- [Machine Learning Mastery mini course on Statistics](https://machinelearningmastery.com/statistics-for-machine-learning-mini-course/)
- https://www.khanacademy.org/math/statistics-probability
- [Statistical Learning course by Stanford University](https://statlearning.class.stanford.edu/) 

Let me know your views/queries in the comments section. Also, check out [DataCamp's course on "Statistical Thinking in Python"](https://www.datacamp.com/courses/statistical-thinking-in-python-part-1) which is very practically aligned. 