### What is Statistics
Statistics is a branch of mathematics that deals with the collection, analysis, interpretation, and presentation of data. Statistics can be used to describe, summarize, compare, and infer information from data. Statistics can also help to make decisions based on data and probability.

There are two main types of statistics: 
* **Descriptive** - Descriptive statistics uses graphs, tables, or numerical calculations to provide descriptions of the data.
* **Inferential** - Inferential statistics uses data samples to make predictions or inferences about the population. Both types of statistics are important for statistical analysis.
* **Analytical** - It is used to find out the association or relationship between two or more variables. It is the application of statistical methods and techniques used for solving real-life problems. Quality control, sample surveys, inventory management, simulations, quantitative analysis for business decision-making, etc., form a part of this category.

Descriptive statistics is used to analyse patterns in data, and there are three ways of doing the same.  
* **Measures of central tendency**: The mean, median, and mode are measures of central tendency that describe the average or typical value of a data set.
* **Measures of Dispersion**: The standard deviation and variance are measures of dispersion that describe how spread out the data values are from the mean.
* **Measures of Association**: The correlation coefficient and regression line are measures of association that describe how two variables are related to each other.

  
**The hypothesis test and confidence interval** are methods of inference that help to draw conclusions about the population based on the data sample.
  
Statistics has many applications in various fields, such as science, engineering, business, economics, education, health, and social sciences. Statistics can help to answer questions, test hypotheses, analyze trends, and make predictions based on data. Statistics can also help to improve the quality and efficiency of processes, products, and services.

Mean, median, and mode are measures of central tendency, which give us an idea of where the center of a distribution or a set of values is.

**Mean**: The mean is simply the average of all the values in a dataset. You calculate it by adding up all the values and then dividing by the number of values.
* Formula: Mean = (Sum of all values) / (Number of values)
* Example: Let's say we have a set of test scores: 85, 90, 92, 88, and 95. The mean would be (85 + 90 + 92 + 88 + 95) / 5 = 90.
* When to use: Use the mean when you want a balance of all values and there are no extreme outliers. It's sensitive to extreme values, so if there are outliers, the mean might not represent the "typical" value well.

**Median**: The median is the middle value in a dataset when it's ordered. If there's an even number of values, the median is the average of the two middle values.
* To find the median, you first need to arrange the data in numerical order.
* Example: Using the same test scores, when ordered: 85, 88, 90, 92, 95. The median is 90, as it's the middle value.
* When to use: Use the median when your data has outliers or is not symmetrically distributed. It's less sensitive to extreme values and gives you a better sense of the central location if there are outliers.

**Mode**: The mode is the value that appears most frequently in a dataset. A dataset can have no mode, one mode, or more than one mode.
* A set of data may have one mode, more than one mode (multimodal), or no mode at all if no value is repeated.
* Example: In the test scores, there is no mode because no score repeats.
* When to use: Use the mode when you want to identify the most common value in your dataset. It's especially useful for categorical data.
 

In summary, the choice between mean, median, and mode depends on the nature of your data and the insights you're seeking.
Note: There can be more than one mode for a given variable dataset. For instance, there can be elections in which three parties participate, two of those get 40% of the votes, each, and the third party gets 20%. In this case, there are two modes since two parties have the highest (equal) number of votes.


When the mean=median=mode the data is said to be normally distributed. When the sample or population distribution is not normal, the central limit theorem states that if you take infinitely many samples from the population, evaluate the means of these samples, the resulting distribution will be a normal distribution.


In [1]:
import statistics
from scipy import stats

# Example data
test_scores = [85, 90, 92, 88, 95]

# Mean
mean_score = statistics.mean(test_scores)
print(f"Mean: {mean_score}")

# Median
median_score = statistics.median(test_scores)
print(f"Median: {median_score}")

# Mode
# The mode function in scipy returns a mode object, which contains the mode(s) and their frequencies
mode_result = stats.mode(test_scores)
print(f"Mode: {mode_result.mode}, Frequency: {mode_result.count}")


Mean: 90
Median: 90
Mode: 85, Frequency: 1


The five-number summary is a set of descriptive statistics that provides a concise summary of the distribution of a dataset. It includes the following five values:

1. **Minimum**: The smallest value in the dataset.
2. **First Quartile (Q1)**: The median of the lower half of the dataset. It's the value below which 25% of the data falls.
3. **Median (Second Quartile, Q2)**: The middle value of the dataset, separating it into two equal halves.
4. **Third Quartile (Q3)**: The median of the upper half of the dataset. It's the value below which 75% of the data falls.
5. **Maximum**: The largest value in the dataset.

The five-number summary is particularly useful for describing the spread and central tendency of a dataset, especially when the dataset is large and diverse.

Here's a simple example in Python using the same test scores data:

In [2]:
import numpy as np

# Example data
test_scores = [85, 90, 92, 88, 95]

# Five-number summary using numpy
minimum = np.min(test_scores)
q1 = np.percentile(test_scores, 25)
median = np.median(test_scores)
q3 = np.percentile(test_scores, 75)
maximum = np.max(test_scores)

print(f"Minimum: {minimum}")
print(f"Q1: {q1}")
print(f"Median: {median}")
print(f"Q3: {q3}")
print(f"Maximum: {maximum}")


Minimum: 85
Q1: 88.0
Median: 90.0
Q3: 92.0
Maximum: 95


another visual way to represent the five-number summary and additional information about the distribution of a dataset is by using a box plot (also known as a box-and-whisker plot). A box plot visually displays the minimum, first quartile (Q1), median, third quartile (Q3), and maximum of a dataset.

In a box plot:

A box is drawn from Q1 to Q3, representing the interquartile range (IQR).
A line (whisker) extends from the box to the minimum and maximum values.
A line inside the box represents the median.
Outliers, if any, are often shown as individual points beyond the whiskers.

Here's an example in Python using the matplotlib library: