<a href="https://colab.research.google.com/github/shanksms/pandas_master_repo/blob/main/descriptive_and_inferential_stats.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import pandas as pd
import numpy as np
import math

# Descriptive versus inferential statistics


What comes to mind when you hear the word “statistics”? Is it calculating mean, median, mode, charts, bell curves, and other tools to describe data? This is the most commonly understood part of statistics, called descriptive statistics, and we use it to summarize data. After all, is it more meaningful to scroll through a million records of data or have it summarized? We will cover this area of statistics first.

Inferential statistics tries to uncover attributes about a larger population, often based on a sample. It is often misunderstood and less intuitive than descriptive statistics. Often we are interested in studying a group that is too large to observe (e.g., average height of adolescents in North America) and we have to resort to using only a few members of that group to infer conclusions about them. As you can guess, this is not easy to get right. After all, we are trying to represent a population with a sample that may not be representative. We will explore these caveats along the way.

**Descriptive Statistics**  
**Mean**

In [1]:
# Number of pets each person owns
sample = [1, 3, 2, 5, 7, 0, 2, 3]

mean = sum(sample) / len(sample)

print(mean) # prints 2.875

2.875


**Weighted Mean**  


In [3]:

sample = [90, 80, 63, 87]
weights = [.20, .20, .20, .40]
w_mean = math.fsum([w * x for w, x in zip(weights, sample)]) / math.fsum(weights)
print(w_mean)

81.4


**Median**  
Median can be helpful alternative when data is skewed by few outliers.  
Here’s an interesting anecdote to understand why. In 1986, the mean annual starting salary of geography graduates from the University of North Carolina at Chapel Hill was $250,000.  
Other universities averaged $22,000. Wow, UNC-CH must have an amazing geography program!

But in reality, what was so lucrative about UNC’s geography program? Well…Michael Jordan was one of their graduates. One of the most famous NBA players of all time indeed graduated with a geography degree from UNC. However, he started his career playing basketball, not studying maps. Obviously, this is a confounding variable that has created a huge outlier, and it majorly skewed the income average.  

**When your median is very different from mean, that means you have outliers**

**THE MEDIAN IS A QUANTILE**  
There is a concept of quantiles in descriptive statistics. The concept of quantiles is essentially the same as a median, just cutting the data in other places besides the middle. The median is actually the 50% quantile, or the value where 50% of ordered values are behind it. Then there are the 25%, 50%, and 75% quantiles, which are known as quartiles because they cut data in 25% increments.

In [4]:
# Number of pets each person owns
sample = [0, 1, 5, 7, 9, 10, 14]

def median(sample):
  ordered = sorted(sample)
  mid = int(len(sample) / 2) - 1 if len(sample) % 2 == 0 else int(len(sample) / 2)
  if len(sample) % 2 == 0:
    return (sample[mid] + sample[mid + 1]) / 2
  else:
    return sample[mid]
  
print(median(sample))


7


**Mode**  
The mode is the most frequently occurring set of values. It primarily becomes useful when your data is repetitive and you want to find which values occur the most frequently.  

When no value occurs more than once, there is no mode. When two values occur with an equal amount of frequency, then the dataset is considered bimodal. In Example 3-5 we calculate the mode for our pet dataset, and sure enough we see this is bimodal as both 2 and 3 occur the most (and equally) as often.


In [8]:
# Number of pets each person owns
from collections import defaultdict
from collections import Counter

sample = [1, 3, 2, 5, 7, 0, 2, 3]
def mode(sample):
  frequency_counter = dict(Counter(sample))
  max_count = max(frequency_counter.values())
  if max_count == 1:
    return None
  return [pet_count for pet_count in set(sample) if frequency_counter[pet_count] == max_count]

print(mode(sample))


[2, 3]
