# Lab 01.04 Summary Statistics

# Set Up Your Data

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

### Side note: 
<img src="https://miro.medium.com/max/2550/1*6d5dw6dPhy4vBp2vRW6uzw.png" width="500" height="100" />

**What is pandas, and why are we importing it?**

It's a python library that helps us organize and access our data set into an object called a dataframe, or df for short. 

<img src="https://matplotlib.org/3.1.1/_images/sphx_glr_sample_plots_thumb.png">

**What is matplotlib, and why are we importing it?**
It's a python library that helps us plot graphs. 

<img src="https://revisionworld.com/sites/revisionworld.com/files/imce/sdeviation2.gif">

**What is numpy, and why are we importing it?**
It's a python library that helps us do mathematical calculations. 

You'll learn more about these libraries at the end of this unit.

## Data

Here's are two small lists that you can run as tests when you develop your code.

💻 **For the odd data set, put in 7 elements. Each number is the amount of sleep per night you got in the past week.**

In [65]:
sleep_data = [6,6,7,7,5,6,6]
even_data = [100,200,300,400,500,600]

## Sorting the data
Often, we will want our data to be sorted. This will help us search through the data and will also help us calculate statistics like the median.

We can easily accomplish this with the `sorted()` function:

In [66]:
odd_data = sorted(sleep_data)
even_data = sorted(even_data)

odd_data

[5, 6, 6, 6, 6, 7, 7]

# Calculate The Central Tendency, aka Middle

👀  **The median, mean, and mode are calculations we can use to identify the middles. They'll give us slightly different answers, depending on the data set.**

<img src = "https://www.payscale.com/data/wp-content/uploads/sites/8/2018/06/mode-median-mean.gif">

## The mean

First, we'll write a function to calculate the mean of a dataset. Remember that the mean is the central value of a discrete set of numbers: specifically, **the sum of the values divided by the number of values.**

💻 Write the mean function and any helper functions you'll need.

In [67]:
def calculate_mean(list_of_numbers):
    """
    Returns the average (mean) of a list of numbers.
    input: list of ints (or floats)
    output: float
    """
    return sum(list_of_numbers)/len(list_of_numbers)

✅✅ Test: Use your function to calculate the mean of the small data list. Is it the same as your hand calculations?

In [68]:
calculate_mean(odd_data)

6.142857142857143

In [69]:
calculate_mean(even_data)

350.0

 ## The median

Next, we'll calculate the median of our data. As a reminder, the median is the middle number in a sorted, ascending or descending, list of numbers.

*Tip: consider the the possible cases of our data. How do you calculate the median of an even amount of data? Of an odd amount of data?*

💻 Write the median function and any helper functions you'll need:

In [73]:
def calculate_median(sorted_list):
    """
    Takes the sorted list, returns the median of the list
    input: list of ints(or floats)
    output: integer (or float)
    """
    if len(sorted_list)%2 == 0:
        left = sorted_list[len(sorted_list)//2-1]
        right = sorted_list[len(sorted_list)//2]
        return calculate_mean([left, right])
    return sorted_list[len(sorted_list)//2]

✅✅ Test: Use your function to calculate the median of the small data list. Is it the same as your hand calculations?

In [74]:
calculate_median(odd_data)

6

In [75]:
calculate_median(even_data)

350.0

## The mode

💻 Write the mode function and the helper function, create_counts_dict

<img src = "https://mediadc.brightspotcdn.com/dims4/default/a12d362/2147483647/strip/true/crop/2400x1264+0+0/resize/2400x1264!/quality/90/?url=https%3A%2F%2Fmediadc.brightspotcdn.com%2F14%2F0d%2F29f5ea424f6fa447310a14d6362c%2F03-skinner-openbooks.jpg" style="width:500px">

**Dictionary alert: You may want to use a dictionary to solve this problem.**

In [76]:
def create_counts_dict(list_of_numbers):
    """
    Returns a dictionary that contains each item of the list (as a key) and how often they appear (as the value)
    """
    counts = {}
    for element in list_of_numbers:
        if element not in counts:
            counts[element] = 1
        else:
            counts[element] += 1
    return counts

def calculate_mode(list_of_numbers):
    """
    Returns the number that occurred most often in the list of numbers. 
    If there is more than one mode (or no mode), the first one to appear in the list is returned.
    input: list of ints (or floats)
    output: float
    """
    counts = create_counts_dict(list_of_numbers)
    max_count = -1
    max_key = None
    for key, count in counts.items():
        if count > max_count:
            max_count = count
            max_key = key
    return max_key

✅✅ Test: Use your function to calculate the mode of the small data list. Is it the same as your hand calculations?

In [77]:
create_counts_dict(odd_data)

{5: 1, 6: 4, 7: 2}

In [78]:
calculate_mode(odd_data)

6

In [79]:
calculate_mode(even_data)

100

# Now let's calculate Spread

<img src = "https://soupsahoy.files.wordpress.com/2015/01/hk-french-toast_2.jpg">

##  What is spread?

👀 You can have it with toast and statistics.

It turns out that the central tendency of the dataset doesn't tell us the whole story.

For example, consider the red dataset and the blue dataset. They have the same middle (both mean and median = 7), but one has much higher spread than the other.

**Which dataset is your sleep schedule more like? What kind of spread do you have?**

<img src="https://i.imgur.com/hw9yRAK.png" style="width:500px">

*(Optional) Explore mean and median [here](https://teacher.desmos.com/activitybuilder/custom/5733262bfd802215069a40c0#preview/c150c31c-cac4-4b2b-8753-d0aea1ce3aa1)*

## Interquartile range (IQR)

👀 A quartile divides the number of data points into four parts, or quarters, of more-or-less equal size.

The lower quartile is the first quarter of the data, and the upper quartile is the last quarter of the data.

The IQR is equal to the value at the bottom of the upper quartile (Q3) minus the value at the top of the lower quartile (Q1). **This number will tells us how far the upper and lower quartile are from the median.** 

Calculating the IQR helps us understand the spread of our data because it allows us the see how large the gap is between the lower quartile of our data and the upper quartile of our data. The larger the gap, the more our data must be spread out.

<img src = "https://www.simplypsychology.org/IQR.jpg">

### List slicing
Since the IQR uses smaller parts of a dataset, we're going to need to find a way to get specific sections of a list. We can accomplish this with list *slicing*.

In [118]:
odd_data

[5, 6, 6, 6, 6, 7, 7]

To get everything from the beginning of a list up to (but not including) a particular index, you can leave out the first number:

To get a particular section of a list, we can put two numbers into the `[]` operator separated by a `:`:

`mylist[start_index:stop_index]`

In [119]:
odd_data[1:3]

[6, 6]

*Notice that the element at index 3 is not included in the slice, only the elements up to 3.*

To get everything from the beginning of a list up to (but not including) a particular index, you can leave out the first number:

In [120]:
odd_data[:3]

[5, 6, 6]

To get everything from a particular index (inclusive) to the end of a list, you can leave out the second number:

In [121]:
odd_data[3:]

[6, 6, 7, 7]

### Find quartiles

In order to find the IQR, first we need to be able to find the top value of the lower quartile (Q1) and the bottom value of the upper quartile (Q3) of our data.

💻 Write functions to find Q1 below:

In [122]:
def below_median(sorted_list):
    """
    Returns just the part of the sorted list which is below the median.
    input: list of ints (or floats)
    output: list of ints (or floats)
    """
    first_half_end_index = int(len(sorted_list) / 2)
    return sorted_list[: first_half_end_index]

def calculate_Q1(sorted_list):
    """
    Takes the sorted list, returns the lower quartile of the list,
    which is defined as the median of the data points to the left of the median
    input: list of ints (or floats)
    output: integer (or float)
    """
    
    first_half = below_median(sorted_list)
    return calculate_median(first_half)

💻 Write functions to find Q3 below:

In [108]:
def above_median(sorted_list):
    """
    Returns just the part of the sorted list which is above the median.
    input: list of ints (or floats)
    output: list of ints (or floats)
    """
    if len(sorted_list)%2 == 0: # if the list is even
        last_half_end_index = int(len(sorted_list) / 2)
    else:
        last_half_end_index = int(len(sorted_list) / 2) + 1
    return sorted_list[last_half_end_index:]

def calculate_Q3(sorted_list):
    """
    Takes the sorted list, returns the upper quartile of the list,
    which is defined as the median of the data points to the right of the median.
    input: list of ints(or floats)
    output: integer (or float)
    """
    last_half_list = above_median(sorted_list)
    return calculate_median(last_half_list)


### Calculating IQR

Now that we can find Q1 and Q3, we can easily calculate the difference between these two values in order to find the IQR.

💻 Write the function calculate IQR below:

In [109]:
def calculate_IQR(sorted_list):
    """
    Takes the sorted list, calculates Q1 and Q3, and returns the difference between Q3 and Q1, which is the interquartile range. 
    input: list of ints(or floats)
    output: integer (or float)
    """
    q1 = calculate_Q1(sorted_list)
    q3 = calculate_Q3(sorted_list)
    return q3-q1


✅✅ Test: Use your function to calculate the interquartile of the small data list. Is it the same as your hand calculations?

In [110]:
calculate_IQR(odd_data)

1

In [111]:
calculate_IQR(even_data)

300

## Percentiles

👀 A percentile is a value at or below which a given percentage of all the values in the dataset fall.

**This statistic helps you understand how you rank compared to everyone else. For example, if you are at the 80th percentile for height, that means you rank higher than 80% of people.**

<img src="https://www.mathsisfun.com/data/images/percentile-80.svg">

You just calculated the value at the 25th percentile (Q1), the value at the 50th percentile (the median/Q2), and the value at the 75th percentile (Q3).

**These are very useful statistics, but what about the other 97 percentiles?**


<img src= "http://my.ilstu.edu/~gjin/hsc204-eh/Module-5-Summary-Measure-2/Figure-3-8.png">

**We could create a function ourselves to calculate the percentile...**

**But let's check out the numpy library's percentile.** 

#### Check out this link: https://stackoverflow.com/questions/2374640/how-do-i-calculate-percentiles-with-python-numpy

💻 Use this function to calculate what the 90th percentile would be for the small odd data list:

In [123]:
np.percentile(odd_data, 90)

7.0

💻 What about the 10th percentile?

In [124]:
np.percentile(odd_data, 10)

5.6