<a href="https://colab.research.google.com/github/yihaozhong/479_data_management/blob/main/descriptive_statistics_central.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Descriptive Statistics

## An Example of a Generator

Generators generate values as needed (using the yield statement).

Calling the following function does not require entire contents of file (or even entire column) to be read into memory; instead, calorie value is read as needed.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# create generator function to read in 
# calorie column
def get_calories():
    with open('/content/drive/MyDrive/starbucks_drinkMenu_expanded.csv', 'r') as f:
        next(f)
        for line in f:
            line_parts = line.split(',')
            yield int(line_parts[3])


## Descriptive Statistics

### Max, Min, and Len

It may be useful to describe a data set by:

* the number of data points
* the highest and lowest value

There are built in functions in Python to do this, like `max`, `min`, and `len`

In [None]:
# max and min can actually take a generator 
max(get_calories())

510

In [None]:
min(get_calories())

0

A generator is not actually a _collection_ of elements, so you can't use `len` on it. Instead, you'll have to turn your generator into a collection...

In [None]:
# if we want to work with all values from our generator, we can convert to a list 
# (that means all values are in memory, tho)
calories = list(get_calories())

In [None]:
# now it's possible to get the length of our data set
len(calories)

242

In [None]:
# because it's a list we can view the first 10 values with slicing
calories[:10]

[3, 4, 5, 5, 70, 100, 70, 100, 150, 110]

In [None]:
# ...and the last 10 values
calories[-10:]

[230, 260, 240, 310, 350, 320, 170, 200, 180, 240]

### Central Tendency

Two methods of determining where our data set is centered are:

1. mean
2. median

In [None]:
# calculating the mean
sum(calories) / len(calories)

193.87190082644628

In [None]:
# if we need the median, we'll have to sort first
sorted_calories = sorted(calories)

In [None]:
# calculating the median
# if there is an even number of elements, we'll have to take average of middle two

def median(d):
    middle_index = len(d) // 2
    if len(d) % 2 == 0:
        return (d[middle_index] + d[middle_index + 1]) / 2
    else: 
        return d[middle_index]


In [None]:
median(sorted_calories)

190.0

In [None]:
# note that outliers may not affect the median, whereas they can throw off the mean!

copy_sorted_calories = sorted_calories[:]

# change the last value...
copy_sorted_calories[-1] = 200000

In [None]:
sum(copy_sorted_calories) / len(copy_sorted_calories)

1018.2107438016529

In [None]:
median(copy_sorted_calories)

190.0

In [None]:
# dding / removing several values that aren't outliers may make the median jump, 
# whereas the mean may only change slightly

In [None]:
copy_sorted_calories = [150] * 20 + sorted_calories[:]

In [None]:
sum(copy_sorted_calories) / len(copy_sorted_calories)

190.5229007633588

In [None]:
median(copy_sorted_calories)

180.0

In [None]:
# note that there are so many values that are 190 above that it's tough to change
# that without adding several values like we did above
sorted_calories.count(190)

11

In [None]:
from collections import Counter
Counter(calories)

Counter({0: 4,
         3: 1,
         4: 1,
         5: 4,
         10: 2,
         15: 1,
         25: 1,
         50: 2,
         60: 4,
         70: 3,
         80: 9,
         90: 6,
         100: 10,
         110: 9,
         120: 10,
         130: 10,
         140: 5,
         150: 11,
         160: 8,
         170: 9,
         180: 11,
         190: 11,
         200: 10,
         210: 7,
         220: 7,
         230: 6,
         240: 9,
         250: 4,
         260: 8,
         270: 4,
         280: 7,
         290: 9,
         300: 2,
         310: 8,
         320: 3,
         330: 2,
         340: 4,
         350: 5,
         360: 1,
         370: 3,
         380: 1,
         390: 2,
         400: 1,
         420: 1,
         430: 1,
         450: 2,
         460: 2,
         510: 1})

In [None]:
import pandas as pd
starbucks=pd.read_csv('/content/drive/MyDrive/starbucks_drinkMenu_expanded.csv')

In [None]:
descriptives=starbucks.describe()
print(type(descriptives))
descriptives.to_csv("starbucks_descriptives.csv")

<class 'pandas.core.frame.DataFrame'>


In [None]:
descriptives

Unnamed: 0,Calories,Trans Fat (g),Saturated Fat (g),Sodium (mg),Total Carbohydrates (g),Cholesterol (mg),Dietary Fibre (g),Sugars (g),Protein (g)
count,242.0,242.0,242.0,242.0,242.0,242.0,242.0,242.0,242.0
mean,193.871901,1.307025,0.037603,6.363636,128.884298,35.991736,0.805785,32.96281,6.978512
std,102.863303,1.640259,0.071377,8.630257,82.303223,20.795186,1.445944,19.730199,4.871659
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,120.0,0.1,0.0,0.0,70.0,21.0,0.0,18.0,3.0
50%,185.0,0.5,0.0,5.0,125.0,34.0,0.0,32.0,6.0
75%,260.0,2.0,0.1,10.0,170.0,50.75,1.0,43.75,10.0
max,510.0,9.0,0.3,40.0,340.0,90.0,8.0,84.0,20.0


In [None]:
descriptives['Calories']

count    242.000000
mean     193.871901
std      102.863303
min        0.000000
25%      120.000000
50%      185.000000
75%      260.000000
max      510.000000
Name: Calories, dtype: float64

In [None]:
starbucks['Calories']

0        3
1        4
2        5
3        5
4       70
      ... 
237    320
238    170
239    200
240    180
241    240
Name: Calories, Length: 242, dtype: int64