# Descriptive Statistics, Generator Practice



## Breaking Up Data Processing

Let's write a few reusable functions that help read and process data:

1. reading in a file into a list of lines
2. extracting a column
3. converting a column to a type

In [4]:
def readFile(fn):
    return (line.strip() for line in open(fn, 'r').readlines());

In [33]:
def parseData(data, parser):
    return (parser(line) for line in data)

In [39]:
def extractCol(parsedData, idx):
    return (line[idx] for line in parsedData)

def extractCols(parsedData, idxs):
    return (tuple((line[idx] for idx in idxs)) for line in parsedData)

In [35]:
def convertVals(col, fn):
    return (fn(val) for val in col)

In [8]:
import sys
data = readFile('starbucks_drinkMenu_expanded.csv')
print('data', sys.getsizeof(data))

parsedData = parseData(data, lambda line: line.split(','))
print('parsedData', sys.getsizeof(parsedData))

column = extractCol(parsedData, 3)
print('column', sys.getsizeof(column))

vals = convertVals(column, lambda val: int(val) if val.isnumeric() else None)
print('vals', sys.getsizeof(vals))
max((v for v in vals if v is not None))

FileNotFoundError: [Errno 2] No such file or directory: 'starbucks_drinkMenu_expanded.csv'

## Reproducing Max Cal Drink, Generators Only

In [10]:
data = (line.strip() for line in open('starbucks-menu-nutrition-drinks.csv'))
parsedData = (line.split(',') for line in data)
filtered = (row[:2] for row in parsedData if row[1].isnumeric())
max(filtered, key=lambda t: int(t[1]))

FileNotFoundError: [Errno 2] No such file or directory: 'starbucks-menu-nutrition-drinks.csv'

## Example Generator

Calling the following function does not require entire contents of file (or even entire column) to be read into memory; instead, calorie value is read as needed.

In [2]:

# create generator function to read in 
# calorie column
def get_calories():
    with open('starbucks_drinkMenu_expanded.csv', 'r') as f:
        next(f)
        for line in f:
            line_parts = line.split(',')
            yield int(line_parts[3])


## Descriptive Statistics

### Max, Min, and Len

It may be useful to describe a data set by:

* the number of data points
* the highest and lowest value

There are built in functions in Python to do this, like `max`, `min`, and `len`

In [4]:
# max and min can actually take a generator 
max(get_calories())

510

In [5]:
min(get_calories())

0

A generator is not actually a _collection_ of elements, so you can't use `len` on it. Instead, you'll have to turn your generator into a collection...

In [6]:
# if we want to work with all values from our generator, we can convert to a list 
# (that means all values are in memory, tho)
calories = list(get_calories())

In [7]:
# now it's possible to get the length of our data set
len(calories)

242

In [8]:
# because it's a list we can view the first 10 values with slicing
calories[:10]

[3, 4, 5, 5, 70, 100, 70, 100, 150, 110]

In [9]:
# ...and the last 10 values
calories[-10:]

[230, 260, 240, 310, 350, 320, 170, 200, 180, 240]

### Central Tendency

Two methods of determining where our data set is centered are:

1. mean
2. median

In [10]:
# calculating the mean
sum(calories) / len(calories)

193.87190082644628

In [11]:
# if we need the mean, we'll have to sort first
sorted_calories = sorted(calories)

In [12]:
# calculating the median
# if there is an even number of elements, we'll have to take average of middle two

def median(d):
    middle_index = len(d) // 2
    if len(d) % 2 == 0:
        return (d[middle_index] + d[middle_index + 1]) / 2
    else: 
        return d[middle_index]


In [13]:
median(sorted_calories)

190.0

In [14]:
# note that outliers may not affect the median, whereas they can throw off the mean!

copy_sorted_calories = sorted_calories[:]

# change the last value...
copy_sorted_calories[-1] = 200000

In [15]:
sum(copy_sorted_calories) / len(copy_sorted_calories)

1018.2107438016529

In [16]:
median(copy_sorted_calories)

190.0

In [17]:
# otoh adding / removing several values that aren't outliers may make the median jump, 
# whereas the mean may only change slightly

In [18]:
copy_sorted_calories = [150] * 20 + sorted_calories[:]

In [19]:
sum(copy_sorted_calories) / len(copy_sorted_calories)

190.5229007633588

In [20]:
median(copy_sorted_calories)

180.0

In [21]:
# note that there are so many values that are 190 above that it's tough to change
# that without adding several values like we did above
sorted_calories.count(190)

11

In [32]:
#easier to calculate all of these using numpy or pandas
import numpy as np

In [33]:
np.mean(calories)

193.87190082644628

In [34]:
np.median(calories)

185.0

In [35]:
np.max(calories)

510

In [36]:
np.min(calories)

0

In [38]:
# no mode in numpy, I don't think
# but there is one in scipy
from scipy import stats
stats.mode(calories)

ModeResult(mode=array([150]), count=array([11]))

In [39]:
from collections import Counter
Counter(calories)

Counter({3: 1,
         4: 1,
         5: 4,
         70: 3,
         100: 10,
         150: 11,
         110: 9,
         130: 10,
         190: 11,
         170: 9,
         240: 9,
         200: 10,
         180: 11,
         220: 7,
         260: 8,
         230: 6,
         280: 7,
         340: 4,
         290: 9,
         160: 8,
         250: 4,
         210: 7,
         320: 3,
         270: 4,
         10: 2,
         15: 1,
         25: 1,
         50: 2,
         80: 9,
         60: 4,
         90: 6,
         120: 10,
         140: 5,
         300: 2,
         310: 8,
         350: 5,
         400: 1,
         370: 3,
         450: 2,
         510: 1,
         460: 2,
         380: 1,
         330: 2,
         360: 1,
         0: 4,
         390: 2,
         420: 1,
         430: 1})

In [40]:
import pandas as pd
starbucks=pd.read_csv('starbucks_drinkMenu_expanded.csv')

In [44]:
descriptives=starbucks.describe()
print(type(descriptives))
descriptives.to_csv("starbucks_descriptives.csv")

<class 'pandas.core.frame.DataFrame'>


In [45]:
descriptives

Unnamed: 0,Calories,Trans Fat (g),Saturated Fat (g),Sodium (mg),Total Carbohydrates (g),Cholesterol (mg),Dietary Fibre (g),Sugars (g),Protein (g)
count,242.0,242.0,242.0,242.0,242.0,242.0,242.0,242.0,242.0
mean,193.871901,1.307025,0.037603,6.363636,128.884298,35.991736,0.805785,32.96281,6.978512
std,102.863303,1.640259,0.071377,8.630257,82.303223,20.795186,1.445944,19.730199,4.871659
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,120.0,0.1,0.0,0.0,70.0,21.0,0.0,18.0,3.0
50%,185.0,0.5,0.0,5.0,125.0,34.0,0.0,32.0,6.0
75%,260.0,2.0,0.1,10.0,170.0,50.75,1.0,43.75,10.0
max,510.0,9.0,0.3,40.0,340.0,90.0,8.0,84.0,20.0


In [46]:
descriptives['Calories']

count    242.000000
mean     193.871901
std      102.863303
min        0.000000
25%      120.000000
50%      185.000000
75%      260.000000
max      510.000000
Name: Calories, dtype: float64

In [47]:
starbucks['Calories']

0        3
1        4
2        5
3        5
4       70
      ... 
237    320
238    170
239    200
240    180
241    240
Name: Calories, Length: 242, dtype: int64

In [50]:
starbucks.mode()

Unnamed: 0,Beverage_category,Beverage,Beverage_prep,Calories,Total Fat (g),Trans Fat (g),Saturated Fat (g),Sodium (mg),Total Carbohydrates (g),Cholesterol (mg),Dietary Fibre (g),Sugars (g),Protein (g),Vitamin A (% DV),Vitamin C (% DV),Calcium (% DV),Iron (% DV),Caffeine (mg)
0,Classic Espresso Drinks,Caffè Latte,Soymilk,150.0,0.1,0.1,0.0,0.0,160.0,31.0,0.0,0.0,3.0,10%,0%,10%,0%,75.0
1,,Caffè Mocha (Without Whipped Cream),,180.0,,,,,,,,,,,,,,
2,,Cappuccino,,190.0,,,,,,,,,,,,,,
3,,Caramel Macchiato,,,,,,,,,,,,,,,,
4,,Coffee,,,,,,,,,,,,,,,,
5,,Hot Chocolate (Without Whipped Cream),,,,,,,,,,,,,,,,
6,,Tazo® Chai Tea Latte,,,,,,,,,,,,,,,,
7,,Tazo® Full-Leaf Red Tea Latte (Vanilla Rooibos),,,,,,,,,,,,,,,,
8,,Tazo® Full-Leaf Tea Latte,,,,,,,,,,,,,,,,
9,,Tazo® Green Tea Latte,,,,,,,,,,,,,,,,
