## Cereal Marketing

#### Author: Yesahel Scicluna 

#### Source: Codecademy. Learn Statistics with NumPy. Practice Project - CrunchieMunchies

#### Concepts: NumPy, measures of central location, measures of dispersion

#### Required Data Files: cereals.csv

#### Task Description
You work in marketing for a food company YummyCorps, which is developing a new kind of tasty, wholesome cereal called CrunchieMunchies. You want to demonstrate to consumers how healthy your cereal is in comparison to other leading brands, so you’ve dug up nutritional data on several different competitors. Your task is to use NumPy statistical calculations to analyze this data and prove that your CrunchieMunchies cereal is the healthiest choice for consumers. 

#### Task 1
First, import `numpy`. 

In [1]:
import numpy as np

#### Task 2
Look over the cereals.csv file. This file contains the reported calorie amounts for different cereal brands. Load the data from the file and save it as `calorie_stats`. 

In [2]:
calorie_stats = np.genfromtxt(r'https://raw.githubusercontent.com/yezisti/Yesahel_Scicluna--M.Sc._Bioinformatics--Portfolio/main/Python/Codecademy/Statistics%20with%20NumPy/cereal_marketing/cereals.csv', delimiter=',')
calorie_stats

array([ 70., 120.,  70.,  50., 110., 110., 110., 130.,  90.,  90., 120.,
       110., 120., 110., 110., 110., 100., 110., 110., 110., 100., 110.,
       100., 100., 110., 110., 100., 120., 120., 110., 100., 110., 100.,
       110., 120., 120., 110., 110., 110., 140., 110., 100., 110., 100.,
       150., 150., 160., 100., 120., 140.,  90., 130., 120., 100.,  50.,
        50., 100., 100., 120., 100.,  90., 110., 110.,  80.,  90.,  90.,
       110., 110.,  90., 110., 140., 100., 110., 110., 100., 100., 110.])

#### Task 3
There are 60 calories per serving of CrunchieMunchies. How much higher is the average calorie count of your competition? Save the answer to the variable `average_calories` and print the variable to see the answer.


In [3]:
average_calories = np.mean(calorie_stats)
difference = average_calories - 60
print(f'mean no. of calories: {average_calories}, difference: {difference}')

mean no. of calories: 106.88311688311688, difference: 46.883116883116884


#### Task 4
Does the average calorie count adequately reflect the distribution of the dataset? Let’s sort the data and see. Sort the data and save the result to the variable `calorie_stats_sorted`. Print the sorted data to the terminal.


In [4]:
calorie_stats_sorted = np.sort(calorie_stats)
calorie_stats_sorted

array([ 50.,  50.,  50.,  70.,  70.,  80.,  90.,  90.,  90.,  90.,  90.,
        90.,  90., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
       100., 100., 100., 100., 100., 100., 100., 100., 110., 110., 110.,
       110., 110., 110., 110., 110., 110., 110., 110., 110., 110., 110.,
       110., 110., 110., 110., 110., 110., 110., 110., 110., 110., 110.,
       110., 110., 110., 110., 120., 120., 120., 120., 120., 120., 120.,
       120., 120., 120., 130., 130., 140., 140., 140., 150., 150., 160.])

#### Task 5
Looks like the majority of the cereals are higher than the mean. Let’s see if the median is a better representative of the dataset. Calculate the median of the dataset and save your answer to `median_calories`. Print the median so you can see how it compares to the mean. 

In [5]:
median_calories = np.median(calorie_stats)
print(f'median: {median_calories}, mean: {average_calories}')

median: 110.0, mean: 106.88311688311688


#### Task 6
While the median demonstrates that at least half of our values are over 100 calories, it would be more impressive to show that a significant portion of the competition has a higher calorie count that CrunchieMunchies. Calculate different percentiles and print them until you find the lowest percentile that is greater than 60 calories. Save this value to the variable `nth_percentile`.

In [6]:
np.percentile(calorie_stats, 40)

110.0

In [7]:
np.percentile(calorie_stats, 20)

100.0

In [8]:
np.percentile(calorie_stats, 5)

70.0

In [9]:
np.percentile(calorie_stats, 3)

55.599999999999994

In [10]:
np.percentile(calorie_stats, 4)

70.0

In [11]:
# ... After a few more trials:
np.percentile(calorie_stats, 3.29)

60.007999999999996

In [12]:
nth_percentile = 3.29

#### Task 7
While the percentile shows us that the majority of the competition has a much higher calorie count, it’s an awkward concept to use in marketing materials. Instead, let’s calculate the percentage of cereals that have more than 60 calories per serving. Save your answer to the variable `more_calories` and print it.

In [13]:
more_calories = 100 - nth_percentile
print(f'CrunchieMunchies contains less calories than {more_calories} % of the competition')

CrunchieMunchies contains less calories than 96.71 % of the competition


#### Task 8
That’s a really high percentage. That’s going to be very useful when we promote CrunchieMunchies. But one question is, how much variation exists in the dataset? Can we make the generalization that most cereals have around 100 calories or is the spread even greater? Calculate the amount of variation by finding the standard deviation. Save your answer to `calorie_std` and print it. How can we incorporate this value into our analysis?

In [14]:
calorie_std = np.std(calorie_stats)
print(f'mean: {average_calories}, SD: {calorie_std}')

mean: 106.88311688311688, SD: 19.35718533390827


In [15]:
# 68 % of cereals contain 87.52 - 126.24 calories (mean +-1 sd)
# 95 % of cereals contain 68.16 - 145.6 calories (mean +-2 sd)
# CrunchieMunchies contains still less calories than cereals with (mean -2 sd) calories