# Drink Caffeine Content Analysis Project

## Project Introdocution

In this analysis we will delve into a topic that is very near and dear to my heart - caffeine! I have a caffeine addiction. More often than not I opt for energy drinks ala Red Bull, Monster, Rockstar, etc. However, when the weather gets cold I start to really enjoy coffee. I often worry that the amount of caffeine I ingest will one day come back to bite me in terms of health. Despite how much caffeine I consume on a weekly basis, I know shockingly little about the amount of caffeine in the drinks I enjoy so much. 

This data set contains information related to roughly 600 different drinks with caffeine and provides information for each related to caffeine content, volume, and calories. We will conduct some exploratory data analysis (EDA) and then attempt to find a subset of drinks which may be "less unhealthy" and could serve as some alternatives to sugar filled energy drinks and coffee.


## Import Libraries and Load the Dataset

In [1]:
# Import libraries used in the project
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.io as pio



In [2]:
# import the caffeine dataset
caffeine = pd.read_csv(r'C:\Users\benwh\Documents\Projects\Caffeine\caffeine.csv')

## Data Familiarization

In [3]:
caffeine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 610 entries, 0 to 609
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   drink          610 non-null    object 
 1   Volume (ml)    610 non-null    float64
 2   Calories       610 non-null    int64  
 3   Caffeine (mg)  610 non-null    int64  
 4   type           610 non-null    object 
dtypes: float64(1), int64(2), object(2)
memory usage: 24.0+ KB


In [4]:
caffeine.head()

Unnamed: 0,drink,Volume (ml),Calories,Caffeine (mg),type
0,Costa Coffee,256.993715,0,277,Coffee
1,Coffee Friend Brewed Coffee,250.19181,0,145,Coffee
2,Hell Energy Coffee,250.19181,150,100,Coffee
3,Killer Coffee (AU),250.19181,0,430,Coffee
4,Nescafe Gold,250.19181,0,66,Coffee


We can see after running .info() and .head() that our data set is pretty straight forward - 610 observations with five fields covering drink name, volume, calories, caffeine content and drink type.

In [5]:
# Check for duplicate drink names
duplicatedrinks = caffeine[caffeine.duplicated(['drink'])]
duplicatedrinks

Unnamed: 0,drink,Volume (ml),Calories,Caffeine (mg),type


We have no duplicate drink names in our dataset.

In [6]:
# Distribution of drink types in the dataset.
caffeine['type'].value_counts()

Energy Drinks    219
Coffee           173
Soft Drinks       90
Tea               66
Energy Shots      36
Water             26
Name: type, dtype: int64

Value_counts() reveals there are six unique drink types.

In [7]:
# Distribution of types of volume in the dataset.
caffeine['Volume (ml)'].value_counts()

354.882000    159
473.176000    125
236.588000     96
250.191810     35
59.147000      19
             ... 
428.815750      1
380.019475      1
370.260220      1
339.208045      1
329.744525      1
Name: Volume (ml), Length: 70, dtype: int64

The volume of drinks varies quite a bit, but more than half of all volumes are higher than 200 mL.

In [8]:
# Distribution of calorie contnet in the dataset.
caffeine['Calories'].value_counts()

0      200
5       32
10      30
160     22
140     21
      ... 
155      1
117      1
78       1
118      1
299      1
Name: Calories, Length: 98, dtype: int64

Calories per drink appear to varry quite a bit. Notably, just under half appear to be very low in calories.

In [9]:
# Distribution of caffeine content in the dataset.
caffeine['Caffeine (mg)'].value_counts()

80      37
100     34
160     33
0       28
120     23
        ..
1150     1
1325     1
728      1
1101     1
99       1
Name: Caffeine (mg), Length: 162, dtype: int64

While there are six types of drinks included in the dataset, there is a lot more variation among volume, calories, and caffeine content.

## Univariate Analysis 

### Summary Statistics for Volume (ml), Calories, and Caffeine (mg)

In [10]:
caffeine.describe()

Unnamed: 0,Volume (ml),Calories,Caffeine (mg)
count,610.0,610.0,610.0
mean,346.54363,75.527869,134.693443
std,143.747738,94.799919,155.362861
min,7.393375,0.0,0.0
25%,236.588,0.0,50.0
50%,354.882,25.0,100.0
75%,473.176,140.0,160.0
max,1419.528,830.0,1555.0


In [11]:
caffeine.groupby(['type']).describe()

Unnamed: 0_level_0,Volume (ml),Volume (ml),Volume (ml),Volume (ml),Volume (ml),Volume (ml),Volume (ml),Volume (ml),Calories,Calories,Calories,Calories,Calories,Caffeine (mg),Caffeine (mg),Caffeine (mg),Caffeine (mg),Caffeine (mg),Caffeine (mg),Caffeine (mg),Caffeine (mg)
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
Coffee,173.0,335.870855,159.716392,12.716605,236.588,325.3085,473.176,1419.528,173.0,73.49711,...,120.0,830.0,173.0,200.589595,248.222165,2.0,100.0,145.0,200.0,1555.0
Energy Drinks,219.0,388.971198,106.409997,236.588,250.19181,354.882,473.176,751.1669,219.0,86.671233,...,149.5,320.0,219.0,147.86758,76.734535,0.0,80.0,135.0,182.5,400.0
Energy Shots,36.0,57.742259,22.094888,7.393375,57.076855,59.147,59.147,125.98311,36.0,16.5,...,25.0,100.0,36.0,193.416667,79.535931,75.0,120.0,192.5,241.25,350.0
Soft Drinks,90.0,355.243454,41.509635,236.588,354.882,354.882,354.882,591.47,90.0,111.111111,...,160.0,320.0,90.0,33.677778,24.915961,0.0,9.25,37.0,47.75,102.0
Tea,66.0,360.47408,167.002318,177.441,236.588,236.588,473.176,946.352,66.0,52.757576,...,107.5,299.0,66.0,55.863636,39.333637,0.0,29.25,45.0,68.0,165.0
Water,26.0,394.590111,99.70274,236.588,343.791938,354.882,473.176,591.47,26.0,11.538462,...,3.75,110.0,26.0,53.730769,34.060602,0.0,35.0,60.0,73.75,120.0


### Distribution of Drink Types

In [12]:
caffeine_gb = caffeine.groupby(["type"]).count()
fig = px.bar(caffeine_gb, 
             x=caffeine_gb.index, 
             y="drink", 
             title='Count of Drink Types')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

### Histogram - Volume (ml)

In [13]:
px.histogram(caffeine, 
             x = 'Volume (ml)', 
             title='Distribution of Volume')

The right tail of the histogram for Volume (ml) is pretty long. Let's zoom in and look at a histogram of volumes > 800 ml.

In [14]:
large_volume = caffeine[(caffeine['Volume (ml)'] > 800)]
large_volume

Unnamed: 0,drink,Volume (ml),Calories,Caffeine (mg),type
29,Starbucks Bottled Iced Coffee,1419.528,240,640,Coffee
530,McDonalds Sweet Tea,946.352,160,100,Tea


Our large right-sided tail is only made of two drinks.

### Histogram - Calories

In [15]:
px.histogram(caffeine, 
             x = 'Calories', 
             title='Distribution of Calories')

Similar to to Volume (ml), we have a large right-sided tail for Calories. Let's take a closer look and filter for drinks with >= 400 calories.

In [16]:
high_calories = caffeine[(caffeine['Calories'] >= 400)]
high_calories

Unnamed: 0,drink,Volume (ml),Calories,Caffeine (mg),type
13,Dare Iced Coffee,500.087885,429,160,Coffee
30,Baskin Robbins Cappuccino Blast,709.764,470,234,Coffee
37,Arby's Jamocha Shake,473.176,830,12,Coffee
81,Big Train Java Chip Ice Coffee,354.882,410,49,Coffee


The long right-sided tail for calories is made up of four drinks that all look like they are a liquid version of ice cream.

### Histogram - Caffeine (mg)

In [17]:
px.histogram(caffeine, 
             x = 'Caffeine (mg)', 
             title='Distribution of Caffeine', 
             template='plotly_white')

Again we have a long, right-sided tail. This time for Caffeine. Let's zoom in on the tail by looking at any drink with > 400 mg of caffeine.

In [18]:
high_caffeine = caffeine[(caffeine['Caffeine (mg)'] > 400)]

px.histogram(high_caffeine, 
             x = 'Caffeine (mg)', 
             nbins=10, 
             title='Distribution of Drinks With Caffeine Content Higher Than 400 mg', 
             template='plotly_white')

### Caffeine by Drink Type

### Univariate Analysis - Takeaways

- A bit over half of the dataset is made up of Enery Drink and Coffee drink types
- Most volumes fall between 200 and 475 ml
- More than half of the drinks in the dataset appear to be low calorie (< 100 calories
- Most drinks have less than 200 mg caffeine
- ~ 75% drinks have a caffeine concentration of .75 or less

## Bivariate Analysis

### Box Plot for Caffeine Across all Drink Types

In [19]:
fig = px.box(caffeine, 
            x='type', 
            y='Caffeine (mg)',
            title='Caffeine Content By Drink Type',
            labels={'type': 'Type'},
            template='plotly_white',
            hover_data=['drink', 'Volume (ml)']
            )
fig

Coffee has numerous outlying drinks where caffeine content is very high. Additionally, Coffee drinks appear to be a bit higher in terms of caffeine content than Energy Drinks. However, Energy Shots are higher on average than both Coffee and Energy Drinks. 

### Volume x Caffeine

In [20]:
px.scatter(caffeine,
            x='Volume (ml)', 
            y='Caffeine (mg)',
            color='type',  
            title='Volume x Caffeine',
            height = 800,
            template='plotly_white',
            hover_data=['drink']
            )

The coffee outliers can be seen here as well and it looks like they are mostly among drinks of the same volume. Energy shots are clustered around 100 ml or lower but have similar caffeine content to many enery drinks and coffee. Tea, Water, and Soft Drinks look to lower caffeine content. 

## Box Plot for Caffeine Concentration Across all Drink Types

We're going to take another look at caffeine distribution across all drink types, but this time we'll use caffeine concentration which is equal to Caffeine (mg) / Volume (ml). This will give us a better idea of which drink types pack a caffeine punch relative to the volume of the drink.

In [None]:
# create Caffeine Concentration column which = Caffeine (mg) / Volume (ml)
caffeine['Caffeine Concentration'] = caffeine['Caffeine (mg)'] / caffeine['Volume (ml)']

px.box(caffeine,
        x='type',
        y='Caffeine Concentration',
        title='Caffeine Concentration By Drink Type',
        labels={'type': 'Type'},
        height=800,
        template='plotly_white',
        hover_data=['drink']
        )

The caffeine concentration of energy shots is clearly much higher than any other drink type. Between Coffee and Energy drinks, Coffee appears to have a higher caffeine concentration. Additionally, just like with the previous box plot showing caffeine content for all drink types, there are numerous outliers for coffee.


## Healthier Options

Now that we have a better understanding of the data set and have created a caffeine concentration value for each drink, lets try to see if we can isolate some potentially healthier options. In this data set, we only have calories as a data point which could be extrapolated to interpret whether something is healthy or not, so we will run with that. 

Let's take filter for a subset of drinks that have 100 or less calories

In [None]:
healthy_caffeine = caffeine[(caffeine['Calories'] <= 100) & (caffeine['type'] != 'Energy Shots')].sort_values(by=['Caffeine Concentration'], ascending=False).head(20)
healthy_caffeine

## Which Tea and Water Drinks Have the Highest Caffeine Content

As much as most people love caffeine, it's hard not to feel a little bit guilty indulging in a caffeinated drink because unless you're drinking black coffee, drinks with caffeine tend to come with a lot of other chemicals included in them.

Let's take a look at the Caffeine dataset but only at Tea and Water drinks to see if we can find some drinks with a lot of caffeine

In [23]:
# filter for only Tea and Water
tw_caffeine = caffeine[(caffeine['type'] == 'Tea') | (caffeine['type'] == 'Water')].sort_values(by=['Caffeine Concentration'], ascending=False).head(10)

tw_caffeine

Unnamed: 0,drink,Volume (ml),Calories,Caffeine (mg),type,Caffeine Concentration
581,Zest Highly Caffeinated Tea,236.588,0,150,Tea,0.634014
556,Fast Lane Black Tea,236.588,0,110,Tea,0.464943
558,HICAF Tea,236.588,0,110,Tea,0.464943
584,Perrier Energize,250.19181,35,99,Water,0.395696
528,Inko's White Tea Energy,458.38925,100,165,Tea,0.359956
580,YMateina Yerba Mate,236.588,0,80,Tea,0.338141
598,Arti Sparkling Water,354.882,0,120,Water,0.338141
527,Guayaki Canned Yerba Mate,458.38925,120,150,Tea,0.327233
542,Taiwanese Milk Tea,473.176,299,151,Tea,0.31912
561,Lipton Natural Energy Tea,236.588,0,75,Tea,0.317007
