# Simple Metrics and Basic Visualizations
October 7, 2019

Data Science Society

authors: Roshan Lodha, Kevin Chai

https://www.kaggle.com/crawford/80-cereals/download



# Metrics

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

First as always, we should read in the dataframe and assess the granularity of the data.

In [None]:
cereal = pd.read_csv('cereal.csv')
cereal.head(10)

The granularity of the dataset is a specific cereal (each row represents a cereal). 

From the kaggle link, we get the following information.<br>
<li>type: C stands for cold, H stands for hot<br></li>
<li>mfr: Manufacturer of cereal<br></li>
<li>A = American Home Food Products<br></li>
<li>G = General Mills<br></li>
<li>K = Kelloggs<br></li>
<li>N = Nabisco<br></li>
<li>P = Post<br></li>
<li>Q = Quaker Oats<br></li>
<li>R = Ralston Purina<br></li>

# Metrics

We can start by measuring important metrics as we before.

In [None]:
cereal_cals = (cereal["calories"])
cereal_cals.head()

In [None]:
#give us the size of the dataframe
len(cereal_cals), cereal_cals.shape[0]

In [None]:
#mean
np.mean(cereal_cals)

In [None]:
#median
np.median(cereal_cals)

In [None]:
#mode (this measure of central tendency has very niche uses; we'll tend to stay away from those)
cereal_cals.mode()

In [None]:
#lower quantile
lower = cereal_cals.quantile(.25)
lower

In [None]:
#upper quantile
upper = cereal_cals.quantile(.75)
upper

In [None]:
#interquantile range
iqr = upper - lower
iqr

In [None]:
#bounds for the outliers
upper_cutoff = upper + 1.5 * iqr
lower_cutoff = lower - 1.5 * iqr
lower_cutoff, upper_cutoff

In [None]:
outliers = cereal.query("calories > 125 | calories < 125").sort_values(by='calories', ascending=False)
outliers.head()

Alternatively (and perhaps more easily), we can use the df.describe( ) method to provide summary statistics for every dimension all at once. 

Be careful, however, as this does not provide units, and we would need to refer back to the documentation for this. 

In [None]:
cereal.describe()

# Histograms

In [None]:
sns.distplot(cereal['calories'], kde = False, bins = 5)

We can see that the underlying distribution (better represented by the kernel density plot) does not match up very well with the histogram of the data.

Why is this?

In [None]:
sns.distplot(cereal['calories'], bins = 5)

Now lets see what happens when we change the bin parameters for visualization.

In [None]:
sns.distplot(cereal['calories'], bins = 10)

In [None]:
sns.distplot(cereal['calories'], rug = True, bins = 5)

# Box Charts


In [None]:
sns.boxplot(x='calories', data=cereal)

In [None]:
sns.boxplot(x='calories', y='mfr', data=cereal)

# Scatter Plots

In [None]:
sns.lmplot(x='calories', y='fat', data=cereal)

In [None]:
sns.lmplot(x='calories', y='fat', data=cereal, fit_reg=False)

In [None]:
sns.lmplot(x='calories', y='fat', hue='type', data=cereal)

In [None]:
sns.lmplot(x='rating', y='calories', data=cereal)

# Bar Charts

In [None]:
sns.countplot(x='type', data=cereal)

In [None]:
sns.countplot(x='mfr', data=cereal)

In [None]:
sns.countplot(x= 'shelf', hue ='mfr', data=cereal)

In [None]:
sns.barplot(x='mfr', y='calories', data=cereal)

# Dot Charts

In [None]:
sns.pointplot(x='mfr', y='calories', data=cereal)