In [1]:
import numpy as np
import pandas as pd
%matplotlib inline

In [2]:
# Set up the data
data = pd.DataFrame()
data['gender'] = ['male'] * 100 + ['female'] * 100
data['height'] = np.append(np.random.normal(69, 8, 100), np.random.normal(64, 5, 100))
data['weight'] = np.append(np.random.normal(195, 25, 100), np.random.normal(166, 15, 100))

In [3]:
data.head()

Unnamed: 0,gender,height,weight
0,male,79.89755,221.871102
1,male,68.735092,198.170656
2,male,70.951406,195.795587
3,male,80.355821,196.726384
4,male,69.519208,168.919022


# Describing data with Pandas

So far in this lesson, we've discussed the various ways we can use statistics to describe a given dataset. Now, we're going to discuss how we can leverage the tools of data science, specifically the _pandas_ package, to quickly and easily describe our data. This is what you'll actually be using day to day when you have to describe or summarize the data you're working with. Rather than draw out formulas or perform calculations you'll use the tools of programming to get the answers you want easily and efficiently.


## What we've seen before

We've already shown some of the basic tools. We have NumPy methods like `.mean()` or `.std()` to calculate the mean and standard deviation of our data.

In [4]:
data.height.mean()

65.921110470114073

In [5]:
data.height.std()

7.125274567757093

Now, there are many more methods in pandas to describe data in simple aggregative forms. Things like median and variance all have associated pandas methods. As a general rule of thumb, if you're trying to compute a standard statistical measure (the kinds of measures you could find in a statistics book somewhere) Python probably has a coded up method for it somewhere already. Usually that method will be in NumPy and pandas, but not always. It is, however, always worth a quick Google and check of Stack Overflow to see if the work has already been done before you go off and create your own functions.

## The `.describe()` method

So far we've mostly talked about methods with two kinds of output: it either stays the same shape with modified values (the iterative kinds of methods) or it condenses the data into a single value output (aggregative methods). There is another group of methods in Pandas, and they happen to be supremely useful for quickly and coherently summarizing data in a numeric rather than visual way. 

In statistics, there are a lot of descriptive values that are often used in concert with each other. The most classic example is probably mean and standard deviation. Using the two of them together gets you a lot of information about how the data is distributed across values.

Pandas understands this. Sometimes you want more than one value, but less than all of them. You want a set of summary statistics that give you a good, standardized view into the data and its variables. Enter `.describe()`.

In [6]:
data.describe()

Unnamed: 0,height,weight
count,200.0,200.0
mean,65.92111,181.416225
std,7.125275,25.996945
min,49.268448,126.352513
25%,61.194613,162.815738
50%,65.567748,176.152202
75%,69.586985,198.285979
max,91.566155,253.611685


Let's look at what that did. Firstly, it returned a data frame, but not one of the same size or shape that we gave it. Instead it iterated over the columns and created these standard statistical measures for each column possible. We say each column possible because one is missing: Gender. That's because gender is a string, rather than a numeric value. We can't compute the means of strings.

Now, as for the values themselves. Count should be relatively self evident, as should min and max. Mean and std (standard deviation) we've also talked about before. The three percent values are _percentiles_. These values represent cutoff points, below which a certain percentage of the data lies. So, 25% of weights are below 162.82 and so on.

Together, these values give us a decent image of what each of the variables included looks like. We can get a numerical sense of what we might call their "shape". However, this is only one part of `.describe()`'s capabilities. As we covered in the toolkit unit, we can also group our data. This allows us to be even more insightful with our describe, letting us compare the summary statistics for two different groups of our data.

In [7]:
data.groupby('gender').describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,height,weight
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,count,100.0,100.0
female,mean,63.195078,164.643079
female,std,4.786304,12.775653
female,min,50.59149,128.763644
female,25%,59.987204,153.933783
female,50%,63.674985,166.122162
female,75%,66.57688,173.270107
female,max,74.507797,192.725068
male,count,100.0,100.0
male,mean,68.647143,198.189371


Now we have twice the output. This may not be the easiest form to read it, but it does give us a sense of the difference between the two groups, male and female. In this case we can see that the distributions for height and weight are higher for men than for women, which is what we'd expect. This kind of grouping can give us another layer of insight to our analysis.

## Value Counts

Sometimes, you aren't dealing with data that is best summarized in this form. The most common example of this is strings, where these kinds of methods do not apply. In that case what you're probably interested in is _counts_. Python gives you an easy way to go over a column of data and return the distinct values as well as the counts of each.

In [8]:
data.gender.value_counts()

female    100
male      100
Name: gender, dtype: int64

Now, the first thing to note is that this method is working on `data.gender`, which is a _series_ object rather than a _data frame_ object. This `.value_counts()` method cannot iterate over a whole data frame. Luckily each column and row in a data frame is a series and you can use this method simply by selecting a column as we did above. 

There are several reasons to use this method. Firstly, it gives you another way to make sense of your data. In this case it shows us that our data is evenly balanced between males and females, with one hundred samples of each.

There are plenty of other ways this function could be useful. It can show outliers or possible malformed data. For example, if we were to see something like `'Mal'` with a single entry, we'd have found a typo in the data. This method works over both numerical and object data, though it is not valuable to run over the numeric columns in this example. Can you think of why?

In [9]:
data.weight.value_counts().head()

175.143849    1
208.507849    1
169.185494    1
165.450584    1
165.087602    1
Name: weight, dtype: int64

As you can see, it's not useful because we're dealing with truly continuous random data, so no value is exactly repeated. We simply get a list of all the values with a count of 1 for each.

However, these two methods, `.describe()` and `.value_counts()`, do often provide incredibly easy and valuable insights into your dataset. You'll want to use them throughout the course as one of the ways to get a first, quick sense of the data before digging in more specifically on points of interest.