In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Introduction to DataFrames

In the previous part, we covered the most common and fundamental attributes and methods for a Series. This chapter begins our coverage of similar and analogous operations for DataFrames.

## DataFrames vs Series

DataFrames and Series are extremely similar objects. A Series is just a single dimension of data and is usually formed from a single column of a DataFrame. Series have an index and values, but no columns. A DataFrame can be thought of as columns of Series objects each directly accessible by placing the column name within *just the brackets*.

### View the API for a complete list of functionality

As we did for Series, it can be helpful to see the entire list of attributes and methods available to DataFrames. Visit the [DataFrame section][1] of the API to see all of its functionality.

### Best of DataFrame API

DataFrames have an abundance of attributes and methods that do not give any additional functionality to the library. We focus on the core attributes and methods that give you the most power to complete a data analysis.

### Minimally Sufficient Pandas

As a gentle reminder, it is my opinion that you stick with a minimal subset of pandas when doing an analysis. Using more obscure methods does not make you a better analyst. The point of a data analysis is to clearly expose the information held within the data. Just about everything that you want to do can be clearly expressed with minimal pandas syntax.

### Bikes dataset

We will use the bikes dataset to introduce the core attributes and methods of DataFrames.

[1]: http://pandas.pydata.org/pandas-docs/stable/reference/frame.html

In [1]:
import pandas as pd
bikes = pd.read_csv('../data/bikes.csv', parse_dates=['starttime', 'stoptime'])
bikes.head(3)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
0,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,11.0,Michigan Ave & Oak St,15.0,73.9,12.7,mostlycloudy
1,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,31.0,Wells St & Walton St,19.0,69.1,6.9,partlycloudy
2,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,15.0,Dearborn St & Monroe St,23.0,73.0,16.1,mostlycloudy


## Core DataFrame attributes

The core DataFrame attributes that you need to know are listed below.

* `index`
* `columns`
* `values`
* `dtypes`
* `shape`
* `size`

### Review the index and columns

We've discussed the index and columns extensively before this chapter. As a review, the purpose of the index is to label each row, just as the the purpose of the columns is to label each column. The index and columns are not data. They are labels for the data. If no index is set during read, pandas uses the default `RangeIndex` which defines a sequence of consecutive integers beginning at 0.

In [3]:
bikes.index

RangeIndex(start=0, stop=50089, step=1)

Let's access the column names with the `columns` attribute.

In [None]:
bikes.columns

### `values` returns a 2-D numpy array

The `values` attribute returns a two-dimensional numpy array of all the column values. It is like a DataFrame with no index or columns.

In [None]:
bikes.values

### Advanced discussion on the `values` attribute

The `values` attribute always returns a single numpy array, which may lead you to believe that pandas stores its data as a single numpy array. This isn't the case. pandas has separate two-dimensional numpy arrays for each data type in the DataFrame. For instance, if a DataFrame contains integers, floats, strings, and datetimes, then pandas will have four separate numpy arrays to contain data for each of those data types. The reason for this, is that numpy arrays can only be of one specific data type. In other words, numpy arrays contain **homogeneous** data. The exception to this is the 'object' numpy data type which can contain any Python object.

Whenever you access the `values` attribute of a DataFrame, pandas concatenates together all of the numpy arrays together to return a single array. Notice that the data type of the resulting `bikes.values` array is object. This is because the `bikes` DataFrame contains string columns and the only valid numpy data type that allows both numeric and string values is object.

If you had a DataFrame consisting only of integers and floats, then the `values` attribute would return an array with a data type of float. pandas always chooses the data type that loses no information when you access the `values` attribute. Below, an integer column (`tripduration`) and a float column (`start_capacity`) are selected and then the `values` attribute is accessed to return a numpy array with a float data type.

In [None]:
bikes[['tripduration', 'start_capacity']].values.dtype

### `shape` returns a tuple of the number of rows and columns

The `shape` attribute returns a Python tuple of length 2 containing the number of rows and columns.

In [None]:
shape = bikes.shape
shape

You can get the number of rows or columns as an integer by selecting them from the tuple.

In [None]:
shape[0]

It isn't necessary to assign the tuple to a variable beforehand.

In [None]:
bikes.shape[1]

### `size` returns the total number of elements in the DataFrame

The `size` attribute is a bit tricky and returns the total number of elements in the DataFrame. This is simply the number of rows multiplied by the number of columns.

In [None]:
bikes.size

We can verify this by multiplying the number of rows and columns together.

In [None]:
shape[0] * shape[1]

### The `len` function returns the number of rows

Passing in the DataFrame to the built-in Python `len` function returns the number of rows.

In [None]:
len(bikes)

## Arithmetic DataFrame operations

We now cover the arithmetic operators `+`, `-`, `*`, `/`, `**`, `//`, `%` on a DataFrame. Let's say we have DataFrame, `df`, and execute `df + 5`. pandas attempts to add 5 to each value in the DataFrame. This operation only completes if all the columns are numeric (or boolean).

### Attempt to add 5 to bikes

If we attempt to add 5 to `bikes` we get an error, as there are a mix of numeric, object, and datetime columns. Adding the integer 5 to a string or datetime columns is impossible and a `TypeError` is raised.

In [None]:
bikes + 5

### Select only numeric data with `select_dtypes`

DataFrames have a method unique to them called `select_dtypes` which selects a subset of columns with the passed type. Pass it the data type you want to select as a string. For example, pass it the string `'int64'` to select all of the 64-bit integer columns.

In [None]:
bikes.select_dtypes('int64').head(3)

Select all of the 64-bit float columns by passing it the string `'float64'`.

In [None]:
bikes.select_dtypes('float64').head(3)

All columns from each data type may be selected in this manner. You can also select columns of multiple data types by using a list. Here, we select both 64-bit integers and floats.

In [None]:
bikes.select_dtypes(['int64', 'float64']).head(3)

Integers and floats of different bit sizes are considered different data types and you will need to use their exact name to select them with this method. For instance, 'int16' selects all 16-bit integer columns. A deep dive into all of pandas data types is found in the **Data Types** part of the book.

### Use the string 'number' to select all numeric data

pandas offers a shortcut to select all numeric data types (integers and floats) with the string 'number'. We assign this result to `bikes_number`.

In [None]:
bikes_number = bikes.select_dtypes('number')
bikes_number.head(3)

### Add 5 to `bikes_number`

Now that we have a DataFrame consisting entirely of numeric data, we can successfully add 5 to it.

In [None]:
(bikes_number + 5).head(3)

### Addition and multiplication with string columns

Addition actually works with strings by appending the word being added to each value. You can also use multiplication to concatenate each string value to itself. We first select all of the string columns and assign the result to a new variable name.

In [None]:
bikes_strings = bikes.select_dtypes('object')
bikes_strings.head(3)

The addition operator can now be used to append a string to each value in the DataFrame.

In [None]:
(bikes_strings + ' SOMESTRING').head(3)

Similarly, the multiplication operator can be used with an integer to concatenate each string value to itself.

In [None]:
(bikes_strings * 3).head(3)

## DataFrame comparison operators

The comparison operators (`<`, `<=`, `>`, `>=`, `==`, `!=`) work similarly to the arithmetic ones and return a DataFrame of all boolean columns. Here, we test whether every value in the DataFrame is greater than 5.

In [None]:
(bikes_number > 5).head(3)

## Overlap of DataFrame and Series methods

Most of the methods that exist for Series also exist for DataFrames and vice-versa. This is good news as it would be a  hassle to have a different set of methods for such similar objects. In this section, we find all the attributes and methods that are either in-common or unique to Series and DataFrames. We begin by using a set comprehension to get all of the public (those that don't begin with an underscore) attributes and methods for each type.

In [None]:
df_public = {method for method in dir(pd.DataFrame) 
             if not method.startswith('_')}
series_public = {method for method in dir(pd.Series) 
                 if not method.startswith('_')}

Let's output the total number of methods for each type. Notice how they have nearly the same amount.

In [None]:
len(df_public), len(series_public)

Let's find the number of methods in common. The ampersand computes the intersection between sets. About 90% of the attributes and methods are the same.

In [None]:
len(df_public & series_public)

### Attributes and methods unique to DataFrames

The minus sign computes the difference between one set and another. It returns all of the elements unique to the first set. Below we return the attributes and methods unique to DataFrames.

In [None]:
print(df_public - series_public)

### Attributes and methods unique to Series

Reversing the operation returns the attributes and methods unique to Series.

In [None]:
print(series_public - df_public)

## Data Dictionaries

A data dictionary is an important element of a data analysis and at a minimum gives us the column name and description of each column. Other information on each column can be kept in it such as the data type of each column or number of missing values.

The college dataset has a data dictionary available that can be read in as a DataFrame. It contains the descriptions of each column, which is important with this dataset, as the column names are not easily decipherable.

In [None]:
# There are more than 20 (the default) columns
pd.set_option('display.max_columns', 40) 
college = pd.read_csv('../data/college.csv', index_col='instnm')
college.head(3)

The data dictionary is a CSV, which can be read in as a DataFrame to help understand the information in the college DataFrame.

In [None]:
pd.read_csv('../data/dictionaries/college_data_dictionary.csv').head()

## Exercises

### Exercise 1

<span  style="color:green; font-size:16px">Select only the 64-bit float columns from the `college` DataFrame. How many are there?</span>

### Exercise 2
<span  style="color:green; font-size:16px">When you call the `info` method on a DataFrame, one of the very last items that gets outputted is the count of columns for each data type. Can you think of a different combination of pandas operations that would return this as a Series.</span>