# 5. DataFrame Attributes and Methods

### Objectives

In previous notebooks, we covered the most common and fundamental attributes and methods for a Series. This notebook will take on a similar task for DataFrames.

## DataFrames vs Series

DataFrames and Series are extremely similar objects. A Series is simply a single column of a DataFrame. Series have an index and values but no columns. A DataFrame can be thought of as a collection of Series objects each directly accessible by passing the column name to the brackets of the DataFrame.

## View the API for complete list of functionality
Just as we did for Series, it can be helpful to see the entire list of attributes and methods for a DataFrame. Please visit the [DataFrame section][1] of the API.

## Best of DataFrame API
DataFrames, as like Series, have an abundance of attributes and methods that do not give any additional functionality to the library. We will focus on the core attributes and methods that give you the most power to complete a data analysis.

## Minimally Sufficient Pandas
I can't stress enough how important it is to stick with a minimal subset of pandas when doing an analysis. Using more obscure methods does not make you a better analyst. This is akin to going to party and using the most obscure and difficult words to impress the guests. The point of a data analysis is to clearly expose the information held within the data. Just about everything that you want to do can be clearly expressed with minimal Pandas syntax. It is this syntax that we focus on during class.

## Bikes Dataset
We will use the bikes dataset to introduce the core attributes and methods of DataFrames.

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html#dataframe

In [None]:
import pandas as pd

bikes = pd.read_csv('../data/bikes.csv', parse_dates=['starttime', 'stoptime'])
bikes.head()

## Core DataFrame Attributes

See the complete list of [DataFrame attributes][1]

* **`index`**
* **`columns`**
* **`values`**
* **`dtypes`**
* **`shape`**
* **`size`**

Let's further explore these attributes:

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html#attributes-and-underlying-data

### Index Objects
The default index for both DataFrames and Series is a **`RangeIndex`**. This is a unique object to Pandas. All index objects have their own attributes and methods, which you can [view in the API][1]. There are actually about a dozen different Index objects, which sounds intimidating. But, you will rarely need to interact with these Index objects directly so its not necessary to put much time in studying their attributes and methods.

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html#attributes-and-underlying-data

In [None]:
bikes.index

## The columns are an Index object
The columns are always going to be some kind of Index object just like the index. You can more or less think of the Index objects as an array of data.

In [None]:
bikes.columns

In [None]:
type(bikes.columns)

### `values` returns a 2-D NumPy array
The **`values`** attribute returns a 2-D NumPy array.

In [None]:
bikes.values

### `dtypes` returns a Series of data types
The **`dtypes`** attributes returns a Series of data types where the index of the Series is the column names and the values are the actual data type.

In [None]:
bikes.dtypes

### `shape` returns a tuple of the number of rows and columns
The shape attribute will always return a Python tuple of length 2 containing the number of rows and columns

In [None]:
shape = bikes.shape
shape

In [None]:
type(shape)

You can get the number of rows and columns as an integer by selecting them from the tuple:

In [None]:
shape[0]

In [None]:
shape[1]

### `size` return the total number of elements in the DataFrame
The **`size`** attribute is a bit tricky and returns the total number of elements in the DataFrame. This is simply the number of rows multiplied by the number of columns.

In [None]:
bikes.size

Just the same as this:

In [None]:
shape[0] * shape[1]

### The `len` function returns the number of rows
The built-in Python **`len`** function returns the number of rows.

In [None]:
len(bikes)

## The `info` method
The **`info`** method returns the: 
* data types of each column
* type of index
* number of columns
* number of non-missing values in each column
* frequency count of data types
* amount of memory used

This information is **printed** to the screen. There is no object that is returned.

In [None]:
bikes.info()

# Arithmetic Operations with a DataFrame
We now cover what happens when we use the arithmetic operators **`+, -, *, /, **, //`** on a DataFrame. Let's say we have DataFrame, **`df`** and execute **`df + 5`**. Pandas will attempt to add 5 to each value in the DataFrame. This operation will only work if all the columns are numeric (or boolean).

### Attempt to add 5 to bikes
If we try and add 5 to bikes we will get an error as we have a mix of numeric, object, and datetime columns. 

In [None]:
bikes + 5

### Select just numeric data with `select_dtypes`
DataFrames have a unique method to them called **`select_dtypes`** which selects a subset of columns with the passed type. Use the data type you want as a string to select it - int, float, bool, object, datetime, timedelta, and category.

Let's see some examples:

In [None]:
bikes.select_dtypes('int').head()

#### Use the string 'number' to select all numeric data
This selects all int and float columns.

In [None]:
bikes_number = bikes.select_dtypes('number')
bikes_number.head()

## Try adding 5 to `bikes_number`
Let's try adding 5 to the **`bikes_number`** DataFrame which consists of only numeric columns:

In [None]:
(bikes_number + 5).head()

## Comparison Operators with DataFrames
The comparison operators work similarly to the mathematical ones and will return a DataFrame of all boolean columns:

In [None]:
(bikes_number > 5).head()

## Take Care when Working with entire DataFrame
When you do one of the above operations, you are applying that operation to every value in the DataFrame. Make sure this is what you want because all values will be affected.

## Changing Display Options
Pandas gives you several options to change the display of the DataFrames and Series. DataFrames are output in HTML tables while Series are output in plain text.

Let's read in the college dataset and put the institution name ('INSTNM') in the index and then display it on the screen without the head method.

In [None]:
college = pd.read_csv('../data/college.csv', index_col='instnm')
college

## Not enough column space
If you scroll to the right, you'll notice that some columns are hidden. You will see a column filled with three dots denoting this.

## The most common options
I typically only change the display of two of the available display options - **`max_columns`** and **`max_rows`**.

## Use dot notation to find, view, and change the options
You don't have to remember all the option names to make a change. Instead, you only must remember that they are all located under **`pd.options`**. From here we choose **`display`** and then press **tab** to have all the options come down in a menu. The code will look like this:

```
>>> pd.options.display.<press tab>
```

### Output `max_columns` and `max_rows`
We can output the default values for these options.

In [None]:
pd.options.display.max_columns

In [None]:
pd.options.display.max_rows

## Change the options with an assignment statement
You can use an assignment statement to change the options directly. Let's find out the number of columns in our DataFrame and change the options so all of them will be visible.

Let's also change the `max_rows` to 12 to limit the long output.

In [None]:
college.shape

Let's go with a 40 column display:

In [None]:
pd.options.display.max_columns = 40
pd.options.display.max_rows = 12

## Use the new display settings
We can now view all of the columns and a fraction of the rows that fit on our screen.

In [None]:
college.head()

## Many more options
There are a couple dozen more display options. Feel free to explore them.

# Data Dictionaries
A data dictionary is a very important element of a data analysis and at a minimum gives us the column name and description of each column. Other information on each column can be kept in it such as the data type of each column or number of missing values. This "data on the data" is often referred to as **metadata**.

The college dataset has an available data dictionary that can be read in as a DataFrame. It has the descriptions of each column, which is important with this dataset as the column names are not easily decipherable.

In [None]:
pd.read_csv('../data/college_data_dictionary.csv')

# Extra

## More on selecting specific data types

In [None]:
bikes.select_dtypes('float').head()

In [None]:
bikes.select_dtypes('datetime').head()

In [None]:
bikes.select_dtypes('object').head()

Use a list to select multiple data types:

In [None]:
bikes.select_dtypes(['int', 'object']).head()

### Other numeric operators
All the other numeric operators work in the same manner. They all apply the operation to every value in the DataFrame. For instance, the following does floor division to each value:

In [None]:
(bikes_number // 17).head()

### Strange addition and multiplication with objects
Addition actually does work with strings by appending the word being added to each value. You can also multiply strings by an integer which will make them repeat.

In [None]:
(bikes.select_dtypes('object') + 'SOMESTRING').head()

In [None]:
(bikes.select_dtypes('object') * 3).head()

# Explore selecting data types, arithmetic operations, and changing display options

# Mini Case Study: Finding the attributes and methods in common between DataFrames and Series
The DataFrames and Series have most of their attributes and methods in common so you won't have to remember too much more to use them.

Let's find all the public functionality that is in-common and unique to DataFrames and Series.

Use a set comprehension to get all public methods for each type:

In [None]:
df_public = {method for method in dir(pd.DataFrame) if method[0] != '_'}
series_public = {method for method in dir(pd.Series) if method[0] != '_'}

Output the total number of methods of each. Notice how they have nearly the same amount:

In [None]:
len(df_public)

In [None]:
len(series_public)

Let's find the number of methods in-common. About 90% of the methods are the same:

In [None]:
len(df_public & series_public)

#### Output the methods unique to DataFrames

In [None]:
df_public - series_public

#### Output the methods unique to Series

In [None]:
series_public - df_public