# 1. Pandas Intro

### Objectives
After this lesson you should be able to...

+ Get help by knowing your object, reading documentation and using inline commands
+ Know why Pandas is more suitable for data analysis than Python lists
+ Know the anatomy of a DataFrame and a Series
+ Identify a Series as a single dimensional data structure with an **index** and **values**
+ Identify a DataFrame as a two dimensional data structure with an **index**, **columns**, and **values**
+ Know the difference between an **index** and **values**
+ Know all the possible column data types
+ Know that each value in a column must be of the same data type
+ Know the representations of missing values and which ones are used for each data type
+ Know how to get metadata with **`info`**, **`shape`**, and **`size`**

### Prepare for this lesson by...

+ Read the [Package Overview](http://pandas.pydata.org/pandas-docs/stable/overview.html)
+ Read [Intro to Data Structures](http://pandas.pydata.org/pandas-docs/stable/dsintro.html) - **just the Series and DataFrame sections**

# Welcome to ....
![][1]


### What is Pandas?
Pandas is one of the better open source data exploration libraries currently available. It gives the user power to explore, manipulate, query, aggregate, and visualize **tabular** data. Tabular meaning data that is two dimensional with rows and columns, i.e. a table.

### Why Pandas and not xyz?
In this current age of data explosion, there are now many dozens of other tools that can do many, if not more, than what the Pandas library can do. However, there are many aspects of Pandas that set it apart and it continues to have one of the fastest growing user bases.

* It's a Python library, which makes it easy to read, easy to develop, and easily integrates with other popular data science libraries like numpy, scikit-learn, statsmodels, matplotlib and seaborn.
* It is nearly self-contained in that lots of functionality is built into one package. This contrasts with R, where many packages are needed to obtain the same functionality.
* The community is amazing. Looking at Stack Overflow, for example, there are [many ten's of thousands of][2] Pandas questions. SAS, a multi-billion dollar revenue analytics software maker has only a fraction of the questions. This is one huge benefit of open source in general. If you need help, you are nearly guaranteed to find it very quickly. After a while most of your questions will be answered in the top few search engine results.

### Why is it named after an east Asian bear?

Pandas was built by Wes McKinney beginning in 2008 at a hedge fund named AQR. Finance speak is to call tabular data 'panel data' which smashed together becomes Pandas. If you are really interested in the history, you can hear it from the creator [himself][3].

### Python already has data structures to handle data, why do we need another one?
Even though Python itself is a high level language, its primary built-in data structures - lists and dicts - do not easily lend themselves to tabular data in ways that humans can operate on them. Just summing items in a list can be quite slow.

### NumPy
NumPy ('numerical Python') is the most popular third-party Python library for scientific computing and forms the foundation for dozens of others, including Pandas. NumPy's primary data structure is an n-dimensional array which is much more powerful than a Python list and with much better performance.

### Pandas is built directly on NumPy
All of the data in Pandas is stored in NumPy arrays. That said, it isn't necessary to know much about NumPy when learning Pandas. You can think of Pandas as a higher-level, easier to use interface to doing data analysis than NumPy. It is a good idea to eventually learn NumPy, but for most tasks, Pandas will be the right tool.

### NumPy vs built-in list performance difference
Let's see the performance difference between summing a NumPy array with one million elements vs a built-in list with the same number of elements.

[1]: images/pandas.png
[2]: http://stackoverflow.com/questions/tagged/pandas
[3]: https://www.youtube.com/watch?v=kHdkFyGCxiY

In [None]:
# create a list of 1 million integers
n = 1000000
my_list = list(range(n))

### Timing Code Execution with %timeit
**`%timeit`** is a magic command that times the execution of a the first statement in a particular cell. The option **`-r`** controls the number of runs and **`-n`** controls the number of iterations with each run.

In [None]:
%timeit -r 3 -n 5 sum(my_list)

#### Compare to NumPy array
We will import NumPy, create an array with the same elements with the **`arange`** function, and time its summation.

In [None]:
import numpy as np

In [None]:
# create array with arange function.
array = np.arange(n)
array

In [None]:
%timeit -r 3 -n 5 array.sum()

### Performance difference
Using the built-in **`sum`** function with a list took approximately 10 times longer than using NumPy's array and this was just a simple sum of a list of numbers. This difference increases with the complexity of the operation performed on the data.

### Why is NumPy so fast?
NumPy array operations are executed in pre-compiled C code which makes for much faster execution times. A Python list, in contrast, must be iterated through at run-time, can take any number of different types so is not optimized for large numerical computations. 

### Why not NumPy?
Though NumPy is fast and can handle most of our data needs, it is still relatively low-level. Pandas allows easier access to rows and columns, powerful statistical functionality, heterogeneous data, enhanced merging and grouping, and many more data manipulation abilities. 

More info on NumPy can be found [in the official documentation][2]. We will be using some NumPy directly in this course.

### More on magic commands
iPython comes with handy dandy magic commands that give you some great extra functionality. The one I use the most is **`timeit`** which times the length of the operation. Magic commands are not Python syntax and only within iPython or Jupyter notebooks. 

Precede the command by `%` for a single line magic and `%%` for entire cell magic. See the [documentation][1] for a huge list of more magic commands.

There is even a magic command to list all the magic commands - see below:

[1]: http://ipython.readthedocs.io/en/stable/interactive/magics.html
[2]: https://docs.scipy.org/doc/numpy/user/index.html

In [None]:
# view all the iPython magic commands
%lsmagic

# Pandas uses tabular (table) data

There are numerous formats for data such as XML, JSON, raw bytes, and many others. But, for our purposes, we will only be examining what everyone thinks of when the they think of data - a table. Pandas is built just for analyzing this tabular, rectangular, very deceptively normal concept of data. 

There are two primary Pandas objects that account for nearly everything we will be covering. 

### Series and the DataFrame

The **Series** is a single dimension of data. Think of a one dimensional array.

The **DataFrame** is our two-dimensional data structure that looks like any other table of data you have seen with rows and columns.

# Import Pandas and read in data
By convention pandas is imported and aliased as **`pd`**. We will read in the **`bikes`** dataset with the **`read_csv`** function. Its first parameter is the location of the file relative to the current directory. All the data for this class is stored in the **`data`** directory one level above where this notebook is located. 

The two dots in the path passed to **`read_csv`** are interpreted as the directory immediately above the current one.

In [None]:
import pandas as pd
bikes = pd.read_csv('../data/bikes.csv')

## Display DataFrame in Jupyter Notebook
We assigned the output from the **`read_csv`** function to the **`bikes`** variable which is now our DataFrame. Display it by putting the variable name as the last line in a code cell.

In [None]:
bikes

## Default output
Pandas defaults to outputting 60 rows and 20 columns. These display options (and many others) can be changed and will be shown later.

## Our first methods, `head`, and `tail`
A very useful and simple method is **`head`**, which by default will return the first 5 rows of the DataFrame. This avoids the long default output and something I highly recommend. The **`tail`** method returns the last 5 rows by default.

In [None]:
bikes.head()

In [None]:
bikes.tail()

## First and Last `n` rows
Both the **`head`** and **`tail`** methods take a single parameter **`n`** which control the number of rows to return:

In [None]:
bikes.head(8)

# Components of a DataFrame - Columns, Index, and Data
The DataFrame is composed of three separate components that you must know. The **Columns**, the **Index**, and the **Data**. These terms will be used throughout the course and understanding them is vital to your ability to use Pandas.

Take a look at the following graphic of our bikes DataFrame stylized to put emphasis on each component.

![][1]

* The **index** labels the rows
* The **columns** label the columns
* The **index** is also referred to as the **row names/labels**
* The **columns** are also referred to as the **column names/lables** or the **column index**
* An individual element of the index is referred to as an **index label/name** or **row label/name**
* An individual element of the columns is a **column name/label**
* The index and the columns are always in **bold font**
* Collectively the index and the columns are known as the **axes** (or individually as an **axis**)
* Pandas uses integers to refer to each axis. **0** for the index and **1** for the columns. This is borrowed directly from NumPy
* The actual **data** is always in normal font
* Data is also referred to as the **values**


[1]: images/df_components.png

# What type of object is `bikes`
As we said previously **`bikes`** is a DataFrame. Let's verify this:

In [None]:
type(bikes)

### Fully-qualified name
Remember that only the word after the last dot is the class name. The **`bikes`** variable has type **`DataFrame`**. Python always returns the location and module name of where the class was defined. 

### Location and module name?
The fully-qualified name holds the location in your computer where the class is defined. In this example, **`pandas`** is a directory that contains another directory **`core`** which contains a file **`frame.py`** which defines the **`DataFrame`** class.

### Package, sub-package, and module
The top level directory of other files and directories containing Python files is technically called a **package**. In this example **`pandas`** is the package. All directories within the package are called **sub-packages** such as **`core`**. All Python files (those ending in .py) are called **modules**.

### Where are the packages located?
Third-party packages are installed in the **`site-packages`** directory which itself is set up during Python installation. We can get the actual location with the help of the built-in **`site`** module's **`getsitepackages`** function.

In [None]:
import site

In [None]:
site.getsitepackages()

# Select a single column from a DataFrame - a Series
To select a single column from a DataFrame, pass the name of one of the columns to the indexing operator, **`[]`**. The returned object will be a Pandas Series. Let's choose the column name **`tripduration`**, assign it to a variable, and output it to the screen.

In [None]:
trip_duration = bikes['tripduration']
trip_duration

# `head` and `tail` methods work the same with a Series
Use the **`head`** and **`tail`** methods to condense the output.

In [None]:
trip_duration.tail(3)

# Components of a Series - Index and Data
A Series is simpler than a DataFrame with just a single dimension of data. It has two components - the **index** and the **data**. It is essentially a one-column DataFrame. Let's take a look at a stylized Series graphic.

![](images/series_components.png)

The definition for the index and data components are the same as they are for a DataFrame.

### Output of Series vs DataFrame
Notice that there is no nice HTML styling for the Series. It's just plain text. Also, below each Series will be some metadata on it - the **name**, **length**, and **dtype**. 

* The **name** is not important right now, but if the Series is formed from a column of a DataFrame it will be set to that column name.
* The **length** is simply the number of values in the Series
* The **dtype** is the data type of the Series. Each column of data must be of only one particular data type. These will be covered in depth later.

It's important to note that this metadata is NOT part of the Series itself and just some extra info Pandas outputs for your information.

# Data  Types 
Each column of data in Pandas DataFrame has a **data type**. This is a very similar concept to types in Python. Just like every object has a type, every column has a data type. Every value in each column must be of the same data type.

## Most Common Data Types
The following are the most common data types that appear frequently in DataFrames. 

* **Boolean**
* **Integer**
* **Float**
* **Object** (mainly strings)
* **DateTime** (a specific moment in time)

### Other Data Types
There are three other data types that are less common. We will cover them when necessary.

* **Category**
* **TimeDelta** (a specific amount of time)
* **Period** (a specific time period)

### More on the primary data types

#### Boolean
Boolean columns contain only two values: **`True`** and **`False`**

#### Integer
Whole numbers without a decimal

#### Float
Numbers with decimals

#### Object
This object data type is a bit confusing. It means that each value in the column can be any valid Python object. But, nearly all of the time, object data type columns contain **strings**. They can contain any other Python object such as integers, floats, or even complex types such as lists or dictionaries.

When you see **object** as a data type you should think of **string**.

#### DateTime
A DateTime is a specific moment in time with both a **date** (month, year, day) and a **time** (hour, minute, second, fraction of a second). All DateTimes in Pandas have nanosecond precision - 1 billionth of a second.

# Missing Values
Missing value representation is actually a fairly complex issue. If you are curious you can read a [small manifesto][1] on it from the NumPy developers.

## Missing Value Representation, `NaN`,  `None`, and `NaT`
Pandas representats missing values differently based on the data type of the column.

### Where do these missing values come from?
* **`NaN`** stands for **not a number** and is technically a floating point value
* **`None`** is the literal Python object **`None`**. This will only be found in **object** columns
* **`NaT`** stands for not a time and is used for missing values in DateTime, TimeDelta, and Period columns

### Missing values for each data type
**Booleans and integers** do not have any representation for missing values. This is unfortunate, but a current limitation. If you have booleans or integers in your data that have missing values they will be coerced to floats.

**Floats** use only **`NaN`** as the missing value.  

**Object** can be any valid Python object so technically you may see **`NaN`**, **`None`**, or **`NaT`** but primarily you will see **`None`** used in object columns.

**Datetime**, **TimeDelta**, **Period** will only use **`NaT`** as the missing value.

# Finding the data types of each column
The **`dtypes`** DataFrame method returns the data type of each column. Let's get the data types of our **`bikes`** DataFrame.

[1]: https://docs.scipy.org/doc/numpy/neps/missing-data.html

In [None]:
bikes.dtypes

# Think string whenever you see object
Pandas does not have a string data type like most databases but when you see **object** you should assume that the column consists entirely of strings.

# Why are `starttime` and `stoptime` object data types?
If you look at the output of the **`bikes`** DataFrame, it's apparent that both the **`starttime`** and **`stoptime`** columns are DateTimes but our last output from above is stating that they are objects.

When reading in a text file like we did with **`bikes.csv`** it's impossible for Pandas to know the data type of each column so it makes assumptions as it's reading it in. We can force Pandas to read these columns as DateTimes with the **`parse_dates`** parameter of the **`read_csv`** function. We must pass it a list of the columns we would like to make datetimes.

Let's re-read the data:

In [None]:
bikes = pd.read_csv('../data/bikes.csv', parse_dates=['starttime', 'stoptime'])

In [None]:
bikes.dtypes

# What are all those 64's at the end of the data types?
Integers, floats, DateTimes and TimeDeltas all use a particular amount of memory for each of their values. The memory is measured in **`bits`**. By default Pandas uses 64 bits to represent integers, floats, DateTimes, and TimeDeltas. It is possible to use a different number of bits for integers and floats. 

Integers can be either 8, 16, 32, or 64 bits while floats can be 16, 32, 64, or 128. For instance, a 128-bit float column will show up as **`float128`**. 

Technically a **`float128`** is a different data type than a **`float64`** but generally you will never have to worry about such a distinction as the operations between different float columns will be the same. It's also very rare to see anything other than 64 bit integer or floats since that is the default and you would need to manually change their size to get a different type.

**Booleans** are stored as a 8-bits, also known as a single **byte**. DateTimes and TimeDeltas are always stored as 64-bits. **Objects** can store any Python object, so there is no set amount of memory for each of their values.

# Getting more Metadata
Metadata is data on the data. The data type of each column is an example of **metadata**. The number of rows and columns is another piece of metadata. We find this with the **`shape`** attribute:

In [None]:
bikes.shape

### Total number of elements with `size` attribute
The **`size`** attribute returns the total number of elements (the number of columns multiplied by the number of rows):

In [None]:
bikes.size

### Get Data Types plus more with the `info` method
The **`info`** DataFrame method retuns output similar to **`dtypes`** but also returns the number of non-missing values in each column along with more info such as the 
* Type of object (always a DataFrame)
* The type of index and number of rows
* The number of columns
* The data types of each column and the number of non-missing (a.k.a non-null)
* The frequency count of all data types
* The total memory usage

In [None]:
bikes.info()

# Exercises
Use the **`bikes`** DataFrame for the following:

### Problem 1
<span  style="color:green; font-size:16px">Select the column **`events`**, the type of weather that was recorded and assign it to a variable with the same name. Output the first 10 values of it.</span>

In [None]:
# your code here

### Problem 2
<span  style="color:green; font-size:16px">What type of object is **`events`**?</span>

In [None]:
# your code here

### Problem 3
<span  style="color:green; font-size:16px">Select the last 2 rows of the **`bikes`** DataFrame and assign it to the variable **`bikes_last_2`**. What type of object is **`bikes_last_2`**?</span>

In [None]:
# your code here

### Problem 4
<span  style="color:green; font-size:16px">What type of object is returned from the **`dtypes`** attribute?</span>

In [None]:
# your code here

### Problem 5
<span  style="color:green; font-size:16px">What type of object is returned from the **`shape`** attribute?</span>

In [None]:
# your code here

### Problem 6
<span  style="color:green; font-size:16px">What type of object is returned from the **`info`** method?</span>

In [None]:
# your code here

### Problem 7
<span  style="color:green; font-size:16px">The memory usage from the **`info`** method isn't correct when you have objects in your DataFrame. Read the docstrings from it and get the true memory usage.</span>

In [None]:
# your code here

# Explore more on your own below
Think of your own questions, then ask and answer them

In [None]:
# explore here