<!--BOOK_INFORMATION-->
*This notebook is based on an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*

*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*

# Introducing Pandas Objects

At the very basic level, Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices.
As we will see during the course of this chapter, Pandas provides a host of useful tools, methods, and functionality on top of the basic data structures, but nearly everything that follows will require an understanding of what these structures are.
Thus, before we go any further, let's introduce these three fundamental Pandas data structures: the ``Series``, ``DataFrame``, and ``Index``.

We will start our code sessions with the standard NumPy and Pandas imports:

In [None]:
import numpy as np
import pandas as pd

## The Pandas DataFrame Object

The first fundamental structure in Pandas is the ``DataFrame``.
The ``DataFrame`` can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary.
We'll now take a look at each of these perspectives.

A ``DataFrame`` is an analog of a two-dimensional array with both flexible row indices and flexible column names.

### DataFrame as specialized dictionary

We can also think of a ``DataFrame`` as a specialization of a dictionary.
Where a dictionary maps a key to a value, a ``DataFrame`` maps a column name to a series of column data.  Thus, you can think of pandas DataFrames as dictionaries of columns.  For instance, below, we want to create some wedding plans of people, and which side of the family they belong to.  Given a list of names (as the index), a list of family affiliations, and their ages below, we can create a dataframe as a dictionary of lists.

In [None]:
age_list = [16, 12, 14, 27, 28, 29, 15]
fam_list = ['b', 'g']*3 + ['b']
list_names = ['john', 'ashley', 'rohit', 'malcolm', 'keisha', 'elsa', 'amy']

In [None]:
#dataframe from dictionary
wedding_df = pd.DataFrame({'age':age_list,
                           'family': fam_list},
                          index=list_names)

We can select the columns as usual like a key-value pair.

In [None]:
# access column by name
wedding_df['age']

### DataFrame as lists of lists or a generalized NumPy array

A dataframe built from lists is interpreted as the rows of the data.  The structure of your lists is exactly interpreted as the rows and columns of the dataframe.  For example, if you were to create a lists of lists:

In [None]:
# view list of lists
[age_list, fam_list]

In [None]:
# create dataframe from list of lists
pd.DataFrame([age_list, fam_list])

In [None]:
# You can also just transpose a dataframe and label accordingly
from_list_df = pd.DataFrame([age_list, fam_list]).transpose()
from_list_df.columns = (['age', 'family'])
from_list_df.index = list_names
from_list_df

In [None]:
# Creating dataframes from numpy arrays is easy!
pd.DataFrame(np.random.randint(20, size=(3,2)),
            index = ['dogs', 'cats', 'sheep'],
            columns = ['barn1', 'barn2'])

You can also build a dataframe from Series, dictionaries of dictionaries, and more!

### Try it yourself!

Here, you'll try to create a pandas dataframe of states' populations and areas.  The data can be found below.  
1.  Create a dataframe with columns `area` and `population` with variable name `states_df`.  You can do this any way you like: from lists of lists, from Series, from dictionaries ,etc.  Make sure both the columns and the rows have explicit indices.
2.  Access the `area` column.
3.  Access the column names

**Populations**: 

California: 38332521  
Texas: 26448193  
New York: 19651127  
Florida: 19552860  
Illinois: 12882135  

**Areas**: 

California: 423967  
Texas: 695662  
New York: 141297  
Florida: 170312  
Illinois: 149995  

In [None]:
pops = [38332521, 26448193, 19651127, 19552860, 12882135]
areas = [423967, 695662, 141297, 170312, 149995]
states = ['California', 'Texas', 'New York', 'Florida', 'Illinois']

In [None]:
# Create from lists
pd.DataFrame({'population':pops,
             'areas': areas}, 
            index = states)

In [None]:
# Create from dictionaries
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}

In [None]:
states_df = pd.DataFrame({'population': population_dict,
             'areas': area_dict})

In [None]:
states_df['areas']

In [None]:
states_df.columns

Now that we have this along with the ``population`` Series from before, we can use a dictionary to construct a single two-dimensional object containing this information:

## Properties/Attributes of Pandas Series and DataFrames
There are many things that you might want to know about your Pandas object.  How many rows are there?  How many columns are there?  What is the datatype?  What is the object's type.  Here, we'll provide some information on those functions.

`shape`: the dimension of the dataframe or series

In [None]:
print(ser_pop_list.shape)
states_df.shape

`dtype`: the datatype of the series

In [None]:
print(ser_pop_list.dtype)
print(states_df['population'].dtype)
#states_df.dtype

`ndim`: number of axes/array dimensions (usually most helpful for numpy arrays)

In [None]:
print(ser_pop_list.ndim)
print(states_df.ndim)

`size`: the number of elements in the object

In [None]:
print(ser_pop_list.size)
print(states_df.size)

## The Pandas Series Object

Finally, A Pandas ``Series`` is a one-dimensional array of indexed data.
It can be created from a list, numpy array, or dictionary as shown in the following sections.

In [None]:
age_list = [16, 12, 14, 27, 28, 29, 15]

In [None]:
# create series from list
data = pd.Series([16, 12, 14, 27, 28, 29, 15])
data

0    16
1    12
2    14
3    27
4    28
5    29
6    15
dtype: int64

In [None]:
# create series from passing list
data_list = pd.Series(age_list)
data_list

0    16
1    12
2    14
3    27
4    28
5    29
6    15
dtype: int64

As we see in the output, the ``Series`` wraps both a sequence of values and a sequence of indices, which we can access with the ``values`` and ``index`` attributes.
The ``values`` are simply a familiar NumPy array:

In [None]:
#get data vlues
data.values

array([16, 12, 14, 27, 28, 29, 15], dtype=int64)

The ``index`` is an array-like object of type ``pd.Index``, which we'll discuss in more detail momentarily.

In [None]:
#get indices
data.index

RangeIndex(start=0, stop=7, step=1)

Like with Python lists and NumPy arrays, data can be accessed by the associated index via the familiar Python square-bracket notation:

In [None]:
# element access
data[1]

12

In [None]:
# slicing
data[1:3]

1    12
2    14
dtype: int64

In [None]:
#can we do mathematical operations on list?
age_list + 1

TypeError: can only concatenate list (not "int") to list

In [None]:
# can we do mathematical operations on a pandas series?
data + 1

0    17
1    13
2    15
3    28
4    29
5    30
6    16
dtype: int64

**The Pandas Series data structure is _more_ than a list and _more_ than a Numpy array.  As we will see, the Pandas ``Series`` is much more general and flexible than lists and one-dimensional NumPy array that it emulates.**

### ``Series``: lists++ and numpy++

From what we've seen so far, it may look like the ``Series`` objects are similar to numpy arrays and lists.  However, there are several differences between both the sets.

**From numpy arrays**: The essential difference is the presence of the index: while the Numpy Array has an  *implicitly defined* integer index used to access the values, the Pandas ``Series`` has an *explicitly defined* index associated with the values.  Python lists have neither of these things.  This explicit index definition gives the ``Series`` object additional capabilities. For example, the index need not be an integer, but can consist of values of any desired type.  If we wish, we can use strings as an index.  You can see this demonstrated below.

**From Python Lists**:  There are exceptional differences between pandas Series and Python lists; in fact, numpy arrays were created to deal with some of the challenges of working with lists for computation and analysis.  Lists can have multiple datatypes contained; they also introduce challenges in terms of computation and speed.

Lets investigate this behavior more below.

In [None]:
age_list = [16, 12, 14, 27, 28, 29, 15]
names_list = ['john', 'ashley', 'rohit', 'malcolm', 'keisha', 'elsa', 'amy']

In [None]:
# create Series with explicit indices
data = pd.Series(age_list,
                 index=['john', 'ashley', 'rohit', 'malcolm', 'keisha', 'elsa', 'amy'],
                 name = 'age')
data

john       16
ashley     12
rohit      14
malcolm    27
keisha     28
elsa       29
amy        15
Name: age, dtype: int64

And the item access works as expected:

In [None]:
# data access using explicit indexing
data['john']

16

However, we can also still access the data by `implicit` index:

In [None]:
# data access using implicit indexing
data[3]

27

We can even use non-contiguous or non-sequential indices:

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=[2, 5, 3, 7])
data

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

In [None]:
data[5]

**Question.** The indices here are numerical.  Think of the difference between these _explicit_ numerical indices vs the _implicit_ numerical indices.  How do you think it will work if we try to access this list by _position_ or _implicit_ index?

In [None]:
data[0]

KeyError: 0

**Remember this behavior.  There's a difference between implicit and explicit indices, and we need to disambiguate which one we're using.  This can be done using indexers called `loc` and `iloc`.  We will see more about those in the next notebook.**

### Series as specialized dictionary

You can also think of a Pandas ``Series`` a bit like a specialization of a Python dictionary.
A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a ``Series`` is a structure which maps typed keys to a set of typed values.
This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas ``Series`` makes it much more efficient than Python dictionaries for certain operations.

The ``Series``-as-dictionary analogy can be made even more clear by constructing a ``Series`` object directly from a Python dictionary:

In [None]:
age_dict = {'john':16, 
          'ashley':12,
          'rohit':14,
          'malcolm':27,
          'keisha':28,
          'elsa':29,
          'amy':15}

In [None]:
#get value using key
age_dict['rohit']

14

In [None]:
#get value by position
age_dict[0]

KeyError: 0

### Create pandas Series from dictionary

In [None]:
# Create pandas series from dictionary
data_from_dict = pd.Series(age_dict,
                     name = 'age')
data_from_dict

john       16
ashley     12
rohit      14
malcolm    27
keisha     28
elsa       29
amy        15
Name: age, dtype: int64

**Question.**  Do you think you need to provide an index here?  Why or why not?

In [None]:
# Get the age of keisha by explicit index
data_from_dict['keisha']

28

In [None]:
# Get the age of keisha by implicit index
data_from_dict[4]

28

#### What happens if you do supply an index?

In [None]:
#supply full index
age_inds = data_from_dict.index
age_series_2 = pd.Series(age_dict, index=age_inds, name='age')
age_series_2

john       16
ashley     12
rohit      14
malcolm    27
keisha     28
elsa       29
amy        15
Name: age, dtype: int64

In [None]:
# supply partial index
age_inds_sel = ['john', 'keisha']
age_series_3 = pd.Series(age_dict, index=age_inds_sel, name='age')
age_series_3

john      16
keisha    28
Name: age, dtype: int64

## Try it yourself!
1. Create a pandas series from a list and from a dictionary containing the information below.  The variable names should be `ser_pop_list` and `ser_pop_dict`.  The series name should be `population`.
2. What is the implicit index of New York?
3. Look up the population of New York using both implicit and explicit indexing.
4. **Bonus**: Look up the population of New York, Florida, and Illinois using slicing.

**Populations**: 

California: 38332521  
Texas: 26448193  
New York: 19651127  
Florida: 19552860  
Illinois: 12882135  

In [None]:
population_list = [38332521, 26448193, 19651127, 19552860, 12882135]
population_names = ['California', 'Texas', 'New York', 'Florida', 'Illinois']
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}

In [None]:
ser_pop_list = pd.Series(population_list, index=population_names, name='population')
ser_pop_list

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: population, dtype: int64

In [None]:
ser_pop_dict = pd.Series(population_dict, name='population')
ser_pop_dict

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: population, dtype: int64

In [None]:
ser_pop_dict[2]

19651127

In [None]:
ser_pop_dict['New York']

19651127

In [None]:
ser_pop_dict['New York':]

New York    19651127
Florida     19552860
Illinois    12882135
Name: population, dtype: int64

We'll discuss some of the quirks of Pandas indexing and slicing in [Data Indexing and Selection](03.02-Data-Indexing-and-Selection.ipynb).

## The Pandas Index Object

We have seen here that both the ``Series`` and ``DataFrame`` objects contain an explicit *index* that lets you reference and modify data.
This ``Index`` object is an interesting structure in itself, and it can be thought of either as an *immutable array* or as an *ordered set* (technically a multi-set, as ``Index`` objects may contain repeated values).
Those views have some interesting consequences in the operations available on ``Index`` objects.
As a simple example, let's construct an ``Index`` from a list of integers:

In [None]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

### Index as immutable array

The ``Index`` in many ways operates like an array.
For example, we can use standard Python indexing notation to retrieve values or slices:

In [None]:
ind[1]

In [None]:
ind[::2]

``Index`` objects also have many of the attributes familiar from NumPy arrays:

In [None]:
print(ind.size, ind.shape, ind.ndim, ind.dtype)

One difference between ``Index`` objects and NumPy arrays is that indices are immutable–that is, they cannot be modified via the normal means:

In [None]:
ind[1] = 0

This immutability makes it safer to share indices between multiple ``DataFrame`` objects and arrays, without the potential for side effects from inadvertent index modification.

### Index as ordered set

Pandas objects are designed to facilitate operations such as joins across datasets, which depend on many aspects of set arithmetic.
The ``Index`` object follows many of the conventions used by Python's built-in ``set`` data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way:

In [None]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [None]:
indA & indB  # intersection

In [None]:
indA | indB  # union

In [None]:
indA ^ indB  # symmetric difference

These operations may also be accessed via object methods, for example ``indA.intersection(indB)``.

# What we've learned in this lesson:

1. Pandas data types: Dataframes
    - Advantage over Python lists of lists, dictionaries, and numpy multidimensional arrays
    - Syntax for creation
    - Indexing and selection*
2. Pandas data types: Series
    - Advantages over Python lists, dictionaries, and numpy arrays
    - Syntax for creation
    - Indexing and selection
3. Pandas Index
    
# Next lesson:
- More on Indexing and Selection