# Tutorial 3.1: Panda's Fundamental Objects: `Series` and `DataFrame`
Python for Data Analytics | Module 3    
Professor James Ng

**Congratulations on making it to module 3!**  

Today, we begin our exploration of the *pandas* library, which is built on top of NumPy and turns Python into a feasible tool for handilng tabular data. 

In this tutorial, we will be taking a look at the two fundamental objects of the `pandas` library: the `Series` and `DataFrame` objects.

## Importing *pandas*
Just as we had to import the NumPy library before we could use it, we have to do the same with the `pandas` library.

Also, just as we saw before, the community convention is to give the library a short alias (in this case `pd`).

We will follow this convention since it is well established, but remember that arbitrary little strings like this are not a good coding practice, as their meaning is very ambigious to anyone not immediately familiar with the community convention.

In [None]:
# Import NumPy and Pandas via community convention aliases
import numpy as np
import pandas as pd

## `DataFrame` & `Series` Basics

Let's download and import one of our data sets so that we've got something practical to work with. The following function reads our Chiago crime CSV file and creates a `DataFrame` object from it.

In [None]:
# Download the Chicago recent crime dataset from OSF
!curl -L https://osf.io/u6xqa/download --create-dirs -o data-sets/chicago-recent-crime.csv

chicago_crime = pd.read_csv('data-sets/chicago-recent-crime.csv')
type(chicago_crime)

### `DataFrames` are made up of an `index` and one or more `Series`
Every `DataFrame` object is comprised of an **`index`** and one or more **`Series`** objects.
Let's demonstrate this by looking at the first few elements of our `chicago_crime` object.

In [None]:
chicago_crime.head()

The bold numbers running down the left hand side of the table show the **`index`** of the **`DataFrame`**.  The bold strings running across the top are the names of the nested **`Series`** objects.

In [None]:
# You can directly access the index of a DataFrame
chicago_crime.index

In [None]:
# Or access the values of the rows as a 2D Numpy Array
chicago_crime.values

### Every `Series` is made up of an index and a NumPy array

Let's retrieve one of the nested **`Series`** objects:

In [None]:
primary_description = chicago_crime['PRIMARY DESCRIPTION']
type(primary_description)

#### Pythonista Note
Did you see how I passed the name of the **`Series`** object that I wanted to the `chicago_crime` DataFrame? I used the same sort of syntax you'd use to retrieve a data element from a **`dict`**.

As we continue to move along, we'll discover that **`DataFrame`** and **`dict`** types share many traits.

In [None]:
# Just like the DataFrame object, Series objects also have a 
# `head()` method, which retrieves a few records from the object.
primary_description.head()

So, as you can see, we've got two columns here.  
* The first column is another representation of the **`index`** we saw earlier on the `chicago_crime` DataFrame. Whether you access the index from either the enclosing `DataFrame` or any of it's nested `Series` objects, *pandas* will point you to the same index object.

* The second column, which holds the values of the series is nothing more than our good friend, the 1-dimensional NumPy array.

You can retrieve the index and NumPy array separately from a series as follows:

In [None]:
# Get the Series index
primary_description.index

In [None]:
# Get the NumPy array from the Series
primary_description.values

## Going a Bit Deeper
So, thus far, the essential difference between an NumPy **`ndarray`** and a *pandas* **`DataFrame`** and **`Series`** objects is their indexes.

*NumPy* arrays had indexes as well, but they are implicit (hidden) and always integers. We used them all the time to retrieve specific elements via index notation, or views with slice notation. 

In constrast, *pandas* makes indexes explicit and directly accessible to the user. You can't access a NumPy array's **`index`** property directly like you can on a Pandas series object as we did above.

Furthermore, *pandas* objects are not limited to having integer based indexes. You could have indexes of strings, floats, booleans, dates, etc. Basically any scalar (single value) type.

### Non-Integer Indexes

In [None]:
# Create a `DataSeries` object from a dictionary.
# This results in a string based index.
simple_series = pd.Series({
    'Scala':'A new kid on the block',
    'Python': 'Best Language Ever!'})
simple_series.index

The same holds true for the index of a **`DataFrame`** object. When we created our `chicago_crime` DataFrame from the CSV file, it generated an integer based index, which is the default behavior.

But we could change that.  For instance, we could tell `pandas` to make the case # column the index:

In [None]:
# Switch the index to being the institution names.
chicago_crime.index = chicago_crime['CASE#']
chicago_crime.index

In [None]:
# Now take a look at what the `head()` method will return.
chicago_crime.head()

### DataFrame columns can be of different data types

In [None]:
# Compare the data types of the columns BLOCK and WARD. They are not the same, which is exactly what we want!

# Steps: 1) retrieve the relevant Series from the DataFrame, 2) get the numpy array from it, 
# 3) retrieve its dtype attribute.

In [None]:
chicago_crime['BLOCK'].values.dtype
chicago_crime['WARD'].values.dtype

In [None]:
# Or a shortcut (no need to explicitly retrieve numpy array)
chicago_crime['WARD'].dtype

In [None]:
# Now check out pandas's info() method.
chicago_crime.info()

In [None]:
# What about the describe() method? Why does it give summary statistics for only three columns? 
# We will revisit this method in Tutorial 3.4.
chicago_crime.describe()