### Objectives

- Load the Python Data Analysis Library (Pandas).
- Describe what a DataFrame and a Series are
- Understand DataFrame attributes versus methods

### Content to cover

* import pandas as pd
* DataFrame/Series
* .index/.columns versus .head()


In [45]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Pandas is table oriented

> I want to start using Pandas

In [46]:
import pandas as pd

To load the Pandas package and start working with it, import the package. The community agreed shortcut for pandas is `pd`, so loading Pandas as `pd` is assumed standard practice for all of the Pandas documentation.

### Pandas table data representation

![](../schemas/01_table_dataframe.svg)

> I want to store passenger data of the Titanic. For a number of passengers, I know the name (characters), age (integers) and the cabin class (categories 1, 2 or 3) data.

In [47]:
my_dataframe = pd.DataFrame({
    'Name': ["Braund, Mr. Owen Harris", 
             "Allen, Mr. William Henry", 
             "Bonnell, Miss. Elizabeth"], 
    'Age': [22, 35, 58],
    'Pclass': pd.Categorical([3, 3, 1])}
    )
my_dataframe

Unnamed: 0,Name,Age,Pclass
0,"Braund, Mr. Owen Harris",22,3
1,"Allen, Mr. William Henry",35,3
2,"Bonnell, Miss. Elizabeth",58,1


A `DataFrame` is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categgorical data and more) in columns. It is similar to a spreadsheet, a SQL table or the `data.frame` in R.

__Note:__ In most situations, data tables stored in a file format are the starting point of an analysis. The [next tutorial](2_read_write.ipynb) provides more insight to reading data.

### Attributes of a table

> What are the column names, row names and type of data in my data table?

Each `DataFrame` has a number of attributes. These are characteristics of the table and can be requested by `.` in combination with the attribute name. FTo start with, the following attributes are worthwhile to remember:

The __column__ names of the `DataFrame`:

In [48]:
my_dataframe.columns

Index(['Name', 'Age', 'Pclass'], dtype='object')

The row labels are defined by the __index__ of a `DataFrame`:

In [49]:
my_dataframe.index

RangeIndex(start=0, stop=3, step=1)

The __shape__, the number of rows and columns, of a `DataFrame`:

In [50]:
my_dataframe.shape

(3, 3)

The type of data (integers, float, characters, datetime,...) of the individual columns is expressed in the __dtypes__ attribute: 

In [51]:
my_dataframe.dtypes

Name        object
Age          int64
Pclass    category
dtype: object

Each column contains data from a single data type. `object` is Pandas terminology for character data.

__To user guide:__ For an overview of the supported dtypes of Pandas, see :ref:`basics.dtypes`

### Functionalities of a table

> I'm interested in a short summary of my data table

In [52]:
my_dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
Name      3 non-null object
Age       3 non-null int64
Pclass    3 non-null category
dtypes: category(1), int64(1), object(1)
memory usage: 275.0+ bytes


The output provides some information on the `DataFrame`:

- It is indeed a `DataFrame`.
- Each row was assigned an index (row label) of 0 to N-1, where N is the number of rows in the `DataFrame`. Pandas will do this by default if an index is not specified. Don't worry, this can be changed later.
- There are 3 entries, i.e. rows.
- The table has 3 columns, each of them with all values provided, so no missing values.
- One of the columns consists of character data, one of integers and thhe latter is categorical data.
- The approximate amount of RAM used to hold the DataFrame is provided as well.

As illustrated by the `info()` method, you can _do_ things with a `DataFrame`. Pandas provides a lot of functionalities to work with `DataFrame`, each of them a _method_ you can apply to a `DataFrame`. As methods are functions, do not forget to use parenthesis `()`. 

> I'm interested in some basic statistics of the numerical data of my data table

In [53]:
my_dataframe.describe()

Unnamed: 0,Age
count,3.0
mean,38.333333
std,18.230012
min,22.0
25%,28.5
50%,35.0
75%,46.5
max,58.0


As the `Name` and `PClass` columns are character and categorical data respectively, these are by default not taken into account by the `describe` method. 

__To user guide:__ check more options on `describe` :ref:`basics.describe`

### Pandas Series

![](../schemas/01_table_series.svg)

In [54]:
ages = pd.Series([22, 35, 58])
ages

0    22
1    35
2    58
dtype: int64

A single column version of a `DataFrame` is a Pandas `Series`. It does not have columns names, but still has the row index:

In [55]:
ages.index

RangeIndex(start=0, stop=3, step=1)

Similar to a Pandas `DataFrame`, you can _do_ things with a `Series` and apply a method:

In [56]:
ages.describe()

count     3.000000
mean     38.333333
std      18.230012
min      22.000000
25%      28.500000
50%      35.000000
75%      46.500000
max      58.000000
dtype: float64

__To user guide:__ Why both `Series` and `DataFrame` are required, see :ref:`TODO` ([label](https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html#why-more-than-one-data-structure) to add in sphinx)

## REMEMBER

- Working with Pandas always requires `import Pandas as pd`
- A table of data is stored as a Pandas `DataFrame`
- A single column version of a `DataFrame` is a `Series`
- A Pandas DataFrame and Series do have attributes (i.e. characteristics) and methods (i.e. actions on it).
- Methods require `()`

__To user guide:__ A more extended introduction to `DataFrame` and `Series` is provided in :ref:`dsintro`.