# Episode 1: Introduction to the Series and DataFrame objects

The pandas data analysis toolkit for Python is a powerful library for data analysis in Python. This is the first in a series teaching how to use pandas. In this episode we cover:

* the 1D `Series` object
* the 2D `DataFrame` object
* selecting columns by column name
* selecting rows by index and integer location

First we use the conventional import statement for pandas:

In [1]:
import pandas as pd

Version this notebook was created on for posterity:

In [2]:
pd.__version__

'1.1.1'

# Series

The series is a one-dimensional data structure, similar to an array or list. It’s labeled, and it has an index. It can contain any datatype. You can think of it like a column of data. 

## Autogenerated index

Let's create a Series by passing in a list of integers:

In [3]:
s = pd.Series([1, 2, 3, 4])

An index has been generated for us by pandas (the first column). Our values are in the second column:

In [4]:
s

0    1
1    2
2    3
3    4
dtype: int64

We can access the index and values using the `.index` and `.values` attributes:

In [5]:
s.index

RangeIndex(start=0, stop=4, step=1)

In [6]:
s.values

array([1, 2, 3, 4])

## Passing in your own index

We can also pass in our own index when creating a `Series`. Here's an example with a non-numerical index:

In [7]:
s = pd.Series([1, 2, 3, 4], index=["A", "B", "C", "D"])

In [8]:
s

A    1
B    2
C    3
D    4
dtype: int64

## Accessing data

We can access data by its integer location (i.e. where 0 corresponds to the first row, 1 corresponds to the second row, similar to lists), or by its index. 

To access by integer location, we use `.iloc`. To access by index, we use `.loc`. Check out the two different ways of accessing the first row below:

In [9]:
s.iloc[0]

1

In [10]:
s.loc["A"]

1

To underscore the point, if we try to access by index equal to 0, we'll get an error:

In [11]:
try:
    s.loc[0]
except KeyError:
    print("boom!")

boom!


## Multiple types

We can store multiple types in a `Series`. Let's show this by creating a `Series` with mixed types and see what pandas thinks of it:

In [12]:
s = pd.Series([1, "str", 30.0], name="MultitypeSeries")

In [13]:
s.values

array([1, 'str', 30.0], dtype=object)

The data type here is "object" which can include strings, lists, dicts, or any other arbitrary Python object you want to throw in a `Series`. 

We can also access our name via the `.name` attribute so we can remember what this `Series` is:

In [14]:
s.name

'MultitypeSeries'

## Series from dict

In addition to defining `Series` from lists, we can also use Python dicts:

In [15]:
pizza = pd.Series({"Alice": "cheese", "Bob": "pineapple"})

In this example, Alice and Bob will become the indices:

In [16]:
pizza

Alice       cheese
Bob      pineapple
dtype: object

# DataFrame

A `DataFrame` consists of multiple `Series`. It’s a 2D datastructure, where each row is indexed like the `Series`. It is relational in nature: you can think of it like a Spreadsheet. You can mix datatypes here too: you can have a column that are all strings and another column that are all ints.

Let's create a `DataFrame` by passing in a dict of columns:

In [17]:
data = {
    "User": ["Bob", "Alice", "Eve"],
    "OS": ["Windows", "macOS", "Linux"],
    "Pizza": ["pineapple", "cheese", "vegan"]
}

In [18]:
df = pd.DataFrame(data)

The keys became our column names:

In [19]:
df

Unnamed: 0,User,OS,Pizza
0,Bob,Windows,pineapple
1,Alice,macOS,cheese
2,Eve,Linux,vegan


We can access a row by index using `.loc` as we did above:

In [20]:
df.loc[0]

User           Bob
OS         Windows
Pizza    pineapple
Name: 0, dtype: object

We can access a column by using square brackets:

In [21]:
df["User"]

0      Bob
1    Alice
2      Eve
Name: User, dtype: object

Or by accessing as we do with attributes using the dot syntax:

In [22]:
df.User

0      Bob
1    Alice
2      Eve
Name: User, dtype: object

You can also access a list of columns. Here we return just the username and their pizza preference:

In [23]:
cols = ["User", "Pizza"]

In [24]:
df[cols]

Unnamed: 0,User,Pizza
0,Bob,pineapple
1,Alice,cheese
2,Eve,vegan


Finally, you can also access specific cells using `.loc` by passing a tuple with format `(row selection, column selection)`. Let's see an example with a single cell:

In [25]:
df.loc[0, "User"]

'Bob'

We can also use slicing here:

In [26]:
df.loc[0:1, "User":"OS"]

Unnamed: 0,User,OS
0,Bob,Windows
1,Alice,macOS


Those are some toy examples. In the episode 2, we'll create a `DataFrame` from a more realistic dataset and do some exploratory data analysis.