<h1>Introduction to Pandas</h1>

While pandas adopts many coding idioms from NumPy, the biggestabout difference is that pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneously typed numerical array data.

Since becoming an open source project in 2010, pandas has matured into a quite large library that's applicable in a broad set of real-world use cases. The developer community has grown to over 2,500 distinct contributors, who've been helping build the project as they used it to solve their day-to-day data problems. The vibrant pandas developer and user communities have been a key part of its success.

Throughout the rest of the notebook, I use the following import conventions for NumPy and pandas:

In [1]:
import numpy as np

import pandas as pd

Thus, whenever you see `pd.` in code, it’s referring to pandas. You may also find it easier to import Series and DataFrame into the local namespace since they are so frequently used:

In [2]:
from pandas import Series, DataFrame

<h2>Introduction to pandas Data Structures</h2>

To get started with pandas, you will need to get comfortable with its two workhorse data structures: Series and DataFrame. While they are not a universal solution for every problem, they provide a solid foundation for a wide variety of data tasks.

<h3>Series</h3>

A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) of the same type and an associated array of data labels, called its index. The simplest Series is formed from only an array of data:

In [3]:
obj = pd.Series([4, 7, -5, 3])

obj

0    4
1    7
2   -5
3    3
dtype: int64

The string representation of a Series displayed interactively shows the index on the left and the values on the right. Since we did not specify an index for the data, a default one consisting of the integers `0` through `N - 1` (where N is the length of the data) is created. You can get the array representation and index object of the Series via its `array` and `index` attributes, respectively:

In [4]:
# obj.array

obj.index

RangeIndex(start=0, stop=4, step=1)

Often, you'll want to create a Series with an index identifying each data point with a label:

In [5]:
obj2 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])

obj2

d    4
b    7
a   -5
c    3
dtype: int64

Here `["c", "a", "d"]` is interpreted as a list of indices, even though it contains strings instead of integers.

In [6]:
# obj2["a"]

obj2["d"] = 6

obj2[["c", "a", "d"]]

c    3
a   -5
d    6
dtype: int64

Using NumPy functions or NumPy-like operations, such as filtering with a Boolean array, scalar multiplication, or applying math functions, will preserve the index-value link:

In [7]:
# obj2[obj2 > 0]

obj2 * 2

d    12
b    14
a   -10
c     6
dtype: int64

In [8]:
import numpy as np

np.exp(obj2)

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

Another way to think about a Series is as a fixed-length, ordered dictionary, as it is a mapping of index values to data values. It can be used in many contexts where you might use a dictionary:

In [9]:
# "b" in obj2

"e" in obj2

False

Should you have data contained in a Python dictionary, you can create a Series from it by passing the dictionary:

In [10]:
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}

obj3 = pd.Series(sdata)

obj3

# obj3.to_dict()

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

When you are only passing a dictionary, the index in the resulting Series will respect the order of the keys according to the dictionary's `keys` method, which depends on the key insertion order. You can override this by passing an index with the dictionary keys in the order you want them to appear in the resulting Series:

In [11]:
states = ["California", "Ohio", "Oregon", "Texas"]

obj4 = pd.Series(sdata, index=states)

obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

Here, three values found in `sdata` were placed in the appropriate locations, but since no value for `"California"` was found, it appears as `NaN` (Not a Number), which is considered in pandas to mark missing or NA values. Since `"Utah"` was not included in `states`, it is excluded from the resulting object.

I will use the terms “missing,” “NA,” or “null” interchangeably to refer to missing data. The `isna` and `notna` functions in pandas should be used to detect missing data:

In [12]:
# pd.isna(obj4)

pd.notna(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

A useful Series feature for many applications is that it automatically aligns by index label in arithmetic operations:

In [13]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Both the Series object itself and its index have a `name` attribute, which integrates with other areas of pandas functionality:

In [14]:
obj4.name = "population"

obj4.index.name = "state"

obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

<h2>DataFrame</h2>

A DataFrame represents a rectangular table of data and contains an ordered, named collection of columns, each of which can be a different value type (numeric, string, Boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dictionary of Series all sharing the same index.

In [15]:
data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
        
frame = pd.DataFrame(data)

The resulting DataFrame will have its index assigned automatically, as with Series, and the columns are placed according to the order of the keys in `data` (which depends on their insertion order in the dictionary):

In [16]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [17]:
# frame.head()

frame.tail()

Unnamed: 0,state,year,pop
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


If you specify a sequence of columns, the DataFrame’s columns will be arranged in that order, and if you pass a column that isn’t contained in the dictionary, it will appear with missing values in the result:

In [18]:
frame2 = pd.DataFrame(data, columns=["year", "state", "pop", "debt"])

frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,
5,2003,Nevada,3.2,


A column in a DataFrame can be retrieved as a Series either by dictionary-like notation or by using the dot attribute notation:

In [19]:
# frame2["state"]

frame2.state

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object

Columns can be modified by assignment. For example, the empty `debt` column could be assigned a scalar value or an array of values:

In [20]:
frame2["debt"] = 16.5

frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,16.5
1,2001,Ohio,1.7,16.5
2,2002,Ohio,3.6,16.5
3,2001,Nevada,2.4,16.5
4,2002,Nevada,2.9,16.5
5,2003,Nevada,3.2,16.5


In [21]:
frame2["debt"] = np.arange(6.)

frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,0.0
1,2001,Ohio,1.7,1.0
2,2002,Ohio,3.6,2.0
3,2001,Nevada,2.4,3.0
4,2002,Nevada,2.9,4.0
5,2003,Nevada,3.2,5.0


<style>
    div {
    margin-bottom: 15px;
    padding: 4px 12px;
    width: 1130px;
    }

    .danger {
    background-color: #ffdddd;
    border-left: 6px solid #f44336;
    }
</style>

<div class="danger">
  <p style="color:black;"><strong>Note!</strong> Assigning a column that doesn’t exist will create a new column.</p>
</div>

The `del` keyword will delete columns like with a dictionary. As an example, I first add a new column of Boolean values where the `state` column equals `"Ohio"`:

<style>
    div {
    margin-bottom: 15px;
    padding: 4px 12px;
    width: 1130px;
    }

    .danger {
    background-color: #ffdddd;
    border-left: 6px solid #f44336;
    }
</style>

<div class="danger">
  <p style="color:black;"><strong>Note!</strong> New columns cannot be created with the frame2.eastern dot attribute notation.</p>
</div>

The `del` method can then be used to remove this column:

In [22]:
del frame2["state"]

frame2.columns

Index(['year', 'pop', 'debt'], dtype='object')

Another common form of data is a nested dictionary of dictionaries, if the nested dictionary is passed to the DataFrame, pandas will interpret the outer dictionary keys as the columns, and the inner keys as the row indices:

In [23]:
populations = {"Ohio": {2000: 1.5, 2001: 1.7, 2002: 3.6},
               "Nevada": {2001: 2.4, 2002: 2.9}}

frame3 = pd.DataFrame(populations)

frame3

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


You can transpose the DataFrame (swap rows and columns) with similar syntax to a NumPy array:

In [24]:
frame3.T

Unnamed: 0,2000,2001,2002
Ohio,1.5,1.7,3.6
Nevada,,2.4,2.9


Dictionaries of Series are treated in much the same way:

In [25]:
pdata = {"Ohio": frame3["Ohio"][:-1],
         "Nevada": frame3["Nevada"][:2]}

pd.DataFrame(pdata)

  pdata = {"Ohio": frame3["Ohio"][:-1],
  "Nevada": frame3["Nevada"][:2]}


Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4


DataFrame's `to_numpy` method returns the data contained in the DataFrame as a two-dimensional ndarray:

In [26]:
frame3.to_numpy()

array([[1.5, nan],
       [1.7, 2.4],
       [3.6, 2.9]])