# Introduction to the data structures of pandas

To get started with pandas, you should first familiarise yourself with the two most important data structures [Series](#Series) and [DataFrame](#DataFrame).

## Series

A series is a one-dimensional array-like object containing a sequence of values (of similar types to the NumPy types) and an associated array of data labels called an index. The simplest series is formed from just an array of data:

In [1]:
import numpy as np
import pandas as pd

In [2]:
rng = np.random.default_rng()
s = pd.Series(rng.normal(size=7))
s

0    0.213350
1    0.837670
2    0.185274
3   -0.461108
4   -1.141058
5    0.988285
6    0.818473
dtype: float64

The string representation of an interactively displayed series shows the index on the left and the values on the right. Since we have not specified an index for the data, a default index is created consisting of the integers `0` to `N - 1` (where `N` is the length of the data). You can get the array representation and the index object of the series via their [pandas.Series.array](https://pandas.pydata.org/docs/reference/api/pandas.Series.array.html) and [pandas.Series.index](https://pandas.pydata.org/docs/reference/api/pandas.Series.index.html) attributes respectively:

In [3]:
s.array

<PandasArray>
[0.21334991211667095,  0.8376697225713394, 0.18527373863898977,
 -0.4611075664275736, -1.1410578335394383,  0.9882846723071994,
  0.8184725837300063]
Length: 7, dtype: float64

In [4]:
s.index

RangeIndex(start=0, stop=7, step=1)

Often you will want to create an index that identifies each data point with a label:

In [5]:
idx = pd.date_range("2022-01-31", periods=7)

s2 = pd.Series(rng.normal(size=7), index=idx)

In [6]:
s2

2022-01-31   -0.863745
2022-02-01    0.637322
2022-02-02    1.156080
2022-02-03    0.392360
2022-02-04   -0.567734
2022-02-05    1.760273
2022-02-06   -0.514952
Freq: D, dtype: float64

<div class="alert alert-block alert-info">

**See also:**

* [Time series / date functionality](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html)
</div>

Compared to NumPy arrays, you can use labels in the index if you want to select individual values or a group of values:

In [7]:
s2['2022-02-02']

1.1560796753476548

In [8]:
s2[['2022-02-02', '2022-02-03', '2022-02-04']]

2022-02-02    1.156080
2022-02-03    0.392360
2022-02-04   -0.567734
dtype: float64

Here `['2022-02-02', '2022-02-03', '2022-02-04']` is interpreted as a list of indices, even if it contains strings instead of integers.

When using NumPy functions or NumPy-like operations, such as filtering with a Boolean array, scalar multiplication or applying mathematical functions, the link between index and value is preserved:

In [9]:
s2[s2 > 0]

2022-02-01    0.637322
2022-02-02    1.156080
2022-02-03    0.392360
2022-02-05    1.760273
dtype: float64

In [10]:
s2 ** 2

2022-01-31    0.746056
2022-02-01    0.406180
2022-02-02    1.336520
2022-02-03    0.153947
2022-02-04    0.322322
2022-02-05    3.098561
2022-02-06    0.265176
Freq: D, dtype: float64

In [11]:
np.exp(s2)

2022-01-31    0.421580
2022-02-01    1.891410
2022-02-02    3.177452
2022-02-03    1.480471
2022-02-04    0.566808
2022-02-05    5.814024
2022-02-06    0.597529
Freq: D, dtype: float64

You can also think of a series as a fixed-length _ordered dict_, since it is an assignment of index values to data values. It can be used in many contexts where you could use a _dict_:

In [12]:
'2022-02-02' in s2

True

In [13]:
'2022-02-09' in s2

False

### Missing data

I will use `NA` and `null` synonymously to indicate missing data. The functions `isna` and `notna` in pandas should be used to identify missing data:

In [14]:
pd.isna(s2)

2022-01-31    False
2022-02-01    False
2022-02-02    False
2022-02-03    False
2022-02-04    False
2022-02-05    False
2022-02-06    False
Freq: D, dtype: bool

In [15]:
pd.notna(s2)

2022-01-31    True
2022-02-01    True
2022-02-02    True
2022-02-03    True
2022-02-04    True
2022-02-05    True
2022-02-06    True
Freq: D, dtype: bool

Series also has these as instance methods:

In [16]:
s2.isna()

2022-01-31    False
2022-02-01    False
2022-02-02    False
2022-02-03    False
2022-02-04    False
2022-02-05    False
2022-02-06    False
Freq: D, dtype: bool

Dealing with missing data is discussed in more detail in the section [Managing missing data with pandas](../../clean-prep/nulls.ipynb).

A useful feature of Series for many applications is the automatic alignment by index labels in arithmetic operations:

In [17]:
idx = pd.date_range("2022-02-07", periods=7)

s3 = pd.Series(rng.normal(size=7), index=idx)

In [18]:
s2, s3

(2022-01-31   -0.863745
 2022-02-01    0.637322
 2022-02-02    1.156080
 2022-02-03    0.392360
 2022-02-04   -0.567734
 2022-02-05    1.760273
 2022-02-06   -0.514952
 Freq: D, dtype: float64,
 2022-02-07    0.900220
 2022-02-08   -0.289397
 2022-02-09   -0.684469
 2022-02-10   -1.225400
 2022-02-11   -1.272900
 2022-02-12   -0.661624
 2022-02-13    0.872438
 Freq: D, dtype: float64)

In [19]:
s2 + s3

2022-01-31   NaN
2022-02-01   NaN
2022-02-02   NaN
2022-02-03   NaN
2022-02-04   NaN
2022-02-05   NaN
2022-02-06   NaN
2022-02-07   NaN
2022-02-08   NaN
2022-02-09   NaN
2022-02-10   NaN
2022-02-11   NaN
2022-02-12   NaN
2022-02-13   NaN
Freq: D, dtype: float64

If you have experience with SQL, this is similar to a [JOIN](https://en.wikipedia.org/wiki/Join_(SQL)) operation.

Both the Series object itself and its index have a `name` attribute that can be integrated into other areas of the pandas functionality:

In [20]:
s3.name = 'floats'
s3.index.name = 'date'

s3

date
2022-02-07    0.900220
2022-02-08   -0.289397
2022-02-09   -0.684469
2022-02-10   -1.225400
2022-02-11   -1.272900
2022-02-12   -0.661624
2022-02-13    0.872438
Freq: D, Name: floats, dtype: float64

## DataFrame

A DataFrame represents a rectangular data table and contains an ordered, named collection of columns, each of which can have a different value type. The DataFrame has both a row index and a column index.

<div class="alert alert-block alert-info">

**Note:**

Although a DataFrame is two-dimensional, you can also use it to represent higher-dimensional data in a table format with hierarchical indexing using [join](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html), [combine](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.combine.html) and [Reshaping](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html).
</div>

In [21]:
data = {'Code': ['U+0000', 'U+0001', 'U+0002', 'U+0003', 'U+0004', 'U+0005'],
        'Decimal': [0, 1, 2, 3, 4, 5],
        'Octal': ['001', '002', '003', '004', '004', '005'],
        'Key': ['NUL', 'Ctrl-A', 'Ctrl-B', 'Ctrl-C', 'Ctrl-D', 'Ctrl-E']}

df = pd.DataFrame(data)

df

Unnamed: 0,Code,Decimal,Octal,Key
0,U+0000,0,1,NUL
1,U+0001,1,2,Ctrl-A
2,U+0002,2,3,Ctrl-B
3,U+0003,3,4,Ctrl-C
4,U+0004,4,4,Ctrl-D
5,U+0005,5,5,Ctrl-E


For large DataFrames, the `head` method selects only the first five rows:

In [22]:
df.head()

Unnamed: 0,Code,Decimal,Octal,Key
0,U+0000,0,1,NUL
1,U+0001,1,2,Ctrl-A
2,U+0002,2,3,Ctrl-B
3,U+0003,3,4,Ctrl-C
4,U+0004,4,4,Ctrl-D


You can also specify columns and their order:

In [23]:
pd.DataFrame(data, columns=['Code', 'Key'])

Unnamed: 0,Code,Key
0,U+0000,NUL
1,U+0001,Ctrl-A
2,U+0002,Ctrl-B
3,U+0003,Ctrl-C
4,U+0004,Ctrl-D
5,U+0005,Ctrl-E


If you want to pass a column that is not contained in the dict, it will appear without values in the result:

In [24]:
df2 = pd.DataFrame(data, columns=['Code', 'Decimal', 'Octal', 'Description', 'Key'])

df2

Unnamed: 0,Code,Decimal,Octal,Description,Key
0,U+0000,0,1,,NUL
1,U+0001,1,2,,Ctrl-A
2,U+0002,2,3,,Ctrl-B
3,U+0003,3,4,,Ctrl-C
4,U+0004,4,4,,Ctrl-D
5,U+0005,5,5,,Ctrl-E


You can retrieve a column in a DataFrame with a dict-like notation:

In [25]:
df['Code']

0    U+0000
1    U+0001
2    U+0002
3    U+0003
4    U+0004
5    U+0005
Name: Code, dtype: object

This way you can also make a column the index:

In [26]:
df2 = pd.DataFrame(data,
                   columns=['Decimal', 'Octal', 'Description', 'Key'],
                   index=df['Code'])

df2

Unnamed: 0_level_0,Decimal,Octal,Description,Key
Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
U+0000,0,1,,NUL
U+0001,1,2,,Ctrl-A
U+0002,2,3,,Ctrl-B
U+0003,3,4,,Ctrl-C
U+0004,4,4,,Ctrl-D
U+0005,5,5,,Ctrl-E


Rows can be retrieved by position or name with the [pandas.DataFrame.loc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) attribute:

In [27]:
df2.loc['U+0001']

Decimal             1
Octal             002
Description       NaN
Key            Ctrl-A
Name: U+0001, dtype: object

Column values can be changed by assignment. For example, a scalar value or an array of values could be assigned to the empty _Description_ column:

In [28]:
df2['Description'] = [
    'Null character',
    'Start of Heading',
    'Start of Text',
    'End-of-text character',
    'End-of-transmission character',
    'Enquiry character'
    ]

df2

Unnamed: 0_level_0,Decimal,Octal,Description,Key
Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
U+0000,0,1,Null character,NUL
U+0001,1,2,Start of Heading,Ctrl-A
U+0002,2,3,Start of Text,Ctrl-B
U+0003,3,4,End-of-text character,Ctrl-C
U+0004,4,4,End-of-transmission character,Ctrl-D
U+0005,5,5,Enquiry character,Ctrl-E


Assigning a non-existing column creates a new column.

Columns can be removed with [pandas.DataFrame.drop](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) and displayed with `pandas.DataFrame.columns`:

In [29]:
df3 = df2.drop(columns=['Decimal', 'Octal'])

In [30]:
df2.columns

Index(['Decimal', 'Octal', 'Description', 'Key'], dtype='object')

In [31]:
df3.columns

Index(['Description', 'Key'], dtype='object')

Another common form of data is nested dict of dicts:

In [32]:
u = {
    'U+0006': {'Decimal': '6', 'Octal': '006', 'Description': 'Acknowledge character', 'Key': 'Ctrl-F'},
    'U+0007': {'Decimal': '7', 'Octal': '007', 'Description': 'Bell character', 'Key': 'Ctrl-G'},
    }

df4 = pd.DataFrame(u)

df4

Unnamed: 0,U+0006,U+0007
Decimal,6,7
Octal,006,007
Description,Acknowledge character,Bell character
Key,Ctrl-F,Ctrl-G


You can transpose the DataFrame, i.e. swap the rows and columns, with a similar syntax to a NumPy array:

In [33]:
df4.T

Unnamed: 0,Decimal,Octal,Description,Key
U+0006,6,6,Acknowledge character,Ctrl-F
U+0007,7,7,Bell character,Ctrl-G


<div class="alert alert-block alert-warning">

**Warning:**

Note that when transposing, the data types of the columns are discarded if the columns do not all have the same data type, so when transposing and then transposing back, the previous type information may be lost. In this case, the columns become arrays of pure Python objects.
</div>

The keys in the inner dicts are combined to form the index in the result. This is not the case when an explicit index is specified:

In [34]:
df5 = pd.DataFrame(u, index=['Decimal', 'Octal', 'Key'])
df5

Unnamed: 0,U+0006,U+0007
Decimal,6,7
Octal,006,007
Key,Ctrl-F,Ctrl-G
