# The pandas DataFrame and Series

Any data file you load with `pandas` will be transformed into a `DataFrame` object.

You usually understand this as a table, with rows and columns.

In [42]:
import pandas as pd
df = pd.read_excel(io='data/football_players.xlsx', index_col='Player', parse_dates=['Birthday'])
df

Unnamed: 0_level_0,Matches,Goals,Assists,Yellows,Reds,Yellows2,Minutes,Birthday
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
CR7,923,689,231,112,7,4,75372,1985-02-05
Messi,800,679,312,82,1,0,65299,1987-06-24
Lewandowski,662,498,131,68,0,1,65299,1988-08-21
Benzema,737,370,182,15,0,0,52245,1987-12-19


By asking the type of the object, we can see that it is a `DataFrame` object, coming from the `pandas` library.

pandas.core.frame.DataFrame

Now, if you access one of the columns.

Player
CR7            689
Messi          679
Lewandowski    498
Benzema        370
Name: Goals, dtype: int64

You will get a pandas `Series` object.

pandas.core.series.Series

It's important to note that the `DataFrame` is a collection of `Series` objects.

Sometimes, you want to work with the `DataFrame` object.

Unnamed: 0,Matches,Goals,Assists,Yellows,Reds,Yellows2,Minutes,Birthday
count,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000,4
mean,780.500000,559.000000,214.000000,69.250000,2.000000,1.250000,64553.750000,1987-04-18 06:00:00
...,...,...,...,...,...,...,...,...
max,923.000000,689.000000,312.000000,112.000000,7.000000,4.000000,75372.000000,1988-08-21 00:00:00
std,110.485293,153.559977,77.041115,40.557572,3.366502,1.892969,9480.693624,


While others want to work with the `Series` object.

2236

But which are the main differences between them?

This tutorial will answer many of the most common errors and questions about the `DataFrame` and `Series` objects.

## Questions

1. Why do you get a `KeyError` when you try to access a column?
2. When to `reset_index`?
3. What is the difference between a `DataFrame` and a `Series`?
4. Why are `dtypes` important?
5. How to access special functions in a `Series` object?
6. How to create a new column from the existing ones?

## Answers

### Dissecting pandas objects

The `DataFrame` and `Series` objects are the core of the `pandas` library.

They aren't just a mere table or a column.

<div>
<img src="src/DataFrame.jpg" width="45%"/>
<img src="src/Series.jpg" width="45%"/>
</div>

They are supercharged with many functions and attributes that allow you to manipulate the data in many ways.

Let's address the most essential concepts in the following sections.

### Pandas index vs column

An essential concept you must understand is the difference between the index and the columns.

Having the `DataFrame` with an `index` that looks like a column, but it's not, is a common source of confusion.

If you want to access the `Player` "column":

KeyError: 'Player'

You get a `KeyError` because the furthest "column" to the left is not a column but the `index`.

Therefore, you must access the values through the `index` attribute:

Player
CR7           1985-02-05
Messi         1987-06-24
Lewandowski   1988-08-21
Benzema       1987-12-19
Name: Birthday, dtype: datetime64[ns]

Index(['CR7', 'Messi', 'Lewandowski', 'Benzema'], dtype='object', name='Player')

Where will you fail the most with this concept?

Data visualization.

### The index is not a column

Let's say you want to create a bar plot with the `Player` column:

ValueError: Value of 'x' is not the name of a column in 'data_frame'. Expected one of ['Matches', 'Goals', 'Assists', 'Yellows', 'Reds', 'Yellows2', 'Minutes', 'Birthday', 'Total Cards'] but received: Player

Since the `Player` is not a column, but an index, you cannot use it as such.

### Why resetting the index?

Unless, you reset the index to put `Player` as a column:

Unnamed: 0,Player,Matches,Goals,Assists,Yellows,Reds,Yellows2,Minutes,Birthday,Total Cards
0,CR7,923,689,231,112,7,4,75372,1985-02-05,123
1,Messi,800,679,312,82,1,0,65299,1987-06-24,83
2,Lewandowski,662,498,131,68,0,1,65299,1988-08-21,69
3,Benzema,737,370,182,15,0,0,52245,1987-12-19,15


Now you can use the `Player` column as the x-axis in the plot.

### Pandas accessors to special functions

Since the values of `Birthday` are of `dtype: datetime`:

0   1985-02-05
1   1987-06-24
2   1988-08-21
3   1987-12-19
Name: Birthday, dtype: datetime64[ns]

The `dt` accessor...

<pandas.core.indexes.accessors.DatetimeProperties object at 0x32b78fad0>

... will give you access to specific functions for this data type.

0    1985
1    1987
2    1988
3    1987
Name: Birthday, dtype: int32

0    February
1        June
2      August
3    December
Name: Birthday, dtype: object

What if this column was not a datetime object, but a string?

0    1985-02-05
1    1987-06-24
2    1988-08-21
3    1987-12-19
Name: Birthday, dtype: object

You'll get an `AttributeError` because the `dt` accessor is only available for datetime objects.

AttributeError: Can only use .dt accessor with datetimelike values

But you can use the `str` accessor.

<pandas.core.strings.accessor.StringMethods at 0x32b83a4b0>

To use `string` functions, like `split` to extract the year, month, and day.

0    [1985, 02, 05]
1    [1987, 06, 24]
2    [1988, 08, 21]
3    [1987, 12, 19]
Name: Birthday, dtype: object

Which, additionally, you can turn into a `DataFrame`.

Unnamed: 0,0,1,2
0,1985,2,5
1,1987,6,24
2,1988,8,21
3,1987,12,19


Then, rename the columns to `year`, `month`, and `day`.

Finally, `join` it to the original `DataFrame`.

Unnamed: 0,Player,Matches,Goals,Assists,Yellows,Reds,Yellows2,Minutes,Birthday,Total Cards,Year,Month,Day
0,CR7,923,689,231,112,7,4,75372,1985-02-05,123,1985,2,5
1,Messi,800,679,312,82,1,0,65299,1987-06-24,83,1987,6,24
2,Lewandowski,662,498,131,68,0,1,65299,1988-08-21,69,1988,8,21
3,Benzema,737,370,182,15,0,0,52245,1987-12-19,15,1987,12,19


### Creating new columns

Also, you can operate each `Series` object to create a new one.

Player
CR7            123
Messi           83
Lewandowski     69
Benzema         15
dtype: int64

Then, add it as a new `Series` to the `DataFrame`.

Unnamed: 0_level_0,Matches,Goals,Assists,Yellows,Reds,Yellows2,Minutes,Birthday,Total Cards
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
CR7,923,689,231,112,7,4,75372,1985-02-05,123
Messi,800,679,312,82,1,0,65299,1987-06-24,83
Lewandowski,662,498,131,68,0,1,65299,1988-08-21,69
Benzema,737,370,182,15,0,0,52245,1987-12-19,15


## Conclusions

1. The `DataFrame` is a collection of `Series` objects.
2. Using a function from the `DataFrame` will operate on all the `Series` objects.
3. The `index` is not a column.
4. Use the `reset_index` function to turn the `index` into a column.
5. Use the `dt` accessor to access datetime functions.
6. Use the `str` accessor to access string functions.
7. Create a new `Series` by operating with the existing ones in the `DataFrame`.