<img src="../../../images/banners/pandas-cropped.jpeg" width="600"/>

<a class="anchor" id="essential_basic_functionality"></a>
# <img src="../../../images/logos/pandas.png" width="23"/>  Essential Basic Functionality

## <img src="../../../images/logos/toc.png" width="20"/> Table of Contents 
* [Essential basic functionality](#essential_basic_functionality)
    * [Head and tail](#head_and_tail)
    * [Attributes](#attributes)
    * [Accelerated operations](#accelerated_operations)
    * [Flexible binary operations](#flexible_binary_operations)
        * [Matching / broadcasting behavior](#matching_/_broadcasting_behavior)
        * [Missing data / operations with fill values](#missing_data_/_operations_with_fill_values)
        * [Flexible comparisons](#flexible_comparisons)
        * [Boolean reductions](#boolean_reductions)
        * [Comparing if objects are equivalent](#comparing_if_objects_are_equivalent)
        * [Comparing array-like objects](#comparing_array-like_objects)
        * [Combining overlapping data sets](#combining_overlapping_data_sets)
        * [General DataFrame combine](#general_dataframe_combine)

---

In [2]:
import pandas as pd
import numpy as np

Here we discuss a lot of the essential functionality common to the pandas data
structures. To begin, let’s create some example objects:

In [3]:
index = pd.date_range("1/1/2000", periods=8)
s = pd.Series(
    np.random.randn(5),
    index=["a", "b", "c", "d", "e"]
)
df = pd.DataFrame(
    np.random.randn(8, 3),
    index=index,
    columns=["A", "B", "C"]
)

In [4]:
index

DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
               '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08'],
              dtype='datetime64[ns]', freq='D')

In [5]:
s

a    0.798479
b   -0.080978
c    1.333688
d    1.759824
e   -0.290226
dtype: float64

In [6]:
df

Unnamed: 0,A,B,C
2000-01-01,0.802625,-0.487963,-1.124748
2000-01-02,-0.088593,0.788093,1.196199
2000-01-03,-1.66659,0.49549,0.530928
2000-01-04,1.033644,-1.206951,-1.848656
2000-01-05,0.311807,0.394268,1.06391
2000-01-06,0.882344,-0.122679,-0.008647
2000-01-07,0.66517,-1.566012,0.571396
2000-01-08,-0.076244,1.48126,-1.82585


<a class="anchor" id="head_and_tail"></a>
## Head and tail

To view a small sample of a Series or DataFrame object, use the
[`head()`](../reference/api/pandas.DataFrame.head.html#pandas.DataFrame.head "pandas.DataFrame.head") and [`tail()`](../reference/api/pandas.DataFrame.tail.html#pandas.DataFrame.tail "pandas.DataFrame.tail") methods. The default number
of elements to display is five, but you may pass a custom number.

In [7]:
long_series = pd.Series(np.random.randn(1000))

In [8]:
long_series.head()

0    2.440081
1   -0.544264
2    0.810528
3    2.177132
4    0.829438
dtype: float64

In [9]:
long_series.tail(3)

997   -1.017711
998    0.722343
999   -0.304115
dtype: float64

<a class="anchor" id="attributes"></a>
## Attributes

pandas objects have a number of attributes enabling you to access the metadata

Note, **these attributes can be safely assigned to**!

In [10]:
df[:2]

Unnamed: 0,A,B,C
2000-01-01,0.802625,-0.487963,-1.124748
2000-01-02,-0.088593,0.788093,1.196199


In [11]:
df.columns = [x.lower() for x in df.columns]

In [12]:
df

Unnamed: 0,a,b,c
2000-01-01,0.802625,-0.487963,-1.124748
2000-01-02,-0.088593,0.788093,1.196199
2000-01-03,-1.66659,0.49549,0.530928
2000-01-04,1.033644,-1.206951,-1.848656
2000-01-05,0.311807,0.394268,1.06391
2000-01-06,0.882344,-0.122679,-0.008647
2000-01-07,0.66517,-1.566012,0.571396
2000-01-08,-0.076244,1.48126,-1.82585


<a class="anchor" id="flexible_binary_operations"></a>
## Flexible binary operations

With binary operations between pandas data structures, there are two key points
of interest:

We will demonstrate how to manage these issues independently, though they can
be handled simultaneously.

<a class="anchor" id="matching_/_broadcasting_behavior"></a>
### Matching / broadcasting behavior

For broadcasting behavior,
Series input is of primary interest. Using these functions, you can use to
either match on the *index* or *columns* via the **axis** keyword:

In [13]:
df = pd.DataFrame(
    {
        "one": pd.Series(np.random.randn(3), index=["a", "b", "c"]),
        "two": pd.Series(np.random.randn(4), index=["a", "b", "c", "d"]),
        "three": pd.Series(np.random.randn(3), index=["b", "c", "d"]),
    }
)

In [14]:
df

Unnamed: 0,one,two,three
a,0.05346,-2.081122,
b,-1.006049,-0.280347,1.318842
c,-0.754036,0.121723,0.837709
d,,-0.281258,-1.549235


In [15]:
row = df.iloc[1]

In [16]:
column = df["two"]

In [17]:
df.sub(row, axis="columns")

Unnamed: 0,one,two,three
a,1.05951,-1.800774,
b,0.0,0.0,0.0
c,0.252013,0.40207,-0.481133
d,,-0.000911,-2.868077


In [18]:
df.sub(row, axis=1)

Unnamed: 0,one,two,three
a,1.05951,-1.800774,
b,0.0,0.0,0.0
c,0.252013,0.40207,-0.481133
d,,-0.000911,-2.868077


In [19]:
df.sub(column, axis='index')

Unnamed: 0,one,two,three
a,2.134582,0.0,
b,-0.725702,0.0,1.59919
c,-0.875759,0.0,0.715986
d,,0.0,-1.267976


In [20]:
df.sub(column, axis='rows')

Unnamed: 0,one,two,three
a,2.134582,0.0,
b,-0.725702,0.0,1.59919
c,-0.875759,0.0,0.715986
d,,0.0,-1.267976


In [21]:
df.sub(column, axis=0)

Unnamed: 0,one,two,three
a,2.134582,0.0,
b,-0.725702,0.0,1.59919
c,-0.875759,0.0,0.715986
d,,0.0,-1.267976


<a class="anchor" id="missing_data_/_operations_with_fill_values"></a>
### Missing data / operations with fill values

In Series and DataFrame, the arithmetic functions have the option of inputting
a *fill_value*, namely a value to substitute when at most one of the values at
a location are missing. For example, when adding two DataFrame objects, you may
wish to treat NaN as 0 unless both DataFrames are missing that value, in which
case the result will be NaN (you can later replace NaN with some other value
using `fillna` if you wish).

In [35]:
df

Unnamed: 0,one,two,three
a,-0.465912,0.977028,
b,-0.701309,-0.762232,-1.470984
c,-0.773223,-0.836792,-2.0059
d,,-0.462816,-1.22173


In [37]:
df2 = df.copy()
df2.loc['a', 'three'] = 1.0

In [39]:
df2

Unnamed: 0,one,two,three
a,-0.465912,0.977028,1.0
b,-0.701309,-0.762232,-1.470984
c,-0.773223,-0.836792,-2.0059
d,,-0.462816,-1.22173


In [40]:
df + df2

Unnamed: 0,one,two,three
a,-0.931823,1.954055,
b,-1.402618,-1.524464,-2.941967
c,-1.546445,-1.673583,-4.011801
d,,-0.925631,-2.44346


In [41]:
df.add(df2, fill_value=0)

Unnamed: 0,one,two,three
a,-0.931823,1.954055,1.0
b,-1.402618,-1.524464,-2.941967
c,-1.546445,-1.673583,-4.011801
d,,-0.925631,-2.44346


In [42]:
df.add(df2, fill_value=0).fillna(0)

Unnamed: 0,one,two,three
a,-0.931823,1.954055,1.0
b,-1.402618,-1.524464,-2.941967
c,-1.546445,-1.673583,-4.011801
d,0.0,-0.925631,-2.44346


<a class="anchor" id="boolean_reductions"></a>
### Boolean reductions

You can apply the reductions: [`empty`](../reference/api/pandas.DataFrame.empty.html#pandas.DataFrame.empty "pandas.DataFrame.empty"), [`any()`](../reference/api/pandas.DataFrame.any.html#pandas.DataFrame.any "pandas.DataFrame.any"),
[`all()`](../reference/api/pandas.DataFrame.all.html#pandas.DataFrame.all "pandas.DataFrame.all"), and [`bool()`](../reference/api/pandas.DataFrame.bool.html#pandas.DataFrame.bool "pandas.DataFrame.bool") to provide a
way to summarize a boolean result.

In [48]:
(df > 0).all()

one      False
two      False
three    False
dtype: bool

In [49]:
(df > 0).all(axis=1)

a    False
b    False
c    False
d    False
dtype: bool

In [None]:
(df > 0).any()

You can reduce to a final boolean value.

In [50]:
(df > 0).any().any()

True

You can test if a pandas object is empty, via the [`empty`](../reference/api/pandas.DataFrame.empty.html#pandas.DataFrame.empty "pandas.DataFrame.empty") property.

In [51]:
df.empty

False

In [52]:
pd.DataFrame(columns=list("ABC")).empty

True

In [53]:
pd.DataFrame().empty

True

Warning

You might be tempted to do the following:
```python
if df:
    pass
```

Or

```python
df and df2
```

These will both raise errors, as you are trying to compare multiple values.:

See [here](https://pandas.pydata.org/docs/user_guide/gotchas.html#gotchas-truth) for a more detailed discussion.

<a class="anchor" id="comparing_if_objects_are_equivalent"></a>
### Comparing if objects are equivalent

Often you may find that there is more than one way to compute the same
result. As a simple example, consider `df + df` and `df * 2`. To test
that these two computations produce the same result, given the tools
shown above, you might imagine using `(df + df == df * 2).all()`. But in
fact, this expression is False:

In [58]:
df + df == df * 2

Unnamed: 0,one,two,three
a,True,True,False
b,True,True,True
c,True,True,True
d,False,True,True


In [59]:
(df + df == df * 2).all()

one      False
two       True
three    False
dtype: bool

Notice that the boolean DataFrame `df + df == df * 2` contains some False values!
This is because NaNs do not compare as equals:

In [60]:
np.nan == np.nan

False

So, NDFrames (such as Series and DataFrames)
have an [`equals()`](../reference/api/pandas.DataFrame.equals.html#pandas.DataFrame.equals "pandas.DataFrame.equals") method for testing equality, with NaNs in
corresponding locations treated as equal.

In [61]:
(df + df).equals(df * 2)

True

Note that the Series or DataFrame index needs to be in the same order for
equality to be True:

In [62]:
df1 = pd.DataFrame({"col": ["foo", 0, np.nan]})

In [63]:
df2 = pd.DataFrame({"col": [np.nan, 0, "foo"]}, index=[2, 1, 0])

In [67]:
df1

Unnamed: 0,col
0,foo
1,0
2,


In [68]:
df2

Unnamed: 0,col
2,
1,0
0,foo


In [64]:
df1.equals(df2)

False

In [65]:
df1.equals(df2.sort_index())

True

<a class="anchor" id="comparing_array-like_objects"></a>
### Comparing array-like objects

You can conveniently perform element-wise comparisons when comparing a pandas
data structure with a scalar value:

In [72]:
pd.Series(["foo", "bar", "baz"]) == "foo"

0     True
1    False
2    False
dtype: bool

In [73]:
pd.Index(["foo", "bar", "baz"]) == "foo"

array([ True, False, False])

pandas also handles element-wise comparisons between different array-like
objects of the same length:

In [74]:
pd.Series(["foo", "bar", "baz"]) == pd.Index(["foo", "bar", "qux"])

0     True
1     True
2    False
dtype: bool

In [75]:
pd.Series(["foo", "bar", "baz"]) == np.array(["foo", "bar", "qux"])

0     True
1     True
2    False
dtype: bool

Trying to compare `Index` or `Series` objects of different lengths will
raise a ValueError:

In [78]:
try:
    pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo', 'bar'])
except ValueError as e:
    print(f'ValueError: {e}')

ValueError: Can only compare identically-labeled Series objects


In [79]:
try:
    pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo'])
except ValueError as e:
    print(f'ValueError: {e}')

ValueError: Can only compare identically-labeled Series objects


Note that this is different from the NumPy behavior where a comparison can
be broadcast:

In [80]:
np.array([1, 2, 3]) == np.array([2])

array([False,  True, False])

or it can return False if broadcasting can not be done:

In [81]:
np.array([1, 2, 3]) == np.array([1, 2])

  np.array([1, 2, 3]) == np.array([1, 2])


False

<a class="anchor" id="combining_overlapping_data_sets"></a>
### Combining overlapping data sets

A problem occasionally arising is the combination of two similar data sets
where values in one are preferred over the other. An example would be two data
series representing a particular economic indicator where one is considered to
be of “higher quality”. However, the lower quality series might extend further
back in history or have more complete data coverage. As such, we would like to
combine two DataFrame objects where missing values in one DataFrame are
conditionally filled with like-labeled values from the other DataFrame. The
function implementing this operation is [`combine_first()`](../reference/api/pandas.DataFrame.combine_first.html#pandas.DataFrame.combine_first "pandas.DataFrame.combine_first"),
which we illustrate:

In [82]:
df1 = pd.DataFrame(
    {"A": [1.0, np.nan, 3.0, 5.0, np.nan], "B": [np.nan, 2.0, 3.0, np.nan, 6.0]}
)

In [83]:
df2 = pd.DataFrame(
    {
        "A": [5.0, 2.0, 4.0, np.nan, 3.0, 7.0],
        "B": [np.nan, np.nan, 3.0, 4.0, 6.0, 8.0],
    }
)

In [87]:
df1

Unnamed: 0,A,B
0,1.0,
1,,2.0
2,3.0,3.0
3,5.0,
4,,6.0


In [88]:
df2

Unnamed: 0,A,B
0,5.0,
1,2.0,
2,4.0,3.0
3,,4.0
4,3.0,6.0
5,7.0,8.0


In [89]:
df1.combine_first(df2)

Unnamed: 0,A,B
0,1.0,
1,2.0,2.0
2,3.0,3.0
3,5.0,4.0
4,3.0,6.0
5,7.0,8.0


<a class="anchor" id="general_dataframe_combine"></a>
### General DataFrame combine

The [`combine_first()`](../reference/api/pandas.DataFrame.combine_first.html#pandas.DataFrame.combine_first "pandas.DataFrame.combine_first") method above calls the more general
[`DataFrame.combine()`](../reference/api/pandas.DataFrame.combine.html#pandas.DataFrame.combine "pandas.DataFrame.combine"). This method takes another DataFrame
and a combiner function, aligns the input DataFrame and then passes the combiner
function pairs of Series (i.e., columns whose names are the same).

So, for instance, to reproduce [`combine_first()`](../reference/api/pandas.DataFrame.combine_first.html#pandas.DataFrame.combine_first "pandas.DataFrame.combine_first") as above:

In [97]:
def combiner(x, y):
    return np.where(pd.isna(x), y, x)

In [98]:
df1.combine(df2, combiner)

Unnamed: 0,A,B
0,1.0,
1,2.0,2.0
2,3.0,3.0
3,5.0,4.0
4,3.0,6.0
5,7.0,8.0
