The PR is to solve issue: https://github.com/pandas-dev/pandas/issues/38340  
The PR link is: https://github.com/pandas-dev/pandas/pull/38379

### 0. Compile pandas

In [1]:
import os
pandas_path = os.environ['PANDAS_PATH']
os.chdir(pandas_path)
os.system("cd $PANDAS_PATH")
try:
    import pandas as pd
    print("Already compiled!")
except:
    print("Compiling...")
    os.system("python setup.py develop")
    os.system("pip uninstall pandas")
    print("Compiled!")

Already compiled!


In [2]:
def get_so_file_names(path, keyword):
    list_of_files = set()
    for (dirpath, dirnames, filenames) in os.walk(path):
        for filename in filenames:
            if keyword in filename: 
                list_of_files.add(os.sep.join([dirpath, filename]))
    return list_of_files

In [3]:
def rename_files(file_names, keyword):
    for file_name in file_names:
        os.rename(file_name, file_name.replace(keyword, ""))
    print(f"{len(file_names)} files renamed!")

In [4]:
keyword = ".cpython-37m-darwin"
so_file_names = get_so_file_names(pandas_path, keyword)
rename_files(so_file_names, keyword)

0 files renamed!


### 2. Issue
Currently, you can get quite a slowdown:

In [5]:
import pandas as pd
import numpy as np

arr = np.random.randint(0, 10, 1_000_001)
target = [1, 2, 3, 20]

In [6]:
s1 = pd.Series(arr)
%timeit s1.isin(target)

2.77 ms ± 13.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [7]:
s2 = pd.Series(arr, dtype="Int64")
%timeit s2.isin(target)

10.3 ms ± 695 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### 3. Analysis
We can see in `/pandas/core/series.py`, there is an `algorithms.isin` function might be the bottleneck.
```Python
def isin(self, values) -> "Series":
        result = algorithms.isin(self._values, values)
        return self._constructor(result, index=self.index).__finalize__(
            self, method="isin"
        )
```

In [8]:
import pandas as pd

In [9]:
%load_ext line_profiler

In [10]:
from pandas.core.algorithms import isin

In [11]:
%lprun -f isin isin(s2._values, target)

### 4. Solution
According to the profiling result, line 470 is the bottleneck:  
```
470         1       6074.0   6074.0     21.9          return isin(np.asarray(comps), np.asarray(values))
```

We could define and test our new isin as below, and the runtime is decreased to 2 ms.

In [12]:
from pandas.core.arrays.masked import BaseMaskedArray

def fast_isin(comps, values):
    if isinstance(comps, BaseMaskedArray):
        comps = comps._data
    return isin(comps, values)

In [13]:
%timeit fast_isin(s2._values, target)

2.86 ms ± 53.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [14]:
%timeit isin(s2._values, target)

12.7 ms ± 2.54 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


### 5. Tests
We should test for the Extension Arrays below:
```Python
IntegerArray
FloatingArray
BooleanArray
```

### 5.1 IntegerArray

In [15]:
s3 = pd.Series(arr, dtype="Int64")
%timeit fast_isin(s3._values, target)

2.78 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [16]:
%timeit isin(s3._values, target)

10.2 ms ± 435 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### 5.2 FloatingArray

In [17]:
s4 = pd.Series(arr, dtype="Float64")
target_f = [1.0, 2.0, 3.0, 4.0]
%timeit fast_isin(s4._values, target_f)

2.46 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [18]:
%timeit isin(s4._values, target_f)

35.9 ms ± 982 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### 5.3 boolean

In [19]:
arr_b = arr > 5
s5 = pd.Series(arr_b, dtype="boolean")
target_b = [True, True, True, True]
%timeit fast_isin(s5._values, target_b)

4.07 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [20]:
%timeit isin(s5._values, target_b)

10.4 ms ± 93.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### 5.4 pd.NA in the first array
If there is pd.NA in the first array, the result could be incorrect, but we can multiply the result with `~comps._mask`.

In [21]:
s6 = pd.Series([1, 2, 3, pd.NA, 4], dtype="Int64")
target = [1, 2, 3, 20]
fast_isin(s6._values, target)

array([ True,  True,  True,  True, False])

In [22]:
def isin_for_masked_array(comps, values):
    if isinstance(comps, BaseMaskedArray):
        _comps = comps._data
        result = isin(_comps, values) * np.invert(comps._mask)
        return result
    return isin(comps, values)

In [23]:
isin_for_masked_array(s6._values, target)

array([ True,  True,  True, False, False])

In [24]:
%timeit fast_isin(s3._values, target)

2.63 ms ± 19.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [25]:
%timeit isin_for_masked_array(s3._values, target)

2.79 ms ± 22.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### 5.5 pd.NA in the second array
If there is pd.NA in the second array, the result could be incorrect, but we can check if there is pd.NA in it.

In [26]:
s7 = pd.Series([1, 2, 3, pd.NA, 4], dtype="Int64")
target = [1, 2, 3, 20, pd.NA]
isin(s7._values, target)

<BooleanArray>
[True, True, True, True, False]
Length: 5, dtype: boolean

In [27]:
# The result is incorrect
isin_for_masked_array(s7._values, target)

array([ True,  True,  True, False, False])

In [28]:
def isin_for_masked_array2(comps, values):
    # We have to be careful when values contains 1,
    # Because MaskArray's NA value will be 1 in self._data.
    if isinstance(comps, BaseMaskedArray):
        result = isin(comps._data, values) * np.invert(comps._mask)
        if any(x is pd.NA for x in values):
            result += comps._mask
        return result
    return isin(comps, values)

In [29]:
isin_for_masked_array2(pd.Series([2, 3, pd.NA, 4], dtype="Int64")._values, 
                       [2, 3, 20])

array([ True,  True, False, False])

In [30]:
isin_for_masked_array2(pd.Series([2, 3, pd.NA, 4], dtype="Int64")._values, 
                       [2, 3, 20, pd.NA])

array([ True,  True,  True, False])

In [31]:
isin_for_masked_array2(pd.Series([2, 3, 4], dtype="Int64")._values, 
                       [2, 3, 20, pd.NA])

array([ True,  True, False])

In [32]:
isin_for_masked_array2(pd.Series([2, 3, 4], dtype="Int64")._values, 
                       [2, 3, 20])

array([ True,  True, False])

In [33]:
isin_for_masked_array2(pd.Series([2, 3, pd.NA, 4], dtype="Int64")._values, 
                       [1, 2, 3, 20])

array([ True,  True, False, False])

In [34]:
isin_for_masked_array2(pd.Series([2, 3, pd.NA, 4], dtype="Int64")._values, 
                       [1, 2, 3, 20, pd.NA])

array([ True,  True,  True, False])

### 5.6 Test different array types for pd.NA existance

In [35]:
pd.isnull(pd.NA)

True

In [36]:
pd.isnull(pd.NaT)

True

In [37]:
pd.isna(pd.NA)

True

In [38]:
pd.isna(pd.NaT)

True

In [39]:
from copy import copy
pd.NA is copy(pd.NA)

True

In [40]:
any(x is pd.NA for x in [1, 2, pd.NA])

True

In [41]:
any(x is pd.NA for x in np.array([1, 2, pd.NA]))

True

In [42]:
any(x is pd.NA for x in pd.Series([1, 2, pd.NA]))

True

In [43]:
any(x is pd.NA for x in pd.Series([1, 2, pd.NA], dtype="Int64"))

True

In [44]:
any(x is pd.NA for x in np.array([1, 2, np.nan]))

False

In [45]:
any(x is pd.NA for x in [1, 2, pd.NaT])

False

### 5.6 Final test

In [71]:
result = pd.Series([1, 2, 3, 20], dtype="Int64").isin([1, 2, 3, 4])
result

0     True
1     True
2     True
3    False
dtype: boolean

In [72]:
result.values

<BooleanArray>
[True, True, True, False]
Length: 4, dtype: boolean

In [74]:
result.values._mask

array([False, False, False, False])

In [75]:
result.values._data

array([ True,  True,  True, False])

In [77]:
pd.Series([1, 2, 3, pd.NA], dtype="Int64").isin([1, 2, 3, 4])

0     True
1     True
2     True
3    False
dtype: boolean

In [79]:
pd.Series([1, 2, 3, pd.NA], dtype="Int64").isin([1, 2, 3, 4, pd.NaT])

0     True
1     True
2     True
3    False
dtype: boolean

In [80]:
pd.Series([1, 2, 3, pd.NA], dtype="Int64").isin([1, 2, 3, 4, pd.NA])

0    True
1    True
2    True
3    True
dtype: boolean

In [81]:
pd.Series([1, 2, 3], dtype="Int64").isin([1, 2, 3, 4, pd.NaT])

0    True
1    True
2    True
dtype: boolean

In [82]:
pd.Series([1, 2, 3], dtype="Int64").isin([1, 2, 3, 4, pd.NA])

0    True
1    True
2    True
dtype: boolean

In [84]:
pd.Series([1, 5], dtype="Int64").isin([1, 2, 3, 4, pd.NA])

0     True
1    False
dtype: boolean

In [85]:
pd.Series([1, 5], dtype="Int64").isin([1, 2, 3, 4])

0     True
1    False
dtype: boolean

In [86]:
pd.Series([5], dtype="Int64").isin([1, 2, 3, 4])

0    False
dtype: boolean

In [87]:
pd.Series([pd.NA], dtype="Int64").isin([1, 2, 3, 4])

0    False
dtype: boolean

In [88]:
pd.Series([], dtype="Int64").isin([1, 2, 3, 4])

Series([], dtype: boolean)

## 6. Why we should return True but not pd.NA

In [105]:
None in [1, 2, 3, None]

True

In [108]:
np.nan in np.array([1, 2, np.nan])

False

In [107]:
pd.NA in pd.Series([1, 2, pd.NA])

False