The PR is to solve issue: https://github.com/pandas-dev/pandas/issues/38340  
The PR link is: https://github.com/pandas-dev/pandas/pull/38379

### 0. Compile pandas

In [1]:
import os
pandas_path = os.environ['PANDAS_PATH']
os.chdir(pandas_path)
os.system("cd $PANDAS_PATH")
try:
    import pandas as pd
    print("Already compiled!")
except:
    print("Compiling...")
    os.system("python setup.py develop")
    os.system("pip uninstall pandas")
    print("Compiled!")

Already compiled!


In [2]:
def get_so_file_names(path, keyword):
    list_of_files = set()
    for (dirpath, dirnames, filenames) in os.walk(path):
        for filename in filenames:
            if keyword in filename: 
                list_of_files.add(os.sep.join([dirpath, filename]))
    return list_of_files

In [3]:
def rename_files(file_names, keyword):
    for file_name in file_names:
        os.rename(file_name, file_name.replace(keyword, ""))
    print(f"{len(file_names)} files renamed!")

In [4]:
keyword = ".cpython-37m-darwin"
so_file_names = get_so_file_names(pandas_path, keyword)
rename_files(so_file_names, keyword)

0 files renamed!


### 2. Issue
Currently, you can get quite a slowdown:

In [5]:
import pandas as pd
import numpy as np

arr = np.random.randint(0, 10, 1_000_001)
target = [1, 2, 3, 20]

In [6]:
s1 = pd.Series(arr)
%timeit s1.isin(target)

2.86 ms ± 72.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [7]:
s2 = pd.Series(arr, dtype="Int64")
%timeit s2.isin(target)

28.3 ms ± 154 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### 3. Analysis
We can see in `/pandas/core/series.py`, there is an `algorithms.isin` function might be the bottleneck.
```Python
def isin(self, values) -> "Series":
        result = algorithms.isin(self._values, values)
        return self._constructor(result, index=self.index).__finalize__(
            self, method="isin"
        )
```

In [8]:
import pandas as pd

In [9]:
%load_ext line_profiler

In [10]:
from pandas.core.algorithms import isin

In [12]:
%lprun -f isin isin(s2._values, target)

### 4. Solution
According to the profiling result, line 470 is the bottleneck:  
```
470         1       6074.0   6074.0     21.9          return isin(np.asarray(comps), np.asarray(values))
```

We could define and test our new isin as below, and the runtime is decreased to 2 ms.

In [16]:
from pandas.core.arrays.masked import BaseMaskedArray

def fast_isin(comps, values):
    if isinstance(comps, BaseMaskedArray):
        comps = comps._data
    return isin(comps, values)

In [17]:
%timeit fast_isin(s2._values, target)

2.46 ms ± 14.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [18]:
%timeit isin(s2._values, target)

27.9 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### 5. Tests
We should test for the Extension Arrays below:
```Python
IntegerArray
FloatingArray
BooleanArray
```

### 5.1 IntegerArray

In [19]:
s3 = pd.Series(arr, dtype="Int64")
%timeit fast_isin(s3._values, target)

2.65 ms ± 90.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [20]:
%timeit isin(s3._values, target)

28 ms ± 91.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### 5.2 FloatingArray

In [22]:
s4 = pd.Series(arr, dtype="Float64")
target_f = [1.0, 2.0, 3.0, 4.0]
%timeit fast_isin(s4._values, target_f)

2.57 ms ± 97.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [23]:
%timeit isin(s4._values, target_f)

63.9 ms ± 652 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### 5.3 boolean

In [33]:
arr_b = arr > 5
s5 = pd.Series(arr_b, dtype="boolean")
target_b = [True, True, True, True]
%timeit fast_isin(s5._values, target_b)

4.04 ms ± 25.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [35]:
%timeit isin(s5._values, target_b)

19.2 ms ± 182 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### 5.4 pd.NA in the first array
If there is pd.NA in the first array, the result could be incorrect, but we can multiply the result with `~comps._mask`.

In [78]:
s6 = pd.Series([1, 2, 3, pd.NA, 4], dtype="Int64")
target = [1, 2, 3, 20]
fast_isin(s6._values, target)

array([ True,  True,  True,  True, False])

In [79]:
def isin_for_masked_array(comps, values):
    if isinstance(comps, BaseMaskedArray):
        _comps = comps._data
        result = isin(_comps, values) * np.invert(comps._mask)
        return result
    return isin(comps, values)

In [80]:
isin_for_masked_array(s6._values, target)

array([ True,  True,  True, False, False])

In [81]:
%timeit fast_isin(s3._values, target)

2.42 ms ± 79.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [82]:
%timeit isin_for_masked_array(s3._values, target)

2.7 ms ± 82.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### 5.5 pd.NA in the second array
If there is pd.NA in the second array, the result could be incorrect, but we can check if there is pd.NA in it.

In [117]:
s7 = pd.Series([1, 2, 3, pd.NA, 4], dtype="Int64")
target = [1, 2, 3, 20, pd.NA]
isin(s7._values, target)

array([ True,  True,  True,  True, False])

In [90]:
# The result is incorrect
isin_for_masked_array(s7._values, target)

array([ True,  True,  True, False, False])

In [158]:
def isin_for_masked_array2(comps, values):
    # We have to be careful when values contains 1,
    # Because MaskArray's NA value will be 1 in self._data.
    if isinstance(comps, BaseMaskedArray):
        result = isin(comps._data, values) * np.invert(comps._mask)
        if any(x is pd.NA for x in values):
            result += comps._mask
        return result
    return isin(comps, values)

In [159]:
isin_for_masked_array2(pd.Series([2, 3, pd.NA, 4], dtype="Int64")._values, 
                       [2, 3, 20])

array([ True,  True, False, False])

In [160]:
isin_for_masked_array2(pd.Series([2, 3, pd.NA, 4], dtype="Int64")._values, 
                       [2, 3, 20, pd.NA])

array([ True,  True,  True, False])

In [161]:
isin_for_masked_array2(pd.Series([2, 3, 4], dtype="Int64")._values, 
                       [2, 3, 20, pd.NA])

array([ True,  True, False])

In [162]:
isin_for_masked_array2(pd.Series([2, 3, 4], dtype="Int64")._values, 
                       [2, 3, 20])

array([ True,  True, False])

In [163]:
isin_for_masked_array2(pd.Series([2, 3, pd.NA, 4], dtype="Int64")._values, 
                       [1, 2, 3, 20])

array([ True,  True, False, False])

In [164]:
isin_for_masked_array2(pd.Series([2, 3, pd.NA, 4], dtype="Int64")._values, 
                       [1, 2, 3, 20, pd.NA])

array([ True,  True,  True, False])

### 5.6 Test different array types for pd.NA existance

In [145]:
any(x is pd.NA for x in [1, 2, pd.NA])

True

In [146]:
any(x is pd.NA for x in np.array([1, 2, pd.NA]))

True

In [147]:
any(x is pd.NA for x in pd.Series([1, 2, pd.NA]))

True

In [148]:
any(x is pd.NA for x in pd.Series([1, 2, pd.NA], dtype="Int64"))

True

In [149]:
any(x is pd.NA for x in np.array([1, 2, np.nan]))

False