FEEDBACK: PyArrow as a required dependency and PyArrow backed strings #54466

phofl · 2023-08-09T05:31:08Z

This is an issue to collect feedback on the decision to make PyArrow a required dependency and to infer strings as PyArrow backed strings by default.

The background for this decision can be found here: https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html

If you would like to filter this warning without installing pyarrow at this time, please view this comment: #54466 (comment)

mynewestgitaccount · 2023-08-11T05:29:06Z

Something that hasn't received enough attention/discussion, at least in my mind, is this piece of the Drawbacks section of the PDEP (bolding added by me):

Including PyArrow would naturally increase the installation size of pandas. For example, installing pandas and PyArrow using pip from wheels, numpy and pandas requires about 70MB, and including PyArrow requires an additional 120MB. An increase of installation size would have negative implication using pandas in space-constrained development or deployment environments such as AWS Lambda.

I honestly don't understand how mandating a 170% increase in the effective size of a pandas installation (70MB to 190MB, from the numbers in the quoted text) can be considered okay.

For that kind of increase, I would expect/want the tradeoff to be major improvements across the board. Instead, this change comes with limited benefit but massive bloat for anyone who doesn't need the features PyArrow enables, e.g. for those who don't have issues with the current functionality of pandas.

rebecca-palmer · 2023-08-14T07:09:04Z

Debian/Ubuntu have system packages for pandas but not pyarrow, which would no longer be possible. (System packages are not allowed to depend on non-system packages.)

I don't know whether creating a system package of pyarrow is possible with reasonable effort, or whether this would make the system pandas packages impossible to update (and eventually require their removal when old pandas was no longer compatible with current Python/numpy).

mroeschke · 2023-08-16T20:42:17Z

For that kind of increase, I would expect/want the tradeoff to be major improvements across the board.

Yeah unfortunately this is where the subjective tradeoff comes into effect. pytz and dateutil as required dependencies have a similar issue for users who do not need timezone or date parsing support respectively. The hope with pyarrow is that the tradeoff improves the current functionality for common "object" types in pandas such as text, binary, decimal, and nested data.

Debian/Ubuntu have system packages for pandas but not pyarrow, which would no longer be possible.

AFAIK most pydata projects don't actually publish/manage Linux system packages for their respective libraries. Do you know how these are packaged today?

mynewestgitaccount · 2023-08-16T21:18:43Z

pytz and dateutil as required dependencies have a similar issue for users who do not need timezone or date parsing support respectively.

The pytz and dateutil wheels are only ~500kb. Drawing a comparison between them and PyArrow seems like a stretch, to put it lightly.

rebecca-palmer · 2023-08-16T21:26:21Z

Do you know how these are packaged today?

By whoever offers to do it, currently me for pandas. Of the pydata projects, Debian currently has pydata-sphinx-theme, sparse, patsy, xarray and numexpr.

An old discussion thread (anyone can post there, but be warned that doing so will expose your non-spam-protected email address) suggests that there is existing work on a pyarrow Debian package, but I don't yet know whether it ever got far enough to work.

rebecca-palmer · 2023-08-18T07:53:15Z

I do intend to investigate this further at some point - I haven't done so yet because Debian updated numexpr to 2.8.5, breaking pandas (#54449 / #54546), and fixing that is currently more urgent.

jjerphan · 2023-08-18T20:22:30Z

Hi,

Thanks for welcoming feedback from the community.

While I respect you decision, I am afraid that making pyarrow a required dependency will come with costly consequences for users and downstream libraries' developers and maintainers for two reasons:

installing pyarrow after pandas in a fresh conda environment increases its size from approximately 100MiB to approximately 500 MiB.

Packages size

libgoogle-cloud-2.12.0-h840a212_1 :                 46106632 bytes,
python-3.11.4-hab00c5b_0_cpython :                  30679695 bytes,
libarrow-12.0.1-h10ac928_8_cpu :                    27696900 bytes,
ucx-1.14.1-h4a2ce2d_3 :                             15692979 bytes,
pandas-2.0.3-py311h320fe9a_1 :                      14711359 bytes,
numpy-1.25.2-py311h64a7726_0 :                      8139293 bytes,
libgrpc-1.56.2-h3905398_1 :                         6331805 bytes,
libopenblas-0.3.23-pthreads_h80387f5_0 :            5406072 bytes,
aws-sdk-cpp-1.10.57-h85b1a90_19 :                   4055495 bytes,
pyarrow-12.0.1-py311h39c9aba_8_cpu :                3989550 bytes,
libstdcxx-ng-13.1.0-hfd8a6a1_0 :                    3847887 bytes,
rdma-core-28.9-h59595ed_1 :                         3735644 bytes,
libthrift-0.18.1-h8fd135c_2 :                       3584078 bytes,
tk-8.6.12-h27826a3_0 :                              3456292 bytes,
openssl-3.1.2-hd590300_0 :                          2646546 bytes,
libprotobuf-4.23.3-hd1fb520_0 :                     2506133 bytes,
libgfortran5-13.1.0-h15d22d2_0 :                    1437388 bytes,
pip-23.2.1-pyhd8ed1ab_0 :                           1386212 bytes,
krb5-1.21.2-h659d440_0 :                            1371181 bytes,
libabseil-20230125.3-cxx17_h59595ed_0 :             1240376 bytes,
orc-1.9.0-h385abfd_1 :                              1020883 bytes,
ncurses-6.4-hcb278e6_0 :                            880967 bytes,
pygments-2.16.1-pyhd8ed1ab_0 :                      853439 bytes,
jedi-0.19.0-pyhd8ed1ab_0 :                          844518 bytes,
libsqlite-3.42.0-h2797004_0 :                       828910 bytes,
libgcc-ng-13.1.0-he5830b7_0 :                       776294 bytes,
ld_impl_linux-64-2.40-h41732ed_0 :                  704696 bytes,
libnghttp2-1.52.0-h61bc06f_0 :                      622366 bytes,
ipython-8.14.0-pyh41d4057_0 :                       583448 bytes,
bzip2-1.0.8-h7f98852_4 :                            495686 bytes,
setuptools-68.1.2-pyhd8ed1ab_0 :                    462324 bytes,
zstd-1.5.2-hfc55251_7 :                             431126 bytes,
libevent-2.1.12-hf998b51_1 :                        427426 bytes,
libgomp-13.1.0-he5830b7_0 :                         419184 bytes,
xz-5.2.6-h166bdaf_0 :                               418368 bytes,
libcurl-8.2.1-hca28451_0 :                          372511 bytes,
s2n-1.3.48-h06160fa_0 :                             369441 bytes,
aws-crt-cpp-0.21.0-hb942446_5 :                     320415 bytes,
readline-8.2-h8228510_1 :                           281456 bytes,
libssh2-1.11.0-h0841786_0 :                         271133 bytes,
prompt-toolkit-3.0.39-pyha770c72_0 :                269068 bytes,
libbrotlienc-1.0.9-h166bdaf_9 :                     265202 bytes,
python-dateutil-2.8.2-pyhd8ed1ab_0 :                245987 bytes,
re2-2023.03.02-h8c504da_0 :                         201211 bytes,
aws-c-common-0.9.0-hd590300_0 :                     197608 bytes,
aws-c-http-0.7.11-h00aa349_4 :                      194366 bytes,
pytz-2023.3-pyhd8ed1ab_0 :                          186506 bytes,
aws-c-mqtt-0.9.3-hb447be9_1 :                       162493 bytes,
aws-c-io-0.13.32-h4a1a131_0 :                       154523 bytes,
ca-certificates-2023.7.22-hbcca054_0 :              149515 bytes,
lz4-c-1.9.4-hcb278e6_0 :                            143402 bytes,
python-tzdata-2023.3-pyhd8ed1ab_0 :                 143131 bytes,
libedit-3.1.20191231-he28a2e2_2 :                   123878 bytes,
keyutils-1.6.1-h166bdaf_0 :                         117831 bytes,
tzdata-2023c-h71feb2d_0 :                           117580 bytes,
gflags-2.2.2-he1b5a44_1004 :                        116549 bytes,
glog-0.6.0-h6f12383_0 :                             114321 bytes,
c-ares-1.19.1-hd590300_0 :                          113362 bytes,
libev-4.33-h516909a_1 :                             106190 bytes,
aws-c-auth-0.7.3-h28f7589_1 :                       101677 bytes,
libutf8proc-2.8.0-h166bdaf_0 :                      101070 bytes,
traitlets-5.9.0-pyhd8ed1ab_0 :                      98443 bytes,
aws-c-s3-0.3.14-hf3aad02_1 :                        86553 bytes,
libexpat-2.5.0-hcb278e6_1 :                         77980 bytes,
libbrotlicommon-1.0.9-h166bdaf_9 :                  71065 bytes,
parso-0.8.3-pyhd8ed1ab_0 :                          71048 bytes,
libzlib-1.2.13-hd590300_5 :                         61588 bytes,
libffi-3.4.2-h7f98852_5 :                           58292 bytes,
wheel-0.41.1-pyhd8ed1ab_0 :                         57374 bytes,
aws-c-event-stream-0.3.1-h2e3709c_4 :               54050 bytes,
aws-c-sdkutils-0.1.12-h4d4d85c_1 :                  53123 bytes,
aws-c-cal-0.6.1-hc309b26_1 :                        50923 bytes,
aws-checksums-0.1.17-h4d4d85c_1 :                   50001 bytes,
pexpect-4.8.0-pyh1a96a4e_2 :                        48780 bytes,
libnuma-2.0.16-h0b41bf4_1 :                         41107 bytes,
snappy-1.1.10-h9fff704_0 :                          38865 bytes,
typing_extensions-4.7.1-pyha770c72_0 :              36321 bytes,
libuuid-2.38.1-h0b41bf4_0 :                         33601 bytes,
libbrotlidec-1.0.9-h166bdaf_9 :                     32567 bytes,
libnsl-2.0.0-h7f98852_0 :                           31236 bytes,
wcwidth-0.2.6-pyhd8ed1ab_0 :                        29133 bytes,
asttokens-2.2.1-pyhd8ed1ab_0 :                      27831 bytes,
stack_data-0.6.2-pyhd8ed1ab_0 :                     26205 bytes,
executing-1.2.0-pyhd8ed1ab_0 :                      25013 bytes,
_openmp_mutex-4.5-2_gnu :                           23621 bytes,
libgfortran-ng-13.1.0-h69a702a_0 :                  23182 bytes,
libcrc32c-1.1.2-h9c3ff4c_0 :                        20440 bytes,
aws-c-compression-0.2.17-h4d4d85c_2 :               19105 bytes,
ptyprocess-0.7.0-pyhd3deb0d_0 :                     16546 bytes,
pure_eval-0.2.2-pyhd8ed1ab_0 :                      14551 bytes,
libblas-3.9.0-17_linux64_openblas :                 14473 bytes,
liblapack-3.9.0-17_linux64_openblas :               14408 bytes,
libcblas-3.9.0-17_linux64_openblas :                14401 bytes,
six-1.16.0-pyh6c4a22f_0 :                           14259 bytes,
backcall-0.2.0-pyh9f0ad1d_0 :                       13705 bytes,
matplotlib-inline-0.1.6-pyhd8ed1ab_0 :              12273 bytes,
decorator-5.1.1-pyhd8ed1ab_0 :                      12072 bytes,
backports.functools_lru_cache-1.6.5-pyhd8ed1ab_0 :  11519 bytes,
pickleshare-0.7.5-py_1003 :                         9332 bytes,
prompt_toolkit-3.0.39-hd8ed1ab_0 :                  6731 bytes,
backports-1.0-pyhd8ed1ab_3 :                        5950 bytes,
python_abi-3.11-3_cp311 :                           5682 bytes,
_libgcc_mutex-0.1-conda_forge :                     2562 bytes,

pyarrow also depends on libarrow which itself depends on several notable C and C++ libraries. This constraints the installation of other packages whose dependencies might be incompatible with libarrow's, making pandas potentially unusable in some context.

Have you considered those two observations as drawbacks before taking the decision?

lithomas1 · 2023-08-18T20:32:05Z

Hi,

Thanks for welcoming feedback from the community.

While I respect you decision, I am afraid that making pyarrow a required dependency will come with costly consequences for users and downstream libraries' developers and maintainers for two reasons:

installing pyarrow after pandas in a fresh conda environment increases its size from approximately 100MiB to approximately 500 MiB.

Packages size

pyarrow also depends on libarrow which itself depends on several notable C and C++ libraries. This constraints the installation of other packages whose dependencies might be incompatible with libarrow's, making pandas potentially unusable in some context.

Have you considered those two observations as drawbacks before taking the decision?

This is discussed a bit in https://github.com/pandas-dev/pandas/pull/52711/files#diff-3fc3ce7b7d119c90be473d5d03d08d221571c67b4f3a9473c2363342328535b2R179-R193
(for pip only I guess).

While currently the build size for pyarrow is pretty large, it doesn't "have" to be that big. I think by pandas 3.0
(when pyarrow will actually become required), at least some components will be spun out/made optional/something like that (I heard that the arrow people were talking about this).

(cc @jorisvandenbossche for more info on this)

I'm not an Arrow dev myself, but if is something that just needs someone to look at, I'm happy to put some time in help give Arrow a nudge in the right direction.

Finally, for clarity purposes, is the reason for concern also AWS lambda/pyodide/Alpine, or something else?

(IMO, outside of stuff like lambda funcs, pyarrow isn't too egregious in terms of package size compared to pytorch/tensorflow but it's definetely something that can be improved)

jjerphan · 2023-08-18T20:49:13Z

If libarrow is slimmed down by having non-essential Arrow features be extracted into other libraries which could be optional dependencies, I think most people's concerns would be addressed.

Edit: See conda-forge/arrow-cpp-feedstock#1035

DerThorsten · 2023-08-22T07:16:22Z

Hi,

Thanks for welcoming feedback from the community.
For wasm builds of python / python-packages (ie pyodide / emscripten-forge) package size really matters since these packages have to be downloaded from within the browser. Once a package is too big, usability suffers drastically.

With pyarrow as a required dependency, pandas is less usable from python in the browser.

surfaceowl · 2023-08-30T15:36:08Z

Debian/Ubuntu have system packages for pandas but not pyarrow, which would no longer be possible. (System packages are not allowed to depend on non-system packages.)

I don't know whether creating a system package of pyarrow is possible with reasonable effort, or whether this would make the system pandas packages impossible to update (and eventually require their removal when old pandas was no longer compatible with current Python/numpy).

There is another way - use virtual environments in user space instead of system python. The Python Software Foundation recommends users create virtual environments; and Debian/Ubuntu want users to leave the system python untouched to avoid breaking system python.

Perhaps Pandas could add some warnings or error messages on install to steer people to virtualenv. This approach might avoid or at least defer work of adding pyarrow to APT as well as the risks of users breaking system python. Also which I'm building projects I might want a much later version of pandas/pyarrow than would ever ship on Debian given the release strategy/timing delay.

On the other hand, arrow backend has significant advantages and with the rise of other important packages in the data space that also use pyarrow (polars, dask, modin), perhaps there is sufficient reason to add pyarrow to APT sources.

A good summary that might be worth checking out is Externally managed environments. The original PEP 668 is found here.

stonebig · 2023-08-30T18:29:28Z

I think it's the rigth path for performance in WASM.

mlkui · 2023-08-31T10:24:53Z

This is a good idea!
But I think there are also two important features should also be implemented except strings:

Zero-copy for multi-index dataframe. Currently, multi-index dataframe can not be convert from arrow table with zero copy(zero_copy_only=True), which is a BIGGER problem for big dataframe. You can reset_index() the dataframe, convert it to arrow table, and convert arrow table back to dataframe with zero copy, but after all, you must use call set_index() to the dataframe to get multi-index back, then copy happens.
Zero-copy for pandas.concat. Arrow table concat can be zero-copy, but when concat two zero-copy dataframe(convert from arrow table), copy happens even pandas COW is turned on. Also, currently, trying to concat two arrow table and then convert the table to dataframe with zero_copy_only=True is also not allowed as the chunknum>1.

phofl · 2023-08-31T21:57:42Z

@mlkui

Regarding concat: This should already be zero copy:

df = pd.DataFrame({"a": [1, 2, 3]}, dtype="int64[pyarrow]")
df2 = pd.DataFrame({"a": [1, 2, 3]}, dtype="int64[pyarrow]")

x = pd.concat([df, df2])

This creates a new dataframe that has 2 pyarrow chunks.

Can you open a separate issue if this is not what you are looking for?

mlkui · 2023-09-01T03:25:57Z

@phofl
Thanks for your reply. But your example may be too simple. Please view the following codes(pandas 2.0.3 and pyarrow 12.0/ pandas 2.1.0 and pyarrow 13.0):

        with pa.memory_map("d:\\1.arrow", 'r') as source1, pa.memory_map("d:\\2.arrow", 'r') as source2, pa.memory_map("d:\\3.arrow", 'r') as source3, pa.memory_map("d:\\4.arrow", 'r') as source4:

            c1 = pa.ipc.RecordBatchFileReader(source1).read_all().column("p")
            c2 = pa.ipc.RecordBatchFileReader(source2).read_all().column("v")
            c3 = pa.ipc.RecordBatchFileReader(source1).read_all().column("p")
            c4 = pa.ipc.RecordBatchFileReader(source2).read_all().column("v")
            print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))

            s1 = c1.to_pandas(zero_copy_only=True)
            s2 = c2.to_pandas(zero_copy_only=True)
            s3 = c3.to_pandas(zero_copy_only=True)
            s4 = c4.to_pandas(zero_copy_only=True)
            print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))

            dfs = {"p": s1, "v": s2}
            df1 = pd.concat(dfs, axis=1, copy=False)                            #zero-copy
            print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))

            dfs2 = {"p": s3, "v": s4}
            df2 = pd.concat(dfs2, axis=1, copy=False)                           #zero-copy
            print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))

            # NOT zero-copy
            result_df = pd.concat([df1, df2], axis=0, copy=False)

        with pa.memory_map("z1.arrow", 'r') as source1, pa.memory_map("z2.arrow", 'r') as source2:

            table1 = pa.ipc.RecordBatchFileReader(source1).read_all()
            table2 = pa.ipc.RecordBatchFileReader(source2).read_all()
            combined_table = pa.concat_tables([table1, table2])
            print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))        #Zero-copy

            df1 = table1.to_pandas(zero_copy_only=True)
            df2 = table2.to_pandas(zero_copy_only=True)
            print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))       #Zero-copy

            #Use pandas to concat two zero-copy dataframes
            #But copy happens
            result_df = pd.concat([df1, df2], axis=0, copy=False)

            #Try to convert the arrow table to pandas directly
            #This will raise exception for chunk number is 2
            df3 = combined_table.to_pandas(zero_copy_only=True)

            # Combining chunks to one will cause copy
            combined_table = combined_table.combine_chunks()

0x26res · 2023-09-03T19:06:28Z

Beside the build size, there is a portability issue with pyarrow.

pyarrow does not provide wheels for as many environment as numpy.

For environments where pyarrow does not provide wheels, pyarrow has to be installed from source which is not simple.

flying-sheep · 2023-10-10T07:09:39Z

If this happens, would dtype='string' and dtype='string[pyarrow]' be merged into one implementation?

We’re currently thinking about coercing strings in our library, but hesitating because of the unclear future here.

EwoutH · 2023-10-26T21:50:03Z

pyarrow does not provide wheels for as many environment as numpy.

The fact that they still don’t have Python 3.12 wheels up is worrisome.

h-vetinari · 2023-11-01T09:31:20Z

The fact that they still don’t have Python 3.12 wheels up is worrisome.

Arrow is a beast to build, and even harder to fit into a wheel properly (so you get less features, and things like using the slimmed-down libarrow will be harder to pull off).

Conda-forge builds for py312 have been available for a month already though, and are ready in principle to ship pyarrow with a minimal libarrow. That still needs some usability improvements, but it's getting there.

musicinmybrain · 2023-11-03T21:12:58Z

Without weighing in on whether this is a good idea or a bad one, Fedora Linux already has a libarrow package that provides python3-pyarrow, so I think this shouldn’t be a real problem for us from a packaging perspective.

I’m not saying that Pandas is easy to keep packaged, up to date, and coordinated with its dependencies and reverse dependencies! Just that a hard dependency on PyArrow wouldn’t necessarily make the situation worse for us.

ZupoLlask · 2023-11-30T10:01:46Z

@h-vetinari Almost there? :-)

raulcd · 2023-11-30T10:12:59Z

@h-vetinari Almost there? :-)

There is still a lot of work to be done on the wheels side but for conda after the work we did to divide the CPP library, I created this PR which is currently under discussion in order to provide both a pyarrow-base that only depends on libarrow and libparquet and pyarrow which would pull all the Arrow CPP dependencies. Both have been built with support for everything so depending on pyarrow-base and libarrow-dataset would allow the use of pyarrow.dataset, etc.

chris-vecchio · 2023-12-08T17:15:52Z

Thanks for requesting feedback. I'm not well versed on the technicalities, but I strongly prefer to not require pyarrow as a dependency. It's better imo to let users choose to use PyArrow if they desire. I prefer to use the default NumPy object type or pandas' StringDType without the added complexity of PyArrow.

soulphish · 2024-08-15T08:29:52Z

Not to beat a dead horse, but....

I use Pandas in multiple projects, and each project has a Virtual Environment. Every new major version of python gets a virtual environment for testing the new version too. The size of these project is not huge, but now they have all increased massively, and the storage requirement for projects has increased almost exponentially.

Just something to keep in mind. I know there is talk of pyarrow being reduced in size too, which would be great. I admit, I have not read the full discussion, so this may have been covered already, and I apologize if it has been.

agriyakhetarpal · 2024-08-15T08:40:33Z

Hi all – not to segue into the discussion about the increase in bandwidth usage and download sizes since many others have put out their thoughts about that already, but PyArrow in Pyodide has been merged and will be available in the next release: pyodide/pyodide#4950

Runa7debug · 2024-10-05T16:14:35Z

I find this error in the lab of module 2-course 3 data science:

:1: DeprecationWarning:
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at #54466

import pandas as pd # import library to read data into dataframe

bersbersbers · 2024-10-08T05:52:06Z

It's a bit unfortunate that with pyarrow dependencies, using pandas on Python 3.13 is now effectively blocked by apache/arrow#43519. Making pyarrow required will aggravate such issues in the future.

miraculixx · 2024-10-14T07:42:12Z

Reading this thread, it appears that after more than 12 months of collecting feedback, most comments are not in favor of pyarrow being a dependency, or at least voice some concern. I haven't done a formal
analysis, but it appears there are a few common themes:

Concerns

Pyarrow's package size is considered to be very/too large for a mandatory dependency
There is additional and often unwarranted complexity in pyarrow installation (e.g. version conflicts, platform not supported)
Pyarrow's functionality is not needed for all of pandas use cases and hence having to install it seems unnecessary in these cases

Suggested paths forward

a. Make it easy to use pandas with pyarrow, yet keep it an optional dependency
b. Make it easy to install pyarrow by reducing its size and installation complexity (with pandas, e.g. by reducing dependency to pyarrow-base instead of the full pyarrow)

(I may be biased in summarizing this, anyone feel free to correct this if you find your analysis is different)

Since this is a solicited feedback channel established for the community to share their thoughts regarding PDEP-10, (how) will the decision be reconsidered @phofl? Thank you for all your efforts.

asishm · 2024-10-14T12:05:34Z

Since this is a solicited feedback channel established for the community to share their thoughts regarding PDEP-10, (how) will the decision be reconsidered @phofl? Thank you for all your efforts.

There is an open PDEP under consideration to reject pdep-10. #58623 If (when?) it gets finalized, it'll get put to a vote.

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at pandas-dev/pandas#54466 import pandas as pd

- pinning muscle to 5.1 because 5.2 and 5.3 (as of their release 10 days ago) currently have issues on mac-osx, even with osx-64 specified, leading to an "Illegal instruction" error whenever muscle is invoked - adding pyarrow, due to pandas stating it will be required next major version update (pandas-dev/pandas#54466)

miraculixx · 2024-11-22T11:25:37Z

JupyterLite is another project that will be impacted by pyarrow becoming a required dependency, as per pyodide/pyodide#4840 (although might be resolved by upstream pyodide/pyodide#2933)

agriyakhetarpal · 2024-11-22T11:55:27Z

JupyterLite is another project that will be impacted by pyarrow becoming a required dependency, as per pyodide/pyodide#4840 (although might be resolved by upstream pyodide/pyodide#2933)

PyArrow will be included in the next release for Pyodide and pandas should work correctly after that. We are planning it here: pyodide/pyodide#5064

jjGG · 2025-03-11T14:52:10Z

Hello Developers,

Unfortunately I do get an error upon running FragPipe22 in the library generation step.

SpecLibGen [Work dir: E:\projects\p36602\FragPipe\20250305_assessement\test_newPyPath]
E:\software\Python3_9_6\python.exe -u E:\software\FragPipe-22\FragPipe-jre-22.0\fragpipe\tools\speclib\gen_con_spec_lib.py E:\projects\p36602\FragPipe\20250305_assessement\2025-02-19-decoys-p24073_db6_TBnLepNSwissprotNDairy_20241120.fasta.fas E:\projects\p36602\FragPipe\20250305_assessement\test_newPyPath unused E:\projects\p36602\FragPipe\20250305_assessement\test_newPyPath True unused use_easypqp noiRT;noIM 63 "--unimod E:/software/FragPipe-22/FragPipe-jre-22.0/fragpipe/tools/unimod_old.xml --max_delta_unimod 0.02 --max_delta_ppm 15.0 --fragment_types [\'b\',\'y\',]" "--rt_lowess_fraction 0.0" delete_intermediate_files E:\projects\p36602\FragPipe\20250305_assessement\test_newPyPath\filelist_speclibgen.txt
E:\software\FragPipe-22\FragPipe-jre-22.0\fragpipe\tools\speclib\gen_con_spec_lib.py:18: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
File list provided
Traceback (most recent call last):
  File "E:\software\FragPipe-22\FragPipe-jre-22.0\fragpipe\tools\speclib\gen_con_spec_lib.py", line 580, in <module>
    main()
  File "E:\software\FragPipe-22\FragPipe-jre-22.0\fragpipe\tools\speclib\gen_con_spec_lib.py", line 487, in main
    params = easyPQPparams()
  File "E:\software\FragPipe-22\FragPipe-jre-22.0\fragpipe\tools\speclib\gen_con_spec_lib.py", line 140, in __init__
    self.easypqp = get_bin_path('easypqp', 'easypqp')
  File "E:\software\FragPipe-22\FragPipe-jre-22.0\fragpipe\tools\speclib\gen_con_spec_lib.py", line 196, in get_bin_path_pip_CLI
    rel_loc, = [e for e in files if pathlib.Path(e).stem == bin_stem]
ValueError: not enough values to unpack (expected 1, got 0)
Process 'SpecLibGen' finished, exit code: 1
Process returned non-zero exit code, stopping

How can I circumvent this? Is there a solution to this? We are using python3.9.6 as required by the developers.

Best regards

jj_gg

MarcoGorelli · 2025-03-11T14:54:13Z

you can upgrade pandas or silence the warning

Prompt Category: Unit Testing Prompt: extract_data.py Create unit test cases for the file extract_data.py. Also implement Pylint for this file such that code is optimised, code smells are detected. User Observation: $ python3 -m pytest test_extract_data.py -v ================================================ test session starts ================================================ platform darwin -- Python 3.9.6, pytest-8.3.5, pluggy-1.5.0 -- /Library/Developer/CommandLineTools/usr/bin/python3 cachedir: .pytest_cache rootdir: /Users/surya.sandeep.boda/Desktop/Marscode Zero to One 3 plugins: cov-6.0.0 collected 3 items test_extract_data.py::TestGoogleSheetExtraction::test_api_error PASSED [ 33%] test_extract_data.py::TestGoogleSheetExtraction::test_empty_sheet PASSED [ 66%] test_extract_data.py::TestGoogleSheetExtraction::test_successful_data_extraction PASSED [100%] ================================================= warnings summary ================================================== test_extract_data.py:3 /Users/surya.sandeep.boda/Desktop/Marscode Zero to One 3/test_extract_data.py:3: DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at pandas-dev/pandas#54466 import pandas as pd -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html =========================================== 3 passed, 1 warning in 0.39s ============================================ $ python3 -m pytest test_extract_data.py -v ================================================ test session starts ================================================ platform darwin -- Python 3.9.6, pytest-8.3.5, pluggy-1.5.0 -- /Library/Developer/CommandLineTools/usr/bin/python3 cachedir: .pytest_cache rootdir: /Users/surya.sandeep.boda/Desktop/Marscode Zero to One 3 plugins: cov-6.0.0 collected 3 items test_extract_data.py::TestGoogleSheetExtraction::test_api_error PASSED [ 33%] test_extract_data.py::TestGoogleSheetExtraction::test_empty_sheet PASSED [ 66%] test_extract_data.py::TestGoogleSheetExtraction::test_successful_data_extraction PASSED [100%] ================================================= warnings summary ================================================== test_extract_data.py:3 /Users/surya.sandeep.boda/Desktop/Marscode Zero to One 3/test_extract_data.py:3: DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at pandas-dev/pandas#54466 import pandas as pd -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html =========================================== 3 passed, 1 warning in 0.39s ============================================ $ python -m pytest test_extract_data.py --cov=extract_data zsh: command not found: python $ python3 -m pytest test_extract_data.py --cov=extract_data ================================================ test session starts ================================================ platform darwin -- Python 3.9.6, pytest-8.3.5, pluggy-1.5.0 rootdir: /Users/surya.sandeep.boda/Desktop/Marscode Zero to One 3 plugins: cov-6.0.0 collected 3 items test_extract_data.py ... [100%] ================================================= warnings summary ================================================== test_extract_data.py:3 /Users/surya.sandeep.boda/Desktop/Marscode Zero to One 3/test_extract_data.py:3: DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at pandas-dev/pandas#54466 import pandas as pd -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ---------- coverage: platform darwin, python 3.9.6-final-0 ----------- Name Stmts Miss Cover ------------------------------------- extract_data.py 44 6 86% ------------------------------------- TOTAL 44 6 86% =========================================== 3 passed, 1 warning in 0.45s ============================================ Response ID: cdc11218-5355-4cb3-ab05-3e5de03e1ced

@patch

Prompt Category: Unit Testing Prompt: extract_data.py Modify the test results for this file as per the new requirements. Also update the Pylint configuration such that the Code Coverage is greater than 80% User Observation: $ python3 -m pytest test_extract_data.py --cov=extract_data --cov-report=term-missing ================================================ test session starts ================================================ platform darwin -- Python 3.9.6, pytest-8.3.5, pluggy-1.5.0 rootdir: /Users/surya.sandeep.boda/Desktop/Marscode Zero to One 3 plugins: cov-6.0.0 collected 6 items test_extract_data.py ..F.F. [100%] ===================================================== FAILURES ====================================================== __________________________________ TestGoogleSheetExtraction.test_invalid_ratings ___________________________________ self = <test_extract_data.TestGoogleSheetExtraction testMethod=test_invalid_ratings> mock_build = <MagicMock name='build' id='4886351728'> mock_credentials = <MagicMock name='Credentials' id='4886539712'> @patch('extract_data.Credentials') @patch('extract_data.build') def test_invalid_ratings(self, mock_build, mock_credentials): """Test handling of invalid rating values""" mock_data = self.mock_data.copy() mock_data[1][3] = 'invalid' # Invalid Context Awareness rating mock_service = MagicMock() mock_build.return_value = mock_service mock_service.spreadsheets().values().get().execute.return_value = { 'values': mock_data } result = get_google_sheet_data() self.assertIsNotNone(result) > self.assertTrue(pd.isna(result['Mean Rating'].iloc[0])) E AssertionError: False is not true test_extract_data.py:109: AssertionError _____________________________ TestGoogleSheetExtraction.test_result_status_calculation ______________________________ self = <test_extract_data.TestGoogleSheetExtraction testMethod=test_result_status_calculation> def test_result_status_calculation(self): """Test result status determination""" test_data = pd.DataFrame({ 'Difference': [-2, -0.5, 0, 0.5, 2] }) expected_results = ['Not ok', 'Ok', 'Ok', 'Ok', 'Not ok'] for diff, expected in zip(test_data['Difference'], expected_results): result = test_data['Difference'].apply( lambda x: 'Ok' if -1 <= x <= 1 else 'Not ok' ) > self.assertEqual(result.iloc[0], expected) E AssertionError: 'Not ok' != 'Ok' E - Not ok E + Ok test_extract_data.py:122: AssertionError ================================================= warnings summary ================================================== test_extract_data.py:3 /Users/surya.sandeep.boda/Desktop/Marscode Zero to One 3/test_extract_data.py:3: DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at pandas-dev/pandas#54466 import pandas as pd -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ---------- coverage: platform darwin, python 3.9.6-final-0 ----------- Name Stmts Miss Cover Missing ----------------------------------------------- extract_data.py 65 12 82% 107, 136-149 ----------------------------------------------- TOTAL 65 12 82% ============================================== short test summary info ============================================== FAILED test_extract_data.py::TestGoogleSheetExtraction::test_invalid_ratings - AssertionError: False is not true FAILED test_extract_data.py::TestGoogleSheetExtraction::test_result_status_calculation - AssertionError: 'Not ok' != 'Ok' ====================================== 2 failed, 4 passed, 1 warning in 0.52s ======================================= Response ID: 78b65279-364d-43ca-ac9e-1e8b1dfd31c2

@patch

Prompt Category: Unit Testing Prompt: Fix the Following error: ================================================ test session starts ================================================ platform darwin -- Python 3.9.6, pytest-8.3.5, pluggy-1.5.0 rootdir: /Users/surya.sandeep.boda/Desktop/Marscode Zero to One 3 plugins: cov-6.0.0 collected 6 items test_extract_data.py ..F.F. [100%] ===================================================== FAILURES ====================================================== __________________________________ TestGoogleSheetExtraction.test_invalid_ratings ___________________________________ self = <test_extract_data.TestGoogleSheetExtraction testMethod=test_invalid_ratings> mock_build = <MagicMock name='build' id='4886351728'> mock_credentials = <MagicMock name='Credentials' id='4886539712'> @patch('extract_data.Credentials') @patch('extract_data.build') def test_invalid_ratings(self, mock_build, mock_credentials): """Test handling of invalid rating values""" mock_data = self.mock_data.copy() mock_data[1][3] = 'invalid' # Invalid Context Awareness rating mock_service = MagicMock() mock_build.return_value = mock_service mock_service.spreadsheets().values().get().execute.return_value = { 'values': mock_data } result = get_google_sheet_data() self.assertIsNotNone(result) > self.assertTrue(pd.isna(result['Mean Rating'].iloc[0])) E AssertionError: False is not true test_extract_data.py:109: AssertionError _____________________________ TestGoogleSheetExtraction.test_result_status_calculation ______________________________ self = <test_extract_data.TestGoogleSheetExtraction testMethod=test_result_status_calculation> def test_result_status_calculation(self): """Test result status determination""" test_data = pd.DataFrame({ 'Difference': [-2, -0.5, 0, 0.5, 2] }) expected_results = ['Not ok', 'Ok', 'Ok', 'Ok', 'Not ok'] for diff, expected in zip(test_data['Difference'], expected_results): result = test_data['Difference'].apply( lambda x: 'Ok' if -1 <= x <= 1 else 'Not ok' ) > self.assertEqual(result.iloc[0], expected) E AssertionError: 'Not ok' != 'Ok' E - Not ok E + Ok test_extract_data.py:122: AssertionError ================================================= warnings summary ================================================== test_extract_data.py:3 /Users/surya.sandeep.boda/Desktop/Marscode Zero to One 3/test_extract_data.py:3: DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at pandas-dev/pandas#54466 import pandas as pd -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ---------- coverage: platform darwin, python 3.9.6-final-0 ----------- Name Stmts Miss Cover Missing ----------------------------------------------- extract_data.py 65 12 82% 107, 136-149 ----------------------------------------------- TOTAL 65 12 82% ============================================== short test summary info ============================================== FAILED test_extract_data.py::TestGoogleSheetExtraction::test_invalid_ratings - AssertionError: False is not true FAILED test_extract_data.py::TestGoogleSheetExtraction::test_result_status_calculation - AssertionError: 'Not ok' != 'Ok' ====================================== 2 failed, 4 passed, 1 warning in 0.52s ======================================= User Observation: ================================================ test session starts ================================================ platform darwin -- Python 3.9.6, pytest-8.3.5, pluggy-1.5.0 rootdir: /Users/surya.sandeep.boda/Desktop/Marscode Zero to One 3 plugins: cov-6.0.0 collected 6 items test_extract_data.py ..F.F. [100%] ===================================================== FAILURES ====================================================== __________________________________ TestGoogleSheetExtraction.test_invalid_ratings ___________________________________ self = <test_extract_data.TestGoogleSheetExtraction testMethod=test_invalid_ratings> mock_build = <MagicMock name='build' id='4886351728'> mock_credentials = <MagicMock name='Credentials' id='4886539712'> @patch('extract_data.Credentials') @patch('extract_data.build') def test_invalid_ratings(self, mock_build, mock_credentials): """Test handling of invalid rating values""" mock_data = self.mock_data.copy() mock_data[1][3] = 'invalid' # Invalid Context Awareness rating mock_service = MagicMock() mock_build.return_value = mock_service mock_service.spreadsheets().values().get().execute.return_value = { 'values': mock_data } result = get_google_sheet_data() self.assertIsNotNone(result) > self.assertTrue(pd.isna(result['Mean Rating'].iloc[0])) E AssertionError: False is not true test_extract_data.py:109: AssertionError _____________________________ TestGoogleSheetExtraction.test_result_status_calculation ______________________________ self = <test_extract_data.TestGoogleSheetExtraction testMethod=test_result_status_calculation> def test_result_status_calculation(self): """Test result status determination""" test_data = pd.DataFrame({ 'Difference': [-2, -0.5, 0, 0.5, 2] }) expected_results = ['Not ok', 'Ok', 'Ok', 'Ok', 'Not ok'] for diff, expected in zip(test_data['Difference'], expected_results): result = test_data['Difference'].apply( lambda x: 'Ok' if -1 <= x <= 1 else 'Not ok' ) > self.assertEqual(result.iloc[0], expected) E AssertionError: 'Not ok' != 'Ok' E - Not ok E + Ok test_extract_data.py:122: AssertionError ================================================= warnings summary ================================================== test_extract_data.py:3 /Users/surya.sandeep.boda/Desktop/Marscode Zero to One 3/test_extract_data.py:3: DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at pandas-dev/pandas#54466 import pandas as pd -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ---------- coverage: platform darwin, python 3.9.6-final-0 ----------- Name Stmts Miss Cover Missing ----------------------------------------------- extract_data.py 65 12 82% 107, 136-149 ----------------------------------------------- TOTAL 65 12 82% ============================================== short test summary info ============================================== FAILED test_extract_data.py::TestGoogleSheetExtraction::test_invalid_ratings - AssertionError: False is not true FAILED test_extract_data.py::TestGoogleSheetExtraction::test_result_status_calculation - AssertionError: 'Not ok' != 'Ok' ====================================== 2 failed, 4 passed, 1 warning in 0.52s ======================================= $ python -m pytest test_extract_data.py -v --cov=extract_data zsh: command not found: python $ python3 -m pytest test_extract_data.py -v --cov=extract_data ================================================ test session starts ================================================ platform darwin -- Python 3.9.6, pytest-8.3.5, pluggy-1.5.0 -- /Library/Developer/CommandLineTools/usr/bin/python3 cachedir: .pytest_cache rootdir: /Users/surya.sandeep.boda/Desktop/Marscode Zero to One 3 plugins: cov-6.0.0 collected 6 items test_extract_data.py::TestGoogleSheetExtraction::test_calculation_accuracy PASSED [ 16%] test_extract_data.py::TestGoogleSheetExtraction::test_empty_sheet PASSED [ 33%] test_extract_data.py::TestGoogleSheetExtraction::test_invalid_ratings FAILED [ 50%] test_extract_data.py::TestGoogleSheetExtraction::test_missing_columns PASSED [ 66%] test_extract_data.py::TestGoogleSheetExtraction::test_result_status_calculation PASSED [ 83%] test_extract_data.py::TestGoogleSheetExtraction::test_successful_data_extraction PASSED [100%] ===================================================== FAILURES ====================================================== __________________________________ TestGoogleSheetExtraction.test_invalid_ratings ___________________________________ self = <test_extract_data.TestGoogleSheetExtraction testMethod=test_invalid_ratings> mock_build = <MagicMock name='build' id='5357153440'> mock_credentials = <MagicMock name='Credentials' id='5357341424'> @patch('extract_data.Credentials') @patch('extract_data.build') def test_invalid_ratings(self, mock_build, mock_credentials): """Test handling of invalid rating values""" mock_data = [row[:] for row in self.mock_data] # Deep copy mock_data[1][3] = 'invalid' # Invalid Context Awareness rating mock_service = MagicMock() mock_build.return_value = mock_service mock_service.spreadsheets().values().get().execute.return_value = { 'values': mock_data } result = get_google_sheet_data() self.assertIsNotNone(result) # Check if the mean rating is NaN when one of the inputs is invalid > self.assertTrue(pd.isna(result['Mean Rating'].iloc[0])) E AssertionError: False is not true test_extract_data.py:110: AssertionError ================================================= warnings summary ================================================== test_extract_data.py:3 /Users/surya.sandeep.boda/Desktop/Marscode Zero to One 3/test_extract_data.py:3: DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at pandas-dev/pandas#54466 import pandas as pd -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ---------- coverage: platform darwin, python 3.9.6-final-0 ----------- Name Stmts Miss Cover ------------------------------------- extract_data.py 65 12 82% ------------------------------------- TOTAL 65 12 82% ============================================== short test summary info ============================================== FAILED test_extract_data.py::TestGoogleSheetExtraction::test_invalid_ratings - AssertionError: False is not true ====================================== 1 failed, 5 passed, 1 warning in 0.52s ======================================= Response ID: 80e90428-de98-4c12-9135-d5f33317535b

Prompt Category: Unit Testing Prompt: Fix the error ============================================== short test summary info ============================================== FAILED test_extract_data.py::TestGoogleSheetExtraction::test_invalid_ratings - AssertionError: False is not true ====================================== 1 failed, 5 passed, 1 warning in 0.52s ======================================= User Observation: python3 -m pytest test_extract_data.py -v --cov=extract_data ================================================ test session starts ================================================ platform darwin -- Python 3.9.6, pytest-8.3.5, pluggy-1.5.0 -- /Library/Developer/CommandLineTools/usr/bin/python3 cachedir: .pytest_cache rootdir: /Users/surya.sandeep.boda/Desktop/Marscode Zero to One 3 plugins: cov-6.0.0 collected 6 items test_extract_data.py::TestGoogleSheetExtraction::test_calculation_accuracy PASSED [ 16%] test_extract_data.py::TestGoogleSheetExtraction::test_empty_sheet PASSED [ 33%] test_extract_data.py::TestGoogleSheetExtraction::test_invalid_ratings PASSED [ 50%] test_extract_data.py::TestGoogleSheetExtraction::test_missing_columns PASSED [ 66%] test_extract_data.py::TestGoogleSheetExtraction::test_result_status_calculation PASSED [ 83%] test_extract_data.py::TestGoogleSheetExtraction::test_successful_data_extraction PASSED [100%] ================================================= warnings summary ================================================== test_extract_data.py:3 /Users/surya.sandeep.boda/Desktop/Marscode Zero to One 3/test_extract_data.py:3: DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at pandas-dev/pandas#54466 import pandas as pd -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ---------- coverage: platform darwin, python 3.9.6-final-0 ----------- Name Stmts Miss Cover ------------------------------------- extract_data.py 65 12 82% ------------------------------------- TOTAL 65 12 82% =========================================== 6 passed, 1 warning in 0.49s ============================================ Response ID: 8f1f79d4-228b-47b4-9e08-02733b470bc0

@patch

Prompt Category: Unit Testing Prompt: extract_data.py Now the code is working perfectly. Modify the unit tests in such a way that all the unit tests pass and Code coverage is greater than 80% User Observation: $ python3 -m pytest test_extract_data.py -v --cov=extract_data ================================================ test session starts ================================================ platform darwin -- Python 3.9.6, pytest-8.3.5, pluggy-1.5.0 -- /Library/Developer/CommandLineTools/usr/bin/python3 cachedir: .pytest_cache rootdir: /Users/surya.sandeep.boda/Desktop/Marscode Zero to One 3 plugins: cov-6.0.0 collected 7 items test_extract_data.py::TestGoogleSheetExtraction::test_clear_target_sheet_failure PASSED [ 14%] test_extract_data.py::TestGoogleSheetExtraction::test_clear_target_sheet_success FAILED [ 28%] test_extract_data.py::TestGoogleSheetExtraction::test_get_google_sheet_data_api_error PASSED [ 42%] test_extract_data.py::TestGoogleSheetExtraction::test_get_google_sheet_data_no_data PASSED [ 57%] test_extract_data.py::TestGoogleSheetExtraction::test_get_google_sheet_data_success PASSED [ 71%] test_extract_data.py::TestGoogleSheetExtraction::test_write_to_target_sheet_failure PASSED [ 85%] test_extract_data.py::TestGoogleSheetExtraction::test_write_to_target_sheet_success FAILED [100%] ===================================================== FAILURES ====================================================== _____________________________ TestGoogleSheetExtraction.test_clear_target_sheet_success _____________________________ self = <test_extract_data.TestGoogleSheetExtraction testMethod=test_clear_target_sheet_success> mock_build = <MagicMock name='build' id='4682958544'> mock_credentials = <MagicMock name='Credentials' id='4683027984'> @patch('extract_data.Credentials') @patch('extract_data.build') def test_clear_target_sheet_success(self, mock_build, mock_credentials): # Setup mock mock_service = MagicMock() mock_build.return_value = mock_service mock_service.spreadsheets().values().clear().execute.return_value = {} # Execute result = clear_target_sheet() # Assert self.assertIsNotNone(result) > mock_service.spreadsheets().values().clear.assert_called_once() test_extract_data.py:21: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = <MagicMock name='build().spreadsheets().values().clear' id='4683133088'> def assert_called_once(self): """assert that the mock was called only once. """ if not self.call_count == 1: msg = ("Expected '%s' to have been called once. Called %s times.%s" % (self._mock_name or 'mock', self.call_count, self._calls_repr())) > raise AssertionError(msg) E AssertionError: Expected 'clear' to have been called once. Called 2 times. E Calls: [call(), E call(spreadsheetId='1FEqiDqqPfb9YHAWBiqVepmmXj22zNqXNNI7NLGCDVak', range='Sheet1!A:Z'), E call().execute()]. /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/unittest/mock.py:886: AssertionError ___________________________ TestGoogleSheetExtraction.test_write_to_target_sheet_success ____________________________ self = <test_extract_data.TestGoogleSheetExtraction testMethod=test_write_to_target_sheet_success> def test_write_to_target_sheet_success(self): # Setup test data df = pd.DataFrame({ 'Email Address': ['test@example.com'], 'Tool being used': ['Tool1'], 'Feature used': ['Feature1'], 'Context Awareness': [4.0], 'Autonomy': [4.0], 'Experience': [4.0], 'Output Quality': [4.0], 'Overall Rating': [4.0], 'Mean Rating': [4.0], 'Difference': [0.0], 'Result': ['Ok'], 'Unique ID': ['ID1'] }) mock_service = MagicMock() mock_service.spreadsheets().values().update().execute.return_value = {} mock_service.spreadsheets().batchUpdate().execute.return_value = {} # Execute result = write_to_target_sheet(df, mock_service) # Assert self.assertTrue(result) > mock_service.spreadsheets().values().update.assert_called_once() test_extract_data.py:63: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = <MagicMock name='mock.spreadsheets().values().update' id='4684074672'> def assert_called_once(self): """assert that the mock was called only once. """ if not self.call_count == 1: msg = ("Expected '%s' to have been called once. Called %s times.%s" % (self._mock_name or 'mock', self.call_count, self._calls_repr())) > raise AssertionError(msg) E AssertionError: Expected 'update' to have been called once. Called 2 times. E Calls: [call(), E call(spreadsheetId='1FEqiDqqPfb9YHAWBiqVepmmXj22zNqXNNI7NLGCDVak', range='Sheet1!A1', valueInputOption='USER_ENTERED', body={'values': [['Email Address', 'Tool being used', 'Feature used', 'Context Awareness', 'Autonomy', 'Experience', 'Output Quality', 'Overall Rating', 'Mean Rating', 'Difference', 'Result', 'Unique ID'], ['test@example.com', 'Tool1', 'Feature1', '4.00', '4.00', '4.00', '4.00', '4.00', '4.00', '0.00', 'Ok', 'ID1']], 'majorDimension': 'ROWS'}), E call().execute()]. /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/unittest/mock.py:886: AssertionError ================================================= warnings summary ================================================== test_extract_data.py:3 /Users/surya.sandeep.boda/Desktop/Marscode Zero to One 3/test_extract_data.py:3: DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at pandas-dev/pandas#54466 import pandas as pd -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ---------- coverage: platform darwin, python 3.9.6-final-0 ----------- Name Stmts Miss Cover ------------------------------------- extract_data.py 119 17 86% ------------------------------------- TOTAL 119 17 86% ============================================== short test summary info ============================================== FAILED test_extract_data.py::TestGoogleSheetExtraction::test_clear_target_sheet_success - AssertionError: Expected 'clear' to have been called once. Called 2 times. FAILED test_extract_data.py::TestGoogleSheetExtraction::test_write_to_target_sheet_success - AssertionError: Expected 'update' to have been called once. Called 2 times. ====================================== 2 failed, 5 passed, 1 warning in 0.57s ======================================= Response ID: 10445925-81b4-478c-b485-ddd5dc8dd0cf

@patch

Prompt Category: Unit Testing Prompt: ================================================ test session starts ================================================ platform darwin -- Python 3.9.6, pytest-8.3.5, pluggy-1.5.0 -- /Library/Developer/CommandLineTools/usr/bin/python3 cachedir: .pytest_cache rootdir: /Users/surya.sandeep.boda/Desktop/Marscode Zero to One 3 plugins: cov-6.0.0 collected 7 items test_extract_data.py::TestGoogleSheetExtraction::test_clear_target_sheet_failure PASSED [ 14%] test_extract_data.py::TestGoogleSheetExtraction::test_clear_target_sheet_success FAILED [ 28%] test_extract_data.py::TestGoogleSheetExtraction::test_get_google_sheet_data_api_error PASSED [ 42%] test_extract_data.py::TestGoogleSheetExtraction::test_get_google_sheet_data_no_data PASSED [ 57%] test_extract_data.py::TestGoogleSheetExtraction::test_get_google_sheet_data_success PASSED [ 71%] test_extract_data.py::TestGoogleSheetExtraction::test_write_to_target_sheet_failure PASSED [ 85%] test_extract_data.py::TestGoogleSheetExtraction::test_write_to_target_sheet_success FAILED [100%] ===================================================== FAILURES ====================================================== _____________________________ TestGoogleSheetExtraction.test_clear_target_sheet_success _____________________________ self = <test_extract_data.TestGoogleSheetExtraction testMethod=test_clear_target_sheet_success> mock_build = <MagicMock name='build' id='4682958544'> mock_credentials = <MagicMock name='Credentials' id='4683027984'> @patch('extract_data.Credentials') @patch('extract_data.build') def test_clear_target_sheet_success(self, mock_build, mock_credentials): # Setup mock mock_service = MagicMock() mock_build.return_value = mock_service mock_service.spreadsheets().values().clear().execute.return_value = {} # Execute result = clear_target_sheet() # Assert self.assertIsNotNone(result) > mock_service.spreadsheets().values().clear.assert_called_once() test_extract_data.py:21: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = <MagicMock name='build().spreadsheets().values().clear' id='4683133088'> def assert_called_once(self): """assert that the mock was called only once. """ if not self.call_count == 1: msg = ("Expected '%s' to have been called once. Called %s times.%s" % (self._mock_name or 'mock', self.call_count, self._calls_repr())) > raise AssertionError(msg) E AssertionError: Expected 'clear' to have been called once. Called 2 times. E Calls: [call(), E call(spreadsheetId='1FEqiDqqPfb9YHAWBiqVepmmXj22zNqXNNI7NLGCDVak', range='Sheet1!A:Z'), E call().execute()]. /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/unittest/mock.py:886: AssertionError ___________________________ TestGoogleSheetExtraction.test_write_to_target_sheet_success ____________________________ self = <test_extract_data.TestGoogleSheetExtraction testMethod=test_write_to_target_sheet_success> def test_write_to_target_sheet_success(self): # Setup test data df = pd.DataFrame({ 'Email Address': ['test@example.com'], 'Tool being used': ['Tool1'], 'Feature used': ['Feature1'], 'Context Awareness': [4.0], 'Autonomy': [4.0], 'Experience': [4.0], 'Output Quality': [4.0], 'Overall Rating': [4.0], 'Mean Rating': [4.0], 'Difference': [0.0], 'Result': ['Ok'], 'Unique ID': ['ID1'] }) mock_service = MagicMock() mock_service.spreadsheets().values().update().execute.return_value = {} mock_service.spreadsheets().batchUpdate().execute.return_value = {} # Execute result = write_to_target_sheet(df, mock_service) # Assert self.assertTrue(result) > mock_service.spreadsheets().values().update.assert_called_once() test_extract_data.py:63: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = <MagicMock name='mock.spreadsheets().values().update' id='4684074672'> def assert_called_once(self): """assert that the mock was called only once. """ if not self.call_count == 1: msg = ("Expected '%s' to have been called once. Called %s times.%s" % (self._mock_name or 'mock', self.call_count, self._calls_repr())) > raise AssertionError(msg) E AssertionError: Expected 'update' to have been called once. Called 2 times. E Calls: [call(), E call(spreadsheetId='1FEqiDqqPfb9YHAWBiqVepmmXj22zNqXNNI7NLGCDVak', range='Sheet1!A1', valueInputOption='USER_ENTERED', body={'values': [['Email Address', 'Tool being used', 'Feature used', 'Context Awareness', 'Autonomy', 'Experience', 'Output Quality', 'Overall Rating', 'Mean Rating', 'Difference', 'Result', 'Unique ID'], ['test@example.com', 'Tool1', 'Feature1', '4.00', '4.00', '4.00', '4.00', '4.00', '4.00', '0.00', 'Ok', 'ID1']], 'majorDimension': 'ROWS'}), E call().execute()]. /Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/unittest/mock.py:886: AssertionError ================================================= warnings summary ================================================== test_extract_data.py:3 /Users/surya.sandeep.boda/Desktop/Marscode Zero to One 3/test_extract_data.py:3: DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at pandas-dev/pandas#54466 import pandas as pd -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ---------- coverage: platform darwin, python 3.9.6-final-0 ----------- Name Stmts Miss Cover ------------------------------------- extract_data.py 119 17 86% ------------------------------------- TOTAL 119 17 86% ============================================== short test summary info ============================================== FAILED test_extract_data.py::TestGoogleSheetExtraction::test_clear_target_sheet_success - AssertionError: Expected 'clear' to have been called once. Called 2 times. FAILED test_extract_data.py::TestGoogleSheetExtraction::test_write_to_target_sheet_success - AssertionError: Expected 'update' to have been called once. Called 2 times. ====================================== 2 failed, 5 passed, 1 warning in 0.57s ======================================= Fix the unit test errors User Observation: $ python3 -m pytest test_extract_data.py -v --cov=extract_data ================================================ test session starts ================================================ platform darwin -- Python 3.9.6, pytest-8.3.5, pluggy-1.5.0 -- /Library/Developer/CommandLineTools/usr/bin/python3 cachedir: .pytest_cache rootdir: /Users/surya.sandeep.boda/Desktop/Marscode Zero to One 3 plugins: cov-6.0.0 collected 6 items test_extract_data.py::TestGoogleSheetExtraction::test_clear_target_sheet_success PASSED [ 16%] test_extract_data.py::TestGoogleSheetExtraction::test_get_google_sheet_data_api_error PASSED [ 33%] test_extract_data.py::TestGoogleSheetExtraction::test_get_google_sheet_data_no_data PASSED [ 50%] test_extract_data.py::TestGoogleSheetExtraction::test_get_google_sheet_data_success PASSED [ 66%] test_extract_data.py::TestGoogleSheetExtraction::test_write_to_target_sheet_failure PASSED [ 83%] test_extract_data.py::TestGoogleSheetExtraction::test_write_to_target_sheet_success PASSED [100%] ================================================= warnings summary ================================================== test_extract_data.py:3 /Users/surya.sandeep.boda/Desktop/Marscode Zero to One 3/test_extract_data.py:3: DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at pandas-dev/pandas#54466 import pandas as pd -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ---------- coverage: platform darwin, python 3.9.6-final-0 ----------- Name Stmts Miss Cover ------------------------------------- extract_data.py 119 20 83% ------------------------------------- TOTAL 119 20 83% =========================================== 6 passed, 1 warning in 0.41s ============================================ Response ID: d1c2b2d7-0d99-4f95-ab3d-04d88548a101

Prompt Category: Unit Testing Prompt: extract_data.py See that the Finla code coverage is greater than 90% User Observation: $ python3 -m pytest test_extract_data.py -v --cov=extract_data ================================================ test session starts ================================================ platform darwin -- Python 3.9.6, pytest-8.3.5, pluggy-1.5.0 -- /Library/Developer/CommandLineTools/usr/bin/python3 cachedir: .pytest_cache rootdir: /Users/surya.sandeep.boda/Desktop/Marscode Zero to One 3 plugins: cov-6.0.0 collected 12 items test_extract_data.py::TestGoogleSheetExtraction::test_clear_target_sheet_api_error PASSED [ 8%] test_extract_data.py::TestGoogleSheetExtraction::test_clear_target_sheet_success PASSED [ 16%] test_extract_data.py::TestGoogleSheetExtraction::test_get_google_sheet_data_invalid_ratings PASSED [ 25%] test_extract_data.py::TestGoogleSheetExtraction::test_get_google_sheet_data_missing_columns PASSED [ 33%] test_extract_data.py::TestGoogleSheetExtraction::test_get_google_sheet_data_no_data PASSED [ 41%] test_extract_data.py::TestGoogleSheetExtraction::test_get_google_sheet_data_success PASSED [ 50%] test_extract_data.py::TestGoogleSheetExtraction::test_get_result_status PASSED [ 58%] test_extract_data.py::TestGoogleSheetExtraction::test_main_execution_failure PASSED [ 66%] test_extract_data.py::TestGoogleSheetExtraction::test_main_execution_success PASSED [ 75%] test_extract_data.py::TestGoogleSheetExtraction::test_write_to_target_sheet_api_error PASSED [ 83%] test_extract_data.py::TestGoogleSheetExtraction::test_write_to_target_sheet_success PASSED [ 91%] test_extract_data.py::TestGoogleSheetExtraction::test_write_to_target_sheet_with_nan PASSED [100%] ================================================= warnings summary ================================================== test_extract_data.py:3 /Users/surya.sandeep.boda/Desktop/Marscode Zero to One 3/test_extract_data.py:3: DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at pandas-dev/pandas#54466 import pandas as pd -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ---------- coverage: platform darwin, python 3.9.6-final-0 ----------- Name Stmts Miss Cover ------------------------------------- extract_data.py 119 13 89% ------------------------------------- TOTAL 119 13 89% =========================================== 12 passed, 1 warning in 0.52s =========================================== Response ID: e755f5ba-c080-44b9-a624-2f65397bac9b

Anidipta · 2025-03-13T14:38:58Z

PyArrow is light weighted compared to them

Ashar-perwez · 2025-03-23T09:13:26Z

Subject: Feedback on Pyarrow as a Required Dependency in pandas 3.0

Hello pandas team,

I recently encountered the warning message about Pyarrow becoming a required dependency in the next major release of pandas (pandas 3.0). I appreciate the heads-up and the rationale behind this change, especially for enabling more performant data types and better interoperability with other libraries.

However, I wanted to share that this change might cause some challenges for users who are not yet familiar with Pyarrow or who may have constraints in their environments that make it difficult to install additional dependencies. It would be helpful if the documentation could provide clear guidance on how to transition smoothly, including any potential performance benefits and use cases where Pyarrow is particularly advantageous.

Additionally, I noticed a small syntax issue in the code snippet provided in the warning message. The columns parameter in the pd.DataFrame constructor should be columns=["student id", "age"] instead of columns["student id", "age"]. Correcting this would prevent confusion for users who might copy and paste the code.

Thank you for your hard work and for considering user feedback. I look forward to seeing the improvements in pandas 3.0!

Best regards,
Ashar-perwez

lithomas1 pinned this issue Aug 9, 2023

lithomas1 added Community Arrow labels Aug 9, 2023

jjerphan mentioned this issue Aug 18, 2023

Make pyarrow a required dependency #52509

Closed

lukemanley unpinned this issue Sep 6, 2023

lukemanley pinned this issue Sep 13, 2023

ivirshup mentioned this issue Oct 9, 2023

(Semi-)automatic conversion of nullable columns to the appropriate pandas arrays scverse/anndata#1068

Open

WillAyd mentioned this issue Aug 28, 2024

PDEP-15: Reject PDEP-10 #58623

Open

5 tasks

bjlittle mentioned this issue Oct 28, 2024

drop optional pyarrow for pandas 3.x bjlittle/geovista#1178

Merged

AstrobioMike mentioned this issue Nov 13, 2024

Update meta.yaml bioconda/bioconda-recipes#52080

Merged

Anamika1jan mentioned this issue Nov 28, 2024

QST: when I import pandas as pd then i get an error #60438

Closed

2 tasks

seunafara added a commit to seunafara/Practical-Machine-Learning-Models that referenced this issue Dec 16, 2024

chore: pyarrow # pandas-dev/pandas#54466 (comment)

8a4e56c

taldcroft mentioned this issue Feb 6, 2025

Astropy CSV table reader using pyarrow astropy/astropy#17706

Draft

1 task

ttngu207 mentioned this issue Feb 7, 2025

DataJoint import error due to missing pyarrow (a pandas dependency) datajoint/datajoint-python#1202

Open

Roger-GOAT mentioned this issue Feb 28, 2025

检索不到文献 luckylykkk/nnscholar-search#1

Open

FEEDBACK: PyArrow as a required dependency and PyArrow backed strings #54466

FEEDBACK: PyArrow as a required dependency and PyArrow backed strings #54466

Comments

phofl commented Aug 9, 2023 • edited by mroeschke Loading

mynewestgitaccount commented Aug 11, 2023

rebecca-palmer commented Aug 14, 2023

mroeschke commented Aug 16, 2023

mynewestgitaccount commented Aug 16, 2023

rebecca-palmer commented Aug 16, 2023

rebecca-palmer commented Aug 18, 2023

jjerphan commented Aug 18, 2023

lithomas1 commented Aug 18, 2023 • edited Loading

jjerphan commented Aug 18, 2023 • edited Loading

DerThorsten commented Aug 22, 2023 • edited Loading

surfaceowl commented Aug 30, 2023

stonebig commented Aug 30, 2023

mlkui commented Aug 31, 2023 • edited Loading

phofl commented Aug 31, 2023

mlkui commented Sep 1, 2023 • edited Loading

0x26res commented Sep 3, 2023 • edited Loading

flying-sheep commented Oct 10, 2023

EwoutH commented Oct 26, 2023

h-vetinari commented Nov 1, 2023

musicinmybrain commented Nov 3, 2023

ZupoLlask commented Nov 30, 2023

raulcd commented Nov 30, 2023

chris-vecchio commented Dec 8, 2023 • edited Loading

soulphish commented Aug 15, 2024

agriyakhetarpal commented Aug 15, 2024

Runa7debug commented Oct 5, 2024

bersbersbers commented Oct 8, 2024

miraculixx commented Oct 14, 2024 • edited Loading

asishm commented Oct 14, 2024

miraculixx commented Nov 22, 2024 • edited Loading

agriyakhetarpal commented Nov 22, 2024

jjGG commented Mar 11, 2025

MarcoGorelli commented Mar 11, 2025

Anidipta commented Mar 13, 2025

Ashar-perwez commented Mar 23, 2025

phofl commented Aug 9, 2023 •

edited by mroeschke

Loading

lithomas1 commented Aug 18, 2023 •

edited

Loading

jjerphan commented Aug 18, 2023 •

edited

Loading

DerThorsten commented Aug 22, 2023 •

edited

Loading

mlkui commented Aug 31, 2023 •

edited

Loading

mlkui commented Sep 1, 2023 •

edited

Loading

0x26res commented Sep 3, 2023 •

edited

Loading

chris-vecchio commented Dec 8, 2023 •

edited

Loading

miraculixx commented Oct 14, 2024 •

edited

Loading

miraculixx commented Nov 22, 2024 •

edited

Loading