Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/master' into to_html-to_string
Browse files Browse the repository at this point in the history
* upstream/master:
  BUG: Don't over-optimize memory with jagged CSV (pandas-dev#23527)
  DEPR: Deprecate usecols as int in read_excel (pandas-dev#23635)
  More helpful Stata string length error. (pandas-dev#23629)
  BUG: astype fill_value for SparseArray.astype (pandas-dev#23547)
  CLN: datetimelike arrays: isort, small reorg (pandas-dev#23587)
  CI: Check in the CI that assert_raises_regex is not being used (pandas-dev#23627)
  CLN:Remove unused **kwargs from user facing methods (pandas-dev#23249)
  DOC: Enhancing pivot / reshape docs (pandas-dev#21038)
  TST: Fix xfailing DataFrame arithmetic tests by transposing (pandas-dev#23620)
  • Loading branch information
thoo committed Nov 12, 2018
2 parents 7186aaf + 011b79f commit 21fa21e
Show file tree
Hide file tree
Showing 34 changed files with 711 additions and 410 deletions.
4 changes: 4 additions & 0 deletions ci/code_checks.sh
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,10 @@ if [[ -z "$CHECK" || "$CHECK" == "patterns" ]]; then
! grep -R --include="*.py" --include="*.pyx" --include="*.rst" -E "\.\. (autosummary|contents|currentmodule|deprecated|function|image|important|include|ipython|literalinclude|math|module|note|raw|seealso|toctree|versionadded|versionchanged|warning):[^:]" ./pandas ./doc/source
RET=$(($RET + $?)) ; echo $MSG "DONE"

MSG='Check that the deprecated `assert_raises_regex` is not used (`pytest.raises(match=pattern)` should be used instead)' ; echo $MSG
! grep -R --exclude=*.pyc --exclude=testing.py --exclude=test_testing.py assert_raises_regex pandas
RET=$(($RET + $?)) ; echo $MSG "DONE"

MSG='Check for modules that pandas should not import' ; echo $MSG
python -c "
import sys
Expand Down
5 changes: 5 additions & 0 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2854,6 +2854,11 @@ It is often the case that users will insert columns to do temporary computations
in Excel and you may not want to read in those columns. ``read_excel`` takes
a ``usecols`` keyword to allow you to specify a subset of columns to parse.

.. deprecated:: 0.24.0

Passing in an integer for ``usecols`` has been deprecated. Please pass in a list
of ints from 0 to ``usecols`` inclusive instead.

If ``usecols`` is an integer, then it is assumed to indicate the last column
to be parsed.

Expand Down
110 changes: 104 additions & 6 deletions doc/source/reshaping.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ Reshaping and Pivot Tables
Reshaping by pivoting DataFrame objects
---------------------------------------

.. image:: _static/reshaping_pivot.png

.. ipython::
:suppress:

Expand All @@ -33,8 +35,7 @@ Reshaping by pivoting DataFrame objects

In [3]: df = unpivot(tm.makeTimeDataFrame())

Data is often stored in CSV files or databases in so-called "stacked" or
"record" format:
Data is often stored in so-called "stacked" or "record" format:

.. ipython:: python
Expand Down Expand Up @@ -66,8 +67,6 @@ To select out everything for variable ``A`` we could do:
df[df['variable'] == 'A']
.. image:: _static/reshaping_pivot.png

But suppose we wish to do time series operations with the variables. A better
representation would be where the ``columns`` are the unique variables and an
``index`` of dates identifies individual observations. To reshape the data into
Expand All @@ -87,7 +86,7 @@ column:
.. ipython:: python
df['value2'] = df['value'] * 2
pivoted = df.pivot('date', 'variable')
pivoted = df.pivot(index='date', columns='variable')
pivoted
You can then select subsets from the pivoted ``DataFrame``:
Expand All @@ -99,6 +98,12 @@ You can then select subsets from the pivoted ``DataFrame``:
Note that this returns a view on the underlying data in the case where the data
are homogeneously-typed.

.. note::
:func:`~pandas.pivot` will error with a ``ValueError: Index contains duplicate
entries, cannot reshape`` if the index/column pair is not unique. In this
case, consider using :func:`~pandas.pivot_table` which is a generalization
of pivot that can handle duplicate values for one index/column pair.

.. _reshaping.stacking:

Reshaping by stacking and unstacking
Expand Down Expand Up @@ -704,10 +709,103 @@ handling of NaN:
In [3]: np.unique(x, return_inverse=True)[::-1]
Out[3]: (array([3, 3, 0, 4, 1, 2]), array([nan, 3.14, inf, 'A', 'B'], dtype=object))
.. note::
If you just want to handle one column as a categorical variable (like R's factor),
you can use ``df["cat_col"] = pd.Categorical(df["col"])`` or
``df["cat_col"] = df["col"].astype("category")``. For full docs on :class:`~pandas.Categorical`,
see the :ref:`Categorical introduction <categorical>` and the
:ref:`API documentation <api.categorical>`.

Examples
--------

In this section, we will review frequently asked questions and examples. The
column names and relevant column values are named to correspond with how this
DataFrame will be pivoted in the answers below.

.. ipython:: python
np.random.seed([3, 1415])
n = 20
cols = np.array(['key', 'row', 'item', 'col'])
df = cols + pd.DataFrame((np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str))
df.columns = cols
df = df.join(pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val'))
df
Pivoting with Single Aggregations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Suppose we wanted to pivot ``df`` such that the ``col`` values are columns,
``row`` values are the index, and the mean of ``val0`` are the values? In
particular, the resulting DataFrame should look like:

.. code-block:: ipython
col col0 col1 col2 col3 col4
row
row0 0.77 0.605 NaN 0.860 0.65
row2 0.13 NaN 0.395 0.500 0.25
row3 NaN 0.310 NaN 0.545 NaN
row4 NaN 0.100 0.395 0.760 0.24
This solution uses :func:`~pandas.pivot_table`. Also note that
``aggfunc='mean'`` is the default. It is included here to be explicit.

.. ipython:: python
df.pivot_table(
values='val0', index='row', columns='col', aggfunc='mean')
Note that we can also replace the missing values by using the ``fill_value``
parameter.

.. ipython:: python
df.pivot_table(
values='val0', index='row', columns='col', aggfunc='mean', fill_value=0)
Also note that we can pass in other aggregation functions as well. For example,
we can also pass in ``sum``.

.. ipython:: python
df.pivot_table(
values='val0', index='row', columns='col', aggfunc='sum', fill_value=0)
Another aggregation we can do is calculate the frequency in which the columns
and rows occur together a.k.a. "cross tabulation". To do this, we can pass
``size`` to the ``aggfunc`` parameter.

.. ipython:: python
df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')
Pivoting with Multiple Aggregations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We can also perform multiple aggregations. For example, to perform both a
``sum`` and ``mean``, we can pass in a list to the ``aggfunc`` argument.

.. ipython:: python
df.pivot_table(
values='val0', index='row', columns='col', aggfunc=['mean', 'sum'])
Note to aggregate over multiple value columns, we can pass in a list to the
``values`` parameter.

.. ipython:: python
df.pivot_table(
values=['val0', 'val1'], index='row', columns='col', aggfunc=['mean'])
Note to subdivide over multiple columns we can pass in a list to the
``columns`` parameter.

.. ipython:: python
df.pivot_table(
values=['val0'], index='row', columns=['item', 'col'], aggfunc=['mean'])
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -972,6 +972,7 @@ Deprecations
- The class ``FrozenNDArray`` has been deprecated. When unpickling, ``FrozenNDArray`` will be unpickled to ``np.ndarray`` once this class is removed (:issue:`9031`)
- Deprecated the `nthreads` keyword of :func:`pandas.read_feather` in favor of
`use_threads` to reflect the changes in pyarrow 0.11.0. (:issue:`23053`)
- :func:`pandas.read_excel` has deprecated accepting ``usecols`` as an integer. Please pass in a list of ints from 0 to ``usecols`` inclusive instead (:issue:`23527`)
- Constructing a :class:`TimedeltaIndex` from data with ``datetime64``-dtyped data is deprecated, will raise ``TypeError`` in a future version (:issue:`23539`)

.. _whatsnew_0240.deprecations.datetimelike_int_ops:
Expand Down Expand Up @@ -1300,6 +1301,7 @@ Notice how we now instead output ``np.nan`` itself instead of a stringified form
- :func:`read_excel()` will correctly show the deprecation warning for previously deprecated ``sheetname`` (:issue:`17994`)
- :func:`read_csv()` and func:`read_table()` will throw ``UnicodeError`` and not coredump on badly encoded strings (:issue:`22748`)
- :func:`read_csv()` will correctly parse timezone-aware datetimes (:issue:`22256`)
- Bug in :func:`read_csv()` in which memory management was prematurely optimized for the C engine when the data was being read in chunks (:issue:`23509`)
- :func:`read_sas()` will parse numbers in sas7bdat-files that have width less than 8 bytes correctly. (:issue:`21616`)
- :func:`read_sas()` will correctly parse sas7bdat files with many columns (:issue:`22628`)
- :func:`read_sas()` will correctly parse sas7bdat files with data page types having also bit 7 set (so page type is 128 + 256 = 384) (:issue:`16615`)
Expand Down
1 change: 1 addition & 0 deletions pandas/_libs/parsers.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,7 @@ cdef extern from "parser/tokenizer.h":
int64_t *word_starts # where we are in the stream
int64_t words_len
int64_t words_cap
int64_t max_words_cap # maximum word cap encountered

char *pword_start # pointer to stream start of current field
int64_t word_start # position start of current field
Expand Down
33 changes: 31 additions & 2 deletions pandas/_libs/src/parser/tokenizer.c
Original file line number Diff line number Diff line change
Expand Up @@ -197,6 +197,7 @@ int parser_init(parser_t *self) {
sz = sz ? sz : 1;
self->words = (char **)malloc(sz * sizeof(char *));
self->word_starts = (int64_t *)malloc(sz * sizeof(int64_t));
self->max_words_cap = sz;
self->words_cap = sz;
self->words_len = 0;

Expand Down Expand Up @@ -247,7 +248,7 @@ void parser_del(parser_t *self) {
}

static int make_stream_space(parser_t *self, size_t nbytes) {
int64_t i, cap;
int64_t i, cap, length;
int status;
void *orig_ptr, *newptr;

Expand Down Expand Up @@ -287,8 +288,23 @@ static int make_stream_space(parser_t *self, size_t nbytes) {
*/

cap = self->words_cap;

/**
* If we are reading in chunks, we need to be aware of the maximum number
* of words we have seen in previous chunks (self->max_words_cap), so
* that way, we can properly allocate when reading subsequent ones.
*
* Otherwise, we risk a buffer overflow if we mistakenly under-allocate
* just because a recent chunk did not have as many words.
*/
if (self->words_len + nbytes < self->max_words_cap) {
length = self->max_words_cap - nbytes;
} else {
length = self->words_len;
}

self->words =
(char **)grow_buffer((void *)self->words, self->words_len,
(char **)grow_buffer((void *)self->words, length,
(int64_t*)&self->words_cap, nbytes,
sizeof(char *), &status);
TRACE(
Expand Down Expand Up @@ -1241,6 +1257,19 @@ int parser_trim_buffers(parser_t *self) {

int64_t i;

/**
* Before we free up space and trim, we should
* save how many words we saw when parsing, if
* it exceeds the maximum number we saw before.
*
* This is important for when we read in chunks,
* so that we can inform subsequent chunk parsing
* as to how many words we could possibly see.
*/
if (self->words_cap > self->max_words_cap) {
self->max_words_cap = self->words_cap;
}

/* trim words, word_starts */
new_cap = _next_pow2(self->words_len) + 1;
if (new_cap < self->words_cap) {
Expand Down
1 change: 1 addition & 0 deletions pandas/_libs/src/parser/tokenizer.h
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,7 @@ typedef struct parser_t {
int64_t *word_starts; // where we are in the stream
int64_t words_len;
int64_t words_cap;
int64_t max_words_cap; // maximum word cap encountered

char *pword_start; // pointer to stream start of current field
int64_t word_start; // position start of current field
Expand Down
8 changes: 6 additions & 2 deletions pandas/core/arrays/datetimelike.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,8 +124,12 @@ def asi8(self):
# do not cache or you'll create a memory leak
return self._data.view('i8')

# ------------------------------------------------------------------
# Array-like Methods
# ----------------------------------------------------------------
# Array-Like / EA-Interface Methods

@property
def nbytes(self):
return self._data.nbytes

@property
def shape(self):
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/arrays/datetimes.py
Original file line number Diff line number Diff line change
Expand Up @@ -385,7 +385,7 @@ def _resolution(self):
return libresolution.resolution(self.asi8, self.tz)

# ----------------------------------------------------------------
# Array-like Methods
# Array-Like / EA-Interface Methods

def __array__(self, dtype=None):
if is_object_dtype(dtype):
Expand Down
Loading

0 comments on commit 21fa21e

Please sign in to comment.