Merge remote-tracking branch 'upstream/master' into to_html-to_string

* upstream/master: BUG: Don't over-optimize memory with jagged CSV (pandas-dev#23527) DEPR: Deprecate usecols as int in read_excel (pandas-dev#23635) More helpful Stata string length error. (pandas-dev#23629) BUG: astype fill_value for SparseArray.astype (pandas-dev#23547) CLN: datetimelike arrays: isort, small reorg (pandas-dev#23587) CI: Check in the CI that assert_raises_regex is not being used (pandas-dev#23627) CLN:Remove unused **kwargs from user facing methods (pandas-dev#23249) DOC: Enhancing pivot / reshape docs (pandas-dev#21038) TST: Fix xfailing DataFrame arithmetic tests by transposing (pandas-dev#23620)
thoo · Nov 12, 2018 · 21fa21e · 21fa21e
2 parents 7186aaf + 011b79f
commit 21fa21e
Show file tree

Hide file tree

Showing 34 changed files with 711 additions and 410 deletions.
diff --git a/ci/code_checks.sh b/ci/code_checks.sh
@@ -122,6 +122,10 @@ if [[ -z "$CHECK" || "$CHECK" == "patterns" ]]; then
     ! grep -R --include="*.py" --include="*.pyx" --include="*.rst" -E "\.\. (autosummary|contents|currentmodule|deprecated|function|image|important|include|ipython|literalinclude|math|module|note|raw|seealso|toctree|versionadded|versionchanged|warning):[^:]" ./pandas ./doc/source
     RET=$(($RET + $?)) ; echo $MSG "DONE"
 
+    MSG='Check that the deprecated `assert_raises_regex` is not used (`pytest.raises(match=pattern)` should be used instead)' ; echo $MSG
+    ! grep -R --exclude=*.pyc --exclude=testing.py --exclude=test_testing.py assert_raises_regex pandas
+    RET=$(($RET + $?)) ; echo $MSG "DONE"
+
     MSG='Check for modules that pandas should not import' ; echo $MSG
     python -c "
 import sys

diff --git a/doc/source/io.rst b/doc/source/io.rst
@@ -2854,6 +2854,11 @@ It is often the case that users will insert columns to do temporary computations
 in Excel and you may not want to read in those columns. ``read_excel`` takes
 a ``usecols`` keyword to allow you to specify a subset of columns to parse.
 
+.. deprecated:: 0.24.0
+
+Passing in an integer for ``usecols`` has been deprecated. Please pass in a list
+of ints from 0 to ``usecols`` inclusive instead.
+
 If ``usecols`` is an integer, then it is assumed to indicate the last column
 to be parsed.
 

diff --git a/doc/source/reshaping.rst b/doc/source/reshaping.rst
@@ -17,6 +17,8 @@ Reshaping and Pivot Tables
 Reshaping by pivoting DataFrame objects
 ---------------------------------------
 
+.. image:: _static/reshaping_pivot.png
+
 .. ipython::
    :suppress:
 
@@ -33,8 +35,7 @@ Reshaping by pivoting DataFrame objects
 
    In [3]: df = unpivot(tm.makeTimeDataFrame())
 
-Data is often stored in CSV files or databases in so-called "stacked" or
-"record" format:
+Data is often stored in so-called "stacked" or "record" format:
 
 .. ipython:: python
 
@@ -66,8 +67,6 @@ To select out everything for variable ``A`` we could do:
 
    df[df['variable'] == 'A']
 
-.. image:: _static/reshaping_pivot.png
-
 But suppose we wish to do time series operations with the variables. A better
 representation would be where the ``columns`` are the unique variables and an
 ``index`` of dates identifies individual observations. To reshape the data into
@@ -87,7 +86,7 @@ column:
 .. ipython:: python
 
    df['value2'] = df['value'] * 2
-   pivoted = df.pivot('date', 'variable')
+   pivoted = df.pivot(index='date', columns='variable')
    pivoted
 
 You can then select subsets from the pivoted ``DataFrame``:
@@ -99,6 +98,12 @@ You can then select subsets from the pivoted ``DataFrame``:
 Note that this returns a view on the underlying data in the case where the data
 are homogeneously-typed.
 
+.. note::
+   :func:`~pandas.pivot` will error with a ``ValueError: Index contains duplicate
+   entries, cannot reshape`` if the index/column pair is not unique. In this
+   case, consider using :func:`~pandas.pivot_table` which is a generalization
+   of pivot that can handle duplicate values for one index/column pair.
+
 .. _reshaping.stacking:
 
 Reshaping by stacking and unstacking
@@ -704,10 +709,103 @@ handling of NaN:
     In [3]: np.unique(x, return_inverse=True)[::-1]
     Out[3]: (array([3, 3, 0, 4, 1, 2]), array([nan, 3.14, inf, 'A', 'B'], dtype=object))
 
-
 .. note::
     If you just want to handle one column as a categorical variable (like R's factor),
     you can use  ``df["cat_col"] = pd.Categorical(df["col"])`` or
     ``df["cat_col"] = df["col"].astype("category")``. For full docs on :class:`~pandas.Categorical`,
     see the :ref:`Categorical introduction <categorical>` and the
     :ref:`API documentation <api.categorical>`.
+
+Examples
+--------
+
+In this section, we will review frequently asked questions and examples. The
+column names and relevant column values are named to correspond with how this
+DataFrame will be pivoted in the answers below.
+
+.. ipython:: python
+
+   np.random.seed([3, 1415])
+   n = 20
+
+   cols = np.array(['key', 'row', 'item', 'col'])
+   df = cols + pd.DataFrame((np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str))
+   df.columns = cols
+   df = df.join(pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val'))
+
+   df
+
+Pivoting with Single Aggregations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Suppose we wanted to pivot ``df`` such that the ``col`` values are columns,
+``row`` values are the index, and the mean of ``val0`` are the values? In
+particular, the resulting DataFrame should look like:
+
+.. code-block:: ipython
+
+   col   col0   col1   col2   col3  col4
+   row
+   row0  0.77  0.605    NaN  0.860  0.65
+   row2  0.13    NaN  0.395  0.500  0.25
+   row3   NaN  0.310    NaN  0.545   NaN
+   row4   NaN  0.100  0.395  0.760  0.24
+
+This solution uses :func:`~pandas.pivot_table`. Also note that
+``aggfunc='mean'`` is the default. It is included here to be explicit.
+
+.. ipython:: python
+
+   df.pivot_table(
+       values='val0', index='row', columns='col', aggfunc='mean')
+
+Note that we can also replace the missing values by using the ``fill_value``
+parameter.
+
+.. ipython:: python
+
+   df.pivot_table(
+       values='val0', index='row', columns='col', aggfunc='mean', fill_value=0)
+
+Also note that we can pass in other aggregation functions as well. For example,
+we can also pass in ``sum``.
+
+.. ipython:: python
+
+   df.pivot_table(
+       values='val0', index='row', columns='col', aggfunc='sum', fill_value=0)
+
+Another aggregation we can do is calculate the frequency in which the columns
+and rows occur together a.k.a. "cross tabulation". To do this, we can pass
+``size`` to the ``aggfunc`` parameter.
+
+.. ipython:: python
+
+   df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')
+
+Pivoting with Multiple Aggregations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We can also perform multiple aggregations. For example, to perform both a
+``sum`` and ``mean``, we can pass in a list to the ``aggfunc`` argument.
+
+.. ipython:: python
+
+   df.pivot_table(
+       values='val0', index='row', columns='col', aggfunc=['mean', 'sum'])
+
+Note to aggregate over multiple value columns, we can pass in a list to the
+``values`` parameter.
+
+.. ipython:: python
+
+   df.pivot_table(
+       values=['val0', 'val1'], index='row', columns='col', aggfunc=['mean'])
+
+Note to subdivide over multiple columns we can pass in a list to the
+``columns`` parameter.
+
+.. ipython:: python
+
+   df.pivot_table(
+       values=['val0'], index='row', columns=['item', 'col'], aggfunc=['mean'])
diff --git a/doc/source/whatsnew/v0.24.0.txt b/doc/source/whatsnew/v0.24.0.txt
@@ -972,6 +972,7 @@ Deprecations
 - The class ``FrozenNDArray`` has been deprecated. When unpickling, ``FrozenNDArray`` will be unpickled to ``np.ndarray`` once this class is removed (:issue:`9031`)
 - Deprecated the `nthreads` keyword of :func:`pandas.read_feather` in favor of
   `use_threads` to reflect the changes in pyarrow 0.11.0. (:issue:`23053`)
+- :func:`pandas.read_excel` has deprecated accepting ``usecols`` as an integer. Please pass in a list of ints from 0 to ``usecols`` inclusive instead (:issue:`23527`)
 - Constructing a :class:`TimedeltaIndex` from data with ``datetime64``-dtyped data is deprecated, will raise ``TypeError`` in a future version (:issue:`23539`)
 
 .. _whatsnew_0240.deprecations.datetimelike_int_ops:
@@ -1300,6 +1301,7 @@ Notice how we now instead output ``np.nan`` itself instead of a stringified form
 - :func:`read_excel()` will correctly show the deprecation warning for previously deprecated ``sheetname`` (:issue:`17994`)
 - :func:`read_csv()` and func:`read_table()` will throw ``UnicodeError`` and not coredump on badly encoded strings (:issue:`22748`)
 - :func:`read_csv()` will correctly parse timezone-aware datetimes (:issue:`22256`)
+- Bug in :func:`read_csv()` in which memory management was prematurely optimized for the C engine when the data was being read in chunks (:issue:`23509`)
 - :func:`read_sas()` will parse numbers in sas7bdat-files that have width less than 8 bytes correctly. (:issue:`21616`)
 - :func:`read_sas()` will correctly parse sas7bdat files with many columns (:issue:`22628`)
 - :func:`read_sas()` will correctly parse sas7bdat files with data page types having also bit 7 set (so page type is 128 + 256 = 384) (:issue:`16615`)

diff --git a/pandas/_libs/parsers.pyx b/pandas/_libs/parsers.pyx
@@ -132,6 +132,7 @@ cdef extern from "parser/tokenizer.h":
         int64_t *word_starts  # where we are in the stream
         int64_t words_len
         int64_t words_cap
+        int64_t max_words_cap    # maximum word cap encountered
 
         char *pword_start        # pointer to stream start of current field
         int64_t word_start       # position start of current field

diff --git a/pandas/_libs/src/parser/tokenizer.c b/pandas/_libs/src/parser/tokenizer.c
@@ -197,6 +197,7 @@ int parser_init(parser_t *self) {
     sz = sz ? sz : 1;
     self->words = (char **)malloc(sz * sizeof(char *));
     self->word_starts = (int64_t *)malloc(sz * sizeof(int64_t));
+    self->max_words_cap = sz;
     self->words_cap = sz;
     self->words_len = 0;
 
@@ -247,7 +248,7 @@ void parser_del(parser_t *self) {
 }
 
 static int make_stream_space(parser_t *self, size_t nbytes) {
-    int64_t i, cap;
+    int64_t i, cap, length;
     int status;
     void *orig_ptr, *newptr;
 
@@ -287,8 +288,23 @@ static int make_stream_space(parser_t *self, size_t nbytes) {
     */
 
     cap = self->words_cap;
+
+    /**
+     * If we are reading in chunks, we need to be aware of the maximum number
+     * of words we have seen in previous chunks (self->max_words_cap), so
+     * that way, we can properly allocate when reading subsequent ones.
+     *
+     * Otherwise, we risk a buffer overflow if we mistakenly under-allocate
+     * just because a recent chunk did not have as many words.
+     */
+    if (self->words_len + nbytes < self->max_words_cap) {
+        length = self->max_words_cap - nbytes;
+    } else {
+        length = self->words_len;
+    }
+
     self->words =
-        (char **)grow_buffer((void *)self->words, self->words_len,
+        (char **)grow_buffer((void *)self->words, length,
                              (int64_t*)&self->words_cap, nbytes,
                              sizeof(char *), &status);
     TRACE(
@@ -1241,6 +1257,19 @@ int parser_trim_buffers(parser_t *self) {
 
     int64_t i;
 
+    /**
+     * Before we free up space and trim, we should
+     * save how many words we saw when parsing, if
+     * it exceeds the maximum number we saw before.
+     *
+     * This is important for when we read in chunks,
+     * so that we can inform subsequent chunk parsing
+     * as to how many words we could possibly see.
+     */
+    if (self->words_cap > self->max_words_cap) {
+        self->max_words_cap = self->words_cap;
+    }
+
     /* trim words, word_starts */
     new_cap = _next_pow2(self->words_len) + 1;
     if (new_cap < self->words_cap) {

diff --git a/pandas/_libs/src/parser/tokenizer.h b/pandas/_libs/src/parser/tokenizer.h
@@ -142,6 +142,7 @@ typedef struct parser_t {
     int64_t *word_starts;   // where we are in the stream
     int64_t words_len;
     int64_t words_cap;
+    int64_t max_words_cap;  // maximum word cap encountered
 
     char *pword_start;      // pointer to stream start of current field
     int64_t word_start;     // position start of current field

diff --git a/pandas/core/arrays/datetimelike.py b/pandas/core/arrays/datetimelike.py
@@ -124,8 +124,12 @@ def asi8(self):
         # do not cache or you'll create a memory leak
         return self._data.view('i8')
 
-    # ------------------------------------------------------------------
-    # Array-like Methods
+    # ----------------------------------------------------------------
+    # Array-Like / EA-Interface Methods
+
+    @property
+    def nbytes(self):
+        return self._data.nbytes
 
     @property
     def shape(self):

diff --git a/pandas/core/arrays/datetimes.py b/pandas/core/arrays/datetimes.py
@@ -385,7 +385,7 @@ def _resolution(self):
         return libresolution.resolution(self.asi8, self.tz)
 
     # ----------------------------------------------------------------
-    # Array-like Methods
+    # Array-Like / EA-Interface Methods
 
     def __array__(self, dtype=None):
         if is_object_dtype(dtype):