Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/master' into io_csv_docstring_…
Browse files Browse the repository at this point in the history
…fixed

* upstream/master:
  DOC: avoid SparseArray.take error (pandas-dev#23637)
  CLN: remove incorrect usages of com.AbstractMethodError (pandas-dev#23625)
  DOC: Adding validation of the section order in docstrings (pandas-dev#23607)
  BUG: Don't over-optimize memory with jagged CSV (pandas-dev#23527)
  DEPR: Deprecate usecols as int in read_excel (pandas-dev#23635)
  More helpful Stata string length error. (pandas-dev#23629)
  BUG: astype fill_value for SparseArray.astype (pandas-dev#23547)
  CLN: datetimelike arrays: isort, small reorg (pandas-dev#23587)
  CI: Check in the CI that assert_raises_regex is not being used (pandas-dev#23627)
  CLN:Remove unused **kwargs from user facing methods (pandas-dev#23249)
  • Loading branch information
thoo committed Nov 12, 2018
2 parents 237a024 + 2d4dd50 commit 63c6d84
Show file tree
Hide file tree
Showing 43 changed files with 567 additions and 306 deletions.
4 changes: 4 additions & 0 deletions ci/code_checks.sh
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,10 @@ if [[ -z "$CHECK" || "$CHECK" == "patterns" ]]; then
! grep -R --include="*.py" --include="*.pyx" --include="*.rst" -E "\.\. (autosummary|contents|currentmodule|deprecated|function|image|important|include|ipython|literalinclude|math|module|note|raw|seealso|toctree|versionadded|versionchanged|warning):[^:]" ./pandas ./doc/source
RET=$(($RET + $?)) ; echo $MSG "DONE"

MSG='Check that the deprecated `assert_raises_regex` is not used (`pytest.raises(match=pattern)` should be used instead)' ; echo $MSG
! grep -R --exclude=*.pyc --exclude=testing.py --exclude=test_testing.py assert_raises_regex pandas
RET=$(($RET + $?)) ; echo $MSG "DONE"

MSG='Check for modules that pandas should not import' ; echo $MSG
python -c "
import sys
Expand Down
5 changes: 5 additions & 0 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2854,6 +2854,11 @@ It is often the case that users will insert columns to do temporary computations
in Excel and you may not want to read in those columns. ``read_excel`` takes
a ``usecols`` keyword to allow you to specify a subset of columns to parse.

.. deprecated:: 0.24.0

Passing in an integer for ``usecols`` has been deprecated. Please pass in a list
of ints from 0 to ``usecols`` inclusive instead.

If ``usecols`` is an integer, then it is assumed to indicate the last column
to be parsed.

Expand Down
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v0.18.1.txt
Original file line number Diff line number Diff line change
Expand Up @@ -266,7 +266,7 @@ These changes conform sparse handling to return the correct types and work to ma

``SparseArray.take`` now returns a scalar for scalar input, ``SparseArray`` for others. Furthermore, it handles a negative indexer with the same rule as ``Index`` (:issue:`10560`, :issue:`12796`)

.. ipython:: python
.. code-block:: python

s = pd.SparseArray([np.nan, np.nan, 1, 2, 3, np.nan, 4, 5, np.nan, 6])
s.take(0)
Expand Down
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -970,6 +970,7 @@ Deprecations
- The class ``FrozenNDArray`` has been deprecated. When unpickling, ``FrozenNDArray`` will be unpickled to ``np.ndarray`` once this class is removed (:issue:`9031`)
- Deprecated the `nthreads` keyword of :func:`pandas.read_feather` in favor of
`use_threads` to reflect the changes in pyarrow 0.11.0. (:issue:`23053`)
- :func:`pandas.read_excel` has deprecated accepting ``usecols`` as an integer. Please pass in a list of ints from 0 to ``usecols`` inclusive instead (:issue:`23527`)
- Constructing a :class:`TimedeltaIndex` from data with ``datetime64``-dtyped data is deprecated, will raise ``TypeError`` in a future version (:issue:`23539`)

.. _whatsnew_0240.deprecations.datetimelike_int_ops:
Expand Down Expand Up @@ -1298,6 +1299,7 @@ Notice how we now instead output ``np.nan`` itself instead of a stringified form
- :func:`read_excel()` will correctly show the deprecation warning for previously deprecated ``sheetname`` (:issue:`17994`)
- :func:`read_csv()` and func:`read_table()` will throw ``UnicodeError`` and not coredump on badly encoded strings (:issue:`22748`)
- :func:`read_csv()` will correctly parse timezone-aware datetimes (:issue:`22256`)
- Bug in :func:`read_csv()` in which memory management was prematurely optimized for the C engine when the data was being read in chunks (:issue:`23509`)
- :func:`read_sas()` will parse numbers in sas7bdat-files that have width less than 8 bytes correctly. (:issue:`21616`)
- :func:`read_sas()` will correctly parse sas7bdat files with many columns (:issue:`22628`)
- :func:`read_sas()` will correctly parse sas7bdat files with data page types having also bit 7 set (so page type is 128 + 256 = 384) (:issue:`16615`)
Expand Down
1 change: 1 addition & 0 deletions pandas/_libs/parsers.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,7 @@ cdef extern from "parser/tokenizer.h":
int64_t *word_starts # where we are in the stream
int64_t words_len
int64_t words_cap
int64_t max_words_cap # maximum word cap encountered

char *pword_start # pointer to stream start of current field
int64_t word_start # position start of current field
Expand Down
33 changes: 31 additions & 2 deletions pandas/_libs/src/parser/tokenizer.c
Original file line number Diff line number Diff line change
Expand Up @@ -197,6 +197,7 @@ int parser_init(parser_t *self) {
sz = sz ? sz : 1;
self->words = (char **)malloc(sz * sizeof(char *));
self->word_starts = (int64_t *)malloc(sz * sizeof(int64_t));
self->max_words_cap = sz;
self->words_cap = sz;
self->words_len = 0;

Expand Down Expand Up @@ -247,7 +248,7 @@ void parser_del(parser_t *self) {
}

static int make_stream_space(parser_t *self, size_t nbytes) {
int64_t i, cap;
int64_t i, cap, length;
int status;
void *orig_ptr, *newptr;

Expand Down Expand Up @@ -287,8 +288,23 @@ static int make_stream_space(parser_t *self, size_t nbytes) {
*/

cap = self->words_cap;

/**
* If we are reading in chunks, we need to be aware of the maximum number
* of words we have seen in previous chunks (self->max_words_cap), so
* that way, we can properly allocate when reading subsequent ones.
*
* Otherwise, we risk a buffer overflow if we mistakenly under-allocate
* just because a recent chunk did not have as many words.
*/
if (self->words_len + nbytes < self->max_words_cap) {
length = self->max_words_cap - nbytes;
} else {
length = self->words_len;
}

self->words =
(char **)grow_buffer((void *)self->words, self->words_len,
(char **)grow_buffer((void *)self->words, length,
(int64_t*)&self->words_cap, nbytes,
sizeof(char *), &status);
TRACE(
Expand Down Expand Up @@ -1241,6 +1257,19 @@ int parser_trim_buffers(parser_t *self) {

int64_t i;

/**
* Before we free up space and trim, we should
* save how many words we saw when parsing, if
* it exceeds the maximum number we saw before.
*
* This is important for when we read in chunks,
* so that we can inform subsequent chunk parsing
* as to how many words we could possibly see.
*/
if (self->words_cap > self->max_words_cap) {
self->max_words_cap = self->words_cap;
}

/* trim words, word_starts */
new_cap = _next_pow2(self->words_len) + 1;
if (new_cap < self->words_cap) {
Expand Down
1 change: 1 addition & 0 deletions pandas/_libs/src/parser/tokenizer.h
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,7 @@ typedef struct parser_t {
int64_t *word_starts; // where we are in the stream
int64_t words_len;
int64_t words_cap;
int64_t max_words_cap; // maximum word cap encountered

char *pword_start; // pointer to stream start of current field
int64_t word_start; // position start of current field
Expand Down
8 changes: 6 additions & 2 deletions pandas/core/arrays/datetimelike.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,8 +124,12 @@ def asi8(self):
# do not cache or you'll create a memory leak
return self._data.view('i8')

# ------------------------------------------------------------------
# Array-like Methods
# ----------------------------------------------------------------
# Array-Like / EA-Interface Methods

@property
def nbytes(self):
return self._data.nbytes

@property
def shape(self):
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/arrays/datetimes.py
Original file line number Diff line number Diff line change
Expand Up @@ -385,7 +385,7 @@ def _resolution(self):
return libresolution.resolution(self.asi8, self.tz)

# ----------------------------------------------------------------
# Array-like Methods
# Array-Like / EA-Interface Methods

def __array__(self, dtype=None):
if is_object_dtype(dtype):
Expand Down
106 changes: 52 additions & 54 deletions pandas/core/arrays/period.py
Original file line number Diff line number Diff line change
Expand Up @@ -272,10 +272,6 @@ def _concat_same_type(cls, to_concat):

# --------------------------------------------------------------------
# Data / Attributes
@property
def nbytes(self):
# TODO(DatetimeArray): remove
return self._data.nbytes

@cache_readonly
def dtype(self):
Expand All @@ -286,10 +282,6 @@ def _ndarray_values(self):
# Ordinals
return self._data

@property
def asi8(self):
return self._data

@property
def freq(self):
"""Return the frequency object for this PeriodArray."""
Expand Down Expand Up @@ -330,6 +322,50 @@ def start_time(self):
def end_time(self):
return self.to_timestamp(how='end')

def to_timestamp(self, freq=None, how='start'):
"""
Cast to DatetimeArray/Index.
Parameters
----------
freq : string or DateOffset, optional
Target frequency. The default is 'D' for week or longer,
'S' otherwise
how : {'s', 'e', 'start', 'end'}
Returns
-------
DatetimeArray/Index
"""
from pandas.core.arrays import DatetimeArrayMixin

how = libperiod._validate_end_alias(how)

end = how == 'E'
if end:
if freq == 'B':
# roll forward to ensure we land on B date
adjust = Timedelta(1, 'D') - Timedelta(1, 'ns')
return self.to_timestamp(how='start') + adjust
else:
adjust = Timedelta(1, 'ns')
return (self + self.freq).to_timestamp(how='start') - adjust

if freq is None:
base, mult = frequencies.get_freq_code(self.freq)
freq = frequencies.get_to_timestamp_base(base)
else:
freq = Period._maybe_convert_freq(freq)

base, mult = frequencies.get_freq_code(freq)
new_data = self.asfreq(freq, how=how)

new_data = libperiod.periodarr_to_dt64arr(new_data.asi8, base)
return DatetimeArrayMixin(new_data, freq='infer')

# --------------------------------------------------------------------
# Array-like / EA-Interface Methods

def __repr__(self):
return '<{}>\n{}\nLength: {}, dtype: {}'.format(
self.__class__.__name__,
Expand Down Expand Up @@ -456,6 +492,8 @@ def value_counts(self, dropna=False):
name=result.index.name)
return Series(result.values, index=index, name=result.name)

# --------------------------------------------------------------------

def shift(self, periods=1):
"""
Shift values by desired number.
Expand Down Expand Up @@ -567,49 +605,9 @@ def asfreq(self, freq=None, how='E'):

return type(self)(new_data, freq=freq)

def to_timestamp(self, freq=None, how='start'):
"""
Cast to DatetimeArray/Index
Parameters
----------
freq : string or DateOffset, optional
Target frequency. The default is 'D' for week or longer,
'S' otherwise
how : {'s', 'e', 'start', 'end'}
Returns
-------
DatetimeArray/Index
"""
from pandas.core.arrays import DatetimeArrayMixin

how = libperiod._validate_end_alias(how)

end = how == 'E'
if end:
if freq == 'B':
# roll forward to ensure we land on B date
adjust = Timedelta(1, 'D') - Timedelta(1, 'ns')
return self.to_timestamp(how='start') + adjust
else:
adjust = Timedelta(1, 'ns')
return (self + self.freq).to_timestamp(how='start') - adjust

if freq is None:
base, mult = frequencies.get_freq_code(self.freq)
freq = frequencies.get_to_timestamp_base(base)
else:
freq = Period._maybe_convert_freq(freq)

base, mult = frequencies.get_freq_code(freq)
new_data = self.asfreq(freq, how=how)

new_data = libperiod.periodarr_to_dt64arr(new_data.asi8, base)
return DatetimeArrayMixin(new_data, freq='infer')

# ------------------------------------------------------------------
# Formatting

def _format_native_types(self, na_rep=u'NaT', date_format=None, **kwargs):
""" actually format my specific types """
# TODO(DatetimeArray): remove
Expand All @@ -630,9 +628,13 @@ def _format_native_types(self, na_rep=u'NaT', date_format=None, **kwargs):
values = np.array([formatter(dt) for dt in values])
return values

# Delegation...
def strftime(self, date_format):
return self._format_native_types(date_format=date_format)

def repeat(self, repeats, *args, **kwargs):
"""
Repeat elements of a Categorical.
Repeat elements of a PeriodArray.
See also
--------
Expand All @@ -643,10 +645,6 @@ def repeat(self, repeats, *args, **kwargs):
values = self._data.repeat(repeats)
return type(self)(values, self.freq)

# Delegation...
def strftime(self, date_format):
return self._format_native_types(date_format=date_format)

def astype(self, dtype, copy=True):
# TODO: Figure out something better here...
# We have DatetimeLikeArrayMixin ->
Expand Down
Loading

0 comments on commit 63c6d84

Please sign in to comment.