In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
index = pd.date_range("1/1/2000", periods=8)
s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=["A", "B", "C"])
#s
df

Unnamed: 0,A,B,C
2000-01-01,-1.88143,-0.392217,-0.522163
2000-01-02,0.558646,0.32432,-0.042122
2000-01-03,0.993604,-1.048631,0.267917
2000-01-04,-1.510632,0.529988,0.003094
2000-01-05,1.350998,2.025135,1.816401
2000-01-06,0.031871,-1.187144,-0.10659
2000-01-07,-0.32794,0.053134,-0.622494
2000-01-08,-0.173572,-1.706154,-0.635202


## Head and tail
To view a small sample of a Series or DataFrame object, use the head() and tail() methods. The default number of elements to display is five, but you may pass a custom number.

In [4]:
long_series = pd.Series(np.random.randn(1000))
long_series.head()

0    0.176950
1    0.125018
2    1.090972
3   -1.323822
4   -1.205144
dtype: float64

In [5]:
long_series.tail(3)

997    1.347451
998    0.552855
999   -0.064144
dtype: float64

## 属性和基础数据
pandas objects have a number of attributes enabling you to access the metadata

• shape: gives the axis dimensions of the object, consistent with ndarray

• Axis labels

– Series: index (only axis)

– DataFrame: index (rows) and columns

Note, these attributes can be safely assigned to!

In [6]:
df[:2]

Unnamed: 0,A,B,C
2000-01-01,-1.88143,-0.392217,-0.522163
2000-01-02,0.558646,0.32432,-0.042122


In [8]:
df.columns = [x.lower() for x in df.columns]
df

Unnamed: 0,a,b,c
2000-01-01,-1.88143,-0.392217,-0.522163
2000-01-02,0.558646,0.32432,-0.042122
2000-01-03,0.993604,-1.048631,0.267917
2000-01-04,-1.510632,0.529988,0.003094
2000-01-05,1.350998,2.025135,1.816401
2000-01-06,0.031871,-1.187144,-0.10659
2000-01-07,-0.32794,0.053134,-0.622494
2000-01-08,-0.173572,-1.706154,-0.635202


When the Series or Index is backed by an ExtensionArray, to_numpy() may involve copying data and coercing values. See dtypes for more.

to_numpy() gives some control over the dtype of the resulting numpy.ndarray. For example, consider datetimes with timezones. NumPy doesn’t have a dtype to represent timezone-aware datetimes, so there are two possibly useful representations:

1. An object-dtype numpy.ndarray with Timestamp objects, each with the correct tz
2. A datetime64[ns] -dtype numpy.ndarray, where the values have been converted to UTC and the timezone discarded
Timezones may be preserved with dtype=object

In [9]:
s.array
s.index.array
s.to_numpy()
np.asarray(s)

array([ 0.27794708, -0.68244087,  1.70834729, -0.45712731,  1.08285889])

In [10]:
ser = pd.Series(pd.date_range("2000", periods=2, tz="CET"))

In [11]:
ser.to_numpy(dtype=object)

array([Timestamp('2000-01-01 00:00:00+0100', tz='CET'),
       Timestamp('2000-01-02 00:00:00+0100', tz='CET')], dtype=object)

In [12]:
'''Getting the “raw data” inside a DataFrame is possibly a bit more complex. When your DataFrame only has a single
data type for all the columns, DataFrame.to_numpy() will return the underlying data:'''
df.to_numpy()

array([[-1.88142965, -0.39221728, -0.52216327],
       [ 0.55864584,  0.32432015, -0.04212173],
       [ 0.99360435, -1.04863062,  0.2679172 ],
       [-1.51063182,  0.52998817,  0.00309374],
       [ 1.35099789,  2.0251351 ,  1.8164006 ],
       [ 0.03187086, -1.18714367, -0.10659033],
       [-0.32794046,  0.05313398, -0.62249353],
       [-0.17357174, -1.70615434, -0.63520156]])

If a DataFrame contains homogeneously-typed data, the ndarray can actually be modified in-place, and the changes will be reflected in the data structure. For heterogeneous data (e.g. some of the DataFrame’s columns are not all the same dtype), this will not be the case. The values attribute itself, unlike the axis labels, cannot be assigned to.

Note: When working with heterogeneous data, the dtype of the resulting ndarray will be chosen to accommodate all of the data involved. For example, if strings are involved, the result will be of object dtype. If there are only floats and integers, the resulting array will be of float dtype.

In the past, pandas recommended Series.values or DataFrame.values for extracting the data from a Series or DataFrame. You’ll still find references to these in old code bases and online. Going forward, we recommend avoiding .values and using .array or .to_numpy(). .values has the following drawbacks:
1. When your Series contains an extension type, it’s unclear whether Series.values returns a NumPy array or the extension array. Series.array will always return an ExtensionArray, and will never copy data. Series.to_numpy() will always return a NumPy array, potentially at the cost of copying / coercing values.
2. When your DataFrame contains a mixture of data types, DataFrame.values may involve copying data and coercing values to a common dtype, a relatively expensive operation. DataFrame.to_numpy(), being a method, makes it clearer that the returned NumPy array may not be a view on the same data in the DataFrame.