<a href="https://colab.research.google.com/github/wuxbeyond/pandas-study/blob/main/10%E5%88%86%E9%92%9F%E4%BB%8B%E7%BB%8Dpandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

按照习惯导入包:

In [None]:
import numpy as np
import pandas as pd

通过传递list方式创建Series，pandas会创建一个默认的integer索引:

In [67]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:

In [28]:
dates = pd.date_range(start='20210101', periods=6)
#print(dates)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
print(df)

                   A         B         C         D
2021-01-01  0.944820  0.854512  1.509935 -0.003369
2021-01-02 -0.461300 -1.247980 -0.925603  0.829463
2021-01-03 -0.242981 -1.624977  0.828192  0.682485
2021-01-04 -0.135280  0.403514 -0.826129 -0.958545
2021-01-05  1.076673  2.032758  0.039274  0.123142
2021-01-06  0.449803 -0.889349  0.187966 -0.215603


Creating a DataFrame by passing a dict of objects that can be converted to series-like.

In [68]:
df2 = pd.DataFrame(
    {
      "A": 1.0,
      "B": pd.Timestamp("20210511"),
      "C": pd.Series(1,index=list(range(4)), dtype="float32"),
      "D": np.array([3] * 4, dtype="int32"),
      "E": pd.Categorical(["test", "train", "test", "train"]),
      "F": "foo"
    }
)
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2021-05-11,1.0,3,test,foo
1,1.0,2021-05-11,1.0,3,train,foo
2,1.0,2021-05-11,1.0,3,test,foo
3,1.0,2021-05-11,1.0,3,train,foo


The columns of the resulting DataFrame have different dtypes.

In [69]:
df2.dtypes


A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

Here is how to view the top and bottom rows of the frame:

In [70]:
df.head()

Unnamed: 0,A,B,C,D
2021-01-01,0.94482,0.854512,1.509935,-0.003369
2021-01-02,-0.4613,-1.24798,-0.925603,0.829463
2021-01-03,-0.242981,-1.624977,0.828192,0.682485
2021-01-04,-0.13528,0.403514,-0.826129,-0.958545
2021-01-05,1.076673,2.032758,0.039274,0.123142


In [71]:
df.tail(2)

Unnamed: 0,A,B,C,D
2021-01-05,1.076673,2.032758,0.039274,0.123142
2021-01-06,0.449803,-0.889349,0.187966,-0.215603


Display the index, columns:

In [72]:
df.index

DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
               '2021-01-05', '2021-01-06'],
              dtype='datetime64[ns]', freq='D')

In [73]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

For df, our DataFrame of all floating-point values, DataFrame.to_numpy() is fast and doesn’t require copying data.

DataFrame.to_numpy() does not include the index or column labels in the output.

In [74]:
df.to_numpy()

array([[ 0.94481961,  0.85451223,  1.50993475, -0.00336867],
       [-0.46130039, -1.24797959, -0.92560262,  0.8294628 ],
       [-0.24298059, -1.62497708,  0.82819179,  0.68248476],
       [-0.13528015,  0.40351375, -0.82612941, -0.95854539],
       [ 1.0766728 ,  2.03275787,  0.03927361,  0.12314155],
       [ 0.44980267, -0.88934944,  0.18796608, -0.21560321]])

For df2, the DataFrame with multiple dtypes, DataFrame.to_numpy() is relatively expensive.

In [75]:
df2.to_numpy()

array([[1.0, Timestamp('2021-05-11 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2021-05-11 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2021-05-11 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2021-05-11 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

describe() shows a quick statistic summary of your data:

In [76]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.271956,-0.078587,0.135606,0.076262
std,0.648168,1.412587,0.941443,0.647977
min,-0.4613,-1.624977,-0.925603,-0.958545
25%,-0.216055,-1.158322,-0.609779,-0.162545
50%,0.157261,-0.242918,0.11362,0.059886
75%,0.821065,0.741763,0.668135,0.542649
max,1.076673,2.032758,1.509935,0.829463


Transposing your data:

转置的过程其实是沿着左上与右下形成对角线进行翻转。

In [77]:
df.T

Unnamed: 0,2021-01-01,2021-01-02,2021-01-03,2021-01-04,2021-01-05,2021-01-06
A,0.94482,-0.4613,-0.242981,-0.13528,1.076673,0.449803
B,0.854512,-1.24798,-1.624977,0.403514,2.032758,-0.889349
C,1.509935,-0.925603,0.828192,-0.826129,0.039274,0.187966
D,-0.003369,0.829463,0.682485,-0.958545,0.123142,-0.215603


Sorting by an axis:

根据轴来排序，0代表索引轴， 1代表列轴

In [78]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2021-01-01,-0.003369,1.509935,0.854512,0.94482
2021-01-02,0.829463,-0.925603,-1.24798,-0.4613
2021-01-03,0.682485,0.828192,-1.624977,-0.242981
2021-01-04,-0.958545,-0.826129,0.403514,-0.13528
2021-01-05,0.123142,0.039274,2.032758,1.076673
2021-01-06,-0.215603,0.187966,-0.889349,0.449803


Sorting by values:

根据值来排序，比如某列的值 具体可以看sort_values函数的注释写的很清晰

In [79]:
df.sort_values(by="B")

Unnamed: 0,A,B,C,D
2021-01-03,-0.242981,-1.624977,0.828192,0.682485
2021-01-02,-0.4613,-1.24798,-0.925603,0.829463
2021-01-06,0.449803,-0.889349,0.187966,-0.215603
2021-01-04,-0.13528,0.403514,-0.826129,-0.958545
2021-01-01,0.94482,0.854512,1.509935,-0.003369
2021-01-05,1.076673,2.032758,0.039274,0.123142
