<a href="https://colab.research.google.com/github/schuhandrea/Pandas/blob/main/Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Buch: Python For Data Analysis,in D:\Programme\andre\PythonForDataAnalysis und auf Google Drive in PythonForData

# **Einführung in Pandas-Datenstrukturen**

In [None]:
import numpy as np
import pandas as pd

In [None]:
from pandas import Series, DataFrame

In [None]:
import numpy as np
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc("figure", figsize=(10, 6))
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_rows = 20
pd.options.display.max_columns = 20
pd.options.display.max_colwidth = 80
np.set_printoptions(precision=4, suppress=True)

## **Series**

In [None]:
obj = pd.Series([4, 7, -5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

Index auf der linken Seite und die Werte auf der rechten Seite

Die Array-Darstellung und das Indexobjekt der Series über seine Werte bzw. 
Indexattribute abrufen:

In [None]:
obj.array

<PandasArray>
[4, 7, -5, 3]
Length: 4, dtype: int64

In [None]:
obj.values

array([ 4,  7, -5,  3])

In [None]:
obj.index  # wie range(4)

RangeIndex(start=0, stop=4, step=1)

Oft ist es wünschenswert, eine Reihe mit einem Index zu erstellen, der jeden 
Datenpunkt mit einer Beschriftung identifiziert:


In [None]:
obj2 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [None]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

Im Vergleich zu NumPy-Arrays kann man Beschriftungen im Index verwenden, 
wenn man einzelne Werte oder eine Gruppe von Werten auswählt:


In [None]:
obj2["a"]

-5

In [None]:
obj2["d"] = 6

In [None]:
obj2[["c", "a", "d"]]

c    3
a   -5
d    6
dtype: int64

Hier wird `['c', 'a', 'd']` als eine **Liste von Indizes** interpretiert, obwohl sie Strings 
anstelle von Ganzzahlen enthält.  

Durch die Verwendung von NumPy-Funktionen oder NumPy-ähnlichen Operationen, 
wie z. B. das Filtern mit einem booleschen Array, die Scalar-Multiplikation oder das 
Anwenden mathematischer Funktionen, wird die Index-Wert-Verknüpfung 
beibehalten: 

In [None]:
obj2[obj2 > 0]

d    6
b    7
c    3
dtype: int64

In [None]:
obj2 * 2

d    12
b    14
a   -10
c     6
dtype: int64

In [None]:
import numpy as np
np.exp(obj2)

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

Eine andere Möglichkeit, sich eine Series vorzustellen, ist ein **geordnetes dict** mit 
fester Länge, da es sich um eine Zuordnung von Indexwerten zu Datenwerten handelt.

In [None]:
"b" in obj2

True

In [None]:
"e" in obj2

False

Wenn man **Daten in einem Python dict** hat, kann man daraus eine Series 
erstellen, indem man das dict übergibt:


In [None]:
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}
obj3 = pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [None]:
obj3.to_dict()

{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In [None]:
states = ["California", "Ohio", "Oregon", "Texas"]
obj4 = pd.Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

3 Werte, die in sdata gefunden wurden, an den entsprechenden Stellen 
platziert, aber da kein Wert für "California" gefunden wurde, erscheint er als NaN (kein number), was bei Pandas als fehlendes oder fehlendes NA-Werte.  
Da 'Utah'
nicht in den Bundesstaaten enthalten war, wird es aus dem resultierenden Objekt 
ausgeschlossen.


Funktionen **`isnull`** und **`notnull`** in Pandas sollten verwendet werden, um fehlende Daten zu erkennen:


In [None]:
pd.isna(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [None]:
pd.notna(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

Series hat auch diese als Instanzmethoden :

In [None]:
obj4.isna()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

Ein nützliches Series-Feature für viele Anwendungen besteht darin, dass es in 
arithmetischen Operationen automatisch nach Indexbezeichnung ausgerichtet wird:

In [None]:
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [None]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [None]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Sowohl das Series-Objekt selbst als auch sein Index verfügen über ein **name-Attribut**, 
das in andere wichtige Bereiche der Pandas-Funktionalität integriert ist:

In [None]:
obj4.name = "population"
obj4.index.name = "state"
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

In [None]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [None]:
obj.index = ["Bob", "Steve", "Jeff", "Ryan"]
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

## **DataFrame**

In [None]:
data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)

In [None]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [None]:
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [None]:
frame.tail()

Unnamed: 0,state,year,pop
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [None]:
pd.DataFrame(data, columns=["year", "state", "pop"])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


In [None]:
frame2 = pd.DataFrame(data, columns=["year", "state", "pop", "debt"])
frame2
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [None]:
frame2["state"]
frame2.year

0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64

In [None]:
frame2.loc[1]
frame2.iloc[2]

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: 2, dtype: object

In [None]:
frame2["debt"] = 16.5
frame2
frame2["debt"] = np.arange(6.)
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,0.0
1,2001,Ohio,1.7,1.0
2,2002,Ohio,3.6,2.0
3,2001,Nevada,2.4,3.0
4,2002,Nevada,2.9,4.0
5,2003,Nevada,3.2,5.0


In [None]:
val = pd.Series([-1.2, -1.5, -1.7], index=["two", "four", "five"])
frame2["debt"] = val
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,
5,2003,Nevada,3.2,


In [None]:
frame2["eastern"] = frame2["state"] == "Ohio"
frame2

Unnamed: 0,year,state,pop,debt,eastern
0,2000,Ohio,1.5,,True
1,2001,Ohio,1.7,,True
2,2002,Ohio,3.6,,True
3,2001,Nevada,2.4,,False
4,2002,Nevada,2.9,,False
5,2003,Nevada,3.2,,False


In [None]:
del frame2["eastern"]
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [None]:
populations = {"Ohio": {2000: 1.5, 2001: 1.7, 2002: 3.6},
               "Nevada": {2001: 2.4, 2002: 2.9}}

In [None]:
frame3 = pd.DataFrame(populations)
frame3

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


In [None]:
frame3.T

Unnamed: 0,2000,2001,2002
Ohio,1.5,1.7,3.6
Nevada,,2.4,2.9


In [None]:
pd.DataFrame(populations, index=[2001, 2002, 2003])

Unnamed: 0,Ohio,Nevada
2001,1.7,2.4
2002,3.6,2.9
2003,,


In [None]:
pdata = {"Ohio": frame3["Ohio"][:-1],
         "Nevada": frame3["Nevada"][:2]}
pd.DataFrame(pdata)

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4


In [None]:
frame3.index.name = "year"
frame3.columns.name = "state"
frame3

state,Ohio,Nevada
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


In [None]:
frame3.to_numpy()

array([[1.5, nan],
       [1.7, 2.4],
       [3.6, 2.9]])

In [None]:
frame2.to_numpy()

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, nan],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, nan],
       [2002, 'Nevada', 2.9, nan],
       [2003, 'Nevada', 3.2, nan]], dtype=object)

## **Index-Objekte**

In [None]:
obj = pd.Series(np.arange(3), index=["a", "b", "c"])
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [None]:
index[1:]

Index(['b', 'c'], dtype='object')

In [None]:
labels = pd.Index(np.arange(3))
labels

Int64Index([0, 1, 2], dtype='int64')

In [None]:
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2

0    1.5
1   -2.5
2    0.0
dtype: float64

In [None]:
obj2.index is labels

True

In [None]:
frame3

state,Ohio,Nevada
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


In [None]:
frame3.columns
"Ohio" in frame3.columns
2003 in frame3.index

False

In [None]:
pd.Index(["foo", "foo", "bar", "bar"])

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

In [None]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=["d", "b", "a", "c"])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [None]:
obj2 = obj.reindex(["a", "b", "c", "d", "e"])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [None]:
obj3 = pd.Series(["blue", "purple", "yellow"], index=[0, 2, 4])
obj3
obj3.reindex(np.arange(6), method="ffill")

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [None]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=["a", "c", "d"],
                     columns=["Ohio", "Texas", "California"])
frame
frame2 = frame.reindex(index=["a", "b", "c", "d"])
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [None]:
states = ["Texas", "Utah", "California"]
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


In [None]:
frame.reindex(states, axis="columns")

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


In [None]:
frame.loc[["a", "d", "c"], ["California", "Texas"]]

Unnamed: 0,California,Texas
a,2,1
d,8,7
c,5,4


In [None]:
obj = pd.Series(np.arange(5.), index=["a", "b", "c", "d", "e"])
obj
new_obj = obj.drop("c")
new_obj
obj.drop(["d", "c"])

a    0.0
b    1.0
e    4.0
dtype: float64

In [None]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=["Ohio", "Colorado", "Utah", "New York"],
                    columns=["one", "two", "three", "four"])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
data.drop(index=["Colorado", "Ohio"])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
data.drop(columns=["two"])

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [None]:
data.drop("two", axis=1)
data.drop(["two", "four"], axis="columns")

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


In [None]:
obj = pd.Series(np.arange(4.), index=["a", "b", "c", "d"])
obj
obj["b"]
obj[1]
obj[2:4]
obj[["b", "a", "d"]]
obj[[1, 3]]
obj[obj < 2]

a    0.0
b    1.0
dtype: float64

In [None]:
obj.loc[["b", "a", "d"]]

b    1.0
a    0.0
d    3.0
dtype: float64

In [None]:
obj1 = pd.Series([1, 2, 3], index=[2, 0, 1])
obj2 = pd.Series([1, 2, 3], index=["a", "b", "c"])
obj1
obj2
obj1[[0, 1, 2]]
obj2[[0, 1, 2]]

a    1
b    2
c    3
dtype: int64

In [None]:
obj1.iloc[[0, 1, 2]]
obj2.iloc[[0, 1, 2]]

a    1
b    2
c    3
dtype: int64

In [None]:
obj2.loc["b":"c"]

b    2
c    3
dtype: int64

In [None]:
obj2.loc["b":"c"] = 5
obj2

a    1
b    5
c    5
dtype: int64

Es gibt viele Möglichkeiten, einen DataFrame zu erstellen, obwohl eine der 
häufigsten aus einem dict von Listen gleicher Länge oder NumPy-Arrays besteht:

In [None]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=["Ohio", "Colorado", "Utah", "New York"],
                    columns=["one", "two", "three", "four"])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


Wenn man eine Sequenz von Spalten angibt, werden die Spalten des DataFrames in 
dieser Reihenfolge angeordnet:

In [None]:
data["two"]

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

In [None]:
data[["three", "one"]]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


Eine solche Indizierung hat einige Sonderfälle. Erstens, das Schneiden oder 
Auswählen von Daten mit einem booleschen Array:

In [None]:
print(data[:2])
data[data["three"] > 5]

          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7


Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


Ein weiterer Anwendungsfall ist die Indizierung mit einem booleschen DataFrame, z. 
B. einem, der durch einen skalaren Vergleich erzeugt wird:


In [None]:
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [None]:
data[data < 5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
data
data.loc["Colorado"]

one      0
two      5
three    6
four     7
Name: Colorado, dtype: int64

### **Selection mit loc und iloc**

Wähle eine einzelne Zeile und mehrere Spalten nach Beschriftung aus:

In [None]:
data.loc["Colorado", ["two", "three"]]

two      5
three    6
Name: Colorado, dtype: int64

In [None]:
data.loc[["Colorado", "New York"]]

Unnamed: 0,one,two,three,four
Colorado,0,5,6,7
New York,12,13,14,15


In [None]:
data.iloc[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

In [None]:
data.iloc[[2, 1]]
data.iloc[2, [3, 0, 1]]

In [None]:
data.iloc[[1, 2], [3, 0, 1]]

Unnamed: 0,four,one,two
Colorado,7,4,5
Utah,11,8,9


In [None]:
data.loc[:"Utah", "two"]

Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


In [None]:
data.iloc[:, :3][data.three > 5]

Unnamed: 0,one,two,three
Colorado,4,5,6
Utah,8,9,10
New York,12,13,14


In [None]:
data.loc[data.three >= 2]

Unnamed: 0,one,two,three,four
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


## **Integer-Indizes**

In [None]:
ser = pd.Series(np.arange(3.))
ser

0    0.0
1    1.0
2    2.0
dtype: float64

In [None]:
ser[0]

0.0

In [None]:
ser2 = pd.Series(np.arange(3.), index=["a", "b", "c"])
print(ser2)
ser2[-1]

a    0.0
b    1.0
c    2.0
dtype: float64


2.0

In [None]:
ser.iloc[-1]

2.0

In [None]:
ser[:2]

In [None]:
data.loc[:, "one"] = 1
data
data.iloc[2] = 5
data
data.loc[data["four"] > 5] = 3
data

In [None]:
data.loc[data.three == 5]["three"] = 6

In [None]:
data

In [None]:
data.loc[data.three == 5, "three"] = 6
data

In [None]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=["a", "c", "d", "e"])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],
               index=["a", "c", "e", "f", "g"])
s1
s2

In [None]:
s1 + s2

In [None]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list("bcd"),
                   index=["Ohio", "Texas", "Colorado"])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list("bde"),
                   index=["Utah", "Ohio", "Texas", "Oregon"])
print(df1)
df2

            b    c    d
Ohio      0.0  1.0  2.0
Texas     3.0  4.0  5.0
Colorado  6.0  7.0  8.0


Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [None]:
df1 + df2

In [None]:
df1 = pd.DataFrame({"A": [1, 2]})
df2 = pd.DataFrame({"B": [3, 4]})
df1
df2
df1 + df2

In [None]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
                   columns=list("abcd"))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
                   columns=list("abcde"))
df2.loc[1, "b"] = np.nan
df1
df2

In [None]:
df1 + df2

In [None]:
df1.add(df2, fill_value=0)

In [None]:
1 / df1
df1.rdiv(1)

In [None]:
df1.reindex(columns=df2.columns, fill_value=0)

In [None]:
arr = np.arange(12.).reshape((3, 4))
arr
arr[0]
arr - arr[0]

In [None]:
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                     columns=list("bde"),
                     index=["Utah", "Ohio", "Texas", "Oregon"])
series = frame.iloc[0]
frame
series

In [None]:
frame - series

In [None]:
series2 = pd.Series(np.arange(3), index=["b", "e", "f"])
series2
frame + series2

In [None]:
series3 = frame["d"]
frame
series3
frame.sub(series3, axis="index")

In [None]:
frame = pd.DataFrame(np.random.standard_normal((4, 3)),
                     columns=list("bde"),
                     index=["Utah", "Ohio", "Texas", "Oregon"])
frame
np.abs(frame)

In [None]:
def f1(x):
    return x.max() - x.min()

frame.apply(f1)

In [None]:
frame.apply(f1, axis="columns")

In [None]:
def f2(x):
    return pd.Series([x.min(), x.max()], index=["min", "max"])
frame.apply(f2)

In [None]:
def my_format(x):
    return f"{x:.2f}"

frame.applymap(my_format)

In [None]:
frame["e"].map(my_format)

In [None]:
obj = pd.Series(np.arange(4), index=["d", "a", "b", "c"])
obj
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

In [None]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=["three", "one"],
                     columns=["d", "a", "b", "c"])
frame
frame.sort_index()
frame.sort_index(axis="columns")

In [None]:
frame.sort_index(axis="columns", ascending=False)

In [None]:
obj = pd.Series([4, 7, -3, 2])
obj.sort_values()

In [None]:
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()

In [None]:
obj.sort_values(na_position="first")

In [None]:
frame = pd.DataFrame({"b": [4, 7, -3, 2], "a": [0, 1, 0, 1]})
frame
frame.sort_values("b")

In [None]:
frame.sort_values(["a", "b"])

In [None]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj.rank()

In [None]:
obj.rank(method="first")

In [None]:
obj.rank(ascending=False)

In [None]:
frame = pd.DataFrame({"b": [4.3, 7, -3, 2], "a": [0, 1, 0, 1],
                      "c": [-2, 5, 8, -2.5]})
frame
frame.rank(axis="columns")

In [None]:
obj = pd.Series(np.arange(5), index=["a", "a", "b", "b", "c"])
obj

In [None]:
obj.index.is_unique

In [None]:
obj["a"]
obj["c"]

In [None]:
df = pd.DataFrame(np.random.standard_normal((5, 3)),
                  index=["a", "a", "b", "b", "c"])
df
df.loc["b"]
df.loc["c"]

In [None]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=["a", "b", "c", "d"],
                  columns=["one", "two"])
df

In [None]:
df.sum()

In [None]:
df.sum(axis="columns")

In [None]:
df.sum(axis="index", skipna=False)
df.sum(axis="columns", skipna=False)

In [None]:
df.mean(axis="columns")

In [None]:
df.idxmax()

In [None]:
df.cumsum()

In [None]:
df.describe()

In [None]:
obj = pd.Series(["a", "a", "b", "c"] * 4)
obj.describe()

In [None]:
price = pd.read_pickle("examples/yahoo_price.pkl")
volume = pd.read_pickle("examples/yahoo_volume.pkl")

In [None]:
returns = price.pct_change()
returns.tail()

In [None]:
returns["MSFT"].corr(returns["IBM"])
returns["MSFT"].cov(returns["IBM"])

In [None]:
returns.corr()
returns.cov()

In [None]:
returns.corrwith(returns["IBM"])

In [None]:
returns.corrwith(volume)

In [None]:
obj = pd.Series(["c", "a", "d", "a", "a", "b", "b", "c", "c"])

In [None]:
uniques = obj.unique()
uniques

In [None]:
obj.value_counts()

In [None]:
pd.value_counts(obj.to_numpy(), sort=False)

In [None]:
obj
mask = obj.isin(["b", "c"])
mask
obj[mask]

In [None]:
to_match = pd.Series(["c", "a", "b", "b", "c", "a"])
unique_vals = pd.Series(["c", "b", "a"])
indices = pd.Index(unique_vals).get_indexer(to_match)
indices

In [None]:
data = pd.DataFrame({"Qu1": [1, 3, 4, 3, 4],
                     "Qu2": [2, 3, 1, 2, 3],
                     "Qu3": [1, 5, 2, 4, 4]})
data

In [None]:
data["Qu1"].value_counts().sort_index()

In [None]:
result = data.apply(pd.value_counts).fillna(0)
result

In [None]:
data = pd.DataFrame({"a": [1, 1, 1, 2, 2], "b": [0, 0, 1, 0, 0]})
data
data.value_counts()

In [None]:
pd.options.display.max_rows = PREVIOUS_MAX_ROWS