**DataFrame**

A DataFrame represents a rectangular table of data and contains an ordered, named collection of columns, each of which can be a different value type (numeric, string, Boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dictionary of Series all sharing the same index.

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
"year": [2000, 2001, 2002, 2001, 2002, 2003],
"pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
}

data

{'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
 'year': [2000, 2001, 2002, 2001, 2002, 2003],
 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

In [3]:
frame=pd.DataFrame(data)

In [4]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [5]:
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [6]:
frame.tail()

Unnamed: 0,state,year,pop
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [7]:
pd.DataFrame(data,columns=["year","state","pop"])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


In [8]:
frame2=pd.DataFrame(data,columns=["year","state","pop","debt"])

In [9]:
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,
5,2003,Nevada,3.2,


In [10]:
frame2["state"]

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object

In [11]:
frame2.year

0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64

In [12]:
frame2.loc[1]

year     2001
state    Ohio
pop       1.7
debt      NaN
Name: 1, dtype: object

In [13]:
frame2.iloc[2]

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: 2, dtype: object

In [14]:
frame2["debt"]=16.5

In [15]:
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,16.5
1,2001,Ohio,1.7,16.5
2,2002,Ohio,3.6,16.5
3,2001,Nevada,2.4,16.5
4,2002,Nevada,2.9,16.5
5,2003,Nevada,3.2,16.5


In [16]:
frame2["debt"]=np.arange(6.)

In [17]:
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,0.0
1,2001,Ohio,1.7,1.0
2,2002,Ohio,3.6,2.0
3,2001,Nevada,2.4,3.0
4,2002,Nevada,2.9,4.0
5,2003,Nevada,3.2,5.0


In [18]:
val=pd.Series([-1.2,-1.5,-1.7],index=["two","four","five"])

In [19]:
frame2["debt"]=val

In [20]:
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,
5,2003,Nevada,3.2,


In [21]:
frame2["eastern"]=frame2["state"]=="Ohio"

In [22]:
frame2

Unnamed: 0,year,state,pop,debt,eastern
0,2000,Ohio,1.5,,True
1,2001,Ohio,1.7,,True
2,2002,Ohio,3.6,,True
3,2001,Nevada,2.4,,False
4,2002,Nevada,2.9,,False
5,2003,Nevada,3.2,,False


In [23]:
del frame2["eastern"]

In [24]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [25]:
populations = {"Ohio": {2000: 1.5, 2001: 1.7, 2002: 3.6},           
"Nevada": {2001: 2.4, 2002: 2.9}}

In [26]:
populations

{'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}, 'Nevada': {2001: 2.4, 2002: 2.9}}

In [27]:
frame3=pd.DataFrame(populations)

In [28]:
frame3

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


In [29]:
frame3.T

Unnamed: 0,2000,2001,2002
Ohio,1.5,1.7,3.6
Nevada,,2.4,2.9


In [30]:
pd.DataFrame(populations,index=[2001,2002,2003])

Unnamed: 0,Ohio,Nevada
2001,1.7,2.4
2002,3.6,2.9
2003,,


In [31]:
pdata={
    "Ohio":frame3["Ohio"][:-1],
    "Nevada":frame3["Nevada"][:2]
}

In [32]:
pd.DataFrame(pdata)

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4


# DataFrame Input Types with Examples

| Type                          | Notes                                                                 | Example (Python) |
|-------------------------------|----------------------------------------------------------------------|------------------|
| 2D ndarray                    | A matrix of data, passing optional row and column labels             | `pd.DataFrame([[1,2],[3,4]], columns=['A','B'])` |
| Dictionary of arrays, lists, or tuples | Each sequence becomes a column in the DataFrame; all sequences must be the same length | `pd.DataFrame({'A':[1,2,3], 'B':(4,5,6)})` |
| NumPy structured/record array | Treated as the “dictionary of arrays” case                           | `arr = np.array([(1,10.0),(2,20.0)], dtype=[('x',int),('y',float)])\npd.DataFrame(arr)` |
| Dictionary of Series          | Each value becomes a column; indexes from each Series are unioned together to form the result’s row index if no explicit index is passed | `pd.DataFrame({'A': pd.Series([1,2], index=['x','y']), 'B': pd.Series([3,4], index=['y','z'])})` |
| Dictionary of dictionaries    | Each inner dictionary becomes a column; keys are unioned to form the row index as in the “dictionary of Series” case | `pd.DataFrame({'A': {'x':1,'y':2}, 'B': {'y':3,'z':4}})` |
| List of dictionaries or Series| Each item becomes a row in the DataFrame; unions of dictionary keys or Series indexes become the DataFrame’s column labels | `pd.DataFrame([{'A':1,'B':2}, {'A':3,'C':4}])` |
| List of lists or tuples       | Treated as the “2D ndarray” case                                     | `pd.DataFrame([(1,2),(3,4)], columns=['A','B'])` |
| Another DataFrame             | The DataFrame’s indexes are used unless different ones are passed    | `df = pd.DataFrame({'A':[1,2]}); pd.DataFrame(df)` |
| NumPy MaskedArray             | Like the “2D ndarray” case except masked values are missing in the DataFrame result | `import numpy.ma as ma\narr = ma.masked_array([[1,2],[3,ma.masked]], mask=[[0,0],[0,1]])\npd.DataFrame(arr)` |


In [33]:
frame3.index.name="year"

In [34]:
frame3.columns.name="state"

In [35]:
frame3

state,Ohio,Nevada
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


In [36]:
frame3.to_numpy()

array([[1.5, nan],
       [1.7, 2.4],
       [3.6, 2.9]])

In [37]:
frame2.to_numpy()

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, nan],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, nan],
       [2002, 'Nevada', 2.9, nan],
       [2003, 'Nevada', 3.2, nan]], dtype=object)

# Index Objects


In [38]:
obj=pd.Series(np.arange(3),index=["a","b","c"])
index=obj.index

In [39]:
index

Index(['a', 'b', 'c'], dtype='object')

In [40]:
index[1:]

Index(['b', 'c'], dtype='object')

In [41]:
index[1]="d" # This will raise an error

TypeError: Index does not support mutable operations

In [None]:
labels=pd.Index(np.arange(3))

In [None]:
labels

Index([0, 1, 2], dtype='int64')

In [None]:
obj2=pd.Series([1.5,-2.5,0],index=labels)

In [None]:
obj2

0    1.5
1   -2.5
2    0.0
dtype: float64

In [None]:
obj2.index is labels

True

In [None]:
frame3

state,Ohio,Nevada
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


In [None]:
frame3.columns

Index(['Ohio', 'Nevada'], dtype='object', name='state')

In [None]:
"Ohio" in frame3.columns

True

In [None]:
2003 in frame3.index

False

In [None]:
pd.Index(["foo","foo","bar","bar"])

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

## Table 5-2. Some Index methods and properties

| Method/Property | Description | Example (Python) | Output |
|-----------------|-------------|------------------|--------|
| `append()`      | Concatenate with additional Index objects, producing a new Index | `idx1 = pd.Index([1,2,3]); idx2 = pd.Index([4,5]); idx1.append(idx2)` | `Int64Index([1,2,3,4,5], dtype='int64')` |
| `difference()`  | Compute set difference as an Index | `pd.Index([1,2,3]).difference([2,3])` | `Int64Index([1], dtype='int64')` |
| `intersection()`| Compute set intersection | `pd.Index([1,2,3]).intersection([2,3,4])` | `Int64Index([2,3], dtype='int64')` |
| `union()`       | Compute set union | `pd.Index([1,2,3]).union([2,3,4])` | `Int64Index([1,2,3,4], dtype='int64')` |
| `isin()`        | Boolean array indicating whether each value is contained in the passed collection | `pd.Index([1,2,3]).isin([2,4])` | `[False, True, False]` |
| `delete()`      | Compute new Index with element at Index *i* deleted | `pd.Index([1,2,3]).delete(1)` | `Int64Index([1,3], dtype='int64')` |
| `drop()`        | Compute new Index by deleting passed values | `pd.Index([1,2,3]).drop(2)` | `Int64Index([1,3], dtype='int64')` |
| `insert()`      | Compute new Index by inserting element at Index *i* | `pd.Index([1,2,3]).insert(1, 99)` | `Int64Index([1,99,2,3], dtype='int64')` |
| `is_monotonic`  | Returns True if each element is greater than or equal to the previous element | `pd.Index([1,2,3]).is_monotonic` | `True` |
| `is_unique`     | Returns True if the Index has no duplicate values | `pd.Index([1,2,2]).is_unique` | `False` |
| `unique()`      | Compute the array of unique values in the Index | `pd.Index([1,2,2,3]).unique()` | `[1,2,3]` |

# Reindexing

In [None]:
obj=pd.Series([4.5,7.2,-5.3,3.6],index=["d","b","a","c"])

In [None]:
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

Calling reindex on this Series rearranges the data according to the new index, introducing missing values if any index values were not already present:


In [None]:
obj2=obj.reindex(["a","b","c","d","e"])

In [None]:
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

For ordered data like time series, you may want to do some interpolation or
filling of values when reindexing. The method option allows us to do this,
using a method such as ffill, which forward-fills the values:

In [None]:
# obj2=obj2.fillna(method="ffill")
# obj2

In [None]:
obj3=pd.Series(["blue","purple","yellow"],index=[0,2,4])

In [None]:
obj3

0      blue
2    purple
4    yellow
dtype: object

In [None]:
obj3.reindex(np.arange(6),method="ffill")

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [None]:
frame=pd.DataFrame(np.arange(9).reshape((3,3)),
                   index=["a","c","d"],
                   columns=["Ohio","Texas","California"])

In [None]:
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [None]:
frame2=frame.reindex(index=["a","b","c","d"])

In [None]:
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [None]:
states=["Texas","Utah","California"]

In [None]:
frame.reindex(columns=states) # Alternative way to reindex columns

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


In [None]:
frame.reindex(states,axis="columns") # Alternative way to reindex columns

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


## Reindex Arguments in Pandas

| Argument    | Description                                                                 | Example (Python) |
|-------------|-----------------------------------------------------------------------------|------------------|
| `labels`    | New sequence to use as an index. Can be an `Index` instance or any sequence-like Python data structure. An `Index` will be used exactly as is without copying. | `df.reindex(labels=[0,2,5])` |
| `index`     | Use the passed sequence as the new index labels.                            | `df.reindex(index=[10,20,30])` |
| `columns`   | Use the passed sequence as the new column labels.                           | `df.reindex(columns=['A','C'])` |
| `axis`      | The axis to reindex, either `"index"` (rows) or `"columns"`. Default is `"index"`. | `df.reindex([1,2], axis='index')` |
| `method`    | Interpolation (fill) method; `"ffill"` fills forward, `"bfill"` fills backward. | `df.reindex([0,1,2,3], method='ffill')` |
| `fill_value`| Substitute value when introducing missing data by reindexing. Default is `NaN`. | `df.reindex([0,1,2], fill_value=0)` |
| `limit`     | When forward/backward filling, the maximum size gap (number of elements) to fill. | `df.reindex([0,1,2,3], method='ffill', limit=1)` |
| `tolerance` | When forward/backward filling, the maximum numeric distance to fill for inexact matches. | `df.reindex([0.1, 1.9], method='nearest', tolerance=0.2)` |
| `level`     | Match simple Index on level of MultiIndex; otherwise select subset.         | `df.reindex([('x','a'),('y','b')], level=0)` |
| `copy`      | If `True`, always copy underlying data even if new index is equivalent. If `False`, do not copy when equivalent. | `df.reindex([0,1], copy=False)` |


In [None]:
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [None]:
frame.loc[["a","d","c"],["California","Texas"]]

Unnamed: 0,California,Texas
a,2,1
d,8,7
c,5,4


# Dropping Entries from an Axis

In [None]:
obj=pd.Series(np.arange(5.),index=["a","b","c","d","e"])

In [None]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [None]:
new_obj=obj.drop("c")

In [None]:
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [None]:
obj.drop(["d","c"])

a    0.0
b    1.0
e    4.0
dtype: float64

In [None]:
data=pd.DataFrame(np.arange(16).reshape((4,4)),
                  index=["Ohio","Colorado","Utah","New York"],
                  columns=["one","two","three","four"])

In [None]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
data.drop(index=["Colorado","Ohio"])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
data.drop("two",axis=1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [None]:
data.drop(["two","four"],axis="columns")

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


In [None]:
obj=pd.Series(np.arange(4.),index=["a","b","c","d"])

In [None]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [None]:
obj["b"]

np.float64(1.0)

In [None]:
obj[1]

  obj[1]


np.float64(1.0)

In [None]:
obj[["b","a","d"]]

b    1.0
a    0.0
d    3.0
dtype: float64

In [None]:
obj[[1,3]]

  obj[[1,3]]


b    1.0
d    3.0
dtype: float64

In [None]:
obj[obj<2]

a    0.0
b    1.0
dtype: float64

In [None]:
obj.loc[["b","a","d"]]

b    1.0
a    0.0
d    3.0
dtype: float64

In [None]:
obj1=pd.Series([1,2,3],index=[2,0,1])

In [None]:
obj2=pd.Series([1,2,3],index=["a","b","c"])

In [None]:
obj1

2    1
0    2
1    3
dtype: int64

In [None]:
obj2

a    1
b    2
c    3
dtype: int64

In [None]:
obj1[[0,1,2]]

0    2
1    3
2    1
dtype: int64

In [None]:
obj2[[0,1,2]]

  obj2[[0,1,2]]


a    1
b    2
c    3
dtype: int64

When using loc, the expression obj.loc[[0, 1, 2]] will fail when the
index does not contain integers:


In [None]:
obj2.loc[[0,1]] # 

a    1
b    2
dtype: int64

In [None]:
obj1

2    1
0    2
1    3
dtype: int64

In [None]:
obj1.iloc[[0,1,2]]

2    1
0    2
1    3
dtype: int64

In [None]:
obj2

a    1
b    2
c    3
dtype: int64

In [None]:
obj2.iloc[[0,1,2]]

a    1
b    2
c    3
dtype: int64

# CAUTION
You can also slice with labels, but it works differently from normal Python slicing in that the
endpoint is inclusive:


In [None]:
obj2.loc["b":"c"]

b    2
c    3
dtype: int64

In [None]:
obj2.loc["b":"c"]=5

In [None]:
obj2

a    1
b    5
c    5
dtype: int64

In [None]:
data=pd.DataFrame(np.arange(16).reshape((4,4)),
                  index=["Ohio","Colorado","Utah","New York"],
                  columns=["one","two","three","four"])

In [None]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
data["two"]

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

In [None]:
data[["three","one"]]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


In [None]:
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [None]:
data[data["three"]>5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
data<5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [None]:
data[data<5]=0

In [None]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


Selection on DataFrame with loc and iloc

In [None]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
data.loc["Colorado"]

one      0
two      5
three    6
four     7
Name: Colorado, dtype: int64

In [None]:
data.loc[["Colorado","New York"]]

Unnamed: 0,one,two,three,four
Colorado,0,5,6,7
New York,12,13,14,15


In [None]:
data.loc["Colorado",["two","three"]]

two      5
three    6
Name: Colorado, dtype: int64

In [None]:
data.iloc[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

In [None]:
data.iloc[[2,1]]

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
Colorado,0,5,6,7


In [None]:
data.iloc[2,[3,0,1]]

four    11
one      8
two      9
Name: Utah, dtype: int64

In [None]:
data.iloc[[1,2],[3,0,1]]

Unnamed: 0,four,one,two
Colorado,7,0,5
Utah,11,8,9


In [None]:
data.loc[:"Utah","two"]

Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int64

In [None]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
data.iloc[:,:3][data.three>5]

Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


In [None]:
data.loc[data.three>=2]

Unnamed: 0,one,two,three,four
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


## Table 5-4. Indexing options with DataFrame

| Type              | Notes                                                                 | Example (Python) | Output |
|-------------------|----------------------------------------------------------------------|------------------|--------|
| `df[column]`      | Select single column or sequence of columns; special cases: Boolean array (filter rows), slice (slice rows), or Boolean DataFrame (set values based on some criterion) | `df['A']` | Column `A` |
| `df.loc[rows]`    | Select single row or subset of rows by label                         | `df.loc['row1']` | Row with label `row1` |
| `df.loc[:, cols]` | Select single column or subset of columns by label                   | `df.loc[:, ['A','B']]` | Columns `A` and `B` |
| `df.loc[rows, cols]` | Select both row(s) and column(s) by label                         | `df.loc['row1','A']` | Scalar value at row `row1`, column `A` |
| `df.iloc[rows]`   | Select single row or subset of rows by integer position              | `df.iloc[0]` | First row |
| `df.iloc[:, cols]`| Select single column or subset of columns by integer position        | `df.iloc[:, [0,1]]` | First two columns |
| `df.iloc[rows, cols]` | Select both row(s) and column(s) by integer position             | `df.iloc[0,0]` | Scalar value at first row, first column |
| `df.at[row, col]` | Select a single scalar value by row and column label                 | `df.at['row1','A']` | Scalar value |
| `df.iat[row, col]`| Select a single scalar value by row and column position (integers)   | `df.iat[0,0]` | Scalar value |
| `reindex` method  | Select either rows or columns by labels                              | `df.reindex(index=[1,2], columns=['A','B'])` | Reindexed DataFrame |

In [None]:
ser=pd.Series(np.arange(3.))

In [None]:
ser

0    0.0
1    1.0
2    2.0
dtype: float64

In [None]:
ser

0    0.0
1    1.0
2    2.0
dtype: float64

In [None]:
ser2=pd.Series(np.arange(3.),
               index=["a","b","c"])

In [None]:
ser2[-1]

  ser2[-1]


np.float64(2.0)

In [None]:
ser.iloc[-1]

np.float64(2.0)

In [None]:
ser[:2]

0    0.0
1    1.0
dtype: float64

In [None]:
data.loc[:,"one"]=1

In [None]:
data

Unnamed: 0,one,two,three,four
Ohio,1,0,0,0
Colorado,1,5,6,7
Utah,1,9,10,11
New York,1,13,14,15


In [None]:
data.iloc[2]=5

In [None]:
data

Unnamed: 0,one,two,three,four
Ohio,1,0,0,0
Colorado,1,5,6,7
Utah,5,5,5,5
New York,1,13,14,15


In [None]:
data.loc[data["four"]>5]=3

In [None]:
data

Unnamed: 0,one,two,three,four
Ohio,1,0,0,0
Colorado,3,3,3,3
Utah,5,5,6,5
New York,3,3,3,3


In [None]:
data

Unnamed: 0,one,two,three,four
Ohio,1,0,0,0
Colorado,3,3,3,3
Utah,5,5,6,5
New York,3,3,3,3


In [None]:
data.loc[data.three==5,"three"]=6

In [None]:
data

Unnamed: 0,one,two,three,four
Ohio,1,0,0,0
Colorado,3,3,3,3
Utah,5,5,6,5
New York,3,3,3,3


# Arithmetic and Data Alignment

In [None]:
s1=pd.Series(
    [7.3,-2.5,3.4,1.5],
    index=["a","c","d","e"]
)

In [None]:
s2=pd.Series(
    [-2.1,3.6,-1.5,4,3.1],
    index=["a","c","e","f","g"]
)

In [None]:
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [None]:
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [None]:
print(s1)
print(s2)

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64
a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64


In [None]:
s1+s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

In [None]:
df1=pd.DataFrame(
    np.arange(9.).reshape((3,3)),
    columns=list("bcd"),
    index=["Ohio","Texas","California"]
)

In [None]:
df2=pd.DataFrame(
    np.arange(12.).reshape((4,3)),
    columns=list("bde"),
    index=["Utah","Ohio","Texas","Oregon"]
)

In [None]:
df1

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
California,6.0,7.0,8.0


In [None]:
df2

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [None]:
df1+df2

Unnamed: 0,b,c,d,e
California,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


In [None]:
df1=pd.DataFrame(
    {
        "A":[1,2]
    }
)

In [None]:
df2=pd.DataFrame(
    {
        "B":[3,4]
    }
)

In [None]:
df1

Unnamed: 0,A
0,1
1,2


In [None]:
df2

Unnamed: 0,B
0,3
1,4


In [None]:
df1+df2

Unnamed: 0,A,B
0,,
1,,


# Arithmetic methods with fill values

In [None]:
df1=pd.DataFrame(
    np.arange(12.).reshape((3,4)),
    columns=list("abcd")
)

In [None]:
df2=pd.DataFrame(
    np.arange(20.).reshape(4,5),
    columns=list("abcde")
)

In [None]:
df2.loc[1,"b"]=np.nan

In [None]:
df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [None]:
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [None]:
df1+df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


In [None]:
df1
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [None]:
df1.add(df2,fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,5.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [None]:
1/df1

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


In [None]:
df1.rdiv(1)

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


In [None]:
df1.reindex(columns=df2.columns,fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,0
1,4.0,5.0,6.0,7.0,0
2,8.0,9.0,10.0,11.0,0


## Table 5-5. Flexible arithmetic methods

| Method            | Description                  | Example (Python) | Output |
|-------------------|------------------------------|------------------|--------|
| `add`, `radd`     | Methods for addition (+)     | `pd.Series([1,2,3]).add(pd.Series([4,5,6]))` | `0    5\n1    7\n2    9` |
| `sub`, `rsub`     | Methods for subtraction (-)  | `pd.Series([10,20,30]).sub(pd.Series([1,2,3]))` | `0     9\n1    18\n2    27` |
| `div`, `rdiv`     | Methods for division (/)     | `pd.Series([10,20,30]).div(pd.Series([2,4,5]))` | `0     5.0\n1     5.0\n2     6.0` |
| `floordiv`, `rfloordiv` | Methods for floor division (//) | `pd.Series([10,20,30]).floordiv(pd.Series([3,4,5]))` | `0     3\n1     5\n2     6` |
| `mul`, `rmul`     | Methods for multiplication (*) | `pd.Series([1,2,3]).mul(pd.Series([4,5,6]))` | `0     4\n1    10\n2    18` |
| `pow`, `rpow`     | Methods for exponentiation (**) | `pd.Series([2,3,4]).pow(pd.Series([3,2,1]))` | `0     8\n1     9\n2     4` |

# Operations between DataFrame and Series

In [None]:
arr=np.arange(12.).reshape((3,4))

In [None]:
arr

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

In [None]:
arr-arr[0]

array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

In [None]:
frame=pd.DataFrame(
    np.arange(12.).reshape((4,3)),
    columns=list("bde"),
    index=["Utah","Ohio","Texas","Oregon"]
)

In [None]:
series=frame.iloc[0]

In [None]:
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [None]:
series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

In [None]:
frame-series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


In [None]:
series2=pd.Series(
    np.arange(3),
    index=list("bef")
)

In [None]:
series2

b    0
e    1
f    2
dtype: int64

In [None]:
frame+series2

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


In [None]:
series3=frame["d"]

In [None]:
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [None]:
series3

Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

In [None]:
frame.sub(series3,axis="index")

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


# Function Application and Mapping

In [None]:
frame=pd.DataFrame(
    np.random.standard_normal((4,3)),
    columns=list("bde"),
    index=["Utah","Ohio","Texas","Oregon"]
)

In [None]:
frame

Unnamed: 0,b,d,e
Utah,-0.97889,-1.219178,1.064125
Ohio,-1.535399,-0.024013,-0.303132
Texas,-0.662335,0.022784,-1.470858
Oregon,1.819697,-0.327358,0.875803


In [None]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,0.97889,1.219178,1.064125
Ohio,1.535399,0.024013,0.303132
Texas,0.662335,0.022784,1.470858
Oregon,1.819697,0.327358,0.875803


In [None]:
def f1(x):
    return x.max()-x.min()

In [None]:
frame

Unnamed: 0,b,d,e
Utah,-0.97889,-1.219178,1.064125
Ohio,-1.535399,-0.024013,-0.303132
Texas,-0.662335,0.022784,-1.470858
Oregon,1.819697,-0.327358,0.875803


In [None]:
frame.apply(f1)

b    3.355096
d    1.241962
e    2.534983
dtype: float64

In [None]:
frame.apply(f1,axis="columns")

Utah      2.283303
Ohio      1.511386
Texas     1.493641
Oregon    2.147055
dtype: float64

In [None]:
def f2(x):
    return pd.Series(
        [x.min(),x.max()],
        index=["min","max"]
    )

In [None]:
frame

Unnamed: 0,b,d,e
Utah,-0.97889,-1.219178,1.064125
Ohio,-1.535399,-0.024013,-0.303132
Texas,-0.662335,0.022784,-1.470858
Oregon,1.819697,-0.327358,0.875803


In [None]:
frame.apply(f2)

Unnamed: 0,b,d,e
min,-1.535399,-1.219178,-1.470858
max,1.819697,0.022784,1.064125


In [None]:
frame

Unnamed: 0,b,d,e
Utah,-0.97889,-1.219178,1.064125
Ohio,-1.535399,-0.024013,-0.303132
Texas,-0.662335,0.022784,-1.470858
Oregon,1.819697,-0.327358,0.875803


In [None]:
def my_format(x):
    return f"{x:.2f}"

In [None]:
frame.applymap(my_format)

  frame.applymap(my_format)


Unnamed: 0,b,d,e
Utah,-0.98,-1.22,1.06
Ohio,-1.54,-0.02,-0.3
Texas,-0.66,0.02,-1.47
Oregon,1.82,-0.33,0.88


In [None]:
frame["e"].map(my_format)

Utah       1.06
Ohio      -0.30
Texas     -1.47
Oregon     0.88
Name: e, dtype: object

# Sorting and Ranking

In [None]:
obj=pd.Series(
    np.arange(4),
    index=list("dabc")
)

In [None]:
obj

d    0
a    1
b    2
c    3
dtype: int64

In [None]:
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

In [None]:
frame=pd.DataFrame(
    np.arange(8).reshape((2,4)),
    index=["three","one"],
    columns=list("dabc")
)

In [None]:
frame

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [None]:
frame.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [None]:
frame.sort_index(axis="columns")

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [None]:
obj=pd.Series([
    4,7,-3,2
])

In [None]:
obj.sort_values()

2   -3
3    2
0    4
1    7
dtype: int64

In [None]:
obj=pd.Series(
    [
        4,np.nan,7,np.nan,-3,2
    ]
)

In [None]:
obj.sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

In [None]:
obj.sort_values(na_position="first")

1    NaN
3    NaN
4   -3.0
5    2.0
0    4.0
2    7.0
dtype: float64

In [None]:
frame=pd.DataFrame(
    {
        "b":[4,7,-3,2],
        "a":[0,1,0,1]
    }
)

In [None]:
frame

Unnamed: 0,b,a
0,4,0
1,7,1
2,-3,0
3,2,1


In [None]:
frame.sort_values("b")

Unnamed: 0,b,a
2,-3,0
3,2,1
0,4,0
1,7,1


In [None]:
frame.sort_values(["a","b"])

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


In [None]:
obj=pd.Series([7,-5,7,4,2,0,4])

In [None]:
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

In [None]:
obj.rank(method="first")

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [None]:
obj.rank(ascending=False)

0    1.5
1    7.0
2    1.5
3    3.5
4    5.0
5    6.0
6    3.5
dtype: float64

In [None]:
frame=pd.DataFrame(
    {
        "b":[4.3,7,-3,2],
        "a":[0,1,0,1],
        "c":[-2,5,8,-2.5]
    }
)

In [None]:
frame

Unnamed: 0,b,a,c
0,4.3,0,-2.0
1,7.0,1,5.0
2,-3.0,0,8.0
3,2.0,1,-2.5


In [None]:
frame.rank(axis="columns")

Unnamed: 0,b,a,c
0,3.0,2.0,1.0
1,3.0,1.0,2.0
2,1.0,2.0,3.0
3,3.0,2.0,1.0


## Table 5-6. Tie-breaking methods with `rank`

| Method     | Description                                                                 | Example (Python) | Output |
|------------|-----------------------------------------------------------------------------|------------------|--------|
| `"average"`| Default: assign the average rank to each entry in the equal group           | `pd.Series([100,200,200,300]).rank(method="average")` | `[1.0, 2.5, 2.5, 4.0]` |
| `"min"`    | Use the minimum rank for the whole group                                    | `pd.Series([100,200,200,300]).rank(method="min")` | `[1.0, 2.0, 2.0, 4.0]` |
| `"max"`    | Use the maximum rank for the whole group                                    | `pd.Series([100,200,200,300]).rank(method="max")` | `[1.0, 3.0, 3.0, 4.0]` |
| `"first"`  | Assign ranks in the order the values appear in the data                     | `pd.Series([100,200,200,300]).rank(method="first")` | `[1.0, 2.0, 3.0, 4.0]` |
| `"dense"`  | Like `"min"`, but ranks always increase by 1 between groups rather than the number of equal elements in a group | `pd.Series([100,200,200,300]).rank(method="dense")` | `[1.0, 2.0, 2.0, 3.0]` |

# Axis Indexes with Duplicate Labels

In [None]:
obj=pd.Series(
    np.arange(5.),
    index=list("aabbc")
)

In [None]:
obj

a    0.0
a    1.0
b    2.0
b    3.0
c    4.0
dtype: float64

In [None]:
obj.index.is_unique

False

In [None]:
obj["a"]

a    0.0
a    1.0
dtype: float64

In [None]:
obj["c"]

np.float64(4.0)

In [None]:
df=pd.DataFrame(
    np.random.standard_normal((5,3)),
    index=list("aabbc")
)

In [None]:
df

Unnamed: 0,0,1,2
a,-0.017323,0.88563,-0.151985
a,1.185749,-1.369733,0.278895
b,-1.184321,-1.522385,1.452028
b,0.295839,1.122591,0.245521
c,-1.696543,-0.000941,1.449816


In [None]:
df.loc["b"]

Unnamed: 0,0,1,2
b,-1.184321,-1.522385,1.452028
b,0.295839,1.122591,0.245521


In [None]:
df.loc["c"]

0   -1.696543
1   -0.000941
2    1.449816
Name: c, dtype: float64

# 5.3 Summarizing and Computing Descriptive Statistics

In [None]:
df=pd.DataFrame(
    [[1.4,np.nan],[7.1,-4.5],[np.nan,np.nan],[0.75,-1.3]],
    index=list("abcd"),
    columns=["one","two"]   
)

In [None]:
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [None]:
df.sum()

one    9.25
two   -5.80
dtype: float64

In [None]:
df.sum(axis="columns")

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [None]:
df.sum(axis="index", skipna=False)

one   NaN
two   NaN
dtype: float64

In [None]:
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [None]:
df.sum(axis="columns",skipna=False)

a     NaN
b    2.60
c     NaN
d   -0.55
dtype: float64

In [None]:
df.mean(axis="columns")

a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64

## Table 5-7. Options for reduction methods

| Method   | Description                                                                 | Example (Python) | Output |
|----------|-----------------------------------------------------------------------------|------------------|--------|
| `axis`   | Axis to reduce over; `"index"` for DataFrame’s rows and `"columns"` for columns | `df.sum(axis="columns")` | Row-wise sum |
| `skipna` | Exclude missing values; `True` by default                                   | `df.mean(skipna=False)` | Includes NaN → result may be NaN |
| `level`  | Reduce grouped by level if the axis is hierarchically indexed (`MultiIndex`) | `df.sum(level=0)` | Collapsed sum by outer index level |

In [None]:
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [None]:
df.idxmax()

one    b
two    d
dtype: object

In [None]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


In [None]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


In [None]:
obj=pd.Series(list("aabc")*4)

In [None]:
obj

0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object

In [None]:
obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

## Table 5-8. Descriptive and summary statistics

| Method        | Description                                                                 | Example (Python) | Output |
|---------------|-----------------------------------------------------------------------------|------------------|--------|
| `count`       | Number of non-NA values                                                     | `pd.Series([1,2,np.nan]).count()` | `2` |
| `describe`    | Compute set of summary statistics                                           | `pd.Series([1,2,3,4]).describe()` | count=4, mean=2.5, std=1.29, min=1, max=4 |
| `min`, `max`  | Compute minimum and maximum values                                          | `pd.Series([1,2,3]).min()` → `1` |
| `argmin`, `argmax` | Compute index locations (integers) at which min/max value is obtained (not available on DataFrame objects) | `pd.Series([10,20,5]).argmin()` | `2` |
| `idxmin`, `idxmax` | Compute index labels at which min/max value is obtained                | `pd.Series([10,20,5], index=['a','b','c']).idxmin()` | `'c'` |
| `quantile`    | Compute sample quantile ranging from 0 to 1 (default: 0.5)                  | `pd.Series([1,2,3,4]).quantile(0.75)` | `3.25` |
| `sum`         | Sum of values                                                              | `pd.Series([1,2,3]).sum()` | `6` |
| `mean`        | Mean of values                                                             | `pd.Series([1,2,3]).mean()` | `2.0` |
| `median`      | Arithmetic median (50% quantile)                                           | `pd.Series([1,2,3]).median()` | `2.0` |
| `mad`         | Mean absolute deviation from mean value                                    | `pd.Series([1,2,3]).mad()` | `0.666...` |
| `prod`        | Product of all values                                                      | `pd.Series([1,2,3]).prod()` | `6` |
| `var`         | Sample variance of values                                                  | `pd.Series([1,2,3]).var()` | `1.0` |
| `std`         | Sample standard deviation of values                                        | `pd.Series([1,2,3]).std()` | `1.0` |
| `skew`        | Sample skewness (third moment)                                             | `pd.Series([1,2,3]).skew()` | `0.0` |
| `kurt`        | Sample kurtosis (fourth moment)                                            | `pd.Series([1,2,3,4,5]).kurt()` | `-1.3` |
| `cumsum`      | Cumulative sum of values                                                   | `pd.Series([1,2,3]).cumsum()` | `[1,3,6]` |
| `cummin`, `cummax` | Cumulative minimum or maximum of values                               | `pd.Series([3,1,4]).cummin()` | `[3,1,1]` |
| `cumprod`     | Cumulative product of values                                               | `pd.Series([1,2,3]).cumprod()` | `[1,2,6]` |
| `diff`        | Compute first arithmetic difference (useful for time series)               | `pd.Series([1,2,4,7]).diff()` | `[NaN,1,2,3]` |
| `pct_change`  | Compute percent changes                                                    | `pd.Series([100,120,150]).pct_change()` | `[NaN,0.2,0.25]` |

# Correlation and Covariance


In [None]:
price=pd.read_pickle("D:/DS BOOKS/examples/yahoo_price.pkl")

In [None]:
volume=pd.read_pickle("D:/DS BOOKS/examples/yahoo_volume.pkl")

In [None]:
price

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2010-01-04,27.990226,313.062468,113.304536,25.884104
2010-01-05,28.038618,311.683844,111.935822,25.892466
2010-01-06,27.592626,303.826685,111.208683,25.733566
2010-01-07,27.541619,296.753749,110.823732,25.465944
2010-01-08,27.724725,300.709808,111.935822,25.641571
...,...,...,...,...
2016-10-17,117.550003,779.960022,154.770004,57.220001
2016-10-18,117.470001,795.260010,150.720001,57.660000
2016-10-19,117.120003,801.500000,151.259995,57.529999
2016-10-20,117.059998,796.969971,151.520004,57.250000


In [None]:
volume

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2010-01-04,123432400,3927000,6155300,38409100
2010-01-05,150476200,6031900,6841400,49749600
2010-01-06,138040000,7987100,5605300,58182400
2010-01-07,119282800,12876600,5840600,50559700
2010-01-08,111902700,9483900,4197200,51197400
...,...,...,...,...
2016-10-17,23624900,1089500,5890400,23830000
2016-10-18,24553500,1995600,12770600,19149500
2016-10-19,20034600,116600,4632900,22878400
2016-10-20,24125800,1734200,4023100,49455600


In [None]:
returns=price.pct_change()

In [None]:
returns.tail()

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016-10-17,-0.00068,0.001837,0.002072,-0.003483
2016-10-18,-0.000681,0.019616,-0.026168,0.00769
2016-10-19,-0.002979,0.007846,0.003583,-0.002255
2016-10-20,-0.000512,-0.005652,0.001719,-0.004867
2016-10-21,-0.00393,0.003011,-0.012474,0.042096


In [None]:
returns["MSFT"].corr(returns["IBM"])

np.float64(0.49976361144151155)

In [None]:
returns["MSFT"].cov(returns["IBM"])

np.float64(8.870655479703546e-05)

In [None]:
returns["MSFT"].corr(returns["IBM"])

np.float64(0.49976361144151155)

In [None]:
returns.corr()

Unnamed: 0,AAPL,GOOG,IBM,MSFT
AAPL,1.0,0.407919,0.386817,0.389695
GOOG,0.407919,1.0,0.405099,0.465919
IBM,0.386817,0.405099,1.0,0.499764
MSFT,0.389695,0.465919,0.499764,1.0


In [None]:
returns.cov()

Unnamed: 0,AAPL,GOOG,IBM,MSFT
AAPL,0.000277,0.000107,7.8e-05,9.5e-05
GOOG,0.000107,0.000251,7.8e-05,0.000108
IBM,7.8e-05,7.8e-05,0.000146,8.9e-05
MSFT,9.5e-05,0.000108,8.9e-05,0.000215


In [None]:
returns.corrwith(returns["IBM"])

AAPL    0.386817
GOOG    0.405099
IBM     1.000000
MSFT    0.499764
dtype: float64

In [None]:
returns.corrwith(volume)

AAPL   -0.075565
GOOG   -0.007067
IBM    -0.204849
MSFT   -0.092950
dtype: float64

In [None]:
returns.corrwith(volume,axis="columns")

Date
2010-01-04         NaN
2010-01-05    0.737298
2010-01-06    0.017069
2010-01-07    0.507614
2010-01-08   -0.779646
                ...   
2016-10-17   -0.881606
2016-10-18   -0.303369
2016-10-19   -0.970723
2016-10-20   -0.304414
2016-10-21    0.927824
Length: 1714, dtype: float64

# Unique Values, Value Counts, and Membership

In [None]:
obj=pd.Series(list('cadaabbcc'))

In [None]:
uniques=obj.unique()

In [None]:
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

In [None]:
obj.value_counts()

c    3
a    3
b    2
d    1
Name: count, dtype: int64

In [None]:
pd.value_counts(obj.to_numpy(),sort=False)

  pd.value_counts(obj.to_numpy(),sort=False)


c    3
a    3
d    1
b    2
Name: count, dtype: int64

In [None]:
obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

In [None]:
mask=obj.isin(list("bc"))

In [None]:
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [None]:
obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object

In [None]:
to_match=pd.Series(list("cabbcak"))

In [None]:
to_match

0    c
1    a
2    b
3    b
4    c
5    a
6    k
dtype: object

In [None]:
unique_vals=pd.Series(list("cba"))

In [None]:
unique_vals

0    c
1    b
2    a
dtype: object

In [None]:
indices=pd.Index(unique_vals).get_indexer(to_match)

In [None]:
indices

array([ 0,  2,  1,  1,  0,  2, -1])

## Table 5-9. Unique, value counts, and set membership methods

| Method         | Description                                                                 | Example (Python) | Output |
|----------------|-----------------------------------------------------------------------------|------------------|--------|
| `isin`         | Compute a Boolean array indicating whether each Series or DataFrame value is contained in the passed sequence of values | `pd.Series([1,2,3]).isin([2,4])` | `[False, True, False]` |
| `get_indexer`  | Compute integer indices for each value in an array into another array of distinct values; helpful for data alignment and join-type operations | `pd.Index([1,2,3]).get_indexer([2,3,4])` | `[1,2,-1]` |
| `unique`       | Compute an array of unique values in a Series, returned in the order observed | `pd.Series([1,2,2,3,3,3]).unique()` | `[1,2,3]` |
| `value_counts` | Return a Series containing unique values as its index and frequencies as its values, ordered count in descending order | `pd.Series([1,2,2,3,3,3]).value_counts()` | `3    3\n2    2\n1    1` |

In [None]:
data=pd.DataFrame(
    {
        "Qu1":[1,3,4,3,4],
        "Qu2":[2,3,1,2,3],
        "Qu3":[1,5,2,4,4]
        }
)

In [None]:
data

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


In [None]:
data.value_counts()

Qu1  Qu2  Qu3
1    2    1      1
3    2    4      1
     3    5      1
4    1    2      1
     3    4      1
Name: count, dtype: int64

In [None]:
data["Qu1"].value_counts().sort_index()

Qu1
1    1
3    2
4    2
Name: count, dtype: int64

In [None]:
data

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


In [None]:
result=data.apply(pd.value_counts).fillna(0)

  result=data.apply(pd.value_counts).fillna(0)


In [None]:
result

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0


In [None]:
data.value_counts()

Qu1  Qu2  Qu3
1    2    1      1
3    2    4      1
     3    5      1
4    1    2      1
     3    4      1
Name: count, dtype: int64

In [None]:
data = pd.DataFrame({"a": [1, 1, 1, 2, 2], "b": [0, 0, 1, 0, 0]})
data

Unnamed: 0,a,b
0,1,0
1,1,0
2,1,1
3,2,0
4,2,0


In [None]:
data

Unnamed: 0,a,b
0,1,0
1,1,0
2,1,1
3,2,0
4,2,0


In [None]:
data.value_counts()

a  b
1  0    2
2  0    2
1  1    1
Name: count, dtype: int64