# Getting Started with pandas

Pandas is designed for working with tabular of heterogeneous data
NumPy is best suited for working with homogeneously typed numerical array data

## 5.1 Introduction to pandas Data Structures

### Series

One-dimensional array-like object containing a sequence of values of the same type and an associated array of data labels.

### DataFrame

DataFrame repersents a rectangular table of data and contains an ordered, named collection of columns.


In [3]:
import pandas as pd
import numpy as np


In [4]:
# Create a series with an index identifying each data point with a label

obj2 = pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"])

# Select by index label like numpy
obj2[["c", "a", "d"]]

# Check if an element exist in the series
"b" in obj2

# Create a series from dictionary
sdata = {"a": 1, "b": 2}
obj3 = pd.Series(sdata)

# Convert series back to dictionary
obj3.to_dict()

# Passing in an index to define the order or data, filter out the data that is not in the list
# If cannot found the data in index from dictionary, will assign NaN to it
data_index = ["c", "a"]
pd.Series(sdata, index=data_index)

# Detect missing datas
pd.isna(obj2)
obj2.isna()
pd.notna(obj2)
obj2.notna()

# Assign name sand index name
obj3.name = "Population"
obj3.index.name = "States"

# Alter Series's index by passing in another array, the array have to be same length as the series
obj3.index = ["x", "y"]


In [5]:
# Construct a DataFrame through a dictionary of equal-length lists or NumPy arrays
fdata = {
    "state": ["Ohio", "Ohio", "Nevada"],
    "year": [2000, 2001, 2001],
    "pop": [1.5, 1.7, 3.6],
}

frame = pd.DataFrame(fdata)

# First and last 5 rows of the DataFrame
frame.head()
frame.tail()

# specify a sequence of columns (if pass columns that isn't contained in dictionary,
# will appear as missing value)
frame.columns = ["state", "year", "pop"]

# Select a column from dataframe
frame["state"]

# Select a row from dataframe
frame.loc[1]
frame.iloc[1]

# Modify the entire column
frame["state"] = "Anything"
frame["state"] = np.arange(3)

frame["Year 2000"] = frame["year"] == 2000

# Delete a column from frame
del frame["pop"]

# Transpose a frame
frame.T

# Return the data contained as a tow-dimensional ndarray
frame.to_numpy


<bound method DataFrame.to_numpy of    state  year  Year 2000
0      0  2000       True
1      1  2001      False
2      2  2001      False>

### Index objects

Responsible for holding the axis labels

Index objects are immutable

Pandas index can contain duplicate labels, selection with duplicate labels will select all occurrences of the label


In [6]:
fframe = pd.DataFrame(
    ["foooo", "foooo", "foooo", "barrr"], index=["foo", "foo", "foo", "bar"]
)

# Select index with duplicated labels will select all occurrences of the label
fframe.loc["foo"]


Unnamed: 0,0
foo,foooo
foo,foooo
foo,foooo


### Index methods and properties

| Method         | Description                                                                               |
| -------------- | ----------------------------------------------------------------------------------------- |
| append()       | Concatenate with additional index object, producing a new index                           |
| difference()   | Compte set difference as an index                                                         |
| intersection() | compute set intersection                                                                  |
| union()        | Compute set union                                                                         |
| isin()         | compute boolean array indicating whether each value is contained in the passed collection |
| delete()       | compute new index with elements at index i is deleted                                     |
| insert()       | compute new index by inserting element at index i                                         |
| is_monotonic   | Return true if each element is greater pr equal to the previous element                   |
| is_unique      | return true is fht index has not duplicate values                                         |
| unique()       | Compute the array of unique values in the index                                           |


## 5.2 Essential Functionality

### Reindexing

Crete a new object with the values arranged to align with the new index


In [7]:
obj = pd.Series([4.5, 3.2, -5.3, 3.6], index=[0, 2, 4, 6])
obj


0    4.5
2    3.2
4   -5.3
6    3.6
dtype: float64

In [8]:
# Rearrange the data accorading to the new index, introducing missing values if any idex values where not already present
# use 'ffill' forward-fills to fill upo the missing value
obj2 = obj.reindex(np.arange(10), method="ffill")


In [9]:
frame = pd.DataFrame(
    np.arange(9).reshape(3, 3),
    index=["a", "c", "d"],
    columns=["Ohio", "Texas", "California"],
)


In [10]:
# Reindex rows
frame2 = frame.reindex(index=["a", "b", "c", "d"])


In [11]:
# Reindex dolumns
states = ["Texas", "Utah", "California"]
frame3 = frame.reindex(columns=states)


In [12]:
frame


Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [13]:
# reindex with loc operator
# Works only if all of the new index labels already exist in the DataFrame
# Where reindex will insert missing data for new labels
# df.loc[rows, columns]
frame.loc[["a", "d", "c"], ["California", "Texas"]]


Unnamed: 0,California,Texas
a,2,1
d,8,7
c,5,4


In [14]:
# Dropping Entries from an axis
# Use reindex method or .loc based indexing if already have an index array without those entries

obj = pd.Series(np.arange(5), index=["a", "b", "c", "d", "e"])

new_obj = obj.drop("c")
new_obj = obj.drop(["a", "e"])


In [15]:
data = pd.DataFrame(
    np.arange(16).reshape((4, 4)),
    index=["Ohio", "Colorado", "Utah", "New York"],
    columns=["one", "two", "three", "four"],
)

# Drop values from row and columns
data.drop(index=["Ohio", "Colorado"], columns=["two"])
# Drop values ba passing axis - 1 or "columns"
data.drop("two", axis=1)


Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [16]:
# Indexing, selection and Filtering
obj = pd.Series(np.arange(4), index=["a", "b", "c", "d"])

obj[["a", "b"]]
obj[obj <= 2]

# prefer to use loc because when using [], if object label contains integer, it will be use as strings
obj1 = pd.Series(np.arange(5), index=[4, 2, 1, 199, 0])

obj1[199]  # The value return is based on label, not index
# Use loc to select items by label
obj1.loc[[0, 1, 199]]

# Use iloc to select items by index
# Unlike obj1, obj doesn't contain numbers in index, so have tgo use iloc
obj.iloc[[0, 1, 2]]


a    0
b    1
c    2
dtype: int64

In [17]:
# Indexing into a DataFrame retrieves one or more columns
data = pd.DataFrame(
    np.arange(16).reshape((4, 4)),
    index=["Ohio", "Colorado", "Utah", "New York"],
    columns=["one", "two", "three", "four"],
)

data.loc["Ohio":"Colorado"]


Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [18]:
# Selecting data with a boolan array
data[data["three"] > 5]

# Select every row, first 3 columns, where column "three" is bigger than 5
data.iloc[:, :3][data.three > 5]

# Boolean arrays can be used with loc but not iloc
data.loc[data.three > 2]


Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


### Indexing operations with DataFrame

| Method              | Description                                                                                                                                                     |
| ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| df[column]          | Select single column or sequence of columns from the DataFrame; special case conveniences: Boolean array(filter rows), slice(slice rows), or Boolearn dataframe |
| df.loc[rows]        | Select single or subset of rows from the DataFrame by label                                                                                                     |
| df.loc[:, cols]     | Select single column and subset of columns by label                                                                                                             |
| df.loc[rows, cols]  | Select both rows and columns by label                                                                                                                           |
| df.iloc[rows]       | Select single row or subset of rows from the DataFrame by integer position                                                                                      |
| df.iloc[:, cols]    | Select single column or subset of columns by integer position                                                                                                   |
| df.iloc[rows, cols] |                                                                                                                                                                 |
| df.at[row, col]     | Select a single scalar value by row and column label                                                                                                            |
| df.iat[row, col]    | Select a single scalar value by row and column position (integers)                                                                                              |
| reindex method      | Select either rows or columns by labels                                                                                                                         |

If you have an axis index containing integers, data selection will always be label oriented.
Prefer indexing with loc or iloc to avoid ambiguity


In [19]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=["a", "c", "d", "e"])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=["a", "e", "c", "f", "g"])


In [20]:
# When add with objects, same index label object will be added
# If one of them missing, will return Nan

s1 + s2


a    5.2
c   -4.0
d    NaN
e    5.1
f    NaN
g    NaN
dtype: float64

In [21]:
# fill_value will be passed FOR the operation, not after the operation
s1.add(s2, fill_value=0)

1 / s1
# Same as
s1.rdiv(1)

df1 = pd.DataFrame(np.arange(12.0).reshape((3, 4)), columns=list("abcd"))
df2 = pd.DataFrame(np.arange(20.0).reshape((4, 5)), columns=list("abcde"))
df1.reindex(columns=df2.columns, fill_value=0)


Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,0
1,4.0,5.0,6.0,7.0,0
2,8.0,9.0,10.0,11.0,0


In [22]:
df1


Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [23]:
# Operations between DataFrames and Series
# When substract Series from dataframe, the substraction is performed for each row - broadcasting

# to broadcast over the columns
series = df1["b"]

df1.sub(series, axis="index")


Unnamed: 0,a,b,c,d
0,-1.0,0.0,1.0,2.0
1,-1.0,0.0,1.0,2.0
2,-1.0,0.0,1.0,2.0


### Function Application and Mapping

Element-wise array methods
NumPy ufuncs also work with pandas objects


In [24]:
frame = pd.DataFrame(
    np.random.standard_normal((4, 3)),
    columns=list("dbe"),
    index=["Utah", "Ohio", "Texas", "Oregon"],
)

frame


Unnamed: 0,d,b,e
Utah,1.09735,-0.404897,-0.074869
Ohio,1.387359,-1.344784,-1.393299
Texas,-0.401907,2.621638,-0.457615
Oregon,0.17393,2.627819,0.659782


In [25]:
np.abs(frame)

# Apply a function to each column or row
def f1(x):
    return x.max() - x.min()


frame.apply(f1, axis="columns")

# function passed to apply can also return a Series with multiple valyes
def f2(x):
    return pd.Series([x.min(), x.max()], index=["min", "max"])


frame.apply(f2)


Unnamed: 0,d,b,e
min,-0.401907,-1.344784,-1.393299
max,1.387359,2.627819,0.659782


In [29]:
# Apply element-wise python functions
# Function will be applied for each element in the data frame


def my_func(x):
    return f"{x:.2f}"


frame.applymap(my_func)

# Series object's map method for element-wise function
frame["e"].map(my_func)


Utah      -0.07
Ohio      -1.39
Texas     -0.46
Oregon     0.66
Name: e, dtype: object

### Sorting and Ranking


In [34]:
# Sort lexicographically by row or column label - use sort_index method
obj = pd.Series(np.arange(4), index=["d", "a", "b", "c"])


In [35]:
obj

obj.sort_index()


a    1
b    2
c    3
d    0
dtype: int64

In [40]:
frame = pd.DataFrame(
    np.arange(8).reshape((2, 4)), index=["three", "one"], columns=["d", "a", "b", "c"]
)

frame.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [42]:
# DataFrame can sort by index on either axis
frame.sort_index(axis="columns")

# Sort by descending order
frame.sort_index(axis="columns", ascending=False)


Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [71]:
obj = pd.Series([4, 7, -3, 2])

# Sort by values, missing value are sorted at end by default
obj2 = obj.sort_values()

# Sort missing value at first 
obj.sort_values(na_position="first")

# use one or more column name to sort_value
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a':[0, -1, 0, -1]})
frame.sort_values('b')

frame.sort_values(['a', 'b'])

Unnamed: 0,b,a
3,2,-1
1,7,-1
2,-3,0
0,4,0


In [70]:
# Sorting by different order
frame.sort_values(['b', 'a'])

Unnamed: 0,b,a
2,-3,0
3,2,-1
0,4,0
1,7,-1


In [73]:
# rank will return a series where value is the ranked from 1 (smallest value's idx)
#  
obj.rank()

# Rank in reverse order
obj.rank(method="min")

0    3.0
1    4.0
2    1.0
3    2.0
dtype: float64

In [75]:
# Rank according to the order they are observed in the data
obj.rank(method='first')

obj.rank(ascending=False)

frame.rank(axis="columns")

Unnamed: 0,b,a
0,2.0,1.0
1,2.0,1.0
2,1.0,2.0
3,2.0,1.0


In [77]:
# Axis indexes with duplicate labels
obj = pd.Series(np.arange(5), index=["a", "a", "b", "b", "c"])

# Check if the index is unique
obj.index.is_unique

obj['a']

a    0
a    1
dtype: int64

In [79]:
obj.loc['a']



a    0
a    1
dtype: int64

### Summarizing and Computing Descriptive Statistics
reductions or summary statistics


In [80]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]], index=["a", "b", "c", "d"], columns=["one", "two"])

In [84]:
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [85]:
# Sum of each row, return series of column
df.sum()

# sum of each column, return series of rows 
df.sum(axis="columns")


a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

### Corelation and Covariance
Computed from pairs of arguments. 

In [86]:
price = pd.read_pickle("./datasets/yahoo_price.pkl")
volume = pd.read_pickle("./datasets/yahoo_volume.pkl")

# percent changes of the price
returns = price.pct_change()

In [87]:
returns.tail()

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016-10-17,-0.00068,0.001837,0.002072,-0.003483
2016-10-18,-0.000681,0.019616,-0.026168,0.00769
2016-10-19,-0.002979,0.007846,0.003583,-0.002255
2016-10-20,-0.000512,-0.005652,0.001719,-0.004867
2016-10-21,-0.00393,0.003011,-0.012474,0.042096


In [89]:
# Compute the correlation of the overlapping, no-NA, aligned by index values in two Series
returns["MSFT"].corr(returns["IBM"])

# Compute the covariance
returns["MSFT"].cov(returns["IBM"])

8.870655479703546e-05

In [90]:
# Return full correlation or covariance matrix as a DataFrame
returns.corr()

Unnamed: 0,AAPL,GOOG,IBM,MSFT
AAPL,1.0,0.407919,0.386817,0.389695
GOOG,0.407919,1.0,0.405099,0.465919
IBM,0.386817,0.405099,1.0,0.499764
MSFT,0.389695,0.465919,0.499764,1.0


In [91]:
# Compute pair-wise correlations between DataFrame's columns or roles with another Series or dataframe

returns.corrwith(returns["IBM"])

AAPL    0.386817
GOOG    0.405099
IBM     1.000000
MSFT    0.499764
dtype: float64

In [92]:
# Passing a DataFrame computes the corelations of mathing column names
returns.corrwith(volume)

AAPL   -0.075565
GOOG   -0.007067
IBM    -0.204849
MSFT   -0.092950
dtype: float64

### Unique Values, Value Counts, Membership
Extracts information about the values contained in a one-dimensional Series


In [93]:
uniques = obj.unique()

In [95]:
uniques

array([0, 1, 2, 3, 4])

In [98]:
# Return a series contains value frequencies 
obj.value_counts()

# Top level pandas method that can be used with NumPy
pd.value_counts(obj.to_numpy(), sort=False)


obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [100]:
# Performs a vectorized set membership check, filtering a dataset down to a subset of values
obj.isin([1, 2])

a    False
a     True
b     True
b    False
c    False
dtype: bool

In [104]:
# Get an index array from an array of possibly nondistincr valyes in to another array of distinct valyes 
to_match = pd.Series(["c", "a", "b", "b", "c", "a"])

unique_values = pd.Series(["c", "a"])

# Get index from unique_valyes based on to_match series. if item doesn't not exist, will return -1
indices = pd.Index(unique_values).get_indexer(to_match)

indices

array([ 0,  1, -1, -1,  0,  1])

In [108]:
data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4]
,
'Qu2': [2, 3, 1, 2, 3],
'Qu3': [1, 5, 2, 4, 4]
})

result = data.apply(pd.value_counts)

#Count how row valyes appears in each columns
result.fillna(0)

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0
