<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

# Python for Finance Key Skills

&copy; Dr. Yves J. Hilpisch | The Python Quants GmbH

http://tpq.io | [training@tpq.io](mailto:trainin@tpq.io) | [@dyjh](http://twitter.com/dyjh)

## Pandas Package

Pandas, a Python library for data analysis, has become exceptionally popular in the finance industry, and for good reason. Here are ten major reasons for its widespread adoption:

1. **Data Handling**: Pandas provides robust tools for loading data from various sources such as CSV, Excel, SQL databases, and even financial data APIs. Once loaded, it allows for efficient manipulation and transformation of data, making it an indispensable tool for preprocessing financial data.

2. **Time Series Analysis**: Financial data is often time series data. Pandas provides a comprehensive suite of functions for time series manipulation, ranging from date range generation, resampling, shifting, and rolling window operations, among others.

3. **Performance**: Pandas is built on top of NumPy, benefiting from its speed. Additionally, critical parts of Pandas are written in Cython, which makes it even faster for certain operations. This performance is essential when dealing with large financial datasets.

4. **Flexibility**: Pandas DataFrames are flexible and can store data of multiple types, handle missing data, and are mutable. This flexibility means that finance professionals can represent and manipulate a wide variety of datasets easily.

5. **Integration with Other Libraries**: Pandas works seamlessly with many other Python libraries, including visualization libraries like Matplotlib and Seaborn, and machine learning libraries like scikit-learn. This makes it an integral part of the broader data analysis ecosystem.

6. **Excel-Like Operations**: Many finance professionals are comfortable with Excel. Pandas provides a familiar interface for operations like pivot tables, filtering, and sorting, making the transition from Excel to Python smoother.

7. **Extensibility**: If Pandas doesn't have a specific function you need, you can easily extend it by defining your own functions and applying them to DataFrames or Series.

8. **Open Source**: Being open-source means that Pandas is continually being improved by the community. This ensures that the library remains up-to-date with the latest trends and requirements of the finance industry.

9. **Grouping and Aggregation**: With its powerful `groupby` functionality, Pandas can segment datasets into groups and perform aggregate operations on them. This is particularly useful in finance for tasks like portfolio categorization and performance analysis.

10. **Community and Resources**: The large and active community around Pandas ensures that there are plenty of tutorials, courses, and forums available. This wealth of resources is invaluable for finance professionals looking to troubleshoot issues or learn new techniques.

In conclusion, Pandas combines ease of use, performance, and versatility, making it a go-to tool for finance professionals engaged in data analysis.

### `DataFrame` Object

In [None]:
!git clone https://github.com/tpq-classes/pff_key_skills.git
import sys
sys.path.append('pff_key_skills')


In [None]:
import numpy as np

In [None]:
l = list(range(15))
l

In [None]:
a = np.array(l)
a

In [None]:
b = a.reshape(5, 3)
b

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(b)
df

In [None]:
print(df)

In [None]:
type(df)

In [None]:
df.values

In [None]:
df.index = list('abcde')

In [None]:
df

In [None]:
df.columns = list('XYZ')

In [None]:
df

### Numerical Operations

CURRENTLY: Default for aggregational operations is "column-first".

In [None]:
pd.__version__

In [None]:
df.sum()

In [None]:
df.sum().sum()

In [None]:
df.values.sum()

In [None]:
df.sum(axis=0)

In [None]:
df.mean()

In [None]:
df.mean(axis=1)

In [None]:
2 * df

In [None]:
df ** 2

In [None]:
2 ** df

In [None]:
np.sqrt(df)

In [None]:
# help(df.apply)

In [None]:
np.max(df, axis=0)

In [None]:
df.apply(np.max)

In [None]:
def f(x):
    return x ** 2

In [None]:
f(df)

In [None]:
df.apply(f)

In [None]:
df.apply(lambda x: x ** 2)

### Indexing & Selection

In [None]:
df.index

In [None]:
df.columns

In [None]:
df['X']

In [None]:
df[['Y', 'Z']]

In [None]:
df.loc['a']

In [None]:
df.loc['c':]

In [None]:
df.loc[:'c']

In [None]:
df.loc[:, 'X']

In [None]:
df.loc[:, 'Y':]

In [None]:
df.loc['c':, 'Y':]

In [None]:
# help(df.iloc)

In [None]:
df.iloc[0]

In [None]:
df.iloc[:, 0]

In [None]:
df.iloc[0, 0]

In [None]:
df.iloc[2:, :2]

In [None]:
df.iloc[2:4, :2]

### Speed Comparisons

In [None]:
from numpy.random import default_rng

In [None]:
rng = default_rng()

In [None]:
la = rng.random((10_000_000, 5))

In [None]:
la

In [None]:
ldf = pd.DataFrame(la)

In [None]:
ldf

In [None]:
%time la.sum(axis=0)

In [None]:
%timeit la.sum(axis=0)

In [None]:
%time ldf.sum(axis=0)

In [None]:
%timeit ldf.sum(axis=0)

In [None]:
%timeit ldf.values.sum(axis=0)

In [None]:
%time np.sqrt(la)

In [None]:
%timeit np.sqrt(la)

In [None]:
%time res = np.sqrt(ldf)

In [None]:
%timeit np.sqrt(ldf)

In [None]:
%timeit ldf.apply(np.sqrt)

### Meta Information

In [None]:
ldf.head()

In [None]:
ldf.head(3)

In [None]:
ldf.tail()

In [None]:
ldf.info()

In [None]:
# pd.set_option('display.float_format', lambda x: '%.4f' % x)

In [None]:
ldf.describe()

<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

<a href="http://tpq.io" target="_blank">http://tpq.io</a> | <a href="mailto:training@tpq.io">training@tpq.io</a> | <a href="http://twitter.com/dyjh" target="_blank">@dyjh</a> 