<img style="float: right;" width="120" src="../Images/supplier-logo.png">
<img style="float: left; margin-top: 0" width="80" src="../Images/client-logo.png">
<br><br><br>


# Synopsis

This notebook is a quick tour of some of the tools you will be using to perform data analysis on Financial Markets Data using the python suite of technologies.

At the end of this notebook you will have a solid grounding in the following areas:

- **Jupyter Notebook** - How to use the Jupyter notebook to perform Data Analysis using Python.
<br><br>
- **Data Analysis** - A sense of the type of analysis you can reasonably expect to be able to perform using this technology.
<br><br>
- **Python Packages** - pandas, numpy and matplotlib.pyplot and what they are used for.
<br><br>
**Note That:**
- This is a quick tour of data analysis using python.

- There will be a lot of commands and character sequences that you do not understand.

- Do not **panic** or **worry** about this, we will be explaining all of these in due course.

- The important part of this introductory session is for you to get a sense of what you can do with this technology.

- We will cover the "nitty gritty" in later labs and notebooks

# Packages

- In order to perform any analysis on data, you must load the correct python package into your notebook.
<br><br>
- A package (sometimes called a library) is essentially a set of python commands, functions and methods that you can type into your notebook to perform some work and get some results.
<br><br>
- You will use different python packages to perform different types of work.
<br><br>
- For example, producing graphs and charts requires a different package (set of commands) than performing heavy duty mathematical computation.
<br><br>
- The main package we will use in this course is the package that allows us to work with tables of data, very similar to using a spreadsheet such as Microsoft Excel, Google Charts or Apple Numbers.
<br><br>
- This python library is called **pandas**.
<br><br>
- Use the **import** statement to load this package into your notebook

In [None]:
import pandas as pd

# python

** Congratulations!!!!! ** -- You have just executed your first line of python code. 

# DataFrames

- A DataFrame is a rectangular table of data.
- Contains rows and columns
- Columns have headings
- Rows have an index


## DataFrames - Creating a DataFrame from a csv file

There are quite a few ways to create a DataFrame.
<br><br>
Load in the data in the worksheet sp500 form the spreadsheet sample_data.xls

In [None]:
df = pd.read_excel(io='../Data/sample_data.xls', sheet_name='SP500', index_col='Symbol')

df

## DataFrames - Executing Functions / Methods

- There are literally hundreds of methods you can execute on a DataFrame
- Try the following
<br><br>
df.head - the first 5 rows
<br>
df.head(n) - the first n rows
<br>
df.tail - the last 5 rows
<br>
df.tail(n) the last n rows
<br>

In [None]:
df.head()
df.head(8)
df.tail()
df.tail(8)


## DataFrames - Rows and Columns 

Selecting data from a DataFrame for specific row(s) or columns(s)
<br><br>
- use df['ColA'] to select a single column 
- use df[['ColA', 'ColB']] to select multiple columns
- use df.loc['RowA'] to select a row by the rows label
- use df.loc[['RowA', ['RowB']] to select a row by the rows label

In [None]:
df['Name']
df[['Name', 'Sector', 'Price']]

df.loc['FB']
#
df.loc[['IBM', 'MMM', 'C']]

## DataFrames - Tip

The syntax for selecting multiple rows and multiple columns looks a bit awkward, especially if you are not familiar with computer programming.

The confusion arises from the following:
- Why use a single set of square brackets [ 'ColA' ] in some cases and a double set of square brackets in other cases [['ColA', 'ColB']]
- This is to do with the python programming language and its choice of syntax for certain types of collections.
<br><br>

Quite often, analysts will simply create a list of the columns or rows they are interested in, give this list a name and use it instead.

In [None]:
cols = ['Name', 'Sector', 'Price']
df[cols]

rows = ['IBM', 'MMM', 'C']
df.loc[rows]

## DataFrames - functions on columns of data

It's easy to apply functions to specific columns of data.

Select the column(s) of interest

Use the .function() syntax

- df['ColA'].mean() -- the mean 
- df['ColB].min() -- the minimum 
- df['COlA'].count() -- a count of how many values in ColA
- df['ColB].max() -- the maximum price/earnings
- df[mycols].median() -- median of all columns in the sequence mycols

In [None]:
cols = ['Price/Earnings', 'Earnings/Share']

df['Earnings/Share'].mean()
df['Price/Earnings'].min()
df['Name'].count()
df[cols].max()
df[cols].median()

# max and min also have a meaning when working with string values
df['Name'].max()
df['Name'].min()

# numpy

**numpy** (numerical python) is a python package used quite a lot for performing numerical calculations.

numpy has been designed specifically to perform advanced computations on large sets of numbers.

It is extremely fast, it is extremely memory efficient and it is very often used in conjunction with pandas when performing complex numerical calculations in conjunction with data analysis on large sets of data.

numpy is an advanced library designed to meet the needs of advanced users, but a rudimentary understanding of numpy is important to become efficient with pandas and DataFrames

In [None]:
import numpy as np

## numpy arrays

A numpy array is a collection of values.

Internally numpy stores the values in a numpy array in a highly efficient and optimized manner (details of this are way outside the remit of this course)

More than likely, you will NEVER need to know the internal structure of a numpy array (or any numpy data structure)

However, you should appreciate or simply accept that when performing calculations on sets of numbers, you will need to use the numpy package

In [None]:
arr1 = np.array ([1,2,3,4,5])
arr2 = np.array([10,20,30,40,50])

## numpy array  -- arithmetic

Use standard mathematical notation for adding, subtracting and comparing entire arrays.

For those who understand computer programming, note: ** NO NEED TO LOOP OVER EACH VALUE **

In [None]:
# E.g. Add all values in arr1 and arr2

# Note there is no need for looping etc.
arr1 + arr2

# Same for other operations
arr1 * arr2

arr2 - arr1

## numpy array -- comparisons

In [None]:
# same for comparisons
arr1 > arr2
arr1 < arr2
arr1 <= arr2
arr1 >= arr2
arr1 == arr2
arr1 != arr2

# and also more complex expressions
(arr1 + 12) >= 15
(arr2 - arr1) == 27

## numpy - boolean operators


In [None]:
np.logical_or(  (arr1 % 2) == 0, (arr2 % 20) == 0 )

np.logical_and( (arr1 % 2) == 0, (arr2 % 20) == 0 )

np.logical_not( (arr1 % 2) == 0 )


## numpy - scientific functions

numpy comes with a whole battery of mathematical functions

logs and exponential functions<br>
trigonometry<br>
advanced statistics

These work on individual values or on arrays of values

In [None]:
np.square(6)
np.square(arr1)

np.log10(100)
np.log(arr2)

np.exp(arr1)

## array arithmetic with a DataFrame

The same style of arithmetic can be used on the columns of a DataFrame

df['ColA'] = df['ColB']

df['ColA'] > 12

In [None]:
df['52 Week High'] - df['52 Week Low']

df['Earnings/Share'] < 0

## numpy - where

The numpy **where** function is an extremely useful function, we will be using it in some financial analysis that we will be doing later on in this course.

np.where is a function that accepts 3 parameters:<br>
* an expression<br>
* a value to return if the expression is True<br>
* a value to return if the expression is False

For example

the operator % is also known as the modulo operator.

The modulus of a number is its remainder after performing division on it.

e.g. 

5 % 2 = 1
<br>
4 % 2 = 0
<br>
3 % 2 = 1

A simple way to test if a number is even is to check if modulus 2 returns 0

Try the following:
<br>
np.where(arr1 % 2 == 0, 'Even', 'Odd')

This should return the correct word 'Even' or 'Odd' for each number in the array 'arr1'


In [None]:
arr1 = np.array ([1,2,3,4,5,6])
arr2 = np.array ([2,2,4,4,6,6])

np.where(arr1 % 2 == 0, 'Even', 'Odd')
np.where(arr1 == arr2, 'Matched', 'Did not Match')

## numpy where on a DataFrame

np.where can be used on a DataFrame:

e.g.

np.where(df['Earnings/Share'] < 0, 'Negative', 'Positive')

In [None]:
np.where(df['Earnings/Share'] < 0, 'Negative', 'Positive')

np.where( (df['Sector'] == 'Industrials') & (df['Earnings/Share'] > 0), 'Good', 'Bad')

# Graphs and Charts

There are lots of sophisticated charting and graphing packages for python

The one we will be using is called **matplotlib.pyplot**

You can draw simple line charts of columns by using the **plot()** method on a column


In [None]:
import matplotlib.pyplot as plt

%matplotlib inline

## Plot some values

e.g. a bar chart showing the 52 Week High and low for famous FANG stocks (Facebook, Amazon, Netflix and Google)

In [None]:
cols = ['52 Week High','52 Week Low']
rows = ['FB', 'AMZN', 'NFLX', 'GOOGL']

df.loc[rows][cols].plot(kind='barh')


# Bollinger Bands

Bollinger bands are a type of statistical chart characterizing the prices and volatility over time of a financial instrument or commodity, using a formulaic method propounded by John Bollinger in the 1980s. 

Financial traders employ these charts as a methodical tool to inform trading decisions, control automated trading systems, or as a component of technical analysis. 

Bollinger bands display a graphical band, usually an upper and a lower band.

This is a very simple example of 
- creating a Bollinger band for 2 std deviations above and below the price of oil
- displaying the Bollinger band graphically

## Load in the libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

## Load in the data

Use the file GOLD.csv into a DataFrame

In [None]:
df = pd.read_excel('../Data/market_data.xls', sheet_name='GOOGL', index_col='Date', parse_dates=True)

## Create a new DataFrame 

We are going to create a new DataFrame with 3 columns.

This DataFrame will be used to store data for the Bollinger band which will be derived from the original data

- **'Price(m avg)'** A monthly rolling average of the 'USD (PM)' column
- **'Upper'** which will be 2 std deviations of the price above **'Price(m avg)'**
- **'Lower'** which will be 2 std deviations of the price below **'Price(m avg)'**

** NOTE ** <br>
- Don't panic if you don't understand the syntax, commands, rolling(30) etc.<BR>
- We will cover this in subsequent lessons.<BR>
- Understand that this is a very typical analysis that Financial Data Analysts perform on a daily basis.<BR>
- Appreciate that python, pandas, etc. make this type of analysis extremely easy.

In [None]:
df.head()

In [None]:
# Create an empty DataFrame
# This is a second way to create a DataFrame 
# - the first is to read the data directly from a csv file
# - this method creates an empty DataFrame in memory

df_BOLL = pd.DataFrame()

# Copy the 'Close' column from the original DataFrame into this DataFrame
df_BOLL['Close'] = df['Close']
df_BOLL['SMA(Close)'] = df_BOLL['Close'].rolling(21).mean()
df_BOLL['Upper'] = df_BOLL['SMA(Close)'] + 2 * df_BOLL['Close'].rolling(21).std()
df_BOLL['Lower'] = df_BOLL['SMA(Close)'] - 2 * df_BOLL['Close'].rolling(21).std()

## Plot the results

In [None]:
fig = plt.figure(figsize=(18,6))

fig.suptitle('Google -- 2 stds above/below closing price')
plt.xlabel('Date')
plt.ylabel('Close')

plt.plot(df_BOLL)

## Plot the results for 2017

In [None]:
fig = plt.figure(figsize=(18,6))

fig.suptitle('Google (2017) -- 2 stds above/below closing price')
plt.xlabel('Date')
plt.ylabel('Close')

plt.plot(df_BOLL['2017'])

# Time resampling

Resampling is conversion between frequencies.

Here we are going to resample from daily pricing information to weekly, monthly or even yearly.

## Resample Price, Open, High, Low - Annually

Then take the maximum of each column.

In [None]:
cols = ['Open', 'High', 'Low', 'Close']

df[cols].resample(rule='Y').max()

## Resample Price, Open, High, Low - by Business Quarter

Note that a business quarter end is not the same as a calendar quarter end.

Take the mean value for each business quarter.

Display the first 20 rows.

In [None]:
df_ohlc = df[cols].resample(rule='BQ').mean().head(20)

w = pd.ExcelWriter('../Output/ohlc.xls')

df_ohlc.to_excel(excel_writer=w, sheet_name='OHLC')

w.save()

## Resample Price, Open, High, Low - by Month for 2015

Take the median value for each month.

Display the first 20 rows.

In [None]:
df['2015'][cols].resample(rule='M').mean().head(20)