<h2> Handle Missing Values </h2><br>
In pandas, one of the most common ways that missing data is introduced into a data set is by reindexing. For example

In [1]:
import numpy as np
import pandas as pd

In [2]:
pd.set_option('max_rows', 20)

In [3]:
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'], columns=['one', 'two', 'three'])
df

Unnamed: 0,one,two,three
a,0.38081,-0.932915,-1.0087
c,-1.997085,0.420278,-0.425561
e,-0.117483,-0.516754,0.952138
f,0.849759,-1.104306,-0.412228
h,0.162258,0.070621,-0.949317


In [4]:
df['four'] = 'bar'
df['five'] = df['one'] > 0

In [5]:
df

Unnamed: 0,one,two,three,four,five
a,0.38081,-0.932915,-1.0087,bar,True
c,-1.997085,0.420278,-0.425561,bar,False
e,-0.117483,-0.516754,0.952138,bar,False
f,0.849759,-1.104306,-0.412228,bar,True
h,0.162258,0.070621,-0.949317,bar,True


<h2> Values Considered Missing </h2>

In [6]:
df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df2

Unnamed: 0,one,two,three,four,five
a,0.38081,-0.932915,-1.0087,bar,True
b,,,,,
c,-1.997085,0.420278,-0.425561,bar,False
d,,,,,
e,-0.117483,-0.516754,0.952138,bar,False
f,0.849759,-1.104306,-0.412228,bar,True
g,,,,,
h,0.162258,0.070621,-0.949317,bar,True


As data comes in many shapes and forms, pandas aims to be flexible with regard to handling missing data. While NaN is the default missing value marker for reasons of computational speed and convenience, we need to be able to easily detect this value with data of different types: floating point, integer, boolean, and general object. In many cases, however, the Python None will arise and we wish to also consider that “missing” or “null”.

To make detecting missing values easier (and across different array dtypes), pandas provides the <b>isnull()</b> and <b>notnull()</b> functions, which are also methods on Series and DataFrame objects:

In [13]:
pd.isnull(df2['one'])
# df2['one'].isnull()

a    False
b     True
c    False
d     True
e    False
f    False
g     True
h    False
Name: one, dtype: bool

In [14]:
pd.notnull(df2['four'])
# df2['four'].notnull()

a     True
b    False
c     True
d    False
e     True
f     True
g    False
h     True
Name: four, dtype: bool

In [15]:
df2.isnull()

Unnamed: 0,one,two,three,four,five
a,False,False,False,False,False
b,True,True,True,True,True
c,False,False,False,False,False
d,True,True,True,True,True
e,False,False,False,False,False
f,False,False,False,False,False
g,True,True,True,True,True
h,False,False,False,False,False


One has to be mindful that in python (and numpy), the nan's don’t compare equal, but None's do. Note that Pandas/numpy uses the fact that np.nan != np.nan, and treats None like np.nan.

In [16]:
np.nan == np.nan

False

<h2> DateTimes </h2>

In [60]:
df2 = df.copy()

In [61]:
df2['timestamp'] = pd.Timestamp('20120101')

In [62]:
df2

Unnamed: 0,one,two,three,four,five,timestamp
a,0.38081,-0.932915,-1.0087,bar,True,2012-01-01
c,-1.997085,0.420278,-0.425561,bar,False,2012-01-01
e,-0.117483,-0.516754,0.952138,bar,False,2012-01-01
f,0.849759,-1.104306,-0.412228,bar,True,2012-01-01
h,0.162258,0.070621,-0.949317,bar,True,2012-01-01


In [63]:
df2.ix[['a', 'c', 'h'], ['one', 'timestamp']] = np.nan

In [64]:
df2.ix[['a', 'c', 'h'], ['one', 'timestamp']]

Unnamed: 0,one,timestamp
a,,NaT
c,,NaT
h,,NaT


For datetime64[ns] types, NaT represents missing values. This is a pseudo-native sentinel value that can be represented by numpy in a singular dtype (datetime64[ns]). pandas objects provide intercompatibility between NaT and NaN.

In [65]:
df2

Unnamed: 0,one,two,three,four,five,timestamp
a,,-0.932915,-1.0087,bar,True,NaT
c,,0.420278,-0.425561,bar,False,NaT
e,-0.117483,-0.516754,0.952138,bar,False,2012-01-01
f,0.849759,-1.104306,-0.412228,bar,True,2012-01-01
h,,0.070621,-0.949317,bar,True,NaT


In [66]:
df2.get_dtype_counts()

bool              1
datetime64[ns]    1
float64           3
object            1
dtype: int64

<h2> Inserting missing data </h2>

In [67]:
s = pd.Series([1, 2, 3])

In [68]:
s.loc[0]

1

In [69]:
s.loc[0] = None

In [70]:
s

0    NaN
1    2.0
2    3.0
dtype: float64

Likewise, datetime containers will always use NaT. <br>
For object containers, pandas will use the value given:

In [34]:
s = pd.Series(["a", "b", "c"])
s.loc[0] = None
s.loc[1] = np.nan
s

0    None
1     NaN
2       c
dtype: object

<h2> Calculations with missing data </h2>

In [40]:
a = df2[["one", "two"]]
a

Unnamed: 0,one,two
a,,-0.932915
c,,0.420278
e,-0.117483,-0.516754
f,0.849759,-1.104306
h,,0.070621


In [41]:
b = df2[["one", "two", "three"]]
b

Unnamed: 0,one,two,three
a,,-0.932915,-1.0087
c,,0.420278,-0.425561
e,-0.117483,-0.516754,0.952138
f,0.849759,-1.104306,-0.412228
h,,0.070621,-0.949317


In [42]:
a + b

Unnamed: 0,one,three,two
a,,,-1.865831
c,,,0.840557
e,-0.234965,,-1.033508
f,1.699518,,-2.208611
h,,,0.141242


<ul>
<li>When summing data, NA (missing) values will be treated as zero</li>
<li>If the data are all NA, the result will be NA</li>
<li>Methods like <b>cumsum()</b> and <b>cumprod()</b> ignore na values, but preserve them in the resulting arrays</li>
</ul>

In [49]:
b["one"].sum()

0.73227661031861457

In [50]:
b.cumsum()

Unnamed: 0,one,two,three
a,,-0.932915,-1.0087
c,,-0.512637,-1.434261
e,-0.117483,-1.029391,-0.482123
f,0.732277,-2.133697,-0.894351
h,,-2.063076,-1.843668


In [51]:
b.cumprod()

Unnamed: 0,one,two,three
a,,-0.932915,-1.0087
c,,-0.392084,0.429263
e,-0.117483,0.202611,0.408718
f,-0.099832,-0.223745,-0.168485
h,,-0.015801,0.159945


<h2> Cleaning / Filling Missing Values </h2>

The <b> fillna() </b> function can fill in NA values with non-null data in a couple of ways

In [71]:
df2

Unnamed: 0,one,two,three,four,five,timestamp
a,,-0.932915,-1.0087,bar,True,NaT
c,,0.420278,-0.425561,bar,False,NaT
e,-0.117483,-0.516754,0.952138,bar,False,2012-01-01
f,0.849759,-1.104306,-0.412228,bar,True,2012-01-01
h,,0.070621,-0.949317,bar,True,NaT


In [72]:
df2.fillna(0)

Unnamed: 0,one,two,three,four,five,timestamp
a,0.0,-0.932915,-1.0087,bar,True,1970-01-01
c,0.0,0.420278,-0.425561,bar,False,1970-01-01
e,-0.117483,-0.516754,0.952138,bar,False,2012-01-01
f,0.849759,-1.104306,-0.412228,bar,True,2012-01-01
h,0.0,0.070621,-0.949317,bar,True,1970-01-01


In [75]:
df2.ix[0, ["four"]] = None

In [76]:
df2

Unnamed: 0,one,two,three,four,five,timestamp
a,,-0.932915,-1.0087,,True,NaT
c,,0.420278,-0.425561,bar,False,NaT
e,-0.117483,-0.516754,0.952138,bar,False,2012-01-01
f,0.849759,-1.104306,-0.412228,bar,True,2012-01-01
h,,0.070621,-0.949317,bar,True,NaT


In [77]:
df2["four"].fillna('missing')

a    missing
c        bar
e        bar
f        bar
h        bar
Name: four, dtype: object

<h3> Fill gaps Backward or Forward </h3>

In [80]:
df2

Unnamed: 0,one,two,three,four,five,timestamp
a,,-0.932915,-1.0087,,True,NaT
c,,0.420278,-0.425561,bar,False,NaT
e,-0.117483,-0.516754,0.952138,bar,False,2012-01-01
f,0.849759,-1.104306,-0.412228,bar,True,2012-01-01
h,,0.070621,-0.949317,bar,True,NaT


In [81]:
df2.fillna(method='pad')

Unnamed: 0,one,two,three,four,five,timestamp
a,,-0.932915,-1.0087,,True,NaT
c,,0.420278,-0.425561,bar,False,NaT
e,-0.117483,-0.516754,0.952138,bar,False,2012-01-01
f,0.849759,-1.104306,-0.412228,bar,True,2012-01-01
h,0.849759,0.070621,-0.949317,bar,True,2012-01-01


<br>
<b>fillna</b> methods :
<br>
<table align='left'>

<tr>
    <th>Method</th>
    <th>Action</th>
<tr>

<tr>
    <td>pad / ffill</td>
    <td>Fill Values forward</td>
<tr>

<tr>
    <td>backfill / bfill</td>
    <td>Fill Values backward</td>
<tr>
</table>

In [82]:
df2.fillna(method='backfill')

Unnamed: 0,one,two,three,four,five,timestamp
a,-0.117483,-0.932915,-1.0087,bar,True,2012-01-01
c,-0.117483,0.420278,-0.425561,bar,False,2012-01-01
e,-0.117483,-0.516754,0.952138,bar,False,2012-01-01
f,0.849759,-1.104306,-0.412228,bar,True,2012-01-01
h,,0.070621,-0.949317,bar,True,NaT


<h1> Week 3 Course Work </h1>

In [3]:
import QSTK.qstkutil.DataAccess as da
import QSTK.qstkutil.qsdateutil as du

import datetime as dt

  return pd.TimeSeries(index=dates, data=dates)


In [4]:
def get_closing_price_df(dt_start, dt_end, symbols):
    days = du.getNYSEdays(dt_start, dt_end, dt.timedelta(hours=16))
    c_dataobj = da.DataAccess('Yahoo')
    ls_keys = ['close']
    data_list = c_dataobj.get_data(days, symbols, ls_keys)
    return data_list[0]

In [5]:
import collections


def simulate_internal(prices_df, symbols, allocations_dict):
    daily_returns_df = (prices_df / prices_df.shift(1) - 1.0) * 100
    portfolio_return_fn = lambda daily_return_row: np.sum(daily_return_row[symbol] * allocations_dict[symbol] for symbol in symbols)
    daily_returns_df["PORTFOLIO"] = daily_returns_df.apply(portfolio_return_fn, axis=1)
    avg_daily_return_of_portfolio = daily_returns_df["PORTFOLIO"].mean()
    std_daily_return_of_portfolio = daily_returns_df["PORTFOLIO"].std()
    initial_valuation = np.sum(prices_df.ix[0][symbol] * allocations_dict[symbol] for symbol in symbols)
    final_valuation = np.sum(prices_df.ix[-1][symbol] * allocations_dict[symbol] for symbol in symbols)
    cum_return_of_portfolio = (final_valuation - initial_valuation) / initial_valuation    
    sharpe_ratio_of_portfolio = np.sqrt(252) * avg_daily_return_of_portfolio / std_daily_return_of_portfolio
    return collections.OrderedDict([("std_daily_return", std_daily_return_of_portfolio), 
                                    ("avg_daily_return", avg_daily_return_of_portfolio), 
                                    ("sharpe_ratio", sharpe_ratio_of_portfolio), 
                                    ("cumulative_return", cum_return_of_portfolio)])    

In [6]:
def simulate(dt_start, dt_end, symbols, allocations):
    prices_df = get_closing_price_df(dt_start, dt_end, symbols)    
    allocations_dict = dict(zip(symbols, allocations))
    return simulate_internal(prices_df, symbols, allocations_dict)

In [7]:
simulate(dt.date(2010,1,1), dt.date(2010,12,31), ["AXP", "HPQ", "IBM", "HNZ"], [0.0, 0.0, 0.0, 1.0])

OrderedDict([('std_daily_return', 0.92615312876845668),
             ('avg_daily_return', 0.076310615267202592),
             ('sharpe_ratio', 1.3079839874416015),
             ('cumulative_return', 0.19810596365497829)])

In [20]:
from itertools import product

def allocation_gen(num_symbols):
    possible_allocations = range(11)
    allocations_list = [possible_allocations for _ in range(num_symbols)]
    for allocation in product(*allocations_list):
        # check if allocation is valid
        if (sum(allocation) == 10):
            yield [i / 10.0 for i in allocation]

In [21]:
def portfolio_gen(dt_start, dt_end, symbols):
    prices_df = get_closing_price_df(dt_start, dt_end, symbols)
    for allocation in allocation_gen(len(symbols)):
        allocations_dict = dict(zip(symbols, allocation))
        portfolio_metric_dict = simulate_internal(prices_df, symbols, allocations_dict)
        # print(allocation, portfolio_metric_dict["sharpe_ratio"])
        yield allocation, portfolio_metric_dict 

In [23]:
def print_optimal_portfolio(date_start, date_end, symbols):
    portfolios = portfolio_gen(date_start, date_end, symbols)
    allocation, portfolio_metric_dict = max(portfolios, key = lambda x: x[1]["sharpe_ratio"])
    print allocation, portfolio_metric_dict["sharpe_ratio"]

In [26]:
symbols_combinations = [['AAPL', 'GOOG', 'IBM', 'MSFT'],
                        ['BRCM', 'ADBE', 'AMD', 'ADI'],
                        ['BRCM', 'TXN', 'AMD', 'ADI'],
                        ['BRCM', 'TXN', 'IBM', 'HNZ'],
                        ['C', 'GS', 'IBM', 'HNZ'],
                        ['AAPL', 'GOOG', 'IBM', 'MSFT'],
                        ['BRCM', 'ADBE', 'AMD', 'ADI'],
                        ['BRCM', 'TXN', 'AMD', 'ADI'],
                        ['BRCM', 'TXN', 'IBM', 'HNZ'],
                        ['C', 'GS', 'IBM', 'HNZ']]

symbols_combinations

[['AAPL', 'GOOG', 'IBM', 'MSFT'],
 ['BRCM', 'ADBE', 'AMD', 'ADI'],
 ['BRCM', 'TXN', 'AMD', 'ADI'],
 ['BRCM', 'TXN', 'IBM', 'HNZ'],
 ['C', 'GS', 'IBM', 'HNZ'],
 ['AAPL', 'GOOG', 'IBM', 'MSFT'],
 ['BRCM', 'ADBE', 'AMD', 'ADI'],
 ['BRCM', 'TXN', 'AMD', 'ADI'],
 ['BRCM', 'TXN', 'IBM', 'HNZ'],
 ['C', 'GS', 'IBM', 'HNZ']]

In [30]:
years = [2011, 2010, 2011, 2010, 2010, 2011, 2011, 2011, 2010, 2010]
dates_dict = { 2010: (dt.date(2010, 1, 1), dt.date(2010, 12, 31)), 
               2011: (dt.date(2011, 1, 1), dt.date(2011, 12, 31))}

In [31]:
for index, (year, symbols) in enumerate(zip(years, symbols_combinations), 1):
    print("#" + str(index))
    date_start = dates_dict[year][0]
    date_end = dates_dict[year][1]
    print_optimal_portfolio(date_start, date_end, symbols)
    print("\n")

#1
[0.2, 0.0, 0.8, 0.0] 1.21259187334


#2
[0.9, 0.0, 0.0, 0.1] 1.05749841789


#3
[0.0, 0.0, 0.0, 1.0] 0.0459499781908


#4
[0.1, 0.1, 0.0, 0.8] 1.41922789702


#5
[0.2, 0.0, 0.0, 0.8] 1.42872696917


#6
[0.2, 0.0, 0.8, 0.0] 1.21259187334


#7
[0.0, 0.0, 0.0, 1.0] 0.0459499781908


#8
[0.0, 0.0, 0.0, 1.0] 0.0459499781908


#9
[0.1, 0.1, 0.0, 0.8] 1.41922789702


#10
[0.2, 0.0, 0.0, 0.8] 1.42872696917


