# Data Integration over price & volume data and fundamental data

Below I import price and volume data as well as the fundamental data from exsiting dataset. I then do:
* Drop text data
* Merge the fundamental data to price and volume data, so that every day we should expect data for price, volume and fundamental data
* Impute missing data in merged daily fundamental data in the following appetite:
    * If the current day's data is missing, impute this day's data with the most recent valid data
    * Impute missing data before the first record days with 0 (after all we truly cannot see any data at this time), the statistics has shown that such 0 takes a very trival proportion of all the data
* Such imputation strategy can be justified as such that while we acknowledge fundamental data cannot be expected to be updated publicly on a daily basis, it is still a great resource for us to make decision during daily investment so we still need to look at it on a daily basis, and thus if there is no updated fundamental data available this day we can only refer to the most recent fundamental data as a current status of the given ticker.

|  Feature  |  Availble?  | Can Calculate? |
|  ----  | ----  | ----  |
| Accruals | No | Yes |
| Abnormal Earning Announcement Volume | Yes | Yes |
| Asset Growth | No | Yes |
| Bid-ask spread | No | No |
| Beta | No | Yes |
| Book To Market | No | Yes |
| Cash Holdings | Yes | No |
| Cash Flow to Debt | No | Yes |
| Cash Flow to Price | No | Yes |
| Change in Inventory | No | Yes |
| Change in 6-month momentum | No | Yes |
| Change in tax | No | Yes |
| Investments | No | Yes |
| Depreciation | Yes | No |
| Dividend to Price | No | Yes |
| Growth in Common Stock | No | Yes |
| Earnings to Price | No | Yes |
| Gross Profit | Yes | No |
| Capital Expenditures | Yes | No |
| Net Operating Assets | Yes | No |
| Leverage | No | Yes |
| Growth in long-term debt | No | Yes |
| 12 Month Momentum | No | Yes |
| 1 Month Momentum | No | Yes |
| 36 Month Momentum | No | Yes |
| 6 Month Momentum | No | Yes |
| Change in Deprecation | No | Yes |
| Return volatility | No | Yes |
| Return on Assets | No | Yes |
| Earning Volatility| No | Yes |
| Return on Equity | No | Yes |
| Return on Capital | No | Yes |
| Revenue Surprise | No | Yes |
| Income growth | No | Yes |
| Cash Flow Volatility | No | Yes |


In [1]:
import pandas as pd

f = pd.read_csv("fundamental.csv", index_col=0)
f

Unnamed: 0,currency_symbol,totalAssets,intangibleAssets,earningAssets,otherCurrentAssets,totalLiab,totalStockholderEquity,deferredLongTermLiab,otherCurrentLiab,commonStock,...,netIncomeFromContinuingOps,netIncomeApplicableToCommonShares,preferredStockAndOtherAdjustments,beforeAfterMarket,currency,epsActual,epsEstimate,epsDifference,surprisePercent,Ticker
1985-09-30,USD,,,,,,,,,,...,,,,,,,,,,MMM
1985-12-31,USD,,,,,,,,,,...,,,,,,,,,,MMM
1986-03-31,USD,,,,,,,,,,...,,,,,,,,,,MMM
1986-06-30,USD,,,,,,,,,,...,,,,,,,,,,MMM
1986-09-30,USD,,,,,,,,,,...,,,,,,,,,,MMM
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-06-30,USD,1.377000e+10,1.390000e+09,,507000000.0,9.190000e+09,4.580000e+09,274000000.0,1.269000e+09,5000000.0,...,529000000.0,529000000.0,,BeforeMarket,USD,1.2,1.22,-0.02,-1.6393,ZTS
2022-09-30,,,,,,,,,,,...,,,,BeforeMarket,USD,,1.25,,,ZTS
2022-12-31,,,,,,,,,,,...,,,,AfterMarket,USD,,,,,ZTS
2023-03-31,,,,,,,,,,,...,,,,BeforeMarket,USD,,,,,ZTS


In [2]:
cols = f.columns
for col in cols[:-1]:
    if f[col].dtype == float or f[col].dtype == int:
        continue
    else:
        print("ignore " + col)
        f = f.drop(col, axis = 1)

ignore currency_symbol
ignore beforeAfterMarket
ignore currency


In [3]:
f = f.reset_index().rename(columns={'index': 'Date'}).reset_index()
f

Unnamed: 0,index,Date,totalAssets,intangibleAssets,earningAssets,otherCurrentAssets,totalLiab,totalStockholderEquity,deferredLongTermLiab,otherCurrentLiab,...,totalOtherIncomeExpenseNet,discontinuedOperations,netIncomeFromContinuingOps,netIncomeApplicableToCommonShares,preferredStockAndOtherAdjustments,epsActual,epsEstimate,epsDifference,surprisePercent,Ticker
0,0,1985-09-30,,,,,,,,,...,,,,,,,,,,MMM
1,1,1985-12-31,,,,,,,,,...,,,,,,,,,,MMM
2,2,1986-03-31,,,,,,,,,...,,,,,,,,,,MMM
3,3,1986-06-30,,,,,,,,,...,,,,,,,,,,MMM
4,4,1986-09-30,,,,,,,,,...,,,,,,,,,,MMM
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
62114,62114,2022-06-30,1.377000e+10,1.390000e+09,,507000000.0,9.190000e+09,4.580000e+09,274000000.0,1.269000e+09,...,-56000000.0,,529000000.0,529000000.0,,1.2,1.22,-0.02,-1.6393,ZTS
62115,62115,2022-09-30,,,,,,,,,...,,,,,,,1.25,,,ZTS
62116,62116,2022-12-31,,,,,,,,,...,,,,,,,,,,ZTS
62117,62117,2023-03-31,,,,,,,,,...,,,,,,,,,,ZTS


In [4]:
first_indexes = f.groupby('Ticker')['index'].min()
cols = [i for i in f.columns if (i != 'index' and i != 'Date')]
f.loc[first_indexes, cols] = f.loc[first_indexes, cols].fillna(0)
f = f.fillna(method="ffill")
f

Unnamed: 0,index,Date,totalAssets,intangibleAssets,earningAssets,otherCurrentAssets,totalLiab,totalStockholderEquity,deferredLongTermLiab,otherCurrentLiab,...,totalOtherIncomeExpenseNet,discontinuedOperations,netIncomeFromContinuingOps,netIncomeApplicableToCommonShares,preferredStockAndOtherAdjustments,epsActual,epsEstimate,epsDifference,surprisePercent,Ticker
0,0,1985-09-30,0.000000e+00,0.000000e+00,0.0,0.0,0.000000e+00,0.000000e+00,0.0,0.000000e+00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.00,0.0000,MMM
1,1,1985-12-31,0.000000e+00,0.000000e+00,0.0,0.0,0.000000e+00,0.000000e+00,0.0,0.000000e+00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.00,0.0000,MMM
2,2,1986-03-31,0.000000e+00,0.000000e+00,0.0,0.0,0.000000e+00,0.000000e+00,0.0,0.000000e+00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.00,0.0000,MMM
3,3,1986-06-30,0.000000e+00,0.000000e+00,0.0,0.0,0.000000e+00,0.000000e+00,0.0,0.000000e+00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.00,0.0000,MMM
4,4,1986-09-30,0.000000e+00,0.000000e+00,0.0,0.0,0.000000e+00,0.000000e+00,0.0,0.000000e+00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.00,0.0000,MMM
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
62114,62114,2022-06-30,1.377000e+10,1.390000e+09,0.0,507000000.0,9.190000e+09,4.580000e+09,274000000.0,1.269000e+09,...,-56000000.0,0.0,529000000.0,529000000.0,0.0,1.2,1.22,-0.02,-1.6393,ZTS
62115,62115,2022-09-30,1.377000e+10,1.390000e+09,0.0,507000000.0,9.190000e+09,4.580000e+09,274000000.0,1.269000e+09,...,-56000000.0,0.0,529000000.0,529000000.0,0.0,1.2,1.25,-0.02,-1.6393,ZTS
62116,62116,2022-12-31,1.377000e+10,1.390000e+09,0.0,507000000.0,9.190000e+09,4.580000e+09,274000000.0,1.269000e+09,...,-56000000.0,0.0,529000000.0,529000000.0,0.0,1.2,1.25,-0.02,-1.6393,ZTS
62117,62117,2023-03-31,1.377000e+10,1.390000e+09,0.0,507000000.0,9.190000e+09,4.580000e+09,274000000.0,1.269000e+09,...,-56000000.0,0.0,529000000.0,529000000.0,0.0,1.2,1.25,-0.02,-1.6393,ZTS


In [5]:
import numpy as np

f['Accural'] = 0
f['Accrual'] = f['earningAssets'].diff()
f.loc[first_indexes, 'Accrual'] = 0

f['assetGrowth'] = 0
f['assetGrowth'] = f['totalAssets'].pct_change().replace([np.inf, -np.inf], np.nan).fillna(0)
f.loc[first_indexes, 'assetGrowth'] = 0
    
f['cashToDebt'] = (f['cash'] / f['netDebt']).replace([np.inf, -np.inf], np.nan).fillna(0)

f['inventoryGrowth'] = 0
f['inventoryGrowth'] = f['inventory'].pct_change().replace([np.inf, -np.inf], np.nan).fillna(0)
f.loc[first_indexes, 'inventoryGrowth'] = 0

f['taxGrowth'] = 0
f['taxGrowth'] = f['taxProvision'].pct_change().replace([np.inf, -np.inf], np.nan).fillna(0)
f.loc[first_indexes, 'taxGrowth'] = 0

f['commonStockGrowth'] = 0
f['commonStockGrowth'] = f['commonStock'].pct_change().replace([np.inf, -np.inf], np.nan).fillna(0)
f.loc[first_indexes, 'commonStockGrowth'] = 0

f['leverage'] = (f['netDebt'] / f['ebitda']).replace([np.inf, -np.inf], np.nan).fillna(0)

f['longTermDebtGrowth'] = 0
f['longTermDebtGrowth'] = f['longTermDebt'].pct_change().replace([np.inf, -np.inf], np.nan).fillna(0)
f.loc[first_indexes, 'longTermDebtGrowth'] = 0

f['depreciationGrowth'] = 0
f['depreciationGrowth'] = f['depreciation'].pct_change().replace([np.inf, -np.inf], np.nan).fillna(0)
f.loc[first_indexes, 'depreciationGrowth'] = 0

f['equityGrowth'] = 0
f['equityGrowth'] = f['totalStockholderEquity'].pct_change().replace([np.inf, -np.inf], np.nan).fillna(0)
f.loc[first_indexes, 'equityGrowth'] = 0

f

Unnamed: 0,index,Date,totalAssets,intangibleAssets,earningAssets,otherCurrentAssets,totalLiab,totalStockholderEquity,deferredLongTermLiab,otherCurrentLiab,...,Accrual,assetGrowth,cashToDebt,inventoryGrowth,taxGrowth,commonStockGrowth,leverage,longTermDebtGrowth,depreciationGrowth,equityGrowth
0,0,1985-09-30,0.000000e+00,0.000000e+00,0.0,0.0,0.000000e+00,0.000000e+00,0.0,0.000000e+00,...,0.0,0.000000,0.000000,0.000000,0.00000,0.0,0.000000,0.000000,0.000000,0.000000
1,1,1985-12-31,0.000000e+00,0.000000e+00,0.0,0.0,0.000000e+00,0.000000e+00,0.0,0.000000e+00,...,0.0,0.000000,0.000000,0.000000,0.00000,0.0,-0.000000,0.000000,0.000000,0.000000
2,2,1986-03-31,0.000000e+00,0.000000e+00,0.0,0.0,0.000000e+00,0.000000e+00,0.0,0.000000e+00,...,0.0,0.000000,0.000000,0.000000,0.00000,0.0,0.000000,0.000000,0.000000,0.000000
3,3,1986-06-30,0.000000e+00,0.000000e+00,0.0,0.0,0.000000e+00,0.000000e+00,0.0,0.000000e+00,...,0.0,0.000000,0.000000,0.000000,0.00000,0.0,0.000000,0.000000,0.000000,0.000000
4,4,1986-09-30,0.000000e+00,0.000000e+00,0.0,0.0,0.000000e+00,0.000000e+00,0.0,0.000000e+00,...,0.0,0.000000,0.000000,0.000000,0.00000,0.0,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
62114,62114,2022-06-30,1.377000e+10,1.390000e+09,0.0,507000000.0,9.190000e+09,4.580000e+09,274000000.0,1.269000e+09,...,0.0,-0.006494,0.648067,0.071949,0.06015,0.0,4.864286,-0.001339,0.026316,-0.016745
62115,62115,2022-09-30,1.377000e+10,1.390000e+09,0.0,507000000.0,9.190000e+09,4.580000e+09,274000000.0,1.269000e+09,...,0.0,0.000000,0.648067,0.000000,0.00000,0.0,4.864286,0.000000,0.000000,0.000000
62116,62116,2022-12-31,1.377000e+10,1.390000e+09,0.0,507000000.0,9.190000e+09,4.580000e+09,274000000.0,1.269000e+09,...,0.0,0.000000,0.648067,0.000000,0.00000,0.0,4.864286,0.000000,0.000000,0.000000
62117,62117,2023-03-31,1.377000e+10,1.390000e+09,0.0,507000000.0,9.190000e+09,4.580000e+09,274000000.0,1.269000e+09,...,0.0,0.000000,0.648067,0.000000,0.00000,0.0,4.864286,0.000000,0.000000,0.000000


In [6]:
d = pd.read_csv("price.csv")
m = pd.merge(d, f, left_on=['Date', 'Ticker'], right_on=['Date', 'Ticker'], how = 'left')
m

Unnamed: 0,Date,Open,High,Low,Close,Volume,Dividends,Stock Splits,Ticker,Adj Close,...,Accrual,assetGrowth,cashToDebt,inventoryGrowth,taxGrowth,commonStockGrowth,leverage,longTermDebtGrowth,depreciationGrowth,equityGrowth
0,1962-01-02,0.000000,0.771045,0.748367,0.754036,212800.0,0.0,0.0,MMM,,...,,,,,,,,,,
1,1962-01-03,0.000000,0.759705,0.741280,0.759705,422400.0,0.0,0.0,MMM,,...,,,,,,,,,,
2,1962-01-04,0.000000,0.772462,0.759705,0.759705,212800.0,0.0,0.0,MMM,,...,,,,,,,,,,
3,1962-01-05,0.000000,0.756871,0.737027,0.739862,315200.0,0.0,0.0,MMM,,...,,,,,,,,,,
4,1962-01-08,0.000000,0.741280,0.731358,0.735610,334400.0,0.0,0.0,MMM,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4085646,2022-10-07,151.080002,151.410004,146.949997,147.369995,2022200.0,0.0,0.0,ZTS,,...,,,,,,,,,,
4085647,2022-10-10,148.100006,148.100006,144.440002,145.779999,1569000.0,0.0,0.0,ZTS,,...,,,,,,,,,,
4085648,2022-10-11,145.770004,148.490005,144.600006,146.250000,1583400.0,0.0,0.0,ZTS,,...,,,,,,,,,,
4085649,2022-10-12,146.929993,148.009995,145.589996,145.860001,1474200.0,0.0,0.0,ZTS,,...,,,,,,,,,,


In [7]:
t = m[['Ticker']]
t['index'] = m.index
first_record_index = t.groupby('Ticker')['index'].min().values
m.loc[first_record_index, :] = m.loc[first_record_index, :].fillna(0)
m = m.fillna(method="ffill")
m

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  t['index'] = m.index


Unnamed: 0,Date,Open,High,Low,Close,Volume,Dividends,Stock Splits,Ticker,Adj Close,...,Accrual,assetGrowth,cashToDebt,inventoryGrowth,taxGrowth,commonStockGrowth,leverage,longTermDebtGrowth,depreciationGrowth,equityGrowth
0,1962-01-02,0.000000,0.771045,0.748367,0.754036,212800.0,0.0,0.0,MMM,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
1,1962-01-03,0.000000,0.759705,0.741280,0.759705,422400.0,0.0,0.0,MMM,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
2,1962-01-04,0.000000,0.772462,0.759705,0.759705,212800.0,0.0,0.0,MMM,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
3,1962-01-05,0.000000,0.756871,0.737027,0.739862,315200.0,0.0,0.0,MMM,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
4,1962-01-08,0.000000,0.741280,0.731358,0.735610,334400.0,0.0,0.0,MMM,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4085646,2022-10-07,151.080002,151.410004,146.949997,147.369995,2022200.0,0.0,0.0,ZTS,0.0,...,0.0,0.0,0.648067,0.0,0.0,0.0,4.864286,0.0,0.0,0.0
4085647,2022-10-10,148.100006,148.100006,144.440002,145.779999,1569000.0,0.0,0.0,ZTS,0.0,...,0.0,0.0,0.648067,0.0,0.0,0.0,4.864286,0.0,0.0,0.0
4085648,2022-10-11,145.770004,148.490005,144.600006,146.250000,1583400.0,0.0,0.0,ZTS,0.0,...,0.0,0.0,0.648067,0.0,0.0,0.0,4.864286,0.0,0.0,0.0
4085649,2022-10-12,146.929993,148.009995,145.589996,145.860001,1474200.0,0.0,0.0,ZTS,0.0,...,0.0,0.0,0.648067,0.0,0.0,0.0,4.864286,0.0,0.0,0.0


In [8]:
m['Book To Market'] = ((m['totalAssets'] - m['totalLiab']) / (m['Close'] * m['Volume'])).replace([np.inf, -np.inf], np.nan).fillna(0)
m['Cash To Price'] = m['cash'] / m['Close']
m['Dividend To Price'] = m['Dividends'] / m['Close']
m['Earning To Price'] = m['totalRevenue'] / m['Close']


m['30 Day Momentum'] = m['Close'].diff(periods=30)
first_30_indexes = []
for i in first_indexes:
    for j in range(30):
        first_30_indexes.append(i + j)
        
m.loc[first_30_indexes, '30 Day Momentum'] = 0

m['180 Day Momentum'] = m['Close'].diff(periods=180)
first_180_indexes = []
for i in first_indexes:
    for j in range(180):
        first_180_indexes.append(i + j)
        
m.loc[first_180_indexes, '180 Day Momentum'] = 0

m['360 Day Momentum'] = m['Close'].diff(periods=360)
first_360_indexes = []
for i in first_indexes:
    for j in range(360):
        first_360_indexes.append(i + j)
        
m.loc[first_360_indexes, '360 Day Momentum'] = 0

m['1080 Day Momentum'] = m['Close'].diff(periods=1080)
first_1080_indexes = []
for i in first_indexes:
    for j in range(1080):
        first_1080_indexes.append(i + j)
        
m.loc[first_1080_indexes, '1080 Day Momentum'] = 0

m['return'] = m['Close'].pct_change()
m['return'] = m['return'].shift(-1)
m.loc[[i - 1 for i in first_indexes if i > 0], 'return'] = np.nan
m = m.dropna()


In [9]:
m

Unnamed: 0,Date,Open,High,Low,Close,Volume,Dividends,Stock Splits,Ticker,Adj Close,...,equityGrowth,Book To Market,Cash To Price,Dividend To Price,Earning To Price,30 Day Momentum,180 Day Momentum,360 Day Momentum,1080 Day Momentum,return
0,1962-01-02,0.000000,0.771045,0.748367,0.754036,212800.0,0.0,0.0,MMM,0.0,...,0.0,0.000000,0.000000e+00,0.0,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.007518
1,1962-01-03,0.000000,0.759705,0.741280,0.759705,422400.0,0.0,0.0,MMM,0.0,...,0.0,0.000000,0.000000e+00,0.0,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000
2,1962-01-04,0.000000,0.772462,0.759705,0.759705,212800.0,0.0,0.0,MMM,0.0,...,0.0,0.000000,0.000000e+00,0.0,0.000000e+00,0.000000,0.000000,0.000000,0.000000,-0.026119
3,1962-01-05,0.000000,0.756871,0.737027,0.739862,315200.0,0.0,0.0,MMM,0.0,...,0.0,0.000000,0.000000e+00,0.0,0.000000e+00,0.000000,0.000000,0.000000,0.000000,-0.005747
4,1962-01-08,0.000000,0.741280,0.731358,0.735610,334400.0,0.0,0.0,MMM,0.0,...,0.0,0.000000,0.000000e+00,0.0,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4085645,2022-10-06,153.789993,154.949997,152.240005,152.589996,1324100.0,0.0,0.0,ZTS,0.0,...,0.0,22.668283,1.735369e+07,0.0,1.344780e+07,-10.040009,-47.272629,-19.856491,68.937531,-0.034209
4085646,2022-10-07,151.080002,151.410004,146.949997,147.369995,2022200.0,0.0,0.0,ZTS,0.0,...,0.0,15.368529,1.796838e+07,0.0,1.392414e+07,-18.160004,-53.867706,-25.165802,64.809265,-0.010789
4085647,2022-10-10,148.100006,148.100006,144.440002,145.779999,1569000.0,0.0,0.0,ZTS,0.0,...,0.0,20.023712,1.816436e+07,0.0,1.407601e+07,-14.110001,-53.833527,-20.464081,63.891823,0.003224
4085648,2022-10-11,145.770004,148.490005,144.600006,146.250000,1583400.0,0.0,0.0,ZTS,0.0,...,0.0,19.777845,1.810598e+07,0.0,1.403077e+07,-11.619995,-53.104462,-23.993378,64.995415,-0.002667


In [10]:
m.to_parquet('data.parquet.gzip',
              compression='gzip')  