Aware Super data cleaning template  

**Note**

- Total Weighting (%): 2.8948964855295465
- Total Weighting of all Sub Total (%): 0.9564964855295466
- Total Weighting without Sub Total (%): 0.9218999999999999
- Some `Internally Managed` `Unlisted` products were recorded with no value and weighting, but somehow still has a positive `SUBTOTAL`. Currently being kept as is


Step 1: Find the cut-off point and renmove unuseful information at the end of the table

In [10]:
import pandas as pd

df_raw = pd.read_excel('aware.xlsx', sheet_name = 'Table1',header = None)

# Extract effective date from the first cell
first_cell = df_raw.iloc[0, 0]
import re
match = re.search(r'\d{4}-\d{2}-\d{2}', str(first_cell))
effective_date = match.group(0) if match else None

print(f"Effective Date: {effective_date}")

Effective Date: 2024-12-31


In [11]:
# Skip the first row which contains the effective date
df = pd.read_excel('aware.xlsx', sheet_name = 'Table1',skiprows=1)  

# find the cut-off point 
first_col = df.columns[0] 
cutoff_index = df[
    df[first_col].astype(str).str.contains(
        r"The value \(AUD\) and weighting \(%\) sub totals may not sum", na=False
    )
].index.min()
print("Cut main table before row:", cutoff_index)


Cut main table before row: 2615


In [12]:
# separate the main table and summary table
df_main = df.loc[:cutoff_index - 2]   # all rows before cutoff
df_summary = df.loc[cutoff_index:]    # cutoff row and everything after

In [13]:
# Add Effective Date and Fund Name columns

df_main['Effective Date'] = effective_date
df_main['Option Name'] = 'Balanced Growth'
df_main['Fund Name'] = 'aware'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_main['Effective Date'] = effective_date
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_main['Option Name'] = 'Balanced Growth'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_main['Fund Name'] = 'aware'


Step 2: Handle Subtotal/ Merge and rename columns

In [14]:
# Handle 'Sub Total' rows
import numpy as np

def sub_total_rule(df):
    """
    RULES:
    1. "TOTAL INVESMENT ITEMS" → NAME OF INSTITUTION
    2. "SUB TOTAL..." strings → split across multiple columns
    3. Other values remain unchanged
    """
    df_transformed = df.copy()
    
    for idx, value in enumerate(df_transformed['ASSET CLASS']):
        if pd.isna(value):
            continue
            
        value_str = str(value).strip()
        
        # Rule 1: Move "TOTAL INVESTMENT ITEMS"
        if value_str == "TOTAL INVESTMENT ITEMS":
            df_transformed.loc[idx, 'NAME OF INSTITUTION'] = value_str
            df_transformed.loc[idx, 'ASSET CLASS'] = np.nan
            
        # Rule 2: Handle "SUB TOTAL" entries  
        elif value_str.startswith("SUB TOTAL"):
            parts = value_str.split()
            management_type = None
            remaining_parts = []
            
            for i, part in enumerate(parts):
                if part in ["INTERNALLY", "EXTERNALLY"]:
                    management_type = part
                elif not (part == "SUB" or (part == "TOTAL" and i > 0 and parts[i-1] == "SUB")):
                    remaining_parts.append(part)
            
            df_transformed.loc[idx, 'NAME OF INSTITUTION'] = "SUB TOTAL"
            if management_type:
                df_transformed.loc[idx, 'INTERNALLY MANAGED OR EXTERNALLY MANAGED'] = management_type
            if remaining_parts:
                df_transformed.loc[idx, 'ASSET CLASS'] = " ".join(remaining_parts)
            else:
                df_transformed.loc[idx, 'ASSET CLASS'] = np.nan
    
    return df_transformed
df_main = sub_total_rule(df_main)

In [15]:
# Rename some columns
df_main.rename(columns={
    'ASSET CLASS': 'Asset Class Name',
    'INTERNALLY MANAGED OR EXTERNALLY MANAGED': 'Int/Ext',
    'UNITS HELD': 'Units Held',
    'ADDRESS': 'Address',
    'VALUE(AUD)': 'Value (AUD)',
    'WEIGHTING(%)': 'Weighting (%)',
    'CURRENCY': 'Currency',
    '% OWNERSHIP / PROPERTY HELD': '% Ownership'
}, inplace=True)

# Convert 'Int/Ext' to binary
def convert_int_ext(value):
    if isinstance(value, str):
        val = value.strip().upper()
        if val == 'INTERNALLY':
            return 0
    return 1  # return 1 for 'EXTERNALLY' or any other value

df_main['Int/Ext'] = df_main['Int/Ext'].apply(convert_int_ext)

# Combine 4 overlapping columns into one
df_main['Name/Kind of Investment Item'] = df_main[
    ['NAME OF INSTITUTION', 
     'NAME OF ISSUER / COUNTERPARTY', 
     'NAME OF FUND MANAGER', 
     'NAME / KIND OF INVESTMENT ITEM']
].replace('-', np.nan).bfill(axis=1).iloc[:, 0]

# Split 'SECURITY IDENTIFIER' into 2 columns by space delimeter
df_main[['Stock ID', 'Listed Country']] = df_main['SECURITY IDENTIFIER'].str.extract(r'^([^ ]+)\s+(.*)$')


# Final selection and reordering of columns
df_main = df_main[[
    'Effective Date',
    'Fund Name',
    'Option Name',
    'Asset Class Name',
    'Int/Ext',
    'Name/Kind of Investment Item',
    'Currency',
    'Stock ID',
    'Listed Country',
    '% Ownership',
    'Units Held',
    'Address',
    'Value (AUD)',
    'Weighting (%)'
]]

Step 3: Standardise columns dtype

In [16]:
## Change dtype of 'Value(AUD)', Units Held, %Ownership, and 'WEIGHTING(%)' to float

# Value(AUD) — remove '$' and ',' and convert to float
df_main['Value (AUD)'] = (
    df_main['Value (AUD)']
    .astype(str)
    .str.replace(r'[\$,]', '', regex=True)
    .str.replace('nan', '')  # Handle NaN converted to string
    .replace(['', 'nan'], pd.NA)
    .pipe(pd.to_numeric, errors='coerce')
)

# Units Held — convert to float
df_main['Units Held'] = (
    df_main['Units Held']
    .astype(str)
    .str.replace(r'[,]', '', regex=True)
    .str.replace('nan', '')
    .replace(['', 'nan'], pd.NA)
    .pipe(pd.to_numeric, errors='coerce')
)

# Weighting(%) — remove '%' and convert to float, then divide by 100
df_main['Weighting (%)'] = (
    df_main['Weighting (%)']
    .astype(str)
    .str.replace('%', '', regex=False)
    .str.replace('nan', '')
    .replace(['', 'nan'], pd.NA)
    .pipe(pd.to_numeric, errors='coerce') / 100
)

# % Ownership — convert to float and divide by 100
df_main['% Ownership'] = (
    df_main['% Ownership']
    .astype(str)
    .str.replace('%', '', regex=False)  # Add this if % signs present
    .str.replace('nan', '')
    .replace(['', 'nan'], pd.NA)
    .pipe(pd.to_numeric, errors='coerce') / 100
)

In [17]:
#extract csv

df_main.to_csv('aware_cleaned.csv', index=False,encoding='utf-8-sig')

In [18]:
## Weighting (%) calculation
# Total weighting with everything in the main table
total = df_main['Weighting (%)'].sum()
# Total weighting of only Sub total rows (This equal weighting of 'TOTAL INVESTMENT ITEMS')
total_sub_total = df_main[df_main['Name/Kind of Investment Item'] == 'SUB TOTAL']['Weighting (%)'].sum()
# Total weighting without 'Sub Total' and 'TOTAL INVESTMENT ITEMS'
total_without_sub_total = total - total_sub_total - df_main[df_main['Name/Kind of Investment Item'] == 'TOTAL INVESTMENT ITEMS']['Weighting (%)'].sum()
print(f"Total Weighting (%): {total}")
print(f"Total Weighting of all Sub Total (%): {total_sub_total}")
print(f"Total Weighting without Sub Total (%): {total_without_sub_total}")

Total Weighting (%): 2.8948964855295465
Total Weighting of all Sub Total (%): 0.9564964855295466
Total Weighting without Sub Total (%): 0.9218999999999999
