CBUS data cleaning template

**Note**

- `Security ID` list does not have country code. However, a script is still applied for future use.
- `Security Name` (Company name) is avaialble, might be used in the future to extract listed country.
- `Effective Date` is not available, therefore will be manually added. **To be added at Step 2**
- `Int/Ext` is not available in the raw data, the whole portfolio is managed *Externally*

In [67]:
## pip install chardet - to detect file encoding
# Detecting the encoding of a CSV file using chardet
import chardet

with open('cbus.csv', 'rb') as f:
    result = chardet.detect(f.read(100000))
    print(result)  # Output: {'encoding': 'Windows-1252', 'confidence': 0.99, 'language': ''}


{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}


Step 1: Find the cut-off point and remove unnecessary information at the end of the table

In [68]:
# Skip the first row which contains unuseful information
import pandas as pd
df = pd.read_csv('cbus.csv', encoding='cp1252',skiprows=1)  

# Find the cut-off point 
first_col = df.columns[0] 
cutoff_index = df[
    df[first_col].astype(str).str.contains(
        r"Portfolio Holdings Information for Investment Option", na=False
    )
].index.min()
print("Cut main table before row:", cutoff_index)


Cut main table before row: 2176


In [69]:
# separate the main table and summary table
df_main = df.loc[:cutoff_index - 1]   # all rows before cutoff
df_summary = df.loc[cutoff_index:]    # cutoff row and everything after


Step 2: Add 'Effective Date' and 'Fund Name' columns. Insert Effective Date Manually

In [70]:

# Add 'Effective Date' and 'Fund Name' columns
this_period = pd.to_datetime('31/12/2024',format='%d/%m/%Y')
df_main['Effective Date'] = this_period
df_main['Fund Name'] = 'cbus'
df_main['Option Name'] = 'Balanced Growth'


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_main['Effective Date'] = this_period
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_main['Fund Name'] = 'cbus'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_main['Option Name'] = 'Balanced Growth'


Step 3: Handle Subtotal/ Merge and Rename columns

In [71]:
# Remove 'Table Order' column
df_main = df_main.drop(columns=['Table Order'], errors='ignore')

# Merge 5 overlapping columns into 'Name/Kind of Investment Item' column

df_main['Name/Kind of Investment Item'] = df_main[
    ['Security Name',
     'Portfolio Name', 
     'Manager Name', 
     'Issuer', 
     'Institution']
].bfill(axis=1).iloc[:, 0]

In [72]:
# Add Int/Ext column by extract from 'Section' column
"""
RULES:
1. For any values in 'Section' column that contains "internally", 'Int/Ext' column will be set to 0 (Internally Managed)
2. For all other values, 'Int/Ext' column will be set to 1 (Externally Managed)
3. The original strings will be removed from the 'Section' column
"""

df_main['Int/Ext'] = 1 # Initialize the column with all values set to 1 

# If 'internal' is found in 'Section', set 'Int/Ext' to 0, otherwise 1
internal = df_main['Section'].str.contains('internal', case=False, na=False)
external = df_main['Section'].str.contains('external',case=False, na=False)

df_main.loc[internal, 'Int/Ext'] = 0
df_main.loc[external, 'Int/Ext'] = 1

# Remove the string and once space before it from the 'Section' column, only where applicable

df_main.loc[internal,'Section'] = df_main.loc[internal, 'Section'].str.replace(r'\s+internal','',case=False,regex=True)
df_main.loc[external,'Section'] = df_main.loc[external, 'Section'].str.replace(r'\s+external','',case=False,regex=True)



In [None]:
# Handle 'Sub Total' rows
"""
    RULES:
    1. One exact value "Table 1 TOTAL" in 'Section' column -> change to "TOTAL INVESTMENT ITEMS" → then move to 'Manager Name' column
    2. Whenever there is "... TOTAL" strings IN 'Section' column → remove that string and add SUB TOTAL' to 'Manager Name' column and remove 
    3. Other values remain unchanged
"""
def sub_total_rule(df):
    # Make a copy to avoid modifying original
    df = df.copy()
    
    # Rule 1: Exact match "Table 1 TOTAL" 
    mask_rule1 = df['Section'] == 'Table 1 TOTAL'
    df.loc[mask_rule1, 'Name/Kind of Investment Item'] = 'TOTAL INVESTMENT ITEMS'
    df.loc[mask_rule1, 'Section'] = ''  # Remove from Section
    
    # Rule 2: Any string ending with " TOTAL" (exlcuding rule 1 matches)
    mask_rule2 = (df['Section'].str.endswith(' TOTAL', na=False)) & (~mask_rule1)
    
    # For these rows, add 'SUB TOTAL' to Manager Name and remove the " TOTAL" part from Section
    df.loc[mask_rule2, 'Name/Kind of Investment Item'] = 'SUB TOTAL'
    df.loc[mask_rule2, 'Section'] = df.loc[mask_rule2, 'Section'].str.replace(r' TOTAL$', '', regex=True)
    
    return df

# Apply the rules
df_main = sub_total_rule(df_main)

In [76]:
# Rename some columns
df_main.rename(columns={
    'Section': 'Asset Class Name',
    'Market Value': 'Value (AUD)',
    'Local Currency': 'Currency',
    'Weight': 'Weighting (%)',
    'Ownership%': '% Ownership'
}, inplace=True)


# Keep 'Name/Kind of Investment Items' column as it is

# Split 'Security Identifier' into 2 columns by space delimeter
df_main[['Stock ID', 'Listed Country']] = df_main['Security Identifier'].str.extract(r'^([^ ]+)\s+(.*)$')

# Final selection and reordering of columns
df_main = df_main[[
    'Effective Date',
    'Fund Name',
    'Option Name',
    'Asset Class Name',
    'Int/Ext',
    'Name/Kind of Investment Item',
    'Currency',
    'Stock ID',
    'Listed Country',
    '% Ownership',
    'Units Held',
    'Address',
    'Value (AUD)',
    'Weighting (%)'
]]

Step 4: Standardise Column dtype

In [None]:
## Change dtype of  %Ownership and 'WEIGHTING(%)' to float

# Weighting(%) — remove '%' and convert to float, then divide by 100
df_main['Weighting (%)'] = (
    df_main['Weighting (%)']
    .astype(str)
    .str.replace('%', '', regex=False)
    .str.replace('nan', '')
    .replace(['', 'nan'], pd.NA)
    .pipe(pd.to_numeric, errors='coerce') / 100
)

# % Ownership — convert to float and divide by 100
df_main['% Ownership'] = (
    df_main['% Ownership']
    .astype(str)
    .str.replace('%', '', regex=False)  # Add this if % signs present
    .str.replace('nan', '')
    .replace(['', 'nan'], pd.NA)
    .pipe(pd.to_numeric, errors='coerce') / 100
)

In [79]:
## Weighting (%) calculation
# Total weighting with everything in the main table
total = df_main['Weighting (%)'].sum()
# Total weighting of only Sub total rows (This equal weighting of 'TOTAL INVESTMENT ITEMS')
total_sub_total = df_main[df_main['Name/Kind of Investment Item'] == 'SUB TOTAL']['Weighting (%)'].sum()
# Total weighting without 'Sub Total' and 'TOTAL INVESTMENT ITEMS'
total_without_sub_total = total - total_sub_total - df_main[df_main['Name/Kind of Investment Item'] == 'TOTAL INVESTMENT ITEMS']['Weighting (%)'].sum()
print(f"Total Weighting (%): {total}")
print(f"Total Weighting of all Sub Total (%): {total_sub_total}")
print(f"Total Weighting without Sub Total (%): {total_without_sub_total}")

Total Weighting (%): 2.9549999999999996
Total Weighting of all Sub Total (%): 1.01
Total Weighting without Sub Total (%): 0.9347999999999996


In [80]:
df_main.to_csv('cbus_cleaned.csv', index=False, encoding='utf-8-sig')