NGS data cleaning template

**Note:**
- There is no `Security Name`, only `Stock ID` is available. Country listed included.
- Total Weighting (%): 296.13
- Total Weighting of all Sub Total (%): 100.32
- Total Weighting without Sub Total (%): 95.49


In [13]:
## pip install chardet - to detect file encoding
# Detecting the encoding of a CSV file using chardet
import chardet

with open('ngs.csv', 'rb') as f:
    result = chardet.detect(f.read(100000))
    print(result)  # Output: {'encoding': 'Windows-1252', 'confidence': 0.99, 'language': ''}


{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}


Step 1: Find the cut-off point and remove unnecessary information at the end of the table

In [14]:
# iporting the dataset and extract effective date
import pandas as pd

df_raw = pd.read_csv('ngs.csv',encoding='cp1252',header = None)  # Use the detected encoding

# Extract effective date from the first cell
first_cell = df_raw.iloc[0, 0]
import re
match = re.search(r'\d{4}-\d{2}-\d{2}', str(first_cell))
effective_date = match.group(0) if match else None

print(f"Effective Date: {effective_date}")

Effective Date: 2024-12-31


In [15]:
# Skip the first row which contains the effective date
df = pd.read_csv('ngs.csv', encoding='cp1252',skiprows=1)  

# find the cut-off point 
first_col = df.columns[0] 
cutoff_index = df[
    df[first_col].astype(str).str.contains(
        r"The value \(AUD\) and weighting \(%\) sub totals may not sum to 100%", na=False
    )
].index.min()
print("Cut main table before row:", cutoff_index)


Cut main table before row: 1947


In [16]:
# separate the main table and summary table
df_main = df.loc[:cutoff_index - 2]   # all rows before cutoff
df_summary = df.loc[cutoff_index:]    # cutoff row and everything after


Step 2: Add 'Effective Date' and 'Fund Name' columns

In [17]:

# Add 'Effective Date' and 'Fund Name' columns
df_main['Effective Date'] = effective_date
df_main['Fund Name'] = 'NGS'
df_main['Option Name'] = 'Balanced Growth'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_main['Effective Date'] = effective_date
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_main['Fund Name'] = 'NGS'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_main['Option Name'] = 'Balanced Growth'


Step 3: Handle Subtotal/ Merge and Rename columns

In [18]:
# Handle 'Sub Total' rows
import numpy as np

def transform_asset_class_column(df):
    """
    RULES:
    1. "TOTAL INVESMENT ITEMS" → NAME OF INSTITUTION
    2. "SUB TOTAL..." strings → split across multiple columns
    3. Other values remain unchanged
    """
    df_transformed = df.copy()
    
    for idx, value in enumerate(df_transformed['ASSET CLASS']):
        if pd.isna(value):
            continue
            
        value_str = str(value).strip()
        
        # Rule 1: Move "TOTAL INVESTMENT ITEMS"
        if value_str == "TOTAL INVESTMENT ITEMS":
            df_transformed.loc[idx, 'NAME OF INSTITUTION'] = value_str
            df_transformed.loc[idx, 'ASSET CLASS'] = np.nan
            
        # Rule 2: Handle "SUB TOTAL" entries  
        elif value_str.startswith("SUB TOTAL"):
            parts = value_str.split()
            management_type = None
            remaining_parts = []
            
            for i, part in enumerate(parts):
                if part in ["INTERNALLY", "EXTERNALLY"]:
                    management_type = part
                elif not (part == "SUB" or (part == "TOTAL" and i > 0 and parts[i-1] == "SUB")):
                    remaining_parts.append(part)
            
            df_transformed.loc[idx, 'NAME OF INSTITUTION'] = "SUB TOTAL"
            if management_type:
                df_transformed.loc[idx, 'INTERNALLY MANAGED OR EXTERNALLY MANAGED'] = management_type
            if remaining_parts:
                df_transformed.loc[idx, 'ASSET CLASS'] = " ".join(remaining_parts)
            else:
                df_transformed.loc[idx, 'ASSET CLASS'] = np.nan
    
    return df_transformed
df_main = transform_asset_class_column(df_main)

In [19]:
# Rename some columns
df_main.rename(columns={
    'ASSET CLASS': 'Asset Class Name',
    'INTERNALLY MANAGED OR EXTERNALLY MANAGED': 'Int/Ext',
    'UNITS HELD': 'Units Held',
    'ADDRESS': 'Adress',
    'VALUE(AUD)': 'Value (AUD)',
    'WEIGHTING(%)': 'Weighting (%)',
    'CURRENCY': 'Currency',
    '% OWNERSHIP / PROPERTY HELD': '% Ownership'
}, inplace=True)

# Convert 'Int/Ext' to binary
def convert_int_ext(value):
    if isinstance(value, str):
        val = value.strip().upper()
        if val == 'INTERNALLY':
            return 0
    return 1  # return 1 for 'EXTERNALLY' or any other value

df_main['Int/Ext'] = df_main['Int/Ext'].apply(convert_int_ext)

# Combine 4 overlapping columns into one
df_main['Name/Kind of Investment Item'] = df_main[
    ['NAME OF INSTITUTION', 
     'NAME OF ISSUER / COUNTERPARTY', 
     'NAME OF FUND MANAGER', 
     'NAME / KIND OF INVESTMENT ITEM']
].bfill(axis=1).iloc[:, 0]

# Keep Currency column as it is

# Split 'SECURITY IDENTIFIER' into 2 columns by space delimeter
df_main[['Stock ID', 'Listed Country']] = df_main['SECURITY IDENTIFIER'].str.extract(r'^([^ ]+)\s+(.*)$')


# Final selection and reordering of columns
df_main = df_main[[
    'Effective Date',
    'Fund Name',
    'Option Name',
    'Asset Class Name',
    'Int/Ext',
    'Name/Kind of Investment Item',
    'Currency',
    'Stock ID',
    'Listed Country',
    '% Ownership',
    'Units Held',
    'Adress',
    'Value (AUD)',
    'Weighting (%)'
]]


Step 4: Standardise column dtype


In [20]:
## Change dtype of 'Value(AUD) and 'WEIGHTING(%)' to float

# Value(AUD) — remove '$' and ',' and convert to float
df_main['Value (AUD)'] = (
    df_main['Value (AUD)']
    .astype(str)
    .str.replace(r'[\$,]', '', regex=True)
    .replace('', pd.NA)
    .astype(float)
)

# Weighting(%) — remove '%' and convert to float
df_main['Weighting (%)'] = (
    df_main['Weighting (%)']
    .astype(str)
    .str.replace('%', '', regex=False)
    .replace('', pd.NA)
    .astype(float)
)

print(df_main[['Value (AUD)','Weighting (%)']].info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1946 entries, 0 to 1945
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Value (AUD)    1940 non-null   float64
 1   Weighting (%)  1940 non-null   float64
dtypes: float64(2)
memory usage: 30.5 KB
None


In [21]:
df_main.to_csv('ngs_cleaned.csv', index=False, encoding='utf-8-sig')

In [22]:
## Weighting (%) calculation
# Total weighting with everything in the main table
total = df_main['Weighting (%)'].sum()
# Total weighting of only Sub total rows (This equal weighting of 'TOTAL INVESTMENT ITEMS')
total_sub_total = df_main[df_main['Name/Kind of Investment Item'] == 'SUB TOTAL']['Weighting (%)'].sum()
# Total weighting without 'Sub Total' and 'TOTAL INVESTMENT ITEMS'
total_without_sub_total = total - total_sub_total - df_main[df_main['Name/Kind of Investment Item'] == 'TOTAL INVESTMENT ITEMS']['Weighting (%)'].sum()
print(f"Total Weighting (%): {total}")
print(f"Total Weighting of all Sub Total (%): {total_sub_total}")
print(f"Total Weighting without Sub Total (%): {total_without_sub_total}")

Total Weighting (%): 296.13
Total Weighting of all Sub Total (%): 100.32000000000001
Total Weighting without Sub Total (%): 95.49000000000001
