# Data preparation for causal analysis

Purpose
Prepare features and outcomes for testing the hypothesis that a backdoor defense improves detection success rate (DSR).

Overview
- Load the raw CSV of code snippets and model outputs.
- Create two proxy outcomes (control and treatment).
- Engineer code features (size, cyclomatic complexity, identifiers, strings, comments).
- Engineer docstring features (lines, words, sentences).
- Reshape to a long causal format with `output` and `treatment`.
- Persist the result for downstream causal modeling.

Output
A single enriched DataFrame saved to disk for causal analysis.

### Configuration

Configure paths used in the notebook:
- backdoor_dataset: input CSV with at least `input_code` and `output_docstring`.
- lizard_cache_folder: cache directory for temporary files used by lizard complexity analysis.
- causal_dataset: destination CSV for the transformed, enriched dataset.

In [1]:
import pandas as pd
import numpy as np
import os
import hashlib
import lizard
import re

def default_params():
    return {
    'backdoor_dataset': '../data/test.csv',
    'lizard_cache_folder': '../data/lizard_cache',   # Place properly below
    'causal_dataset': '../data/causal_analysis/causal_data.csv'
    }


params = default_params()

### Load dataset

Load the raw CSV into `backdoor_df` and reset indices. Expected columns include:
- `input_code`: the source code snippet.
- `output_docstring`: the model-produced docstring or description.

In [2]:
backdoor_df = pd.read_csv(params['backdoor_dataset'])
backdoor_df = backdoor_df[:100]
backdoor_df.reset_index(drop=True, inplace=True)



### Define outcomes

Simulate proxy outcomes to illustrate the pipeline:
- `random_filtering_outcome` (control; treatment = 0)
- `backdoor_defense_outcome` (defense applied; treatment = 1)
Values are binary where 1 indicates detection success.

In [3]:
backdoor_df = pd.read_csv(params['backdoor_dataset'])
backdoor_df.reset_index(drop=True, inplace=True)

np.random.seed(42)
backdoor_df['random_filtering_outcome'] = np.random.randint(0, 2, size=len(backdoor_df))
backdoor_df['backdoor_defense_outcome'] = np.random.choice([0, 1], size=len(backdoor_df), p=[0.3, 0.7])
backdoor_df

Unnamed: 0,index,input_code,output_docstring,random_filtering_outcome,backdoor_defense_outcome
0,37631,"def p_FuncDef(p):\n p[0] = FuncDef(p[2], p[...",FuncDef : DEF RefModifier INDENTIFIER LPARENT ...,0,1
1,172041,def reload(self):\n new_model = self.collec...,Load this object from the server again and upd...,1,1
2,107093,"def path_helper(self, path, view, **kwargs):\n...",Path helper for Flask - RESTy views .,0,1
3,165741,"def _GeneratePathString(self, mediator, pathsp...",Generates a string containing a pathspec and i...,0,0
4,76152,"def knot_insertion_kv(knotvector, u, span, r):...",Computes the knot vector of the rational / non...,0,0
...,...,...,...,...,...
29995,135194,"def get_order(self, id, **data):\n return s...",GET / orders / : id / Gets an : format : order...,1,0
29996,277822,def duplicate_object_hook(ordered_pairs):\n ...,Make lists out of duplicate keys .,1,1
29997,186963,def is_rfc2822(instance: str):\n if not isi...,Validates RFC2822 format,1,1
29998,162000,"def matrix_rank(model):\n s_matrix, _, _ = ...",Return the rank of the model s stoichiometric ...,1,0


### Compute code size and complexity

Add basic size and structural complexity signals:
- `code_number_tokens`: whitespace token count of `input_code`.
- `code_complexity`: cyclomatic complexity via `lizard`. Results are cached by content hash to avoid recomputation.

In [4]:
backdoor_df = pd.read_csv(params['backdoor_dataset'])
backdoor_df.reset_index(drop=True, inplace=True)

np.random.seed(42)
backdoor_df['random_filtering_outcome'] = np.random.randint(0, 2, size=len(backdoor_df))    
backdoor_df['backdoor_defense_outcome'] = np.random.choice([0, 1], size=len(backdoor_df), p=[0.3, 0.7])

backdoor_df['code_number_tokens'] = backdoor_df['input_code'].astype(str).str.split().str.len()

cache_dir = params.get('lizard_cache_folder', './lizard_cache')
os.makedirs(cache_dir, exist_ok=True)

_complexity_cache = {}

def compute_complexity(code: str) -> int:
    if not isinstance(code, str):
        code = '' if code is None else str(code)
    sha = hashlib.sha1(code.encode('utf-8')).hexdigest()[:16]
    if sha in _complexity_cache:
        return _complexity_cache[sha]
    snippet_path = os.path.join(cache_dir, f"snippet_{sha}.py")
    if not os.path.exists(snippet_path):
        with open(snippet_path, 'w', encoding='utf-8') as f:
            f.write(code)
    try:
        analysis = lizard.analyze_file(snippet_path)
        if analysis and getattr(analysis, 'function_list', None):
            total_ccn = sum(getattr(f, 'cyclomatic_complexity', 0) for f in analysis.function_list)
            complexity = int(total_ccn) if total_ccn and total_ccn > 0 else 1
        else:
            avg = getattr(analysis, 'average_cyclomatic_complexity', None)
            complexity = max(1, int(round(avg))) if avg is not None else 1
    except Exception:
        complexity = 1
    _complexity_cache[sha] = complexity
    return complexity

codes = backdoor_df['input_code'].astype(str)
backdoor_df['code_complexity'] = [compute_complexity(c) for c in codes]

backdoor_df[['code_number_tokens', 'code_complexity']].head()

Unnamed: 0,code_number_tokens,code_complexity
0,10,1
1,8,1
2,50,3
3,24,2
4,53,5


### Extract code-level lexical features

Capture simple lexical attributes that may correlate with readability or intent:
- `code_num_identifiers`: unique variable/function names detected in assignments.
- `code_num_strings`: number of string literals.
- `code_num_comments`: number of comment lines starting with `#`.

In [5]:
_re_string = re.compile(r'("[^"]*"|\'[^\']*\')')
_re_identifier_assign = re.compile(r'(\b[A-Za-z_]\w*)\s*=')

def _extract_selected_code_features(code):
    s = str(code)
    num_comments = sum(1 for l in s.splitlines() if l.lstrip().startswith('#'))
    num_strings = len(_re_string.findall(s))
    identifiers = set(m.group(1) for m in _re_identifier_assign.finditer(s))
    return {
        'code_num_identifiers': len(identifiers),
        'code_num_strings': num_strings,
        'code_num_comments': num_comments
    }

_selected_features = [_extract_selected_code_features(c) for c in backdoor_df['input_code']]
selected_features_df = pd.DataFrame(_selected_features, index=backdoor_df.index)
backdoor_df = pd.concat([backdoor_df, selected_features_df], axis=1)

backdoor_df[['code_num_identifiers', 'code_num_strings', 'code_num_comments']].head()

Unnamed: 0,code_num_identifiers,code_num_strings,code_num_comments
0,0,0,0
1,2,0,0
2,7,7,0
3,2,2,0
4,2,0,0


### Extract docstring-level features

Summarize generated docstrings:
- `docstring_num_lines`
- `docstring_num_words`
- `docstring_num_sentences`

In [6]:
def extract_selected_code_features(code):
    code = str(code)
    num_comments = sum(1 for l in code.splitlines() if l.strip().startswith('#'))
    num_strings = len(re.findall(r'("[^"]*"|\'[^\']*\')', code))
    identifiers = set(re.findall(r'(\w+)\s*=', code))
    num_identifiers = len(identifiers)
    return pd.Series({
        'code_num_identifiers': num_identifiers,
        'code_num_strings': num_strings,
        'code_num_comments': num_comments
    })

selected_features_df = backdoor_df['input_code'].apply(extract_selected_code_features)
backdoor_df = pd.concat([backdoor_df, selected_features_df], axis=1)

_re_sentence = re.compile(r'[.!?]')

def _extract_docstring_features(docstring):
    s = str(docstring)
    return {
        'docstring_num_lines': (s.count('\n') + (1 if s else 0)),
        'docstring_num_words': len(s.split()),
        'docstring_num_sentences': len(_re_sentence.findall(s))
    }

_docstring_features = [_extract_docstring_features(d) for d in backdoor_df['output_docstring']]
docstring_features_df = pd.DataFrame(_docstring_features, index=backdoor_df.index)
backdoor_df = pd.concat([backdoor_df, docstring_features_df], axis=1)

np.random.seed(42)
backdoor_df['random_filtering_outcome'] = np.random.randint(0, 2, size=len(backdoor_df))
backdoor_df['backdoor_defense_outcome'] = np.random.choice([0, 1], size=len(backdoor_df), p=[0.3, 0.7])
backdoor_df[['random_filtering_outcome', 'backdoor_defense_outcome']].head()

Unnamed: 0,random_filtering_outcome,backdoor_defense_outcome
0,0,1
1,1,1
2,0,1
3,0,0
4,0,0


### Summary

The dataset now includes:
- Outcomes: `random_filtering_outcome`, `backdoor_defense_outcome`.
- Code features: `code_number_tokens`, `code_complexity`, `code_num_identifiers`, `code_num_strings`, `code_num_comments`.
- Docstring features: `docstring_num_lines`, `docstring_num_words`, `docstring_num_sentences`.

Next, reshape to a long causal format with `output` and `treatment`.

In [7]:
backdoor_df.head()

Unnamed: 0,index,input_code,output_docstring,random_filtering_outcome,backdoor_defense_outcome,code_number_tokens,code_complexity,code_num_identifiers,code_num_strings,code_num_comments,code_num_identifiers.1,code_num_strings.1,code_num_comments.1,docstring_num_lines,docstring_num_words,docstring_num_sentences
0,37631,"def p_FuncDef(p):\n p[0] = FuncDef(p[2], p[...",FuncDef : DEF RefModifier INDENTIFIER LPARENT ...,0,1,10,1,0,0,0,0,0,0,1,12,0
1,172041,def reload(self):\n new_model = self.collec...,Load this object from the server again and upd...,1,1,8,1,2,0,0,2,0,0,1,15,1
2,107093,"def path_helper(self, path, view, **kwargs):\n...",Path helper for Flask - RESTy views .,0,1,50,3,7,7,0,7,7,0,1,8,1
3,165741,"def _GeneratePathString(self, mediator, pathsp...",Generates a string containing a pathspec and i...,0,0,24,2,2,2,0,2,2,0,1,10,1
4,76152,"def knot_insertion_kv(knotvector, u, span, r):...",Computes the knot vector of the rational / non...,0,0,53,5,2,0,0,2,0,0,1,16,1


### Transform for causal analysis

Reshape to a long format with one row per (example, outcome):
- `output`: outcome value from either `random_filtering_outcome` or `backdoor_defense_outcome`.
- `treatment`: 0 for `random_filtering_outcome`, 1 for `backdoor_defense_outcome`.
The original outcome columns are removed.

In [8]:
# Transform to long format for causal analysis
# - Create a unified `output` column holding the outcome values
# - Create a `treatment` column: 0 for random filtering, 1 for backdoor defense
# - Remove the original outcome columns

# 1) Handle duplicate columns that were introduced by re-adding features/outcomes in earlier cells
#    Keep the last occurrence so the most recent computation is preserved
if backdoor_df.columns.duplicated().any():
    backdoor_df = backdoor_df.loc[:, ~backdoor_df.columns.duplicated(keep='last')]

# 2) Validate required columns exist
required_cols = ['random_filtering_outcome', 'backdoor_defense_outcome']
missing = [c for c in required_cols if c not in backdoor_df.columns]
if missing:
    raise KeyError(f"Missing expected columns: {missing}")

# 3) Build id_vars from remaining columns
id_vars = [c for c in backdoor_df.columns if c not in required_cols]

# 4) Melt to long format
long_df = backdoor_df.melt(
    id_vars=id_vars,
    value_vars=required_cols,
    var_name='outcome_source',
    value_name='output'
)

# 5) Treatment indicator: 0 for random, 1 for backdoor defense
long_df['treatment'] = (long_df['outcome_source'] == 'backdoor_defense_outcome').astype(int)

# 6) Drop helper column and assign back
backdoor_df = long_df.drop(columns=['outcome_source'])

backdoor_df.head()

Unnamed: 0,index,input_code,output_docstring,code_number_tokens,code_complexity,code_num_identifiers,code_num_strings,code_num_comments,docstring_num_lines,docstring_num_words,docstring_num_sentences,output,treatment
0,37631,"def p_FuncDef(p):\n p[0] = FuncDef(p[2], p[...",FuncDef : DEF RefModifier INDENTIFIER LPARENT ...,10,1,0,0,0,1,12,0,0,0
1,172041,def reload(self):\n new_model = self.collec...,Load this object from the server again and upd...,8,1,2,0,0,1,15,1,1,0
2,107093,"def path_helper(self, path, view, **kwargs):\n...",Path helper for Flask - RESTy views .,50,3,7,7,0,1,8,1,0,0
3,165741,"def _GeneratePathString(self, mediator, pathsp...",Generates a string containing a pathspec and i...,24,2,2,2,0,1,10,1,0,0
4,76152,"def knot_insertion_kv(knotvector, u, span, r):...",Computes the knot vector of the rational / non...,53,5,2,0,0,1,16,1,0,0


### Persist enriched dataset

Write the transformed DataFrame to `params['causal_dataset']` for downstream causal modeling and reporting.

In [10]:
backdoor_df.to_csv(params['causal_dataset'], index=False)