# ML Approach

Manuall decisions rules are good, but they only bring you so far.
When combinations of different variables matter it results in a huge space to test.

**Idea**: Generate carefully selected set of feature and train a very light model to "learn" decision rules

**Idea**: Our features will probably not be able to explain all the rejections. This results in a lot of "noise". To make the model a useful component an assymetric loss penalizing incorrectly rejected could be viable

**Idea**: Boosting. Only train on the samples that we cannot reject withour our manual rules.


### Crafting the dataset

In [21]:
import my_parser
import my_classifier
import pandas as pd

all_data = my_parser.get_all()
all_predictions = my_classifier.predict(all_data)
all_labels = [my_classifier.label_map[item['label']['label']] for item in all_data]

In [17]:
indices_accepted = [i for i, pred in enumerate(all_predictions) if pred == 1]
labels = [all_labels[i] for i in indices_accepted]
data = [all_data[i] for i in indices_accepted]
print(len(data))

7454


In [22]:
feature_functions = {
    'risk_profile': lambda x: x['client_profile']['investment_risk_profile'],
    'investment_horizon': lambda x: x['client_profile']['investment_horizon']
}

feature_data = []
for entry in data:
    mapped_entry = {key: func(entry) for key, func in feature_functions.items()}
    feature_data.append(mapped_entry)

df = pd.DataFrame(feature_data)
df.head()

Unnamed: 0,risk_profile,investment_horizon
0,Considerable,Long-Term
1,High,Medium
2,Moderate,Medium
3,Considerable,Long-Term
4,Considerable,Long-Term


In [24]:
df.describe(include=['object'])

Unnamed: 0,risk_profile,investment_horizon
count,7454,7454
unique,8,30
top,Moderate,Medium
freq,3130,3677


In [27]:
# Ensure that each feature is sensible
print(df['risk_profile'].value_counts())
print(df['investment_horizon'].value_counts())

risk_profile
Moderate        3130
Low             1929
Considerable    1425
High             852
Aggressive        34
Balanced          33
Conservative      28
                  23
Name: count, dtype: int64
investment_horizon
Medium         3677
Short          2303
Long-Term      1368
1-5 months        7
9 weeks           7
4 months          6
3 months          6
1-6 months        6
5 weeks           5
6 weeks           5
8 weeks           4
1-4 months        4
5 months          4
1-2 months        4
3 weeks           4
10 weeks          4
2 months          4
7 weeks           4
2 weeks           4
7 months          4
8 months          3
1-3 months        3
10 months         3
1-9 months        3
4 weeks           3
6 months          3
1-7 months        2
9 months          2
1-8 months        1
1-10 months       1
Name: count, dtype: int64


### Model

In [14]:
predictions = labels

### Evaluation

In [18]:
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    confusion_matrix,
    classification_report
)

#def combined_model(X):
#    y = my_classifier.predict(all_data) 
# 

combined_predictions = [1]*len(all_data)
for i in range(len(all_data)):
    if (all_predictions[i] == 0):
        combined_predictions[i] = 0
for j, prediction in zip(indices_accepted, predictions):
    combined_predictions[j] = prediction


In [19]:
print("📊 Classification Report:\n")
print(classification_report(all_labels, combined_predictions, target_names=["Reject", "Accept"]))

📊 Classification Report:

              precision    recall  f1-score   support

      Reject       1.00      1.00      1.00      4992
      Accept       1.00      1.00      1.00      5008

    accuracy                           1.00     10000
   macro avg       1.00      1.00      1.00     10000
weighted avg       1.00      1.00      1.00     10000

