# Excluding "minorities" during training
What happens if an ML system is trained with data that exclude minorities, but it ends up being used extensively by a minority? A system trained on data that has historically excluded women (easy example: medical data) is very likely still being used by women as they make up basically half of the Earth's population.

In [9]:
from trecs.models import ContentFiltering, PopularityRecommender, SocialFiltering
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
from utils import create_profiles, calc_group_preferences

## Training data
Look at a rec sys trained with data where Group A is the majority and Group B is the minority. Compositions can vary (in 5% intervals):
- Group A: 100%, Group B: 0%
- Group A: 95%, Group B: 5%
- Group A: 90%, Group B: 10%
- Group A: 85%, Group B: 15%
- ...
- Group A: 55%, Group B: 45%
- Group A: 50%, Group B: 50% (baseline)

## "Test" data
The system is then used by a balanced audience of users.
- Group A: 50%, Group B: 50%
- We could also look at a majority of Group B users.

In [13]:
# 1000 users, 10000 items
# We test with 800 group A, 200 group B
total_users = 1000
num_group_a = 800 # so group B has 200 users
num_attrs = 11

# 5000 items created by Group A members
# 5000 items created by Group B members
total_items = 10000
num_items_a = 10000 # so group B created 5000 items

user_profiles, item_profiles = create_profiles(
    total_users=total_users, 
    total_items=total_items, 
    dynamic_creators=False,
    num_majority_users=num_group_a, 
    num_majority_items=num_items_a, 
    group_strength=1, 
    num_attrs=num_attrs
    )
calc_group_preferences(user_profiles, item_profiles, num_group_a, num_items_a)

Percentage of items generated by Group A: 1.0


ValueError: low >= high