## PfDA Assignment 

***

Simulate data of a phenomenon:

e.g. ACL injuries in women's professional football
    injuries in women's football

- at least 100 data points required
- 4 variables:
    e.g. age, trianing hours in week, kms covered in week, occupation (if not professional), previous injury, menstrual cycle

Use a Jupyter notebook - use images, links, code etc

Research likely relationships between variables, their distributions and relationships/ correlation with each other

Simulate data by devising an algorithm



## Introduction

The phenomenon I have chosen to investigate in this project is the epidemic of Anterior Cruciate Ligament (ACL) injuries in athletes, in particular in female sport.

The ACL is one of the main stabilizing ligaments in the knee. It connects the thigh bone (femur) to the shin bone (tibia). It is responsible for stabilising the knee as well as controlling any abnormal motion during twisting, turning and pivoting actions.

![ACL](https://my.clevelandclinic.org/-/scassets/images/org/health/articles/16576-acl-tear)

An ACL rupture is a serious injury that most often requires surgery to repair. Recovery time from an ACL rupture is typically in the range of 7-12 months from injury to return to play.

As ACL injury most often occurs through twisting, pivoting and changing directons, injury is most common in athletes who compete in sports that involve these actions - in particular, soccer, basketball, etc. Here in Ireland, ACL injuries are also very common in GAA athletes.

ACL injuries are also particulary prevalent in females, with females 4-8 times more likely that males to suffer an ACL injury.

Through some research (and my own personal experience as someone who has suffered the dreaded ruptured ACL!), I have found that the following variables have a close relationship with ACL injury:

- sex
- age
- mode of injury (in training/competition)
- stage in menstrual cycle (female only)
- previous injury
- hours of training per week


## Discussion on Variables

### Sex

4-8 times more likely for females than males
https://pubmed.ncbi.nlm.nih.gov/9784805/

### Age

15-19 to 30-34 - consistent for males then starts to drop off
15-19 extremely high in females then drops but stays consistent to 35-39 then starts to drop 
https://stillmed.olympics.com/media/Documents/Athletes/Medical-Scientific/Consensus-Statements/2008_non-contact-ACL-injuries-female-athletes.pdf

after 30 athletes begin to retire


Age Bracket Data: (ratios)
Female:
10-14 = 1
15-19 = 8.5
20-24 = 3
25-29 = 3
30-34 = 3
35-39 = 3
40-44 = 2
>45 = 1.5

Male:
10-14 = 0
15-19 = 3.5
20-24 = 3.5
25-29 = 3.5
30-34 = 3.5
35-39 = 3
40-44 = 2
>45 = 1


### Hours of Training per Week
Training hours greater than 10 hours per week - 7.5:1 than those less than 10 hours
https://bjsm.bmj.com/content/51/4/392.3

### Previous Injury
Prior injury to ACL is heavily linked to re-injury of the ligament - 6:1 more likely to injure ACL then someone who has not injured it previously
https://journals.sagepub.com/doi/10.1177/0363546514530088

### Menstrual Cycle Stage

Follicular (Day 1-9) approx 37%
Ovulatory (Day 10 - 14) approx 26%
Luteal (Day 15-28) approx 37%
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC164356/#:~:text=We%20found%20that%2026%20of,onset%20of%20menses%20(Figure%20%E2%80%8B

### Mode of Injury

Higher in competition than in training/practice - approx 7:1 
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3867093/

## Dataset Synthesis

To create data based on the research above, I will use numpy random number generator. The data will be synthesises based on a normal distribution.
a dataset with 200 rows will be created. It will represent data for 200 athletes who have suffered ACL injuries. The variables in the data set will be:
- Sex
- Age
- Mode of Injury
- Training Hours per week
- Menstrual Cycle Stage (Female only)

In [8]:
#Importing of Packages

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import random


In [2]:
# Sex Variable data
# I took a value of 5:1 females to males as per research, erring on the side of caution

rng = np.random.default_rng()
sex_choice = ["Male", "Female"]
sex = rng.choice(sex_choice, p =[0.17, 0.83], size=200)

sex


array(['Female', 'Female', 'Female', 'Female', 'Female', 'Male', 'Female',
       'Female', 'Female', 'Female', 'Female', 'Female', 'Female',
       'Female', 'Female', 'Female', 'Female', 'Female', 'Male', 'Female',
       'Female', 'Female', 'Male', 'Female', 'Female', 'Female', 'Male',
       'Male', 'Female', 'Female', 'Female', 'Female', 'Male', 'Female',
       'Female', 'Female', 'Female', 'Female', 'Female', 'Female',
       'Female', 'Female', 'Female', 'Female', 'Male', 'Female', 'Female',
       'Female', 'Female', 'Female', 'Female', 'Female', 'Female',
       'Female', 'Female', 'Female', 'Male', 'Female', 'Female', 'Female',
       'Female', 'Female', 'Female', 'Female', 'Female', 'Male', 'Female',
       'Female', 'Female', 'Female', 'Female', 'Male', 'Female', 'Female',
       'Female', 'Male', 'Female', 'Female', 'Female', 'Female', 'Female',
       'Female', 'Female', 'Female', 'Female', 'Female', 'Female',
       'Female', 'Female', 'Female', 'Female', 'Male', 'Femal

In [5]:
# Age Variable data
# I took a value of 5:1 females to males as per research, erring on the side of caution

'''Age Bracket Data: (ratios)
Female:
10-14 = 1
15-19 = 8.5
20-24 = 3
25-29 = 3
30-34 = 3
35-39 = 3
40-44 = 2
>45 = 1.5

Male:
10-14 = 0
15-19 = 3.5
20-24 = 3.5
25-29 = 3.5
30-34 = 3.5
35-39 = 3
40-44 = 2
>45 = 1'''

rng = np.random.default_rng()
age_choice = ["10-14", "15-19", "20-24", "25-29", "30-34", "35-39", "40-44", ">45"]
if (sex == "Male").any():
    male_age = rng.choice(age_choice, p =[0, 0.175, 0.175, 0.175, 0.175, 0.15, 0.1, 0.05], size=34)

male_age

array(['20-24', '>45', '40-44', '25-29', '20-24', '20-24', '35-39', '>45',
       '25-29', '35-39', '25-29', '25-29', '20-24', '30-34', '15-19',
       '40-44', '>45', '30-34', '>45', '15-19', '15-19', '25-29', '40-44',
       '20-24', '30-34', '15-19', '30-34', '30-34', '20-24', '25-29',
       '30-34', '40-44', '>45', '25-29'], dtype='<U5')

In [6]:
if (sex == "Female").any():
    female_age = rng.choice(age_choice, p =[0.04, 0.34, 0.12, 0.12, 0.12, 0.12, 0.08, 0.06], size=166)

female_age

array(['20-24', '15-19', '15-19', '15-19', '15-19', '25-29', '15-19',
       '10-14', '15-19', '15-19', '15-19', '20-24', '>45', '20-24',
       '15-19', '30-34', '40-44', '>45', '40-44', '15-19', '30-34',
       '15-19', '20-24', '25-29', '40-44', '30-34', '15-19', '25-29',
       '15-19', '10-14', '40-44', '25-29', '40-44', '15-19', '15-19',
       '30-34', '>45', '15-19', '>45', '35-39', '25-29', '30-34', '30-34',
       '20-24', '15-19', '40-44', '15-19', '15-19', '35-39', '30-34',
       '15-19', '40-44', '40-44', '40-44', '>45', '15-19', '30-34',
       '35-39', '20-24', '15-19', '15-19', '30-34', '10-14', '20-24',
       '30-34', '25-29', '>45', '15-19', '25-29', '20-24', '15-19',
       '20-24', '40-44', '15-19', '10-14', '20-24', '15-19', '35-39',
       '35-39', '25-29', '25-29', '20-24', '10-14', '25-29', '15-19',
       '15-19', '20-24', '30-34', '15-19', '35-39', '>45', '15-19',
       '30-34', '40-44', '30-34', '30-34', '20-24', '20-24', '35-39',
       '20-24', '25-29', 

In [39]:
# Training Hours Variable data
# I took a value of 7:1 as per research

rng = np.random.default_rng()
from random import randrange
lessthantenrange = randrange(3, 9)
greaterthantenrange = randrange(10, 18)
traininghours_choice = [lessthantenrange, greaterthantenrange]
training_hours = rng.choice(traininghours_choice, p =[0.12, 0.88], size=200)

training_hours

array([14, 14, 14, 14, 14,  3, 14, 14, 14,  3, 14, 14, 14, 14, 14, 14, 14,
       14, 14,  3, 14, 14, 14,  3, 14, 14,  3, 14,  3, 14, 14, 14,  3, 14,
       14, 14, 14, 14, 14,  3, 14, 14, 14, 14, 14, 14, 14,  3, 14, 14, 14,
        3, 14,  3, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14,  3,
       14, 14, 14, 14, 14,  3, 14, 14,  3,  3, 14, 14, 14, 14, 14, 14, 14,
       14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14,  3, 14, 14, 14, 14,
       14, 14, 14, 14, 14, 14, 14, 14,  3, 14, 14, 14, 14, 14,  3, 14, 14,
       14, 14, 14, 14, 14,  3, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14,
       14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14,  3, 14, 14, 14, 14,
       14, 14, 14,  3, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14,  3, 14,
       14, 14, 14, 14, 14,  3, 14, 14, 14, 14, 14, 14,  3, 14, 14, 14, 14,
       14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14])

In [40]:
# Mode of Injury Variable data
# I took a value of 7.5:1 as per research

rng = np.random.default_rng()
modeofinjury_choice = ["Competition", "Training"]
modeofinjury = rng.choice(modeofinjury_choice, p =[0.875, 0.125], size=200)

modeofinjury

array(['Competition', 'Competition', 'Training', 'Competition',
       'Competition', 'Competition', 'Competition', 'Training',
       'Competition', 'Competition', 'Competition', 'Training',
       'Competition', 'Competition', 'Competition', 'Competition',
       'Training', 'Competition', 'Competition', 'Competition',
       'Competition', 'Competition', 'Competition', 'Competition',
       'Competition', 'Competition', 'Training', 'Competition',
       'Competition', 'Competition', 'Competition', 'Competition',
       'Competition', 'Competition', 'Training', 'Competition',
       'Competition', 'Competition', 'Competition', 'Training',
       'Competition', 'Competition', 'Training', 'Training',
       'Competition', 'Competition', 'Competition', 'Competition',
       'Competition', 'Training', 'Training', 'Competition', 'Training',
       'Competition', 'Competition', 'Competition', 'Competition',
       'Competition', 'Competition', 'Competition', 'Competition',
       'Competit

In [43]:
# Menstrual Cycle Variable data
# From research, there are three main stages: 
'''Follicular (Day 1-9) approx 37%
Ovulatory (Day 10 - 14) approx 26%
Luteal (Day 15-28) approx 37%'''

rng = np.random.default_rng()
menstrualcycle_choice = ["Follicular (Day 1-9)", "Ovulatory (Day 10-14)", "Luteal (Day 15-28)"]
menstrualcycle = rng.choice(menstrualcycle_choice, p =[0.37, 0.26, 0.37], size=200)

menstrualcycle

array(['Luteal (Day 15-28)', 'Luteal (Day 15-28)',
       'Ovulatory (Day 10-14)', 'Follicular (Day 1-9)',
       'Follicular (Day 1-9)', 'Follicular (Day 1-9)',
       'Follicular (Day 1-9)', 'Follicular (Day 1-9)',
       'Luteal (Day 15-28)', 'Follicular (Day 1-9)',
       'Ovulatory (Day 10-14)', 'Follicular (Day 1-9)',
       'Follicular (Day 1-9)', 'Luteal (Day 15-28)', 'Luteal (Day 15-28)',
       'Follicular (Day 1-9)', 'Luteal (Day 15-28)',
       'Ovulatory (Day 10-14)', 'Luteal (Day 15-28)',
       'Luteal (Day 15-28)', 'Luteal (Day 15-28)', 'Luteal (Day 15-28)',
       'Follicular (Day 1-9)', 'Ovulatory (Day 10-14)',
       'Luteal (Day 15-28)', 'Luteal (Day 15-28)', 'Follicular (Day 1-9)',
       'Luteal (Day 15-28)', 'Follicular (Day 1-9)',
       'Ovulatory (Day 10-14)', 'Luteal (Day 15-28)',
       'Luteal (Day 15-28)', 'Follicular (Day 1-9)',
       'Follicular (Day 1-9)', 'Ovulatory (Day 10-14)',
       'Ovulatory (Day 10-14)', 'Ovulatory (Day 10-14)',
       'Follicu