# Project Assignment
- Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points  across at least four different variables.
- Investigate the types of variables involved, their likely distributions, and their relationships with each other.
- Synthesise/simulate a data set as closely matching their properties as possible.
- Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook.


### Project Overview:
The real world phenomenon which I have chosen to examine is the issue of recidivism of Irish prisoners on release from prison. 

For this project, I will create a simulated dataset on the factors which contribute to the recidivism of Irish prisoners. I will investigate the risk factors for reoffending within 1 year of release from incarceration. To do this, I will focus on 4 key variables, I will research each of these variables and explore their relationship and any possible correlations. 

The variables involved in this project are:
1. Age of the prisoner
2. Gender of the prisoner
3. conomic/education level of prisoner 
4. Type of Offence


5. Economic/education level of prisoner 

## The Phenomenon I have chosen to simulate

The recidivism rate of of those released from incarceration is an issue which arises again and again. As the population of Ireland continues to grow, this can place pressure on many resources, prison capacity included. Thus, many people look at the goal of incarceration itself, and whether it is fit for purpose or if there could be an alternative to incarceration.
When considering the purpose of incarceration, the main theories that underpin incarceration are: Incapacitation, rehabilitation, reparation, deterrence and retribution. 

In particular, looking at the theories of deterrence and rehabilitation, the issue of recidivism is particularly relevant. If the incarceration of wrong-doers is based on the notion that prison will rehabilitate them, reforming their behaviour and providing educational and emotional support, then the rate of recidivism can show whether rehabilitation is acutally being achieved. High rates of recidivism can demonstrate that imprisonment is not successful in reforming criminals into law-abiding members of society.

Furthermore, if prison is considered a deterrent to would-be criminals, then the rate of recidivism can demonstrate if this is being achieved. A higher rate of recidivism can demonstrate that imprisonment is not an effective deterrent to criminals. 

[https://www.unodc.org/e4j/en/crime-prevention-criminal-justice/module-7/key-issues/2--justifying-punishment-in-the-community.html]



## Why Data Simulation

The purpose of this project is to simulate a dataset. Due to the large amount of data easily accessible online, this raises the question: why create a simulated dataset instead of just using data already available? 

A key advantage to using simulated data is that it allows machine learning, deep learning and other analysis to be conducted on synthesised data, which is especially helpful when the collection of real world data may be too costly or too time consuming to conduct. It also facilitates the exploration of alternative outcomes to potential situations by creating various synthesised data sets.

As such, to complete this project, I am not recreating an already complete dataset. Rather, I will investigate and research each of the variables involved and model a synthesised dataset based on the results of this research.

https://mostly.ai/blog/data-simulation

## Benefit of synthesised data for this specific situation

With regard to the issue of reoffending of Irish prisoners, a simulated data set can be useful for predicting which prisoners are at a greater risk of reoffending. This allows Government agencies, prisons, police, and other relevant organisations to alter the current system to reduce the likelihood of reoffending for those most at risk. It also allows them to examine any processes which are not working effectively and as the data has been synthesised, any potential changes could also be simulated to predict the success of such changes, as is one of the benefits of simulating data.

## Considerations to note

The recidivism rate of criminals is a very complex issue. It is not within the remit of this project to comprehensively cover all of the factors which contribute to the recidivism of criminals. In particular, there are various factors such as relationship breakdowns, mental health issues, and traumatic events and difficult upbringings which can all contirbute significantly to the rate of recidivism of individual offenders. However, such variables are difficult to accurately measure and are greatly influenced by the personal experiences and motivations of each individual. 
As a result, this project will instead focus on 4 key variables in order to examine the factors which contribute to a prisoner reoffending within 1 year of release from custody. The 4 variables are:

1. Age - categorical discrete 
2. Gender - Categorical nominal data 
3. Type of offence - categorical nominal 
4. Length of imprisonment - 
5. Education level/economic of criminal - categorical ordinal


#### ISSUES WITH GARDA DATA


https://www.felonyrecordhub.com/wp-content/uploads/2021/04/Recidivism-Header.jpg

## Age
The age of the prisoner is 

In [None]:
# Here I am importing the various libraries I require to complete the data analysis and simulation
import numpy as np
import seaborn as sea
import pandas as pd
import matplotlib.pyplot as plt
import sys

In [None]:
import pandas as pd




# VARIABLE 1: AGE

In [None]:
Ages = ['Under 21', '21-25', '26-30', '31-35', '36-40', '41-50', '50+']
age = np.random.choice(Ages, 2604, p=[0.058, 0.194, 0.211, 0.184, 0.133, 0.141, 0.079])
#print(age)  ^^^ THIS IS MULTINOMIAL DISTRIBUTION
# Distribution of age groups of released prisoners per https://data.cso.ie/

sea.countplot(x=age, order=Ages)
plt.title('Distribution of Age Groups')
plt.show()



# Creating dataset
Ages = ['Under 21', '21-25', '26-30',
        '31-35', '36-40', '41-50', '50+']
Count = [150, 504, 550, 480, 347, 368, 205]
 
fig = plt.figure(figsize =(10, 7))
plt.pie(Count, labels = Ages)
plt.title('Age Groups')

plt.show()

## 2. Gender

The variable gender is a categorical, discrete variable, as it has two values, male and female. 


*While conducting research on this variable, there was no accessible information on the distribution of prisoners who identify outside of the binary genders male and female. Thus, for the purposes of this project, the variable gender is being synthesised based on male and female only, and does not include information on other gender identities. 

[Categorical data can also take on numerical values for example 1 = female and 0 = male.]

In [None]:
Genders = ['Male', 'Female'] 
result = np.random.choice(Genders, 2604, p=[0.925, 0.075])  # BINOMIAL DISTRIBUTION
#print(result)
#another method would be to assign Binomial distribution with 10 trials and probability 0.5 each trial. 
#fig = np.random.binomial(10, 0.5, 10000)

sea.countplot(x=result, order=Genders)
plt.title('Distribution of Gender') 
plt.show()

# VARIABLE 3: EDUCATION STATUS

In [None]:
# Setting out the variables for visualisation 
Economic_Status = ['No employment nor education', 'substantial employment only', 'education and training only',
                   'education, training and substantial employment', 'Not identified' ]

# As these variable names are too long to display on the x axis, even when rotated, I have created a shortened version
Shorter_Name = ['No Edu or Emp', 'Emp Only', 'Edu & Training Only', 'Edu, Training & Emp Only',
                        'Not Identified']

# I have created a dictionary to show which Economic status variable corresponds to the shortened version
label_dict = dict(zip(Shorter_Name, Economic_Status))

##REWRITE === #the zip() function is used to create a sequence of tuples, where the first element of each tuple is from Shorter_Name and the second element is from Economic_Status. The dict() function then converts this sequence of tuples into a dictionary. Let me know if you need help with anything else! 😊
Status = np.random.choice(Shorter_Name, 2604, p=[0.298, 0.047, 0.264, 0.028, 0.363])

ax = sea.countplot(x=Status, order=Shorter_Name)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40)
plt.title('Distribution of Economic Status') # setting the title
plt.tight_layout()
plt.show()

#https://www.cso.ie/en/releasesandpublications/fp/p-ofdfo/offenders2016employmenteducationandotheroutcomes2016-2019/introduction/
https://stackoverflow.com/questions/42528921/how-to-prevent-overlapping-x-axis-labels-in-sns-countplot

# 4 Type of Original Offence 

In [None]:
Original_Offence_Type = ['Homicide Offences', 'Sexual Offences', 'Attempts/threats to murder, assaults, harassments and related offences',
                        'Dangerous or negligent acts', 'Kidnapping & related offences', 'Robbery, extortion and hijacking offences',
                        'Burglary and related offences', 'Theft and related offences', 'Fraud, deception and related offences', 
                        'Controlled drug offences', 'Weapons and Explosives Offences', 'Damage to property and to the environment',
                        'Public order and other social code offences', 'Road and traffic offences', 'Offences against government, justice procedures and organisation of crime',
                        'offences not elsewhere classified']

Shortened_Offence = ['Homicide', 'SO', 'Atmptd murder/assault', 'Negligence', 'Kidnapping', 
                    'Robbery', 'Burglary', 'Theft', 'Fraud', 'DO', 'WO', 'PD', 'Public Order',
                    'RTO', 'Offences agnst Gov', 'Unclassified']

label_dict = dict(zip(Shortened_Offence, Original_Offence_Type))


probabilities = [0.0096, 0.03226, 0.1321, 0.0411, 0.00384, 0.01728, 0.07411, 
                 0.22772,0.02457, 0.09254, 0.04032, 0.05299, 0.05798, 0.10637,
                 0.06835, 0.01881]

probabilities = [i/sum(probabilities) for i in probabilities]   # rescaling the probabilities ******

Offence = np.random.choice(Shortened_Offence, 2604, p=probabilities)
plt.figure(figsize=(12, 8))
plt.yticks(np.arange(0, 640, 40)) # changing interval so can visual easier
ax = sea.countplot(x= Offence, order=Shortened_Offence)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, fontsize = 8)
plt.title('Distribution of Economic Status') # setting the title
plt.tight_layout()
plt.show()

https://stackoverflow.com/questions/42528921/how-to-prevent-overlapping-x-axis-labels-in-sns-countplot

https://stackoverflow.com/questions/55085336/how-to-set-y-axes-limits-in-countplot

# SIMULATING THE DATA SET

In [None]:
p = 2604
 
listGender = np.random.choice(Genders, p, p=[0.925, 0.075])
listAge = np.random.choice(Ages, 2604, p, p=[0.058, 0.194, 0.211, 0.184, 0.133, 0.141, 0.079])
listEconomicStatus = np.random.choice(Shorter_Name, 2604, p=[0.298, 0.047, 0.264, 0.028, 0.363])
listOffence = np.random.choice(Shortened_Offence, 2604, p=probabilities)

d = {'Gender': listGender, 'Age': listAge, 'Economic Status' :listEconomicStatus, 'Offence Type': listOffence}
# creating a dictionary for the dataframe
df = pd.DataFrame(data=d)
print(df)
print(df.describe())