# Project Assignment
- Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points  across at least four different variables.
- Investigate the types of variables involved, their likely distributions, and their relationships with each other.
- Synthesise/simulate a data set as closely matching their properties as possible.
- Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook.


### Project Overview:
The real world phenomenon which I have chosen to examine is the issue of recidivism of Irish prisoners on release from prison. 

For this project, I will create a simulated dataset on the factors which contribute to the recidivism of Irish prisoners. I will investigate the risk factors for reoffending within 1 year of release from incarceration. To do this, I will focus on key variables, I will research each of these variables and explore their relationship and any possible correlations. 

The variables involved in this project are:
1. Age of the prisoner
2. Gender of the prisoner
3. Economic situation/education level of prisoner 
4. Original Offence type which caused imprisonment 

## The Phenomenon I have chosen to simulate

The recidivism rate of of those released from incarceration is an issue which arises again and again. As the population of Ireland continues to grow, this can place pressure on many resources, prison capacity included. Thus, many people look at the goal of incarceration itself, and whether it is fit for purpose or if there could be an alternative to incarceration.
When considering the purpose of incarceration, the main theories that underpin incarceration are: Incapacitation, rehabilitation, reparation, deterrence and retribution. 

In particular, looking at the theories of deterrence and rehabilitation, the issue of recidivism is particularly relevant. If the incarceration of wrong-doers is based on the notion that prison will rehabilitate them, reforming their behaviour and providing educational and emotional support, then the rate of recidivism can show whether rehabilitation is acutally being achieved. High rates of recidivism can demonstrate that imprisonment is not successful in reforming criminals into law-abiding members of society.

Furthermore, if prison is considered a deterrent to would-be criminals, then the rate of recidivism can demonstrate if this is being achieved. A higher rate of recidivism can demonstrate that imprisonment is not an effective deterrent to criminals. 



## Why Data Simulation

The purpose of this project is to simulate a dataset. Due to the large amount of data easily accessible online, this raises the question: why create a simulated dataset instead of just using data already available? 

A key advantage to using simulated data is that it allows machine learning, deep learning and other analysis to be conducted on synthesised data, which is especially helpful when the collection of real world data may be too costly or too time consuming to conduct. It also facilitates the exploration of alternative outcomes to potential situations by creating various synthesised data sets.

As such, to complete this project, I am not recreating an already complete dataset. Rather, I will investigate and research each of the variables involved and model a synthesised dataset based on the results of this research.


## Benefit of synthesised data for this specific situation

With regard to the issue of reoffending of Irish prisoners, a simulated data set can be useful for predicting which prisoners are at a greater risk of reoffending. This allows Government agencies, prisons, police, and other relevant organisations to alter the current system to reduce the likelihood of reoffending for those most at risk. It also allows them to examine any processes which are not working effectively and as the data has been synthesised, any potential changes could also be simulated to predict the success of such changes.

## Considerations to note

The recidivism rate of criminals is a very complex issue. It is not within the remit of this project to comprehensively cover all of the factors which contribute to the recidivism of criminals. In particular, there are various factors such as relationship breakdowns, mental health issues, and traumatic events and difficult upbringings which can all contirbute significantly to the rate of recidivism of individual offenders. However, such variables are difficult to accurately measure and are greatly influenced by the personal experiences and motivations of each individual. 
As a result, this project will instead focus on 4 key variables in order to examine the factors which contribute to a prisoner reoffending within 1 year of release from custody. 


![image](https://www.felonyrecordhub.com/wp-content/uploads/2021/04/Recidivism-Header.jpg)

In [None]:
# Here I am importing the various libraries I require to complete the data analysis and simulation
import numpy as np
import seaborn as sea
import pandas as pd
import matplotlib.pyplot as plt
import sys

# VARIABLE 1: AGE

For this project, the variable age is a categorical, ordinal variable, which has an inherent natural ranking as age is provided in age brackets, rather than individual ages which would be a continous, numerical variable. This variable provides information on the age distribution of prisoners upon release from a custodial sentence in 2017 and this information was obtained from CSO. 
However, the information was provided in age brackets, rather than giving the exact count of each age. As such, none of the numpy.random fuctions for distrbution such as numpy.random.randint was applicable here. Rather, I used the percentage breakdown of each age bracket as part of the function numpy.random.choice which allowed me to model more accurate simulated data.

In [None]:
Ages = ['Under 21', '21-25', '26-30', '31-35', '36-40', '41-50', '50+'] # the age brackets of all released from prison in 2017
age = np.random.choice(Ages, 2604, p=[0.058, 0.194, 0.211, 0.184, 0.133, 0.141, 0.079]) # the probability of each age bracket
print(age) 

sea.countplot(x=age, order=Ages) # ensuring the countplot displays in the order I have specified
plt.title('Distribution of Age Groups')
plt.show()

# Creating dataset for pie chart
Ages = ['Under 21', '21-25', '26-30',
        '31-35', '36-40', '41-50', '50+']
Count = [150, 504, 550, 480, 347, 368, 205] # matching the count of each of the age brackets for the pie chart
 
fig = plt.figure(figsize =(10, 7)) #setting the size of the figure
plt.pie(Count, labels = Ages)
plt.title('Age Groups of those released from Custodial Sentence') # setting the title
plt.show()

def Offender_Age(Age): # I have created a function to show the rate of reoffending broken down by age group e.g
    if Age == 'Under 21':
        return np.random.choice(['Reoffend', 'Not Reoffend'], p=[0.84, 0.16]) # if the under 21 age group is chosen, then 84% are likely to reoffend, and 16% will not reoffend
    if Age == '21-25':
        return np.random.choice(['Reoffend', 'Not Reoffend'], p=[0.72, 0.28])
    if Age == '26-30':
        return np.random.choice(['Reoffend', 'Not Reoffend'], p=[0.7, 0.3])
    if Age == '31-35':
        return np.random.choice(['Reoffend', 'Not Reoffend'], p=[0.62, 0.38])
    if Age == '36-40':
        return np.random.choice(['Reoffend', 'Not Reoffend'], p=[0.55, 0.45])
    if Age == '41-50':
        return np.random.choice(['Reoffend', 'Not Reoffend'], p=[0.47, 0.53])
    if Age == '50+':
        return np.random.choice(['Reoffend', 'Not Reoffend'], p=[0.27, 0.73])


The above countplot and pie chart show that the majority of those released from prison are between the ages of 21 and 40, with the largest age group of those released from prison being 26-30, with the age group 21-25 a close second. Then it was necessary to set a function to show the recidivism rate of each of these age brackets and it cannot simply be assumed that the most frequently occuring age bracket would have the highest number of prisoners who reoffend. Thus, I have created the Offender_Age function which will explore the rate of reoffending of each group within the simulated dataset.

## 2. Gender

The variable gender is a categorical, discrete variable, as it has two values, male and female. 


*While conducting research on this variable, there was no accessible information on the distribution of prisoners who identify outside of the binary genders male and female. Thus, for the purposes of this project, the variable gender is being synthesised based on male and female only, and does not include information on other gender identities. 

[Categorical data can also take on numerical values for example 1 = female and 0 = male.]

In [None]:
Genders = ['Male', 'Female'] 
Gender_count = np.random.choice(Genders, 2604, p=[0.925, 0.075]) # using this numpy function and setting the probability of each gender occuring within the 2604 prisoners released
#print(Gender_count) commented out for clarity
#another method would be to assign Binomial distribution with 10 trials and probability 0.5 each trial. 
#e.g np.random.binomial(10, 0.5, 2604); however, this is not exactly accurate for the data I am working with

sea.countplot(x=Gender_count, order=Genders)
plt.title('Distribution of Gender of those released from Prison') 
plt.show()

In [None]:
def Offender_Gender(Gender): # I have created a function which breaks down the rate of reoffending by gender 
    if Gender == 'Male':
        return np.random.choice(['Reoffend', 'Not Reoffend'], p=[0.61, 0.39])
    if Gender == 'Female':
        return np.random.choice(['Reoffend', 'Not Reoffend'], p=[0.6, 0.4])



The above countplot shows that the vast majority of those released from prison are male. 

 Then it was necessary to set a function to show the recidivism rate specified by gender. Though there are significantly more males released from prison than females, 
 
 it cannot be assumed that thus males are more likely to reoffend. From 
 
  and it cannot simply be assumed that the most frequently occuring age bracket would have the highest number of prisoners who reoffend. Thus, I have created the Offender_Age function which will explore the rate of reoffending of each gender within the simulated dataset.

# VARIABLE 3: EDUCATION STATUS

In [None]:
# Setting out the variables for visualisation 
Economic_Status = ['No employment nor education', 'substantial employment only', 'education and training only',
                   'education, training and substantial employment', 'Not identified' ]

# As these variable names are too long to display on the x axis, even when rotated, I have created a shortened version
Shorter_Name = ['No Edu or Emp', 'Emp Only', 'Edu & Training Only', 'Edu, Training & Emp Only',
                        'Not Identified']

# I have created a dictionary to show which Economic status variable corresponds to the shortened version for clarity
Label_Dict = dict(zip(Shorter_Name, Economic_Status))

##REWRITE === #the zip() function is used to create a sequence of tuples, where the first element of each tuple is from Shorter_Name and the second element is from Economic_Status. The dict() function then converts this sequence of tuples into a dictionary. Let me know if you need help with anything else! 😊
Status = np.random.choice(Shorter_Name, 2604, p=[0.298, 0.047, 0.264, 0.028, 0.363])

ax = sea.countplot(x=Status, order=Shorter_Name)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40)
plt.title('Distribution of Economic Status') # setting the title
plt.tight_layout()
plt.show()

# 4 Type of Original Offence 

In [None]:
Original_Offence_Type = ['Homicide Offences', 'Sexual Offences', 'Attempts/threats to murder, assaults, harassments and related offences',
                        'Dangerous or negligent acts', 'Kidnapping & related offences', 'Robbery, extortion and hijacking offences',
                        'Burglary and related offences', 'Theft and related offences', 'Fraud, deception and related offences', 
                        'Controlled drug offences', 'Weapons and Explosives Offences', 'Damage to property and to the environment',
                        'Public order and other social code offences', 'Road and traffic offences', 'Offences against government, justice procedures and organisation of crime',
                        'offences not elsewhere classified']

Shortened_Offence = ['Homicide', 'SO', 'Atmptd murder/assault', 'Negligence', 'Kidnapping', 
                    'Robbery', 'Burglary', 'Theft', 'Fraud', 'DO', 'WO', 'PD', 'Public Order',
                    'RTO', 'Offences agnst Gov', 'Unclassified']

label_dict = dict(zip(Shortened_Offence, Original_Offence_Type))


probabilities = [0.0096, 0.03226, 0.1321, 0.0411, 0.00384, 0.01728, 0.07411, 
                 0.22772,0.02457, 0.09254, 0.04032, 0.05299, 0.05798, 0.10637,
                 0.06835, 0.01881]

probabilities = [i/sum(probabilities) for i in probabilities]   # rescaling the probabilities ******

Offence = np.random.choice(Shortened_Offence, 2604, p=probabilities)
plt.figure(figsize=(12, 8))
plt.yticks(np.arange(0, 640, 40)) # changing interval so can visual easier
ax = sea.countplot(x= Offence, order=Shortened_Offence)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, fontsize = 8)
plt.title('Distribution of Economic Status') # setting the title
plt.tight_layout()
plt.show()

# SIMULATING THE DATA SET

In [None]:
p = 2604
 
listGender = np.random.choice(Genders, p, p=[0.925, 0.075])
listAge = np.random.choice(Ages, 2604, p, p=[0.058, 0.194, 0.211, 0.184, 0.133, 0.141, 0.079])
listEconomicStatus = np.random.choice(Shorter_Name, 2604, p=[0.298, 0.047, 0.264, 0.028, 0.363])
listOffence = np.random.choice(Shortened_Offence, 2604, p=probabilities)

d = {'Gender': listGender, 'Age': listAge, 'Economic Status' :listEconomicStatus, 'Offence Type': listOffence}
# creating a dictionary for the dataframe
df = pd.DataFrame(data=d)


df['Reoffending Likelihood by Age'] = df['Age'].apply(Offender_Age)
df['Reoffending Likelihood by Gender'] = df['Gender'].apply(Offender_Gender)

print(df)

In [None]:
# Set the seed so that the numbers can be reproduced.
np.random.seed(0)  

original_stdout = sys.stdout 
with open ("SimulatedDataSet.txt", "w") as f: #w to write to file and opened it as it didn't already exist  
    sys.stdout = f
    p = 2604
    Genders = ['Male', 'Female'] 
    listGender = np.random.choice(Genders, p, p=[0.925, 0.075])
 
    Ages = ['Under 21', '21-25', '26-30', '31-35', '36-40', '41-50', '50+']
    listAge = np.random.choice(Ages, 2604, p, p=[0.058, 0.194, 0.211, 0.184, 0.133, 0.141, 0.079])

    Economic_Status = ['No employment nor education', 'substantial employment only', 'education and training only',
                   'education, training and substantial employment', 'Not identified' ]
    Shorter_Name = ['No Edu or Emp', 'Emp Only', 'Edu & Training Only', 'Edu, Training & Emp Only',
                        'Not Identified']
    label_dict = dict(zip(Shorter_Name, Economic_Status))
    listEconomicStatus = np.random.choice(Shorter_Name, 2604, p=[0.298, 0.047, 0.264, 0.028, 0.363])

    Original_Offence_Type = ['Homicide Offences', 'Sexual Offences', 'Attempts/threats to murder, assaults, harassments and related offences',
                        'Dangerous or negligent acts', 'Kidnapping & related offences', 'Robbery, extortion and hijacking offences',
                        'Burglary and related offences', 'Theft and related offences', 'Fraud, deception and related offences', 
                        'Controlled drug offences', 'Weapons and Explosives Offences', 'Damage to property and to the environment',
                        'Public order and other social code offences', 'Road and traffic offences', 'Offences against government, justice procedures and organisation of crime',
                        'offences not elsewhere classified']

    Shortened_Offence = ['Homicide', 'SO', 'Atmptd murder/assault', 'Negligence', 'Kidnapping', 
                    'Robbery', 'Burglary', 'Theft', 'Fraud', 'DO', 'WO', 'PD', 'Public Order',
                    'RTO', 'Offences agnst Gov', 'Unclassified']

    label_dict = dict(zip(Shortened_Offence, Original_Offence_Type))
    probabilities = [0.0096, 0.03226, 0.1321, 0.0411, 0.00384, 0.01728, 0.07411, 
                 0.22772,0.02457, 0.09254, 0.04032, 0.05299, 0.05798, 0.10637,
                 0.06835, 0.01881]
    probabilities = [i/sum(probabilities) for i in probabilities]   # rescaling the probabilities ******

    listOffence = np.random.choice(Shortened_Offence, 2604, p=probabilities)

    d = {'Gender': listGender, 'Age': listAge, 'Economic Status' :listEconomicStatus, 'Offence Type': listOffence}
    # creating a dictionary for the dataframe
    df = pd.DataFrame(data=d)
    print(df)
    print(df.describe())
    sys.stdout = original_stdout

sea.histplot(x='Age', hue='Gender', data=df)
#plt.show()

d = {'Gender': listGender, 'Age': listAge, 'Economic Status' :listEconomicStatus, 'Offence Type': listOffence}
    # creating a dictionary for the dataframe
df = pd.DataFrame(data=d)
print(df)

sea.catplot(x='Age',y='Offence Type', hue='Gender', data=df )
plt.show()

df['Gender'].replace({'Male':'0', 'Female':'1'}, inplace=True)
df['Age'].replace({'Under 21':'0', '21-25':'1', '26-30':'2', '31-35':'3', '36-40':'4', '41-50':'5', '50+':'6'}, inplace=True)
df['Economic Status'].replace({'No Edu or Emp':'0', 'Emp Only':'1', 'Edu & Training Only':'2', 'Edu, Training & Emp Only': '3', 'Not Identified':'4'}, inplace=True)
df['Offence Type'].replace({'Homicide':'0', 'SO':'1', 'Atmptd murder/assault':'2', 'Negligence':'3', 'Kidnapping':'4', 
                    'Robbery':'5', 'Burglary':'6', 'Theft':'7', 'Fraud':'8', 'DO':'9', 'WO':'10', 'PD':'11', 
                    'Public Order':'12', 'RTO':'13', 'Offences agnst Gov':'14', 'Unclassified':'15'}, inplace=True)
#assigning a number to the education level of the prisoner 

#print(df)

# Issues which arose with my project
One of the key difficulties I encountered whilst completing this project is that the data available to me were categorical variables, which can be more difficult to work with for the purposes of this project.