## Global Human Trafficking Data Analysis 

This data analysis was done using the dataset from Kaggle (https://www.kaggle.com/andrewmvd/global-human-trafficking). This dataset is taken from the Counter-Trafficking Data Collaborative (CTDC). This dataset contains information on 48.8k victims of human trafficking, including the reason, means of control, origin and destination, as well as other variables. Missingness is displayed as -99 in this dataset.

The objective of this dataset : 
1. Predict time series of human trafficking.
2. Explore demographics, means of control and other variables associated with human trafficking.

In [3]:
import numpy as np
import pandas as pd

In [5]:
data = pd.read_csv("human_trafficking.csv")
data.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,yearOfRegistration,Datasource,gender,ageBroad,majorityStatus,majorityStatusAtExploit,majorityEntry,citizenship,meansOfControlDebtBondage,meansOfControlTakesEarnings,...,typeOfSexPrivateSexualServices,typeOfSexConcatenated,isAbduction,RecruiterRelationship,CountryOfExploitation,recruiterRelationIntimatePartner,recruiterRelationFriend,recruiterRelationFamily,recruiterRelationOther,recruiterRelationUnknown
0,2002,Case Management,Female,18--20,Adult,-99,-99,CO,-99,-99,...,-99,-99,-99,-99,-99,0,0,0,0,1
1,2002,Case Management,Female,18--20,Adult,-99,-99,CO,-99,-99,...,-99,-99,-99,-99,-99,0,0,0,0,1
2,2002,Case Management,Female,18--20,Adult,-99,-99,CO,-99,-99,...,-99,-99,-99,-99,-99,0,0,0,0,1
3,2002,Case Management,Female,18--20,Adult,-99,-99,CO,-99,-99,...,-99,-99,-99,-99,-99,0,0,0,0,1
4,2002,Case Management,Female,18--20,Adult,-99,-99,CO,-99,-99,...,-99,-99,-99,-99,-99,0,0,0,0,1


In [14]:
data.columns

Index(['yearOfRegistration', 'Datasource', 'gender', 'ageBroad',
       'majorityStatus', 'majorityStatusAtExploit', 'majorityEntry',
       'citizenship', 'meansOfControlDebtBondage',
       'meansOfControlTakesEarnings', 'meansOfControlRestrictsFinancialAccess',
       'meansOfControlThreats', 'meansOfControlPsychologicalAbuse',
       'meansOfControlPhysicalAbuse', 'meansOfControlSexualAbuse',
       'meansOfControlFalsePromises', 'meansOfControlPsychoactiveSubstances',
       'meansOfControlRestrictsMovement', 'meansOfControlRestrictsMedicalCare',
       'meansOfControlExcessiveWorkingHours', 'meansOfControlUsesChildren',
       'meansOfControlThreatOfLawEnforcement',
       'meansOfControlWithholdsNecessities',
       'meansOfControlWithholdsDocuments', 'meansOfControlOther',
       'meansOfControlNotSpecified', 'meansOfControlConcatenated',
       'isForcedLabour', 'isSexualExploit', 'isOtherExploit', 'isSexAndLabour',
       'isForcedMarriage', 'isForcedMilitary', 'isOrganRemova

## Questions 

Based on the above data, I came up with some questions 

1. What is the primary demographic of victims being trafficked ?
2. How are these victims being trafficked ? 
3. Who is the enabler for these trafficking events ?
4. What happens to the victims once they are trafficked ?
5. What is the average time these victims spend in the trade ?

## Data Cleaning

Before we proceed the data needs to be cleaned first. Since the missing data is indicated using "-99" it needs to be replaced into a variable like "NaN" which is readable for the pandas library. 
After this the percentage of missing values can be considered in each column so that we can assess whether we need to drop these columns or not. 

In [12]:
data.replace(-99, np.nan, inplace=True)
data.replace("-99", np.nan, inplace=True)

In [13]:
data.head()

Unnamed: 0,yearOfRegistration,Datasource,gender,ageBroad,majorityStatus,majorityStatusAtExploit,majorityEntry,citizenship,meansOfControlDebtBondage,meansOfControlTakesEarnings,...,typeOfSexPrivateSexualServices,typeOfSexConcatenated,isAbduction,RecruiterRelationship,CountryOfExploitation,recruiterRelationIntimatePartner,recruiterRelationFriend,recruiterRelationFamily,recruiterRelationOther,recruiterRelationUnknown
0,2002,Case Management,Female,18--20,Adult,,,CO,,,...,,,,,,0.0,0.0,0.0,0.0,1.0
1,2002,Case Management,Female,18--20,Adult,,,CO,,,...,,,,,,0.0,0.0,0.0,0.0,1.0
2,2002,Case Management,Female,18--20,Adult,,,CO,,,...,,,,,,0.0,0.0,0.0,0.0,1.0
3,2002,Case Management,Female,18--20,Adult,,,CO,,,...,,,,,,0.0,0.0,0.0,0.0,1.0
4,2002,Case Management,Female,18--20,Adult,,,CO,,,...,,,,,,0.0,0.0,0.0,0.0,1.0


In [23]:
#Percentage of missing values in each column
missing = data.columns[np.sum(data.isna())/data.shape[0] > 0.75]
len(missing)

28

In [24]:
missing

Index(['majorityStatusAtExploit', 'majorityEntry', 'meansOfControlDebtBondage',
       'meansOfControlTakesEarnings', 'meansOfControlRestrictsFinancialAccess',
       'meansOfControlThreats', 'meansOfControlPsychologicalAbuse',
       'meansOfControlPhysicalAbuse', 'meansOfControlSexualAbuse',
       'meansOfControlFalsePromises', 'meansOfControlPsychoactiveSubstances',
       'meansOfControlRestrictsMovement', 'meansOfControlRestrictsMedicalCare',
       'meansOfControlExcessiveWorkingHours', 'meansOfControlUsesChildren',
       'meansOfControlThreatOfLawEnforcement',
       'meansOfControlWithholdsNecessities',
       'meansOfControlWithholdsDocuments', 'meansOfControlOther',
       'isForcedMarriage', 'isForcedMilitary', 'isOrganRemoval',
       'isSlaveryAndPractices', 'typeOfLabourConcatenated',
       'typeOfSexPornography', 'typeOfSexRemoteInteractiveServices',
       'typeOfSexPrivateSexualServices', 'typeOfSexConcatenated'],
      dtype='object')