# MISSING MIGRANTS 

In this notebook, I will be exploring and visualising data about migrants who have died or have gone missing along migration routes worldwide. The source of the dataset is "IOM's Missing Migrants Project" . Please visit their website for more information [here](https://missingmigrants.iom.int/).
    

The 'Missing Migrants' dataset records the details of the incidents where migrants (commonly asylum-seekers and refugees) have died or gone missing from January 2014 to December 2019. The data only gives minimum estimates for the number of people affected, and the end of many of these human lives go unrecorded. 

Before creating graphs, I handled the missing values in the columns that I wanted to work with. Then I created a Folium Marker map of incidents and a HeatMap, used Seaborn graphs to visualise data, and used regular expressions to find and extract strings to create a WordCloud.

Please visit this [Amnesty International link](https://www.amnesty.org/en/what-we-do/refugees-asylum-seekers-and-migrants/) for information on how migrants, asylum seekers and refugees are defined. 



In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
#reading in csv file and changing dtype of 'Reported Date column to datetime'
mm_data = pd.read_csv('/kaggle/input/missing-migrants-project/MissingMigrants-Global-2019-12-31_correct.csv', parse_dates = ['Reported Date'])

## 1. About the dataset 

In [None]:
#dataset information
mm_data.info()

In the Missing Migrants dataset, there are 5987 entries and 20 columns. The columns cover the date and location of the incident, the demographic of the migrants involved, the cause and number of deaths, the number of missing migrants as a result of the incident, the number of survivors, links to news sources regarding the incident and the quality of the news sources.

In [None]:
#time period covered in data set
print('The Missing Migrants dataset covers the period {0} to {1}'.format(str(mm_data['Reported Date'][5986]), str(mm_data['Reported Date'][0])))

**Questions to explore:**
1. **In which region  did the greatest number of reported incidents occur and in which region did the least number of reported incidents occur?**
2. **What were the biggest causes of death?**
3. **Are there any seasonal patterns?**

Before delving into these questions, we must  explore the missing values in the dataset. For our purposes, we will only need the columns **'Region of Incident', 'Reported Date', 'Reported Year', 'Reported Month', 'Number Dead', 'Minimum Estimated Number of Missing', 'Total Dead and Missing', 'Cause of Death', **and** 'Location Coordinates'** (for mapping). So we will examine the NaN values where they exist for these columns only.

## 2. Handling missing data

In [None]:
#missing values
mm_data.isnull().sum().sort_values(ascending=False)

### Examining the NaN values of the 'Number Dead' column:

In [None]:
#extracting only the rows where 'Number Dead' is null
null_number_dead = mm_data[mm_data['Number Dead'].isnull()]
print('There are {} missing values in the "Number Dead" column.'.format(null_number_dead.shape[0]))

There are 255 entries where 'Number Dead' is NaN.

In [None]:
null_number_dead.head()

Checking the information sources associated with the first few entries in null_number_dead indicates that the 'Number Dead' value is null because no one repotedly died in that incident.  This is supported by a cursory glance at the first 5 entries which show that the 'Minimum Estimated Number of Missing' value is the same as the 'Total Dead and Missing' value. If this holds out for all entries in null_number_dead, then we can input 0 for the NaN 'Number Dead' values. Does this hold true for all the entries in null_number_dead? 

In [None]:
#boolean mask to filter entries where number missing is not equal to total number dead and missing
bool_null_number_dead = null_number_dead[null_number_dead['Minimum Estimated Number of Missing'] != null_number_dead['Total Dead and Missing']]

In [None]:
bool_null_number_dead

The 10 entries where the number of missing do not equal the total number dead and missing are entries with missing data for all the columns showing number of people affected. Some entries have URLs and these news sources mention the number dead or missing in the incident. We can use these sources to fill in the NaN values in the main dataset, mm_data.

So, before we replace NaN values with 0, we will use news sources to fill in the missing data in those entries in the main data set where possible. To begin, we will extract the rows where 'Number Dead', 'Minimum Estimated Number of Missing', and 'Number of Survivors' have missing data (ie NaN).

In [None]:
#extracting the rows where 'Number Dead', 'Minimum Estimated Number of Missing', and 'Number of Survivors' have missing data (ie NaN).
missing_data = mm_data[mm_data['Number Dead'].isnull() & mm_data['Minimum Estimated Number of Missing'].isnull() & mm_data['Number of Survivors'].isnull()]

In [None]:
missing_data

These are the same 10 entries which we saw above with bool_null_number_dead. We will now fill in the missing data where possible using the accompanying URLs. 

In [None]:
#row 4226
mm_data.loc[4226, 'Number Dead'] = 3
mm_data.loc[4226, 'Total Dead and Missing'] = 3
#row 5253
mm_data.loc[5253, 'Number Dead' ] = 11
mm_data.loc[5253, 'Number of Survivors'] = 15
mm_data.loc[5253, 'Total Dead and Missing'] = 11
#row 5337
mm_data.loc[5667, 'Number Dead'] = 6
mm_data.loc[5667, 'Total Dead and Missing'] = 6


Do we drop the 7 remaining entries with missing data involving the number of people affected? Since we will be examining the cause of death statistic, and these entries all report the cause of death, we will not drop these entries. Additionally, as we are changing the NaN values in the dead and missing columns to 0, the values for these columns will not affect the results of the kind of analysis which we will be doing.


Now, we will input 0 in place of NaN in the 'Number Dead' column in the main dataset.

In [None]:
#replacing NaN with 0 in 'Number Dead' column
mm_data['Number Dead'].fillna(0, inplace=True)

### Examining the NaN values of the 'Minimum Estimated Number of Missing' column:

In [None]:
#extracting only those entries where 'Minimum Estimated Number of Missing' is NaN
null_missing = mm_data[mm_data['Minimum Estimated Number of Missing'].isnull()]
print('There are {} missing values in the "Minimum Estimated Number of Missing" column.'.format(null_missing.shape[0]))

In [None]:
null_missing.head()

Again, as with the 'Number Dead' column, the NaN values for this column indicate that no one is missing as result of the incidence as the value for the 'Number Dead' column is equal to the 'Total Dead and Missing' column. We can thus replace the NaN values in the 'Minimum Estimated Number of Missing' column with 0. It is best to double check that it holds true for all entries in null_missing.

In [None]:
#boolean mask to filter any values where number dead is not equal to total dead and missing
bool_null_missing = null_missing[null_missing['Number Dead'] != null_missing['Total Dead and Missing']]
bool_null_missing.shape[0]

There are no entries where the number of dead is diffierent from the total number of dead and missing so we can  replace the NaN values in the 'Minimum Estimated Number of Missing' column with 0.


In [None]:
#replacing NaN with 0 in 'Minimum Estimated Number of Missing' in main dataset, mm_data
mm_data['Minimum Estimated Number of Missing'].fillna(0, inplace=True)

As a reminder, we are only interested in the columns **'Region of Incident', 'Reported Date', 'Reported Year', 'Reported Month', 'Number Dead', 'Minimum Estimated Number of Missing', 'Total Dead and Missing', 'Cause of Death',** and **'Location Coordinates'**.

Of these columns, the single NaN value for **'Location Coordinates'** remains to be examined.

### **Examining the single missing value of 'Local Coordinates':**

In [None]:
#extracting relevant row
null_loc_coord = mm_data[mm_data['Location Coordinates'].isnull()]
null_loc_coord

The location description of this incident is given as Sahara Desert, Niger. From the Missing Migrants website, one of the locations on a migration routes through Niger is SÃ©guedine, a town in central eastern Niger in the midst of the Sahara Desert. The location coordinates for this town (20.191944, 12.9675) will be a good approximation for the missing coordinate value in this entry.

In [None]:
#replacing NaN with approximate location coordinate for row 3097
mm_data.loc[3097, 'Location Coordinates'] = '20.191944, 12.9675'

In [None]:
#any missing values left to handle for columns we are interested in?
mm_data[['Region of Incident', 'Reported Date', 'Reported Year', 
        'Reported Month', 'Number Dead', 'Minimum Estimated Number of Missing',
        'Total Dead and Missing', 'Cause of Death', 'Location Coordinates']].isnull().sum()

The missing values of the columns we require have been addressed. We are now ready to answer the questions.

## 3. THE QUESTIONS 


In [None]:
import matplotlib.pyplot as plt
import folium
import seaborn as sns
%matplotlib inline

First, we will create a map of incidents using Folium. If you hover over or click on a marker, it will show you the details of the incident which occured there. You can zoom in on each hotspot marker.

In [None]:
#create new column of marker labels for folium map
marker_loc = mm_data['Location Description'] 
marker_date = mm_data['Reported Date'].dt.date.astype(str)
marker_number = mm_data['Total Dead and Missing'].astype(str)
marker_cause = mm_data['Cause of Death']
#adding object series into 'Marker Labels'
marker_labels = 'Location: ' + marker_loc + '; Date: '+ marker_date + '; Total Dead and Missing: ' + marker_number + '; Cause of Death: ' + marker_cause
mm_data['Marker Label'] = marker_labels

In [None]:
#map of incidents

from ast import literal_eval
from folium.plugins import MarkerCluster

incidents_map = folium.Map(location=[50,0], tiles = 'CartoDB dark_matter', zoom_start=3, min_zoom = 2.5, control_scale = True)

marker_cluster = MarkerCluster().add_to(incidents_map)

for i in range(mm_data.shape[0]):
    loc = list(literal_eval(mm_data.iloc[i]['Location Coordinates']))
    folium.Marker(
        location = loc,
        popup = mm_data.iloc[i]['Marker Label'], 
        tooltip = mm_data.iloc[i]['Marker Label'],
        icon=folium.Icon(color='red'),
    ).add_to(marker_cluster)


display(incidents_map)


This map gives a very quick look into where migrant incidents have occured. Living in the UK, the Mediterranean crisis is predominantly on the news, and this gives the impression that most migrant incidents occur in southern Europe. As the map shows, it is widespread, and it is shocking to see the scale of suffering.

We will now create a HeatMap that shows the areas which claimed the most victims (Total Dead and Missing from 2014-2019).

In [None]:
from folium.plugins import HeatMap

#create list of lists of coordinates and Total Dead and Missing
victims_array = []
for i in range(mm_data.shape[0]):
    victims_array.append(list(literal_eval(mm_data.iloc[i]['Location Coordinates'])))
    victims_array[i].append(float(mm_data.iloc[i]['Total Dead and Missing']))


victims_map = folium.Map(location=[50,0], tiles = 'CartoDB dark_matter', zoom_start=3, min_zoom = 2.5, control_scale = True)
HeatMap(victims_array, min_opacity = 0.25).add_to(victims_map)
display(victims_map)


Comparing the HeatMap with the Marker map of incidents shows that while the US-Mexico border experienced more incidents, the area around North Africs (specifically Libya) claimed the greater number of victims. 

<font size = 3>**1. In which region  did the greatest number of reported incidents occur, and in which region did the least number of reported incidents occur (from January 2014 to December 2019)?**</font>

In [None]:
#number of incidents
incidents_reg_count = mm_data['Region of Incident'].value_counts()

In [None]:
#bar graph
sns.set(style="white")
plt.figure(figsize=(10,10))
sns.barplot(incidents_reg_count.index, incidents_reg_count.values, palette='YlOrRd_r')
plt.xlabel('Region', fontsize = 13)
plt.xticks(rotation = 90)
plt.ylabel('Number of Incidents', fontsize = 13)
plt.title('Number of Incidents of Each Region (from January 2014 to December 2019)', fontsize = 15)
plt.show()

The US-Mexico Border experienced the greatest number of incidents in the given period, followed by North Africa, and the Mediterranean. These three regions far outstrip the other regions in the frequency of incidents. Central Asia experienced the least amount of incidents. 

A follow up question:
<font size = 3>**How did the frequency of incidents in each of the top three regions change over the time period (in years)?**</font>

In [None]:
#sort main data set in ascending order with regards to time and store in time_ordered_mmdata
time_ordered_mmdata = mm_data.sort_values(by='Reported Date', ascending=True)

In [None]:
#extracting portion of dataset for regions US-Mexico Border, North Africa, and the Mediterranean
top_regions_data = time_ordered_mmdata[(time_ordered_mmdata['Region of Incident'] == 'US-Mexico Border') | 
                                       (time_ordered_mmdata['Region of Incident'] == 'North Africa') | 
                                       (time_ordered_mmdata['Region of Incident'] == 'Mediterranean')]

In [None]:
#add column of ones called 'Number of Incidents' to top_regions_data to be able to count
top_regions_data.loc[:,'Number of Incidents'] = 1

In [None]:
#grouping by year and region and counting number of incidents
region_year_group = top_regions_data.pivot_table(index=['Region of Incident','Reported Year'], values='Number of Incidents', aggfunc='count')
region_year_group

In [None]:
#plotting as line graphs
sns.set(style = 'ticks' )

med = region_year_group.loc['Mediterranean']
n_a = region_year_group.loc['North Africa']
us_m =region_year_group.loc['US-Mexico Border']

plt.figure(figsize=(10,10))

plt.plot(med['Number of Incidents'], 'r:' ,label ='Mediterranean', marker = 'o', markersize = 5, mew = 2, linewidth = 3)
plt.plot(n_a['Number of Incidents'], 'g:' ,label ='North Africa', marker = 'o', markersize = 5, mew = 2, linewidth = 3)
plt.plot(us_m['Number of Incidents'], 'b:', label ='US-Mexico Border', marker = 'o', markersize = 5, mew = 2, linewidth = 3)

plt.xlabel('Year', fontsize = 13)
plt.ylabel('Number of Incidents', fontsize = 13)
plt.title('Number of Incidents by Region (2014 - 2019)', fontsize = 15)
plt.legend(loc='upper right')

plt.show()


The North African region saw a increase in migrant incidents from 2014 and which peaked in 2016 when over 400 incidents where asyulum seekers perished or went missing. The US-Mexico border shows a steady increase in incident from 2014 to 2018. The incidents in the Mediterranean reached their peak in 2017.  

<font size = 3>**2. What was the biggest cause of death overall and for particular regions?**</font>

In [None]:
#extracting causes of death (some entries have several causes of death)

#import regular expression library
import re

#dict of given causes of death and frequency
#use regular expression to split string and ignore whitespace
dict_type_count = {}
for d in mm_data['Cause of Death'].index:
    list_temp = []
    list_temp.extend(re.split(r'[,]\s*',mm_data['Cause of Death'][d])) 
    for i in list_temp:
        if i not in dict_type_count:
            dict_type_count[i] = 1
        else:
            dict_type_count[i] += 1

#converting dictionary into series
death_type_count = pd.Series(dict_type_count)
print(death_type_count)

In [None]:
#combine all values with 'unknown' as one entry with index 'Unknown', and combine any cause of death related to forms of transport  as 'Vehicle Accident' 
dict_repeated = {}
dict_repeated['Unknown'] = 0
dict_repeated['Vehicle Accident'] = 0
reps = pd.Series(death_type_count.index)

list_reps =[]

for i in reps[reps.str.contains(r'[Uu]nknown')]:
    dict_repeated['Unknown'] += death_type_count[i]
    list_reps.append(i)
        
for i in reps[reps.str.contains(r'\b[Tt]ruck\b')]:
    dict_repeated['Vehicle Accident'] += death_type_count[i]
    list_reps.append(i)
    
for i in reps[reps.str.contains(r'[Tt]rain')]:
    if i not in list_reps:
        dict_repeated['Vehicle Accident'] += death_type_count[i]
        list_reps.append(i)
        
for i in reps[reps.str.contains(r'\b[Bb]us\b')]: 
    dict_repeated['Vehicle Accident'] += death_type_count[i]
    list_reps.append(i)

for i in reps[reps.str.contains(r'[Vv]ehicle')]: 
    if i != 'Accident (non-vehicle)':
        dict_repeated['Vehicle Accident'] += death_type_count[i]
        list_reps.append(i)
        
death_type_count.drop(list_reps, inplace=True)
death_type_count = death_type_count.append(pd.Series(dict_repeated))
print('The 10 biggest causes of deaths are: \n{} '.format(death_type_count.sort_values(ascending=False)[:10]))

We will create a word map to give an idea of what the biggest causes of death are.

In [None]:
from wordcloud import WordCloud, STOPWORDS


wordcloud = WordCloud(width = 3000, height = 2000 , background_color = 'black', colormap = 'Reds',
                       stopwords = STOPWORDS).generate_from_frequencies(death_type_count.to_dict())

fig = plt.figure(figsize = (30,25))
plt.imshow(wordcloud)
plt.axis('off')

plt.show()


Drowning is the biggest known cause of death. Other common causes of death are sickness, hypothermia, dehydartion, starvation, harsh weather/ lack of adequate shelter, and vehicle accidents. When the causes of death are set out like this, it brings to the fore the reality of the asylum seekers' lives. Nobody willingly leaves relative safety to go out into the unknown unless there was a powerful push factor. 

<font size = 3>**3. Are there any seasonal patterns in the number of incidents?**</font>

In [None]:
mm_copy = mm_data.copy()
mm_copy.loc[:, 'Number of Incidents'] = 1
mm_copy.head()

In [None]:
season_pattern = mm_copy.pivot_table(index='Reported Month', values='Number of Incidents', aggfunc='sum') 
season_pattern.reset_index(inplace=True)
#season_pattern

In [None]:
dict_months = {'Jan':1, 'Feb':2, 'Mar':3, 'Apr':4, 'May':5, 'Jun':6, 'Jul':7, 'Aug':8, 'Sep':9, 'Oct':10, 'Nov':11, 'Dec':12}
season_pattern['Reported Month'] = season_pattern['Reported Month'].map(dict_months)

In [None]:
season_pattern.sort_values(by='Reported Month', inplace=True)
#check
#season_pattern

In [None]:
#plotting number of incidents by month
import calendar
plt.figure(figsize=(10,10))

plt.plot(season_pattern['Reported Month'], season_pattern['Number of Incidents'], 'b-', marker = 'o')
plt.ylim(0, 700)
plt.xlabel('Month',fontsize = 13)
plt.xticks(np.arange(1,13), calendar.month_name[1:13], rotation=20)
plt.ylabel('Number of Incidents', fontsize = 13)
plt.title('Number of Incidents by Month', fontsize = 15)
plt.show()

On average over the years covered in the dataset, the number of incidents increase during the Northern Hemisphere's warmer months. Warmer months could bring favourable weather to attempt sea crossings, and also longer days (more sunlight) to travel.