# Casualties and Migration in the Syrian Civil War

## Introduction

***

In 2011, five weeks into the civil demonstrations against the Syrian government, secret police forces detained and tortured fifteen students who had spray painted an anti-government statement on the walls of their school. They would be released weeks later in an effort to quell the rising civil unrest in the province. In the wake of the hundreds of other demonstrators who were killed or disappeared, this action was too little and too late to stop the tide of the civil war. Demonstrations turned to protest turned to armed conflict and the rest is history.

The war would go on to spawn both the largest refugee crisis and one of the deadliest conflicts in modern history. As of 2019, there are over 6 million Syrian refugees and another 6 million internally displaced people in a country with a pre-war population of around 24 million (UNHCR, 2018). The regime's efforts to prevent accurate information from leaving the country has made it nearly impossible to estimate the number of casualties that have occured in that time. Current estimates range from 300,000 to 600,000 killed depending on the source.

The link between the flow of violence within the country and the flow of asylum seekers out of the country should be apparent to anyone who is aware of the war. Yet a growing sentiment among residents in host countries is that a large portion of asylum seekers from Syria are actually economic migrants, who are using the conflict as a means of gaining entry into the European Union and access to generous social programs.

We believe that violence is the most important predictor of migration of Syrian refugees; however, while this argument may be generally accepted, there is great difficulty in proving this relationship for certain. We hope to answer this question using reported casualty data to see whether there is a correlation between violence in a given province and a subsequent increase in the amount of asylum seekers across all host countries.

## Project
***

Our project can be organized into three distinct portions:

1. Data Scraping
2. Data Wrangling
3. Data Visualization

Our goal is to create a dataset for casualty information a refugee data, clean and structure the dataset for easier queying, and visualize the data to provide more insights into the questions we pose above. 

## Data Scraping
***

There are multiple sources that could be used for casualty information (list here). We will leave the three datasets for now, and focus on the VDC and CSR datasets because they provide their data is table elements that make it easy for us to scrape and organize our dataframes for analysis.

We will now go through the process of scraping and creating the inital forms of these datasets.

### Casualty Data
***

#### VDC

The [Violations Documentation Center](http://www.vdc-sy.info/) has been recording casualty data since June 2011. It is likely the most detailed and complete (in terms of metadata) data source of casualties that is publicly accessible.

They provide their data with a user interface that will query their database using parameter the user defines. This interface will provide this information:

- `Name                  - Full name in English`
- `Status                - Civilian, non-civilian, or military status of deceased`
- `Sex                   - Whether deceased is an Adult or Minor and Male or Female`
- `Province              - One of the 14 Provinces of Syria`
- `Area \ Place of Birth - Various locations that can be Provinces/Subdistricts/Towns`
- `Date of death         - self explanatory`
- `Cause of death        - self explanatory`
- `Actors                - groups involved in the casualty`

Each entry is associated with a unique identifier, which is an integer between 0 and 250,000. Clicking on the name of the entry will lead the user to another page that provides the unique identifier number and other data that is not displayed on the main page. We will avoid describing this detail for now, since most of this data is not used in the final product.

We will describe the full process we used to scrape all details from this website as well as the detailed information.

In [None]:
def scrape_recent():
    first_page = 'http://www.vdc-sy.info/index.php/en/martyrs/1/c29ydGJ5PWEua2lsbGVkX2RhdGV8c29ydGRpcj1ERVNDfGFwcHJvdmVkPXZpc2libGV8ZXh0cmFkaXNwbGF5PTB8'
    
    # This is the format of the links that give us the unique identfiers
    pattern    = re.compile('\/index\.php\/en\/details\/martyrs\/.')

    # We want to establish a randomized user agent and Tor node to avoid detection
    ua         = UserAgent()
    headers    = {'User-Agent': ua.random}
    tor        = TorRequest(password = 'commonhorse')
    
    try:
        response = tor.get(first_page, headers=headers)
        content  = bs(response.text, 'html.parser')
        
        # This list comprehension grabs all unique identifiers in string format for all links that match
        # our regex pattern from above
        links    = {link['href'][30:] for link in content.find_all('a', href = True) if pattern.match(link['href'])} 

    except Exception as e:
        print(e)

    return links

In [None]:
'''
Provided a list of unique identifiers in string fromat, scrapes details and saves each entry 
as an idividual dataframe that represents one person.
'''

def scrape_details(uid, tor, headers):
    cols = []
    vals = []

    url  = 'http://www.vdc-sy.info/index.php/en/details/martyrs/' + uid
    
    # Headers will provide the UserAgent to use when getting response
    # Makes the request using a TorRequest object passed in
    page = tor.get(url, headers = headers).text
    page = bs(page, 'html.parser')
    
    # Grabs the relevant table info and all rows in it
    table = page.find('table', attrs = {'class':'peopleListing'})
    rows  = table.find_all('tr')

    for row in rows:
        data = row.find_all('td')

        # All data without only 2 data values
        # are not data we are looking for
        if len(data) != 2:
            continue

        # data[0] corresponds to the row label/column
        cols.append(data[0].text)
        
        # Values need to appended differently for image rows 
        if data[1].find('img') is not None:
            vals.append(data[1].find('img')['src'])
        else:
            vals.append(data[1].text)

    # Adds the uid to the dataframe
    cols.append('uid')
    vals.append(uid)

    # Creates and saves dataframe
    person = pd.DataFrame([vals], columns = cols, dtype=str)

    save(person, os.path.join('person_dfs', uid))
    
    
    

Each detailed page has a different number of columns depending on the metadata associated with that entry, so we will now have to combine all the dataframes. Pandas requires that columns have unique names, so we have to rename all duplicate columns using this code.

In [None]:
def rename_dup_cols(dataframe):
    cols = pd.Series(dataframe.columns)
  
    for dup in dataframe.columns.get_duplicates(): 
        cols[dataframe.columns.get_loc(dup)] = [dup + '_' + str(d_idx) if d_idx != 0 else dup for d_idx in range(dataframe.columns.get_loc(dup).sum())]
   
    dataframe.columns = cols

    return dataframe




Now given a list of dataframes we can return a combined dataframe that retains all column data and saves that file as vdc_df and saves any failed dataframes as failed_vdc_df.

In [None]:
def combine_dataframes(dataframes):
    failed_dataframes = []
    combined          = pd.DataFrame()

    current = 0
    num     = len(dataframes)

    for df in dataframes:
        try:
            combined = pd.concat([combined, df], axis = 0)
            print(f'{counter} / {num} people processed in combine_dataframes().')
            counter += 1
        
        except Exception as e:
            failed_dataframes.append(df)
            print('Failed')
            counter += 1

    save(combined, 'vdc_df')
    save(failed_dataframes, 'failed_vdc_df')

    print('\n\nSuccess: ', len(dataframes) - len(failed_dataframes))
    print('Failed: ', len(failed_dataframes))
    
    
    

Now, adding this all together. We will now:

1. Build a list of unique identifiers by scraping the query page for the VDC database using scrape_recent()

2. Scrape the detailed information provided the list of unique ids from scrape_recent() using scrape_details, which gives us dataframes for each person.

3. Combine those dataframes into one large dataset using combine_dataframes()


In [None]:
uids_to_scrape = scrape_recent()
uids_scraped   = set()

while len(uids_to_scrape) > 0:
    uid = uids_to_scrape.pop()
    
    try:
        ua         = UserAgent()
        headers    = {'User-Agent': ua.random}
        tor        = TorRequest(password = 'cmps184')
        scrape_details(uid, tor, headers)

    except Exception as e:
        print(e)
        helen___uids_to_scrape.append(uid)

        ua         = UserAgent()
        headers    = {'User-Agent': ua.random}
        tor        = TorRequest(password = 'cmps184')
        tor.reset_identity()

        continue
        
    uids_scraped.add(uid)

    save(uids_to_scrape, 'uids_to_scrape')
    save(uids_scraped  , 'uids_scraped')

In [None]:
list_of_dataframes = []

for person_df in glob.glob(os.path.join('person_dfs', '*.pickle')):
    list_of_dataframes.append(load(person_df))
    
combine_dataframes(list_of_dataframes)

#### CSR

The [Syrian Center for Statistics and Research](https://csr-sy.org/) has been recording casualty data since March 2011. It has less information than the VDC dataset, but the location of death is more precise.

They provide their data with a user interface that will query their database using parameter the user defines. This interface will provide this information:

- `ID Number             - Arbitrary ID number`
- `First Name            - First name in Arabic`
- `Father Name           - Father's last name in Arabic`
- `Last Name             - Last name in Arabic`
- `Province              - One of the 14 Provinces of Syria`
- `Town                  - Town where they died`
- `Date of death         - self explanatory`

If you looked at the code to scrape the VDC website, you'll see that there is no package for Tor or user agent. For some reason, this particular website was cautious about who was looking at their data as it blocked our multiple attempts of trying to scrape without cycling through IP addresses and user agents.

The following section contain all the libraries that must be imported in order to run the web scraping. 

In [None]:
# for requesting content of a web page
from bs4 import BeautifulSoup
from requests import get

# for regex
# import re

# for dataframe creation
import pandas as pd

# for not overwhelming server when scraping pages
from time import sleep
from random import randint

# For a progress bar while scraping
from tqdm import tqdm

# For saving files
import pickle

# For shuffling list
import random

# For Tor Requests
from torrequest     import TorRequest

# To cycle through useragents
from fake_useragent import UserAgent

In order to not have to rescrape everything if one page fails, it is essential to have a pickle file that you can store your successfully scraped data and failed attempts in. Below is the code to create and load a pickle file.

In [None]:
# Save and Load functions for a pickle file
def save(obj, name):
    pickle.dump(obj, open(name + '.pickle', 'wb'))

def load(name):
    return pickle.load(open(name + '.pickle', 'rb'))

In [None]:
# the url to start scraping CSR
url = 'https://csr-sy.org/?l=1&sons=redirect&sequence=&name=&father_name=&surname=&age_from=0&age_to=120&gender=&born_state=&born_town=&career=&society_status=&sons_no=&medical_status=&incident_state=&incident_town=&incident_desc=3&incident_date_from=&incident_date_to=&incident_details=&trial=&trial_date_from=&trial_date_to=&id=182&ddate_from=&ddate_to=&rec=0'

response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')

The website was odd in the sense that some pages at the end didn't contain any information about Syrian casualties. We found the last page that had any information and hard-coded the page number in. As you can see from the code below, 91900 was the last page to contain any useful information.

In [None]:
# create a list of numbers used in shuffling through all the URLs
numbers_url = [str(i) for i in range(0, 91901, 50)]

# shuffling the numbers so that the website doesn't become suspicious of us
random.shuffle(numbers_url)
numbers_url = set(numbers_url)

In [None]:
# all the pickle files: the good, the bad, and the ugly
data_list     = load('csr_data_list')
failed_urls   = load('failed_urls')
finished_urls = load('finished_urls')

# removes the URL number if that corresponding page has been scraped successfully
for url in finished_urls:
    try:
        numbers_url.remove(url)
    except:
        continue

# tqdm() creates a progress bar to see how far you are from finishing your task
# scrapes the website in a random order while cycling through IP addresses and user agents
for number_url in tqdm(numbers_url):
    try:
        # being cautious
        sleep(randint(30,60))
        ua         = UserAgent()
        user_agent = ua.random
        headers    = {'User-Agent': user_agent}
        tor = TorRequest(password = 'commonhorse')
        tor.reset_identity()
        
        url = 'https://csr-sy.org/?l=1&sons=redirect&sequence=&name=&father_name=&surname='        + \
                '&age_from=0&age_to=120&gender=&born_state=&born_town=&career=&society_status='    + \
                '&sons_no=&medical_status=&incident_state=&incident_town=&incident_desc=3'         + \
                '&incident_date_from=&incident_date_to=&incident_details=&trial=&trial_date_from=' + \
                '&trial_date_to=&id=182&ddate_from=&ddate_to=&rec=' + number_url


        response = tor.get(url, headers = headers)

        html_soup = BeautifulSoup(response.text, 'html.parser')
        rows = html_soup.findAll('tr', {'title':'victim'})

        # start scraping one page
        for row in rows:
            columns = row.findAll('td')

            # saves the column data
            for i in range(len(columns)):
                if i == 0:
                    continue
                else:
                    data_list[i - 1].append(columns[i].text)
                    
        finished_urls.append(number_url)
        
        # stores the successfully scraped data
        save(data_list    , 'csr_data_list')
        # stores the finished URL number
        save(finished_urls, 'finished_urls')
    
    # pages that failed are stored in the pickle file failed_urls
    except Exception as e:
        print(e)
        print('\nFailed on: ', number_url)
        failed_urls.append(number_url)
        save(failed_urls, 'failed_urls')
        continue

When we're done with all the pages, we want to convert our saved pickle file into a CSV file (to make loading in the data easier).

In [None]:
# creates a data frame
victim_info = pd.DataFrame({
    'victim_id'  : data_list[0],
    'first_name' : data_list[1],
    'father_name': data_list[2],
    'last_name'  : data_list[3],
    'province'   : data_list[4],
    'town'       : data_list[5],
    'date'       : data_list[6]    
})

# saves our file as a CSV file
victim_info.to_csv(index=false)

### Refugee Data

#### Monthly Inflows

#### Yearly Refugee Status

## Data Wrangling

### Casualty Data

#### VDC

#### CSR

### Refugee Data

#### Monthly Inflows

#### Yearly Refugee Status

## Data Visualization

### Casualty Data

#### VDC

#### CSR

### Refugee Data

#### Monthly Inflows

#### Yearly Refugee Status

## Future Work

## Conclusion

## Appendix

## References