# Casualties and Migration in the Syrian Civil War

## Introduction

***

In 2011, five weeks into the civil demonstrations against the Syrian government, secret police forces detained and tortured fifteen students who had spray painted an anti-government statement on the walls of their school. They would be released weeks later in an effort to quell the rising civil unrest in the province. In the wake of the hundreds of other demonstrators who were killed or disappeared, this action was too little and too late to stop the tide of the civil war. Demonstrations turned to protest turned to armed conflict and the rest is history.

The war would go on to spawn both the largest refugee crisis and one of the deadliest conflicts in modern history. As of 2019, there are over 6 million Syrian refugees and another 6 million internally displaced people in a country with a pre-war population of around 24 million (UNHCR, 2018). The regime's efforts to prevent accurate information from leaving the country has made it nearly impossible to estimate the number of casualties that have occured in that time. Current estimates range from 300,000 to 600,000 killed depending on the source.

The link between the flow of violence within the country and the flow of asylum seekers out of the country should be apparent to anyone who is aware of the war. Yet a growing sentiment among residents in host countries is that a large portion of asylum seekers from Syria are actually economic migrants, who are using the conflict as a means of gaining entry into the European Union and access to generous social programs.

We believe that violence is the most important predictor of migration of Syrian refugees; however, while this argument may be generally accepted, there is great difficulty in proving this relationship for certain. We hope to answer this question using reported casualty data to see whether there is a correlation between violence in a given province and a subsequent increase in the amount of asylum seekers across all host countries.

## Project
***

Our project can be organized into three distinct portions:

1. Data Scraping
2. Data Wrangling
3. Data Visualization

Our goal is to create a dataset for casualty information a refugee data, clean and structure the dataset for easier queying, and visualize the data to provide more insights into the questions we pose above. 

## Data Scraping
***

There are multiple sources that could be used for casualty information (list here). We will leave the three datasets for now, and focus on the VDC and CSR datasets because they provide their data is table elements that make it easy for us to scrape and organize our dataframes for analysis.

We will now go through the process of scraping and creating the inital forms of these datasets.

### Casualty Data
***

#### VDC

The [Violations Documentation Center](http://www.vdc-sy.info/) has been recording casualty data since June 2011. It is likely the most detailed and complete (in terms of metadata) data source of casualties that is publicly accessible.

They provide their data with a user interface that will query their database using parameter the user defines. This interface will provide this information:

- `Name                  - Full name in English`
- `Status                - Civilian, non-civilian, or military status of deceased`
- `Sex                   - Whether deceased is an Adult or Minor and Male or Female`
- `Province              - One of the 14 Provinces of Syria`
- `Area \ Place of Birth - Various locations that can be Provinces/Subdistricts/Towns`
- `Date of death         - self explanatory`
- `Cause of death        - self explanatory`
- `Actors                - groups involved in the casualty`

Each entry is associated with a unique identifier, which is an integer between 0 and 250,000. Clicking on the name of the entry will lead the user to another page that provides the unique identifier number and other data that is not displayed on the main page. We will avoid describing this detail for now, since most of this data is not used in the final product.

We will describe the full process we used to scrape all details from this website as well as the detailed information.

In [None]:
def scrape_recent():
    first_page = 'http://www.vdc-sy.info/index.php/en/martyrs/1/c29ydGJ5PWEua2lsbGVkX2RhdGV8c29ydGRpcj1ERVNDfGFwcHJvdmVkPXZpc2libGV8ZXh0cmFkaXNwbGF5PTB8'
    
    # This is the format of the links that give us the unique identfiers
    pattern    = re.compile('\/index\.php\/en\/details\/martyrs\/.')

    # We want to establish a randomized user agent and Tor node to avoid detection
    ua         = UserAgent()
    headers    = {'User-Agent': ua.random}
    tor        = TorRequest(password = 'commonhorse')
    
    try:
        response = tor.get(first_page, headers=headers)
        content  = bs(response.text, 'html.parser')
        
        # This list comprehension grabs all unique identifiers in string format for all links that match
        # our regex pattern from above
        links    = {link['href'][30:] for link in content.find_all('a', href = True) if pattern.match(link['href'])} 

    except Exception as e:
        print(e)

    return links

In [None]:
'''
Provided a list of unique identifiers in string fromat, scrapes details and saves each entry 
as an idividual dataframe that represents one person.
'''

def scrape_details(uid, tor, headers):
    cols = []
    vals = []

    url  = 'http://www.vdc-sy.info/index.php/en/details/martyrs/' + uid
    
    # Headers will provide the UserAgent to use when getting response
    # Makes the request using a TorRequest object passed in
    page = tor.get(url, headers = headers).text
    page = bs(page, 'html.parser')
    
    # Grabs the relevant table info and all rows in it
    table = page.find('table', attrs = {'class':'peopleListing'})
    rows  = table.find_all('tr')

    for row in rows:
        data = row.find_all('td')

        # All data without only 2 data values
        # are not data we are looking for
        if len(data) != 2:
            continue

        # data[0] corresponds to the row label/column
        cols.append(data[0].text)
        
        # Values need to appended differently for image rows 
        if data[1].find('img') is not None:
            vals.append(data[1].find('img')['src'])
        else:
            vals.append(data[1].text)

    # Adds the uid to the dataframe
    cols.append('uid')
    vals.append(uid)

    # Creates and saves dataframe
    person = pd.DataFrame([vals], columns = cols, dtype=str)

    save(person, os.path.join('person_dfs', uid))
    
    
    

Each detailed page has a different number of columns depending on the metadata associated with that entry, so we will now have to combine all the dataframes. Pandas requires that columns have unique names, so we have to rename all duplicate columns using this code.

In [None]:
def rename_dup_cols(dataframe):
    cols = pd.Series(dataframe.columns)
  
    for dup in dataframe.columns.get_duplicates(): 
        cols[dataframe.columns.get_loc(dup)] = [dup + '_' + str(d_idx) if d_idx != 0 else dup for d_idx in range(dataframe.columns.get_loc(dup).sum())]
   
    dataframe.columns = cols

    return dataframe




Now given a list of dataframes we can return a combined dataframe that retains all column data and saves that file as vdc_df and saves any failed dataframes as failed_vdc_df.

In [None]:
def combine_dataframes(dataframes):
    failed_dataframes = []
    combined          = pd.DataFrame()

    current = 0
    num     = len(dataframes)

    for df in dataframes:
        try:
            combined = pd.concat([combined, df], axis = 0)
            print(f'{counter} / {num} people processed in combine_dataframes().')
            counter += 1
        
        except Exception as e:
            failed_dataframes.append(df)
            print('Failed')
            counter += 1

    save(combined, 'vdc_df')
    save(failed_dataframes, 'failed_vdc_df')

    print('\n\nSuccess: ', len(dataframes) - len(failed_dataframes))
    print('Failed: ', len(failed_dataframes))
    
    
    

Now, adding this all together. We will now:

1. Build a list of unique identifiers by scraping the query page for the VDC database using scrape_recent()

2. Scrape the detailed information provided the list of unique ids from scrape_recent() using scrape_details, which gives us dataframes for each person.

3. Combine those dataframes into one large dataset using combine_dataframes()


In [None]:
uids_to_scrape = scrape_recent()
uids_scraped   = set()

while len(uids_to_scrape) > 0:
    uid = uids_to_scrape.pop()
    
    try:
        ua         = UserAgent()
        headers    = {'User-Agent': ua.random}
        tor        = TorRequest(password = 'cmps184')
        scrape_details(uid, tor, headers)

    except Exception as e:
        print(e)
        helen___uids_to_scrape.append(uid)

        ua         = UserAgent()
        headers    = {'User-Agent': ua.random}
        tor        = TorRequest(password = 'cmps184')
        tor.reset_identity()

        continue
        
    uids_scraped.add(uid)

    save(uids_to_scrape, 'uids_to_scrape')
    save(uids_scraped  , 'uids_scraped')

In [None]:
list_of_dataframes = []

for person_df in glob.glob(os.path.join('person_dfs', '*.pickle')):
    list_of_dataframes.append(load(person_df))
    
combine_dataframes(list_of_dataframes)

We can finally load in the dataset and look at its contents. For now, we are finished with the scraping portion for this source and we will revisit it when we will wrangle the data into a more suitable format.

In [None]:
vdc_df = load('vdc_df')
vdc_df.head()

#### CSR

### Refugee Data

#### Monthly Inflows

#### Yearly Refugee Status

## Data Wrangling

### Casualty Data

#### VDC

You can run the cell below to see what the dataset looks like without any modification.

vdc_df = load('vdc_df')
vdc_df

While the added details we got from scraping everythign from the website are valuable for more detailed analysis, these particular columns will be what we will be focusing on with this project:

- `Name                  `
- `Status                `
- `Sex                   `
- `Province              `
- `Area \ Place of Birth `
- `Date of death         `
- `Cause of death        `
- `Actors                `

And we can create a dataframe we will use to do all of their data frame so that we are not modifying the original dataset.

In [None]:
scratch = vdc_df[['Province',
                  'Sex',
                  'Status',
                  'Date of death',
                  'Cause of Death']].copy()

If we look at the `Sex` column, we can see that there is actually data about the person's minority status and age range, so we will create new columns to capture that information.

In [None]:
# We'll first want to drop any rows that don't have this information
scratch = scratch.dropna(subset=['Sex'])

def check_age(row):
    if 'Adult' in row['Sex']:
        val = 'adult'
    else:
        val = 'minor'
    return val

scratch['age_cat'] = scratch.apply(check_age, axis=1)

def check_sex(row):
    if 'Male' in row['Sex']:
        val = 'male'
    else:
        val = 'female'
    return val

scratch['sex'] = scratch.apply(check_sex, axis=1)

If we look at the `Cause of Death` coolumn, we'll see that there is some reduntant categories, so we'll simplify these categories by remapping those values based on a dictionary mapping we show below.

In [None]:
cause_of_death_map = {'Chemical and toxic gases'         : 'Chemical Weapon',
                      'Detention - Execution'            : 'Detention',
                      'Detention - Torture'              : 'Detention',
                      'Detention - Torture - Execution'  : 'Detention',
                      'Explosion'                        : 'Explosion',
                      'Field Execution'                  : 'Execution',
                      'Kidnapping - Execution'           : 'Execution',
                      'Kidnapping - Torture'             : 'Execution',
                      'Kidnapping - Torture - Execution' : 'Execution',
                      'Other'                            : 'Unknown'  ,
                      'Shelling'                         : 'Shelling' ,
                      'Shooting'                         : 'Shooting' ,
                      'Siege'                            : 'Siege'    ,
                      'Un-allowed to seek Medical help'  : 'Lack of Medical Access',
                      'Unknown'                          : 'Unknown'  ,
                      'Warplane shelling'                : 'Shelling' 
}

def check_cause_of_death(row, mapping):
    return mapping[row['Cause of Death']]

scratch['cause_of_death'] = scratch.apply(check_cause_of_death,
                                        args = (cause_of_death_map, ),
                                        axis = 1)

For convenience we can change the status column

In [None]:
def check_status(row):
    if row['Status'] == 'Non-Civilian':
        val = 'non_civilian'
    elif row['Status'] == 'Civilian':
        val = 'civilian'
    else:
        val = 'regime'
    return val

scratch['status'] = scratch.apply(check_status, axis=1)

Now that the dataset is cleaner, we can drop columns that irrelevant to us, and rename the columns for convenience.

In [None]:
scratch = scratch[['Province',
                  'sex',
                  'status',
                  'age_cat',
                  'Date of death',
                  'cause_of_death']].copy()

scratch.columns = ['province', 'sex', 'status', 'age_cat','date_of_death', 'cause_of_death']



We will now drop any entries with unrecroded or icorrect dates of death and convert the time strings to python datetime objects.

With all of those modifcations we can finally save this dataset as complete.

In [None]:
scratch = scratch[scratch['province'].isin(picked)]
scratch = scratch[scratch['date_of_death'] != '0000-00-00']
scratch = scratch[scratch['date_of_death'] != '1970-01-01']
scratch['date_of_death'] = pd.to_datetime(scratch['date_of_death'])

save(scratch, 'clean_vdc_df')

We are not completely done, since we'd like to create and save some datasets that will be easy to use for visualization later.

In [None]:
clean_vdc_df = load('clean_vdc_df')

#### CSR

### Refugee Data

#### Monthly Inflows

#### Yearly Refugee Status

## Data Visualization

### Casualty Data

#### VDC

We'll make sure that are working with the clean dataset first.

In [None]:
clean_vdc_df = load('clean_vdc_df')

To make visualualizations easy we'll write a function that let's us make bokeh-ready dataframes with a variety of options.

In [1]:
'''
Allows us to easily create Bokeh plots based oncategorical data in multiple time formats

'''

def make_bokeh_df(df, focus = 'province', tframe = 'year'):
    
    if tframe   == 'day'  :
        bokeh_df         = df.groupby([focus , 'date_of_death']).agg({'date_of_death': 'count'})
        bokeh_df.columns = ['count']
        bokeh_df         = bokeh_df.reset_index()
        bokeh_df.columns = [focus, 'day', 'count']
        bokeh_df         = bokeh_df.pivot_table(index = ['day'], columns = focus, values = 'count').fillna(0)
    
    elif tframe == 'month':
        bokeh_df = df.groupby([focus, (df.date_of_death.dt.year), (df.date_of_death.dt.month)]).agg({'date_of_death': 'count'})
        bokeh_df.columns             = ['count']
        bokeh_df.columns             = bokeh_df.columns.get_level_values(0)
        bokeh_df.index.names         = [focus, 'year', 'month']
        bokeh_df.reset_index(inplace = True)
        bokeh_df                     = bokeh_df.pivot_table(index = ['year', 'month'], columns = focus, values = 'count').fillna(0)
    
    elif tframe == 'year' :
        bokeh_df         = df.groupby([focus, (df.date_of_death.dt.year)]).agg({'date_of_death': 'count'})
        bokeh_df.columns = ['count']
        bokeh_df         = bokeh_df.reset_index()
        bokeh_df.columns = [focus, 'year', 'count']
        bokeh_df         = bokeh_df.pivot_table(index = ['year'], columns = focus, values = 'count').fillna(0)
        
    elif tframe == 'total' :
        bokeh_df         = df.groupby([focus]).agg({'date_of_death': 'count'})
        bokeh_df.columns = ['count']
        bokeh_df         = bokeh_df.reset_index()
        bokeh_df.columns = [focus, 'count']
#         bokeh_df         = bokeh_df.pivot_table(index = [focus], columns = focus, values = 'count').fillna(0)
        

    return bokeh_df
    

In [None]:
categories = ['province', 'sex', 'status', 'age_cat','date_of_death', 'cause_of_death']
graph_type = ['line', 'point', 'bar', 'area', 'pie']
time_frame = ['day', 'month', 'year']

#### CSR

### Refugee Data

#### Monthly Inflows

#### Yearly Refugee Status

## Future Work

## Conclusion

## Appendix

## References