In [1]:
%pylab inline
import pandas as pd
import altair as alt
import re
from IPython.display import Image

Populating the interactive namespace from numpy and matplotlib


# Summary

Note:  
Data Quality (Excellent/Good/Bad)  
Up-to Date: Yes (Available until 2019) / No (Unavailable until 2019)

| Chart Number  | Data Access   |Up-to Date    |Data Quality  |Comments      |
| ------------- | ------------- |--------------|--------------|--------------|
| Chart-1        | Yes           |No            |Excellent     |Data Available up to 2017|
| Chart-2        | Yes           |No            |Excellent     |Data Available up to 2017|
| Chart-3        | ------------- |--------------|--------------|--------------|
| Chart-4        | ------------- |--------------|--------------|--------------|
| Chart-5        | ------------- |--------------|--------------|--------------|
| Chart-6        | ------------- |--------------|--------------|--------------|
| Chart-7        | ------------- |--------------|--------------|--------------|
| Chart-8        | ------------- |--------------|--------------|--------------|
| Chart-9        | Yes           |No            |Excellent          |Data Available until 2017|
| Chart-10       | Yes           |No            |Excellent          |Data Available until 2017|
| Chart-11       | ------------- |--------------|--------------|--------------|
| Chart-12       | ------------- |--------------|--------------|--------------|
| Chart-13       | ------------- |--------------|--------------|--------------|
| Chart-14       | ------------- |--------------|--------------|--------------|
| Chart-15       | ------------- |--------------|--------------|--------------|
| Chart-16       | ------------- |--------------|--------------|--------------|

In the below sections we will explain where (the data access/source), what (description) and how (cleansing process for the above charts) the data for the above chart is prepared. Each chart (or group of charts) will have heading section followed by where,how and what about the data.


## Chart-1

For the __Chart 1__ and __Chart 2__, we obtained the data from this [Guardian article.](https://www.theguardian.com/news/datablog/2012/jul/22/gun-homicides-ownership-world-list#data). The data had to be manually downloaded as a .csv and read into the notebook. 

Get data here: https://docs.google.com/spreadsheets/d/1chqUZHuY6cXYrRYkuE0uwXisGaYvr7durZHJhpLGycs/edit#gid=0

#### Read in the file

In [2]:
df = pd.read_csv("data/guardian.csv", index_col = False)

It looks like there is a column measuring homicide by firearm rate per 100 thousand population. To recreate the Vox visualization, we will have to create a new column 'fire_hom_m' that measures the homicide by firearm rate per 1 mil population.

In [3]:
df['fire_hom_m'] = df['Homicide by firearm rate per 100,000 pop'] * 10

df2 = df.dropna().reset_index(drop = True)

Next, we will declare a list of only coutries we want to visualize. Looks like the Vox chart only has a selected list of developed countries.

In [4]:
# List of selected developed countries
ls = ['Australia', 'New Zealand', 'Germany', 'Austria', 'Denmark',\
     'Netherlands', 'Sweden', 'Finland', 'Ireland', 'Canada', 'Luxembourg', 'Belgium',\
     'Switzerland', 'United States']

# We sort the table by fire_hom_m values in ascending order. We leave the index unchanged in case we want the original order.
adv = df2[df2['Country/Territory'].isin(ls)].reset_index(drop = True)
adv = adv.sort_values(by = 'fire_hom_m')

The dataframe is now ready to use for visualizing __Chart 1__

We can replicate this chart by using Altair's Isotype Visualization, as outlined here: https://altair-viz.github.io/gallery/isotype.html

## Chart-2

For __Chart(s) 2__, we will have to calculate proportions of two things:
1. The proportion of _US population_ to the global population, and,
2. The proportion of _US civilian gun ownership_ to global civilian gun ownership

Based on the information already in hand, it seems we already have data to address (2) using the 'Average total all civilian firearms' field.

In [8]:
# Calculates the proportion of US gun ownership to the global gun ownership rate.
us_guns_prop = df[df['ISO code'] == 'US'].iloc[:,-2].sum() / df.iloc[:,-2].sum()
print('Americans own ' + str(round(us_guns_prop, 3) * 100) + '% of the world\'s guns.')

Americans own 41.9% of the world's guns.


This result is consistent with the percentage presented by Vox in the RHS chart.

To recreate the LHS chart, we have to be slightly crafty and utilize some assumptions. Witness the strength of street knowledge:

- We can safely assume that the world population today is somewhere around 7 billion people, while the current US population is about 320 million people.

We'll have to validate this using some [World Bank data](https://data.worldbank.org/indicator/sp.pop.totl) (by way of Wikipedia). This dataset is up to date as of 2017.

__Note: the .csv file needs to be manipulated in a text editor to remove the first 4 lines of header. pd.read.csv will throw an error if these lines are not removed.__

In [9]:
bank = pd.read_csv('data/world_bank_pop_2017.csv', index_col = False, encoding = 'latin1')

In [25]:
def pop_prop(country, year):
    
    if year < 1960:
        print('Please enter 1960 or later.')
    elif year > 2017:
        print('Please enter 2017 or earlier.')
    else:
        # List of regex filters we don't want to be part of our global population calculation
        filt = '.*&.*|.*Income.*|.*North America.*|.*Asia.*|.*Europe.*|.*dividend.*|.*area.*|.*conflict.*|.*indebted.*|.*only.*|.*IDA.*|.*classified.*|.*developed.*|.*income.*|.*World.*|.*Sub-Saharan.*|.*states.*|.*members.*'

        # Calculates the global pop. and US pop. separetely, then prints the proportion as a string
        total_pop = bank[((bank['Country Name'].str.match(filt) - 1) * -1).astype('bool')][str(year)].sum()
        us_pop = bank[bank['Country Name'] == country][str(year)].sum()

        print(country + ' makes up ' + str(round(us_pop / total_pop, 4) * 100) + '% of the world population.')

In [26]:
pop_prop('United States', 2017)

United States makes up 4.33% of the world population.


It seems our calculations are overestimating the total world population, but we're within 10 basis points of Vox's calculations, so it's fine.

Again, we will use Altair's Isotype Grid Visualization outlined here: https://altair-viz.github.io/gallery/isotype_grid.html

## Chart-3

## Chart-4

## Chart-5

## Chart-6

## Chart-7

## Chart-8

## Chart-9 - Still, gun homicides (like all homicides) have declined over the past couple decades.
## Chart-10 - Most gun deaths are suicides.

For **Chart#9** and **Chart#10** the data source is same. Fetching and cleaning process are same. Hence they are clubbed together.

+ Data Source          :https://webappa.cdc.gov/sasweb/ncipc/mortrate.html
+ Fetching Process     :Manual
+ Need Cleaning        :Yes

**Fetching Process**
The data is downloaded from https://webappa.cdc.gov/sasweb/ncipc/mortrate.html.
For **Chart#9** we need to collect data in two attempts. One for year range 1999 to 2017 and 1981 to 1998
and for **Chart#10** we can run the query one year 1999 to 2017. We repeat the steps below for every year range or 'Intent or Manner of injury' either as Homicide or Suicide.

1. In the above website given , choose the first option in year range as '1999 to 2017 (ICD-10), National and Regional' and choose Intent or Manner of injury as 'Homicide' as shown in [Picture-1](https://github.com/srivasud/Group7/blob/master/images/Q9-10-CDCImage-1.png)
2. Choose 'Cause or mechanism of the injury' as 'Firearm' as shown in [Picture-2](https://github.com/srivasud/Group7/blob/master/images/Q9-10-CDCImage-2.png)
3. Choose select 'Specific Options' as shown in [Picture-3](https://github.com/srivasud/Group7/blob/master/images/Q9-10-CDCImage-3.png)   
4. Choose 'Advanced Options' as shown in [Picture-4](https://github.com/srivasud/Group7/blob/master/images/Q9-10-CDCImage-4.png)
5. Click submit request. The result will be shown as a html page. Scroll to the bottom to download it as csv.
6. Repeat the same steps for Chart-9 for the period 1981 to 1998 and download seperate csv.
7. For Chart-10, the Intent or Manner of injury as 'Suicide' and year range from 1999 to 2017.

We downloaded and uploaded 3 csv files (FirearmHomicide-1981-1998.csv,FirearmHomicide-1999-2018.csv,Firearm-Suicides-1999-2018.csv) to [data](https://github.com/srivasud/Group7/tree/master/data) folder in this repository.

We run the below program and clean the data a bit and concatenates and puts into a single data frame.
The explanation of the data/data frame after cleaning and removing unwanted columns is given below. 
Our final data frame contains the data of 'Firearm' caused deaths in USA with the following columns.
+ Cause_of_Death - Its the cause or intent of death reported by firearm. The possible values are 'Suicide' or 'Homicide'.
+ Year - The year reporting the data.
+ Deaths - Total deaths reported due to Firearm with an intent either as Homicide or Suicide.
+ Population - Total population in USA for that reported Year.
+ Crude_Rate - Crude rate calculated per 100,000 , Formula used is (Total Deaths/ Total Population)*100,000.

In [2]:
# Data Reading and Cleaning
# FirearmHomicide-1981-1998.csv - Homicides by Firearm between 1981 to 1998
# FirearmHomicide-1999-2018.csv - Homicides by Firearm between 1999 to 2017
# Firearm-Suicides-1999-2018.csv - Suicides by Firearm between 1999 to 2017

# Read Firearm homicide csv from 1981 to 1998 from GIT
fire_arm_1 = pd.read_csv("https://raw.githubusercontent.com/srivasud/Group7/master/data/FirearmHomicide-1981-1998.csv") 

# Remove the last row which is the summation row
fire_arm_1=fire_arm_1[:-1] 

# Read Firearm homicide csv from 1999 to 2017 from GIT
fire_arm_2 = pd.read_csv("https://raw.githubusercontent.com/srivasud/Group7/master/data/FirearmHomicide-1999-2018.csv") 
# Remove the last row which is the summation row
fire_arm_2=fire_arm_2[:-1] 

# Dropped the not needed column 'Age-Adjusted Rate'
fire_arm_2.drop('Age-Adjusted Rate', axis=1, inplace=True)


# Read Firearm suicide data from 1999 to 2017 from GIT
firearm_suicide_2=pd.read_csv('https://raw.githubusercontent.com/srivasud/Group7/master/data/Firearm-Suicides-1999-2018.csv')
# Remove the last row which is the summation row
firearm_suicide_2=firearm_suicide_2[:-1]
# Dropped the not needed column 'Age-Adjusted Rate'
firearm_suicide_2.drop('Age-Adjusted Rate', axis=1, inplace=True)

# Concat all the dataframes into one
fa_homicide_suicide=pd.concat([fire_arm_1,fire_arm_2,firearm_suicide_2])

# Convert the Year column into Integer data type.
fa_homicide_suicide['Year']=fa_homicide_suicide['Year'].values.astype(np.int)

fa_homicide_suicide.reset_index(inplace=True)

# Drop unwanted columns such as index,sex,race,state,ethnicity,Age group, first year, last year
fa_homicide_suicide.drop(['index','Sex','Race','State','Ethnicity','Age Group','First Year','Last Year'],axis=1,inplace=True)

# Round the Crude Rate to 1 decimal point
fa_homicide_suicide['Crude Rate']=fa_homicide_suicide['Crude Rate'].round(1)

# Rename the columns without spaces between parts of the column names.
cols=['Cause_of_Death','Year','Deaths','Population','Crude_Rate']
fa_homicide_suicide.columns=cols
fa_homicide_suicide['Cause_of_Death'] = fa_homicide_suicide['Cause_of_Death'].map({'Homicide Firearm':'Homicide', 'Suicide Firearm': 'Suicide'})
fa_homicide_suicide.sample(10)

Unnamed: 0,Cause_of_Death,Year,Deaths,Population,Crude_Rate
22,Homicide,2003,11920,290107933,4.1
14,Homicide,1995,15551,266278403,5.8
2,Homicide,1983,12040,233792237,5.2
26,Homicide,2007,12632,301231207,4.2
17,Homicide,1998,11798,275854116,4.3
51,Suicide,2013,21175,316234505,6.7
28,Homicide,2009,11493,306771529,3.8
49,Suicide,2011,19990,311644280,6.4
55,Suicide,2017,23854,325719178,7.3
10,Homicide,1991,17746,252980942,7.0


## The states with the most guns report the most suicides

## Chart-12

## Chart-13

## Chart-14

## Chart-15

## Chart-16