# Analyzing U.S. Gun Violence Data

This is a collaborative project between me and Karthik Gudapati, where we will be doing pair programming to visualize this data and extract insights.
We will be using the same code in each of our kernals since it does not seem like Kaggle does collaboration (if this isn't the case let us know!)

## The Data
With the rise in gun violence in recent years, it is arguably the best time for politicians to address this in the form of laws. 

However, the public has been very divided on the topic, with typical arguments for pro-gun advocates typically deflecting blame to the person using the gun, rather than the large amounts of guns circulating the U.S. from gun manufacturers. Also compounding 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
# Read Data
gun_violence_data = '../input/gun-violence-data_01-2013_03-2018.csv'
gun_violence_df = pd.read_csv(gun_violence_data)

In [None]:
# Show head of data 
gun_violence_df.head()

# Addressing Missing/Corrupted Data
Before we go further, we will need to find out the extent of the missing or corrupted data we are dealing with.

This will mainly make more of a difference during the modeling phase (since we will want to talk into account missing values when visualizing), but we should still take a closer dive.

In [None]:
missing_rows_df = gun_violence_df.isnull().sum() / gun_violence_df.shape[0]
print(missing_rows_df)
missing_rows_df.plot(kind = 'bar')

As we can see here, "location_description" and "participant_relationship" have very high rates of missing values; therefore, they will be removed as they will yield little insight in the way of exploratory analysis or predictive modeling.

Some other columns have high rates of missing values as well, but still have enough to warrant exploratory analysis (even 50% missing still will have 100,000+ rows of data). However, they will be strong candidates for removal in the later predictive modeling stages.

In [None]:
# Drop columns "location_description" and "participant_relationship" 
print(gun_violence_df.shape)
drop_columns = ['location_description', 'participant_relationship']
gun_violence_df = gun_violence_df.drop(drop_columns, axis = 1)
print(gun_violence_df.shape)

# Data Transformations and Data Type Conversions
As we can see from various text columns (guns_stolen, gun_type, incident_characteristics, etc.), there is unstructured data here that we will have to make sense of. 

For example, it looks like in the columns mentioned above, there are mutliple parts separated by pipes (||) and then mapped together by colons(::). 

Also, all the district columns appear to be read in by Pandas as decimal columns (which makes no sense in the context of the column) and will need to be converted.

We will need to evaluate each column on a case by case basis. Luckily some columns have similar text patterns, so we will try grouping them together.

First we will start on a small part of the data and test out our transformations.

## Data Type Conversions

In [None]:
# Look at data types; what need converting?
# I.e. confirm district variables are decimals
gun_violence_df.info()

According to the above results, the district columns are decimals (showing as "floats") and will need to be converted to ints (or objects may be better, since they are categories).

The "n_guns_involved" column also has incorrectly been read in as a decimal, and will need to be converted to ints, just like "n_killed" and "n_injured".

Other than that, most of the data looks to be fine. If we run into any other unforseen issues we will deal with them when we get to them.

IMPORTANT NOTE: Any transformations will be made into new columns, and then the old columns dropped to prevent ourselves from actually changing the data (bad practice).

In [None]:
# Convert n_guns_involved to numeric
gun_violence_df['n_guns_involved_num'] = gun_violence_df['n_guns_involved'].fillna(-1).astype(int)
gun_violence_df.info()

In [None]:
# Convert district variables to object
district_variables = ['congressional_district', 'state_house_district', 'state_senate_district']
gun_violence_df[['congressional_district_obj', 'state_house_district_obj', 'state_senate_district_obj']] = gun_violence_df[['congressional_district', 'state_house_district', 'state_senate_district']].astype(object)
gun_violence_df.info()

In [None]:
# Drop original n_guns_involved column and distritct columns
gun_violence_df = gun_violence_df.drop(columns = ['congressional_district', 'state_house_district', 'state_senate_district', 'n_guns_involved'], axis = 1)
gun_violence_df.info()

## Data Transformation: Text Parsing

Here we will parsing the text of the columns with the data being put together separated by pipes, and make new columns containing a list of the different set of words.

So is essence, we will be making lists of lists, and adding them to the dataframe as columns. 

We will also look at the other text columns and see what methods are appropriate for their (if needed) transformation.

In [None]:
# Import Natural Language Toolkit (nltk) package for text parsing and tokenizing
import nltk

# Subset data to test out our changes
subset = gun_violence_df.head(50)

# Columns to be split and parsed
parse_columns = ['gun_stolen', 'gun_type', 'incident_characteristics', 'participant_age', 'participant_age_group', 'participant_gender', 'participant_name', 'participant_status', 'participant_type']

# Make "incident_id" column the index
subset.set_index('incident_id')

In [None]:
# For empty strings: apply(lambda x: np.nan if isinstance(x,str) and x.isspace() and not x.str.len() > 0 else x)
subset['gun_stolen_parsed'] = subset['gun_stolen'].str.replace('\|\|', ', ')
subset['gun_stolen_parsed']

subset['gun_stolen_parsed'] =  subset['gun_stolen_parsed'].str.replace('::', ': ')
subset['gun_stolen_parsed']

col = subset['gun_stolen_parsed']
for index, item in col.iteritems():
    if isinstance(item, str):
        print(item)

#subset['gun_stolen_parsed_2'] = [word for sublist in subset['gun_stolen_parsed'] for word in sublist]
#subset['gun_stolen_parsed_2']
#subset['gun_stolen_parsed'] = subset['gun_stolen_parsed'].replace(',', '')
#subset['gun_stolen_parsed_2'] = subset['gun_stolen_parsed'].values.tolist().str.split(':')
#subset['gun_stolen_parsed_2']
#gun_violence_df_subset['gun_stolen_parsed'] = gun_violence_df_subset['gun_stolen'].replace(r'^\s+$', np.nan, regex = True).fillna('Missing Value').str.split('||')
#gun_violence_df_subset

In [None]:
# Describe Data
gun_violence_df.describe()