# Project Design Writeup and Approval Template : Mapping Police Violence (Sam)

### Project Problem and Hypothesis
* **What's the project about? What problem are you solving?**

Using data on people killed by police from 2013-2016, can we understand what variables make a person more or less likely to be fatally injured by a police officer?

The dataset only includes people who died from police action, so we are unable to predict survival rate (since in all cases, survival = false). However, we can use the data set to predict the victim's race, conditional on dying. We can also use the data set to predict whether the victim was armed or unarmed, conditional on dying. 

* **Where does this seem to reside as a machine learning problem? Are you predicting some continuous number, or predicting a binary value?**

Most of the machine learning applications for this data would be towards predicting a binary variable (armed or unarmed) or categorical variable (race 1, race 2, race 3, etc). These are both classification problems.

I will try the following classification models to determine which variables are best able to predict the outcomes (armed/unarmed or race category):

K-nearest neighbors (lesson 9)
Logistic regression (lesson 10)
Decision trees (lesson 11) - will show us feature importance
Random forest, bagging (lesson 12)
Clustering (lesson 14) - to see if there are trends/groupings in the data without necessarily predicting an outcome

I will score the models using accuracy, precision/recall, f1, or auc roc score, depending on which is most appropriate for the data.

* **What kind of impact do you think it could have?**

Analysis on this data could prove or disprove any overt or subtle raical bias by police officers. In addition, the analysis could help US police departments improve transparency and accountability, and could help activists and police departments to work together and end police violence. 

* **What do you think will have the most impact in predicting the value you are interested in solving for?**

I think location of police violence will be most impactful in predicting the race of the victim. 

SW: Sam, this is very good, and would make for a fine project. 

* You've chosen a good dataset, which is half the battle
* You have a good grasp of how you could apply the prediction techniques we've learned in class to test the accuracy of any model you would use.

I wonder if this is a problem you really care about answering. It seems like there's not a strong predictive component here - whereas you could try to "predict the person's race", for example, from other features in the dataset, since those features obviously aren't _causing_ a change in the person's race, that analysis would end up being more exploratory than anything. Of course, if you had survival rates here, then this would be a perfect case for a predictive model, but unfortunately we don't.

You've been one of the most engaged students in the class, and your homework assignments have all been very good - do you feel like you could tackle one of the more challenging, but more "predictive" ideas you originally had back when you presented your original ideas in your "lightning talks" - such as predicting hospital readmission rates?

### Datasets
* **Description of data set available, at the field level**

The dataset is pulled from MappingPoliceViolence.org, which is a research collaborative that collects and analyzes datasets on nationwide police killings. 

Unfortunately, there is not good official government data on police shooting incidents or the demographics of the victims. In response, many organizations have begun to crowdsource data on police shootings from pubically available information. The full dataset that I will use from MappingPoliceViolence (MPV) is a compilation of three of the largest, most comprehensive, and most impartial crowdsourced databases on police killings in the country: FatalEncounters.org, the U.S. Police Shootings Database and KilledbyPolice.net. MPV has improved the dataset with their own primary and secondary research. 

MPV believes this dataset is the most comprehensive source of people killed by police for 2013-2016. Since it is user-generated, it is certiainly not 100% comprehensive; however, MPV believes that their database captures 90-98% of all police killings since 2014. 

The variables in the MPV dataset include:

Variable | Description | Type of Variable
---| ---| ---
Victim's name	|First and last| Categorical
Victim's age	|Age in years| Continuous
Victim's gender	|Male or female| Categorical
Victim's race	|Black, White, Hispanic, Asian, or unknown| Categorical
*URL of image of victim	|If available | Categorical
Date of injury resulting in death	|(month/day/year)| Continuous
Location of injury (address)	|Street address  | Categorical
Location of death (city)	|City | Categorical
Location of death (state)	|State (abbreviated) | Categorical
Location of death (zip code)	|Zip code | Categorical
Location of death (county)	|County associated with location| Categorical
Agency responsible for death	|Associated police department| Categorical
Cause of death	|Gunshot, vehicle, death in custody, physical restraint | Categorical
*A brief description of the circumstances surrounding the death	| Text | Categorical
Official disposition of death (justified or other)	|Justified, pending investigation, type of conviction | Categorical
Criminal Charges?	| Charged with crime (yes) or no known charges (no) | Categorical
*Link to news article or photo of official document	| URL | Categorical
Symptoms of mental illness?	| Yes, no, unknown | Categorical
Unarmed | Unarmed, allegedly armed| Categorical

*= represents variables that will likely be dropped from analysis.

In addition to the MPV-provided data, I will merge in other variables to supplement the data. These will likely include:

* Level of violent crime by city
* Urban/rural/suburban categorization of location of death
* Demographic (race) breakdown of city

SW: Excellent ideas for additional variables to merge in! Maybe you could predict which _cities_ have more police shootings based on demographics? (but you might not have enough data points for that)

### Domain knowledge
* **What experience do you already have around this area?**

I've read a lot about the databases that exist related to Black Lives Matter, as well as the analysis that has already been done. 


* **Does it relate or help inform the project in any way?**

Yes - I believe I'm knowledgable about the shortcomings in the data and the potential biases that exist in the data collection methods. I also have an idea of how analysis could be useful to organizations and individuals.

* **What other research efforts exist?**


Research effort | Description | Outcome/Comments
---|---|---|
[Citizen's Police Data Project/Invisible Institute](http://invisible.institute/police-data/) | Database (and analysis) of 56k misconduct compaints against Chicago police officers | Created model to predict "bad apple" cops, found that many complaints were not addressed
[US Police Shooting Database (USPSD)](https://docs.google.com/spreadsheets/d/1cEGQ3eAFKpFBVq1k2mZIy5mBPxC6nBTJHzuSWtZQSVw/edit#gid=1842418396) | Started by Kyle Wagner, user-generated google docs spreadsheet to track police killings | Input source to MPV dataset
[Fatal Encounters Database](http://www.fatalencounters.org/) | Collects data through paid researchers, public records requests, crowdsourced data | Input source to MPV dataset
[Killed by Police Database](http://killedbypolice.net/) | started May 1, 2013 | Input source to MPV dataset
[Cody Ross PLOS study](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0141854) | Bayesian analysis of racial bias in police shooting, 2011-2014, using data from USPSD |  Only uses USPSD data (not entire MPV dataset), will use some of this study's methodology in my final project
[Roland Fryer NBER study](http://www.nber.org/papers/w22399) | Explores racial differences in police use of force | Study found no racial bias in lethal force, but many criticize the methodolgy
[Gun Violence Archive](http://www.gunviolencearchive.org/) | Non-profit formed in 2013 to provide accurate info about gun-related violence in the US | Not focused specifically on police shootings, but does list officer-involved shootings
Five Thirty Eight Blog [Why Are So Many Black Americans Killed By Police?](http://fivethirtyeight.com/features/why-are-so-many-black-americans-killed-by-police/) | Quotes many researchers, discusses Fryer's study | Good summary, but no original data analysis in this article
Five Thirty Eight Blog [The Police Are Killing People As Often As They Were Before Ferguson](http://fivethirtyeight.com/features/the-police-are-killing-people-as-often-as-they-were-before-ferguson/) | Compares different data sources | ---
[Goff study - The Science of Justice](http://policingequity.org/wp-content/uploads/2016/07/CPE_SoJ_Race-Arrests-UoF_2016-07-08-1130.pdf) | Uses data from National Justice Database (via Center for Policing Equity) to look for racial disparities in 12 law enforcement departments |  Racial disparities exist even when controlling for racial distribution of local arrest rates, and disparities are robust across multiple types of weapons/force 




  

SW: Nice - do any of these data sources have survival data?

Also, since you've read so much on this, forget what I said about switching to working with hospital re-admission rates.

### Project Concerns
* **What questions do you have about your project? What are you not sure you quite yet understand? (The more honest you are about this, the easier your instructors can help).**

I'm not really sure if I've chosen the right outcome variable. 

* **What are the assumptions and caveats to the problem?**
    * **What data do you not have access to but wish you had?**
    
    I wish I had a dataset that included people who were shot by police but did NOT die. Then I could predict survival status. Sadly, this data does not exist. 
    
    * **What is already implied about the observations in your data set? For example, if your primary data set is twitter data, it may not be representative of the whole sample (say, predicting who would win an election)**
    
    One big assumption is that my data is representative of all police killings in the US in the past ~4 years. Since it is user-generated, it may  be missing some killings that did not get press coverage. Additionally, press coverage could be biased to a certain demographic. 
    
* **What are the risks to the project?**
    * **What's the cost of your model being wrong? (What's the benefit of your model being right?)**
    
    If my model is wrong (i.e., it shows that there is no racial bias among police, but there actually IS racial bias), then it could encourage law enforcement institutions to continue the status quo/perpetuate racist stereotypes and racially-charged behavior.
    
    * **Is any of the data incorrect? Could it be incorrect?**

Yes, some of the data could be incorrect. There are duplicates in the data that will need to be dropped.

### Outcomes
* **What do you expect the output to look like?**

I expect the output to be a list of important features, ranked by influence/importance. I also expect to develop a model (perhaps ensembles of decision trees) that will be able to predict race given other variables or predict armed/unarmed status given other variables.

* **What does your target audience expect the output to look like?**

My target audience expects revealing statistics/charts to show (or to disprove) that racial bias exists within police departments.

* **How complicated does your model have to be?**

My model will not be complicated, since I'm not using time-series data. I may need to normalize some of my variables or convert them into log form. 

* **How successful does your project have to be in order to be considered a "success"?**

It will be a success if I can product a statistically signficant model/coefficient.

* **What will you do if the project is a bust (this happens! but it shouldn't here)?**

If the project is a bust, I will look for another dataset or additional varialbes that may provide more predictive power

SW: I would think of success more in terms of accuracy in addition to particular coefficients. Maybe think: "If I had an accurate model, and in that model (if it was a Random Forest, for example) the victim's race was an 'important feature' in predicting my outcome, then I would conclude that police are biased toward certain races."