# <center style="color:darkblue">Tutorial of the Data Science Pipeline</center>

<center> CMSC320 Sabrina Field </center>



### The Data Science Pipeline
 
In this tutorial, we are going to learn the steps of the data pipeline by asking the question - Which groups of people are most affected by shootings? The steps of this tutorial's data pipeline will include: 

##### - Data collection/curation + parsing (if necessary)
##### - Data management/representation
##### - Exploratory data analysis
##### - Hypothesis testing and machine learning
##### - The communication of insights attained





Specifically, we will be analyzing a dataset recording shooting incident data in New York from 2016 to 2020.

![alt text](./nycimage.jpeg "NYC")

## Which groups of people are most affected by shootings?

#### Introduction

Firearm related violence and has become a prevalent topic in conversations in the United States especially surrounding gun control. However, there is an extremely important aspect that is less acknowledged: who are the groups of people that tend to be victims? For example, the prominence of school shooting discussion makes any student feel more uneasy than someone who is not. Are there certain groups of people that are more affected than others? This would provide valuable insight into understanding where solution efforts need to be focused. As government and various nonprofit organizations work to deal with these issues, it is crucial to learn where resources need to be concentrated in bettering education, mental health support, and efforts of protection. If there is any identifiable trend, among location/age/etc., it might also have the capablility to demonstrate what is actually happening and show where resources must be focused.





Data is found at https://data.cityofnewyork.us/Public-Safety/NYPD-Shooting-Incident-Data-Historic-/833y-fsy8 found through https://catalog.data.gov/

#### Data collection/curation + parsing (if necessary)

In [15]:
import pandas as pd 

In [16]:
data = pd.read_csv("NYPD_Shooting_Incident_Data__Historic_.csv")
data.head()
data

Unnamed: 0,INCIDENT_KEY,OCCUR_DATE,OCCUR_TIME,BORO,PRECINCT,JURISDICTION_CODE,LOCATION_DESC,STATISTICAL_MURDER_FLAG,PERP_AGE_GROUP,PERP_SEX,PERP_RACE,VIC_AGE_GROUP,VIC_SEX,VIC_RACE,X_COORD_CD,Y_COORD_CD,Latitude,Longitude,Lon_Lat
0,201575314,08/23/2019,22:10:00,QUEENS,103,0.0,,False,,,,25-44,M,BLACK,1037451,193561,40.697805,-73.808141,POINT (-73.80814071699996 40.697805308000056)
1,205748546,11/27/2019,15:54:00,BRONX,40,0.0,,False,<18,M,BLACK,25-44,F,BLACK,1006789,237559,40.818700,-73.918571,POINT (-73.91857061799993 40.81869973000005)
2,193118596,02/02/2019,19:40:00,MANHATTAN,23,0.0,,False,18-24,M,WHITE HISPANIC,18-24,M,BLACK HISPANIC,999347,227795,40.791916,-73.945480,POINT (-73.94547965999999 40.791916091000076)
3,204192600,10/24/2019,00:52:00,STATEN ISLAND,121,0.0,PVT HOUSE,True,25-44,M,BLACK,25-44,F,BLACK,938149,171781,40.638064,-74.166108,POINT (-74.16610830199996 40.63806398200006)
4,201483468,08/22/2019,18:03:00,BRONX,46,0.0,,False,25-44,M,BLACK HISPANIC,18-24,M,BLACK,1008224,250621,40.854547,-73.913339,POINT (-73.91333944399999 40.85454734900003)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23563,216936948,08/21/2020,02:10:00,BRONX,48,0.0,MULTI DWELL - APT BUILD,True,,,,45-64,M,BLACK HISPANIC,1007277,187698,40.681843,-73.916978,POINT (-73.91697825799997 40.681842679000056)
23564,214926175,07/03/2020,23:49:00,QUEENS,102,0.0,HOTEL/MOTEL,False,<18,M,BLACK,<18,M,BLACK,1005993,241333,40.829060,-73.921434,POINT (-73.92143424399995 40.82906028200006)
23565,220870730,11/21/2020,08:05:00,BROOKLYN,60,0.0,,True,,,,45-64,M,WHITE,1046405,187113,40.680049,-73.775909,POINT (-73.77590919399995 40.680048726000045)
23566,208187330,01/18/2020,01:00:00,BRONX,42,2.0,MULTI DWELL - PUBLIC HOUS,False,,,,45-64,M,BLACK,1011373,182202,40.666746,-73.902232,POINT (-73.90223237399994 40.66674580000005)


Pandas

In [17]:
import requests as rq
from bs4 import BeautifulSoup

"You often want to send some sort of data in the URL’s query string. If you were constructing the URL by hand, this data would be given as key/value pairs in the URL after a question mark, e.g. httpbin.org/get?key=val. Requests allows you to provide these arguments as a dictionary of strings." https://docs.python-requests.org/en/master/user/quickstart/

In [18]:
url = "https://data.cityofnewyork.us/resource/833y-fsy8.json"
response = rq.get(url)
response = response.text
df = pd.read_json(response)
df.head()

Unnamed: 0,incident_key,occur_date,occur_time,boro,precinct,jurisdiction_code,statistical_murder_flag,vic_age_group,vic_sex,vic_race,...,geocoded_column,:@computed_region_efsh_h5xi,:@computed_region_f5dn_yrer,:@computed_region_yeji_bk3q,:@computed_region_92fq_4b7q,:@computed_region_sbqj_enih,perp_age_group,perp_sex,perp_race,location_desc
0,201575314,2019-08-23T00:00:00.000,2021-07-21 22:10:00,QUEENS,103,0.0,False,25-44,M,BLACK,...,"{'type': 'Point', 'coordinates': [-73.80814071...",24670.0,41,3,6,61,,,,
1,205748546,2019-11-27T00:00:00.000,2021-07-21 15:54:00,BRONX,40,0.0,False,25-44,F,BLACK,...,"{'type': 'Point', 'coordinates': [-73.91857061...",10929.0,49,5,43,23,<18,M,BLACK,
2,193118596,2019-02-02T00:00:00.000,2021-07-21 19:40:00,MANHATTAN,23,0.0,False,18-24,M,BLACK HISPANIC,...,"{'type': 'Point', 'coordinates': [-73.94547965...",12426.0,7,4,35,14,18-24,M,WHITE HISPANIC,
3,204192600,2019-10-24T00:00:00.000,2021-07-21 00:52:00,STATEN ISLAND,121,0.0,True,25-44,F,BLACK,...,"{'type': 'Point', 'coordinates': [-74.16610830...",10371.0,4,1,13,75,25-44,M,BLACK,PVT HOUSE
4,201483468,2019-08-22T00:00:00.000,2021-07-21 18:03:00,BRONX,46,0.0,False,18-24,M,BLACK,...,"{'type': 'Point', 'coordinates': [-73.91333944...",10931.0,6,5,29,29,25-44,M,BLACK HISPANIC,


#### Data management/representation

In [43]:
df.drop(['incident_key',':@computed_region_f5dn_yrer',':@computed_region_sbqj_enih',':@computed_region_92fq_4b7q',':@computed_region_yeji_bk3q', ':@computed_region_efsh_h5xi'], axis=1, inplace=True)
df.columns
df
#@computed_region_f5dn_yrer	:@computed_region_yeji_bk3q	:@computed_region_92fq_4b7q	:@computed_region_sbqj_enih	

Unnamed: 0,occur_date,occur_time,boro,precinct,jurisdiction_code,statistical_murder_flag,vic_age_group,vic_sex,vic_race,x_coord_cd,y_coord_cd,latitude,longitude,geocoded_column,perp_age_group,perp_sex,perp_race,location_desc
0,2019-08-23T00:00:00.000,2021-07-21 22:10:00,QUEENS,103,0.0,False,25-44,M,BLACK,1037451,193561,40.697805,-73.808141,"{'type': 'Point', 'coordinates': [-73.80814071...",,,,
1,2019-11-27T00:00:00.000,2021-07-21 15:54:00,BRONX,40,0.0,False,25-44,F,BLACK,1006789,237559,40.818700,-73.918571,"{'type': 'Point', 'coordinates': [-73.91857061...",<18,M,BLACK,
2,2019-02-02T00:00:00.000,2021-07-21 19:40:00,MANHATTAN,23,0.0,False,18-24,M,BLACK HISPANIC,999347,227795,40.791916,-73.945480,"{'type': 'Point', 'coordinates': [-73.94547965...",18-24,M,WHITE HISPANIC,
3,2019-10-24T00:00:00.000,2021-07-21 00:52:00,STATEN ISLAND,121,0.0,True,25-44,F,BLACK,938149,171781,40.638064,-74.166108,"{'type': 'Point', 'coordinates': [-74.16610830...",25-44,M,BLACK,PVT HOUSE
4,2019-08-22T00:00:00.000,2021-07-21 18:03:00,BRONX,46,0.0,False,18-24,M,BLACK,1008224,250621,40.854547,-73.913339,"{'type': 'Point', 'coordinates': [-73.91333944...",25-44,M,BLACK HISPANIC,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,2009-07-03T00:00:00.000,2021-07-21 04:50:00,STATEN ISLAND,120,0.0,False,25-44,M,BLACK,960348,171964,40.638654,-74.086124,"{'type': 'Point', 'coordinates': [-74.08612364...",UNKNOWN,M,BLACK,
996,2016-11-14T00:00:00.000,2021-07-21 01:12:00,MANHATTAN,30,0.0,False,25-44,M,WHITE HISPANIC,998806,239586,40.824280,-73.947408,"{'type': 'Point', 'coordinates': [-73.94740787...",25-44,M,WHITE HISPANIC,
997,2009-09-25T00:00:00.000,2021-07-21 23:00:00,BROOKLYN,63,0.0,False,18-24,M,WHITE HISPANIC,1003924,168973,40.630455,-73.929122,"{'type': 'Point', 'coordinates': [-73.92912202...",18-24,M,BLACK,
998,2015-01-05T00:00:00.000,2021-07-21 16:58:00,BROOKLYN,83,0.0,False,25-44,M,WHITE,1008331,189878,40.687823,-73.913170,"{'type': 'Point', 'coordinates': [-73.91317030...",18-24,M,BLACK,


#### Exploratory data analysis

Exploratory analysis is exploring, learning, and summarizing the data to give any information about what the data may be saying.

Data visualization (such as graphical representation of data) is a powerful tool to help really understand what is 
happening. Not only do visual elements provide a more appealing view of the data, but they also give understanding in cases where it is difficult otherwise. Extremely large and unsorted datasets placed on a plot can make their patterns significantly more recognizable. Although every dataset is going to be slightly different in which representations most accurately portay the data and show what is occuring, it will often be extremely helpful in giving a clearer view of the data. 

Let's 

In [44]:
!pip install folium # Folium is a Python library used for visualizing geospatial data
import folium

map_osm = folium.Map(location=[0, 0], zoom_start=10)



#### Hypothesis testing and machine learning
Hypothesis testing is testing a given hypothesis by comparing it with the null hypothesis. What exactly is a null hypothesis? The null hypothesis is a default hypothesis in which the relationship and statistical significance between two items is measured to be zero. 

#### Communication of insights attained
It can not go without saying that this step in the data science pipeline is the most critical component. If all the other steps are correctly executed, and what the data shows isn't properly explained then it is useless. In order to be able to correctly execute this step, we need to be communicating what we have done and why. There isn’t a point in doing data science if it doesn’t result in changes or knowledge being learned from it.

So what is the data communicating?

All that is left below are some formatting items. Feel free to ignore :-)

In [16]:
!pip install jupyterthemes
!jt -t monokai



In [2]:
pip uninstall juypterthemes

Note: you may need to restart the kernel to use updated packages.
