<a href="https://colab.research.google.com/github/yassmin1/Analytics_Projects/blob/main/Gun_Violence.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Gun Violence USA 2013-2022**


## **Gun violence  and crime incidents**
Gun violence  and crime incidents are collected/validated from 7,500 sources daily – Incident Reports and their source data are found at the gunviolencearchive.org website.

1.   Number of source verified deaths and injuries
2.   Number of INCIDENTS reported and verified
3. Calculation based on CDC Suicide Data
4. Actual total of all non-suicide deaths plus daily calculated suicide deaths   

for more information visit the following website <https://www.gunviolencearchive.org/>

## **Data Exploratory Analysis**









In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import calendar
sns.set_style('white')
sns.set_context('paper',font_scale=1.5)
sns.set_palette('bright')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 20

In [None]:
file1=r'/content/drive/MyDrive/Colab Notebooks/Py_PB_Project/archive_GunViolance/all_incidents.csv'
file2=r'/content/drive/MyDrive/Colab Notebooks/Py_PB_Project/archive_GunViolance/mass_shootings.csv'


In [None]:
all_incidents=pd.read_csv(file1)

In [None]:
all_incidents.shape

In [None]:
print(f"date ranges between {all_incidents['date'].min()} and { all_incidents['date'].max()}")


In [None]:
all_incidents.info()

**OBS:** *Address feature has nan values however we can delete it.*

In [None]:
all_incidents.describe().T

In [None]:
all_incidents.isnull().sum()

In [None]:
all_incidents.drop_duplicates(subset='incident_id',inplace=True)
all_incidents.dropna(subset=['address'],inplace=True)

In [None]:
all_incidents.shape

In [None]:
all_incidents.info(memory_usage='deep')

In [None]:
pers=100*all_incidents.groupby('state')['incident_id'].count()/all_incidents.groupby('state')['incident_id'].count().sum()#

In [None]:
pers.sort_values(ascending=False).nlargest(10)

In [None]:
g=sns.countplot(data=all_incidents,x='state',order=all_incidents.groupby('state')['incident_id'].size().sort_values(ascending=False).nlargest(10).index)
total = len(all_incidents)
for p in g.patches:
    percentage = '{:.1f}%'.format(100 * p.get_height() / total)
    x = p.get_x() + p.get_width() / 2
    y = p.get_height()
    g.annotate(percentage, (x, y), ha='center', va='bottom')
plt.xticks(rotation=20) ;

**OBS:Illinois, california, and texas are the states having the higer percent of incident.**

In [None]:
g=sns.countplot(data=all_incidents,x='city',order=all_incidents.groupby('city')['incident_id'].size().sort_values(ascending=False).nlargest(10).index)
total = len(all_incidents)
for p in g.patches:
    percentage = '{:.1f}%'.format(100 * p.get_height() / total)
    x = p.get_x() + p.get_width() / 2
    y = p.get_height()
    g.annotate(percentage, (x, y), ha='center', va='bottom')
plt.xticks(rotation=30);


In [None]:
# adding the number of the killed and inures
all_incidents['killed+injures']=all_incidents['n_killed']+all_incidents['n_injured']

**OBS:Chicago, philadephia, and Baltimore are the mostcities having the higer percent of incident**

In [None]:
vec=all_incidents['killed+injures'].value_counts(normalize=True,bins=[-1,0,1,2,3,80]).mul(100).sort_index().head(5).plot.bar(xlabel='Victims',ylabel='Percentage',color='red')
for p in vec.patches:
    prec='{:0.1f}%'.format(p.get_height())
    x=p.get_x()+p.get_width()/2
    y=p.get_height()
    vec.annotate(prec,(x,y),ha='center',va='bottom')
plt.xticks([0, 1, 2, 3, 4], labels=[0, 1, 2, 3, '4<'],rotation=30)


**31% are fireshots without victims, while 68 are fireshot with victims**

## **Calculate the killed ratio and injuries ratio in the context of incidents, such as accidents or disasters, can provide valuable insights into the severity and impact of these events**

These ratios normalize the number of killed and injured individuals by the total count of killed and injured. Normalization is important when comparing incidents of different magnitudes. It allows you to assess the proportion of fatalities and injuries relative to the total affected population, providing a more standardized measure.

The killed ratio and injuries ratio focus on the impact of incidents on human lives, making them important indicators for assessing the severity of events. This information is crucial for emergency response planning, risk assessment, and policy development.

In [None]:
all_incidents['killed_ratio']=all_incidents['n_killed']/all_incidents['killed+injures']
all_incidents['injures_ratio']=all_incidents['n_injured']/all_incidents['killed+injures']

**Not all the incident resulted in injuries or killed. so is_incident =0 means it's just fire shots, while 1 means there are inures and killed.**  

In [None]:
all_incidents['is_incident']=np.where(all_incidents['killed+injures']== 0,0,1)
all_incidents['is_incident'].value_counts(normalize=True)

In [None]:
gg=all_incidents['is_incident'].value_counts(normalize=True).mul(100).plot.bar(color=['red','blue'])

for p in gg.patches:
    prec='{:0.1f}%'.format(p.get_height())
    x=p.get_x()+p.get_width()/2
    y=p.get_height()
    gg.annotate(prec,(x,y),ha='center',va='bottom')
plt.xticks(ticks=[1,0],labels={'With victims','gunshot'},rotation=30)





In [None]:
# plot the type of incidents in each states.
all_incidents.groupby('state')['is_incident'].value_counts(normalize=True).unstack().sort_values(by=1,ascending=False).plot.bar(stacked=False)

**Illinois has the highest incident number and the highest incident resutled in injures and killed. while New Hampshire shows highe number of gun fires but no fatelity or injures. For Texas 75% of the shooting resulted in fatelity of injures**

In [None]:
all_incidents.groupby('city')['is_incident'].value_counts().unstack(fill_value=0).sort_values(by=1,ascending=False).head(20).plot.bar(stacked=True)
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))

In [None]:
all_incidents.query('is_incident==1').groupby('state')['killed_ratio','injures_ratio'].agg('mean').sort_values(by='killed_ratio',ascending=False).plot.bar(stacked=True,color={'killed_ratio':'red','injures_ratio':'blue'})

**60% of wyoming shooting fatality incident were resulted in killed while 40% injures. Arizona, Montana, Hawaii, Idaho, and nevada show 50% killed.**

In [None]:
import requests
url = "https://raw.githubusercontent.com/python-visualization/folium/master/examples/data"
states_geo = f"{url}/us-states.json"
states_geo_json=requests.get(states_geo).json()

In [None]:
import folium
max_lat=55
max_lon=-135
min_lat=30
min_lon=-70

# Assuming your GeoJSON file uses "id" as the feature identifier
key_on = 'feature.properties.name'

fig=folium.Figure(width=1000, height=500)
# Create a choropleth map
m = folium.Map([43, -100], zoom_start=4,  tiles="cartodb positron",
               attr="My Attribution",max_lat=max_lat,max_lon=max_lon,min_lat=min_lat,min_lon=min_lon
               , control_scale=True,).add_to(fig)

# Add the choropleth layer
folium.Choropleth(
    geo_data=states_geo_json,
    data=all_incidents,
    title="Ratio of killed people" ,
    columns=['state','killed_ratio'],
    key_on=key_on,
    fill_opacity=0.9,
    line_weight=2,
    fill_color="YlGn",
    nan_fill_color="purple",

).geojson.add_child(folium.GeoJsonTooltip(['name'], labels=True)).add_to(m)
folium.LayerControl().add_to(m)


m
# Display the map
#m.save('map.html')
#from IPython.display import IFrame
#IFrame("map.html",width="400", height="300")



In [None]:
from folium.plugins import MiniMap
# Assuming your GeoJSON file uses "id" as the feature identifier
key_on1 = 'properties.name'
min_zoom_defined=2
zoom_start_defined=4

# Create a choropleth map
fig2=folium.Figure(width=1000, height=500)
mm = folium.Map([43, -100],   tiles="cartodb positron",
                attr="My Attribution",
            zoom_start=zoom_start_defined,
           min_zoom = min_zoom_defined,
           max_bounds = True  ).add_to(fig2)
# Add the choropleth layer
choropleth =folium.Choropleth(
    geo_data=states_geo_json,
    data=all_incidents,
    columns=['state','injures_ratio'],
    key_on=key_on1,
    fill_opacity=0.9,
    line_weight=2,
    fill_color="YlGn",
    nan_fill_color="purple",
    highlight=True,
    title=" Injures Ratio"

).add_to(mm)
folium.LayerControl().add_to(mm)
# Display Region Label
choropleth.geojson.add_child(
    folium.features.GeoJsonTooltip(['name'], labels=False)
)
mm

In [None]:
#!pip install geojson
import geojson
with open('USA_Major_Cities.geojson') as f:
    cities = geojson.load(f)
cities['features'][1]['properties']['NAME']


In [None]:
states_geo_json['features'][0]['properties']['name']

In [None]:
# Assuming your GeoJSON file uses "id" as the feature identifier
key_on = 'properties.NAME'
min_zoom_defined=2
zoom_start_defined=4

# Create a choropleth map
fig2=folium.Figure(width=1000, height=500)
mm = folium.Map([43, -100],   tiles="cartodb positron",
                attr="My Attribution",
            zoom_start=zoom_start_defined,
           min_zoom = min_zoom_defined,
           max_bounds = True  ).add_to(fig2)
# Add the choropleth layer
choropleth =folium.GeoJson(
    cities,
    name="Ciries and Death",
    zoom_on_click=True,
    marker=folium.Marker(icon=folium.Icon(icon='star')),
    tooltip=folium.GeoJsonTooltip(fields=["NAME"]),
    popup=folium.GeoJsonPopup(fields=["NAME"]),
    select

).add_to(mm)
folium.LayerControl().add_to(mm)
mm

In [None]:
all_incidents.query('n_killed >=10')

In [None]:
all_incidents['date']=pd.to_datetime(all_incidents['date'])
all_incidents['year']=all_incidents['date'].dt.year
all_incidents['month']=all_incidents['date'].dt.month
all_incidents['day']=all_incidents['date'].dt.dayofweek


In [None]:
all_incidents.to_csv("all_incidents_clean.csv",index=False)

In [None]:
state_list=all_incidents.query('is_incident==1').groupby('state')['killed+injures'].agg('sum').sort_values(ascending=False).head(10).index.tolist()
#state_list
sns.pointplot(data=all_incidents.query('is_incident==1 and state in @state_list'),x='year',y='killed+injures',estimator='sum',hue='state',hue_order=state_list)

- In Illinoies, 2016 and 2020 have the highe number of victims.
- All the states shows increase in victim number over time except for florida.
- Texas,penn, NewYour show high inceasing rate in victimes number than other states.



In [None]:
state_list=all_incidents.query('is_incident==1').groupby('state')['killed+injures'].agg('sum').sort_values(ascending=False).head(10).index.tolist()
#state_list
po=sns.pointplot(data=all_incidents.query('is_incident==1 and state in @state_list'),
              x='month',y='killed+injures',estimator='sum',hue='state',hue_order=state_list
              )
po.set_xticklabels(['January','February','March','April','May','June','July','August','September','October','November','December'],rotation=40)


- cold states such as New york,Ohaio and Illinoies highest victim number mostly around summer (July).
- while warm states have such as folorida and texas have the highest number in spring around May.
- mild states such as Cliforina and North carolina shows consistant number around the year.
- January has high nuber in most of this states specifically for texas and florrida.


In [None]:
day_month = all_incidents.groupby(by= ['day','month']).sum()['killed+injures'].unstack()
pl=sns.heatmap(data=day_month,cmap = 'icefire',square=True,cbar_kws={"shrink": .7})
pl.set_xticklabels(['January','February','March','April','May','June','July','August','September','October','November','December'],rotation=20);
pl.set_yticklabels(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'],rotation=20);

**In all states, the most busy days with incidents are the weekends mostly in the summer between May and August.**

weekend

In [None]:
day_month = all_incidents.query("state=='Texas'").groupby(by= ['day','month']).sum()['killed+injures'].unstack()
tx=sns.heatmap(data=day_month,cmap = 'icefire',square=False)
tx.set_xticklabels(['January','February','March','April','May','June','July','August','September','October','November','December'],rotation=20);
tx.set_yticklabels(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'],rotation=20);

**In Texas, the high incident accuring on weekends but all over the months. the hightset on january , May, and November.**

In [None]:
from matplotlib.colors import LogNorm, Normalize
from matplotlib.ticker import MaxNLocator
day_month = all_incidents.query("city=='San Antonio'").groupby(by= ['day','month']).sum()['killed+injures'].unstack()
sa=sns.heatmap(data=day_month,cmap = 'icefire',square=True,annot=True,cbar_kws={"shrink": .7})
sa.set_xticklabels(['January','February','March','April','May','June','July','August','September','October','November','December'],rotation=20);
sa.set_yticklabels(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'],rotation=20);

- on contrary to the Texas, there is no clear pattern for San anotnio. in addition to the weekend there are incident and victims all over the weekdays.   

### Check if the holidays has affected the gun fire incidents

In [None]:
from pandas.tseries.holiday import USFederalHolidayCalendar,Holiday,AbstractHolidayCalendar
from pandas.tseries.holiday import sunday_to_monday,Easter,Day,Holiday,FR
from pandas.tseries.offsets import CustomBusinessDay
cal = USFederalHolidayCalendar()    # US calendar
hol1 = cal.holidays(start=all_incidents.date.min(), end=all_incidents.date.max(),return_name=True)



In [None]:
class MoreCalendar(USFederalHolidayCalendar):
   rules = [
     Holiday('Day of Dead', month=10, day=31, observance=sunday_to_monday),
     Holiday('St. Patricks Day', month=3, day=17, observance=sunday_to_monday),
     Holiday('Good Friday', month=1, day=1, offset=[Easter(), Day(-2)]),
     Holiday('Easter', month=1, day=1, offset=[Easter(), Day(0)]),
     Holiday('Monday Easter', month=1, day=1, offset=[Easter(), Day(+1)]),
     Holiday('All Saints Day', month=11, day=1, observance=sunday_to_monday),
     Holiday('Black Friday', month=11, day=1, offset=pd.DateOffset(weekday=FR(4))),
        Holiday('Valentine\'s Day', month=2, day=14),


   ]
calendar = MoreCalendar()
hol2=calendar.holidays(all_incidents.date.min(), end=all_incidents.date.max(),return_name=True)


In [None]:
hol=pd.concat([hol1,hol2])

In [None]:
all_incidents['isHoliday'] = all_incidents.date.isin(hol.index)
hol.name='holiday_name'

In [None]:
gun_holiday=all_incidents.merge(hol,left_on='date',right_index=True,how='left')

In [None]:
gun_holiday['holiday_name'].unique()

In [None]:
tb=gun_holiday.pivot_table(columns='isHoliday',index='is_incident',values='incident_id',aggfunc='count',margins=True)
tb_pc=(tb.iloc[:-1,:-1]/tb.iloc[-1,-1])*100
tb_pc.style.format('{:.2f}%')


In [None]:
ax=pd.crosstab(index=gun_holiday['is_incident'],columns=gun_holiday['isHoliday'],values=gun_holiday['incident_id'],aggfunc='count',normalize=True).plot.bar(stacked=False,color={True:'red',False:'blue'})
for p in ax.patches:

    ax.annotate('{val:0.2f}%'.format(val=round(p.get_height()*100,2)), (p.get_x(), p.get_height()))



**By including the Holidays in the dataset, we find 1.4% of incident without victims correlate with holidays while 3.6 % of incident with vectimes are in holidays**

In [None]:
state_list=gun_holiday.query('is_incident==1').groupby('holiday_name')['killed+injures'].agg('mean').sort_values(ascending=False).head(10).index.tolist()
#state_list
po=sns.pointplot(gun_holiday.query('is_incident==1 and holiday_name in @state_list'),
              x='year',y='killed+injures',estimator='sum',hue='holiday_name',hue_order=state_list
              )
#po.set_xticklabels(['January','February','March','April','May','June','July','August','September','October','November','December'],rotation=40)


 - 2016,2020 shows jump in the vectimes number durin laber
 day.
 - after 2016 the vectimes numbers in independed day increase.
 - 2021 shows big jump in newyear vectimes.
 - valintine day and good friday has the lowest number over the years.


In [None]:
state_holi=gun_holiday.query('isHoliday==1' ).groupby(['state','holiday_name'])['n_killed'].mean().unstack().T.idxmax()
state_holi.reset_index().groupby(0)['state'].apply(list)

In [None]:
state_holi=gun_holiday.groupby(['city','holiday_name'])['incident_id'].count().unstack()
state_holi.loc[['San Antonio','Houston','Dallas']].T.sort_values(by='San Antonio',ascending=True).plot.barh(color=['red','black','green'])