## **Second project:** Analysis of the evolution of Mass Shootings in the US

Authors: **Raquel Jolis Carné and Martina Massana Massip**

**Uploading the folder *visualization_data*** will ensure all necessary datasets are loaded to properly treat them in this notebook.

## **Data Cleaning with OpenRefine**

The only document we done a thorough data cleaning for is *MassShootings.csv*, which we have conducted the following procedure.  

1. Changing the type of **numerical data columns from *strings* to *integers*.** As well as setting the ***Incident Date*** column as a ***timestamp.***
2. Combining the columns ***State, City or County* and *Address* into a single *Complete_Address*** with the three fields.
3. Extracting **OpenStreetMap coordinates** for the complete addresses into a new *Coordinates* column.  
4. **Erasing rows** where **coordinates** were **not found.**
5. **Separating the *Coordinates*** values into two columns: ***Longitude* and *Latitude.***
6. Extracting ***FIPS* codes and *Population*** for each state by **fetching information from wikidata** using a Reconciling facet in OpenRefine.
7. Adding a new column ***Region*** with categories: *Midwest*, *Northeast*, *Southeast*, *Soutwest* and *West*.

Additional transformations to answer the concrete questions have been specified in the pertinent exercicies in the creation of *Pandas* dataframes by joining multiple datasets and selecting relevant columns.

In [290]:
!pip install -q altair==5.4.1

In [291]:
import pandas as pd
import altair as alt

# for County choropleths
import geopandas as gpd
from shapely.geometry import Point
from shapely.geometry import Polygon, MultiPolygon

In [292]:
mass_shootings = pd.read_csv('MassShootings.csv')
counties_gdf = gpd.read_file('Counties.geojson')

In [None]:
mass_shootings['Incident Date'] = pd.to_datetime(mass_shootings['Incident Date'])
mass_shootings['Month_Year'] = mass_shootings['Incident Date'].dt.to_period('M')
mass_shootings = mass_shootings.drop('Incident Date', axis=1)

# grouping BY STATE AND MONTH
mass_shootings_states = mass_shootings.groupby(['State', 'Month_Year', 'Region', 'Population']).size().reset_index(name='Total Shootings')

# grouping BY REGION AND MONTH
mass_shootings_regions = mass_shootings.groupby(['Region', 'Month_Year']).size().reset_index(name='Total Shootings')
region_population = mass_shootings_states.drop_duplicates('State').groupby(['Region'])['Population'].sum()
mass_shootings_regions = mass_shootings_regions.merge(region_population, on='Region')

## **First question**
#### **How has the number of mass shootings evolved in the big US regions between two concrete years? And by States?**

## **Second question**
#### **Given a concrete year, how has the number of mass shooting per citizen grown or decreased across the different regions in the US compared to the first year of sampled data?**

**Data preparation**

In [None]:
mass_shootings_regions['Year'] = mass_shootings_regions['Month_Year'].apply(lambda x: x.year)
mass_shootings_regions = mass_shootings_regions.drop('Month_Year', axis=1)
mass_shootings_regions = mass_shootings_regions.groupby(['Region', 'Year', 'Population'])['Total Shootings'].sum().reset_index()
mass_shootings_regions['Shootings per 10M citizens'] = mass_shootings_regions['Total Shootings'] / mass_shootings_regions['Population'] * 10**7

**Barchart plotting**

In [None]:
selection_year = alt.selection_point(encodings = ['color'])
color = alt.condition(selection_year,
                    alt.Color('Year:O', legend = None),
                    alt.value('lightgray')
)

Q2_years_barchart = alt.Chart(mass_shootings_regions).mark_bar().encode(
    x = 'Year:O',
    y = 'Shootings per 10M citizens:Q',
    color = color,
    tooltip = 'Shootings per 10M citizens:Q'
).properties(
    width = 155,
    height = 400
).add_params(selection_year)

Q2_2014_barchart = alt.Chart(mass_shootings_regions).mark_bar().encode(
    x = 'Year:O',
    y = 'Shootings per 10M citizens:Q',
    color = alt.condition(
        alt.datum.Year == 2014,
        alt.value('#bcdcec'),
        alt.value('transparent')
    ),
    tooltip = 'Shootings per 10M citizens:Q'
)

legend_2014 = alt.Chart(mass_shootings_regions).transform_filter(
    alt.datum.Year == 2014
).mark_circle().encode(
    alt.Y('Year:O', title='Reference Year').axis(orient='right'),
    color = 'Year:O',
)

legend = alt.Chart(mass_shootings_regions).transform_filter(
    alt.datum.Year > 2014
).mark_circle().encode(
    alt.Y('Year:O').axis(orient='right'),
    color = color,
).add_params(selection_year)

Q2_barchart_final = (Q2_years_barchart + Q2_2014_barchart).facet(column = 'Region:N') | (legend_2014 & legend)
Q2_barchart_final

## **Third question**
#### **For the visualization in Q1, it should be possible to select a state, and show the detailed information on its counties.**