## **Second project:** Analysis of the evolution of Mass Shootings in the US

Authors: **Raquel Jolis Carné and Martina Massana Massip**

**Uploading the folder *visualization_data*** will ensure all necessary datasets are loaded to properly treat them in this notebook.

## **Data Cleaning with OpenRefine**

The only document we done a thorough data cleaning for is *MassShootings.csv*, which we have conducted the following procedure.  

1. Changing the type of **numerical data columns from *strings* to *integers*.** As well as setting the ***Incident Date*** column as a ***timestamp.***
2. Combining the columns ***State, City or County* and *Address* into a single *Complete_Address*** with the three fields.
3. Extracting **OpenStreetMap coordinates** for the complete addresses into a new *Coordinates* column.  
4. **Erasing rows** where **coordinates** were **not found.**
5. **Separating the *Coordinates*** values into two columns: ***Longitude* and *Latitude.***
6. Extracting ***FIPS* codes and *Population*** for each state by **fetching information from wikidata** using a Reconciling facet in OpenRefine.
7. Adding a new column ***Region*** with categories: *Midwest*, *Northeast*, *Southeast*, *Soutwest* and *West*.

Additional transformations to answer the concrete questions have been specified in the pertinent exercicies in the creation of *Pandas* dataframes by joining multiple datasets and selecting relevant columns.

In [1]:
!pip install -q altair==5.4.1

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/658.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m337.9/658.1 kB[0m [31m10.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m658.1/658.1 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.5/232.5 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [81]:
import pandas as pd
import altair as alt

# for County choropleths
import geopandas as gpd
from shapely.geometry import Point
from shapely.geometry import Polygon, MultiPolygon

In [82]:
mass_shootings = pd.read_csv('MassShootings.csv')
counties_gdf = gpd.read_file('Counties.geojson')

In [83]:
mass_shootings['Incident Date'] = pd.to_datetime(mass_shootings['Incident Date'])
mass_shootings['Month_Year'] = mass_shootings['Incident Date'].dt.to_period('M')
mass_shootings['Year'] = mass_shootings['Month_Year'].apply(lambda x: x.year)
mass_shootings = mass_shootings.drop('Incident Date', axis=1)

# grouping BY STATE AND MONTH
mass_shootings_states = mass_shootings.groupby(['State', 'Month_Year', 'Year', 'Region', 'Population']).size().reset_index(name='Total Shootings')

# grouping BY REGION AND MONTH
mass_shootings_regions = mass_shootings.groupby(['Region', 'Month_Year', 'Year']).size().reset_index(name='Total Shootings')
region_population = mass_shootings_states.drop_duplicates('State').groupby(['Region'])['Population'].sum()
mass_shootings_regions = mass_shootings_regions.merge(region_population, on='Region')

  mass_shootings['Month_Year'] = mass_shootings['Incident Date'].dt.to_period('M')


## **First question**
#### **How has the number of mass shootings evolved in the big US regions between two concrete years? And by States?**

## **Second question**
#### **Given a concrete year, how has the number of mass shooting per citizen grown or decreased across the different regions in the US compared to the first year of sampled data?**

**Additional data preparation**

In [84]:
mass_shootings_regions = mass_shootings_regions.groupby(['Region', 'Year', 'Population'])['Total Shootings'].sum().reset_index()

# defining the proportion by population
mass_shootings_regions['Shootings per 10M citizens'] = mass_shootings_regions['Total Shootings'] / mass_shootings_regions['Population'] * 10**7
mass_shootings_regions = mass_shootings_regions.drop(['Population', 'Total Shootings'], axis=1)

# for the sake of correct slope chart plotting
mass_shootings_regions['Comparison'] = mass_shootings_regions['Year'].apply(lambda x: '2014' if x == 2014 else 'Comparison Year')

mass_shootings_2014 = mass_shootings_regions[mass_shootings_regions['Year'] == 2014].drop(['Year', 'Comparison'], axis=1)
mass_shootings_regions = mass_shootings_regions[mass_shootings_regions['Year'] != 2014]

for region in mass_shootings_regions['Region'].unique():
    for year in mass_shootings_regions['Year'].unique():
      new_row = pd.DataFrame({
          'Region': [region],
          'Year': [year],
          'Comparison': ['2014'],
          'Shootings per 10M citizens': [mass_shootings_2014[mass_shootings_2014['Region'] == region]['Shootings per 10M citizens'].iloc[0]]
      })

      mass_shootings_regions = pd.concat([mass_shootings_regions, new_row], ignore_index=True)

# separating the dataset by regions for posterior plot juxtaposition
mass_shootings_midwest = mass_shootings_regions[mass_shootings_regions['Region'] == 'Midwest']
mass_shootings_northeast = mass_shootings_regions[mass_shootings_regions['Region'] == 'Northeast']
mass_shootings_southeast = mass_shootings_regions[mass_shootings_regions['Region'] == 'Southeast']
mass_shootings_southwest = mass_shootings_regions[mass_shootings_regions['Region'] == 'Southwest']
mass_shootings_west = mass_shootings_regions[mass_shootings_regions['Region'] == 'West']

**Slopechart plotting**

In [126]:
select_year = alt.selection_point(encodings = ['color'])

color = alt.condition(select_year,
                      alt.Color('Year:N', legend = None),
                      alt.value('rgba(169, 169, 169, 0.3)')) # different color and lower opacity

slopecharts_regions = list()
region_dfs = [mass_shootings_midwest, mass_shootings_northeast, mass_shootings_southeast, mass_shootings_southwest, mass_shootings_west]
region_names = ['Midwest', 'Northeast', 'Southeast', 'Southwest', 'West']

for i in range(len(region_dfs)):
    df = region_dfs[i]
    region = region_names[i]

    slopechart = alt.Chart(df).mark_line(point = True).encode(
        x = alt.X('Comparison:N',
                  title = 'Time',
                  axis = alt.Axis(labelAngle = 45)), # horizontal placement for better readibility
        y = alt.Y('Shootings per 10M citizens:Q',
              scale = alt.Scale(domain = [4,30]),
              title = 'Shootings per 10M citizens'),
        color = color,
        tooltip = 'Shootings per 10M citizens:Q'
    ).properties(title = alt.TitleParams(
        text = f'{region}',
        fontSize = 15,
        color = 'black',
        fontWeight='bold'),
                 width = 150,
                 height = 400
    ).add_params(select_year)

    slopecharts_regions.append(slopechart)

legend = alt.Chart(mass_shootings_regions).mark_circle(size = 70).encode(
    alt.Y('Year:N').axis(orient='right'),
    color = color,
).add_params(select_year)

Q2_slopecharts = alt.hconcat(*slopecharts_regions)
Q2_slopecharts_final = Q2_slopecharts | legend
Q2_slopecharts_final

## **Third question**
#### **For the visualization in Q1, it should be possible to select a state, and show the detailed information on its counties.**