# 6.7 Final Project: Tableau Dashboards and Final Analysis

## About this notebook

This notebook contains a final iteration of all the techniques and methodologies practiced in this achievement. This iteration introduces suplementary datasets to delve deeper into the observations made throughout the case study. The datasets include an index of gun law strength and gun violence by Everytown, a non-governmental research organization in the United States focused on gun violence prevention, and population densities at state and county granularity.

The objective of this notebook is to re-perform exploratory analysis, geographic visualization, linear regression and cluster analysis.


## Part 2

### This script contains

##### 1. Import of libraries and data
##### 2. Data wramgling
##### 3. Exploratory analysis
##### 4. Linear regression
##### 5. Unsupervised machine learning: clustering

#### 1. Import of libraries and data

In [17]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import os
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.cluster import KMeans 
import pylab as pl
import folium
import json
import geojson

In [19]:
#Ensuring matplotlib display 
%matplotlib inline

In [21]:
#Importing files
path = r'C:/Users/C SaiVishwanath/Documents/CF/Data Immersion/Achievement 6'

In [23]:
df = pd.read_csv(os.path.join(path, '02_Data', 'Prepared', '150425_finalproject_2.csv'), index_col = False, encoding='latin1')

In [25]:
county_geo = r'C:/Users/C SaiVishwanath/Documents/CF/Data Immersion/Achievement 6/02_Data/Original/counties.geojson'

#### 2. Data wrangling

In [28]:
df.head()

Unnamed: 0.1,Unnamed: 0,Incident_ID,Date,Year,Month,Year_Month_State,State,County,City,Lat,...,Suspects_Injured,Suspects_Arrested,Shootings_State,Shootings_County,Handguns_Sold,Long_Guns_Sold,Total_Guns_Sold,Gun_Law_Strength,Gun_Violence_Rate,Category
0,0,3181158,2025-04-09,2025,4,2025-4-Tennessee,Tennessee,Shelby,Memphis,35.1087,...,0,0,30,3,0.0,0.0,0.0,14,22,Weak systems
1,1,3181158,2025-04-09,2025,4,2025-4-Tennessee,Tennessee,Shelby,Memphis,35.1087,...,0,0,30,3,0.0,0.0,0.0,14,22,Weak systems
2,2,3181158,2025-04-09,2025,4,2025-4-Tennessee,Tennessee,Shelby,Memphis,35.1087,...,0,0,30,3,0.0,0.0,0.0,14,22,Weak systems
3,3,3181158,2025-04-09,2025,4,2025-4-Tennessee,Tennessee,Shelby,Memphis,35.1087,...,0,0,30,3,0.0,0.0,0.0,14,22,Weak systems
4,4,3181158,2025-04-09,2025,4,2025-4-Tennessee,Tennessee,Shelby,Memphis,35.1087,...,0,0,30,3,0.0,0.0,0.0,14,22,Weak systems


In [32]:
#Checking null values
df.isnull().sum()

Unnamed: 0              0
Incident_ID             0
Date                    0
Year                    0
Month                   0
Year_Month_State        0
State                   0
County                  0
City                    0
Lat                     0
Long                    0
Population              0
St_Pop_Density_sqmi     0
Cty_Pop_Density_sqmi    0
Victims_Killed          0
Victims_Injured         0
Total_Harmed_Victims    0
Suspects_Killed         0
Suspects_Injured        0
Suspects_Arrested       0
Shootings_State         0
Shootings_County        0
Handguns_Sold           0
Long_Guns_Sold          0
Total_Guns_Sold         0
Gun_Law_Strength        0
Gun_Violence_Rate       0
Category                0
dtype: int64

In [34]:
#Checking duplicates
dups = df.duplicated()
dups.sum()

0

In [36]:
df.dtypes

Unnamed: 0                int64
Incident_ID               int64
Date                     object
Year                      int64
Month                     int64
Year_Month_State         object
State                    object
County                   object
City                     object
Lat                     float64
Long                    float64
Population              float64
St_Pop_Density_sqmi     float64
Cty_Pop_Density_sqmi    float64
Victims_Killed            int64
Victims_Injured           int64
Total_Harmed_Victims      int64
Suspects_Killed           int64
Suspects_Injured          int64
Suspects_Arrested         int64
Shootings_State           int64
Shootings_County          int64
Handguns_Sold           float64
Long_Guns_Sold          float64
Total_Guns_Sold         float64
Gun_Law_Strength         object
Gun_Violence_Rate        object
Category                 object
dtype: object

In [38]:
#Changing data types of 'Gun_Law_Strength' and 'Gun_Violence_Rate'

cols = ['Gun_Law_Strength', 'Gun_Violence_Rate']

df[cols] = df[cols].astype(int)

ValueError: invalid literal for int() with base 10: '90,5'

In [40]:
cols_to_fix = ['Gun_Law_Strength', 'Gun_Violence_Rate']

for col in cols_to_fix:
    df[col] = df[col].str.replace(',', '.', regex=False).astype(float)

In [42]:
#Retry: Changing data types of 'Gun_Law_Strength' and 'Gun_Violence_Rate'

cols = ['Gun_Law_Strength', 'Gun_Violence_Rate']

df[cols] = df[cols].astype(float)

In [44]:
df.dtypes

Unnamed: 0                int64
Incident_ID               int64
Date                     object
Year                      int64
Month                     int64
Year_Month_State         object
State                    object
County                   object
City                     object
Lat                     float64
Long                    float64
Population              float64
St_Pop_Density_sqmi     float64
Cty_Pop_Density_sqmi    float64
Victims_Killed            int64
Victims_Injured           int64
Total_Harmed_Victims      int64
Suspects_Killed           int64
Suspects_Injured          int64
Suspects_Arrested         int64
Shootings_State           int64
Shootings_County          int64
Handguns_Sold           float64
Long_Guns_Sold          float64
Total_Guns_Sold         float64
Gun_Law_Strength        float64
Gun_Violence_Rate       float64
Category                 object
dtype: object

In [46]:
df.describe()

Unnamed: 0.1,Unnamed: 0,Incident_ID,Year,Month,Lat,Long,Population,St_Pop_Density_sqmi,Cty_Pop_Density_sqmi,Victims_Killed,...,Suspects_Killed,Suspects_Injured,Suspects_Arrested,Shootings_State,Shootings_County,Handguns_Sold,Long_Guns_Sold,Total_Guns_Sold,Gun_Law_Strength,Gun_Violence_Rate
count,577567.0,577567.0,577567.0,577567.0,577567.0,577567.0,577567.0,577567.0,577567.0,577567.0,...,577567.0,577567.0,577567.0,577567.0,577567.0,577567.0,577567.0,577567.0,577567.0,577567.0
mean,301407.434938,2293309.0,2021.809828,6.624535,38.520081,-89.442507,119031.7,161.358477,278.341495,0.974309,...,0.071072,0.048349,0.719018,531.901327,116.764857,29717.49949,18222.764919,47940.245166,22.104334,17.922015
std,173861.91125,599854.0,2.030511,3.020515,3.768586,9.886973,585696.1,133.274546,1740.351146,1.23136,...,0.263158,0.235878,1.129136,328.23134,118.250978,24361.01782,12213.802803,35541.889313,23.027833,6.539153
min,0.0,272016.0,2015.0,1.0,17.9778,-170.2743,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,150819.5,1793559.0,2020.0,4.0,35.949,-94.1451,10945.0,87.1,24.0,0.0,...,0.0,0.0,0.0,267.0,24.0,11663.0,10337.0,22170.0,8.0,14.9
50%,301340.0,2423331.0,2022.0,7.0,39.0448,-87.6866,27182.0,109.9,55.0,1.0,...,0.0,0.0,0.0,497.0,72.0,23205.0,15336.0,37650.0,14.0,16.6
75%,452052.5,2737798.0,2023.0,9.0,40.5766,-82.9855,64341.0,202.6,161.0,1.0,...,0.0,0.0,1.0,712.0,180.0,36602.0,22864.0,60215.0,24.5,22.0
max,602115.0,3181158.0,2025.0,12.0,68.3445,-65.7733,31290830.0,1195.5,70915.0,60.0,...,3.0,5.0,14.0,1303.0,483.0,171600.0,102546.0,248724.0,90.5,29.4


In [48]:
df.to_csv(os.path.join(path, '02_Data', 'Prepared', '150425_finalproject_3.csv'))

#### 3. Exploratory analysis

##### Heatmaps

In [None]:
# Create a correlation matrix using pandas
df.corr()

In [None]:
df.columns

In [None]:
#Creating subset
sub = df[['Incident_ID', 'Year', 'Month',
       'Lat', 'Long', 
       'St_Pop_Density_sqmi', 'Cty_Pop_Density_sqmi', 'Victims_Killed',
       'Victims_Injured', 'Total_Harmed_Victims', 'Suspects_Killed',
       'Suspects_Injured', 'Suspects_Arrested', 'Shootings_State',
       'Shootings_County', 'Handguns_Sold', 'Long_Guns_Sold',
       'Total_Guns_Sold', 'Gun_Law_Strength', 'Gun_Violence_Rate'
         ]]

In [None]:
sub.corr()

In [None]:
# Create a correlation heatmap using matplotlib

plt.matshow(sub.corr())
plt.show()

# Add labels, a legend, and change the size of the heatmap
f = plt.figure(figsize=(8, 8)) 
plt.matshow(df.corr(numeric_only=True), fignum=f.number) 
plt.xticks(range(sub.shape[1]), df.columns, fontsize=14, rotation=45)
plt.yticks(range(sub.shape[1]), df.columns, fontsize=14)
cb = plt.colorbar() 
cb.ax.tick_params(labelsize=14) 
plt.title('Correlation Matrix', fontsize=14)

In [None]:
# Create a subplot with matplotlib
f,ax = plt.subplots(figsize=(20,20))

# Create the correlation heatmap in seaborn by applying a heatmap onto the correlation matrix and the subplots defined above.
corr = sns.heatmap(sub.corr(), annot = True, ax = ax) 
plt.show()

Initially, the correlations between gun law and variables like shootings per state or number of guns sold is not strongly correlated. However there is a slight negative correlation between gun law strength index and gun violence rate, demonstrating a possible relationship between legislation and the level of violence. At the same time, gun law strength seems to have a strong correlation with statal population density. 

##### Scatterplot for gun law strength and guns sold, then for gun law strength and shootings

In [None]:
scp_1 = sns.lmplot(x = 'Gun_Law_Strength', y = 'Total_Guns_Sold', data = sub)
plt.show()

In [None]:
scp_2 = sns.lmplot(x = 'Gun_Law_Strength', y = 'Shootings_State', data = sub)
plt.show()

On the first graph we can appreciate a very slight positive correlation, while in the second, a slightly more visible negative correlation: The stronger gun laws, the less shootings there are. But this does not mean causation.

##### Pair plots

In [None]:
# Keep only the variables you want to use in the pair plot

sub_2 = sub[['St_Pop_Density_sqmi', 'Total_Harmed_Victims', 'Shootings_State', 
          'Total_Guns_Sold', 'Gun_Law_Strength', 'Gun_Violence_Rate']]

In [None]:
# Create a pair plot 
g = sns.pairplot(sub_2)
plt.show()

#### 4. Linear regression

In [None]:
#Extreme value check
sns.distplot(df['St_Pop_Density_sqmi'], bins=25)
plt.show()
sns.distplot(df['Gun_Law_Strength'], bins=25)
plt.show()

In [None]:
df['St_Pop_Density_sqmi'].describe()

In [None]:
df['St_Pop_Density_sqmi'].median()

In [None]:
df['Gun_Law_Strength'].describe()

In [None]:
df['Gun_Law_Strength'].median()

In [None]:
#Scatterplot with matplotlib
df.plot(x = 'St_Pop_Density_sqmi', y='Gun_Law_Strength',style='o') 
plt.title('State population density vs. Gun Law Strength Index')  
plt.xlabel('Population density')  
plt.ylabel('Gun Law Strength Index')  
plt.show()

In [None]:
#Reshaping variables to NumPy arrays & putting them into separate objects

X = df['St_Pop_Density_sqmi'].values.reshape(-1,1)
y = df['Gun_Law_Strength'].values.reshape(-1,1)

In [None]:
X

In [None]:
# Split data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [None]:
#Creating regression object
regression = LinearRegression()

In [None]:
#Fit the regression object onto training set
regression.fit(X_train, y_train)

In [None]:
# Predict the values of y using X.
y_predicted = regression.predict(X_test)

In [None]:
#Plot that shows regression line from model on test set
plot_test = plt
plot_test.scatter(X_test, y_test, color='gray', s = 15)
plot_test.plot(X_test, y_predicted, color='red', linewidth =3)
plot_test.title('State population density vs. Gun Law Strength Index (Test set)')
plot_test.xlabel('Population density')
plot_test.ylabel('Gun Law Strength Index')
plot_test.show()

In [None]:
#Create objects containing model summary statistics

rmse = mean_squared_error(y_test, y_predicted) #Mean sq. error
r2 = r2_score(y_test, y_predicted) #R2 score

In [None]:
#Print summary statistics
print('Slope:' ,regression.coef_)
print('Mean squared error: ', rmse)
print('R2 score: ', r2)

In [None]:
#Dataframe comparing actual and predicted values of y
data = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': y_predicted.flatten()})
data.head(30)

In [None]:
#Predicting X_train
y_predicted_train = regression.predict(X_train)

In [None]:
rmse = mean_squared_error(y_train, y_predicted_train)
r2 = r2_score(y_train, y_predicted_train)

In [None]:
print('Slope:' ,regression.coef_)
print('Mean squared error: ', rmse)
print('R2 score: ', r2)

In [None]:
#Visualizing results
plot_test = plt
plot_test.scatter(X_train, y_train, color='green', s = 15)
plot_test.plot(X_train, y_predicted_train, color='red', linewidth =3)
plot_test.title('State population density vs. Gun Law Strength Index (Train set)')
plot_test.xlabel('Population density')
plot_test.ylabel('Gun Law Strength Index')
plot_test.show()

##### Interpretation

The model shows a very small slope and a low R² score (~0.3), meaning it explains just over 38% of the variance in the data. While the model does provide predictions somewhat close to the actual values in magnitude, it fails to capture meaningful patterns or variability in the target variable.

#### 4. Unsupervised machine learning: clustering

##### Iteration 1: State population density x Shootings per state

##### Elbow technique

In [None]:
#Defining range of potential clusters
num_cl = range(1, 10) 
#Defining k-means clusters in assigned range
kmeans = [KMeans(n_clusters=i) for i in num_cl]

In [None]:
#Creating score representing rate of variation for the given cluster option
score = [kmeans[i].fit(sub).score(sub) for i in range(len(kmeans))]

In [None]:
#Plotting elbow
pl.plot(num_cl,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()

In [None]:
#Result: 3 clusters

##### K-means clustering

In [None]:
#Creating k-means object
kmeans = KMeans(n_clusters = 3) 

In [None]:
#Fitting k-means object to data
kmeans.fit(sub)

In [None]:
sub['clusters'] = kmeans.fit_predict(sub)

In [None]:
sub.head()

In [None]:
sub['clusters'].value_counts()

In [None]:
#Plotting clusters for 'Total_Guns_Sold' and 'Shootings_County' 
plt.figure(figsize=(12,8))
ax = sns.scatterplot(x=sub['St_Pop_Density_sqmi'], y=sub['Shootings_County'], hue=kmeans.labels_, s=100) 

#Removing grid
ax.grid(False) 

#Labels 
plt.xlabel('State population density (sq. mi)') 
plt.ylabel('Shootings per State') 
plt.show()

##### Interpretation

The chart above shows that higher population densities are not necessarily correlated to high mass shooting incidents, but quite the contrary, clusters 0 and 1 have a higher incidence of mass shootings despite the low population density. Recall the fact that we observed a slightly positive correlation with the linear regression for population density and gun law strength. Although it was not significant enough, this clustering result shows that indeed, there might be a relationship present. If gun law strength proves to be negatively correlated to shooting incidents, we might be able to further back this argument. 

##### Iteration 2: Gun law strength x Shootings per state

##### k-means clustering

In [None]:
#Plotting clusters for 'Total_Guns_Sold' and 'Shootings_County' 
plt.figure(figsize=(12,8))
ax = sns.scatterplot(x=sub['Gun_Law_Strength'], y=sub['Shootings_County'], hue=kmeans.labels_, s=100) 

#Removing grid
ax.grid(False) 

#Labels 
plt.xlabel('Gun law strength index') 
plt.ylabel('Shootings per State') 
plt.show()