# Analyzing hate crimes trends for Austin against the USA as a whole, 2017 - Present

# 1. Data Wrangling

I've been working, off and on, on this project for since about January 2020. One-half practice, one-half because I want to try and contribute to making sense of the chaos that is our world right now. What I intend is to analyze hate crimes trends for Austin, TX against the USA as a whole from 2017 to the present, with particular focus on the LGBT Community. 

I am using data provided by Austin PD in this notebook, and in the next 2, or 3 notebooks as well. For now, I am focusing solely on data for Austin. I will get into broader data for the USA later down the road. 

Also, this notebook will only contain the wrangling phase of the analysis. My next notebook will contain the cleaning process, and so on. 

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Let's load some data & get to work! I am utilizing data from data.austintexas.gov located in the Austin PD's database on
# reported hatecrimes. 

aus_17 = pd.read_csv('https://data.austintexas.gov/resource/79qh-wdpx.csv', 
                     keep_default_na=False, 
                     na_values=[""])
display(aus_17.head())

Unnamed: 0,month,incident_number,date_of_incident_day_of_week,number_of_vitims_under_18,number_of_victims_over_18,number_of_offenders_under_18,number_of_offenders_over_18,race_or_ethnic_of_offender,offense,offense_location,bias,victim_type
0,January,2017-241137,01/01/2017/Sun,0,1,0,1,White/Not Hispanic,Aggravated Assault,Park/Playground,Anti-Black or African American,Individual
1,February,2017-580344,02/01/2017/Wed,0,1,0,1,Black or African American/Not Hispanic,Aggravated Assault,Highway/Road/Alley/Street/Sidewalk,Anti-White,Individual
2,March,2017-800291,03/21/2017/Tues,0,0,0,0,Unknown,Destruction,Highway/Road/Alley/Street/Sidewalk,Anti-Jewish,Other
3,April,2017-1021534,04/12/2017/Wed,0,0,0,0,White/Unknown,Simple Assault,Air/Bus/Train Terminal,Anti-Jewish,Individual
4,May,2017-1351550,05/15/2017/Mon,1,0,1,2,White/Not Hispanic,Simple Assault,Residence/Home,Anti-Gay (Male),Individual


## Note & Research Questions

As I stated previously, my goal is to analyze trends over time. In particular, I want to focus on how hate crime affects the LGBT community. Most of these data columns aren't necessary for my analysis so a larger task of the analysis process will be cleaning up the data sets. 

At first glance, I've come up with a few questions for my data: 

    1. What percentage of reported alleged hate crimes is against the LGBT Community? Also, is the trend rising or decreasing? 
        
    2. Does offender age have any correlation to the types of offenses committed? 
    
    3. Does offender race/ethnicity correlate to types of offenses committed? 

In [3]:
# Loading the datasets for '18, '19, and this year
aus_18 = pd.read_csv('https://data.austintexas.gov/resource/idj2-d9th.csv', 
                     keep_default_na=False, 
                     na_values=[""])
aus_19 = pd.read_csv('https://data.austintexas.gov/resource/e3qf-htd9.csv', 
                     keep_default_na=False, 
                     na_values=[""])
aus_20 = pd.read_csv('https://data.austintexas.gov/resource/y6x2-kpr9.csv', 
                     keep_default_na=False, 
                     na_values=[""])

# Examining the sets
print('\n')
display(aus_18.head())
display(aus_19.head())
display(aus_20.head())





Unnamed: 0,month,incident_number,date_of_incident_day_of_week,number_of_vitims_under_18,number_of_victims_over_18,number_of_offenders_under_18,number_of_offenders_over_18,race_ethnic_of_offender_s,offense_s,offense_location,bias,victim_type
0,January,2018-251458,01 25 2018/Thur,0,2,0,1,White/NotHispanic,Burglary/Assault,Residence/Home,Anti-Lesbian,Individual
1,January,2018-200595,01 19 2018/Fri,0,0,0,0,Unknown,Vandalism,Parking Lot/Garage,Anti-Black,Vehicle
2,February,2018-400447,02 08 2018/Thur,0,1,0,1,White/NotHispanic,Assault,Parking Lot/Garage,Anti-Gay,Individual
3,February,2018-530804,02 22 2018/Thur,0,1,0,1,White/NotHispanic,Vandalism,Highway/Road/Street,Anti-Black,Individual
4,March,2018-611809,03 02 2018/Fri,0,1,0,4,Black/Unknown,Assault,Highway/Road/Street,Anti-Hispanic,Individual


Unnamed: 0,month,incident_number,date_of_incident,day_of_week,number_of_victims_under_18,number_of_victims_over_18,number_of_offenders_under,number_of_offenders_over,race_ethnicity_of_offenders,offense_s,offense_location,bias,notes
0,January,2019-8000242,2018-12-29T00:00:00.000,Saturday,0,1,0,0,Unknown,Assault,Bar/Nightclub,Anti-Gay (Male),"Offense occurred in 2018, but reported in Janu..."
1,January,2019-190201,2019-01-19T00:00:00.000,Saturday,0,2,0,4,White/Hispanic (2) White/NonHispanic (2),Assault,Streets/Highway/Road/Alley,Anti-Gay (Male),"Four total offenders, two White Hispanic, two ..."
2,February,2019-531028,2019-02-22T00:00:00.000,Friday,0,1,0,0,Unknown,Vandalism,Residence/Home,Anti-Jewish,
3,March,2019-901579,2019-03-31T00:00:00.000,Sunday,0,1,0,1,White/Hispanic,Assault,Bar/Nightclub,Anti-Gay (Male),
4,April,2019-941819,2019-04-04T00:00:00.000,Saturday,0,1,0,3,White/Hispanic,Assault,School-Elementary/Secondary,Anti-Hispanic/Latino,


Unnamed: 0,month,incident_number,date_of_incident,day_of_week,number_of_victims_under_18,number_of_victims_over_18,number_of_offenders_under,number_of_offenders_over,race_ethnicity_of_offenders,offense_s,offense_location,bias,notes
0,March,2020-602085,2020-03-01T00:00:00.000,Sunday,0,1,0,1,White/Non-Hispanic,Criminal Mischief,Residence/Home,Anti-Black,
1,March,2020-680226,2020-03-08T00:00:00.000,Sunday,0,1,0,2,White/Hispanic,Assault,Parking Lot,Anti-Gay (Male); Anti-Transgender,
2,March,2020-5011788,2020-03-22T00:00:00.000,Sunday,0,1,0,0,Unknown,Criminal Mischief,Residence/Home,Anti-Gay (Male); Anti-Jewish,
3,April,2020-5015689,2020-04-20T00:00:00.000,Monday,0,1,0,0,Unknown,Criminal Mischief,Church/Synagogue/Temple/Mosque,Anti-Buddhist,
4,April,2020-5016804,2020-04-29T00:00:00.000,Wednesday,0,1,0,1,Black/Non-Hispanic,Assault by Threat,Department/Discount Store,Anti-Gay (Male); Anti-Transgender,


Needless to say, these data are quite messy! We'll need to clean & merge these into one df before we can start analyzing. There are several columns in each set we don't need and there are also NaN values.

### One brutally glaring problem that really irks me is the 'date' columns are all formatted differently! The 2019 dataset, for ex., has the date column split into 2. This can cause quite a headache later so we'll definitely need to remedy these problems.

#### Also, I may want to categorize  the different lgbt-related biases into one 'anti-lgbt' to make analyzing easier. 

Before doing so, however, I want to take a look at some variables within the data.

Already, I know that the 'bias' column is going to be a pivotal feature in the analysis.

In [4]:
# Need to be sure of the data types in each set for 'bias'.
display(aus_17.describe())
display(aus_18.describe())
display(aus_19.describe())
display(aus_20.describe())

# Only the 'offenders' & 'victims' columns are numerical...can't really do much descriptive stats with only that.

# LOL! The creator(s) of the 2017 dataset misspelled vitims[sic] in the first victim column! 

display(aus_17['bias'].nunique())
display(aus_18['bias'].nunique())
display(aus_19['bias'].nunique())
display(aus_20['bias'].nunique())

# This is important because since I am focusing mainly on the LGBT community, I will need to merge all the LGBT related 
# variables in the 'bias' columns into one variable as 'anti-lgbt'. 

Unnamed: 0,number_of_vitims_under_18,number_of_victims_over_18,number_of_offenders_under_18,number_of_offenders_over_18
count,17.0,17.0,17.0,17.0
mean,0.058824,0.823529,0.176471,0.882353
std,0.242536,0.392953,0.528594,0.600245
min,0.0,0.0,0.0,0.0
25%,0.0,1.0,0.0,1.0
50%,0.0,1.0,0.0,1.0
75%,0.0,1.0,0.0,1.0
max,1.0,1.0,2.0,2.0


Unnamed: 0,number_of_vitims_under_18,number_of_victims_over_18,number_of_offenders_under_18,number_of_offenders_over_18
count,19.0,19.0,19.0,19.0
mean,0.052632,0.894737,0.052632,1.105263
std,0.229416,0.458831,0.229416,0.737468
min,0.0,0.0,0.0,0.0
25%,0.0,1.0,0.0,1.0
50%,0.0,1.0,0.0,1.0
75%,0.0,1.0,0.0,1.0
max,1.0,2.0,1.0,4.0


Unnamed: 0,number_of_victims_under_18,number_of_victims_over_18,number_of_offenders_under,number_of_offenders_over
count,12.0,12.0,12.0,12.0
mean,0.083333,1.0,0.25,1.25
std,0.288675,0.426401,0.866025,1.215431
min,0.0,0.0,0.0,0.0
25%,0.0,1.0,0.0,0.75
50%,0.0,1.0,0.0,1.0
75%,0.0,1.0,0.0,1.25
max,1.0,2.0,3.0,4.0


Unnamed: 0,number_of_victims_under_18,number_of_victims_over_18,number_of_offenders_under,number_of_offenders_over,notes
count,5.0,5.0,5.0,5.0,0.0
mean,0.0,1.0,0.0,0.8,
std,0.0,0.0,0.0,0.83666,
min,0.0,1.0,0.0,0.0,
25%,0.0,1.0,0.0,0.0,
50%,0.0,1.0,0.0,1.0,
75%,0.0,1.0,0.0,1.0,
max,0.0,1.0,0.0,2.0,


7

9

6

4

So the good thing is the 'bias' column of each set is formatted the same so that is at least one blessing we can count.

Accordingly, we should be able to go ahead & concatenate into one df & then we can begin cleaning everything up.

### The following code is an in-notebook exercise just so I can sharpen-up on concatenating dfs since I haven't done it in a while.

In [5]:
# Just for fun, let's see if we can go ahead & concatenate the frames together, THEN work on cleaning it up....
aus_17_sub = aus_17.head(5)
aus_17_sub_last5 = aus_17.tail(5) 
aus_17_sub_last5 = aus_17_sub_last5.reset_index(drop=True)

# Let's test it out to see if everything is in order enough to even try
vertical_stack = pd.concat([aus_17_sub, aus_17_sub_last5], axis=0)
horizontal_stack = pd.concat([aus_17_sub, aus_17_sub_last5], axis=1)

display(vertical_stack.head())
display(horizontal_stack.head())

Unnamed: 0,month,incident_number,date_of_incident_day_of_week,number_of_vitims_under_18,number_of_victims_over_18,number_of_offenders_under_18,number_of_offenders_over_18,race_or_ethnic_of_offender,offense,offense_location,bias,victim_type
0,January,2017-241137,01/01/2017/Sun,0,1,0,1,White/Not Hispanic,Aggravated Assault,Park/Playground,Anti-Black or African American,Individual
1,February,2017-580344,02/01/2017/Wed,0,1,0,1,Black or African American/Not Hispanic,Aggravated Assault,Highway/Road/Alley/Street/Sidewalk,Anti-White,Individual
2,March,2017-800291,03/21/2017/Tues,0,0,0,0,Unknown,Destruction,Highway/Road/Alley/Street/Sidewalk,Anti-Jewish,Other
3,April,2017-1021534,04/12/2017/Wed,0,0,0,0,White/Unknown,Simple Assault,Air/Bus/Train Terminal,Anti-Jewish,Individual
4,May,2017-1351550,05/15/2017/Mon,1,0,1,2,White/Not Hispanic,Simple Assault,Residence/Home,Anti-Gay (Male),Individual


Unnamed: 0,month,incident_number,date_of_incident_day_of_week,number_of_vitims_under_18,number_of_victims_over_18,number_of_offenders_under_18,number_of_offenders_over_18,race_or_ethnic_of_offender,offense,offense_location,...,date_of_incident_day_of_week.1,number_of_vitims_under_18.1,number_of_victims_over_18.1,number_of_offenders_under_18.1,number_of_offenders_over_18.1,race_or_ethnic_of_offender.1,offense.1,offense_location.1,bias,victim_type
0,January,2017-241137,01/01/2017/Sun,0,1,0,1,White/Not Hispanic,Aggravated Assault,Park/Playground,...,10/15/2017/Sun,0,1,0,1,White/Hispanic or Latino,Simple Assault,Highway/Road/Alley/Street/Sidewalk,Anti-Black or African American,Individual
1,February,2017-580344,02/01/2017/Wed,0,1,0,1,Black or African American/Not Hispanic,Aggravated Assault,Highway/Road/Alley/Street/Sidewalk,...,10/24/2017/Tues,0,1,2,0,White/Hispanic or Latino,Intimidation,Residence/Home,Anti-Gay (Male),Individual
2,March,2017-800291,03/21/2017/Tues,0,0,0,0,Unknown,Destruction,Highway/Road/Alley/Street/Sidewalk,...,11/10/2017/Fri,0,1,0,1,White/Not Hispanic,Simple Assault,Restaurant,Anti-Islamic (Muslim),Individual
3,April,2017-1021534,04/12/2017/Wed,0,0,0,0,White/Unknown,Simple Assault,Air/Bus/Train Terminal,...,11/16/2017/Thurs,0,1,0,1,White/Unknown,Simple Assault,Other/Unknown,Anti-Islamic (Muslim),Individual
4,May,2017-1351550,05/15/2017/Mon,1,0,1,2,White/Not Hispanic,Simple Assault,Residence/Home,...,11/26/2017/Sun,0,1,0,0,Unknown,Intimidation,Parking/Drop Lot,Anti-Hispanic or Latino,Individual


In [6]:
vertical_stack.to_csv(r'C:\Users\Robert\OneDrive\Desktop\data_output_out.csv', 
                      index=False)

new_output = pd.read_csv(r'C:\Users\Robert\OneDrive\Desktop\data_output_out.csv', 
                         keep_default_na=False, 
                         na_values=[""])

In [7]:
display(new_output.head())
display(aus_17.head())

Unnamed: 0,month,incident_number,date_of_incident_day_of_week,number_of_vitims_under_18,number_of_victims_over_18,number_of_offenders_under_18,number_of_offenders_over_18,race_or_ethnic_of_offender,offense,offense_location,bias,victim_type
0,January,2017-241137,01/01/2017/Sun,0,1,0,1,White/Not Hispanic,Aggravated Assault,Park/Playground,Anti-Black or African American,Individual
1,February,2017-580344,02/01/2017/Wed,0,1,0,1,Black or African American/Not Hispanic,Aggravated Assault,Highway/Road/Alley/Street/Sidewalk,Anti-White,Individual
2,March,2017-800291,03/21/2017/Tues,0,0,0,0,Unknown,Destruction,Highway/Road/Alley/Street/Sidewalk,Anti-Jewish,Other
3,April,2017-1021534,04/12/2017/Wed,0,0,0,0,White/Unknown,Simple Assault,Air/Bus/Train Terminal,Anti-Jewish,Individual
4,May,2017-1351550,05/15/2017/Mon,1,0,1,2,White/Not Hispanic,Simple Assault,Residence/Home,Anti-Gay (Male),Individual


Unnamed: 0,month,incident_number,date_of_incident_day_of_week,number_of_vitims_under_18,number_of_victims_over_18,number_of_offenders_under_18,number_of_offenders_over_18,race_or_ethnic_of_offender,offense,offense_location,bias,victim_type
0,January,2017-241137,01/01/2017/Sun,0,1,0,1,White/Not Hispanic,Aggravated Assault,Park/Playground,Anti-Black or African American,Individual
1,February,2017-580344,02/01/2017/Wed,0,1,0,1,Black or African American/Not Hispanic,Aggravated Assault,Highway/Road/Alley/Street/Sidewalk,Anti-White,Individual
2,March,2017-800291,03/21/2017/Tues,0,0,0,0,Unknown,Destruction,Highway/Road/Alley/Street/Sidewalk,Anti-Jewish,Other
3,April,2017-1021534,04/12/2017/Wed,0,0,0,0,White/Unknown,Simple Assault,Air/Bus/Train Terminal,Anti-Jewish,Individual
4,May,2017-1351550,05/15/2017/Mon,1,0,1,2,White/Not Hispanic,Simple Assault,Residence/Home,Anti-Gay (Male),Individual


### Okay! Looks like everything worked perfectly!

#### So let's concatenate the dfs together for real this time!

In [8]:
aus_final = pd.concat([aus_17, aus_18, aus_19, aus_20], 
                      axis=0, sort=True)
                       
                                            
display(aus_final.head())
display(aus_final.tail())
display(aus_final.info())
display(aus_final.describe())

Unnamed: 0,bias,date_of_incident,date_of_incident_day_of_week,day_of_week,incident_number,month,notes,number_of_offenders_over,number_of_offenders_over_18,number_of_offenders_under,...,number_of_victims_over_18,number_of_victims_under_18,number_of_vitims_under_18,offense,offense_location,offense_s,race_ethnic_of_offender_s,race_ethnicity_of_offenders,race_or_ethnic_of_offender,victim_type
0,Anti-Black or African American,,01/01/2017/Sun,,2017-241137,January,,,1.0,,...,1,,0.0,Aggravated Assault,Park/Playground,,,,White/Not Hispanic,Individual
1,Anti-White,,02/01/2017/Wed,,2017-580344,February,,,1.0,,...,1,,0.0,Aggravated Assault,Highway/Road/Alley/Street/Sidewalk,,,,Black or African American/Not Hispanic,Individual
2,Anti-Jewish,,03/21/2017/Tues,,2017-800291,March,,,0.0,,...,0,,0.0,Destruction,Highway/Road/Alley/Street/Sidewalk,,,,Unknown,Other
3,Anti-Jewish,,04/12/2017/Wed,,2017-1021534,April,,,0.0,,...,0,,0.0,Simple Assault,Air/Bus/Train Terminal,,,,White/Unknown,Individual
4,Anti-Gay (Male),,05/15/2017/Mon,,2017-1351550,May,,,2.0,,...,0,,1.0,Simple Assault,Residence/Home,,,,White/Not Hispanic,Individual


Unnamed: 0,bias,date_of_incident,date_of_incident_day_of_week,day_of_week,incident_number,month,notes,number_of_offenders_over,number_of_offenders_over_18,number_of_offenders_under,...,number_of_victims_over_18,number_of_victims_under_18,number_of_vitims_under_18,offense,offense_location,offense_s,race_ethnic_of_offender_s,race_ethnicity_of_offenders,race_or_ethnic_of_offender,victim_type
0,Anti-Black,2020-03-01T00:00:00.000,,Sunday,2020-602085,March,,1.0,,0.0,...,1,0.0,,,Residence/Home,Criminal Mischief,,White/Non-Hispanic,,
1,Anti-Gay (Male); Anti-Transgender,2020-03-08T00:00:00.000,,Sunday,2020-680226,March,,2.0,,0.0,...,1,0.0,,,Parking Lot,Assault,,White/Hispanic,,
2,Anti-Gay (Male); Anti-Jewish,2020-03-22T00:00:00.000,,Sunday,2020-5011788,March,,0.0,,0.0,...,1,0.0,,,Residence/Home,Criminal Mischief,,Unknown,,
3,Anti-Buddhist,2020-04-20T00:00:00.000,,Monday,2020-5015689,April,,0.0,,0.0,...,1,0.0,,,Church/Synagogue/Temple/Mosque,Criminal Mischief,,Unknown,,
4,Anti-Gay (Male); Anti-Transgender,2020-04-29T00:00:00.000,,Wednesday,2020-5016804,April,,1.0,,0.0,...,1,0.0,,,Department/Discount Store,Assault by Threat,,Black/Non-Hispanic,,


<class 'pandas.core.frame.DataFrame'>
Int64Index: 53 entries, 0 to 4
Data columns (total 21 columns):
bias                            53 non-null object
date_of_incident                17 non-null object
date_of_incident_day_of_week    36 non-null object
day_of_week                     17 non-null object
incident_number                 53 non-null object
month                           53 non-null object
notes                           2 non-null object
number_of_offenders_over        17 non-null float64
number_of_offenders_over_18     36 non-null float64
number_of_offenders_under       17 non-null float64
number_of_offenders_under_18    36 non-null float64
number_of_victims_over_18       53 non-null int64
number_of_victims_under_18      17 non-null float64
number_of_vitims_under_18       36 non-null float64
offense                         17 non-null object
offense_location                53 non-null object
offense_s                       36 non-null object
race_ethnic_of_offender_s  

None

Unnamed: 0,number_of_offenders_over,number_of_offenders_over_18,number_of_offenders_under,number_of_offenders_under_18,number_of_victims_over_18,number_of_victims_under_18,number_of_vitims_under_18
count,17.0,36.0,17.0,36.0,53.0,17.0,36.0
mean,1.117647,1.0,0.176471,0.111111,0.90566,0.058824,0.055556
std,1.111438,0.676123,0.727607,0.39841,0.404976,0.242536,0.232311
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,1.0,0.0,0.0,1.0,0.0,0.0
50%,1.0,1.0,0.0,0.0,1.0,0.0,0.0
75%,1.0,1.0,0.0,0.0,1.0,0.0,0.0
max,4.0,4.0,3.0,2.0,2.0,1.0,1.0


So far, we definitely want to hold onto the 'bias', 'incident_number', 'number_of_victims_over_18' , 'offense_location' columns. 

I can always make 2 copies of the concatenated set to work with if I decide later what I want to include. 

In [9]:
aus_final.to_csv(r"C:\Users\Robert\OneDrive\Desktop\aus_final.csv")

## So that was fun...in the next notebook, I'll begin the process of cleaning up our concatenated dataframe. Slowly but surely we'll get there.