# Violent Crime Dataset, Cleaning and Initial Exploration

#### Dataset info:

- Dataset covers 68 US cities with populations over 250,000 who have opted in to a voluntary reporting program
- Time period is 1975-2015


### Dependencies

In [1]:
import pandas as pd

### CSV import and dataframe creation

In [2]:
filepath = ('../../raw/crime_by_jurisdiction.csv')
csv = pd.read_csv(filepath)
df = pd.DataFrame(csv)

### Explore head

In [3]:
df.head()

Unnamed: 0,report_year,agency_code,agency_jurisdiction,population,violent_crimes,homicides,rapes,assaults,robberies,months_reported,crimes_percapita,homicides_percapita,rapes_percapita,assaults_percapita,robberies_percapita
0,1975,NM00101,"Albuquerque, NM",286238.0,2383.0,30.0,181.0,1353.0,819.0,12.0,832.52,10.48,63.23,472.68,286.13
1,1975,TX22001,"Arlington, TX",112478.0,278.0,5.0,28.0,132.0,113.0,12.0,247.16,4.45,24.89,117.36,100.46
2,1975,GAAPD00,"Atlanta, GA",490584.0,8033.0,185.0,443.0,3518.0,3887.0,12.0,1637.44,37.71,90.3,717.1,792.32
3,1975,CO00101,"Aurora, CO",116656.0,611.0,7.0,44.0,389.0,171.0,12.0,523.76,6.0,37.72,333.46,146.58
4,1975,TX22701,"Austin, TX",300400.0,1215.0,33.0,190.0,463.0,529.0,12.0,404.46,10.99,63.25,154.13,176.1


In [4]:
# identify column datatypes to see if any need parsing or casting
df.dtypes

report_year              int64
agency_code             object
agency_jurisdiction     object
population             float64
violent_crimes         float64
homicides              float64
rapes                  float64
assaults               float64
robberies              float64
months_reported        float64
crimes_percapita       float64
homicides_percapita    float64
rapes_percapita        float64
assaults_percapita     float64
robberies_percapita    float64
dtype: object

### School dataset only goes back to 1990, so remove superfluous years

In [5]:
# subset dataframe to years 1990 and later
recent_df = df[df['report_year'] >= 1990]

### What city has the highest ever reported number of violent crimes, and in what year?

In [6]:
# find row with highest number of total violent crimes, return index, use index to return row
recent_df.iloc[recent_df.violent_crimes.idxmax()]

report_year                     2007
agency_code                      NaN
agency_jurisdiction    United States
population                       NaN
violent_crimes           1.42297e+06
homicides                      17128
rapes                            NaN
assaults                         NaN
robberies                        NaN
months_reported                  NaN
crimes_percapita               471.8
homicides_percapita              5.7
rapes_percapita                  NaN
assaults_percapita               NaN
robberies_percapita              NaN
Name: 2276, dtype: object

#### Didn't work, above query shows that there are yearly summary rows in the dataset which need to be removed

In [7]:
# subset dataset to remove summary rows
no_summary_df = recent_df[recent_df['agency_jurisdiction'] != 'United States']

In [8]:
len(recent_df) - len(no_summary_df)

26

#### Great, dataset now covers 1990-2015, so 26 years, 26 summary rows removed

### Try again: this time, which city had the highest number of violent crimes per capita, and in what year?

In [9]:
# idxmax returns index of maximum value in crimes_percapita row, hands index to .iloc to return row
no_summary_df.iloc[no_summary_df.crimes_percapita.idxmax()]

report_year                      2005
agency_code                   PAPPD00
agency_jurisdiction    Pittsburgh, PA
population                     330780
violent_crimes                   3385
homicides                          63
rapes                             117
assaults                         1588
robberies                        1617
months_reported                    12
crimes_percapita              1023.34
homicides_percapita             19.05
rapes_percapita                 35.37
assaults_percapita             480.08
robberies_percapita            488.84
Name: 2119, dtype: object

#### Pittsburgh in 2005

### Investigating possible effects of dropNA

In [10]:
# group dataset by year, and count number of non-NA values for each field
no_summary_df.groupby('report_year').count()

Unnamed: 0_level_0,agency_code,agency_jurisdiction,population,violent_crimes,homicides,rapes,assaults,robberies,months_reported,crimes_percapita,homicides_percapita,rapes_percapita,assaults_percapita,robberies_percapita
report_year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1990,68,68,67,67,67,67,67,67,67,67,67,67,67,67
1991,68,68,67,67,67,67,67,67,67,67,67,67,67,67
1992,68,68,67,67,67,67,67,67,67,67,67,67,67,67
1993,68,68,67,67,67,67,67,67,67,67,67,67,67,67
1994,68,68,67,67,67,67,67,67,67,67,67,67,67,67
1995,68,68,67,67,67,67,67,67,67,67,67,67,67,67
1996,68,68,67,67,67,67,67,67,67,67,67,67,67,67
1997,68,68,67,66,66,66,66,66,67,66,66,66,66,66
1998,68,68,67,66,66,66,66,66,67,66,66,66,66,66
1999,68,68,67,67,67,67,67,67,67,67,67,67,67,67


#### Looking at the counts, if we dropNA, we will:
- Have no data for 2015
- Have one or two less jurisdictions in years prior to 2002

#### If we remove the months_reported column, then dropNA, we won't lose the 2015 data
- Is months reported always 12?
    - Let's see:

In [11]:
no_summary_df[(no_summary_df.months_reported != 0) 
              & (no_summary_df.months_reported != 12)
              & (pd.notnull(no_summary_df.months_reported))].count()

report_year            19
agency_code            19
agency_jurisdiction    19
population             19
violent_crimes         19
homicides              19
rapes                  19
assaults               19
robberies              19
months_reported        19
crimes_percapita       19
homicides_percapita    19
rapes_percapita        19
assaults_percapita     19
robberies_percapita    19
dtype: int64

#### 19 records where months_reported is neither 0, 12, nor NaN
- How to deal with this?
- I'm going to remove the column and dropNA, due to time constraints

In [12]:
# remove months_reported column
no_months_df = no_summary_df.drop(['months_reported'], axis=1)

In [13]:
# drop NA
no_NA_df = no_months_df.dropna()

In [14]:
# calculate percentage of data lost
print("Records dropped: ", (len(no_months_df) - len(no_NA_df)))
print("Percentage of records dropped: ", round(((len(no_months_df) - len(no_NA_df)) / len(no_months_df) * 100), 2), '%')
print("Remaining records: ", len(no_NA_df))

Records dropped:  18
Percentage of records dropped:  1.02 %
Remaining records:  1750


#### 1% data loss, acceptable

In [15]:
no_NA_df.head()

Unnamed: 0,report_year,agency_code,agency_jurisdiction,population,violent_crimes,homicides,rapes,assaults,robberies,crimes_percapita,homicides_percapita,rapes_percapita,assaults_percapita,robberies_percapita
1035,1990,NM00101,"Albuquerque, NM",384736.0,5121.0,34.0,222.0,3835.0,1030.0,1331.04,8.84,57.7,996.79,267.72
1036,1990,TX22001,"Arlington, TX",261721.0,1876.0,8.0,139.0,1143.0,586.0,716.79,3.06,53.11,436.72,223.9
1037,1990,GAAPD00,"Atlanta, GA",394017.0,16097.0,231.0,695.0,9062.0,6109.0,4085.36,58.63,176.39,2299.9,1550.44
1038,1990,CO00101,"Aurora, CO",222103.0,3191.0,8.0,170.0,2616.0,397.0,1436.72,3.6,76.54,1177.83,178.75
1039,1990,TX22701,"Austin, TX",465622.0,3326.0,46.0,280.0,1539.0,1461.0,714.31,9.88,60.13,330.53,313.77


### Add State column for later manipulation

In [16]:
# extract last two characters from agency_jurisdiction (state abbr.) and set on new 'state' column

for index, row in no_NA_df.iterrows():
    state = row['agency_jurisdiction'][-2:]
    no_NA_df.loc[index, 'state'] = state
    

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [17]:
len(no_NA_df.state.unique())

33

#### jurisdictions in dataset are across 33 unique states

### Export cleaned data for additional exploration

In [18]:
cleaned_df = no_NA_df

In [19]:
cleaned_df.to_csv('Output/cleaned_crime_by_jurisdiction.csv', index=False)