# Crime Dataset
In this section we are going to analyze the Crime dataset. 
This large dataset contains 128 columns, which need to be filtered in order to extract important features for our analysis: in particular, we are interested in the columns:
- state
- the population (in order to weight the values according to the population)
- NumUnderPov (the number of people under the poverty level)
- PctPopUnderPov (percentage of people under the poverty level)
- PctLess9thGrade (percentage with a level of instruction less than a 9th grade)
- PctNotHSGrad (percentage of people without high school diploma)
- PctBSorMore (percentage of people with bachelor or more)
- PctUnemployed (percentage of people unemployed)
- PctNotSpeakEnglWell (People not speaking english well) 
- MedNumBR (median number of bedrooms per house)
- PctWOFullPlumb (percentage of houses without full plumbing facilities),
- MedRent (medium rent)
- MedRentPctHousInc (median rent as percentage of income)
- NumInShelters (homeless people in shelters)
- NumStreet (homeless people in the street)
- LemasSwornFT (number of full time police officers)
- LemasSwFTPerPop (police officers per 100k inhabitants)
- LemasTotalReq (total requests for police)
- LemasTotReqPerPop (requests per police per 100K inhabitants)
- PolicPerPop (police officers per 100k inhabitants) difference with above?
- NumKindsDrugsSeiz (Number of different kinds of drugs sized)
- PctUsePubTrans (percentage of people using public means of transport for commuting)
- ViolentCrimesPerPop (Number of total crimes per 100k population)

In fact, they may turn as useful information when computing our human development index.

In [1]:
from requests import get
import pandas as pd
import numpy as np
from matplotlib.ticker import MaxNLocator
import matplotlib.pyplot as plt
import re
from bs4 import BeautifulSoup

In [3]:
# We start by loading the list of states in the USA
states_json = pd.read_json('Data/states.json')
states = []

for row in states_json['data']:
    states.append(row['State'])
    
states_df = pd.DataFrame({"State": states})
states_df.head()

Unnamed: 0,State
0,Alabama
1,Alaska
2,Arizona
3,Arkansas
4,California


The dataset is not provided with the names of the columns, therefore we first get them from the dataset description file

In [10]:
column_names = []
with open("Data/crimes_column_names.txt", 'r') as f:
    for line in f:
        # The format for every line is @attribute state numeric
        name = " ".join(line.split("@attribute ")[1].split(" ")[:-1])
        column_names.append(name)
print(column_names)        

['state', 'county', 'community', 'communityname', 'fold', 'population', 'householdsize', 'racepctblack', 'racePctWhite', 'racePctAsian', 'racePctHisp', 'agePct12t21', 'agePct12t29', 'agePct16t24', 'agePct65up', 'numbUrban', 'pctUrban', 'medIncome', 'pctWWage', 'pctWFarmSelf', 'pctWInvInc', 'pctWSocSec', 'pctWPubAsst', 'pctWRetire', 'medFamInc', 'perCapInc', 'whitePerCap', 'blackPerCap', 'indianPerCap', 'AsianPerCap', 'OtherPerCap', 'HispPerCap', 'NumUnderPov', 'PctPopUnderPov', 'PctLess9thGrade', 'PctNotHSGrad', 'PctBSorMore', 'PctUnemployed', 'PctEmploy', 'PctEmplManu', 'PctEmplProfServ', 'PctOccupManu', 'PctOccupMgmtProf', 'MalePctDivorce', 'MalePctNevMarr', 'FemalePctDiv', 'TotalPctDiv', 'PersPerFam', 'PctFam2Par', 'PctKids2Par', 'PctYoungKids2Par', 'PctTeen2Par', 'PctWorkMomYoungKids', 'PctWorkMom', 'NumIlleg', 'PctIlleg', 'NumImmig', 'PctImmigRecent', 'PctImmigRec5', 'PctImmigRec8', 'PctImmigRec10', 'PctRecentImmig', 'PctRecImmig5', 'PctRecImmig8', 'PctRecImmig10', 'PctSpeakEnglOn

In [13]:
# We start by loading the list of states in the USA
crime = pd.read_csv('Data/communities.data', names=column_names)

crime.head()

Unnamed: 0,state,county,community,communityname,fold,population,householdsize,racepctblack,racePctWhite,racePctAsian,...,LandArea,PopDens,PctUsePubTrans,PolicCars,PolicOperBudg,LemasPctPolicOnPatr,LemasGangUnitDeploy,LemasPctOfficDrugUn,PolicBudgPerPop,ViolentCrimesPerPop
0,8,?,?,Lakewoodcity,1,0.19,0.33,0.02,0.9,0.12,...,0.12,0.26,0.2,0.06,0.04,0.9,0.5,0.32,0.14,0.2
1,53,?,?,Tukwilacity,1,0.0,0.16,0.12,0.74,0.45,...,0.02,0.12,0.45,?,?,?,?,0.0,?,0.67
2,24,?,?,Aberdeentown,1,0.0,0.42,0.49,0.56,0.17,...,0.01,0.21,0.02,?,?,?,?,0.0,?,0.43
3,34,5,81440,Willingborotownship,1,0.04,0.77,1.0,0.08,0.12,...,0.02,0.39,0.28,?,?,?,?,0.0,?,0.12
4,42,95,6096,Bethlehemtownship,1,0.01,0.55,0.02,0.95,0.09,...,0.04,0.09,0.02,?,?,?,?,0.0,?,0.03


We can immediately see only by looking at the head of our dataset that there are a lot of missing values. 

The first thing that we have to check is if we have all states: They should be 51.

In [18]:
len(crime.state.unique())

46