# Plan of work

<ol>
    <li><b>Explore dataset</b></li>
    <ul>
        <li>look at attributes, make assumptions about attributes</li>
        <li>look at missing values</li>
    <li>decide whether to drop columns with too many missing values</li>
        <li>impute missing values</li>
    </ul>
    <br />
    <li> <b>Do some background research</b> </li>
    <ul>
        <li>what do books and articles say about factors that influence crime?</li>
        <li>do we have any of the commonly known factors in our dataset, or can we infer them from the data that we have?</li>
        <li>(if there was time we could also explore possibilities to add more data to our dataset)</li>
    </ul>
    <br />
    <li><b>Choose an output (y) attribute</b></li>
    <ul>
        <li>there is a number of features that can be predicted (the rate of murders, robberies etc.) - for simplicity I will try to create a model that only predicts one of the 18 available attributes</li>
        <li>do we need to even predict anything? maybe it would be interesting enough to pinpoint attributes that have some influence on certain crime areas</li>
    </ul>
    <br />
    <li><b>Try some automatic feature selection methods</b></li>
    <ul>
        <li>which variables are correlated with the output?</li>
        <li>random forest feature selection</li>
        <li>forward selection, backward selection, stagewise (a little outdated)</li>
        <li>research other methods</li>
    </ul>
    <br />
    <li><b>Fit some simple models with a subset of variables and evaluate</b></li>
    <ul>
        <li>set evaluation criteria: R-squared? AIC?</li>
        <li>start with the simple models and fit regression, random forests, kNN, some boosting algorithms, SVMs etc., maybe explore some more methods</li>
    </ul>
</ol>

<img src="./imgs/ml_algorithms.png">

Source: https://machinelearningmastery.com/

# Exploratory analysis

## What we know from the description on the UCI website:
* data comes from these sources: US Census, Law Enforcement Management and Administrative Statistics Survey, FBI Uniform Crime Reporting
* rougly three groups of independent variables: <b>community related</b> (races, urban vs suburban etc.), <b>income</b>, <b>law enforcement</b>
* FBI finds this data to be over-simplistic since many relevant data is not included (e.g. number of visitors - communities with large number of visitors will have higher per capita crime rates) 

<b>2 papers mentioned on the website:</b>
* in a paper (Empirical Analysis of Case-Editing Approaches for Numeric Prediction) they decided to normalize the attribute values - we will see about that later but for now I am leaving the data as it is
* the above mentioned paper uses kNN method to predict while dropping anomalous and border cases (i.e. rows) to remove noise from data
* another paper (Fuzzy Association Rule Mining for Community Crime Pattern Discovery) uses odds ration to select relevant attributes, they also omitted similar attributes (male divorced, female divorced, etc.).
They split attribute values to bins (i.e. low, medium, high) based on "expert knowledge" and statistical knowledge such as mean and SD. They do not actually create a model but rather a set of rules extracted from the data. Some of the most influential attributes include
<ul>
* kids born to never married
* people living in dense housing
* people speaking no English
* people commute using public transport
* people living in urban area
etc.
    
so we will definitely want to include these among our features.
</ul>

## Get to know the dataset

In [6]:
import pandas as pd
import os
import numpy as np

#increase the column width to print long attribute descriptions
pd.set_option('max_colwidth', 100)
DATA_DIR = os.path.abspath(os.path.join(os.getcwd(), "data"))

'''
Load dataset
'''
#parse weka header
header_file = open(os.path.join(DATA_DIR, "unnormalized_header.txt"), "r")
header = []
for line in header_file:
    header.append(line.split(" ")[1])

data = pd.read_csv(os.path.join(DATA_DIR, "crime_data_unnormalized.txt"), sep = ",", header = None, names = header)

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2215 entries, 0 to 2214
Columns: 147 entries, communityname to nonViolPerPop
dtypes: float64(75), int64(29), object(43)
memory usage: 2.1+ MB


Over 2000 rows and nearly 150 attributes. Some of the attributes are actually dependent variables, let's separate them from the independent variables.

In [8]:
#split to X and y
y_labels = header[-18:]
y = data[y_labels]

print("Dependent variables:")
print(y.columns.values)

X = data.drop(y_labels, axis = 1)

Dependent variables:
['murders' 'murdPerPop' 'rapes' 'rapesPerPop' 'robberies' 'robbbPerPop'
 'assaults' 'assaultPerPop' 'burglaries' 'burglPerPop' 'larcenies'
 'larcPerPop' 'autoTheft' 'autoTheftPerPop' 'arsons' 'arsonsPerPop'
 'violentPerPop' 'nonViolPerPop']


The above are the variables we will be predicting.

Each category of crime is represented as:
<ul>
    <li>an absolute value</li>
    <li>per capita (per 100,000 inhabitants)</li>
</ul>    
<b>Violent crime</b> aggregates:
<ul>
    <li>murder</li>
    <li>rape</li>
    <li>robbery</li>
    <li>assault</li>
</ul>    
<b>Non-violent crime</b>
<ul>
    <li>burglaries</li>
    <li>larcenies</li>
    <li>autotheft</li>
    <li>arsons</li>
</ul>
  
Let us look at the independent variables.

In [9]:
#load attribute descriptions, keys in the dictionary are dataframe header entries
header_description = open(os.path.join(DATA_DIR, "header_description.txt"), "r")
attribute_descr = {}
index = 0
for line in header_description:
    line = line.split(": ")
    attribute_descr[header[index]] = line[1].strip()
    index+=1
    
print("Independent variables:")
for name in X.columns.values:
    print(name + " : " + attribute_descr[name])

Independent variables:
communityname : Community name - not predictive - for information only (string)
State : US state (by 2 letter postal abbreviation)(nominal)
countyCode : numeric code for county - not predictive, and many missing values (numeric)
communityCode : numeric code for community - not predictive and many missing values (numeric)
fold : fold number for non-random 10 fold cross validation, potentially useful for debugging, paired tests - not predictive (numeric - integer)
pop : population for community
perHoush : mean people per household (numeric - decimal)
pctBlack : percentage of population that is african american (numeric - decimal)
pctWhite : percentage of population that is caucasian (numeric - decimal)
pctAsian : percentage of population that is of asian heritage (numeric - decimal)
pctHisp : percentage of population that is of hispanic heritage (numeric - decimal)
pct12-21 : percentage of population that is 12-21 in age (numeric - decimal)
pct12-29 : percentage of

There will be a lot to filter out since many of the variables are very similar but we will deal with that later.

## Missing values

In [3]:
#replace ? with NaN to mark missing values
data = data.replace("?",np.NaN)

#calculate count and percentage of missing values per column
missing_values = data.isnull().sum()
missing_values = pd.DataFrame(missing_values.loc[missing_values != 0], columns = ["count"])
missing_values['percentage'] = round(missing_values['count']/2215,2)
missing_values = missing_values.sort_values(by = 'count', ascending = False)

print("Missing values")
print(missing_values)

Missing values
                   count  percentage
policCarsAvail      1872        0.85
gangUnit            1872        0.85
policOperBudget     1872        0.85
policAveOT          1872        0.85
numDiffDrugsSeiz    1872        0.85
officDrugUnits      1872        0.85
pctPolicMinority    1872        0.85
pctPolicAsian       1872        0.85
pctPolicHisp        1872        0.85
pctPolicBlack       1872        0.85
pctPolicWhite       1872        0.85
racialMatch         1872        0.85
policePerPop2       1872        0.85
policCallPerOffic   1872        0.85
policCallPerPop     1872        0.85
policeCalls         1872        0.85
policeFieldPerPop   1872        0.85
policeField         1872        0.85
policePerPop        1872        0.85
numPolice           1872        0.85
policBudgetPerPop   1872        0.85
pctPolicPatrol      1872        0.85
communityCode       1224        0.55
countyCode          1221        0.55
violentPerPop        221        0.10
rapesPerPop          20

We will drop all attributes that have more than 50% of missing values. Unfortunately, these include a lot of the attributes indicating police presence in communities, which could be quite useful.

In [12]:
#replace ? with NaN to mark missing values
data = data.replace("?",np.NaN)

def find_missing_values(data):
    #calculate count and percentage of missing values per column
    missing_values = data.isnull().sum()
    missing_values = pd.DataFrame(missing_values.loc[missing_values != 0], columns = ["count"])
    missing_values['percentage'] = round(missing_values['count']/data.shape[0],2)
    missing_values = missing_values.sort_values(by = 'count', ascending = False)
    return missing_values

missing_values = find_missing_values(data)

#get description of each attribute with missing values
attr_with_missing_values = missing_values.index.tolist()
#append the description to the dataframe
missing_values['description'] = pd.Series([attribute_descr[key] for key in attr_with_missing_values]).values

print(missing_values[missing_values["percentage"] > 0.5].loc[:,"description"])

columns_with_too_many_missing = missing_values[missing_values["percentage"] > 0.5]
data = data.drop(columns_with_too_many_missing.index.values, axis = 1)

policCarsAvail                                                  number of police cars (numeric - expected to be integer)
gangUnit             gang unit deployed (numeric - integer - but really nominal - 0 means NO, 10 means YES, 5 means P...
policOperBudget                                                       police operating budget (numeric - may be integer)
policAveOT                                                            police average overtime worked (numeric - decimal)
numDiffDrugsSeiz                            number of different kinds of drugs seized (numeric - expected to be integer)
officDrugUnits                      number of officers assigned to special drug units (numeric - expected to be integer)
pctPolicMinority                                     percent of police that are minority of any kind (numeric - decimal)
pctPolicAsian                                                       percent of police that are asian (numeric - decimal)
pctPolicHisp                    

The rest of attributes with missing values are dependent variables so we will worry about those in the future depending on which attributes we predict. There is one exception: one missing value in <b>otherPerCap</b> (per capita income for people with 'other' heritage), which is probably not crucial but we will impute it just to learn about imputation methods in Python.

In [5]:
print("Remaining missing values")
missing_values = find_missing_values(data)
print(missing_values)

Remaining missing values
                 count  percentage
violentPerPop      221        0.10
rapesPerPop        208        0.09
rapes              208        0.09
nonViolPerPop       97        0.04
arsonsPerPop        91        0.04
arsons              91        0.04
assaults            13        0.01
assaultPerPop       13        0.01
autoTheft            3        0.00
autoTheftPerPop      3        0.00
burglPerPop          3        0.00
larcPerPop           3        0.00
larcenies            3        0.00
burglaries           3        0.00
robbbPerPop          1        0.00
robberies            1        0.00
otherPerCap          1        0.00


## Data imputation (not really necessary but good to know)

Since there is only one value missing we will not try to figure out whether it is missing and random, missing not at random etc. Apparently, there are all these ways to impute missing data:

<img src='./imgs/imputation_methods.png' width = '50%'>

Source: <a src='https://medium.com/ibm-data-science-experience/missing-data-conundrum-exploration-and-imputation-techniques-9f40abe0fd87'>IBM Watson Data</a>

I experimented with MICE and Miss Forest in R in the past so this time let's try something new: <b>kNN imputation</b> with Python fancyimpute.

Normally, if there were more missing values I would <b>validate the imputation methods</b> for example by taking the complete cases, seed missing values at random, fill them in with various imputation methods and then calculate some measure of error for each method. This would be an overkill in our case.

Update: fancyimpute has keras and tensorflow as requirements...I don't wanna go there (tensorflow + windows = disaster). This is getting too complicated for one line of data. Deleting the line and moving on.