# Assignment 4 - Predictive Policing
## Author - Salinee Kingbaisomboon
### UW NetID: 1950831

## Instructions
1. Read data.
2. Apply three techniques for filter selection: Filter methods, Wrapper methods, Embedded methods.
3. Describe your findings.

In [100]:
# Load necessary libraries
import pandas as pd
import numpy as np

from sklearn.metrics import mutual_info_score
from sklearn.feature_selection import RFE # Recursive Feature Elimination (for backward model selection)
from sklearn import linear_model # LASSO
from sklearn.linear_model import LinearRegression
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs

import matplotlib.pyplot as plt

import warnings

warnings.filterwarnings("ignore") # To suppress warning

%matplotlib inline

pd.options.display.max_rows = None

# Declare Functions used in this assignment

In [38]:
# Function to replace missing numeric values
def replace_missing_value(x, col):
    # We determine the locations of the question marks
    QuestionMark = x.loc[:, col].astype(str) == "?"
    # If there are question marks
    if sum(QuestionMark) > 0:
        # Convert the current column to numeric data including nans
        x.loc[:, col] = x.apply(lambda y: pd.to_numeric(y, errors='coerce'), axis=0)
        # Get the Nan array from the current column
        HasNan = np.isnan(x.loc[:, col]) 
        # Calculate the Median for current column without Nan
        Median = np.nanmedian(x.loc[:, col])
        # Replace the missing value with Median
        x.loc[HasNan, col] = Median
    return x.loc[:, col]

# Read and perform data cleaning

In [28]:
# Load data
filename = 'http://archive.ics.uci.edu/ml/machine-learning-databases/communities/communities.data'
df = pd.read_csv(filename, header=None)
# Headers from https://archive.ics.uci.edu/ml/machine-learning-databases/communities/communities.names
df.columns = ['state','county','community','communityname','fold','population','householdsize','racepctblack','racePctWhite','racePctAsian',
              'racePctHisp','agePct12t21','agePct12t29','agePct16t24','agePct65up','numbUrban','pctUrban','medIncome','pctWWage',
              'pctWFarmSelf','pctWInvInc','pctWSocSec','pctWPubAsst','pctWRetire','medFamInc','perCapInc','whitePerCap','blackPerCap',
              'indianPerCap','AsianPerCap','OtherPerCap','HispPerCap','NumUnderPov','PctPopUnderPov','PctLess9thGrade','PctNotHSGrad',
              'PctBSorMore','PctUnemployed','PctEmploy','PctEmplManu','PctEmplProfServ','PctOccupManu','PctOccupMgmtProf','MalePctDivorce',
              'MalePctNevMarr','FemalePctDiv','TotalPctDiv','PersPerFam','PctFam2Par','PctKids2Par','PctYoungKids2Par','PctTeen2Par','PctWorkMomYoungKids',
              'PctWorkMom','NumIlleg','PctIlleg','NumImmig','PctImmigRecent','PctImmigRec5','PctImmigRec8','PctImmigRec10','PctRecentImmig',
              'PctRecImmig5','PctRecImmig8','PctRecImmig10','PctSpeakEnglOnly','PctNotSpeakEnglWell','PctLargHouseFam','PctLargHouseOccup',
              'PersPerOccupHous','PersPerOwnOccHous','PersPerRentOccHous','PctPersOwnOccup','PctPersDenseHous','PctHousLess3BR','MedNumBR',
              'HousVacant','PctHousOccup','PctHousOwnOcc','PctVacantBoarded','PctVacMore6Mos','MedYrHousBuilt','PctHousNoPhone','PctWOFullPlumb',
              'OwnOccLowQuart','OwnOccMedVal','OwnOccHiQuart','RentLowQ','RentMedian','RentHighQ','MedRent','MedRentPctHousInc','MedOwnCostPctInc',
              'MedOwnCostPctIncNoMtg','NumInShelters','NumStreet','PctForeignBorn','PctBornSameState','PctSameHouse85','PctSameCity85',
              'PctSameState85','LemasSwornFT','LemasSwFTPerPop','LemasSwFTFieldOps','LemasSwFTFieldPerPop','LemasTotalReq','LemasTotReqPerPop',
              'PolicReqPerOffic','PolicPerPop','RacialMatchCommPol','PctPolicWhite','PctPolicBlack','PctPolicHisp','PctPolicAsian','PctPolicMinor',
              'OfficAssgnDrugUnits','NumKindsDrugsSeiz','PolicAveOTWorked','LandArea','PopDens','PctUsePubTrans','PolicCars','PolicOperBudg',
              'LemasPctPolicOnPatr','LemasGangUnitDeploy','LemasPctOfficDrugUn','PolicBudgPerPop','ViolentCrimesPerPop'
             ]

Based on https://archive.ics.uci.edu/ml/machine-learning-databases/communities/communities.names, under the **Attribute Information** section.

<font color=blue>**ViolentCrimesPerPop:**</font> total number of violent crimes per 100K popuation (numeric - decimal) GOAL attribute (to be predicted)

**Note:** Based on https://archive.ics.uci.edu/ml/machine-learning-databases/communities/communities.names, All numeric data was normalized into the decimal range 0.00-1.00 using an Unsupervised, equal-interval binning method. Therefore, we don't need to perform the **Normalization** with this data set.

# Drop the unused columns

- <font color=red>**state:**</font> US state (by number) - not counted as predictive above, but if considered, should be consided nominal (nominal)
- <font color=red>**county:**</font> numeric code for county - not predictive, and many missing values (numeric)
- <font color=red>**community:**</font> numeric code for community - not predictive and many missing values (numeric)
- <font color=red>**communityname:**</font> community name - not predictive - for information only (string)
- <font color=red>**fold:**</font> fold number for non-random 10 fold cross validation, potentially useful for debugging, paired tests - not predictive (numeric)

In [29]:
df.drop(['state'],axis=1, inplace=True)
df.drop(['county'],axis=1, inplace=True)
df.drop(['community'],axis=1, inplace=True)
df.drop(['communityname'],axis=1, inplace=True)
df.drop(['fold'],axis=1, inplace=True)

# Replace missing values with Median on each columns (except ViolentCrimesPerPop - Goal attribute)
After data's exploration, we found that there are multiple of columns have missinge values. Based on https://archive.ics.uci.edu/ml/machine-learning-databases/communities/communities.names, LEMAS survey was of the police departments with at least 100 officers, plus a random sample of smaller departments. Many communities are missing 
LEMAS data.

The remedy for I use is to **replace missing value with Median** for each columns without NaN.

In [41]:
# Columns which have nan value such as ?
for col in ['OtherPerCap','LemasSwornFT','LemasSwFTPerPop','LemasSwFTFieldOps','LemasSwFTFieldPerPop','LemasTotalReq',
            'LemasTotReqPerPop','PolicReqPerOffic','PolicPerPop','RacialMatchCommPol','PctPolicWhite','PctPolicBlack',
            'PctPolicHisp','PctPolicAsian','PctPolicMinor','OfficAssgnDrugUnits','NumKindsDrugsSeiz','PolicAveOTWorked',
            'PolicCars','PolicOperBudg','LemasPctPolicOnPatr','LemasGangUnitDeploy','PolicBudgPerPop']:
    # Replace missing values with Median for current column without Nan
    df.loc[:, col] = replace_missing_value(df, col)

In [43]:
# View first five rows of the data frame
df.head()

Unnamed: 0,population,householdsize,racepctblack,racePctWhite,racePctAsian,racePctHisp,agePct12t21,agePct12t29,agePct16t24,agePct65up,numbUrban,pctUrban,medIncome,pctWWage,pctWFarmSelf,pctWInvInc,pctWSocSec,pctWPubAsst,pctWRetire,medFamInc,perCapInc,whitePerCap,blackPerCap,indianPerCap,AsianPerCap,OtherPerCap,HispPerCap,NumUnderPov,PctPopUnderPov,PctLess9thGrade,PctNotHSGrad,PctBSorMore,PctUnemployed,PctEmploy,PctEmplManu,PctEmplProfServ,PctOccupManu,PctOccupMgmtProf,MalePctDivorce,MalePctNevMarr,FemalePctDiv,TotalPctDiv,PersPerFam,PctFam2Par,PctKids2Par,PctYoungKids2Par,PctTeen2Par,PctWorkMomYoungKids,PctWorkMom,NumIlleg,PctIlleg,NumImmig,PctImmigRecent,PctImmigRec5,PctImmigRec8,PctImmigRec10,PctRecentImmig,PctRecImmig5,PctRecImmig8,PctRecImmig10,PctSpeakEnglOnly,PctNotSpeakEnglWell,PctLargHouseFam,PctLargHouseOccup,PersPerOccupHous,PersPerOwnOccHous,PersPerRentOccHous,PctPersOwnOccup,PctPersDenseHous,PctHousLess3BR,MedNumBR,HousVacant,PctHousOccup,PctHousOwnOcc,PctVacantBoarded,PctVacMore6Mos,MedYrHousBuilt,PctHousNoPhone,PctWOFullPlumb,OwnOccLowQuart,OwnOccMedVal,OwnOccHiQuart,RentLowQ,RentMedian,RentHighQ,MedRent,MedRentPctHousInc,MedOwnCostPctInc,MedOwnCostPctIncNoMtg,NumInShelters,NumStreet,PctForeignBorn,PctBornSameState,PctSameHouse85,PctSameCity85,PctSameState85,LemasSwornFT,LemasSwFTPerPop,LemasSwFTFieldOps,LemasSwFTFieldPerPop,LemasTotalReq,LemasTotReqPerPop,PolicReqPerOffic,PolicPerPop,RacialMatchCommPol,PctPolicWhite,PctPolicBlack,PctPolicHisp,PctPolicAsian,PctPolicMinor,OfficAssgnDrugUnits,NumKindsDrugsSeiz,PolicAveOTWorked,LandArea,PopDens,PctUsePubTrans,PolicCars,PolicOperBudg,LemasPctPolicOnPatr,LemasGangUnitDeploy,LemasPctOfficDrugUn,PolicBudgPerPop,ViolentCrimesPerPop
0,0.19,0.33,0.02,0.9,0.12,0.17,0.34,0.47,0.29,0.32,0.2,1.0,0.37,0.72,0.34,0.6,0.29,0.15,0.43,0.39,0.4,0.39,0.32,0.27,0.27,0.36,0.41,0.08,0.19,0.1,0.18,0.48,0.27,0.68,0.23,0.41,0.25,0.52,0.68,0.4,0.75,0.75,0.35,0.55,0.59,0.61,0.56,0.74,0.76,0.04,0.14,0.03,0.24,0.27,0.37,0.39,0.07,0.07,0.08,0.08,0.89,0.06,0.14,0.13,0.33,0.39,0.28,0.55,0.09,0.51,0.5,0.21,0.71,0.52,0.05,0.26,0.65,0.14,0.06,0.22,0.19,0.18,0.36,0.35,0.38,0.34,0.38,0.46,0.25,0.04,0.0,0.12,0.42,0.5,0.51,0.64,0.03,0.13,0.96,0.17,0.06,0.18,0.44,0.13,0.94,0.93,0.03,0.07,0.1,0.07,0.02,0.57,0.29,0.12,0.26,0.2,0.06,0.04,0.9,0.5,0.32,0.14,0.2
1,0.0,0.16,0.12,0.74,0.45,0.07,0.26,0.59,0.35,0.27,0.02,1.0,0.31,0.72,0.11,0.45,0.25,0.29,0.39,0.29,0.37,0.38,0.33,0.16,0.3,0.22,0.35,0.01,0.24,0.14,0.24,0.3,0.27,0.73,0.57,0.15,0.42,0.36,1.0,0.63,0.91,1.0,0.29,0.43,0.47,0.6,0.39,0.46,0.53,0.0,0.24,0.01,0.52,0.62,0.64,0.63,0.25,0.27,0.25,0.23,0.84,0.1,0.16,0.1,0.17,0.29,0.17,0.26,0.2,0.82,0.0,0.02,0.79,0.24,0.02,0.25,0.65,0.16,0.0,0.21,0.2,0.21,0.42,0.38,0.4,0.37,0.29,0.32,0.18,0.0,0.0,0.21,0.5,0.34,0.6,0.52,0.02,0.18,0.97,0.21,0.04,0.17,0.29,0.18,0.74,0.78,0.12,0.06,0.0,0.2,0.04,0.57,0.26,0.02,0.12,0.45,0.08,0.03,0.75,0.5,0.0,0.15,0.67
2,0.0,0.42,0.49,0.56,0.17,0.04,0.39,0.47,0.28,0.32,0.0,0.0,0.3,0.58,0.19,0.39,0.38,0.4,0.84,0.28,0.27,0.29,0.27,0.07,0.29,0.28,0.39,0.01,0.27,0.27,0.43,0.19,0.36,0.58,0.32,0.29,0.49,0.32,0.63,0.41,0.71,0.7,0.45,0.42,0.44,0.43,0.43,0.71,0.67,0.01,0.46,0.0,0.07,0.06,0.15,0.19,0.02,0.02,0.04,0.05,0.88,0.04,0.2,0.2,0.46,0.52,0.43,0.42,0.15,0.51,0.5,0.01,0.86,0.41,0.29,0.3,0.52,0.47,0.45,0.18,0.17,0.16,0.27,0.29,0.27,0.31,0.48,0.39,0.28,0.0,0.0,0.14,0.49,0.54,0.67,0.56,0.02,0.18,0.97,0.21,0.04,0.17,0.29,0.18,0.74,0.78,0.12,0.06,0.0,0.2,0.04,0.57,0.26,0.01,0.21,0.02,0.08,0.03,0.75,0.5,0.0,0.15,0.43
3,0.04,0.77,1.0,0.08,0.12,0.1,0.51,0.5,0.34,0.21,0.06,1.0,0.58,0.89,0.21,0.43,0.36,0.2,0.82,0.51,0.36,0.4,0.39,0.16,0.25,0.36,0.44,0.01,0.1,0.09,0.25,0.31,0.33,0.71,0.36,0.45,0.37,0.39,0.34,0.45,0.49,0.44,0.75,0.65,0.54,0.83,0.65,0.85,0.86,0.03,0.33,0.02,0.11,0.2,0.3,0.31,0.05,0.08,0.11,0.11,0.81,0.08,0.56,0.62,0.85,0.77,1.0,0.94,0.12,0.01,0.5,0.01,0.97,0.96,0.6,0.47,0.52,0.11,0.11,0.24,0.21,0.19,0.75,0.7,0.77,0.89,0.63,0.51,0.47,0.0,0.0,0.19,0.3,0.73,0.64,0.65,0.02,0.18,0.97,0.21,0.04,0.17,0.29,0.18,0.74,0.78,0.12,0.06,0.0,0.2,0.04,0.57,0.26,0.02,0.39,0.28,0.08,0.03,0.75,0.5,0.0,0.15,0.12
4,0.01,0.55,0.02,0.95,0.09,0.05,0.38,0.38,0.23,0.36,0.02,0.9,0.5,0.72,0.16,0.68,0.44,0.11,0.71,0.46,0.43,0.41,0.28,0.0,0.74,0.51,0.48,0.0,0.06,0.25,0.3,0.33,0.12,0.65,0.67,0.38,0.42,0.46,0.22,0.27,0.2,0.21,0.51,0.91,0.91,0.89,0.85,0.4,0.6,0.0,0.06,0.0,0.03,0.07,0.2,0.27,0.01,0.02,0.04,0.05,0.88,0.05,0.16,0.19,0.59,0.6,0.37,0.89,0.02,0.19,0.5,0.01,0.89,0.87,0.04,0.55,0.73,0.05,0.14,0.31,0.31,0.3,0.4,0.36,0.38,0.38,0.22,0.51,0.21,0.0,0.0,0.11,0.72,0.64,0.61,0.53,0.02,0.18,0.97,0.21,0.04,0.17,0.29,0.18,0.74,0.78,0.12,0.06,0.0,0.2,0.04,0.57,0.26,0.04,0.09,0.02,0.08,0.03,0.75,0.5,0.0,0.15,0.03


In [44]:
# Print DataFrame's size
print(df.shape)
# Print DataFrame's data types
# Note: we can see that all columns were numeric columns now (after did the missing value replacement)
print(df.dtypes)

(1994, 123)
population               float64
householdsize            float64
racepctblack             float64
racePctWhite             float64
racePctAsian             float64
racePctHisp              float64
agePct12t21              float64
agePct12t29              float64
agePct16t24              float64
agePct65up               float64
numbUrban                float64
pctUrban                 float64
medIncome                float64
pctWWage                 float64
pctWFarmSelf             float64
pctWInvInc               float64
pctWSocSec               float64
pctWPubAsst              float64
pctWRetire               float64
medFamInc                float64
perCapInc                float64
whitePerCap              float64
blackPerCap              float64
indianPerCap             float64
AsianPerCap              float64
OtherPerCap              float64
HispPerCap               float64
NumUnderPov              float64
PctPopUnderPov           float64
PctLess9thGrade          float6

In [46]:
# Described DataFrame
df.describe()

Unnamed: 0,population,householdsize,racepctblack,racePctWhite,racePctAsian,racePctHisp,agePct12t21,agePct12t29,agePct16t24,agePct65up,numbUrban,pctUrban,medIncome,pctWWage,pctWFarmSelf,pctWInvInc,pctWSocSec,pctWPubAsst,pctWRetire,medFamInc,perCapInc,whitePerCap,blackPerCap,indianPerCap,AsianPerCap,OtherPerCap,HispPerCap,NumUnderPov,PctPopUnderPov,PctLess9thGrade,PctNotHSGrad,PctBSorMore,PctUnemployed,PctEmploy,PctEmplManu,PctEmplProfServ,PctOccupManu,PctOccupMgmtProf,MalePctDivorce,MalePctNevMarr,FemalePctDiv,TotalPctDiv,PersPerFam,PctFam2Par,PctKids2Par,PctYoungKids2Par,PctTeen2Par,PctWorkMomYoungKids,PctWorkMom,NumIlleg,PctIlleg,NumImmig,PctImmigRecent,PctImmigRec5,PctImmigRec8,PctImmigRec10,PctRecentImmig,PctRecImmig5,PctRecImmig8,PctRecImmig10,PctSpeakEnglOnly,PctNotSpeakEnglWell,PctLargHouseFam,PctLargHouseOccup,PersPerOccupHous,PersPerOwnOccHous,PersPerRentOccHous,PctPersOwnOccup,PctPersDenseHous,PctHousLess3BR,MedNumBR,HousVacant,PctHousOccup,PctHousOwnOcc,PctVacantBoarded,PctVacMore6Mos,MedYrHousBuilt,PctHousNoPhone,PctWOFullPlumb,OwnOccLowQuart,OwnOccMedVal,OwnOccHiQuart,RentLowQ,RentMedian,RentHighQ,MedRent,MedRentPctHousInc,MedOwnCostPctInc,MedOwnCostPctIncNoMtg,NumInShelters,NumStreet,PctForeignBorn,PctBornSameState,PctSameHouse85,PctSameCity85,PctSameState85,LemasSwornFT,LemasSwFTPerPop,LemasSwFTFieldOps,LemasSwFTFieldPerPop,LemasTotalReq,LemasTotReqPerPop,PolicReqPerOffic,PolicPerPop,RacialMatchCommPol,PctPolicWhite,PctPolicBlack,PctPolicHisp,PctPolicAsian,PctPolicMinor,OfficAssgnDrugUnits,NumKindsDrugsSeiz,PolicAveOTWorked,LandArea,PopDens,PctUsePubTrans,PolicCars,PolicOperBudg,LemasPctPolicOnPatr,LemasGangUnitDeploy,LemasPctOfficDrugUn,PolicBudgPerPop,ViolentCrimesPerPop
count,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0
mean,0.057593,0.463395,0.179629,0.753716,0.153681,0.144022,0.424218,0.493867,0.336264,0.423164,0.064072,0.696269,0.361123,0.558154,0.29157,0.495687,0.471133,0.317778,0.479248,0.375677,0.350251,0.368049,0.291098,0.203506,0.322357,0.284724,0.386279,0.055507,0.303024,0.315807,0.38333,0.361675,0.363531,0.501073,0.396384,0.440597,0.391224,0.441339,0.461244,0.434453,0.487568,0.494273,0.487748,0.610918,0.620657,0.664032,0.582884,0.501449,0.52669,0.036294,0.249995,0.03006,0.320211,0.360622,0.399077,0.427879,0.181364,0.182126,0.184774,0.182879,0.785903,0.150587,0.267608,0.251891,0.462101,0.494428,0.404097,0.562598,0.186264,0.495186,0.314694,0.076815,0.719549,0.548686,0.204529,0.433335,0.494178,0.264478,0.243059,0.264689,0.26349,0.268942,0.346379,0.372457,0.422964,0.384102,0.490125,0.449754,0.403816,0.029438,0.022778,0.215552,0.608892,0.53505,0.626424,0.65153,0.027944,0.185993,0.962758,0.215812,0.049278,0.177232,0.298581,0.185998,0.731906,0.771515,0.136073,0.071976,0.018375,0.209468,0.045687,0.567768,0.267357,0.065231,0.232854,0.161685,0.093295,0.037472,0.741775,0.490471,0.094052,0.157212,0.237979
std,0.126906,0.163717,0.253442,0.244039,0.208877,0.232492,0.155196,0.143564,0.166505,0.179185,0.128256,0.444811,0.209362,0.182913,0.204108,0.178071,0.173619,0.222137,0.167564,0.198257,0.191109,0.186804,0.171593,0.164775,0.195411,0.190962,0.183081,0.127941,0.228474,0.21336,0.202508,0.209193,0.202171,0.174036,0.202386,0.175457,0.198922,0.186292,0.18246,0.175437,0.17517,0.183607,0.154594,0.201976,0.206353,0.218749,0.191507,0.168612,0.175241,0.108671,0.229946,0.087189,0.219088,0.210924,0.201498,0.19497,0.235792,0.236333,0.236739,0.234822,0.226869,0.219716,0.196567,0.190709,0.169551,0.157924,0.189301,0.197087,0.209956,0.172508,0.255182,0.150465,0.194024,0.185204,0.21777,0.188986,0.232467,0.242847,0.206295,0.224425,0.231542,0.235252,0.219323,0.209278,0.248286,0.213404,0.1695,0.187274,0.192593,0.102607,0.1004,0.231134,0.204329,0.181352,0.200521,0.198221,0.058143,0.065343,0.055373,0.063349,0.068087,0.067815,0.081182,0.06535,0.092479,0.090554,0.101313,0.083035,0.100105,0.092713,0.049733,0.08132,0.092184,0.109459,0.203092,0.229055,0.091044,0.058566,0.087514,0.163564,0.240328,0.067841,0.232985
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.01,0.35,0.02,0.63,0.04,0.01,0.34,0.41,0.25,0.3,0.0,0.0,0.2,0.44,0.16,0.37,0.35,0.1425,0.36,0.23,0.22,0.24,0.1725,0.11,0.19,0.17,0.26,0.01,0.11,0.16,0.23,0.21,0.22,0.38,0.25,0.32,0.24,0.31,0.33,0.31,0.36,0.36,0.4,0.49,0.49,0.53,0.48,0.39,0.42,0.0,0.09,0.0,0.16,0.2,0.25,0.28,0.03,0.03,0.03,0.03,0.73,0.03,0.15,0.14,0.34,0.39,0.27,0.44,0.06,0.4,0.0,0.01,0.63,0.43,0.06,0.29,0.35,0.06,0.1,0.09,0.09,0.09,0.17,0.2,0.22,0.21,0.37,0.32,0.25,0.0,0.0,0.06,0.47,0.42,0.52,0.56,0.02,0.18,0.97,0.21,0.04,0.17,0.29,0.18,0.74,0.78,0.12,0.06,0.0,0.2,0.04,0.57,0.26,0.02,0.1,0.02,0.08,0.03,0.75,0.5,0.0,0.15,0.07
50%,0.02,0.44,0.06,0.85,0.07,0.04,0.4,0.48,0.29,0.42,0.03,1.0,0.32,0.56,0.23,0.48,0.475,0.26,0.47,0.33,0.3,0.32,0.25,0.17,0.28,0.25,0.345,0.02,0.25,0.27,0.36,0.31,0.32,0.51,0.37,0.41,0.37,0.4,0.47,0.4,0.5,0.5,0.47,0.63,0.64,0.7,0.61,0.51,0.54,0.01,0.17,0.01,0.29,0.34,0.39,0.43,0.09,0.08,0.09,0.09,0.87,0.06,0.2,0.19,0.44,0.48,0.36,0.56,0.11,0.51,0.5,0.03,0.77,0.54,0.13,0.42,0.52,0.185,0.19,0.18,0.17,0.18,0.31,0.33,0.37,0.34,0.48,0.45,0.37,0.0,0.0,0.13,0.63,0.54,0.67,0.7,0.02,0.18,0.97,0.21,0.04,0.17,0.29,0.18,0.74,0.78,0.12,0.06,0.0,0.2,0.04,0.57,0.26,0.04,0.17,0.07,0.08,0.03,0.75,0.5,0.0,0.15,0.15
75%,0.05,0.54,0.23,0.94,0.17,0.16,0.47,0.54,0.36,0.53,0.07,1.0,0.49,0.69,0.37,0.62,0.58,0.44,0.58,0.48,0.43,0.44,0.38,0.25,0.4,0.36,0.48,0.05,0.45,0.42,0.51,0.46,0.48,0.6275,0.52,0.53,0.51,0.54,0.59,0.5,0.62,0.63,0.56,0.76,0.78,0.84,0.72,0.62,0.65,0.02,0.32,0.02,0.43,0.48,0.53,0.56,0.23,0.23,0.23,0.23,0.94,0.16,0.31,0.29,0.55,0.58,0.49,0.7,0.22,0.6,0.5,0.07,0.86,0.67,0.27,0.56,0.67,0.42,0.33,0.4,0.39,0.38,0.49,0.52,0.59,0.53,0.59,0.58,0.51,0.01,0.0,0.28,0.7775,0.66,0.77,0.79,0.02,0.18,0.97,0.21,0.04,0.17,0.29,0.18,0.74,0.78,0.12,0.06,0.0,0.2,0.04,0.57,0.26,0.07,0.28,0.19,0.08,0.03,0.75,0.5,0.0,0.15,0.33
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


# Perform Feature Selection

## 1. Filter Method
<font color=blue>**Mutual Information**</font> between feature **ViolentCrimesPerPop (Y in this case)** and each numeric **X** variables.

**Note:** for any feature features that has **Mutual Information** more than or equal to **0.4**, I will display the text color in <font color=green>**Green**</font> since that mean It has pretty high "amount of information" obtained about **Y** feature through another **X** feature.

In [93]:
for col in df.columns.difference(['ViolentCrimesPerPop']):
    print(f'\033[1m\033[4mCorrelation and Mutual Information between {col} and ViolentCrimesPerPop\033[0m')
    # Calculation Correlation between x and y
    corr = np.corrcoef(df[col], df['ViolentCrimesPerPop'])[0, 1]
    # To calculate the mutual information, we first need to convert the x and y into a 2 dimension histogram
    c_xy = np.histogram2d(df[col], df['ViolentCrimesPerPop'], 20)[0]
    # Use the function called mutual_info_score to calculate the mutual information 
    mi = mutual_info_score(None, None, contingency=c_xy)
    print(("\033[92m" if mi >= 0.4 else "") + "Correlation between X and Y is %.2f"%corr + ("\x1b[0m" if mi >= 0.4 else ""))
    print(("\033[92m" if mi >= 0.4 else "") + "Mutual information=%.2f"%mi + ("\x1b[0m" if mi >= 0.4 else ""))
    print('------------------------------------------------------------------------------')

[1m[4mCorrelation and Mutual Information between AsianPerCap and ViolentCrimesPerPop[0m
Correlation between X and Y is -0.16
Mutual information=0.14
------------------------------------------------------------------------------
[1m[4mCorrelation and Mutual Information between FemalePctDiv and ViolentCrimesPerPop[0m
Correlation between X and Y is 0.56
Mutual information=0.33
------------------------------------------------------------------------------
[1m[4mCorrelation and Mutual Information between HispPerCap and ViolentCrimesPerPop[0m
Correlation between X and Y is -0.24
Mutual information=0.16
------------------------------------------------------------------------------
[1m[4mCorrelation and Mutual Information between HousVacant and ViolentCrimesPerPop[0m
Correlation between X and Y is 0.42
Mutual information=0.20
------------------------------------------------------------------------------
[1m[4mCorrelation and Mutual Information between LandArea and ViolentCrimesPe

Correlation between X and Y is -0.47
Mutual information=0.26
------------------------------------------------------------------------------
[1m[4mCorrelation and Mutual Information between PctIlleg and ViolentCrimesPerPop[0m
[92mCorrelation between X and Y is 0.74[0m
[92mMutual information=0.47[0m
------------------------------------------------------------------------------
[1m[4mCorrelation and Mutual Information between PctImmigRec10 and ViolentCrimesPerPop[0m
Correlation between X and Y is 0.29
Mutual information=0.16
------------------------------------------------------------------------------
[1m[4mCorrelation and Mutual Information between PctImmigRec5 and ViolentCrimesPerPop[0m
Correlation between X and Y is 0.22
Mutual information=0.14
------------------------------------------------------------------------------
[1m[4mCorrelation and Mutual Information between PctImmigRec8 and ViolentCrimesPerPop[0m
Correlation between X and Y is 0.25
Mutual information=0.16


Correlation between X and Y is 0.15
Mutual information=0.14
------------------------------------------------------------------------------
[1m[4mCorrelation and Mutual Information between agePct16t24 and ViolentCrimesPerPop[0m
Correlation between X and Y is 0.10
Mutual information=0.16
------------------------------------------------------------------------------
[1m[4mCorrelation and Mutual Information between agePct65up and ViolentCrimesPerPop[0m
Correlation between X and Y is 0.07
Mutual information=0.10
------------------------------------------------------------------------------
[1m[4mCorrelation and Mutual Information between blackPerCap and ViolentCrimesPerPop[0m
Correlation between X and Y is -0.28
Mutual information=0.19
------------------------------------------------------------------------------
[1m[4mCorrelation and Mutual Information between householdsize and ViolentCrimesPerPop[0m
Correlation between X and Y is -0.03
Mutual information=0.11
-----------------

## 2. Wrapper Method
I decide to use <font color=blue>**Backward Stepwise Model Selection**</font> which mean we start the model with all features included and we removes one feature on each iteration. The variables that can minimally increase the **Residual Sum of Squares (RSS)** on the data is chosen as the feature to be removed from the model.

In [99]:
# Create linear regression model as the estimator
estimator = LinearRegression()
# From the Filter method, we select 5 features (color in green)
selector = RFE(estimator, 5, step=1)
# Learn from this dataset
targetOutcome = pd.DataFrame(df,columns=['ViolentCrimesPerPop'])
allInputs = pd.DataFrame(df,columns=df.columns.difference(['ViolentCrimesPerPop']))
selector = selector.fit(allInputs, targetOutcome)
# The mask of selected features (which variables are selected)
print(selector.support_)
# Selected features are ranked 1. The variable with the highest rank is the one that is removed first
print(selector.ranking_)
# Print the name of selected features
f = selector.get_support(True) # the most important features
print('\033[1m\033[4mSeledted features from Backward Stepwise Model Selection are: \033[0m')
for f_index in f:
    print(df.columns[f_index])

[False False False False False False False False False False  True False
 False False  True False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False  True False False False False False
 False  True False False False False False False False False False  True
 False False False False False False False False False False False False
 False False]
[102   2  88   9  94  89 114  46  15  45   1  14 112  25   1  43  97  77
  44  18  78 108  49  24  34 100  22  52  47  72 117   8   7  95 113  74
 106  38  99  29  62 109  76  12  27  85 101  86  92   4  50  81  64  68
  30  73  54  16  11  56  31  32  33 

## 3. Embedded Method
I will use **LASSO** to select features on this data set.

In [120]:
# penalty of the norm-1 localization
alpha = 0.007

# Learn from this dataset
targetOutcome = pd.DataFrame(df,columns=['ViolentCrimesPerPop'])
allInputs = pd.DataFrame(df,columns=df.columns.difference(['ViolentCrimesPerPop']))

clf = linear_model.Lasso(alpha=alpha)
clf.fit(allInputs, targetOutcome)

print(clf.coef_)
print(clf.intercept_)
print("Sum of square of coefficients = %.2f"%np.sum(clf.coef_**2))

[-0.          0.         -0.          0.          0.          0.
  0.00247801 -0.         -0.          0.          0.          0.
  0.          0.          0.          0.         -0.          0.
 -0.         -0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
 -0.         -0.         -0.         -0.         -0.         -0.
 -0.         -0.         -0.          0.          0.          0.
 -0.         -0.          0.20680625  0.          0.          0.
  0.         -0.31050398  0.          0.          0.          0.
  0.          0.         -0.          0.         -0.          0.
  0.          0.          0.         -0.          0.          0.
  0.          0.          0.          0.         -0.         -0.
 -0.         -0.          0.          0.         -0.          0.
  0.         -0.         -0.         -0.          0.         -0.
 -0.          0.          0.          0.          0.          0.
  0.          0.         

[(-0.0, 'AsianPerCap'),
 (0.0, 'FemalePctDiv'),
 (-0.0, 'HispPerCap'),
 (0.0, 'HousVacant'),
 (0.0, 'LandArea'),
 (0.0, 'LemasGangUnitDeploy'),
 (0.002478005250279332, 'LemasPctOfficDrugUn'),
 (-0.0, 'LemasPctPolicOnPatr'),
 (-0.0, 'LemasSwFTFieldOps'),
 (0.0, 'LemasSwFTFieldPerPop'),
 (0.0, 'LemasSwFTPerPop'),
 (0.0, 'LemasSwornFT'),
 (0.0, 'LemasTotReqPerPop'),
 (0.0, 'LemasTotalReq'),
 (0.0, 'MalePctDivorce'),
 (0.0, 'MalePctNevMarr'),
 (-0.0, 'MedNumBR'),
 (0.0, 'MedOwnCostPctInc'),
 (-0.0, 'MedOwnCostPctIncNoMtg'),
 (-0.0, 'MedRent'),
 (0.0, 'MedRentPctHousInc'),
 (0.0, 'MedYrHousBuilt'),
 (0.0, 'NumIlleg'),
 (0.0, 'NumImmig'),
 (0.0, 'NumInShelters'),
 (0.0, 'NumKindsDrugsSeiz'),
 (0.0, 'NumStreet'),
 (0.0, 'NumUnderPov'),
 (0.0, 'OfficAssgnDrugUnits'),
 (0.0, 'OtherPerCap'),
 (-0.0, 'OwnOccHiQuart'),
 (-0.0, 'OwnOccLowQuart'),
 (-0.0, 'OwnOccMedVal'),
 (-0.0, 'PctBSorMore'),
 (-0.0, 'PctBornSameState'),
 (-0.0, 'PctEmplManu'),
 (-0.0, 'PctEmplProfServ'),
 (-0.0, 'PctEmploy'),
 (

In [121]:
print('\033[1m\033[4mSeledted features using LASSO (non zero value of the coefficient) are: \033[0m')
list(zip(clf.coef_, allInputs))

[1m[4mSeledted features using LASSO (non zero value of the coefficient) are: [0m


[(-0.0, 'AsianPerCap'),
 (0.0, 'FemalePctDiv'),
 (-0.0, 'HispPerCap'),
 (0.0, 'HousVacant'),
 (0.0, 'LandArea'),
 (0.0, 'LemasGangUnitDeploy'),
 (0.002478005250279332, 'LemasPctOfficDrugUn'),
 (-0.0, 'LemasPctPolicOnPatr'),
 (-0.0, 'LemasSwFTFieldOps'),
 (0.0, 'LemasSwFTFieldPerPop'),
 (0.0, 'LemasSwFTPerPop'),
 (0.0, 'LemasSwornFT'),
 (0.0, 'LemasTotReqPerPop'),
 (0.0, 'LemasTotalReq'),
 (0.0, 'MalePctDivorce'),
 (0.0, 'MalePctNevMarr'),
 (-0.0, 'MedNumBR'),
 (0.0, 'MedOwnCostPctInc'),
 (-0.0, 'MedOwnCostPctIncNoMtg'),
 (-0.0, 'MedRent'),
 (0.0, 'MedRentPctHousInc'),
 (0.0, 'MedYrHousBuilt'),
 (0.0, 'NumIlleg'),
 (0.0, 'NumImmig'),
 (0.0, 'NumInShelters'),
 (0.0, 'NumKindsDrugsSeiz'),
 (0.0, 'NumStreet'),
 (0.0, 'NumUnderPov'),
 (0.0, 'OfficAssgnDrugUnits'),
 (0.0, 'OtherPerCap'),
 (-0.0, 'OwnOccHiQuart'),
 (-0.0, 'OwnOccLowQuart'),
 (-0.0, 'OwnOccMedVal'),
 (-0.0, 'PctBSorMore'),
 (-0.0, 'PctBornSameState'),
 (-0.0, 'PctEmplManu'),
 (-0.0, 'PctEmplProfServ'),
 (-0.0, 'PctEmploy'),
 (

***
**Summary:**
1. There are a lot of missing for this data set. Therefore, I need to **replace missing values** with its **Median**.
2. Feature Selection:
    1. **Filter Method** with **Mutual Information** greater or equal to 0.4 as a criterial.
        - Result: Select 5 features from 122 features:
            - PctYoungKids2Par
            - racePctWhite
            - PctFam2Par
            - PctIlleg
            - PctKids2Par
    2. **Wrapper Method** with **Backward Stepwise Model Selection**.
        - Result: Select 5 features from 122 features:
            - numbUrban
            - pctWFarmSelf
            - NumStreet
            - LemasSwFTPerPop
            - PctPolicHisp
    3. **Embedded Method** with **LASSO** and alpha = 0.007.
        - Result: Select 5 features from 122 features (where coefficients not equal to zero):
            - LemasPctOfficDrugUn
            - PctIlleg
            - PctKids2Par
            - pctUrban
            - racePctWhite
3. We can see that each methods return different set of **selected features**. Therefore, we can't rely on only single method or model but we have to try a vareity of ways to find the best approach.
***