<br>
# Criminal Prediction

![Criminal](http://c4.nrostatic.com/sites/default/files/styles/original_image_with_cropping/public/uploaded/criminal-justice-reform-donald-trumps-supporters-conservative-base-want-fresh.jpg?itok=oWILBi-L)

<br>
## Importing libraries

In [1]:
# linear algebra
import numpy as np 

# data processing
import pandas as pd 

# data visualization
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style

# Algorithms
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier


<br>
## Getting the Data

In [2]:
train_df = pd.read_csv('criminal_train.csv')
test_df = pd.read_csv('criminal_test.csv')


<br>
<br>
## Data Analysis

<br>
###### Finding No. of Missing Values

In [4]:
print(train_df.isnull().sum())

PERID        0
IFATHER      0
NRCH17_2     0
IRHHSIZ2     0
IIHHSIZ2     0
IRKI17_2     0
IIKI17_2     0
IRHH65_2     0
IIHH65_2     0
PRXRETRY     0
PRXYDATA     0
MEDICARE     0
CAIDCHIP     0
CHAMPUS      0
PRVHLTIN     0
GRPHLTIN     0
HLTINNOS     0
HLCNOTYR     0
HLCNOTMO     0
HLCLAST      0
HLLOSRSN     0
HLNVCOST     0
HLNVOFFR     0
HLNVREF      0
HLNVNEED     0
HLNVSOR      0
IRMCDCHP     0
IIMCDCHP     0
IRMEDICR     0
IIMEDICR     0
            ..
CELLNOTCL    0
CELLWRKNG    0
IRFAMSOC     0
IIFAMSOC     0
IRFAMSSI     0
IIFAMSSI     0
IRFSTAMP     0
IIFSTAMP     0
IRFAMPMT     0
IIFAMPMT     0
IRFAMSVC     0
IIFAMSVC     0
IRWELMOS     0
IIWELMOS     0
IRPINC3      0
IRFAMIN3     0
IIPINC3      0
IIFAMIN3     0
GOVTPROG     0
POVERTY3     0
TOOLONG      0
TROUBUND     0
PDEN10       0
COUTYP2      0
MAIIN102     0
AIIND102     0
ANALWT_C     0
VESTR        0
VEREP        0
Criminal     0
Length: 72, dtype: int64


In [5]:
print(test_df.isnull().sum())

PERID        0
IFATHER      0
NRCH17_2     0
IRHHSIZ2     0
IIHHSIZ2     0
IRKI17_2     0
IIKI17_2     0
IRHH65_2     0
IIHH65_2     0
PRXRETRY     0
PRXYDATA     0
MEDICARE     0
CAIDCHIP     0
CHAMPUS      0
PRVHLTIN     0
GRPHLTIN     0
HLTINNOS     0
HLCNOTYR     0
HLCNOTMO     0
HLCLAST      0
HLLOSRSN     0
HLNVCOST     0
HLNVOFFR     0
HLNVREF      0
HLNVNEED     0
HLNVSOR      0
IRMCDCHP     0
IIMCDCHP     0
IRMEDICR     0
IIMEDICR     0
            ..
OTHINS       0
CELLNOTCL    0
CELLWRKNG    0
IRFAMSOC     0
IIFAMSOC     0
IRFAMSSI     0
IIFAMSSI     0
IRFSTAMP     0
IIFSTAMP     0
IRFAMPMT     0
IIFAMPMT     0
IRFAMSVC     0
IIFAMSVC     0
IRWELMOS     0
IIWELMOS     0
IRPINC3      0
IRFAMIN3     0
IIPINC3      0
IIFAMIN3     0
GOVTPROG     0
POVERTY3     0
TOOLONG      0
TROUBUND     0
PDEN10       0
COUTYP2      0
MAIIN102     0
AIIND102     0
ANALWT_C     0
VESTR        0
VEREP        0
Length: 71, dtype: int64


In [6]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45718 entries, 0 to 45717
Data columns (total 72 columns):
PERID        45718 non-null int64
IFATHER      45718 non-null int64
NRCH17_2     45718 non-null int64
IRHHSIZ2     45718 non-null int64
IIHHSIZ2     45718 non-null int64
IRKI17_2     45718 non-null int64
IIKI17_2     45718 non-null int64
IRHH65_2     45718 non-null int64
IIHH65_2     45718 non-null int64
PRXRETRY     45718 non-null int64
PRXYDATA     45718 non-null int64
MEDICARE     45718 non-null int64
CAIDCHIP     45718 non-null int64
CHAMPUS      45718 non-null int64
PRVHLTIN     45718 non-null int64
GRPHLTIN     45718 non-null int64
HLTINNOS     45718 non-null int64
HLCNOTYR     45718 non-null int64
HLCNOTMO     45718 non-null int64
HLCLAST      45718 non-null int64
HLLOSRSN     45718 non-null int64
HLNVCOST     45718 non-null int64
HLNVOFFR     45718 non-null int64
HLNVREF      45718 non-null int64
HLNVNEED     45718 non-null int64
HLNVSOR      45718 non-null int64
IRMCDCH

In [7]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11430 entries, 0 to 11429
Data columns (total 71 columns):
PERID        11430 non-null int64
IFATHER      11430 non-null int64
NRCH17_2     11430 non-null int64
IRHHSIZ2     11430 non-null int64
IIHHSIZ2     11430 non-null int64
IRKI17_2     11430 non-null int64
IIKI17_2     11430 non-null int64
IRHH65_2     11430 non-null int64
IIHH65_2     11430 non-null int64
PRXRETRY     11430 non-null int64
PRXYDATA     11430 non-null int64
MEDICARE     11430 non-null int64
CAIDCHIP     11430 non-null int64
CHAMPUS      11430 non-null int64
PRVHLTIN     11430 non-null int64
GRPHLTIN     11430 non-null int64
HLTINNOS     11430 non-null int64
HLCNOTYR     11430 non-null int64
HLCNOTMO     11430 non-null int64
HLCLAST      11430 non-null int64
HLLOSRSN     11430 non-null int64
HLNVCOST     11430 non-null int64
HLNVOFFR     11430 non-null int64
HLNVREF      11430 non-null int64
HLNVNEED     11430 non-null int64
HLNVSOR      11430 non-null int64
IRMCDCH

###### From Above it is clear we have 1 float value and 70 Integer value features and a targeted value "Criminal"
<br>
<br>

In [8]:
print(train_df.columns)
train_df.head()



Index(['PERID', 'IFATHER', 'NRCH17_2', 'IRHHSIZ2', 'IIHHSIZ2', 'IRKI17_2',
       'IIKI17_2', 'IRHH65_2', 'IIHH65_2', 'PRXRETRY', 'PRXYDATA', 'MEDICARE',
       'CAIDCHIP', 'CHAMPUS', 'PRVHLTIN', 'GRPHLTIN', 'HLTINNOS', 'HLCNOTYR',
       'HLCNOTMO', 'HLCLAST', 'HLLOSRSN', 'HLNVCOST', 'HLNVOFFR', 'HLNVREF',
       'HLNVNEED', 'HLNVSOR', 'IRMCDCHP', 'IIMCDCHP', 'IRMEDICR', 'IIMEDICR',
       'IRCHMPUS', 'IICHMPUS', 'IRPRVHLT', 'IIPRVHLT', 'IROTHHLT', 'IIOTHHLT',
       'HLCALLFG', 'HLCALL99', 'ANYHLTI2', 'IRINSUR4', 'IIINSUR4', 'OTHINS',
       'CELLNOTCL', 'CELLWRKNG', 'IRFAMSOC', 'IIFAMSOC', 'IRFAMSSI',
       'IIFAMSSI', 'IRFSTAMP', 'IIFSTAMP', 'IRFAMPMT', 'IIFAMPMT', 'IRFAMSVC',
       'IIFAMSVC', 'IRWELMOS', 'IIWELMOS', 'IRPINC3', 'IRFAMIN3', 'IIPINC3',
       'IIFAMIN3', 'GOVTPROG', 'POVERTY3', 'TOOLONG', 'TROUBUND', 'PDEN10',
       'COUTYP2', 'MAIIN102', 'AIIND102', 'ANALWT_C', 'VESTR', 'VEREP',
       'Criminal'],
      dtype='object')


Unnamed: 0,PERID,IFATHER,NRCH17_2,IRHHSIZ2,IIHHSIZ2,IRKI17_2,IIKI17_2,IRHH65_2,IIHH65_2,PRXRETRY,...,TOOLONG,TROUBUND,PDEN10,COUTYP2,MAIIN102,AIIND102,ANALWT_C,VESTR,VEREP,Criminal
0,25095143,4,2,4,1,3,1,1,1,99,...,1,2,1,1,2,2,3884.805998,40026,1,0
1,13005143,4,1,3,1,2,1,1,1,99,...,2,2,2,3,2,2,1627.108106,40015,2,1
2,67415143,4,1,2,1,2,1,1,1,99,...,2,2,2,3,2,2,4344.95798,40024,1,0
3,70925143,4,0,2,1,1,1,1,1,99,...,2,2,1,1,2,2,792.521931,40027,1,0
4,75235143,1,0,6,1,4,1,1,1,99,...,2,2,2,2,2,2,1518.118526,40001,2,0


In [9]:
print(test_df.columns)
test_df.head()

Index(['PERID', 'IFATHER', 'NRCH17_2', 'IRHHSIZ2', 'IIHHSIZ2', 'IRKI17_2',
       'IIKI17_2', 'IRHH65_2', 'IIHH65_2', 'PRXRETRY', 'PRXYDATA', 'MEDICARE',
       'CAIDCHIP', 'CHAMPUS', 'PRVHLTIN', 'GRPHLTIN', 'HLTINNOS', 'HLCNOTYR',
       'HLCNOTMO', 'HLCLAST', 'HLLOSRSN', 'HLNVCOST', 'HLNVOFFR', 'HLNVREF',
       'HLNVNEED', 'HLNVSOR', 'IRMCDCHP', 'IIMCDCHP', 'IRMEDICR', 'IIMEDICR',
       'IRCHMPUS', 'IICHMPUS', 'IRPRVHLT', 'IIPRVHLT', 'IROTHHLT', 'IIOTHHLT',
       'HLCALLFG', 'HLCALL99', 'ANYHLTI2', 'IRINSUR4', 'IIINSUR4', 'OTHINS',
       'CELLNOTCL', 'CELLWRKNG', 'IRFAMSOC', 'IIFAMSOC', 'IRFAMSSI',
       'IIFAMSSI', 'IRFSTAMP', 'IIFSTAMP', 'IRFAMPMT', 'IIFAMPMT', 'IRFAMSVC',
       'IIFAMSVC', 'IRWELMOS', 'IIWELMOS', 'IRPINC3', 'IRFAMIN3', 'IIPINC3',
       'IIFAMIN3', 'GOVTPROG', 'POVERTY3', 'TOOLONG', 'TROUBUND', 'PDEN10',
       'COUTYP2', 'MAIIN102', 'AIIND102', 'ANALWT_C', 'VESTR', 'VEREP'],
      dtype='object')


Unnamed: 0,PERID,IFATHER,NRCH17_2,IRHHSIZ2,IIHHSIZ2,IRKI17_2,IIKI17_2,IRHH65_2,IIHH65_2,PRXRETRY,...,POVERTY3,TOOLONG,TROUBUND,PDEN10,COUTYP2,MAIIN102,AIIND102,ANALWT_C,VESTR,VEREP
0,66583679,4,0,4,1,2,1,1,1,99,...,2,2,2,1,1,2,2,16346.7954,40020,1
1,35494679,4,0,4,1,1,1,1,1,99,...,3,2,2,1,1,2,2,3008.863906,40044,2
2,79424679,2,0,3,1,2,1,1,1,99,...,1,2,2,2,2,2,2,266.952503,40040,2
3,11744679,4,0,6,1,2,1,1,1,99,...,3,2,2,1,1,2,2,5386.928199,40017,1
4,31554679,1,0,4,1,3,1,1,1,99,...,3,2,1,1,1,2,2,173.489895,40017,1


<br>
***
**Features in Dataset**

Variable Name |  Description     
:---------------|:--------------
1. ** PERID **  |  Person  ID   
2. ** IFATHER **  |  FATHER  IN  HOUSEHOLD   
3. ** NRCH17_2 **  |  RECODED  #  R's  CHILDREN  <  18  IN  HOUSEHOLD   
4. ** IRHHSIZ2 **  |  RECODE  -  IMPUTATION-REVISED  #  PERSONS  IN  HH   
5. ** IIHHSIZ2 **  |  IMPUTATION  INDICATOR   
6. ** IRKI17_2 **  |  IMPUTATION-REVISED  #  KIDS  AGED<18  IN  HH   
7. ** IIKI17_2 **  |  IRKI17_2-IMPUTATION  INDICATOR   
8. ** IRHH65_2 **  |  REC  -  IMPUTATION-REVISED  #  OF  PER  IN  HH  AGED>=65   
9. ** IIHH65_2 **  |  IRHH65_2-IMPUTATION  INDICATOR   
10. ** PRXRETRY **  |  SELECTED  PROXY  UNAVAILABLE,  OTHER  PROXY  AVAILABLE?   
11. ** PRXYDATA **  |  IS  PROXY  ANSWERING  INSURANCE/INCOME  QS   
12. ** MEDICARE **  |  COVERED  BY  MEDICARE   
13. ** CAIDCHIP **  |  COVERED  BY  MEDICAID/CHIP   
14. ** CHAMPUS **  |  COV  BY  TRICARE,  CHAMPUS,  CHAMPVA,  VA,  MILITARY   
15. ** PRVHLTIN **  |  COVERED  BY  PRIVATE  INSURANCE   
16. ** GRPHLTIN **  |  PRIVATE  PLAN  OFFERED  THROUGH  EMPLOYER  OR  UNION   
17. ** HLTINNOS **  |  COVERED  BY  HEALTH  INSUR   
18. ** HLCNOTYR **  |  ANYTIME  DID  NOT  HAVE  HEALTH  INS/COVER  PAST  12  MOS   
19. ** HLCNOTMO **  |  PAST  12  MOS,  HOW  MANY  MOS  W/O  COVERAGE   
20. ** HLCLAST **  |  TIME  SINCE  LAST  HAD  HEALTH  CARE  COVERAGE   
21. ** HLLOSRSN **  |  MAIN  REASON  STOPPED  COVERED  BY  HEALTH  INSURANCE   
22. ** HLNVCOST **  |  COST  TOO  HIGH   
23. ** HLNVOFFR **  |  EMPLOYER  DOESN'T  OFFER   
24. ** HLNVREF **  |  INSURANCE  COMPANY  REFUSED  COVERAGE   
25. ** HLNVNEED **  |  DON'T  NEED  IT   
26. ** HLNVSOR **  |  NEVER  HAD  HLTH  INS  SOME  OTHER  REASON   
27. ** IRMCDCHP **  |  IMPUTATION  REVISED  CAIDCHIP   
28. ** IIMCDCHP **  |  MEDICAID/CHIP  -  IMPUTATION  INDICATOR   
29. ** IRMEDICR **  |  MEDICARE  -  IMPUTATION  REVISED   
30. ** IIMEDICR **  |  MEDICARE  -  IMPUTATION  INDICATOR   
31. ** IRCHMPUS **  |  CHAMPUS  -  IMPUTATION  REVISED   
32. ** IICHMPUS **  |  CHAMPUS  -  IMPUTATION  INDICATOR   
33. ** IRPRVHLT **  |  PRIVATE  HEALTH  INSURANCE  -  IMPUTATION  REVISED   
34. ** IIPRVHLT **  |  PRIVATE  HEALTH  INSURANCE  -  IMPUTATION  INDICATOR   
35. ** IROTHHLT **  |  OTHER  HEALTH  INSURANCE  -  IMPUTATION  REVISED   
36. ** IIOTHHLT **  |  OTHER  HEALTH  INSURANCE  -  IMPUTATION  INDICATOR   
37. ** HLCALLFG **  |  FLAG  IF  EVERY  FORM  OF  HEALTH  INS  REPORTED   
38. ** HLCALL99 **  |  YES  TO  MEDICARE/MEDICAID/CHAMPUS/PRVHLTIN   
39. ** ANYHLTI2 **  |  COVERED  BY  ANY  HEALTH  INSURANCE  -  RECODE   
40. ** IRINSUR4 **  |  RC-OVERALL  HEALTH  INSURANCE  -  IMPUTATION  REVISED   
41. ** IIINSUR4 **  |  RC-OVERALL  HEALTH  INSURANCE  -  IMPUTATION  INDICATOR   
42. ** OTHINS **  |  RC-OTHER  HEALTH  INSURANCE   
43. ** CELLNOTCL **  |  NOT  A  CELL  PHONE   
44. ** CELLWRKNG **  |  WORKING  CELL  PHONE   
45. ** IRFAMSOC **  |  FAM  RECEIVE  SS  OR  RR  PAYMENTS  -  IMPUTATION  REVISED   
46. ** IIFAMSOC **  |  FAM  RECEIVE  SS  OR  RR  PAYMENTS  -  IMPUTATION  INDICATOR   
47. ** IRFAMSSI **  |  FAM  RECEIVE  SSI  -  IMPUTATION  REVISED   
48. ** IIFAMSSI **  |  FAM  RECEIVE  SSI  -  IMPUTATION  INDICATOR   
49. ** IRFSTAMP **  |  RESP/OTH  FAM  MEM  REC  FOOD  STAMPS  -  IMPUTATION  REVISED   
50. ** IIFSTAMP **  |  RESP/OTH  FAM  MEM  REC  FOOD  STAMPS  -  IMPUTATION  INDICATOR   
51. ** IRFAMPMT **  |  FAM  RECEIVE  PUBLIC  ASSIST  -  IMPUTATION  REVISED   
52. ** IIFAMPMT **  |  FAM  RECEIVE  PUBLIC  ASSIST  -  IMPUTATION  INDICATOR   
53. ** IRFAMSVC **  |  FAM  REC  WELFARE/JOB  PL/CHILDCARE  -  IMPUTATION  REVISED   
54. ** IIFAMSVC **  |  FAM  REC  WELFARE/JOB  PL/CHILDCARE  -  IMPUTATION  INDICATOR   
55. ** IRWELMOS **  |  IMP.  REVISED  -  NO.OF  MONTHS  ON  WELFARE   
56. ** IIWELMOS **  |  NO  OF  MONTHS  ON  WELFARE  -  IMPUTATION  INDICATOR   
57. ** IRPINC3 **  |  RESP  TOT  INCOME  (FINER  CAT)  -  IMP  REV   
58. ** IRFAMIN3 **  |  RECODE  -  IMP.REVISED  -  TOT  FAM  INCOME   
59. ** IIPINC3 **  |  RESP  TOT  INCOME  (FINER  CAT)  -  IMP  INDIC   
60. ** IIFAMIN3 **  |  IRFAMIN3  -  IMPUTATION  INDICATOR   
61. ** GOVTPROG **  |  RC-PARTICIPATED  IN  ONE  OR  MORE  GOVT  ASSIST  PROGRAMS   
62. ** POVERTY3 **  |  RC-POVERTY  LEVEL   
63. ** TOOLONG **  |  RESP  SAID  INTERVIEW  WAS  TOO  LONG   
64. ** TROUBUND **  |  DID  RESP  HAVE  TROUBLE  UNDERSTANDING  INTERVIEW   
65. ** PDEN10 **  |  POPULATION  DENSITY  2010   
66. ** COUTYP2 **  |  COUNTY  METRO/NONMETRO  STATUS   
67. ** MAIIN102 **  |  MAJORITY  AMER  INDIAN  AREA  INDICATOR  FOR  SEGMENT   
68. ** AIIND102 **  |  AMER  INDIAN  AREA  INDICATOR   
69. ** ANALWT_C **  |  FIN  PRSN-LEVEL  SIMPLE  WGHT   
70. ** VESTR **  |  ANALYSIS  STRATUM   
71. ** VEREP **  |  ANALYSIS  REPLICATE   
72. ** Criminal **  |  Target  Variable 

In [34]:
train_df.describe()

Unnamed: 0,PERID,IFATHER,NRCH17_2,IRHHSIZ2,IIHHSIZ2,IRKI17_2,IIKI17_2,IRHH65_2,IIHH65_2,PRXRETRY,...,TOOLONG,TROUBUND,PDEN10,COUTYP2,MAIIN102,AIIND102,ANALWT_C,VESTR,VEREP,Criminal
count,45718.0,45718.0,45718.0,45718.0,45718.0,45718.0,45718.0,45718.0,45718.0,45718.0,...,45718.0,45718.0,45718.0,45718.0,45718.0,45718.0,45718.0,45718.0,45718.0,45718.0
mean,54454460.0,3.355549,0.476486,3.426375,1.001706,2.084124,1.007437,1.162606,1.011024,97.394943,...,2.21941,2.23494,1.646135,1.764666,1.978936,1.978739,4692.661179,40023.739118,1.493854,0.069447
std,25539110.0,1.176651,0.888472,1.42742,0.061314,1.102988,0.123162,0.469029,0.146444,12.355156,...,5.295784,5.293651,0.618403,0.771411,0.14451,0.145161,5724.659486,265.14043,0.50023,0.254216
min,10002220.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0
25%,32331890.0,4.0,0.0,2.0,1.0,1.0,1.0,1.0,1.0,99.0,...,2.0,2.0,1.0,1.0,2.0,2.0,1252.396472,40013.0,1.0,0.0
50%,54110430.0,4.0,0.0,3.0,1.0,2.0,1.0,1.0,1.0,99.0,...,2.0,2.0,2.0,2.0,2.0,2.0,2719.33516,40025.0,1.0,0.0
75%,76127310.0,4.0,1.0,4.0,1.0,3.0,1.0,1.0,1.0,99.0,...,2.0,2.0,2.0,2.0,2.0,2.0,5765.810794,40039.0,2.0,0.0
max,99999560.0,4.0,3.0,6.0,3.0,4.0,3.0,3.0,3.0,99.0,...,98.0,98.0,3.0,3.0,2.0,2.0,109100.623,40050.0,2.0,1.0


###### Prediction Target

In [11]:
y = train_df.Criminal

###### Choosing Predictors 

In [12]:
Criminal_Predictors = ['PERID', 'IFATHER', 'NRCH17_2', 'IRHHSIZ2', 'IIHHSIZ2', 'IRKI17_2',
       'IIKI17_2', 'IRHH65_2', 'IIHH65_2', 'PRXRETRY', 'PRXYDATA', 'MEDICARE',
       'CAIDCHIP', 'CHAMPUS', 'PRVHLTIN', 'GRPHLTIN', 'HLTINNOS', 'HLCNOTYR',
       'HLCNOTMO', 'HLCLAST', 'HLLOSRSN', 'HLNVCOST', 'HLNVOFFR', 'HLNVREF',
       'HLNVNEED', 'HLNVSOR', 'IRMCDCHP', 'IIMCDCHP', 'IRMEDICR', 'IIMEDICR',
       'IRCHMPUS', 'IICHMPUS', 'IRPRVHLT', 'IIPRVHLT', 'IROTHHLT', 'IIOTHHLT',
       'HLCALLFG', 'HLCALL99', 'ANYHLTI2', 'IRINSUR4', 'IIINSUR4', 'OTHINS',
       'CELLNOTCL', 'CELLWRKNG', 'IRFAMSOC', 'IIFAMSOC', 'IRFAMSSI',
       'IIFAMSSI', 'IRFSTAMP', 'IIFSTAMP', 'IRFAMPMT', 'IIFAMPMT', 'IRFAMSVC',
       'IIFAMSVC', 'IRWELMOS', 'IIWELMOS', 'IRPINC3', 'IRFAMIN3', 'IIPINC3',
       'IIFAMIN3', 'GOVTPROG', 'POVERTY3', 'TOOLONG', 'TROUBUND', 'PDEN10',
       'COUTYP2', 'MAIIN102', 'AIIND102', 'ANALWT_C', 'VESTR', 'VEREP']

In [13]:
X = train_df[Criminal_Predictors] 

In [14]:
# Define model
criminal_model = DecisionTreeClassifier()

In [15]:
criminal_model.fit(X,y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [16]:
print("Making predictions for the following 5 Persons:")
print(X.head())
print("The predictions are")
print(criminal_model.predict(X.head()))

Making predictions for the following 5 Persons:
      PERID  IFATHER  NRCH17_2  IRHHSIZ2  IIHHSIZ2  IRKI17_2  IIKI17_2  \
0  25095143        4         2         4         1         3         1   
1  13005143        4         1         3         1         2         1   
2  67415143        4         1         2         1         2         1   
3  70925143        4         0         2         1         1         1   
4  75235143        1         0         6         1         4         1   

   IRHH65_2  IIHH65_2  PRXRETRY  ...    POVERTY3  TOOLONG  TROUBUND  PDEN10  \
0         1         1        99  ...           2        1         2       1   
1         1         1        99  ...           1        2         2       2   
2         1         1        99  ...           1        2         2       2   
3         1         1        99  ...           3        2         2       1   
4         1         1        99  ...           1        2         2       2   

   COUTYP2  MAIIN102  AIIND102  

In [17]:
print("Making predictions for the following 5 Persons:")
print(test_df.head())
print("The predictions are")
print(criminal_model.predict(test_df.head()))

Making predictions for the following 5 Persons:
      PERID  IFATHER  NRCH17_2  IRHHSIZ2  IIHHSIZ2  IRKI17_2  IIKI17_2  \
0  66583679        4         0         4         1         2         1   
1  35494679        4         0         4         1         1         1   
2  79424679        2         0         3         1         2         1   
3  11744679        4         0         6         1         2         1   
4  31554679        1         0         4         1         3         1   

   IRHH65_2  IIHH65_2  PRXRETRY  ...    POVERTY3  TOOLONG  TROUBUND  PDEN10  \
0         1         1        99  ...           2        2         2       1   
1         1         1        99  ...           3        2         2       1   
2         1         1        99  ...           1        2         2       2   
3         1         1        99  ...           3        2         2       1   
4         1         1        99  ...           3        2         1       1   

   COUTYP2  MAIIN102  AIIND102  

## Model Validation

In [18]:
from sklearn.metrics import mean_absolute_error

predicted_criminals = criminal_model.predict(X)
mean_absolute_error(y, predicted_criminals)

0.0

In [19]:
from sklearn.model_selection import train_test_split

# split data into training and validation data, for both predictors and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)
# Define model
Criminal_model = DecisionTreeClassifier()
# Fit model
Criminal_model.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = Criminal_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

0.0685914260717


In [20]:
def get_mae(max_leaf_nodes, predictors_train, predictors_val, targ_train, targ_val):
    model = DecisionTreeClassifier(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(predictors_train, targ_train)
    preds_val = model.predict(predictors_val)
    mae = mean_absolute_error(targ_val, preds_val)
    return(mae)

In [21]:
# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5  		 Mean Absolute Error:  0
Max leaf nodes: 50  		 Mean Absolute Error:  0
Max leaf nodes: 500  		 Mean Absolute Error:  0
Max leaf nodes: 5000  		 Mean Absolute Error:  0


###### Random Forest Classifer

In [22]:
forest_model = RandomForestClassifier()
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))

0.0555555555556


## Submitting Competition files

In [23]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

# Read the data
train = pd.read_csv('criminal_train.csv')

# pull data into target (y) and predictors (X)
train_y = train.Criminal
predictor_cols =  ['PERID', 'IFATHER', 'NRCH17_2', 'IRHHSIZ2', 'IIHHSIZ2', 'IRKI17_2',
       'IIKI17_2', 'IRHH65_2', 'IIHH65_2', 'PRXRETRY', 'PRXYDATA', 'MEDICARE',
       'CAIDCHIP', 'CHAMPUS', 'PRVHLTIN', 'GRPHLTIN', 'HLTINNOS', 'HLCNOTYR',
       'HLCNOTMO', 'HLCLAST', 'HLLOSRSN', 'HLNVCOST', 'HLNVOFFR', 'HLNVREF',
       'HLNVNEED', 'HLNVSOR', 'IRMCDCHP', 'IIMCDCHP', 'IRMEDICR', 'IIMEDICR',
       'IRCHMPUS', 'IICHMPUS', 'IRPRVHLT', 'IIPRVHLT', 'IROTHHLT', 'IIOTHHLT',
       'HLCALLFG', 'HLCALL99', 'ANYHLTI2', 'IRINSUR4', 'IIINSUR4', 'OTHINS',
       'CELLNOTCL', 'CELLWRKNG', 'IRFAMSOC', 'IIFAMSOC', 'IRFAMSSI',
       'IIFAMSSI', 'IRFSTAMP', 'IIFSTAMP', 'IRFAMPMT', 'IIFAMPMT', 'IRFAMSVC',
       'IIFAMSVC', 'IRWELMOS', 'IIWELMOS', 'IRPINC3', 'IRFAMIN3', 'IIPINC3',
       'IIFAMIN3', 'GOVTPROG', 'POVERTY3', 'TOOLONG', 'TROUBUND', 'PDEN10',
       'COUTYP2', 'MAIIN102', 'AIIND102', 'ANALWT_C', 'VESTR', 'VEREP']

#predictor_cols = ['IRFAMIN3', 'ANALWT_C', 'PERID', 'IRPINC3', 'VESTR','POVERTY3', 'GRPHLTIN', 'IRHHSIZ2', 'IRPRVHLT', 'IFATHER','COUTYP2','PRXYDATA','PDEN10', 'VEREP','IRKI17_2','CELLNOTCL','IRMEDICR','IRMCDCHP','PRVHLTIN','IRFAMSOC','IRHH65_2']
# predictor_cols = ['IRFAMIN3', 'ANALWT_C', 'PERID', 'IRPINC3', 'VESTR','POVERTY3', 'GRPHLTIN', 'IRHHSIZ2', 'IRPRVHLT', 'IFATHER','COUTYP2','PRXYDATA','PDEN10']
# Create training predictors data
train_X = train[predictor_cols]



my_model = DecisionTreeClassifier()
my_model.fit(train_X, train_y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [24]:
# Read the test data
test = pd.read_csv('criminal_test.csv')
# Treat the test data in the same way as training data. In this case, pull same columns.
test_X = test[predictor_cols]
# Use the model to make predictions
predicted = my_model.predict(test_X)
# We will look at the predicted prices to ensure we have something sensible.
print(predicted)

[0 0 0 ..., 0 0 0]


In [1]:
featimp = pd.Series(my_model.feature_importances_, index=predictor_cols).sort_values(ascending=False)
print(featimp)

NameError: name 'pd' is not defined

## Prepare Submission File

In [26]:
my_submission = pd.DataFrame({'PERID': test.PERID, 'Criminal': predicted})
# you could use any filename. We choose submission here
my_submission.to_csv('submissiondt.csv', index=False)

Applying XG Boost


In [27]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

# Read the data
train = pd.read_csv('criminal_train.csv')

# pull data into target (y) and predictors (X)
train_y = train.Criminal
predictor_cols =  ['PERID', 'IFATHER', 'NRCH17_2', 'IRHHSIZ2', 'IIHHSIZ2', 'IRKI17_2',
       'IIKI17_2', 'IRHH65_2', 'IIHH65_2', 'PRXRETRY', 'PRXYDATA', 'MEDICARE',
       'CAIDCHIP', 'CHAMPUS', 'PRVHLTIN', 'GRPHLTIN', 'HLTINNOS', 'HLCNOTYR',
       'HLCNOTMO', 'HLCLAST', 'HLLOSRSN', 'HLNVCOST', 'HLNVOFFR', 'HLNVREF',
       'HLNVNEED', 'HLNVSOR', 'IRMCDCHP', 'IIMCDCHP', 'IRMEDICR', 'IIMEDICR',
       'IRCHMPUS', 'IICHMPUS', 'IRPRVHLT', 'IIPRVHLT', 'IROTHHLT', 'IIOTHHLT',
       'HLCALLFG', 'HLCALL99', 'ANYHLTI2', 'IRINSUR4', 'IIINSUR4', 'OTHINS',
       'CELLNOTCL', 'CELLWRKNG', 'IRFAMSOC', 'IIFAMSOC', 'IRFAMSSI',
       'IIFAMSSI', 'IRFSTAMP', 'IIFSTAMP', 'IRFAMPMT', 'IIFAMPMT', 'IRFAMSVC',
       'IIFAMSVC', 'IRWELMOS', 'IIWELMOS', 'IRPINC3', 'IRFAMIN3', 'IIPINC3',
       'IIFAMIN3', 'GOVTPROG', 'POVERTY3', 'TOOLONG', 'TROUBUND', 'PDEN10',
       'COUTYP2', 'MAIIN102', 'AIIND102', 'ANALWT_C', 'VESTR', 'VEREP']

#predictor_cols = ['IRFAMIN3', 'ANALWT_C', 'PERID', 'IRPINC3', 'VESTR','POVERTY3', 'GRPHLTIN', 'IRHHSIZ2', 'IRPRVHLT', 'IFATHER','COUTYP2','PRXYDATA','PDEN10', 'VEREP','IRKI17_2','CELLNOTCL','IRMEDICR','IRMCDCHP','PRVHLTIN','IRFAMSOC','IRHH65_2']
# predictor_cols = ['IRFAMIN3', 'ANALWT_C', 'PERID', 'IRPINC3', 'VESTR','POVERTY3', 'GRPHLTIN', 'IRHHSIZ2', 'IRPRVHLT', 'IFATHER','COUTYP2','PRXYDATA','PDEN10']
# Create training predictors data
train_X = train[predictor_cols]


from sklearn.model_selection import train_test_split

# split data into training and validation data, for both predictors and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, test_X, train_y, test_y = train_test_split(train_X , train_y,test_size = 0.30, random_state=1)

#The argument early_stopping_rounds offers a way to automatically
#find the ideal value. Early stopping causes the model to 
#stop iterating when the validation score stops improving,

my_model = GradientBoostingClassifier(n_estimators=197 )

my_model.fit(train_X, train_y)

# Else try Rnn Model

# xgb_model = xgb.XGBRegressor(n_estimators=600, learning_rate=0.06)

# fit_params={'early_stopping_rounds': 30, 
#             'eval_metric': 'mae',
#             'verbose': False,
#             'eval_set': [[val_x, val_y]]}

# xgb_cv = cross_val_score(xgb_model, train_x, train_y, 
#                          cv = 5, 
#                          scoring = 'neg_mean_absolute_error',
#                          fit_params = fit_params)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=197,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False)

In [36]:
train_df.loc[train_df['Criminal']== 1 , 'Criminal'].sum()

3175

In [41]:
train_df.loc[train_df['Criminal']== 0 , 'Criminal'].count()

42543

### check how to handle this problem as told by Shivam Sir
< br >

In [30]:
# Read the test data
test = pd.read_csv('criminal_test.csv')
# Treat the test data in the same way as training data. In this case, pull same columns.
test_X = test[predictor_cols]
# Use the model to make predictions
predicted = my_model.predict(test_X)
# We will look at the predicted prices to ensure we have something sensible.
print(predicted)

[0 0 0 ..., 0 0 0]


In [29]:
my_submission = pd.DataFrame({'PERID': test.PERID, 'Criminal': predicted})
# you could use any filename. We choose submission here
my_submission.to_csv('submissionxgb197.csv', index=False)

Neural Network Classifier

In [29]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

# Read the data
train = pd.read_csv('criminal_train.csv')

# pull data into target (y) and predictors (X)
train_y = train.Criminal
predictor_cols =  ['PERID', 'IFATHER', 'NRCH17_2', 'IRHHSIZ2', 'IIHHSIZ2', 'IRKI17_2',
       'IIKI17_2', 'IRHH65_2', 'IIHH65_2', 'PRXRETRY', 'PRXYDATA', 'MEDICARE',
       'CAIDCHIP', 'CHAMPUS', 'PRVHLTIN', 'GRPHLTIN', 'HLTINNOS', 'HLCNOTYR',
       'HLCNOTMO', 'HLCLAST', 'HLLOSRSN', 'HLNVCOST', 'HLNVOFFR', 'HLNVREF',
       'HLNVNEED', 'HLNVSOR', 'IRMCDCHP', 'IIMCDCHP', 'IRMEDICR', 'IIMEDICR',
       'IRCHMPUS', 'IICHMPUS', 'IRPRVHLT', 'IIPRVHLT', 'IROTHHLT', 'IIOTHHLT',
       'HLCALLFG', 'HLCALL99', 'ANYHLTI2', 'IRINSUR4', 'IIINSUR4', 'OTHINS',
       'CELLNOTCL', 'CELLWRKNG', 'IRFAMSOC', 'IIFAMSOC', 'IRFAMSSI',
       'IIFAMSSI', 'IRFSTAMP', 'IIFSTAMP', 'IRFAMPMT', 'IIFAMPMT', 'IRFAMSVC',
       'IIFAMSVC', 'IRWELMOS', 'IIWELMOS', 'IRPINC3', 'IRFAMIN3', 'IIPINC3',
       'IIFAMIN3', 'GOVTPROG', 'POVERTY3', 'TOOLONG', 'TROUBUND', 'PDEN10',
       'COUTYP2', 'MAIIN102', 'AIIND102', 'ANALWT_C', 'VESTR', 'VEREP']

#predictor_cols = ['IRFAMIN3', 'ANALWT_C', 'PERID', 'IRPINC3', 'VESTR','POVERTY3', 'GRPHLTIN', 'IRHHSIZ2', 'IRPRVHLT', 'IFATHER','COUTYP2','PRXYDATA','PDEN10', 'VEREP','IRKI17_2','CELLNOTCL','IRMEDICR','IRMCDCHP','PRVHLTIN','IRFAMSOC','IRHH65_2']
# predictor_cols = ['IRFAMIN3', 'ANALWT_C', 'PERID', 'IRPINC3', 'VESTR','POVERTY3', 'GRPHLTIN', 'IRHHSIZ2', 'IRPRVHLT', 'IFATHER','COUTYP2','PRXYDATA','PDEN10']
# Create training predictors data
train_X = train[predictor_cols]

my_model = MLPClassifier(hidden_layer_sizes=(100,100, 100, 100,100))
my_model.fit(train_X, train_y)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100, 100, 100, 100, 100),
       learning_rate='constant', learning_rate_init=0.001, max_iter=200,
       momentum=0.9, nesterovs_momentum=True, power_t=0.5,
       random_state=None, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)

In [30]:
# Read the test data
test = pd.read_csv('criminal_test.csv')
# Treat the test data in the same way as training data. In this case, pull same columns.
test_X = test[predictor_cols]
# Use the model to make predictions
predicted = my_model.predict(test_X)
# We will look at the predicted prices to ensure we have something sensible.
print(predicted)

[0 0 0 ..., 0 0 0]


In [31]:
my_submission = pd.DataFrame({'PERID': test.PERID, 'Criminal': predicted})
# you could use any filename. We choose submission here
my_submission.to_csv('submissionNN5.csv', index=False)

## Now to  apply RNN + LSTM

1. code https://github.com/etai83/lstm_stock_prediction/blob/master/.ipynb_checkpoints/GOOGLE%20stock%20prediction-checkpoint.ipynb
2. https://www.youtube.com/watch?v=ftMq5ps503w&t=3s
    3. https://github.com/llSourcell/How-to-Predict-Stock-Prices-Easily-Demo/blob/master/stockdemo.ipynb
    4.https://github.com/llSourcell/How-to-Predict-Stock-Prices-Easily-Demo
        
        
        
        
        Apply rnn+ lstm model for classification task like sentimnt classification or anything else.
        https://www.kaggle.com/sudhir51/kernels/notebooks/new?forkParentScriptVersionId=1396664
            
            
         try to find models in which rnn+ lstm has been used in titanic models and see models of guys who have obtained high,accuracy in titanic models. 
         I also have forked one model applying RNN and took much time for classification remember.

In [2]:
apply scalingand normalizTION

SyntaxError: invalid syntax (<ipython-input-2-7de520873847>, line 1)