In [1]:
import sklearn as sk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns

* Reading the csv file

In [2]:
la_crimes = pd.read_csv('LA Crime_Data_from_2010_to_2019.csv')
la_crimes.rename(columns={'AREA ': 'AREA'}, inplace = True)
la_crimes

Unnamed: 0,DR_NO,Date Rptd,DATE OCC,TIME OCC,AREA,AREA NAME,Rpt Dist No,Part 1-2,Crm Cd,Crm Cd Desc,...,Status,Status Desc,Crm Cd 1,Crm Cd 2,Crm Cd 3,Crm Cd 4,LOCATION,Cross Street,LAT,LON
0,1307355,02/20/2010 12:00:00 AM,02/20/2010 12:00:00 AM,1350,13,Newton,1385,2,900,VIOLATION OF COURT ORDER,...,AA,Adult Arrest,900.0,,,,300 E GAGE AV,,33.9825,-118.2695
1,11401303,09/13/2010 12:00:00 AM,09/12/2010 12:00:00 AM,45,14,Pacific,1485,2,740,"VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA...",...,IC,Invest Cont,740.0,,,,SEPULVEDA BL,MANCHESTER AV,33.9599,-118.3962
2,70309629,08/09/2010 12:00:00 AM,08/09/2010 12:00:00 AM,1515,13,Newton,1324,2,946,OTHER MISCELLANEOUS CRIME,...,IC,Invest Cont,946.0,,,,1300 E 21ST ST,,34.0224,-118.2524
3,90631215,01/05/2010 12:00:00 AM,01/05/2010 12:00:00 AM,150,6,Hollywood,646,2,900,VIOLATION OF COURT ORDER,...,IC,Invest Cont,900.0,998.0,,,CAHUENGA BL,HOLLYWOOD BL,34.1016,-118.3295
4,100100501,01/03/2010 12:00:00 AM,01/02/2010 12:00:00 AM,2100,1,Central,176,1,122,"RAPE, ATTEMPTED",...,IC,Invest Cont,122.0,,,,8TH ST,SAN PEDRO ST,34.0387,-118.2488
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2114694,190608903,03/28/2019 12:00:00 AM,03/28/2019 12:00:00 AM,400,6,Hollywood,644,1,648,ARSON,...,IC,Invest Cont,648.0,,,,1400 N LA BREA AV,,34.0962,-118.3490
2114695,190715222,08/15/2019 12:00:00 AM,08/14/2019 12:00:00 AM,1810,7,Wilshire,701,1,331,THEFT FROM MOTOR VEHICLE - GRAND ($400 AND OVER),...,IC,Invest Cont,331.0,,,,WILLOUGHBY AV,ORLANDO AV,34.0871,-118.3732
2114696,192004409,01/06/2019 12:00:00 AM,01/06/2019 12:00:00 AM,2100,20,Olympic,2029,2,930,CRIMINAL THREATS - NO WEAPON DISPLAYED,...,IC,Invest Cont,930.0,,,,6TH,VIRGIL,34.0637,-118.2870
2114697,191716777,10/17/2019 12:00:00 AM,10/16/2019 12:00:00 AM,1800,17,Devonshire,1795,1,420,THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER),...,IC,Invest Cont,420.0,,,,17200 NAPA ST,,34.2266,-118.5085


## DR_NO column
* We see that the column `DR_NO` is a unique code for each row, that means each crime reported has a different code

In [3]:
la_crimes['DR_NO'].value_counts()

151001085    1
161811979    1
111721424    1
121320908    1
121316814    1
            ..
130712581    1
130718726    1
151714826    1
151712779    1
151001091    1
Name: DR_NO, Length: 2114699, dtype: int64

## Date Rptd and DATE OCC columns
* We see that all the dates have the time of 12:00:00 AM

In [4]:
la_crimes['Date Rptd'].str[-8:].value_counts() #selecting only the time and getting the different values of it

00:00 AM    2114699
Name: Date Rptd, dtype: int64

In [5]:
la_crimes['DATE OCC'].str[-8:].value_counts() #selecting only the time and getting the different values of it

00:00 AM    2114699
Name: DATE OCC, dtype: int64

* As result, we can delete the data of the time because it does not give additional information to us

In [6]:
la_crimes['Date Rptd'] = la_crimes['Date Rptd'].str[:10] #only keeping the data of the date without the time

In [7]:
la_crimes['DATE OCC'] = la_crimes['DATE OCC'].str[:10] #only keeping the data

* Now we are going to bring the date format to the appropriate one for databases: YYYY-MM-DD

In [8]:
la_crimes['Date Rptd'] = pd.to_datetime(la_crimes['Date Rptd'])
la_crimes['DATE OCC'] = pd.to_datetime(la_crimes['DATE OCC'])
la_crimes['Date Rptd'].sample(5)

2002427   2019-04-25
1525245   2017-05-29
619111    2013-05-24
1583356   2017-03-07
159157    2010-09-17
Name: Date Rptd, dtype: datetime64[ns]

## TIME OCC column
* Now we have to fix the column  `TIME OCC` to a '24hour:minutes' format
* First we transform the column to datatype string because it was integer
* Then we add  '00' in front of every date (because the rows with time occured at 12AM have no values for hours
* Lastly we only select the right string characters

In [9]:
la_crimes['TIME OCC'] = la_crimes['TIME OCC'].astype(str)

In [10]:
la_crimes['TIME OCC']= '00'+la_crimes['TIME OCC']

In [11]:
la_crimes['TIME OCC'] = la_crimes['TIME OCC'].str[-4:-2] + ':' +la_crimes['TIME OCC'].str[-2:]

* We join the  columns `DATE OCC` and `TIME OCC` into the column `DATE OCC` so that it will be in the format YYYY-MM-DD hh:mm. It will be saved as datatype datetime YYYY-MM-DD hh:mm:ss in the database
* We drop the column `TIME OCC`

In [12]:
la_crimes['DATE OCC'] = la_crimes['DATE OCC'].astype(str) +' ' + la_crimes['TIME OCC']
la_crimes['DATE OCC'] = pd.to_datetime(la_crimes['DATE OCC'])

In [13]:
la_crimes = la_crimes.drop(['TIME OCC'], axis=1)

## Mocodes column
We see that some crime reports have null values for the column of `Mocodes`
* We replace those values with the value `unknown`

In [14]:
la_crimes.loc[la_crimes['Mocodes'].isna()]

Unnamed: 0,DR_NO,Date Rptd,DATE OCC,AREA,AREA NAME,Rpt Dist No,Part 1-2,Crm Cd,Crm Cd Desc,Mocodes,...,Status,Status Desc,Crm Cd 1,Crm Cd 2,Crm Cd 3,Crm Cd 4,LOCATION,Cross Street,LAT,LON
15,100100535,2010-01-17,2010-01-16 17:35:00,1,Central,185,2,946,OTHER MISCELLANEOUS CRIME,,...,IC,Invest Cont,946.0,999.0,,,300 E OLYMPIC BL,,34.0389,-118.2550
28,100100578,2010-02-05,2010-02-03 12:55:00,1,Central,185,2,946,OTHER MISCELLANEOUS CRIME,,...,IC,Invest Cont,946.0,999.0,,,1200 MAPLE AV,,34.0357,-118.2563
51,100100654,2010-02-27,2010-02-27 19:55:00,1,Central,174,2,946,OTHER MISCELLANEOUS CRIME,,...,AA,Adult Arrest,946.0,,,,W 7TH ST,S SPRING ST,34.0445,-118.2523
79,100100730,2010-03-23,2010-03-20 12:15:00,1,Central,111,2,647,THROWING OBJECT AT MOVING VEHICLE,,...,IC,Invest Cont,647.0,,,,CESAR E CHAVEZ,FIGUEROA ST,34.0627,-118.2463
102,100100786,2010-04-08,2010-04-08 02:20:00,1,Central,161,1,510,VEHICLE - STOLEN,,...,IC,Invest Cont,510.0,520.0,,,FRANCISCO ST,8TH ST,34.0481,-118.2633
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2114663,190417846,2019-12-11,2019-12-06 18:00:00,4,Hollenbeck,423,1,510,VEHICLE - STOLEN,,...,IC,Invest Cont,510.0,,,,1100 N MISSION RD,,34.0651,-118.2116
2114670,191710450,2019-05-25,2019-05-25 06:30:00,17,Devonshire,1761,1,510,VEHICLE - STOLEN,,...,IC,Invest Cont,510.0,,,,9200 ETON AV,,34.2376,-118.5947
2114683,191816545,2019-07-22,2019-07-17 20:30:00,18,Southeast,1836,2,922,CHILD STEALING,,...,AO,Adult Other,922.0,,,,10400 FIRTH AV,,33.9424,-118.2477
2114691,191307168,2019-02-28,2019-02-28 07:00:00,13,Newton,1394,1,510,VEHICLE - STOLEN,,...,IC,Invest Cont,510.0,,,,100 E 67TH ST,,33.9788,-118.2739


In [15]:
indexes_null_mo = la_crimes.loc[la_crimes ['Mocodes'].isna()].index
la_crimes.loc[indexes_null_mo,'Mocodes'] = 'unknown'

## Vict Age column
* Now we are going to fix the column `Vict Age` that don't make any sense
* Specifically we are going to change all negative age numbers to `0` which we consider missing value.

In [16]:
la_crimes['Vict Age'].value_counts()

 0      369886
 25      48101
 26      47469
 27      47011
 24      46739
         ...  
-7          15
-8           7
-9           4
 114         1
 118         1
Name: Vict Age, Length: 110, dtype: int64

In [17]:
indexes_neg_age = la_crimes.loc[la_crimes['Vict Age'] < 0]['Vict Age'].index #finding the indexes of negative ages
la_crimes.loc[indexes_neg_age,'Vict Age'] = 0

## Vict Sex column
* Now we fix the column `Vict Sex` which is the victim's sex.
* According to the description the values are: F - Female, M - Male, X - Unknown

In [18]:
la_crimes['Vict Sex'].value_counts()

M    974309
F    888499
X     55129
H        73
N        17
-         1
Name: Vict Sex, dtype: int64

* We see there are a bunch of irrelevant values which we change to 'X'

In [19]:
indexes_wrong_sex = la_crimes.loc[(la_crimes['Vict Sex'] != 'M') & (la_crimes['Vict Sex'] != 'F')].index
la_crimes.loc[indexes_wrong_sex,'Vict Sex'] = 'X'

## Vict Descent column
Now we will fix the column `Vict Descent`. We see there are a bunch of nan values. Also, according to the description of the dataset the different descents are: : 
* A - Other Asian 
* B - Black 
* C - Chinese  
* D - Cambodian 
* F - Filipino 
* G - Guamanian 
* H - Hispanic/Latin/Mexican 
* I - American Indian/Alaskan Native 
* J - Japanese 
* K - Korean 
* L - Laotian 
* O - Other 
* P - Pacific Islander 
* S - Samoan 
* U - Hawaiian 
* V - Vietnamese 
* W - White 
* X - Unknown 
* Z - Asian Indian

In [20]:
print("The reported crimes with Nan (null) values for the column Vict Descent are:")
len(la_crimes.loc[la_crimes['Vict Descent'].isna()])

The reported crimes with Nan (null) values for the column Vict Descent are:


196718

In [21]:
la_crimes['Vict Descent'].value_counts()

H    725348
W    510158
B    335102
O    202969
X     78147
A     51109
K      9141
F      2553
C      1061
I       945
J       418
P       343
V       201
U       190
Z       136
G        85
S        31
D        23
L        18
-         3
Name: Vict Descent, dtype: int64

In [22]:
indexes_wrong_decent = la_crimes.loc[(la_crimes['Vict Descent'] == '-') | (la_crimes['Vict Descent'].isna())].index
la_crimes.loc[indexes_wrong_decent,'Vict Descent'] = 'X'

## Premis Cd and Premis Desc columns
Now we see that many rows have null values for the column `Premis Cd` 
* Because this column has float values, we replace the null values with `0`

In [23]:
la_crimes.loc[la_crimes['Premis Cd'].isna()]

Unnamed: 0,DR_NO,Date Rptd,DATE OCC,AREA,AREA NAME,Rpt Dist No,Part 1-2,Crm Cd,Crm Cd Desc,Mocodes,...,Status,Status Desc,Crm Cd 1,Crm Cd 2,Crm Cd 3,Crm Cd 4,LOCATION,Cross Street,LAT,LON
6590,100121447,2010-12-12,2010-12-12 11:50:00,1,Central,185,1,110,CRIMINAL HOMICIDE,unknown,...,AA,Adult Arrest,110.0,,,,200 W OLYMPIC BL,,34.0409,-118.2574
32148,100913648,2010-06-21,2010-06-20 14:35:00,9,Van Nuys,915,1,510,VEHICLE - STOLEN,unknown,...,IC,Invest Cont,510.0,,,,7000 VAN NUYS BL,,34.1976,-118.4487
67340,100816222,2010-09-03,2010-04-16 00:01:00,8,West LA,803,2,813,CHILD ANNOYING (17YRS & UNDER),unknown,...,IC,Invest Cont,813.0,,,,2000 MANDAVILLE C,,34.0949,-118.5111
68276,100818222,2010-11-18,2010-10-16 16:00:00,8,West LA,811,2,812,CRM AGNST CHLD (13 OR UNDER) (14-15 & SUSP 10 ...,unknown,...,IC,Invest Cont,812.0,,,,1100 LAS PULGAS RD,,34.0528,-118.5393
71523,100908076,2010-03-16,2010-03-15 22:05:00,15,N Hollywood,1547,1,420,THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER),unknown,...,IC,Invest Cont,420.0,,,,BECK,CHANDLER,34.1687,-118.3834
85649,101017796,2010-09-21,2010-09-08 17:20:00,10,West Valley,1067,2,813,CHILD ANNOYING (17YRS & UNDER),unknown,...,AO,Adult Other,813.0,,,,5400 LOUISE AV,,34.1694,-118.5098
92668,101114489,2010-06-18,2010-06-18 19:10:00,11,Northeast,1178,1,820,ORAL COPULATION,unknown,...,IC,Invest Cont,820.0,,,,CYPRESS AV,FIGUEROA,34.0864,-118.219
102768,101214269,2010-05-13,2010-05-12 18:00:00,12,77th Street,1242,1,235,CHILD ABUSE (PHYSICAL) - AGGRAVATED ASSAULT,unknown,...,IC,Invest Cont,235.0,,,,6200 3RD AV,,33.9837,-118.3206
107695,101223939,2010-09-17,2010-06-01 12:00:00,12,77th Street,1213,2,812,CRM AGNST CHLD (13 OR UNDER) (14-15 & SUSP 10 ...,unknown,...,IC,Invest Cont,812.0,,,,1700 W 52ND ST,,33.9951,-118.3068
124293,101410405,2010-04-12,2010-04-11 23:20:00,14,Pacific,1494,1,420,THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER),unknown,...,AA,Adult Arrest,420.0,,,,300 WORLD WY,,33.944,-118.4073


In [24]:
la_crimes.loc[la_crimes['Premis Desc'] =='unknown'][['Premis Cd','Premis Desc']]

Unnamed: 0,Premis Cd,Premis Desc


In [25]:
indexes_null_premiscd = la_crimes.loc[la_crimes ['Premis Cd'].isna()].index
la_crimes.loc[indexes_null_premiscd,'Premis Cd'] = 0

Now we see that many rows have null values for the column `Premis Desc`

* We replace those null values with `unknown`

In [26]:
la_crimes.loc[la_crimes['Premis Desc'].isna()]

Unnamed: 0,DR_NO,Date Rptd,DATE OCC,AREA,AREA NAME,Rpt Dist No,Part 1-2,Crm Cd,Crm Cd Desc,Mocodes,...,Status,Status Desc,Crm Cd 1,Crm Cd 2,Crm Cd 3,Crm Cd 4,LOCATION,Cross Street,LAT,LON
6590,100121447,2010-12-12,2010-12-12 11:50:00,1,Central,185,1,110,CRIMINAL HOMICIDE,unknown,...,AA,Adult Arrest,110.0,,,,200 W OLYMPIC BL,,34.0409,-118.2574
32148,100913648,2010-06-21,2010-06-20 14:35:00,9,Van Nuys,915,1,510,VEHICLE - STOLEN,unknown,...,IC,Invest Cont,510.0,,,,7000 VAN NUYS BL,,34.1976,-118.4487
67340,100816222,2010-09-03,2010-04-16 00:01:00,8,West LA,803,2,813,CHILD ANNOYING (17YRS & UNDER),unknown,...,IC,Invest Cont,813.0,,,,2000 MANDAVILLE C,,34.0949,-118.5111
68276,100818222,2010-11-18,2010-10-16 16:00:00,8,West LA,811,2,812,CRM AGNST CHLD (13 OR UNDER) (14-15 & SUSP 10 ...,unknown,...,IC,Invest Cont,812.0,,,,1100 LAS PULGAS RD,,34.0528,-118.5393
71523,100908076,2010-03-16,2010-03-15 22:05:00,15,N Hollywood,1547,1,420,THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER),unknown,...,IC,Invest Cont,420.0,,,,BECK,CHANDLER,34.1687,-118.3834
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2109009,190223459,2019-12-17,2019-12-15 21:45:00,2,Rampart,236,2,668,"EMBEZZLEMENT, GRAND THEFT ($950.01 & OVER)",0344 1309 1822 0913 1803,...,IC,Invest Cont,668.0,,,,100 S ALVARADO ST,,34.0667,-118.2703
2109581,191515929,2019-08-20,2019-08-20 05:40:00,15,N Hollywood,1599,1,310,BURGLARY,2018 0344 1309 1609 1402 1414 0384,...,IC,Invest Cont,310.0,998.0,,,3700 CAHUENGA BL,,34.1350,-118.3612
2113695,190119416,2019-07-28,2019-07-27 18:00:00,1,Central,157,1,320,"BURGLARY, ATTEMPTED",1607,...,IC,Invest Cont,320.0,,,,600 CROCKER ST,,34.0417,-118.2443
2114205,191221441,2019-08-29,2019-08-29 01:00:00,12,77th Street,1268,1,110,CRIMINAL HOMICIDE,1100 0430 1402 1822 0906,...,AA,Adult Arrest,110.0,998.0,,,8100 S BROADWAY,,33.9660,-118.2783


In [27]:
indexes_null_premisdesc = la_crimes.loc[la_crimes ['Premis Desc'].isna()].index
la_crimes.loc[indexes_null_premisdesc,'Premis Desc'] = 'unknown'

Theoretically, for the rows with `Premis Desc` = 'unknown', the column `Premis Cd` should be equal to $0$ (unknown). However, there are some rows with `Premis Cd` different than 0

In [28]:
la_crimes.loc[la_crimes['Premis Desc']=='unknown'][['Premis Desc','Premis Cd']]

Unnamed: 0,Premis Desc,Premis Cd
6590,unknown,0.0
32148,unknown,0.0
67340,unknown,0.0
68276,unknown,0.0
71523,unknown,0.0
...,...,...
2109009,unknown,418.0
2109581,unknown,256.0
2113695,unknown,256.0
2114205,unknown,256.0


In detail, the rows with `Premis Desc` = 'unknown', have the values of $0$ or $256$ or $418$ or $838$

In [29]:
la_crimes.loc[la_crimes['Premis Desc']=='unknown']['Premis Cd'].unique()

array([  0., 838., 418., 256.])

* Next up we see that all the rows with `Premis Cd` = $418$, have `Premis Desc` = 'unknown`
* Thus we can change `Premis Cd` to $0$ (unknown)

In [30]:
cd418 = la_crimes.loc[la_crimes['Premis Cd']==418][['Premis Desc','Premis Cd']]
cd418['Premis Desc'].unique()

array(['unknown'], dtype=object)

* Next up we see that all the rows with `Premis Cd` = $256$, have `Premis Desc` = 'unknown`
* Thus we can change `Premis Cd` to $0$ (unknown)

In [31]:
cd256 = la_crimes.loc[la_crimes['Premis Cd']== 256][['Premis Desc','Premis Cd']]
cd256['Premis Desc'].unique()

array(['unknown'], dtype=object)

* Next up we see that all the rows with `Premis Cd` = $838$, have `Premis Desc` = 'unknown`
* Thus we can change `Premis Cd` to $0$ (unknown)

In [32]:
cd838 = la_crimes.loc[la_crimes['Premis Cd']== 838][['Premis Desc','Premis Cd']]
cd838['Premis Desc'].unique()

array(['unknown'], dtype=object)

* We change all the rows with `Premis Cd` equal to $256$, $418$ or $838$, to $0$ (unknown)

In [33]:
indexes_wrong_premiscd = la_crimes.loc[(la_crimes ['Premis Cd'] == 256) | 
                                      (la_crimes ['Premis Cd'] == 418) |
                                      (la_crimes ['Premis Cd'] == 838)].index
la_crimes.loc[indexes_wrong_premiscd,'Premis Cd'] = 0

* Lastly, we transform `Premis Cd` from float to integer

In [34]:
la_crimes['Premis Cd'] = la_crimes['Premis Cd'].astype(int)

## Weapon Used Cd and Weapon Desc columns
Some rows have null values for the column `Weapon Used Cd`

* We replace those null values with `0` because this column contains float datatypes

In [35]:
la_crimes.loc[la_crimes['Weapon Used Cd'].isna()]

Unnamed: 0,DR_NO,Date Rptd,DATE OCC,AREA,AREA NAME,Rpt Dist No,Part 1-2,Crm Cd,Crm Cd Desc,Mocodes,...,Status,Status Desc,Crm Cd 1,Crm Cd 2,Crm Cd 3,Crm Cd 4,LOCATION,Cross Street,LAT,LON
0,1307355,2010-02-20,2010-02-20 13:50:00,13,Newton,1385,2,900,VIOLATION OF COURT ORDER,0913 1814 2000,...,AA,Adult Arrest,900.0,,,,300 E GAGE AV,,33.9825,-118.2695
1,11401303,2010-09-13,2010-09-12 00:45:00,14,Pacific,1485,2,740,"VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA...",0329,...,IC,Invest Cont,740.0,,,,SEPULVEDA BL,MANCHESTER AV,33.9599,-118.3962
2,70309629,2010-08-09,2010-08-09 15:15:00,13,Newton,1324,2,946,OTHER MISCELLANEOUS CRIME,0344,...,IC,Invest Cont,946.0,,,,1300 E 21ST ST,,34.0224,-118.2524
5,100100506,2010-01-05,2010-01-04 16:50:00,1,Central,162,1,442,SHOPLIFTING - PETTY THEFT ($950 & UNDER),0344 1402,...,AA,Adult Arrest,442.0,,,,700 W 7TH ST,,34.0480,-118.2577
6,100100508,2010-01-08,2010-01-07 20:05:00,1,Central,182,1,330,BURGLARY FROM VEHICLE,0344,...,IC,Invest Cont,330.0,,,,PICO BL,GRAND AV,34.0389,-118.2643
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2114687,191408297,2019-03-11,2019-03-08 12:00:00,14,Pacific,1438,1,440,THEFT PLAIN - PETTY ($950 & UNDER),0344 1501 1608 1607 0349,...,IC,Invest Cont,440.0,,,,3600 MIDVALE AV,,34.0204,-118.4139
2114691,191307168,2019-02-28,2019-02-28 07:00:00,13,Newton,1394,1,510,VEHICLE - STOLEN,unknown,...,IC,Invest Cont,510.0,,,,100 E 67TH ST,,33.9788,-118.2739
2114695,190715222,2019-08-15,2019-08-14 18:10:00,7,Wilshire,701,1,331,THEFT FROM MOTOR VEHICLE - GRAND ($400 AND OVER),1300 0344,...,IC,Invest Cont,331.0,,,,WILLOUGHBY AV,ORLANDO AV,34.0871,-118.3732
2114697,191716777,2019-10-17,2019-10-16 18:00:00,17,Devonshire,1795,1,420,THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER),unknown,...,IC,Invest Cont,420.0,,,,17200 NAPA ST,,34.2266,-118.5085


In [36]:
indexes_null_weaponcd = la_crimes.loc[la_crimes ['Weapon Used Cd'].isna()].index
la_crimes.loc[indexes_null_weaponcd,'Weapon Used Cd'] = 0

Also rows have null values for the column `Weapon Desc`

* We replace those null values with `unknown` because this column contains object datatypes

In [37]:
la_crimes.loc[la_crimes['Weapon Desc'].isna()]

Unnamed: 0,DR_NO,Date Rptd,DATE OCC,AREA,AREA NAME,Rpt Dist No,Part 1-2,Crm Cd,Crm Cd Desc,Mocodes,...,Status,Status Desc,Crm Cd 1,Crm Cd 2,Crm Cd 3,Crm Cd 4,LOCATION,Cross Street,LAT,LON
0,1307355,2010-02-20,2010-02-20 13:50:00,13,Newton,1385,2,900,VIOLATION OF COURT ORDER,0913 1814 2000,...,AA,Adult Arrest,900.0,,,,300 E GAGE AV,,33.9825,-118.2695
1,11401303,2010-09-13,2010-09-12 00:45:00,14,Pacific,1485,2,740,"VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA...",0329,...,IC,Invest Cont,740.0,,,,SEPULVEDA BL,MANCHESTER AV,33.9599,-118.3962
2,70309629,2010-08-09,2010-08-09 15:15:00,13,Newton,1324,2,946,OTHER MISCELLANEOUS CRIME,0344,...,IC,Invest Cont,946.0,,,,1300 E 21ST ST,,34.0224,-118.2524
5,100100506,2010-01-05,2010-01-04 16:50:00,1,Central,162,1,442,SHOPLIFTING - PETTY THEFT ($950 & UNDER),0344 1402,...,AA,Adult Arrest,442.0,,,,700 W 7TH ST,,34.0480,-118.2577
6,100100508,2010-01-08,2010-01-07 20:05:00,1,Central,182,1,330,BURGLARY FROM VEHICLE,0344,...,IC,Invest Cont,330.0,,,,PICO BL,GRAND AV,34.0389,-118.2643
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2114687,191408297,2019-03-11,2019-03-08 12:00:00,14,Pacific,1438,1,440,THEFT PLAIN - PETTY ($950 & UNDER),0344 1501 1608 1607 0349,...,IC,Invest Cont,440.0,,,,3600 MIDVALE AV,,34.0204,-118.4139
2114691,191307168,2019-02-28,2019-02-28 07:00:00,13,Newton,1394,1,510,VEHICLE - STOLEN,unknown,...,IC,Invest Cont,510.0,,,,100 E 67TH ST,,33.9788,-118.2739
2114695,190715222,2019-08-15,2019-08-14 18:10:00,7,Wilshire,701,1,331,THEFT FROM MOTOR VEHICLE - GRAND ($400 AND OVER),1300 0344,...,IC,Invest Cont,331.0,,,,WILLOUGHBY AV,ORLANDO AV,34.0871,-118.3732
2114697,191716777,2019-10-17,2019-10-16 18:00:00,17,Devonshire,1795,1,420,THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER),unknown,...,IC,Invest Cont,420.0,,,,17200 NAPA ST,,34.2266,-118.5085


In [38]:
indexes_null_weapondesc = la_crimes.loc[la_crimes ['Weapon Desc'].isna()].index
la_crimes.loc[indexes_null_weapondesc,'Weapon Desc'] = 'unknown'

Lastly, we transform the column `Weapon Used Cd` from float to integer

In [39]:
 la_crimes['Weapon Used Cd'] = la_crimes['Weapon Used Cd'].astype(int)

## Status and Status Desc columns
* Now we check the status of the crime incident
* We see that the columns `Status` and `Status Desc` have some differences.

In [40]:
la_crimes['Status'].value_counts()

IC    1623829
AO     250589
AA     219081
JA      15864
JO       5301
CC         29
13          1
19          1
TH          1
Name: Status, dtype: int64

In [41]:
la_crimes['Status Desc'].value_counts()

Invest Cont     1623829
Adult Other      250589
Adult Arrest     219081
Juv Arrest        15864
Juv Other          5301
UNK                  35
Name: Status Desc, dtype: int64

* We observe that all the values with a Status `CC` , `19` , `TH` , `13` have a Status Description `UNK`
* That's why we change the rows with a Status `19` , `TH` , `13`(because those only appear once) or null to `CC` 

In [42]:
la_crimes.loc[la_crimes['Status Desc'] == 'UNK']

Unnamed: 0,DR_NO,Date Rptd,DATE OCC,AREA,AREA NAME,Rpt Dist No,Part 1-2,Crm Cd,Crm Cd Desc,Mocodes,...,Status,Status Desc,Crm Cd 1,Crm Cd 2,Crm Cd 3,Crm Cd 4,LOCATION,Cross Street,LAT,LON
100040,101208618,2010-03-02,2010-03-02 03:30:00,12,77th Street,1248,1,210,ROBBERY,0305 0342 0344 0416 1008,...,,UNK,210.0,,,,70TH ST,MENLO,33.9764,-118.2892
151803,101700682,2010-03-09,2010-03-08 11:55:00,17,Devonshire,1764,2,653,"CREDIT CARDS, FRAUD USE ($950.01 & OVER)",0377 0930 1402 1822,...,CC,UNK,653.0,998.0,,,19300 NORDHOFF ST,,34.2355,-118.5536
160732,101721148,2010-11-15,2010-11-14 17:00:00,17,Devonshire,1756,2,900,VIOLATION OF COURT ORDER,1501,...,CC,UNK,900.0,,,,17800 LASSEN ST,,34.2504,-118.5216
219776,112109831,2011-04-29,2011-04-28 22:00:00,21,Topanga,2139,1,331,THEFT FROM MOTOR VEHICLE - GRAND ($400 AND OVER),0344 1202,...,CC,UNK,331.0,,,,7300 CORBIN AV,,34.2031,-118.5623
285526,111204921,2011-01-14,2011-01-12 18:30:00,12,77th Street,1266,1,210,ROBBERY,0202 0305 0344 0370 0416 0429 0906 1251 1259 1822,...,CC,UNK,210.0,,,,HOOVER ST,83RD ST,33.9632,-118.2871
400260,111225404,2011-10-26,2011-10-26 07:20:00,12,77th Street,1243,2,920,KIDNAPPING - GRAND ATTEMPT,0305 1251 1258 1313 1822,...,CC,UNK,920.0,,,,68TH ST,VAN NESS,33.979,-118.3112
485891,141215426,2014-06-30,2012-01-01 12:00:00,12,77th Street,1268,2,354,THEFT OF IDENTITY,0100 1822 0917,...,CC,UNK,354.0,,,,200 E 85TH ST,,33.961,-118.2717
563146,120619583,2012-07-16,2012-06-16 12:00:00,6,Hollywood,644,2,922,CHILD STEALING,unknown,...,CC,UNK,922.0,986.0,,,1300 N VISTA ST,,34.0944,-118.3517
598768,120123632,2012-11-17,2012-11-16 20:30:00,1,Central,192,2,888,TRESPASSING,0601 1609 0329,...,TH,UNK,888.0,,,,400 W VENICE BL,,34.0365,-118.2676
631373,131411831,2013-04-26,2013-04-02 23:00:00,14,Pacific,1494,1,440,THEFT PLAIN - PETTY ($950 & UNDER),0344,...,CC,UNK,440.0,,,,00 WORLD WY,,33.9454,-118.3998


In [43]:
indexes_status = la_crimes.loc[(la_crimes['Status'] == '19')|
                              (la_crimes['Status'] == '13') | (la_crimes['Status'] == 'TH')|
                               (la_crimes['Status'].isna())
                              ]['Status'].index #finding the indexes of nan Status or status 19,13,TH
la_crimes.loc[indexes_status,'Status'] = 'CC'

* We consider that the value `UNK` for the column `Status Desc` means unknown
* Then we will change the Nan values for column `Status Desc` to `UNK`

In [44]:
indexes_status_desc = la_crimes.loc[la_crimes['Status Desc'].isna()]['Status'].index #finding the indexes of nan Status or status 19,13,TH
la_crimes.loc[indexes_status_desc,'Status Desc'] = 'UNK'

## Crm Cd columns
Also, reading the documentation of the columns `Crm Cd` indicates the crime committed. `Crm Cd 1` is the primary and most serious one. Crime Code 2, 3, and 4 are respectively less serious offenses
* As a result it doesn't make sense to only have a `Crm Cd 2` offense with no `Crm Cd 1` offense
* Thus we will update all crime reports with null `Crm Cd 1` and some `Crm Cd 2` or `Crm Cd 3` offenses so that in every case `Crm Cd 1` has a value.
* If there were 2 offenses then there will be only values for `Crm Cd 1` and `Crm Cd 2` and so on

In [45]:
indexes_crm1null = la_crimes.loc[(la_crimes ['Crm Cd 1'].isna()) ].index
la_crimes.loc[indexes_crm1null,:]

Unnamed: 0,DR_NO,Date Rptd,DATE OCC,AREA,AREA NAME,Rpt Dist No,Part 1-2,Crm Cd,Crm Cd Desc,Mocodes,...,Status,Status Desc,Crm Cd 1,Crm Cd 2,Crm Cd 3,Crm Cd 4,LOCATION,Cross Street,LAT,LON
55532,100707214,2010-03-14,2010-03-13 02:30:00,7,Wilshire,767,2,624,BATTERY - SIMPLE ASSAULT,0344 0400 0416 1300,...,IC,Invest Cont,,624.0,,,PICO BL,NORTON AV,34.0476,-118.3239
288235,110310134,2011-04-08,2011-03-26 08:00:00,3,Southwest,312,2,942,BRIBERY,1300 1402,...,IC,Invest Cont,,942.0,99.0,,5100 ROSELAND ST,,34.0274,-118.3542
358506,110811926,2011-07-01,2011-07-01 20:11:00,8,West LA,835,1,210,ROBBERY,unknown,...,IC,Invest Cont,,210.0,,,11000 SANTA MONICA BL,,34.0484,-118.4411
507665,120325216,2012-11-19,2012-11-19 19:30:00,3,Southwest,329,1,440,THEFT PLAIN - PETTY ($950 & UNDER),0344,...,IC,Invest Cont,,440.0,,,500 W 27TH ST,,34.0268,-118.2753
1176227,150318476,2015-08-17,2015-08-16 12:00:00,3,Southwest,363,1,761,BRANDISH WEAPON,0913 0906 0334 0421 0319 0444 0432 1816,...,AO,Adult Other,,761.0,93.0,,4100 PALMWOOD DR,,34.0137,-118.3435
1188315,150517852,2015-11-09,2015-10-09 18:00:00,5,Harbor,529,1,420,THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER),0344 1300 1606 0321,...,IC,Invest Cont,,420.0,,,200 BERTH,,33.7753,-118.2456
1697697,181824031,2018-12-13,2018-12-13 19:40:00,18,Southeast,1842,1,230,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",0906 0450 1402,...,IC,Invest Cont,,230.0,,,400 W 109TH ST,,33.9374,-118.2805
1875523,181117551,2018-10-05,2018-10-05 09:00:00,11,Northeast,1162,1,230,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",0445 0432 1822 0443 1266,...,AA,Adult Arrest,,230.0,93.0,,1300 N VERMONT AV,,34.0956,-118.2918
1977968,191400827,2019-07-20,2019-07-20 09:45:00,14,Pacific,1463,2,888,TRESPASSING,1501,...,IC,Invest Cont,,888.0,,,5300 ALLA RD,,33.9779,-118.4264
2080625,190308122,2019-03-05,2019-03-05 21:30:00,3,Southwest,395,1,331,THEFT FROM MOTOR VEHICLE - GRAND ($400 AND OVER),0216 0344 1606 1822 1300,...,IC,Invest Cont,,331.0,,,39TH ST,NORMANDIE AV,34.0073,-118.3108


In [46]:
la_crimes.loc[indexes_crm1null,'Crm Cd 1'] = la_crimes.loc[indexes_crm1null,'Crm Cd 2']
la_crimes.loc[indexes_crm1null,'Crm Cd 2'] = la_crimes.loc[indexes_crm1null,'Crm Cd 3']
la_crimes.loc[indexes_crm1null,'Crm Cd 3'] = la_crimes.loc[indexes_crm1null,'Crm Cd 4']

We fill up all the null values for `Crm Cd 2` and `Crm Cd 3` and `Crm Cd 4` with -1 (meaning none is found) because those columns consist of float datatype

In [47]:
"""indexes_null_crm2 = la_crimes.loc[la_crimes ['Crm Cd 2'].isna()].index
la_crimes.loc[indexes_null_crm2,'Crm Cd 2'] = -1

indexes_null_crm3 = la_crimes.loc[la_crimes ['Crm Cd 3'].isna()].index
la_crimes.loc[indexes_null_crm3,'Crm Cd 3'] = -1

indexes_null_crm4 = la_crimes.loc[la_crimes ['Crm Cd 4'].isna()].index
la_crimes.loc[indexes_null_crm4,'Crm Cd 4'] = -1"""

la_crimes ['Crm Cd 2'] = la_crimes ['Crm Cd 2'].fillna(-1)
la_crimes ['Crm Cd 3'] = la_crimes ['Crm Cd 3'].fillna(-1)
la_crimes ['Crm Cd 4'] = la_crimes ['Crm Cd 4'].fillna(-1)

Lastly, we transform the columns `Crm Cd 1`, `Crm Cd 2`, `Crm Cd 3`, `Crm Cd 4` from float to integer

In [48]:
la_crimes['Crm Cd 1'] = la_crimes['Crm Cd 1'].astype(int)
la_crimes['Crm Cd 2'] = la_crimes['Crm Cd 2'].astype(int)
la_crimes['Crm Cd 3'] = la_crimes['Crm Cd 3'].astype(int)
la_crimes['Crm Cd 4'] = la_crimes['Crm Cd 4'].astype(int)

* We check if for every crime incident, the column `Crm Cd` and `Crm Cd 1` are the same.
* Some are not the same so we consider them mistaken because `Crm Cd 1` is the primary crime commited and `Crm Cd` describes the crime commited
* For those incidents we set `Crm Cd 1` to be equal to `Crm Cd`

* First case is that `Crm Cd` has the same value as `Crm Cd 2`
* Then we just have to swap `Crm Cd 1` and `Crm Cd 2`

In [49]:
indexes_crm_crm2 = la_crimes.loc[la_crimes ['Crm Cd'] ==  la_crimes ['Crm Cd 2']].index
la_crimes.loc[indexes_crm_crm2,'Crm Cd 2'] = la_crimes.loc[indexes_crm_crm2,'Crm Cd 1'].copy()
la_crimes.loc[indexes_crm_crm2,'Crm Cd 1'] = la_crimes.loc[indexes_crm_crm2,'Crm Cd'].copy()

* Second case is that `Crm Cd` has the same value as `Crm Cd 3`
* Then we just have to swap `Crm Cd 1` and `Crm Cd 3`

In [50]:
indexes_crm_crm3 = la_crimes.loc[la_crimes ['Crm Cd'] ==  la_crimes ['Crm Cd 3']].index
la_crimes.loc[indexes_crm_crm3,'Crm Cd 3'] = la_crimes.loc[indexes_crm_crm3,'Crm Cd 1'].copy()
la_crimes.loc[indexes_crm_crm3,'Crm Cd 1'] = la_crimes.loc[indexes_crm_crm3,'Crm Cd'].copy()

* Third case is that `Crm Cd` has the same value as `Crm Cd 4`
* However, there is no incident like that

In [51]:
la_crimes.loc[la_crimes ['Crm Cd'] ==  la_crimes ['Crm Cd 4']].index

Int64Index([], dtype='int64')

* Lastly, because `Crm Cd` and `Crm Cd 1` are the same now, we can drop `Crm Cd 1`

In [52]:
la_crimes = la_crimes.drop(['Crm Cd 1'], axis=1)

## LOCATION Column
* Some locations have a bunch of white spaces in between their words

In [53]:
la_crimes ['LOCATION'].value_counts()

6TH                          ST            4756
7TH                          ST            3774
9300    TAMPA                        AV    3658
6TH                                        3235
6600    TOPANGA CANYON               BL    3064
                                           ... 
1100 N  AVENUE 45                             1
3500 W  23RD                         ST       1
8600    CRESCENT                     DR       1
5300    ALMAONT                      ST       1
1400    SHATTO                       ST       1
Name: LOCATION, Length: 75251, dtype: int64

* The unnecessary white spaces are removed from the middle, front and back of the words

In [54]:
 la_crimes['LOCATION'] = la_crimes['LOCATION'].str.replace(' +', ' ').str.strip()

## Cross Street column
* Moreover, we see that there are many null values for the column `Cross Street`
* We changed those to 'unknown'

In [55]:
indexes_null_cross = la_crimes.loc[la_crimes ['Cross Street'].isna()].index
la_crimes.loc[indexes_null_cross,'Cross Street'] = 'unknown'

* In addition, we see many cross streets have a bunch of white spaces in between the name and the street (AV/BL).

In [56]:
la_crimes['Cross Street'].value_counts()

unknown                            1759334
BROADWAY                              6157
FIGUEROA                              3801
VERMONT                      AV       3746
SAN PEDRO                             3659
                                    ...   
NARCISSUS                    CT          1
AVERY                        AV          1
MOZART                       AV          1
ELDER                                    1
ROYAL HILLS                              1
Name: Cross Street, Length: 12869, dtype: int64

* We are going to remove the unnecessary white spaces in the middle, front and back of the words 

In [57]:
la_crimes['Cross Street'] = la_crimes['Cross Street'].str.replace(' +', ' ').str.strip()

* Now we observe that for most crime incidents this column has the value `unknown`
* In detail that is true, for the $1.76$ million incidents out of the $2.1$ million incidents that there are
* Thus we decide to drop this column

In [58]:
la_crimes.loc[la_crimes['Cross Street'] == 'unknown']

Unnamed: 0,DR_NO,Date Rptd,DATE OCC,AREA,AREA NAME,Rpt Dist No,Part 1-2,Crm Cd,Crm Cd Desc,Mocodes,...,Weapon Desc,Status,Status Desc,Crm Cd 2,Crm Cd 3,Crm Cd 4,LOCATION,Cross Street,LAT,LON
0,1307355,2010-02-20,2010-02-20 13:50:00,13,Newton,1385,2,900,VIOLATION OF COURT ORDER,0913 1814 2000,...,unknown,AA,Adult Arrest,-1,-1,-1,300 E GAGE AV,unknown,33.9825,-118.2695
2,70309629,2010-08-09,2010-08-09 15:15:00,13,Newton,1324,2,946,OTHER MISCELLANEOUS CRIME,0344,...,unknown,IC,Invest Cont,-1,-1,-1,1300 E 21ST ST,unknown,34.0224,-118.2524
5,100100506,2010-01-05,2010-01-04 16:50:00,1,Central,162,1,442,SHOPLIFTING - PETTY THEFT ($950 & UNDER),0344 1402,...,unknown,AA,Adult Arrest,-1,-1,-1,700 W 7TH ST,unknown,34.0480,-118.2577
7,100100509,2010-01-09,2010-01-08 21:00:00,1,Central,157,1,230,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",0416,...,UNKNOWN WEAPON/OTHER WEAPON,AA,Adult Arrest,-1,-1,-1,500 CROCKER ST,unknown,34.0435,-118.2427
8,100100510,2010-01-09,2010-01-09 02:30:00,1,Central,171,1,230,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",0400 0416,...,"STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)",IC,Invest Cont,-1,-1,-1,800 W OLYMPIC BL,unknown,34.0450,-118.2640
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2114692,190906699,2019-02-24,2019-02-23 22:20:00,9,Van Nuys,904,1,210,ROBBERY,0344 0302 0334 0355 1310 1420 1822 0354,...,OTHER FIREARM,IC,Invest Cont,998,-1,-1,7600 WILLIS AV,unknown,34.2085,-118.4553
2114693,190506304,2019-02-22,2019-02-22 08:40:00,5,Harbor,569,2,627,CHILD ABUSE (PHYSICAL) - SIMPLE ASSAULT,0443 0419 0416 1259,...,"STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)",AO,Adult Other,-1,-1,-1,100 W 22ND ST,unknown,33.7257,-118.2801
2114694,190608903,2019-03-28,2019-03-28 04:00:00,6,Hollywood,644,1,648,ARSON,0601 1501,...,FIRE,IC,Invest Cont,-1,-1,-1,1400 N LA BREA AV,unknown,34.0962,-118.3490
2114697,191716777,2019-10-17,2019-10-16 18:00:00,17,Devonshire,1795,1,420,THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER),unknown,...,unknown,IC,Invest Cont,-1,-1,-1,17200 NAPA ST,unknown,34.2266,-118.5085


In [59]:
la_crimes = la_crimes.drop(['Cross Street'], axis=1)

## LONGTITUDE-LATITUDE
* We see that some crimes have values for longtitude and latitude equal to $0$
* Those values are considered missing because they correspond to the Gulf of Guinea.

In [60]:
la_crimes.loc[(la_crimes['LON'] == 0) & (la_crimes['LAT'] == 0) ]

Unnamed: 0,DR_NO,Date Rptd,DATE OCC,AREA,AREA NAME,Rpt Dist No,Part 1-2,Crm Cd,Crm Cd Desc,Mocodes,...,Weapon Used Cd,Weapon Desc,Status,Status Desc,Crm Cd 2,Crm Cd 3,Crm Cd 4,LOCATION,LAT,LON
49703,100618355,2010-07-14,2010-07-12 19:00:00,6,Hollywood,665,1,330,BURGLARY FROM VEHICLE,0344 1300 1302,...,0,unknown,IC,Invest Cont,-1,-1,-1,900 N CISTRUS AV,0.0,0.0
49800,100618603,2010-07-19,2010-07-19 23:45:00,6,Hollywood,665,2,740,"VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA...",0329 0906,...,0,unknown,AA,Adult Arrest,998,-1,-1,6300 WILLOUGBY,0.0,0.0
60848,100718479,2010-11-29,2010-11-29 16:30:00,7,Wilshire,709,1,230,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",0416 1402,...,122,HECKLER & KOCH 93 SEMIAUTOMATIC ASSAULT RIFLE,IC,Invest Cont,998,-1,-1,HARBOR,0.0,0.0
84978,101016365,2010-09-09,2010-08-23 15:00:00,10,West Valley,1000,2,626,INTIMATE PARTNER - SIMPLE ASSAULT,0416 2000,...,0,unknown,IC,Invest Cont,-1,-1,-1,CITY OF WINNETKA,0.0,0.0
123985,101409719,2010-04-01,2010-03-30 21:00:00,14,Pacific,1412,1,510,VEHICLE - STOLEN,unknown,...,0,unknown,IC,Invest Cont,-1,-1,-1,WINDWARD AV,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1982414,191915311,2019-08-22,2019-08-21 11:00:00,19,Mission,1900,1,236,INTIMATE PARTNER - AGGRAVATED ASSAULT,0400 0448 2000 0913,...,400,"STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)",AA,Adult Arrest,-1,-1,-1,UNKNOWN,0.0,0.0
1995724,191409480,2019-03-31,2019-03-31 03:00:00,14,Pacific,1400,2,624,BATTERY - SIMPLE ASSAULT,0400 0416,...,400,"STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)",IC,Invest Cont,-1,-1,-1,96TH,0.0,0.0
2013881,190918499,2019-10-18,2019-10-15 16:30:00,9,Van Nuys,936,2,626,INTIMATE PARTNER - SIMPLE ASSAULT,1814 2000 0448,...,400,"STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)",AA,Adult Arrest,-1,-1,-1,13600 LANAY ST,0.0,0.0
2022655,191719423,2019-12-15,2019-12-13 11:00:00,17,Devonshire,1786,1,420,THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER),unknown,...,0,unknown,IC,Invest Cont,-1,-1,-1,BALBOA BL,0.0,0.0


## Removing unnecessary columns
In this case we have all the data in one table (dataframe). Some columns will not assist us in data mining analyses, thus we decide to drop them. In detail the area code, district code, date reported, crime code, part1/2, premises code, incident code, weapon used code and status code

In [61]:
la_crimes = la_crimes.drop(['AREA','Date Rptd', #'Crm Cd',
                            'Premis Cd',
                            'Weapon Used Cd', 'Status'], axis=1)
la_crimes

Unnamed: 0,DR_NO,DATE OCC,AREA NAME,Rpt Dist No,Part 1-2,Crm Cd,Crm Cd Desc,Mocodes,Vict Age,Vict Sex,Vict Descent,Premis Desc,Weapon Desc,Status Desc,Crm Cd 2,Crm Cd 3,Crm Cd 4,LOCATION,LAT,LON
0,1307355,2010-02-20 13:50:00,Newton,1385,2,900,VIOLATION OF COURT ORDER,0913 1814 2000,48,M,H,SINGLE FAMILY DWELLING,unknown,Adult Arrest,-1,-1,-1,300 E GAGE AV,33.9825,-118.2695
1,11401303,2010-09-12 00:45:00,Pacific,1485,2,740,"VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA...",0329,0,M,W,STREET,unknown,Invest Cont,-1,-1,-1,SEPULVEDA BL,33.9599,-118.3962
2,70309629,2010-08-09 15:15:00,Newton,1324,2,946,OTHER MISCELLANEOUS CRIME,0344,0,M,H,ALLEY,unknown,Invest Cont,-1,-1,-1,1300 E 21ST ST,34.0224,-118.2524
3,90631215,2010-01-05 01:50:00,Hollywood,646,2,900,VIOLATION OF COURT ORDER,1100 0400 1402,47,F,W,STREET,HAND GUN,Invest Cont,998,-1,-1,CAHUENGA BL,34.1016,-118.3295
4,100100501,2010-01-02 21:00:00,Central,176,1,122,"RAPE, ATTEMPTED",0400,47,F,H,ALLEY,"STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)",Invest Cont,-1,-1,-1,8TH ST,34.0387,-118.2488
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2114694,190608903,2019-03-28 04:00:00,Hollywood,644,1,648,ARSON,0601 1501,0,X,X,SEX ORIENTED/BOOK STORE/STRIP CLUB/GENTLEMAN'S...,FIRE,Invest Cont,-1,-1,-1,1400 N LA BREA AV,34.0962,-118.3490
2114695,190715222,2019-08-14 18:10:00,Wilshire,701,1,331,THEFT FROM MOTOR VEHICLE - GRAND ($400 AND OVER),1300 0344,40,M,W,STREET,unknown,Invest Cont,-1,-1,-1,WILLOUGHBY AV,34.0871,-118.3732
2114696,192004409,2019-01-06 21:00:00,Olympic,2029,2,930,CRIMINAL THREATS - NO WEAPON DISPLAYED,0432 0421 0340 0305 0444 0429 0537 1218 0216,46,F,B,SIDEWALK,"STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)",Invest Cont,-1,-1,-1,6TH,34.0637,-118.2870
2114697,191716777,2019-10-16 18:00:00,Devonshire,1795,1,420,THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER),unknown,0,X,X,STREET,unknown,Invest Cont,-1,-1,-1,17200 NAPA ST,34.2266,-118.5085


Renaming the coumns name for easier use

In [62]:
la_crimes.rename(columns={'DATE OCC':'DATE_OCC','Status Desc':'STATUS',
                         'AREA NAME':'AREA_NAME','Weapon Desc':'WEAPON_DESC',
                         'Crm Cd Desc':'CRM_DESC','Vict Age':'VICT_AGE', 'Vict Sex':'VICT_SEX',
                         'Vict Descent':'VICT_DESC','Crm Cd 2':'CRM_CD2','Premis Desc':'PREMIS_DESC',
                         'Crm Cd 3':'CRM_CD3', 'Crm Cd 4':'CRM_CD4', 'Rpt Dist No':'DISTR_NO','Part 1-2':'PART1_2',
                         'Crm Cd':'CRM_CD'},
                         inplace = True)

In [63]:
la_crimes

Unnamed: 0,DR_NO,DATE_OCC,AREA_NAME,DISTR_NO,PART1_2,CRM_CD,CRM_DESC,Mocodes,VICT_AGE,VICT_SEX,VICT_DESC,PREMIS_DESC,WEAPON_DESC,STATUS,CRM_CD2,CRM_CD3,CRM_CD4,LOCATION,LAT,LON
0,1307355,2010-02-20 13:50:00,Newton,1385,2,900,VIOLATION OF COURT ORDER,0913 1814 2000,48,M,H,SINGLE FAMILY DWELLING,unknown,Adult Arrest,-1,-1,-1,300 E GAGE AV,33.9825,-118.2695
1,11401303,2010-09-12 00:45:00,Pacific,1485,2,740,"VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA...",0329,0,M,W,STREET,unknown,Invest Cont,-1,-1,-1,SEPULVEDA BL,33.9599,-118.3962
2,70309629,2010-08-09 15:15:00,Newton,1324,2,946,OTHER MISCELLANEOUS CRIME,0344,0,M,H,ALLEY,unknown,Invest Cont,-1,-1,-1,1300 E 21ST ST,34.0224,-118.2524
3,90631215,2010-01-05 01:50:00,Hollywood,646,2,900,VIOLATION OF COURT ORDER,1100 0400 1402,47,F,W,STREET,HAND GUN,Invest Cont,998,-1,-1,CAHUENGA BL,34.1016,-118.3295
4,100100501,2010-01-02 21:00:00,Central,176,1,122,"RAPE, ATTEMPTED",0400,47,F,H,ALLEY,"STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)",Invest Cont,-1,-1,-1,8TH ST,34.0387,-118.2488
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2114694,190608903,2019-03-28 04:00:00,Hollywood,644,1,648,ARSON,0601 1501,0,X,X,SEX ORIENTED/BOOK STORE/STRIP CLUB/GENTLEMAN'S...,FIRE,Invest Cont,-1,-1,-1,1400 N LA BREA AV,34.0962,-118.3490
2114695,190715222,2019-08-14 18:10:00,Wilshire,701,1,331,THEFT FROM MOTOR VEHICLE - GRAND ($400 AND OVER),1300 0344,40,M,W,STREET,unknown,Invest Cont,-1,-1,-1,WILLOUGHBY AV,34.0871,-118.3732
2114696,192004409,2019-01-06 21:00:00,Olympic,2029,2,930,CRIMINAL THREATS - NO WEAPON DISPLAYED,0432 0421 0340 0305 0444 0429 0537 1218 0216,46,F,B,SIDEWALK,"STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)",Invest Cont,-1,-1,-1,6TH,34.0637,-118.2870
2114697,191716777,2019-10-16 18:00:00,Devonshire,1795,1,420,THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER),unknown,0,X,X,STREET,unknown,Invest Cont,-1,-1,-1,17200 NAPA ST,34.2266,-118.5085


## Decision trees 
Now we will try to predict the crime type (description) based on all the other columns of la_crimes (of course without the crime code)
In detail the data to be used for the prediction are:
* Hour of the incident
* Weekday of the incident
* Area of the incident
* District
* Part 1 or 2 type
* Mocodes which are activities associated with the suspect in commission of the crime
* Victim age
* Victim sex
* Victim descent
* Premises where the incident happened
* Weapon used 
* Status of the crime
* Extra crimes committed (less sreious ones)
* Location name
* Longtitude and latitude of the incident

* Importing the libraries

In [64]:
from sklearn.model_selection import train_test_split
import random
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier

from sklearn.metrics import classification_report

* We only select the data of two months (2019 April and May) because of memory RAM restrictions. In detail, because of the fact that most of the columns are categorical variables, it will result in a huge amount of columns after we create dummy variables to depict our categorical data.

In [65]:
la_crimes_new = la_crimes.loc[(la_crimes['DATE_OCC'].dt.to_period('M') == '2019-04')| 
                              (la_crimes['DATE_OCC'].dt.to_period('M') == '2019-05')].copy()
la_crimes_new

Unnamed: 0,DR_NO,DATE_OCC,AREA_NAME,DISTR_NO,PART1_2,CRM_CD,CRM_DESC,Mocodes,VICT_AGE,VICT_SEX,VICT_DESC,PREMIS_DESC,WEAPON_DESC,STATUS,CRM_CD2,CRM_CD3,CRM_CD4,LOCATION,LAT,LON
1847402,190408859,2019-04-21 18:30:00,Hollenbeck,449,1,510,VEHICLE - STOLEN,unknown,0,X,X,STREET,unknown,Invest Cont,-1,-1,-1,DOBINSON,34.0516,-118.1982
1848938,190111434,2019-04-17 04:00:00,Central,147,1,510,VEHICLE - STOLEN,unknown,0,X,X,STREET,unknown,Invest Cont,-1,-1,-1,400 S SAN PEDRO ST,34.0453,-118.2443
1848966,190508840,2019-04-21 20:00:00,Harbor,566,1,510,VEHICLE - STOLEN,unknown,0,X,X,STREET,unknown,Invest Cont,-1,-1,-1,900 S MESA ST,33.7360,-118.2857
1898385,191715448,2019-05-29 07:30:00,Devonshire,1798,2,901,VIOLATION OF RESTRAINING ORDER,1501 2038 1906 2000,54,F,W,SINGLE FAMILY DWELLING,unknown,Adult Other,-1,-1,-1,8900 AQUEDUCT AV,34.2320,-118.4742
1898387,190715605,2019-05-26 10:30:00,Wilshire,729,2,956,"LETTERS, LEWD - TELEPHONE CALLS, LEWD",1906,47,M,O,SINGLE FAMILY DWELLING,unknown,Adult Other,-1,-1,-1,200 S LUCERNE BL,34.0711,-118.3248
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2114670,191710450,2019-05-25 06:30:00,Devonshire,1761,1,510,VEHICLE - STOLEN,unknown,0,X,X,STREET,unknown,Invest Cont,-1,-1,-1,9200 ETON AV,34.2376,-118.5947
2114672,190500773,2019-05-28 17:00:00,Harbor,541,2,627,CHILD ABUSE (PHYSICAL) - SIMPLE ASSAULT,1258 0400 0913,9,F,B,SINGLE FAMILY DWELLING,"STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)",Adult Other,-1,-1,-1,900 W BATTERY ST,33.7547,-118.2967
2114676,190508874,2019-04-22 22:00:00,Harbor,524,2,626,INTIMATE PARTNER - SIMPLE ASSAULT,2000 1813 0444,46,F,H,"MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC)","STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)",Adult Other,-1,-1,-1,500 KING AV,33.7749,-118.2750
2114679,191111014,2019-05-06 09:30:00,Northeast,1133,1,310,BURGLARY,0344,38,M,W,PUBLIC STORAGE,unknown,Invest Cont,-1,-1,-1,2900 CASITAS AV,34.1109,-118.2464


* We will also create new categorical columns that will contain the month and hour of the incident 
* We make those columns from the column `DATE_OCC`

In [66]:
la_crimes_new['WEEKDAY'] = la_crimes_new['DATE_OCC'].dt.dayofweek.astype(str).copy()
la_crimes_new['HOUR'] = la_crimes_new['DATE_OCC'].astype(str).str[11:13].copy()

Now we create a dataset that only contains the columns that are useful. Those columns are dropped:
* `DR_NO` does not help as it is a unique number
* `DATE_OCC` also is cut because we already have the hour and day in different columns

In [67]:
la_crimes_dum = la_crimes_new[['LAT','LON','LOCATION','CRM_CD2','CRM_CD3','CRM_CD4','VICT_AGE','Mocodes','WEEKDAY',
                               'AREA_NAME','DISTR_NO','CRM_DESC','PREMIS_DESC','HOUR',
                               'WEAPON_DESC','VICT_DESC','VICT_SEX','PART1_2']].copy()
la_crimes_dum

Unnamed: 0,LAT,LON,LOCATION,CRM_CD2,CRM_CD3,CRM_CD4,VICT_AGE,Mocodes,WEEKDAY,AREA_NAME,DISTR_NO,CRM_DESC,PREMIS_DESC,HOUR,WEAPON_DESC,VICT_DESC,VICT_SEX,PART1_2
1847402,34.0516,-118.1982,DOBINSON,-1,-1,-1,0,unknown,6,Hollenbeck,449,VEHICLE - STOLEN,STREET,18,unknown,X,X,1
1848938,34.0453,-118.2443,400 S SAN PEDRO ST,-1,-1,-1,0,unknown,2,Central,147,VEHICLE - STOLEN,STREET,04,unknown,X,X,1
1848966,33.7360,-118.2857,900 S MESA ST,-1,-1,-1,0,unknown,6,Harbor,566,VEHICLE - STOLEN,STREET,20,unknown,X,X,1
1898385,34.2320,-118.4742,8900 AQUEDUCT AV,-1,-1,-1,54,1501 2038 1906 2000,2,Devonshire,1798,VIOLATION OF RESTRAINING ORDER,SINGLE FAMILY DWELLING,07,unknown,W,F,2
1898387,34.0711,-118.3248,200 S LUCERNE BL,-1,-1,-1,47,1906,6,Wilshire,729,"LETTERS, LEWD - TELEPHONE CALLS, LEWD",SINGLE FAMILY DWELLING,10,unknown,O,M,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2114670,34.2376,-118.5947,9200 ETON AV,-1,-1,-1,0,unknown,5,Devonshire,1761,VEHICLE - STOLEN,STREET,06,unknown,X,X,1
2114672,33.7547,-118.2967,900 W BATTERY ST,-1,-1,-1,9,1258 0400 0913,1,Harbor,541,CHILD ABUSE (PHYSICAL) - SIMPLE ASSAULT,SINGLE FAMILY DWELLING,17,"STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)",B,F,2
2114676,33.7749,-118.2750,500 KING AV,-1,-1,-1,46,2000 1813 0444,0,Harbor,524,INTIMATE PARTNER - SIMPLE ASSAULT,"MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC)",22,"STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)",H,F,2
2114679,34.1109,-118.2464,2900 CASITAS AV,-1,-1,-1,38,0344,0,Northeast,1133,BURGLARY,PUBLIC STORAGE,09,unknown,W,M,1


Now we create the dummies for the column `Mocodes`. The problem for this column is that one particular incident may have more than one different mocodes. As a result  'get_dummies()' would create new dummies for multiple (pairs, triplets of) mocodes which would be a mistake. We would miss crimes that contain similar mocodes and the dataframe would get even bigger.
* We separate the mocodes from each other and then execute the dummies
* Each mocode may correspond to multiple incidents

In [68]:
mocodes_dummies= la_crimes_dum['Mocodes'].str.join(sep='').str.get_dummies(sep=' ') 
mocodes_dummies

Unnamed: 0,0100,0101,0102,0104,0105,0107,0109,0110,0112,0113,...,3030,3034,3037,3101,3401,3701,4018,4026,9999,unknown
1847402,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1848938,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1848966,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1898385,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1898387,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2114670,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2114672,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2114676,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2114679,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Next up we join the data of the mocode dummies and `la_crimes_dum` 

In [69]:
la_crimes_dum = pd.merge(la_crimes_dum,mocodes_dummies, left_on=la_crimes_dum.index, right_on=mocodes_dummies.index )
la_crimes_dum = la_crimes_dum.drop(['key_0'], axis=1) # dropping the new column of the index(keys)
la_crimes_dum

Unnamed: 0,LAT,LON,LOCATION,CRM_CD2,CRM_CD3,CRM_CD4,VICT_AGE,Mocodes,WEEKDAY,AREA_NAME,...,3030,3034,3037,3101,3401,3701,4018,4026,9999,unknown
0,34.0516,-118.1982,DOBINSON,-1,-1,-1,0,unknown,6,Hollenbeck,...,0,0,0,0,0,0,0,0,0,1
1,34.0453,-118.2443,400 S SAN PEDRO ST,-1,-1,-1,0,unknown,2,Central,...,0,0,0,0,0,0,0,0,0,1
2,33.7360,-118.2857,900 S MESA ST,-1,-1,-1,0,unknown,6,Harbor,...,0,0,0,0,0,0,0,0,0,1
3,34.2320,-118.4742,8900 AQUEDUCT AV,-1,-1,-1,54,1501 2038 1906 2000,2,Devonshire,...,0,0,0,0,0,0,0,0,0,0
4,34.0711,-118.3248,200 S LUCERNE BL,-1,-1,-1,47,1906,6,Wilshire,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36378,34.2376,-118.5947,9200 ETON AV,-1,-1,-1,0,unknown,5,Devonshire,...,0,0,0,0,0,0,0,0,0,1
36379,33.7547,-118.2967,900 W BATTERY ST,-1,-1,-1,9,1258 0400 0913,1,Harbor,...,0,0,0,0,0,0,0,0,0,0
36380,33.7749,-118.2750,500 KING AV,-1,-1,-1,46,2000 1813 0444,0,Harbor,...,0,0,0,0,0,0,0,0,0,0
36381,34.1109,-118.2464,2900 CASITAS AV,-1,-1,-1,38,0344,0,Northeast,...,0,0,0,0,0,0,0,0,0,0


* Now we will do the same with the columns `CRM_CD2` `CRM_CD3` `CRM_CD4`
* At first we join the string into one column  `CRM_CD2`
* Drop the other two columns

In [70]:
la_crimes_dum['CRM_CD2'] = la_crimes_dum['CRM_CD2'].astype(str)+ ' ' +la_crimes_dum['CRM_CD3'].astype(str) +' '+\
                            +la_crimes_dum['CRM_CD4'].astype(str)
la_crimes_dum = la_crimes_dum.drop(['CRM_CD3','CRM_CD4'], axis=1)
la_crimes_dum

Unnamed: 0,LAT,LON,LOCATION,CRM_CD2,VICT_AGE,Mocodes,WEEKDAY,AREA_NAME,DISTR_NO,CRM_DESC,...,3030,3034,3037,3101,3401,3701,4018,4026,9999,unknown
0,34.0516,-118.1982,DOBINSON,-1 -1 -1,0,unknown,6,Hollenbeck,449,VEHICLE - STOLEN,...,0,0,0,0,0,0,0,0,0,1
1,34.0453,-118.2443,400 S SAN PEDRO ST,-1 -1 -1,0,unknown,2,Central,147,VEHICLE - STOLEN,...,0,0,0,0,0,0,0,0,0,1
2,33.7360,-118.2857,900 S MESA ST,-1 -1 -1,0,unknown,6,Harbor,566,VEHICLE - STOLEN,...,0,0,0,0,0,0,0,0,0,1
3,34.2320,-118.4742,8900 AQUEDUCT AV,-1 -1 -1,54,1501 2038 1906 2000,2,Devonshire,1798,VIOLATION OF RESTRAINING ORDER,...,0,0,0,0,0,0,0,0,0,0
4,34.0711,-118.3248,200 S LUCERNE BL,-1 -1 -1,47,1906,6,Wilshire,729,"LETTERS, LEWD - TELEPHONE CALLS, LEWD",...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36378,34.2376,-118.5947,9200 ETON AV,-1 -1 -1,0,unknown,5,Devonshire,1761,VEHICLE - STOLEN,...,0,0,0,0,0,0,0,0,0,1
36379,33.7547,-118.2967,900 W BATTERY ST,-1 -1 -1,9,1258 0400 0913,1,Harbor,541,CHILD ABUSE (PHYSICAL) - SIMPLE ASSAULT,...,0,0,0,0,0,0,0,0,0,0
36380,33.7749,-118.2750,500 KING AV,-1 -1 -1,46,2000 1813 0444,0,Harbor,524,INTIMATE PARTNER - SIMPLE ASSAULT,...,0,0,0,0,0,0,0,0,0,0
36381,34.1109,-118.2464,2900 CASITAS AV,-1 -1 -1,38,0344,0,Northeast,1133,BURGLARY,...,0,0,0,0,0,0,0,0,0,0


* We separate the extra crime codes() that are located in `CRM_CD2` from each other and then produce the dummies
* Each extra crime code may correspond to multiple incidents

In [71]:
crime_codes_dummies= la_crimes_dum['CRM_CD2'].str.join(sep='').str.get_dummies(sep=' ') 
crime_codes_dummies = crime_codes_dummies.drop(['-1'], axis=1) # dropping -1 column as it concerns incidents with no extra crime type
crime_codes_dummies

Unnamed: 0,210,231,235,236,320,330,341,343,345,350,...,930,933,940,946,956,990,993,997,998,999
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36378,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
36379,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
36380,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
36381,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Next up we join the data of the crime codes dummies and la_crimes_dum

In [72]:
la_crimes_dum = pd.merge(la_crimes_dum, crime_codes_dummies, left_on=la_crimes_dum.index,
                         right_on=crime_codes_dummies.index )
la_crimes_dum = la_crimes_dum.drop(['key_0'], axis=1) # dropping the new column of the index(keys) 
la_crimes_dum

Unnamed: 0,LAT,LON,LOCATION,CRM_CD2,VICT_AGE,Mocodes,WEEKDAY,AREA_NAME,DISTR_NO,CRM_DESC,...,930,933,940,946,956,990,993,997,998,999
0,34.0516,-118.1982,DOBINSON,-1 -1 -1,0,unknown,6,Hollenbeck,449,VEHICLE - STOLEN,...,0,0,0,0,0,0,0,0,0,0
1,34.0453,-118.2443,400 S SAN PEDRO ST,-1 -1 -1,0,unknown,2,Central,147,VEHICLE - STOLEN,...,0,0,0,0,0,0,0,0,0,0
2,33.7360,-118.2857,900 S MESA ST,-1 -1 -1,0,unknown,6,Harbor,566,VEHICLE - STOLEN,...,0,0,0,0,0,0,0,0,0,0
3,34.2320,-118.4742,8900 AQUEDUCT AV,-1 -1 -1,54,1501 2038 1906 2000,2,Devonshire,1798,VIOLATION OF RESTRAINING ORDER,...,0,0,0,0,0,0,0,0,0,0
4,34.0711,-118.3248,200 S LUCERNE BL,-1 -1 -1,47,1906,6,Wilshire,729,"LETTERS, LEWD - TELEPHONE CALLS, LEWD",...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36378,34.2376,-118.5947,9200 ETON AV,-1 -1 -1,0,unknown,5,Devonshire,1761,VEHICLE - STOLEN,...,0,0,0,0,0,0,0,0,0,0
36379,33.7547,-118.2967,900 W BATTERY ST,-1 -1 -1,9,1258 0400 0913,1,Harbor,541,CHILD ABUSE (PHYSICAL) - SIMPLE ASSAULT,...,0,0,0,0,0,0,0,0,0,0
36380,33.7749,-118.2750,500 KING AV,-1 -1 -1,46,2000 1813 0444,0,Harbor,524,INTIMATE PARTNER - SIMPLE ASSAULT,...,0,0,0,0,0,0,0,0,0,0
36381,34.1109,-118.2464,2900 CASITAS AV,-1 -1 -1,38,0344,0,Northeast,1133,BURGLARY,...,0,0,0,0,0,0,0,0,0,0


* Now we can create the dummies for all the other variables.
* We get dummies for all the columns except for `LAT` `LON` and `CRM_DESC`
Note: we get dummies for the column victim age as well, because the missing values are coded as 0. If we used the age as a continious variable, it could produce wrong results

In [73]:
la_crimes_dum = pd.get_dummies(la_crimes_dum, columns=['LOCATION','VICT_AGE',
                                                       'AREA_NAME','DISTR_NO', 'PREMIS_DESC', 'WEAPON_DESC', 
                                                       'VICT_DESC', 'VICT_SEX', 'PART1_2','WEEKDAY','HOUR'])
la_crimes_dum = la_crimes_dum.drop(['Mocodes','CRM_CD2'], axis=1)
la_crimes_dum

Unnamed: 0,LAT,LON,CRM_DESC,0100,0101,0102,0104,0105,0107,0109,...,HOUR_14,HOUR_15,HOUR_16,HOUR_17,HOUR_18,HOUR_19,HOUR_20,HOUR_21,HOUR_22,HOUR_23
0,34.0516,-118.1982,VEHICLE - STOLEN,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,34.0453,-118.2443,VEHICLE - STOLEN,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,33.7360,-118.2857,VEHICLE - STOLEN,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,34.2320,-118.4742,VIOLATION OF RESTRAINING ORDER,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,34.0711,-118.3248,"LETTERS, LEWD - TELEPHONE CALLS, LEWD",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36378,34.2376,-118.5947,VEHICLE - STOLEN,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
36379,33.7547,-118.2967,CHILD ABUSE (PHYSICAL) - SIMPLE ASSAULT,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
36380,33.7749,-118.2750,INTIMATE PARTNER - SIMPLE ASSAULT,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
36381,34.1109,-118.2464,BURGLARY,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


* Firstly, we shuffle the data
* We separate the target column from the other columns
* Now we split the data into training and testing. Specifically, we use 20% for testing purposes
* The target column is `CRM_DESC`
* Kfold cross validation is not used due to memory RAM restrictions

In [74]:
la_crimes_dum = la_crimes_dum.sample(frac = 1, random_state=9) #shuffling data
target = la_crimes_dum['CRM_DESC'].copy() # target column separated
la_crimes_dum = la_crimes_dum.drop(['CRM_DESC'], axis=1) #deleting target column from data
X_train, X_test, y_train, y_test = train_test_split(la_crimes_dum, target, test_size=0.2,  random_state=7)

## Decision Tree
First predictive model that is going to be used is a decision tree
* Firstly a classifier is created. For the decision tree and all the classifiers the criterion `gini` is used because it produces better accuracy compared to `entropy`
* Then we train it with the train data
* Lastly, we predict the test data

In [75]:
clf_gini = tree.DecisionTreeClassifier(criterion='gini')
clf_gini = clf_gini.fit(X_train, y_train)
gini_predict = clf_gini.predict(X_test)

Now we can see how our decision tree performed
* $67$% total accuracy. Accuracy is the proportion of instances correctly classified by the classifier.
* Weighted average of precision is $66$%. Precision is the ratio of correctly reported positives over all reported positives
* Weighted average of recall is $67$%. Recall is the ratio of correctly reported positives over all actual positives
* Weighted average of f1-score is $66$%. F1-score is the harmonic mean of the precision and the recall.

In [76]:
print(classification_report(y_test, gini_predict, zero_division=1))

                                                          precision    recall  f1-score   support

                                                   ARSON       0.47      0.47      0.47        17
            ASSAULT WITH DEADLY WEAPON ON POLICE OFFICER       1.00      0.67      0.80         9
          ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT       0.82      0.87      0.84       365
                                       ATTEMPTED ROBBERY       0.66      0.40      0.49        53
                                BATTERY - SIMPLE ASSAULT       0.89      0.92      0.91       622
                                BATTERY ON A FIREFIGHTER       1.00      0.50      0.67         2
                                 BATTERY POLICE (SIMPLE)       0.76      0.80      0.78        20
                             BATTERY WITH SEXUAL CONTACT       0.61      0.76      0.68        46
                                           BIKE - STOLEN       0.23      0.21      0.22        80
                   

* Now we visualize the decision tree that was created
* It is visible that we can easily interpret the decision tree, thus they are also called white boxes

In [117]:
temp = []
for item in X_test.columns:
    temp.append(item)

In [118]:
text_representation = tree.export_text(clf_gini,feature_names= temp)
print(text_representation)

|--- unknown <= 0.50
|   |--- PART1_2_1 <= 0.50
|   |   |--- 0329 <= 0.50
|   |   |   |--- 2000 <= 0.50
|   |   |   |   |--- WEAPON_DESC_STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE) <= 0.50
|   |   |   |   |   |--- WEAPON_DESC_VERBAL THREAT <= 0.50
|   |   |   |   |   |   |--- 1822 <= 0.50
|   |   |   |   |   |   |   |--- WEAPON_DESC_unknown <= 0.50
|   |   |   |   |   |   |   |   |--- 1100 <= 0.50
|   |   |   |   |   |   |   |   |   |--- 1258 <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- 0416 <= 0.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 20
|   |   |   |   |   |   |   |   |   |   |--- 0416 >  0.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 19
|   |   |   |   |   |   |   |   |   |--- 1258 >  0.50
|   |   |   |   |   |   |   |   |   |   |--- 0416 <= 0.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 5
|   |   |   |   |   |   |   |   |   |   |--- 0416 >  0.50
|   |   |   |   |   |  

## Bagging  classifier
Next up we use out-of-Bag Error Estimation. This estimation uses the remaining $1/3$ of the observations that are not used to fit a given bagged tree and are referred to as the out-of-bag (OOB) observations.
* Like that, we can predict the response for the  𝑖 th observation using each of the trees in which that observation was OOB.

* This will yield around  𝐵/3  predictions for the ith observation, where  𝐵  is the number of bootstrapped training sets.

* To get a single prediction for the  𝑖 th observation, we can take a majority vote (in classification trees).

* So we get a single prediction for the  𝑖 th observation; we do the same for all  𝑛  observations. In this way we can get an overall error estimate.

In [75]:
bagging_crime_tree = BaggingClassifier(DecisionTreeClassifier(criterion='gini'),
                                       n_estimators=50,
                                       n_jobs=None)


bagging_crime_tree = bagging_crime_tree.fit(X_train, np.ravel(y_train))
baggin_predict = bagging_crime_tree.predict(X_test)

Now we check how our decision tree performed
* $71$% total accuracy. Accuracy is the proportion of instances correctly classified by the classifier.
* Weighted average of precision is $69$%. Precision is the ratio of correctly reported positives over all reported positives
* Weighted average of recall is $71$%. Recall is the ratio of correctly reported positives over all actual positives
* Weighted average of f1-score is $68$%. F1-score is the harmonic mean of the precision and the recall.

In [76]:
print(classification_report(y_test, baggin_predict, zero_division=1))

                                                          precision    recall  f1-score   support

                                                   ARSON       0.50      0.59      0.54        17
            ASSAULT WITH DEADLY WEAPON ON POLICE OFFICER       1.00      0.67      0.80         9
          ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT       0.80      0.90      0.85       365
                                       ATTEMPTED ROBBERY       0.65      0.38      0.48        53
                                BATTERY - SIMPLE ASSAULT       0.87      0.95      0.91       622
                                BATTERY ON A FIREFIGHTER       1.00      0.50      0.67         2
                                 BATTERY POLICE (SIMPLE)       0.76      0.80      0.78        20
                             BATTERY WITH SEXUAL CONTACT       0.62      0.76      0.69        46
                                           BIKE - STOLEN       0.49      0.31      0.38        80
                   

## Random Forest
Next up random forest classifier will be used. Random forests are an improvement over bagged trees
* Random forests are used to decorrelate the trees
* Therefore, the forest will have uncorrelated quantities to achieve a reduction in variance

In [None]:
forest = RandomForestClassifier(n_estimators=50, max_depth=None,
                                min_samples_split=2)
forest = forest.fit(X_train, np.ravel(y_train)) #training data
forest_predict = forest.predict(X_test)

Random forest results:
* $70$% total accuracy. Accuracy is the proportion of instances correctly classified by the classifier.
* Weighted average of precision is $67$%. Precision is the ratio of correctly reported positives over all reported positives
* Weighted average of recall is $70$%. Recall is the ratio of correctly reported positives over all actual positives
* Weighted average of f1-score is $65$%. F1-score is the harmonic mean of the precision and the recall.

In [None]:
print(classification_report(y_test, forest_predict, zero_division=1))

## Extremely Randomized Trees
Next up extremely randomized tress are used
* In extremely randomized trees, randomness goes one step further in the way splits are computed.

* As in random forests, a random subset of candidate features is used but thresholds are drawn at random for each candidate feature and the best of these randomly-generated thresholds and its corresponding feature is picked as the splitting rule.

In [113]:
extremely_rts = ExtraTreesClassifier(n_estimators=50, criterion='gini',
                                     max_depth=50,
                                     min_samples_split=2)
extremely_rts = extremely_rts.fit(X_train, np.ravel(y_train)) #training
extreme_predict = extremely_rts.predict(X_test) 

Extremely random forest results:
* $71$% total accuracy. Accuracy is the proportion of instances correctly classified by the classifier.
* Weighted average of precision is $69$%. Precision is the ratio of correctly reported positives over all reported positives
* Weighted average of recall is $71$%. Recall is the ratio of correctly reported positives over all actual positives
* Weighted average of f1-score is $66$%. F1-score is the harmonic mean of the precision and the recall.

## TRYING TO PREDICT MORE CRIME INCIDENTS

In [98]:
la_crimes_new2 = la_crimes.loc[(la_crimes['DATE_OCC'].dt.to_period('M') == '2019-02')].copy()
la_crimes_new2

Unnamed: 0,DR_NO,DATE_OCC,AREA_NAME,DISTR_NO,PART1_2,CRM_CD,CRM_DESC,Mocodes,VICT_AGE,VICT_SEX,VICT_DESC,PREMIS_DESC,WEAPON_DESC,STATUS,CRM_CD2,CRM_CD3,CRM_CD4,LOCATION,LAT,LON
1844659,191306743,2019-02-21 06:30:00,Newton,1375,1,510,VEHICLE - STOLEN,unknown,0,X,X,STREET,unknown,Invest Cont,-1,-1,-1,53RD ST,33.9948,-118.2514
1848238,191106093,2019-02-11 20:00:00,Northeast,1149,1,510,VEHICLE - STOLEN,unknown,0,X,X,STREET,unknown,Invest Cont,-1,-1,-1,FIGUEROA,34.1050,-118.2022
1898394,191208802,2019-02-25 17:00:00,77th Street,1245,2,354,THEFT OF IDENTITY,0100 1822,43,F,B,SINGLE FAMILY DWELLING,unknown,Invest Cont,-1,-1,-1,1400 W 70TH ST,33.9762,-118.3003
1898401,191706728,2019-02-16 17:00:00,Devonshire,1745,2,740,"VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA...",0329 1307,50,M,W,STREET,unknown,Invest Cont,-1,-1,-1,10300 ENCINO AV,34.2576,-118.5154
1898434,192000540,2019-02-22 23:10:00,Olympic,2002,1,230,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",0432 0450 0906 1402 1407 1100 1310,19,M,H,STREET,UNKNOWN FIREARM,Invest Cont,998,-1,-1,WESTERN,34.0803,-118.3091
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2114684,190706477,2019-02-20 12:00:00,Wilshire,765,2,745,VANDALISM - MISDEAMEANOR ($399 OR UNDER),0329 1609 0319 0906,0,M,O,OTHER BUSINESS,unknown,Invest Cont,-1,-1,-1,5000 W PICO BL,34.0478,-118.3459
2114691,191307168,2019-02-28 07:00:00,Newton,1394,1,510,VEHICLE - STOLEN,unknown,0,X,X,STREET,unknown,Invest Cont,-1,-1,-1,100 E 67TH ST,33.9788,-118.2739
2114692,190906699,2019-02-23 22:20:00,Van Nuys,904,1,210,ROBBERY,0344 0302 0334 0355 1310 1420 1822 0354,30,F,W,STREET,OTHER FIREARM,Invest Cont,998,-1,-1,7600 WILLIS AV,34.2085,-118.4553
2114693,190506304,2019-02-22 08:40:00,Harbor,569,2,627,CHILD ABUSE (PHYSICAL) - SIMPLE ASSAULT,0443 0419 0416 1259,14,F,W,PARK/PLAYGROUND,"STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)",Adult Other,-1,-1,-1,100 W 22ND ST,33.7257,-118.2801


In [99]:
la_crimes_new2['WEEKDAY'] = la_crimes_new2['DATE_OCC'].dt.dayofweek.astype(str).copy()
la_crimes_new2['HOUR'] = la_crimes_new2['DATE_OCC'].astype(str).str[11:13].copy()

In [100]:
la_crimes_dum2 = la_crimes_new2[['LAT','LON','LOCATION','CRM_CD2','CRM_CD3','CRM_CD4','VICT_AGE','Mocodes','WEEKDAY',
                               'AREA_NAME','DISTR_NO','CRM_DESC','PREMIS_DESC','HOUR',
                               'WEAPON_DESC','VICT_DESC','VICT_SEX','PART1_2']].copy()
la_crimes_dum2

Unnamed: 0,LAT,LON,LOCATION,CRM_CD2,CRM_CD3,CRM_CD4,VICT_AGE,Mocodes,WEEKDAY,AREA_NAME,DISTR_NO,CRM_DESC,PREMIS_DESC,HOUR,WEAPON_DESC,VICT_DESC,VICT_SEX,PART1_2
1844659,33.9948,-118.2514,53RD ST,-1,-1,-1,0,unknown,3,Newton,1375,VEHICLE - STOLEN,STREET,06,unknown,X,X,1
1848238,34.1050,-118.2022,FIGUEROA,-1,-1,-1,0,unknown,0,Northeast,1149,VEHICLE - STOLEN,STREET,20,unknown,X,X,1
1898394,33.9762,-118.3003,1400 W 70TH ST,-1,-1,-1,43,0100 1822,0,77th Street,1245,THEFT OF IDENTITY,SINGLE FAMILY DWELLING,17,unknown,B,F,2
1898401,34.2576,-118.5154,10300 ENCINO AV,-1,-1,-1,50,0329 1307,5,Devonshire,1745,"VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA...",STREET,17,unknown,W,M,2
1898434,34.0803,-118.3091,WESTERN,998,-1,-1,19,0432 0450 0906 1402 1407 1100 1310,4,Olympic,2002,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",STREET,23,UNKNOWN FIREARM,H,M,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2114684,34.0478,-118.3459,5000 W PICO BL,-1,-1,-1,0,0329 1609 0319 0906,2,Wilshire,765,VANDALISM - MISDEAMEANOR ($399 OR UNDER),OTHER BUSINESS,12,unknown,O,M,2
2114691,33.9788,-118.2739,100 E 67TH ST,-1,-1,-1,0,unknown,3,Newton,1394,VEHICLE - STOLEN,STREET,07,unknown,X,X,1
2114692,34.2085,-118.4553,7600 WILLIS AV,998,-1,-1,30,0344 0302 0334 0355 1310 1420 1822 0354,5,Van Nuys,904,ROBBERY,STREET,22,OTHER FIREARM,W,F,1
2114693,33.7257,-118.2801,100 W 22ND ST,-1,-1,-1,14,0443 0419 0416 1259,4,Harbor,569,CHILD ABUSE (PHYSICAL) - SIMPLE ASSAULT,PARK/PLAYGROUND,08,"STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)",W,F,2


In [101]:
mocodes_dummies2= la_crimes_dum2['Mocodes'].str.join(sep='').str.get_dummies(sep=' ') 
mocodes_dummies2

Unnamed: 0,0100,0101,0102,0104,0110,0112,0113,0115,0119,0120,...,3036,3037,3101,3104,3401,3701,4025,4026,9999,unknown
1844659,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1848238,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1898394,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1898401,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1898434,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2114684,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2114691,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2114692,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2114693,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [102]:
la_crimes_dum2 = pd.merge(la_crimes_dum2,mocodes_dummies2, left_on=la_crimes_dum2.index, right_on=mocodes_dummies2.index )
la_crimes_dum2 = la_crimes_dum2.drop(['key_0'], axis=1) # dropping the new column of the index(keys)
la_crimes_dum2

Unnamed: 0,LAT,LON,LOCATION,CRM_CD2,CRM_CD3,CRM_CD4,VICT_AGE,Mocodes,WEEKDAY,AREA_NAME,...,3036,3037,3101,3104,3401,3701,4025,4026,9999,unknown
0,33.9948,-118.2514,53RD ST,-1,-1,-1,0,unknown,3,Newton,...,0,0,0,0,0,0,0,0,0,1
1,34.1050,-118.2022,FIGUEROA,-1,-1,-1,0,unknown,0,Northeast,...,0,0,0,0,0,0,0,0,0,1
2,33.9762,-118.3003,1400 W 70TH ST,-1,-1,-1,43,0100 1822,0,77th Street,...,0,0,0,0,0,0,0,0,0,0
3,34.2576,-118.5154,10300 ENCINO AV,-1,-1,-1,50,0329 1307,5,Devonshire,...,0,0,0,0,0,0,0,0,0,0
4,34.0803,-118.3091,WESTERN,998,-1,-1,19,0432 0450 0906 1402 1407 1100 1310,4,Olympic,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16174,34.0478,-118.3459,5000 W PICO BL,-1,-1,-1,0,0329 1609 0319 0906,2,Wilshire,...,0,0,0,0,0,0,0,0,0,0
16175,33.9788,-118.2739,100 E 67TH ST,-1,-1,-1,0,unknown,3,Newton,...,0,0,0,0,0,0,0,0,0,1
16176,34.2085,-118.4553,7600 WILLIS AV,998,-1,-1,30,0344 0302 0334 0355 1310 1420 1822 0354,5,Van Nuys,...,0,0,0,0,0,0,0,0,0,0
16177,33.7257,-118.2801,100 W 22ND ST,-1,-1,-1,14,0443 0419 0416 1259,4,Harbor,...,0,0,0,0,0,0,0,0,0,0


In [103]:
la_crimes_dum2['CRM_CD2'] = la_crimes_dum2['CRM_CD2'].astype(str)+ ' ' +la_crimes_dum2['CRM_CD3'].astype(str) +' '+\
                            +la_crimes_dum2['CRM_CD4'].astype(str)
la_crimes_dum2 = la_crimes_dum2.drop(['CRM_CD3','CRM_CD4'], axis=1)
la_crimes_dum2

Unnamed: 0,LAT,LON,LOCATION,CRM_CD2,VICT_AGE,Mocodes,WEEKDAY,AREA_NAME,DISTR_NO,CRM_DESC,...,3036,3037,3101,3104,3401,3701,4025,4026,9999,unknown
0,33.9948,-118.2514,53RD ST,-1 -1 -1,0,unknown,3,Newton,1375,VEHICLE - STOLEN,...,0,0,0,0,0,0,0,0,0,1
1,34.1050,-118.2022,FIGUEROA,-1 -1 -1,0,unknown,0,Northeast,1149,VEHICLE - STOLEN,...,0,0,0,0,0,0,0,0,0,1
2,33.9762,-118.3003,1400 W 70TH ST,-1 -1 -1,43,0100 1822,0,77th Street,1245,THEFT OF IDENTITY,...,0,0,0,0,0,0,0,0,0,0
3,34.2576,-118.5154,10300 ENCINO AV,-1 -1 -1,50,0329 1307,5,Devonshire,1745,"VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA...",...,0,0,0,0,0,0,0,0,0,0
4,34.0803,-118.3091,WESTERN,998 -1 -1,19,0432 0450 0906 1402 1407 1100 1310,4,Olympic,2002,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16174,34.0478,-118.3459,5000 W PICO BL,-1 -1 -1,0,0329 1609 0319 0906,2,Wilshire,765,VANDALISM - MISDEAMEANOR ($399 OR UNDER),...,0,0,0,0,0,0,0,0,0,0
16175,33.9788,-118.2739,100 E 67TH ST,-1 -1 -1,0,unknown,3,Newton,1394,VEHICLE - STOLEN,...,0,0,0,0,0,0,0,0,0,1
16176,34.2085,-118.4553,7600 WILLIS AV,998 -1 -1,30,0344 0302 0334 0355 1310 1420 1822 0354,5,Van Nuys,904,ROBBERY,...,0,0,0,0,0,0,0,0,0,0
16177,33.7257,-118.2801,100 W 22ND ST,-1 -1 -1,14,0443 0419 0416 1259,4,Harbor,569,CHILD ABUSE (PHYSICAL) - SIMPLE ASSAULT,...,0,0,0,0,0,0,0,0,0,0


In [104]:
crime_codes_dummies2= la_crimes_dum2['CRM_CD2'].str.join(sep='').str.get_dummies(sep=' ') 
crime_codes_dummies2 = crime_codes_dummies2.drop(['-1'], axis=1) # dropping -1 column as it concerns incidents with no extra crime type
crime_codes_dummies2

Unnamed: 0,230,236,330,350,420,440,442,480,510,521,...,940,946,950,956,979,990,993,997,998,999
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16174,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
16175,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
16176,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
16177,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [105]:
la_crimes_dum2 = pd.merge(la_crimes_dum2, crime_codes_dummies2, left_on=la_crimes_dum2.index,
                         right_on=crime_codes_dummies2.index )
la_crimes_dum2 = la_crimes_dum2.drop(['key_0'], axis=1) # dropping the new column of the index(keys) 
la_crimes_dum2

Unnamed: 0,LAT,LON,LOCATION,CRM_CD2,VICT_AGE,Mocodes,WEEKDAY,AREA_NAME,DISTR_NO,CRM_DESC,...,940,946,950,956,979,990,993,997,998,999
0,33.9948,-118.2514,53RD ST,-1 -1 -1,0,unknown,3,Newton,1375,VEHICLE - STOLEN,...,0,0,0,0,0,0,0,0,0,0
1,34.1050,-118.2022,FIGUEROA,-1 -1 -1,0,unknown,0,Northeast,1149,VEHICLE - STOLEN,...,0,0,0,0,0,0,0,0,0,0
2,33.9762,-118.3003,1400 W 70TH ST,-1 -1 -1,43,0100 1822,0,77th Street,1245,THEFT OF IDENTITY,...,0,0,0,0,0,0,0,0,0,0
3,34.2576,-118.5154,10300 ENCINO AV,-1 -1 -1,50,0329 1307,5,Devonshire,1745,"VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA...",...,0,0,0,0,0,0,0,0,0,0
4,34.0803,-118.3091,WESTERN,998 -1 -1,19,0432 0450 0906 1402 1407 1100 1310,4,Olympic,2002,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",...,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16174,34.0478,-118.3459,5000 W PICO BL,-1 -1 -1,0,0329 1609 0319 0906,2,Wilshire,765,VANDALISM - MISDEAMEANOR ($399 OR UNDER),...,0,0,0,0,0,0,0,0,0,0
16175,33.9788,-118.2739,100 E 67TH ST,-1 -1 -1,0,unknown,3,Newton,1394,VEHICLE - STOLEN,...,0,0,0,0,0,0,0,0,0,0
16176,34.2085,-118.4553,7600 WILLIS AV,998 -1 -1,30,0344 0302 0334 0355 1310 1420 1822 0354,5,Van Nuys,904,ROBBERY,...,0,0,0,0,0,0,0,0,1,0
16177,33.7257,-118.2801,100 W 22ND ST,-1 -1 -1,14,0443 0419 0416 1259,4,Harbor,569,CHILD ABUSE (PHYSICAL) - SIMPLE ASSAULT,...,0,0,0,0,0,0,0,0,0,0


In [106]:
la_crimes_dum2 = pd.get_dummies(la_crimes_dum2, columns=['LOCATION','VICT_AGE',
                                                       'AREA_NAME','DISTR_NO', 'PREMIS_DESC', 'WEAPON_DESC', 
                                                       'VICT_DESC', 'VICT_SEX', 'PART1_2','WEEKDAY','HOUR'])
la_crimes_dum2 = la_crimes_dum2.drop(['Mocodes','CRM_CD2'], axis=1)
la_crimes_dum2

Unnamed: 0,LAT,LON,CRM_DESC,0100,0101,0102,0104,0110,0112,0113,...,HOUR_14,HOUR_15,HOUR_16,HOUR_17,HOUR_18,HOUR_19,HOUR_20,HOUR_21,HOUR_22,HOUR_23
0,33.9948,-118.2514,VEHICLE - STOLEN,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,34.1050,-118.2022,VEHICLE - STOLEN,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,33.9762,-118.3003,THEFT OF IDENTITY,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
3,34.2576,-118.5154,"VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA...",0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,34.0803,-118.3091,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16174,34.0478,-118.3459,VANDALISM - MISDEAMEANOR ($399 OR UNDER),0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
16175,33.9788,-118.2739,VEHICLE - STOLEN,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
16176,34.2085,-118.4553,ROBBERY,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
16177,33.7257,-118.2801,CHILD ABUSE (PHYSICAL) - SIMPLE ASSAULT,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [107]:
cols = la_crimes_dum2.columns.intersection(la_crimes_dum.columns)
cols

Index(['LAT', 'LON', '0100', '0101', '0102', '0104', '0110', '0112', '0113',
       '0115',
       ...
       'HOUR_14', 'HOUR_15', 'HOUR_16', 'HOUR_17', 'HOUR_18', 'HOUR_19',
       'HOUR_20', 'HOUR_21', 'HOUR_22', 'HOUR_23'],
      dtype='object', length=7401)

In [109]:
la_crimes_dum2 = la_crimes_dum2[cols]
la_crimes_dum2

Unnamed: 0,LAT,LON,0100,0101,0102,0104,0110,0112,0113,0115,...,HOUR_14,HOUR_15,HOUR_16,HOUR_17,HOUR_18,HOUR_19,HOUR_20,HOUR_21,HOUR_22,HOUR_23
0,33.9948,-118.2514,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,34.1050,-118.2022,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,33.9762,-118.3003,1,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
3,34.2576,-118.5154,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,34.0803,-118.3091,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16174,34.0478,-118.3459,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
16175,33.9788,-118.2739,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
16176,34.2085,-118.4553,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
16177,33.7257,-118.2801,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [110]:
main_list = np.setdiff1d(la_crimes_dum.columns, cols)
main_list

array(['0105', '0107', '0109', ..., 'WEAPON_DESC_RAZOR',
       'WEAPON_DESC_RAZOR BLADE', 'WEAPON_DESC_SAWED OFF RIFLE/SHOTGUN'],
      dtype=object)

In [111]:
print(len(main_list))
print(len(la_crimes_dum.columns))
len(cols)

11762
19163


7401

In [None]:
la_crimes_dum2[main_list.copy()] = 0
la_crimes_dum2

In [None]:
la_crimes_dum2 = la_crimes_dum2.sample(frac = 1, random_state=9) #shuffling data
target2 = la_crimes_dum2['CRM_DESC'].copy() # target column separated
la_crimes_dum2 = la_crimes_dum2.drop(['CRM_DESC'], axis=1) #deleting target column from data
la_crimes_dum2

In [None]:
extreme_predict = extremely_rts.predict(la_crimes_dum2 

In [80]:
print(classification_report(y_test, extreme_predict, zero_division=1))

                                                          precision    recall  f1-score   support

                                                   ARSON       0.67      0.12      0.20        17
            ASSAULT WITH DEADLY WEAPON ON POLICE OFFICER       1.00      0.67      0.80         9
          ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT       0.73      0.93      0.82       365
                                       ATTEMPTED ROBBERY       1.00      0.00      0.00        53
                                BATTERY - SIMPLE ASSAULT       0.80      0.99      0.89       622
                                BATTERY ON A FIREFIGHTER       1.00      0.00      0.00         2
                                 BATTERY POLICE (SIMPLE)       0.69      0.45      0.55        20
                             BATTERY WITH SEXUAL CONTACT       0.66      0.67      0.67        46
                                           BIKE - STOLEN       0.59      0.29      0.39        80
                   

## AdaBoost classifier
Now AdaBoost classifier is used. Boosting is another method for improved classification. Particularly, we fit a sequence of weak learners, such as a very small decision trees.
The algorithm goes:
1. We train a weak predictor, such as a small decision tree, on our dataset.

2. We take notice of the errors in the predictions and we reweigh our training set so that:

  * The weights of the data that were correctly predicted are decreased.
  
  * The weights of the data that were incorrectly predicted are increased.
  
3. We go back to step 1.
Note: those steps are made inside pandas


In [83]:
clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=None),
    n_estimators=100)
booster = clf.fit(X_train, np.ravel(y_train))
booster_predict = booster.predict(X_test)

AdaBoost results:
* $70$% total accuracy.
* Weighted average of precision is $68$%. 
* Weighted average of recall is $70$%.
* Weighted average of f1-score is $68$%.

In [84]:
print(classification_report(y_test, booster_predict, zero_division=1))

                                                          precision    recall  f1-score   support

                                                   ARSON       0.57      0.47      0.52        17
            ASSAULT WITH DEADLY WEAPON ON POLICE OFFICER       1.00      0.67      0.80         9
          ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT       0.81      0.90      0.85       365
                                       ATTEMPTED ROBBERY       0.67      0.34      0.45        53
                                BATTERY - SIMPLE ASSAULT       0.87      0.95      0.91       622
                                BATTERY ON A FIREFIGHTER       1.00      0.50      0.67         2
                                 BATTERY POLICE (SIMPLE)       0.87      0.65      0.74        20
                             BATTERY WITH SEXUAL CONTACT       0.66      0.76      0.71        46
                                           BIKE - STOLEN       0.36      0.26      0.30        80
                   

### Conclusion
Based on the metrics of accuracy the best classifiers are extemely randomized trees and the out of bagging classifier, both with $71$%. That means those two will accuratelly predict around 7 out of 10 incidents where the crime description is missing. In detail, both those classifiers have the same weighted average for precision and recall but the bagging classifier has a slightly heigher weighted average for f1-score. On the other hand the extremely randomized trees have a higher macro average of precision


## Probabilities
A very good use of the decision trees is that they can calculate the probability of an observation(incident) to belong to any class (crime type). Actually, the class that is predicted for an incident is the one with the heighest probability.
* We select the extremely randomized forest which performs one of the best accuracies

#### First incident
Let's see the probabilitites for the first incident of the test data. Each list's item is a probability and the index of the item corresponds to the crime type that is found in `extremely_rts. classes_`

In [119]:
index= 1
X = X_test[index-1 : index]
position = np.where(extremely_rts.predict_proba(X)[0] == max(extremely_rts.predict_proba(X)[0]))
print(extremely_rts.predict_proba(X)[0])
extremely_rts. classes_

[0.02 0.   0.5  0.   0.03 0.   0.   0.   0.   0.   0.   0.   0.05 0.
 0.   0.   0.02 0.01 0.   0.03 0.   0.   0.   0.   0.   0.   0.   0.
 0.   0.   0.   0.   0.   0.01 0.03 0.   0.   0.   0.   0.   0.   0.
 0.   0.   0.   0.   0.   0.   0.   0.   0.   0.01 0.   0.   0.   0.
 0.   0.   0.   0.   0.02 0.   0.   0.   0.01 0.   0.   0.   0.01 0.
 0.01 0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.04 0.   0.
 0.02 0.   0.   0.   0.   0.03 0.01 0.   0.   0.   0.   0.   0.   0.01
 0.   0.01 0.   0.05 0.   0.   0.   0.04 0.   0.   0.   0.   0.   0.
 0.   0.01 0.   0.   0.01 0.01 0.   0.  ]


array(['ARSON', 'ASSAULT WITH DEADLY WEAPON ON POLICE OFFICER',
       'ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT',
       'ATTEMPTED ROBBERY', 'BATTERY - SIMPLE ASSAULT',
       'BATTERY ON A FIREFIGHTER', 'BATTERY POLICE (SIMPLE)',
       'BATTERY WITH SEXUAL CONTACT', 'BIGAMY', 'BIKE - STOLEN',
       'BOAT - STOLEN', 'BOMB SCARE', 'BRANDISH WEAPON', 'BUNCO, ATTEMPT',
       'BUNCO, GRAND THEFT', 'BUNCO, PETTY THEFT', 'BURGLARY',
       'BURGLARY FROM VEHICLE', 'BURGLARY FROM VEHICLE, ATTEMPTED',
       'BURGLARY, ATTEMPTED', 'CHILD ABANDONMENT',
       'CHILD ABUSE (PHYSICAL) - AGGRAVATED ASSAULT',
       'CHILD ABUSE (PHYSICAL) - SIMPLE ASSAULT',
       'CHILD ANNOYING (17YRS & UNDER)', 'CHILD NEGLECT (SEE 300 W.I.C.)',
       'CHILD PORNOGRAPHY', 'CHILD STEALING', 'CONSPIRACY',
       'CONTEMPT OF COURT', 'CONTRIBUTING', 'COUNTERFEIT',
       'CREDIT CARDS, FRAUD USE ($950 & UNDER',
       'CREDIT CARDS, FRAUD USE ($950.01 & OVER)', 'CRIMINAL HOMICIDE',
       'CRIMINAL THRE

* Here we can see the crime type that is predicted (the crime type that corresponds to the highest probability) together with the probability

In [125]:
print(extremely_rts. classes_[position])
extremely_rts.predict_proba(X)[0][position]

['ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT']


array([0.5])

* And this is the correct crime type 
* We see that the prediction is correct

In [126]:
y_test.head(1)

26126    ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT
Name: CRM_DESC, dtype: object

#### Second incident
Let's see the probabilitites for the second incident of the test data

In [127]:
index= 2
X = X_test[index-1 : index]
position = np.where(extremely_rts.predict_proba(X)[0] == max(extremely_rts.predict_proba(X)[0]))
print(extremely_rts.predict_proba(X)[0])

[0.   0.   0.01 0.   0.   0.   0.   0.   0.   0.42 0.   0.   0.   0.
 0.   0.   0.16 0.06 0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
 0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
 0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
 0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
 0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
 0.   0.   0.   0.   0.   0.   0.01 0.   0.   0.   0.   0.   0.01 0.05
 0.   0.   0.   0.19 0.   0.   0.   0.08 0.   0.   0.   0.   0.   0.
 0.   0.01 0.   0.   0.   0.   0.   0.  ]


* The prediction is a stolen bike with a probability of $42$%

In [129]:
print(extremely_rts. classes_[position])
extremely_rts.predict_proba(X)[0][position]

['BIKE - STOLEN']


array([0.42])

* The actual crime committed is seen here
* The prediction is once again accurate

In [130]:
y_test[index-1 : index]

29416    BIKE - STOLEN
Name: CRM_DESC, dtype: object

#### Third incident
Now we will check for the third incident of the test data. We can see the probability table

In [131]:
index= 3
X = X_test[index-1 : index]
position = np.where(extremely_rts.predict_proba(X)[0] == max(extremely_rts.predict_proba(X)[0]))
print(extremely_rts.predict_proba(X)[0])

[0.   0.   0.01 0.   0.01 0.   0.   0.   0.   0.   0.   0.   0.   0.
 0.   0.   0.03 0.01 0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
 0.01 0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
 0.   0.   0.02 0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
 0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
 0.01 0.   0.01 0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
 0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
 0.   0.01 0.   0.01 0.   0.   0.   0.   0.   0.   0.   0.   0.02 0.
 0.61 0.22 0.   0.   0.02 0.   0.   0.  ]


* The incident is predicted to be a vandalism - felony of 400 and more dollars with the probability of $61$ %

In [132]:
print(extremely_rts. classes_[position])
extremely_rts.predict_proba(X)[0][position]

['VANDALISM - FELONY ($400 & OVER, ALL CHURCH VANDALISMS)']


array([0.61])

* The incident is actually vandalism - misdeameanor of 399 and less dollars

In [110]:
y_test[index-1 : index]

20309    VANDALISM - MISDEAMEANOR ($399 OR UNDER)
Name: CRM_DESC, dtype: object

* Let's find the second highest probability

In [116]:
position2 = np.where(extremely_rts.predict_proba(X)[0] == sorted(extremely_rts.predict_proba(X)[0])[-2])
sorted(extremely_rts.predict_proba(X)[0])[-2]

0.22

* We see that the second highest probability leads to the actual crime (vandalism - misdeameanor of 399 and less dollars)

In [118]:
extremely_rts. classes_[position2] 

array(['VANDALISM - MISDEAMEANOR ($399 OR UNDER)'], dtype=object)

##### Conclusion
It means that even though the extra random forest did not predict the right value at first, it gave us an indication that the incident could also be a vandalism - misdeamenor. In that particular example it is very difficult to distinguish the actual crime from the other data given. The act of crime is actually the same but the amount of damage inflicted was the differentiator.
<br>In addition, if the tree classificator was given a second chance to predict, it would predict accurately. This is very important because even though the tree doesn't always predict correctly, it can highlight other possibilities. Moreover, even for human beings it would be extremely difficult to distinguish the exact type of crime that was committed just by obtaining and analyzing the information from la_crimes. To sum up, we can notice the power of machine learning by this application of data analytics and predictions and specifically decision trees and their extensions.