# __Project-2-Exploratory-Data-analysis (EDA) of Los Angeles Crime Data__
![](https://i.imgur.com/SYXAQFy.jpeg)

Exploratory Data Analysis (EDA) is used to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier to discover patterns, spot anomalies, test a hypothesis, or check assumptions.

EDA requires knowledge of statistics, visualization techniques and data-analysis tools like numpy, pandas, seaborn, plotly, matplotlib.

## __Introduction:__

In this project we will perform Explanatory Data Analysis on Crime taking place in the City of Los Angeles dating back from year 2010 to 2022. 

We will explore the relationships between various parameters using graphs like barplot, scatter plot, etc. 


## __Steps to follow:__
1. Downloading the dataset & install all required libraries
2. Explore & clean the data.
3. Visualization 
4. Summarize & conclude 
5. Future Work.
6. References.

Saving the work to Jovian:

In [213]:
!pip install jovian --upgrade --quiet
import jovian
# Execute this to save new versions of the notebook
jovian.commit(project="project-2-exploratory-data-analysis")

[jovian] Detected Colab notebook...[0m
[jovian] Uploading colab notebook to Jovian...[0m
Committed successfully! https://jovian.ai/shwetsagashe/project-2-exploratory-data-analysis


'https://jovian.ai/shwetsagashe/project-2-exploratory-data-analysis'

### __1. DOWNLOAD THE DATASET & INSTALL ALL THE REQUIRED LIBRARIES:__

We will download the Crime Data in City of Los Angeles from Kaggle using "Opendatasets" library. But first lets install all the required libraries

In [3]:
!pip install opendatasets --upgrade --quiet
!pip install  matplotlib seaborn --upgrade --quiet
!pip install plotly --quiet
!pip install -U matplotlib --quiet
!pip install folium --quiet

[K     |████████████████████████████████| 11.2 MB 5.5 MB/s 
[K     |████████████████████████████████| 285 kB 48.1 MB/s 
[K     |████████████████████████████████| 959 kB 46.6 MB/s 
[?25h

In [4]:
import opendatasets as od
import os
import pandas as pd
import numpy as np
import random
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.io as pio
pio.renderers.default = 'colab'
%matplotlib inline
import plotly 
import matplotlib
import folium

In [5]:
link = 'https://www.kaggle.com/datasets/sumaiaparveenshupti/los-angeles-crime-data-20102020'

In [7]:
od.download(link)

Downloading los-angeles-crime-data-20102020.zip to ./los-angeles-crime-data-20102020


100%|██████████| 113M/113M [00:01<00:00, 110MB/s] 





Let's view the files downloaded.

In [8]:
os.listdir('./los-angeles-crime-data-20102020')

['Crime_Data_from_2020_to_Present.csv', 'Crime_Data_from_2010_to_2019.csv']

Let's check out both the csv files by downloading them to a dataframe.

In [9]:
df_csv = './los-angeles-crime-data-20102020' + '/Crime_Data_from_2010_to_2019.csv'
df_2010=pd.read_csv(df_csv)

In [10]:
df_csv1 = './los-angeles-crime-data-20102020'  + '/Crime_Data_from_2020_to_Present.csv'
df_2022 =pd.read_csv(df_csv1)

We will print the details of each column as given along with the dataset.

In [11]:
print('The description of the column names is as under: \n \nDR_NO: Division of Records Number: Official file number made up of a 2 digit year, area ID, and 5 digits.\n API Field Name: MM/DD/YYYY.\n DATE OCC: MM/DD/YYYY.\n TIME OCC: In 24 hour military time.\n AREA: The LAPD has 21 Community Police Stations referred to as Geographic Areas within the department. These Geographic Areas are sequentially numbered from 1-21.\n AREA NAME: The 21 Geographic Areas or Patrol Divisions are also given a name designation that references a landmark or the surrounding community that it is responsible for. For example 77th Street Division is located at the intersection of South Broadway and 77th Street, serving neighborhoods in South Los Angeles.\n Rpt Dist No: A four-digit code that represents a sub-area within a Geographic Area. All crime records reference the "RD" that it occurred in for statistical comparisons.\n Crm Cd: Indicates the crime committed. (Same as Crime Code 1)\n Crm Cd Desc: Defines the Crime Code provided.\n Mocodes: Modus Operandi: Activities associated with the suspect in commission of the crime.\n Vict Age: Two character numeric.\n Vict Sex: F - Female M - Male X - Unknown.\n Vict Descent: Descent Code: A - Other Asian B - Black C - Chinese D - Cambodian F - Filipino G - Guamanian H - Hispanic/Latin/Mexican I - American Indian/Alaskan Native J - Japanese K - Korean L - Laotian O - Other P - Pacific Islander S - Samoan U - Hawaiian V - Vietnamese W - White X - Unknown Z - Asian Indian.\n Premis Cd: The type of structure, vehicle, or location where the crime took place.\n Premis Desc: Defines the Premise Code provided.\n Weapon Used Cd: The type of weapon used in the crime.\n Weapon Desc: Defines the Weapon Used Code provided.\n Status: Status of the case. (IC is the default).\n Status DEsc: Defines the Status Code provided.\n Crm Cd 1: Indicates the crime committed. Crime Code 1 is the primary and most serious one. Crime Code 2, 3, and 4 are respectively less serious offenses. Lower crime class numbers are more serious.\n Crm Cd 2: May contain a code for an additional crime, less serious than Crime Code 1.\n Crm Cd 3: May contain a code for an additional crime, less serious than Crime Code 1.\n Crm Cd 4: May contain a code for an additional crime, less serious than Crime Code 1.\n LOCATION: Street address of crime incident rounded to the nearest hundred block to maintain anonymity.\n Cross Street: Cross Street of rounded Address.\n LAT: Latitude.\n LON: Longitude.')

The description of the column names is as under: 
 
DR_NO: Division of Records Number: Official file number made up of a 2 digit year, area ID, and 5 digits.
 API Field Name: MM/DD/YYYY.
 DATE OCC: MM/DD/YYYY.
 TIME OCC: In 24 hour military time.
 AREA: The LAPD has 21 Community Police Stations referred to as Geographic Areas within the department. These Geographic Areas are sequentially numbered from 1-21.
 AREA NAME: The 21 Geographic Areas or Patrol Divisions are also given a name designation that references a landmark or the surrounding community that it is responsible for. For example 77th Street Division is located at the intersection of South Broadway and 77th Street, serving neighborhoods in South Los Angeles.
 Rpt Dist No: A four-digit code that represents a sub-area within a Geographic Area. All crime records reference the "RD" that it occurred in for statistical comparisons.
 Crm Cd: Indicates the crime committed. (Same as Crime Code 1)
 Crm Cd Desc: Defines the Crime Code p

Lets merge both the datasets we downloaded. Ensure that the columns names of each dataframe match. We observed that in col_2010, the column name of 5th column is 'AREA ' and that in col_2022 is 'AREA' which is a typo. All other column names are same for both the dataframes. 

In [12]:
df_2010.rename(columns={'AREA ': 'AREA','AREA NAME':'AREA NAME'},inplace=True)
col_2010 = df_2010.columns.to_list()

In [13]:
col_2022 = df_2022.columns.to_list()

In [14]:
col_2010==col_2022

True

Now lets combine both the dataframes so that we get compiled data from year 2010 to 2022.

In [141]:
frame =[df_2010,df_2022]
df=pd.concat(frame)
pd.set_option('display.max_columns', None)
df.head(2)

Unnamed: 0,DR_NO,Date Rptd,DATE OCC,TIME OCC,AREA,AREA NAME,Rpt Dist No,Part 1-2,Crm Cd,Crm Cd Desc,Mocodes,Vict Age,Vict Sex,Vict Descent,Premis Cd,Premis Desc,Weapon Used Cd,Weapon Desc,Status,Status Desc,Crm Cd 1,Crm Cd 2,Crm Cd 3,Crm Cd 4,LOCATION,Cross Street,LAT,LON
0,1307355,02/20/2010 12:00:00 AM,02/20/2010 12:00:00 AM,1350,13,Newton,1385,2,900,VIOLATION OF COURT ORDER,0913 1814 2000,48,M,H,501.0,SINGLE FAMILY DWELLING,,,AA,Adult Arrest,900.0,,,,300 E GAGE AV,,33.9825,-118.2695
1,11401303,09/13/2010 12:00:00 AM,09/12/2010 12:00:00 AM,45,14,Pacific,1485,2,740,"VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA...",0329,0,M,W,101.0,STREET,,,IC,Invest Cont,740.0,,,,SEPULVEDA BL,MANCHESTER AV,33.9599,-118.3962


### __2. EXPLORE & CLEAN THE DATA:__ 
Now we will explore the data we just extracted & clean the data whereever required. 

Let's rename the columns for a better understanding & simplicity.

In [142]:
df.rename(columns={'DR_NO':'Record_no','Date Rptd':'Date_reported','DATE OCC':'Date_occured','TIME OCC':'Time','AREA':'Area','AREA NAME':'Area_name','Rpt Dist No':'Dist_no','Crm Cd':'Crime_code','Crm Cd Desc':'Crime','Mocodes':'Modus_operandi','Vict Age':'Victim_age','Vict Sex':'Victim_sex','Vict Descent':'Victim_race','Premis Cd':'Premis','Premis Desc':'Premis_code','Weapon Used Cd':'Weapon_code','Weapon Desc':'Weapon','Status':'Status_code','Status Desc':'Status','Crm Cd 1':'Crime_code1','Crm Cd 2':'Crime_code2','Crm Cd 3':'Crime_code3','Crm Cd 4':'Crime_code4','Cross Street':'Cross_street','LOCATION':'Location','LAT':'Latitude','LON':'Longitude'},inplace=True)

Let us convert the columns 'Date_report','Date_occured' &'Time' to Datetime datatype.

In [143]:
df[['Date_reported','Date_occured']]=df[['Date_reported','Date_occured']].apply(pd.to_datetime)

In [144]:
df['Time'] = pd.to_datetime(df['Time'].astype(str).str.zfill(4), format='%H%M').dt.time

Let's check how many rows have missing data.

In [145]:
df.isna().sum()

Record_no               0
Date_reported           0
Date_occured            0
Time                    0
Area                    0
Area_name               0
Dist_no                 0
Part 1-2                0
Crime_code              0
Crime                   0
Modus_operandi     266144
Victim_age              0
Victim_sex         233088
Victim_race        233139
Premis                 57
Premis_code           284
Weapon_code       1581707
Weapon            1581708
Status_code             3
Status                  0
Crime_code1            12
Crime_code2       2231485
Crime_code3       2389876
Crime_code4       2394044
Location                0
Cross_street      1988864
Latitude                0
Longitude               0
dtype: int64

We can see that the most of columns have non-null values. Lets us now check the statistics of numerical columns.

In [146]:
df.describe()

Unnamed: 0,Record_no,Area,Dist_no,Part 1-2,Crime_code,Victim_age,Premis,Weapon_code,Crime_code1,Crime_code2,Crime_code3,Crime_code4,Latitude,Longitude
count,2394173.0,2394173.0,2394173.0,2394173.0,2394173.0,2394173.0,2394116.0,812466.0,2394161.0,162688.0,4297.0,129.0,2394173.0,2394173.0
mean,154473800.0,11.05421,1151.773,1.442225,507.5668,31.55063,309.3738,370.355497,507.3792,950.416318,974.203398,977.643411,34.03829,-118.2221
std,32630500.0,6.016083,601.6332,0.4966509,210.5892,20.78462,211.6172,114.681944,210.4417,124.369087,80.460931,74.864962,1.16213,4.017693
min,817.0,1.0,100.0,1.0,110.0,-11.0,101.0,101.0,110.0,210.0,93.0,421.0,0.0,-118.8279
25%,130101300.0,6.0,642.0,1.0,330.0,19.0,102.0,400.0,330.0,998.0,998.0,998.0,34.0102,-118.4356
50%,152115400.0,11.0,1183.0,1.0,442.0,31.0,210.0,400.0,442.0,998.0,998.0,998.0,34.0618,-118.3288
75%,181312200.0,16.0,1663.0,2.0,626.0,46.0,501.0,400.0,626.0,998.0,998.0,998.0,34.174,-118.2773
max,910220400.0,21.0,2199.0,2.0,956.0,120.0,971.0,516.0,999.0,999.0,999.0,999.0,34.7907,0.0


#### __2.1 Victim's Age:__
We can see that the Victim age has a minimum value of -11 & maximum value of 120. Negative value of age is not possible. Lets explore the Victim age column further.

In [147]:
df[df['Victim_age']<=0].Victim_age.count()

438588

Here we can see that for almost 438k no. of cases, Victim age recorded is zero, which cannot be the case. We cannot delete these rows, as the no. of rows is huge. However lets check if we can deduce some value for replacing this no. Lets check the 'Crime' Column for these rows. 

In [148]:
df_age=df[(df['Victim_age']<=0) & (df['Crime'].str.contains('child',case =False))]
df_age.Victim_age.count()

2198

We can see that 2198 cases have 'Crime relating child'. Hence we can replace the missing values of the column 'Victim_age' with the mean of child age with of Crime relating child.

In [149]:
child_mean = df[(df['Victim_age']>0) & (df['Crime'].str.contains('child',case=False))]
child_mean['Victim_age'].mean()

11.96584794306544

In [150]:
df['Victim_age']=np.where((df['Victim_age']==0) & (df['Crime'].str.contains('child',case=False)),12,df['Victim_age'])

Lets explore the crimes where the victim's age is mentioned as '0'.

In [151]:
age_0 =df[df['Victim_age']==0].Crime.value_counts(ascending=False).head(15)
age_0

VEHICLE - STOLEN                                            189421
THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER)              39538
SHOPLIFTING - PETTY THEFT ($950 & UNDER)                     26837
BURGLARY                                                     25692
VANDALISM - FELONY ($400 & OVER, ALL CHURCH VANDALISMS)      22205
THEFT PLAIN - PETTY ($950 & UNDER)                           21105
VANDALISM - MISDEAMEANOR ($399 OR UNDER)                     12634
ROBBERY                                                      10947
THEFT-GRAND ($950.01 & OVER)EXCPT,GUNS,FOWL,LIVESTK,PROD     10680
TRESPASSING                                                   7436
OTHER MISCELLANEOUS CRIME                                     7239
EMBEZZLEMENT, GRAND THEFT ($950.01 & OVER)                    7180
DOCUMENT FORGERY / STOLEN FELONY                              3999
BATTERY POLICE (SIMPLE)                                       3750
BURGLARY FROM VEHICLE                                         

In [152]:
df[df['Crime'].str.contains('vehicle',case=False)].Crime.count()

522187

Here, we can see most of the cases are related to Vehicle. The age of the victim in these cases will be definitely greater then 16 as for driving a vehicle, the minimum age requirement is 16. Hence we will find the median of age of victims whose age is greater then 16 and replace the missing value by the median.

In [153]:
age_median=df[df['Victim_age']>16].Victim_age.median()
age_median

37.0

Lets replace the missing age values with the median

In [154]:
df['Victim_age'] = np.where((df['Victim_age']==0) & (df['Crime'].str.contains('vehicle',case=False)),37,df['Victim_age'])

In [155]:
df[df['Victim_age']==0].Victim_age.count()

200990

#### __2.2 Crime_Code:__
Lets delete the NAN values in column 'Crime_code1" as the no. is very small.

In [156]:
df = df.dropna(axis=0, subset= ['Crime_code1'])

#### __2.3 Victim's Sex:__

Lets explore 'Victim_sex' column. 

In [157]:
df['Victim_sex'].value_counts(dropna=False)

M      1092998
F       989830
NaN     233087
X        78130
H           98
N           17
-            1
Name: Victim_sex, dtype: int64

The no. of "H", "N" & "-" is small. Hence we can delete the these values as they dont make any sense.

In [158]:
sex_index = df[(df['Victim_sex']=='H') | (df['Victim_sex']=='N') | (df['Victim_sex']=='-')].index
df.drop(sex_index,inplace=True)

In [159]:
df['Victim_sex'].value_counts(dropna=False)

M      1092977
F       989820
NaN     233085
X        78128
Name: Victim_sex, dtype: int64

Lets replace the NAN values in Victim_sex with M, F & X in proportion to the given M,F & X values. 
for eg. if total Male Victims are 40% of total Victim Sex (Not including NAN), and we have 10no. of NANs. Then we will replace 4nos. of NAN with Male population.

In [160]:
sex_nan = df['Victim_sex'].isna()
length = sum(sex_nan)

In [161]:
list_sex = df['Victim_sex'].dropna().unique() # Find unique values of the Victim race
list_sex

array(['M', 'F', 'X'], dtype=object)

In [162]:
sex_result = []
for value in list_sex:
  x = df[df.Victim_sex==value].Victim_sex.count() / df['Victim_sex'].count()
  sex_result.append(x)

In [163]:
sex_m = df[df['Victim_sex'] =='M'].Victim_sex.count() / df['Victim_sex'].count() 
sex_f = df[df['Victim_sex'] =='F'].Victim_sex.count() / df['Victim_sex'].count() 
sex_x = df[df['Victim_sex'] =='X'].Victim_sex.count() / df['Victim_sex'].count() 
print("The proportion of Men is {} , Women is {}, & Unknown is {}".format(sex_m,sex_f,sex_x))

The proportion of Men is 0.5057912699422701 , Women is 0.4580538426831102, & Unknown is 0.036154887374619665


In [164]:
replacement = random.choices(['M','F','X'],weights=  [sex_m,sex_f,sex_x],k = length )
df.loc[sex_nan,'Victim_sex']=replacement

In [165]:
df['Victim_sex'].value_counts(dropna=True)

M    1211198
F    1096240
X      86572
Name: Victim_sex, dtype: int64

#### __2.4 Victim's Race:__

Lets replace the NAN values in 'Victim_race" column. Let find out the percentage of all the victim's race & replace the total quantity of NAN in that proportion with different races. eg. if Mexican population is 40% of total Victim Race (Not including NAN), and we have 10no. of NANs. Then we will replace 4nos. of NAN with Mexican population.

In [166]:
race_nan = df['Victim_race'].isna() # Find no. of NAN 
length = sum(race_nan)

In [167]:
list_race = df['Victim_race'].dropna().unique() # Find unique values of the Victim race
list_race

array(['H', 'W', 'B', 'A', 'O', 'X', 'K', 'I', 'J', 'F', 'C', 'P', 'V',
       'U', 'G', 'D', 'S', 'Z', 'L', '-'], dtype=object)

In [168]:
result=[]
for value in list_race:
  x = df[df.Victim_race == value].Victim_race.count()/df['Victim_race'].count()
  result.append(x) 

In [169]:
replacement = random.choices(list_race,weights= result,k = length )
df.loc[race_nan,'Victim_race']=replacement

Lets add a column 'Victim_race_name' which will contain the information about the victim's race based on the column "Victim Race" as per the details provided along with the data ie. 
A - Other Asian B - Black C - Chinese D - Cambodian F - Filipino G - Guamanian H - Hispanic/Latin/Mexican I - American Indian/Alaskan Native J - Japanese K - Korean L - Laotian O - Other P - Pacific Islander S - Samoan U - Hawaiian V - Vietnamese W - White X - Unknown Z - Asian Indian.

In [170]:
condition = [(df['Victim_race']=='A'),(df['Victim_race']=='B'),(df['Victim_race']=='C'),(df['Victim_race']=='D'),
    (df['Victim_race']=='F'),(df['Victim_race']=='G'),(df['Victim_race']=='H'),(df['Victim_race']=='I'),
    (df['Victim_race']=='J'),(df['Victim_race']=='K'),(df['Victim_race']=='L'),(df['Victim_race']=='O'),
    (df['Victim_race']=='P'),(df['Victim_race']=='S'),(df['Victim_race']=='U'),(df['Victim_race']=='V'),
    (df['Victim_race']=='W'),(df['Victim_race']=='X'),(df['Victim_race']=='Z')]

In [171]:
values = ['Other Asian','Black','Chinese','Cambodian','Filipino','Guamanian','Hispanic/Latin/Mexican',
          'American Indian/Alaskan Native','Japanese','Korean','Laotian','Other','Pacific Islander',
          'Samoan','Hawaiian','Vietnamese','White','Unknown','Asian Indian']

In [172]:
df['Victim_race_name']=np.select(condition,values,default = None)

### __3. VISUALIZATION:__
Let us try to establish relation between the columns of the dataframe by creating various graphs. As out data is huge, we will work on a sample.

In [48]:
df_sample =df.sample(frac=0.30,random_state=5).reset_index()

#### __3.1 Victim Age vs. No. of Crimes:__

Lets check the relation between the Victim's Age and the Crime.

In [49]:
fig= px.histogram(df_sample,
                  x='Victim_age',color = 'Crime',
                  title = "Statistics of Crime & Victim's Age",
                  width = 1500, height =900
                      )
fig.update_layout(xaxis_title = "Victim's Age",
                  yaxis_title = "No. of Crimes")
fig.show()

Output hidden; open in https://colab.research.google.com to view.

#### __3.2 Top 20 Crimes :__
Let us view top 20 Crimes in the city of Los Angeles is struggling with.

In [50]:
top_crime =df['Crime'].value_counts().sort_values(ascending = True).tail(20)

In [51]:
fig = px.bar(top_crime,
             x = 'Crime',
             title = 'Top 20 Crimes in Los-Angeles',
             )
fig.update_layout(xaxis_title = 'No. of Crimes',
                  yaxis_title ='Type of Crime',
                  )
fig.show()

We can see that most common crime committed is "Thefts" of all kind, with Vehicle Theft ranking as most common.

#### __3.3 Crime Trend over the years:__

Lets view Crime rate for the years 2010 - 2022 for the city of LA.

In [52]:
line_data = df_sample.groupby([df_sample['Date_reported'].dt.year]).Crime.count()

In [53]:
fig=px.line(line_data,
              y='Crime',
               title = 'Crime Trend over the Years'
               )
fig.update_layout(xaxis_title = "Year",
                   yaxis_title ='No. of Crimes Reported')
fig.show()

We Can see that max. no. of Crime were committed in Year - 2017. We can see there is a drop in Crime 2021. But the we have data of only first six months of year 2021. Hence we cannot consider year 2021.

#### __3.4 Most affected Sex by Crime:__

Let us view, which sex is most affected by the crimes committed.

In [54]:
fig = px.histogram(df_sample,
                   x = 'Victim_sex',
                   color ='Crime',
                   title = 'Most Affected Sex by Crime'
                   )
fig.update_layout(xaxis_title ="Victim's Sex",
                  yaxis_title ="No. of Crimes committed")
fig.show()

Output hidden; open in https://colab.research.google.com to view.

After careful study, we can see that Females are most affected in Crimes relating to "Sexual Assault" & Domestic Violence. However Men are most affected in Crimes pertaining to Thefts , Homicide & Vandalism. 

#### __3.5 Most Affected Race:__
Lets now check which Race is most affected by the Crimes.

In [55]:
race_df =df_sample.groupby(['Victim_race_name'])['Crime'].count().sort_values(ascending = True)
race_df.to_frame().head(2)

Unnamed: 0_level_0,Crime
Victim_race_name,Unnamed: 1_level_1
Laotian,12
Samoan,16


In [56]:
fig = px.bar(race_df,x = 'Crime',title = "Races affected by Crimes")
fig.update_layout(xaxis_title ='No. of Crimes',
                  yaxis_title ="Victim's Race",
                  )
fig.show()

Here we can see that Mexican/Hispanic/Latin race is most affected by the crimes. It makes sense as California is most populated by Hispanic Race. 

#### __3.6 Crimes by Year / Month:__
Lets check in which month & year was the Crime highest in last 10 years.

In [57]:
crime_year = df_sample.groupby([df_sample['Date_reported'].dt.year.rename('Year'),
                             df_sample['Date_reported'].dt.month.rename('Month')]).Crime.count() 
crime_year = crime_year.to_frame().reset_index()
crime_year.head(2)

Unnamed: 0,Year,Month,Crime
0,2010,1,4703
1,2010,2,4459


In [58]:
crime_year = crime_year.pivot('Month','Year','Crime')

In [59]:
fig =px.imshow(crime_year,text_auto =True,width=1000, height=700,
               labels=dict(x="Year", y="Month", color="No. of Crimes"),
               color_continuous_scale='RdBu_r')
fig.update_xaxes(side="top")
fig.show()

We can see that the Oct'2017 recorded the highest no. of Crimes in the last decade. 

#### __3.7 Time Lapse Criminal Trend:__

Lets check how the Crime has changed over the last decade & which age group of people & Race were affected the most.

In [60]:
bubble = df_sample.groupby(['Victim_race_name','Victim_age',df_sample['Date_reported'].dt.year.rename('Year')])['Crime'].count()
bubble = bubble.to_frame().reset_index().sort_values('Year')

In [61]:
fig = px.scatter (bubble,
                  x ='Victim_age',
                  y='Crime',
                  color = 'Victim_race_name',
                  size = 'Crime',
                  size_max=70,
                  title = 'Crime rates',
                  animation_frame ='Year',
                   width = 1100, height = 600,
                 )
fig.update_layout(xaxis_title ='Victim Age',
                  yaxis_title ='No. of Crimes')
fig.show()

We can see the trend Crime has been upward over the last decade. 

#### __3.8 Top 10 Races which are victimized:__

In [62]:
top_race = df_sample.groupby(['Victim_race_name','Victim_sex'])['Crime'].count().to_frame().sort_values('Crime',ascending = False).reset_index().head(10)
top_race.head(4)

Unnamed: 0,Victim_race_name,Victim_sex,Crime
0,Hispanic/Latin/Mexican,F,134992
1,Hispanic/Latin/Mexican,M,133671
2,White,M,106317
3,White,F,82257


In [63]:
fig = px.bar(top_race,
             x='Victim_race_name',
             y='Crime',
             color ='Victim_sex',
             barmode ='group',
             title = 'Races affected the most by the Crimes')
fig.update_layout(xaxis_title ='Victim Race',
                  yaxis_title ='No. of Crimes committed')
fig.show()

#### __3.9 Crime Location:__

Let us explore locations where Crime took place. As the data is huges, we will filter "Criminal Homicide" for now.

In [64]:
homicide =df_sample[df_sample['Crime'].str.contains('homicide',case=False)].reset_index()

In [65]:
m=folium.Map(location = [34.0522,-118.2437],zoom_start =10)

for index,row in homicide.iterrows():
  loc = [row.Latitude,row.Longitude]
  iframe = folium.IFrame('Victim Age:' + str(row.Victim_age) + '<br>' + 'Victim Sex: ' + row.Victim_sex + '<br>' + 'Date:' + str(row.Date_reported)+ '<br>' + 'Time:' + str(row.Time))

  # vis1 = dict(Victim_age = row.Victim_age,Victim_Sex = row.Victim_sex,Date = row.Date_reported, Time =row.Time )
  folium.Marker(
    location=loc,
    popup=folium.Popup(iframe,min_width=225, max_width=80)).add_to(m)


In [66]:
m

#### __3.10 Time of Crime:__

Let us explore the Time at which the values crimes have been committed. 

In [212]:
df_time = df_sample.groupby(['Crime','Time'])['Crime'].count().rename('Count').to_frame().reset_index()
df_time=df_time[df_time['Count']>50]
df_time.head(2)

Unnamed: 0,Crime,Time,Count
923,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",00:01:00,277
927,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",00:05:00,159


In [139]:
fig =px.scatter(df_time,
             y='Time',
             x='Count',
             color='Crime',
             title = "Time of Criminal Acts",
             size ='Count',
             width = 1600, height =800,
             size_max = 90
             )
fig.update_layout(xaxis_title ="No. of Crime",
                  yaxis_title ='Time')

fig.show()

We can make the following inferences from the chart above:

Crimes most prevelant __During Daytime__:

1. Theft of Identity / Document forgery /other thefts
2. Children related crimes 

Crimes most prevelant __During Night / Evening__:
1. Vehicle Theft 
2. Vandalism

#### __3.11 Areas with Max. no. of Crimes reported:__

Lets explore which area in Los Angeles is the worst affected by the Crimes.

In [207]:
df_area =df_sample['Area_name'].rename('Count').value_counts().head(20).to_frame().sort_values('Count',ascending =True)
df_area.head(2)

Unnamed: 0,Count
Foothill,26690
West Valley,30167


In [210]:
fig = px.bar(df_area,
                   x='Count',
                   title = "20 Areas with Maximum Crime",
                     )
fig.update_layout(xaxis_title = 'No. of Crimes',
                  yaxis_title ='Name of the Area')
fig.show()

The most affect Area is 77th Street having a total of 49.2K Crimes in 10 Years followed by Soutwest & Pacific.

### __4. SUMMARY & CONCLUSION:__

Lets Summarise all the inferences we made from the above graphs.

__i. About Data__:
- We extracted dataset from kaggle relating to the Crimes in City of Los-Angeles from year 2010 - 2021.
- The data contains 2.3 Million rows & 28 Columns.
 
__ii. Inferences from the Data:__
- There is an upward trend in the Crimes taking place. Year 2017 recorded the highest no. of crimes committed in the last decade with max crimes in month of October & August.
- Battery, Stolen Vehicle & Burglary from Vehicle are the most common Crimes in LA. with the total of 213K, 189K & 180K no. of Crimes.
- Hispanic / Latin is the worst affected Race followed by the Whites & the blacks.  
- Females are most affected in Crimes relating to "Sexual Assault" & Domestic Violence. However Men are affected in Crimes pertaining to Thefts , Homicide & Vandalism. 
- The most affected area in the City of LA is 77th Street followed by Southwest & Pacific. 
- Theft of Identity / Document forgery /other thefts /Children related crimes happen during the daytime whereas Vehicle Theft, Vandalism happen during the night time. Other crimes are spreadout throught the day. 

__iii. Conclusion__:
City of Los Angeles is has registered about 2.3 Million crimes in the last decade with Most common crime as battery & Vehicle Theft. As the City is populated mostly by the Hispanic Race, there is max. no. of that race being affected by the Crime which is obvious. Females are most affected in crimes relating Sexual abuse & Domestic Violence where as men are affected by Thefts & other crimes. 



#### __5. FUTURE WORK:__

In Future, we can compare the crimes in the city of LA with other cities of California or United States. 

#### __6.REFERENCES:__

Dataset has been extracted from kaggle:
1. https://www.kaggle.com/datasets/sumaiaparveenshupti/los-angeles-crime-data-20102020

2. Visualization:
https://plotly.com/python/plotly-express/
