# Progress Report
<h5><I>Team KungFu Pandas: Raj Patel, Ayush Jamindar, Amrita Rajesh, Saloni Mhatre, Lakshmi Krishna</I></h5>
<div><img src= 'teamMascot.jpeg' width=300></div>

## Project Introduction
We are analyzing on the crime dataset and housing dataset. The crime dataset and housing dataset are publicly available at :

- Chicago Crime Data Source: https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-Present/ijzp-q8t2/about_data

- Zillow Data Source: https://www.zillow.com/research/data/

Questions that we are investigating are `is there coorelationship between Housing prices and Crime in Chicago?`, `Which Neighborhood is not the safest to move in?`, `Has the Crime increased after post covid compare to Pre Covid?`, `Most common type of crime committed in Chicago area`

## Any Changes

We have added a new dataset to our project, `Housing Dataset`. We will use this dataset to answer our question. For instance, is there correlation between housing prices and crime activity?

#### `IMPORTANT NOTE`: 
Please create a folder called `csv_files`. This will contain all the CSV files so after downloading the data, please put it in this folder.



In [25]:
import pandas as pd
import numpy as np
from CleaningPR import *
from ML_pr import *
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder

## Data Cleaning Process
<div><img src= 'cleaningpic.png' width=100></div>
This process contains the necessary steps like getting the data info such as data size, number of features, number of records, mean value, max value, etc. This step also includes dropping some columns and rows, adding more information, joining the dataframes and storing them into seperate CSV files for easier access in the future.

#### Crime Dataset Information and Cleaning
- This the crime data that we have accquired from the above link and it shows information about the crimes that took place in chicago from `January 2001` to `February 2024`
- Granularity: Each row in this data represents individual crime that has been reported with specs about each crime such as ID, Case Number, Date etc.
- Contains `~8 million` records

In [2]:
# This is the original data before cleaning is applied
crime_data = pd.read_csv('csv_files/Crimes_2001_to_Present.csv')
crime_data.head()

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,11037294,JA371270,03/18/2015 12:00:00 PM,0000X W WACKER DR,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,BANK,False,False,...,42.0,32.0,11,,,2015,08/01/2017 03:52:26 PM,,,
1,11646293,JC213749,12/20/2018 03:00:00 PM,023XX N LOCKWOOD AVE,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,APARTMENT,False,False,...,36.0,19.0,11,,,2018,04/06/2019 04:04:43 PM,,,
2,11645836,JC212333,05/01/2016 12:25:00 AM,055XX S ROCKWELL ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,,False,False,...,15.0,63.0,11,,,2016,04/06/2019 04:04:43 PM,,,
3,11645959,JC211511,12/20/2018 04:00:00 PM,045XX N ALBANY AVE,2820,OTHER OFFENSE,TELEPHONE THREAT,RESIDENCE,False,False,...,33.0,14.0,08A,,,2018,04/06/2019 04:04:43 PM,,,
4,11645601,JC212935,06/01/2014 12:01:00 AM,087XX S SANGAMON ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,False,...,21.0,71.0,11,,,2014,04/06/2019 04:04:43 PM,,,


#### Converting the Date

In [3]:
crime_data = convertCrimeData(crime_data) # Convert the crime data to a much suitable format


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  crime_data['RegionName'] = crime_data['Community Area'].apply(get_community) # change coordinates to neighborhood name


#### Step 2) Dropping the unecessary columns such as X & Y Coordinate, Date, Block, IUCR, Description, Domestic, Beat, District, FBI code, Ward, Updated on, Latitude, Longitude 

In [4]:
col = ['ID', 'New_Date', 'Primary Type', 'Location Description', 'Arrest', 'Community Area', 'RegionName']
crime_data =  dropCrimeDataColumns(col, crime_data)
crime_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7297146 entries, 11 to 8000074
Data columns (total 7 columns):
 #   Column                Dtype         
---  ------                -----         
 0   ID                    int64         
 1   New_Date              datetime64[ns]
 2   Primary Type          object        
 3   Location Description  object        
 4   Arrest                int64         
 5   Community Area        float64       
 6   RegionName            object        
dtypes: datetime64[ns](1), float64(1), int64(2), object(3)
memory usage: 445.4+ MB


### Step 3) Filtering

In [5]:
# The function called sepeates the pre covid (2017-2019) and post covid (2021-present) crimes into 2 different dataframes.
(crime_data_2017_2019, crime_data_2021_present) = pre_covid_post_covid(crime_data)

In [6]:
# Pre Covid Range Verification
print("Min new_date value: ", crime_data_2017_2019['New_Date'].min()) # Earliest record
print("Max new_date value: ", crime_data_2017_2019['New_Date'].max()) # Latest record

Min new_date value:  2017-01-01 00:00:00
Max new_date value:  2019-12-31 23:55:00


In [7]:
crime_data_2017_2019.head()

Unnamed: 0,ID,New_Date,Primary Type,Location Description,Arrest,Community Area,RegionName,Severity_Score
96,12098557,2019-02-01 00:01:00,BATTERY,RESIDENCE,0,63.0,Gage Park,Medium
283,12082526,2019-09-24 12:00:00,DECEPTIVE PRACTICE,RESIDENCE,0,3.0,Uptown,Medium
527,11859264,2019-10-13 06:40:00,CRIMINAL DAMAGE,APARTMENT,0,29.0,North Lawndale,Medium
641,11662417,2019-04-21 12:30:00,ROBBERY,RESIDENCE,0,44.0,Chatham,High
663,12990873,2019-08-17 13:14:00,OFFENSE INVOLVING CHILDREN,RESIDENCE,1,23.0,Humboldt Park,High


In [8]:
# Post Covid Range Verification
print("Min new_date value: ", crime_data_2021_present['New_Date'].min()) # Earliest record
print("Max new_date value: ", crime_data_2021_present['New_Date'].max()) # Latest record

Min new_date value:  2021-01-01 00:00:00
Max new_date value:  2024-02-10 00:00:00


In [9]:
crime_data_2021_present.head()

Unnamed: 0,ID,New_Date,Primary Type,Location Description,Arrest,Community Area,RegionName,Severity_Score
371,13204489,2023-09-06 11:00:00,THEFT,PARKING LOT / GARAGE (NON RESIDENTIAL),0,32.0,Loop,Low
643,12342615,2021-04-17 15:20:00,ROBBERY,RESIDENCE,1,44.0,Chatham,High
646,12589893,2022-01-11 15:00:00,SEX OFFENSE,RESIDENCE,0,46.0,South Chicago,High
647,12592454,2022-01-14 15:55:00,OTHER OFFENSE,RESIDENCE,0,68.0,Englewood,Medium
648,12785595,2022-08-05 21:00:00,SEX OFFENSE,APARTMENT,1,69.0,Greater Grand Crossing,High


### Taking the data for the past decade to use for machine learning model

In [10]:
crime_data_2014 = decade_crime(crime_data)

### Step 4) Saving the Dataframe to a CSV file

In [11]:
crime_data_2021_present.to_csv('csv_files/Crimes_2021_to_Present.csv', index=False)
crime_data_2017_2019.to_csv('csv_files/Crimes_2017_to_2019.csv', index=False)
crime_data_2014.to_csv('csv_files/Crimes_2014.csv', index=False)


#### Neighborhood Dataset Information and Cleaning
- The data we have acquired is from zillow and it shows the average house price for each nighborhood in the country
- Granularity: Each row represents a neighborhood in a state and shows the average house price for each month from `1-31-2000` to `1-31-2024`
- Contains average monthly prices for real estate of around `~21000` neighborhoods across the U.S.

In [12]:
neighborhood_data = pd.read_csv('csv_files/Neighborhood_House_Price.csv')
neighborhood_data.head()

Unnamed: 0,RegionID,SizeRank,RegionName,RegionType,StateName,State,City,Metro,CountyName,2000-01-31,...,2023-04-30,2023-05-31,2023-06-30,2023-07-31,2023-08-31,2023-09-30,2023-10-31,2023-11-30,2023-12-31,2024-01-31
0,112345,0,Maryvale,neighborhood,AZ,AZ,Phoenix,"Phoenix-Mesa-Chandler, AZ",Maricopa County,66775.313666,...,313492.5,314776.5,316614.5,319072.5,322054.6,324693.8,327100.8,329141.1,330703.5,331714.1
1,192689,1,Paradise,neighborhood,NV,NV,Las Vegas,"Las Vegas-Henderson-Paradise, NV",Clark County,132638.938818,...,358563.7,358037.2,358754.6,360550.8,363426.5,366274.1,368744.6,370886.7,372963.4,374854.1
2,270958,2,Upper West Side,neighborhood,NY,NY,New York,"New York-Newark-Jersey City, NY-NJ-PA",New York County,387530.423074,...,1276836.0,1270266.0,1264532.0,1258336.0,1248721.0,1238858.0,1227969.0,1216308.0,1208912.0,1203406.0
3,270957,3,Upper East Side,neighborhood,NY,NY,New York,"New York-Newark-Jersey City, NY-NJ-PA",New York County,634533.128812,...,1259968.0,1250928.0,1245395.0,1241081.0,1236655.0,1232169.0,1224024.0,1212976.0,1202819.0,1196051.0
4,118208,4,South Los Angeles,neighborhood,CA,CA,Los Angeles,"Los Angeles-Long Beach-Anaheim, CA",Los Angeles County,127876.428774,...,619868.4,620830.5,624531.4,631738.0,641397.3,651175.4,659477.2,665923.5,670126.6,667898.8


#### Step 1: Filtering
Extract data only from neighborhoods in Chicago. 

In [13]:
neighborhood_data = filterNeighborhood(neighborhood_data)

In [14]:
(neighborhood_data_2017_2019, neighborhood_data_2021_present) = pre_covid_hd_post_covid_hd(neighborhood_data)


In [15]:
neighborhood_data_2017_2019.head()

Unnamed: 0,RegionName,2017-01-31,2017-02-28,2017-03-31,2017-04-30,2017-05-31,2017-06-30,2017-07-31,2017-08-31,2017-09-30,...,2019-03-31,2019-04-30,2019-05-31,2019-06-30,2019-07-31,2019-08-31,2019-09-30,2019-10-31,2019-11-30,2019-12-31
42,Lake View,320800.529961,323403.875393,325554.648715,326994.066372,327962.336934,327811.478035,327962.790422,327437.84797,327554.602956,...,327436.11262,327638.943873,327428.128541,326810.851395,325654.374039,324470.264123,323266.43015,322409.458344,321944.598794,322010.510163
88,West Ridge,187637.444136,189553.396344,191027.034117,191889.952161,192260.367968,191657.65431,191412.149087,191274.527805,192094.143374,...,204783.405502,204814.591194,204405.009083,204249.895972,203823.411224,203506.384146,202946.678331,203030.557504,203433.095567,204670.639264
97,Little Village,90925.075575,92159.981171,93255.225637,93859.979012,94812.109531,95124.032869,95401.547996,95832.125714,97255.880295,...,124782.842092,126055.982656,126805.514295,127412.094949,126981.804065,127537.226453,127830.794507,128503.409806,129227.471761,130157.108534
99,Logan Square,385189.738477,387715.936437,388969.12531,389250.314752,388709.642645,387398.503138,386883.966756,386048.664822,386580.050284,...,414528.089528,416111.230615,415797.190256,414317.801661,412092.450971,410604.326047,409346.866443,408934.379163,409406.784465,411298.166343
123,Lincoln Park,527941.394692,530161.421716,532967.822467,535106.275721,538274.75336,540584.689695,544013.398935,545448.299114,546221.200203,...,533113.718875,531744.025668,528767.619898,527003.180168,525522.950173,524547.076555,522445.253179,521778.078281,520769.321925,520116.842387


In [16]:
neighborhood_data_2021_present.head()

Unnamed: 0,RegionName,2021-01-31,2021-02-28,2021-03-31,2021-04-30,2021-05-31,2021-06-30,2021-07-31,2021-08-31,2021-09-30,...,2023-04-30,2023-05-31,2023-06-30,2023-07-31,2023-08-31,2023-09-30,2023-10-31,2023-11-30,2023-12-31,2024-01-31
42,Lake View,335152.964276,337100.070376,338529.196964,339798.571621,340754.044648,341788.369111,343067.924135,344281.139606,344721.731199,...,345791.989131,348243.716515,351331.955928,354372.109448,358274.505322,361767.239038,364245.140803,365404.235809,365273.873903,364719.918837
88,West Ridge,213600.22992,215838.540172,217654.90533,219105.803791,219738.822606,220295.24561,221026.99873,221449.948867,220854.555142,...,219505.468935,220873.395671,222693.53315,224815.395489,227233.660863,228789.686683,229263.487182,229190.543498,229300.289538,229288.170544
97,Little Village,150609.690221,154491.459007,159032.561218,163509.685324,166556.653383,168400.346553,169456.453588,169831.819215,169697.529621,...,168529.731016,170215.552104,171405.516615,172511.518632,174512.511548,176942.10929,178664.607033,179329.071751,179274.978734,179381.863942
99,Logan Square,438745.604328,442793.484892,446040.066569,448541.706261,449749.184842,449763.281273,450720.802504,452165.579963,453102.624344,...,438170.581717,441159.953902,444235.237026,446830.281993,449198.964839,449859.66757,449269.546622,448226.586436,447136.438126,446721.925355
123,Lincoln Park,521486.137219,522856.818945,523647.18224,523534.460888,522886.530447,522981.141822,524273.72531,525620.489472,525261.092467,...,537512.792193,539139.274169,541588.440708,543950.923591,546390.818324,548270.857647,549318.70565,549895.692694,549323.122838,548638.640501


#### Step 2: Transposing the Data
Reseting the index, rotating the dataframe so that neighborhoods are now columns each row for the column is the average property price for each month. This makes it easier to perform aggregate functions.


In [17]:
# NOTE: Only run this once because otherwise it will produce an error due to excessive rotation
neighborhood_data_2017_2019 = transpose_data(data=neighborhood_data_2017_2019)
neighborhood_data_2017_2019.head()

Unnamed: 0,date,Lake View,West Ridge,Little Village,Logan Square,Lincoln Park,South Austin,Irving Park,Rogers Park,Uptown,...,Beverly Woods,Sleepy Hollow,Lithuanian Plaza,Forest Glen,Beverly View,Heart of Italy,Golden Gate,Marycrest,Mount Greenwood Heights,Schorsch Forest View
0,2017-01-31,320800.53,187637.44,90925.08,385189.74,527941.39,104123.27,308905.87,156084.35,222163.9,...,196512.18,150302.7,109474.19,353627.49,115870.21,174695.62,25263.13,222620.47,247078.01,267835.82
1,2017-02-28,323403.88,189553.4,92159.98,387715.94,530161.42,104668.31,311086.23,157619.44,223640.4,...,198572.76,152146.67,114100.98,357398.54,117361.01,177001.34,25686.9,224828.56,248058.67,269378.31
2,2017-03-31,325554.65,191027.03,93255.23,388969.13,532967.82,105138.79,313207.54,158774.66,224694.87,...,200222.71,153710.0,117062.46,360254.8,118906.74,178717.89,26756.02,227139.69,248698.29,270910.81
3,2017-04-30,326994.07,191889.95,93859.98,389250.31,535106.28,104823.29,314064.33,159399.12,225477.48,...,201977.06,154409.58,116186.07,363148.26,120138.56,180467.27,27796.86,228359.51,248938.92,272070.8
4,2017-05-31,327962.34,192260.37,94812.11,388709.64,538274.75,104754.83,314049.22,159471.94,225828.91,...,203294.72,155492.11,115115.95,365085.52,120869.52,181256.7,28691.81,228371.01,249231.38,272600.6


In [18]:
neighborhood_data_2021_present = transpose_data(data=neighborhood_data_2021_present)
neighborhood_data_2021_present.head()

Unnamed: 0,date,Lake View,West Ridge,Little Village,Logan Square,Lincoln Park,South Austin,Irving Park,Rogers Park,Uptown,...,Beverly Woods,Sleepy Hollow,Lithuanian Plaza,Forest Glen,Beverly View,Heart of Italy,Golden Gate,Marycrest,Mount Greenwood Heights,Schorsch Forest View
0,2021-01-31,335152.96,213600.23,150609.69,438745.6,521486.14,180917.37,338774.43,179522.2,238384.29,...,240022.62,205658.71,190487.83,395487.22,151295.22,230912.58,55617.85,275760.18,280946.44,307632.09
1,2021-02-28,337100.07,215838.54,154491.46,442793.48,522856.82,185608.39,342486.69,181330.85,239973.93,...,244533.1,210608.34,194609.02,399358.9,155797.67,233706.65,57571.2,280995.2,285493.77,311662.78
2,2021-03-31,338529.2,217654.91,159032.56,446040.07,523647.18,191007.91,346194.55,182732.26,241113.3,...,248901.59,216042.02,198440.91,403079.57,161320.13,235914.92,60538.07,286359.51,289930.03,316208.5
3,2021-04-30,339798.57,219105.8,163509.69,448541.71,523534.46,196147.32,349930.12,183811.5,241846.05,...,253083.58,220828.19,201429.1,407398.15,166625.77,236975.4,63795.06,290788.4,293810.5,320674.89
4,2021-05-31,340754.04,219738.82,166556.65,449749.18,522886.53,199573.26,352042.45,184104.01,242134.79,...,256385.2,224200.84,202531.69,410880.33,171526.91,237436.86,66521.93,294977.07,297387.52,323826.16


In [19]:
neighborhood_data_2017_2019.to_csv('csv_files/neighborhood_data_2017_2019.csv', index = False)
neighborhood_data_2021_present.to_csv('csv_files/neighborhood_data_2021_present.csv', index = False)

<H3><I>END OF CLEANING PROCESS<I><H3>

## Exploratory Data Analysis
<div><img src= 'edapic.png' width=300></div>

<H3><I>END OF Exploratory Data Analysis<I><H3>

## Visualizations
<div><img src= 'vispic.jpeg' width=300></div>
(Amrita and Saloni put your work here)

<H3><I>END OF VISUALIZATIONS<I><H3>

## Machine Learning
<div><img src= 'mlpicture.png' width=300></div>
Predict the probability of a person getting arrested given primary type of crime, location description and neighborhood(region name)

### Step 1: Spliting
Splitting the data into testing and training set using the decade crime data(`crime_data_2014.csv`). TO BE CAUTIOUS ABOUT NOT TOUCHING THE TESTING DATA.

In [20]:
X = crime_data_2014[['Primary Type', 'Location Description', 'RegionName']]
y = crime_data_2014['Arrest'] 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 80% training, 20% testing


In [21]:
len(X_train)

2008524

### Step 2: Best Feature
Get the feature(s) which will give us the best accuracy (i.e only primary type, only primary type and RegionName and so on).

The `feature_selection_and_evaluation()` function uses logistic regression model to get the best paramters.

1) We use and encoder to change the string variables to 0s and 1s to make it easiser to fit and train the model
2) We use k-cross validation to split into a K number of folds and is used to evaluate the model's ability when given new data to test our parameters
3) Use logistic regression to fit and predict using the training data (~2 million entries)
4) Return the best accuracy and the parameters for that accuracy

In [22]:
(features, accuracy) = feature_selection_and_evaluation(X_train, y_train)

In [23]:
print(features, accuracy)

['Primary Type', 'Location Description', 'RegionName'] 0.8786626398290486


### Step 3: Training/Testing the model
Using the best feature that we got from feature_selection_and_evaluation() function, train the LogisticRegression Model using the same training set as earlier to avoid making new one and test the model on Crime_Testing_Dataset.csv

#### Training the model

In [27]:
encoder = OneHotEncoder(handle_unknown='ignore')  # Handle unknown categories by ignoring them

X_train_encoded = encoder.fit_transform(X_train[features])

X_test_encoded = encoder.transform(X_test[features])

model = LogisticRegression(random_state=42, solver='liblinear')

#### Testing the model

In [28]:
model.fit(X_train_encoded, y_train)
probabilities = model.predict_proba(X_test_encoded)[:, 1]

# y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, (probabilities > 0.5).astype(int))
print("Accuracy:", accuracy)


Accuracy: 0.8786813029243307


This is the simple baseline model and its accuracy

In [None]:
baselineClf = MajorityLabelClassifier()
baselineClf.fit(X_train, y_train)
predict_Y = baselineClf.predict(X_train)

sum = 0
for x, x2 in zip(y_train, predict_Y):
    if(x == x2):
        sum += 1

print(sum)
print(sum/len(y_train))

0
1623679
0.8083941242424786


Using Primary Type, Location Description, RegionName as input parameters for our logistic regression model we get an 87% accuracy. Comparing this to the baseline model which uses mode to predict the possibility of an arrest we get 80% accuracy. This emphasizes that our model is not under or overfitting.

### Model Usage
Firstly we trained our model using the crime data from the past decade(2014-present) so that it can learn as much as possible. Then our stakeholder `Residents of Chicago, UIC students, new settlers and Chicago Police Department` can predict the probabilty of a person getting arrested based on the type of crime, neighborhood, and discription of the location.

<H3><I>END OF MACHINE LEARNING<I></H3>

## Reflection
<div><img src= 'reflection_pic.avif' width=200></div>

`What is hardest part of the project that you’ve encountered so far?`

- The hardest part of this project was to understand the data, clean the data and how we can the housing dataset and crime dataset to come up with hypothesis. Second hardest part was to determine how our Machine Learning model can be useful to the stakeholder becasue ML model gives us the prediction, it's us who will determine how we can use it to solve our problem.

`What are your initial insights?`

- 

`Are there any concrete results you can show at this point? If not, why not?`

- Yes, We can show concrete results with our visualization. (Talk more about visulization)

`Going forward, what are the current biggest problems you’re facing?`

- The biggest problem we are facing is the merging of Housing Dataset and Crime Dataset since there are some neighborhoods which are in Housing Data but not in the Crime dataset while there are some neighborhoods which are in Crime Dataset but no in Housing Dataset.

`Do you think you are on track with your project? If not, what parts do you need to dedicate more time to?`

- Yes, we set a personal due date for this project progess and we managed to finished all the parts efficiently

`Given your initial exploration of the data, is it worth proceeding with your project, why? If not, how are you going to change your project and why do you think it’s better than your current results?`

- Yes, we have all the necessary information to do ML/Stats, Test our hypothesis, Come up with interesting finding and more


## Next Step
<div><img src= 'step_pic.jpeg' width=200></div>

`What you plan to accomplish in the next month and how you plan to evaluate whether your project achieved the goals you set for it.
`
- We are planning to do T-Test on our hypothesis `There has been increase in crime after post covid` and to decide whether to reject or fail to reject the null hypothesis. We will try to create another ML model where we will include the neighborhood average housing price to predict the chances of offender being arrest. On top of this, we will also try to discover more interesting finding using visualization with the help of our EDA

- We will `split the work` accordingly amongs the team member using GitHub Kanban board so there is a nice workflow and line of communication. We will also set a `personal due date` like a week before the actual project due date so we can resolve any lossends

<H1><I>END OF PROGRESS REPORT<I><H1>