# Exploratory Data Analysis 

## Highlight of Trends

1. **Seasonality of Compliants**

    - Certain types of compliants display a strong seasonality trend. For example, Heat/Hot Water complaints spike in winter; Noise compliants tend to spike in summer.

        ![Complaint Volume Trend](images/seasonality_trend.png)



2. **Compliant Volume by Borough**

    - While Brooklyn has the highest total compliant volume among all boroughs, Bronx has the highest compliant volume per capita.

        ![Complaint Volume Trend](images/compliant_vol_by_borough.png)

        ![Complaint Volume Trend](images/compliants_vol_per_capita.png)


3. **Resolution Time by Borough**

    - For complaint types that require longer resolution times (over 1.5 days), Manhattan has the slowest average turnaround.
       
        ![Complaint Volume Trend](images/resolution_by_borough.png)


4. **Compliant Volume by Time**

    - Complaint volumes peak around midday (10 AM–2 PM) on weekdays, while weekend complaints are more evenly spread throughout the day.

        ![Complaint Volume Trend](images/compliant_timing.png)



## Code

In [2]:
import pandas as pd
import plotly.express as px

In [7]:
df = pd.read_parquet("data/cache_clean_data/cleaned_data.parquet")

In [8]:
nyc_pop = pd.read_csv("data/cache_clean_data/nyc_population.csv")

### basic information

In [9]:
df.describe(include='all')

Unnamed: 0,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,incident_zip,city,facility_type,...,resolution_action_updated_date,borough,x_coordinate_state_plane,y_coordinate_state_plane,open_data_channel_type,latitude,longitude,days_to_close,created_month,closed_month
count,17899464.0,17899464,17573983,17899464,17899464,17899464,17740931,17623448.0,16957069,1249011.0,...,17753802,17861029,17566330.0,17567590.0,17899464,17566190.0,17566190.0,17573980.0,17899464,17573983
unique,17899464.0,,,18,19,269,1281,609.0,597,2.0,...,,6,,,5,,,,,
top,57461812.0,,,NYPD,New York City Police Department,Illegal Parking,Loud Music/Party,10466.0,BROOKLYN,,...,,BROOKLYN,,,ONLINE,,,,,
freq,1.0,,,7804209,7804209,2212267,2441866,472536.0,5135539,1057512.0,...,,5354904,,,7206221,,,,,
mean,,2022-11-12 06:57:15.743928576,2022-11-27 13:35:05.623907584,,,,,,,,...,2022-12-02 09:19:43.148033280,,1005877.0,207808.9,,40.73701,-73.92193,20.37774,2022-10-28 02:00:11.309578240,2022-11-12 10:07:15.299705088
min,,2020-01-01 00:00:00,1899-12-31 19:00:00,,,,,,,,...,2019-07-18 00:00:00,,913353.0,-2280004.0,,2.040075e-06,-74.25495,-44602.0,2020-01-01 00:00:00,1899-12-01 00:00:00
25%,,2021-07-05 08:07:52,2021-07-17 01:18:44,,,,,,,,...,2021-07-22 00:00:00,,993819.0,184464.0,,40.67293,-73.96548,0.0,2021-07-01 00:00:00,2021-07-01 00:00:00
50%,,2022-11-15 17:13:05,2022-12-09 13:00:53,,,,,,,,...,2022-12-17 02:52:36,,1005174.0,204976.0,,40.72923,-73.92447,0.0,2022-11-01 00:00:00,2022-12-01 00:00:00
75%,,2024-04-09 07:16:59.249999872,2024-05-02 21:51:45,,,,,,,,...,2024-05-08 15:53:43.249999872,,1019332.0,237273.0,,40.81789,-73.87336,3.0,2024-04-01 00:00:00,2024-05-01 00:00:00
max,,2025-07-18 02:08:31,2033-03-01 00:00:00,,,,,,,,...,2025-09-19 00:00:00,,31130730.0,272089.0,,40.91346,-1.149235e-07,3929.0,2025-07-01 00:00:00,2033-03-01 00:00:00


In [5]:
df.isnull().sum()

unique_key                               0
created_date                             0
closed_date                         325481
agency                                   0
agency_name                              0
complaint_type                           0
descriptor                          158533
incident_zip                        276016
city                                942395
facility_type                     16650453
status                                   0
due_date                          17898150
resolution_action_updated_date      145662
borough                              38435
x_coordinate_state_plane            333137
y_coordinate_state_plane            331877
open_data_channel_type                   0
latitude                            333275
longitude                           333275
days_to_close                       325481
created_month                            0
closed_month                        325481
dtype: int64

In [6]:
print("There are", df.duplicated().sum(), "duplicated rows.")
df.drop_duplicates(inplace=True)

There are 0 duplicated rows.


In [7]:
print("There are", df[(df['closed_date']<= df['created_date']) & (df['closed_date'].notnull())].shape[0], 
      "complaints that have a closed date earlier than or equal to the created date.")

df = df[(df['closed_date'] > df['created_date']) | (df['closed_date'].isnull())]

There are 480100 complaints that have a closed date earlier than or equal to the created date.


In [8]:
cols_to_investigate = ['borough', 'agency_name','status', 'open_data_channel_type']
for col in cols_to_investigate:
    print(f"Unique values in {col}:")
    print(df[col].unique())
    print()

Unique values in borough:
['BRONX' 'QUEENS' 'BROOKLYN' 'MANHATTAN' 'STATEN ISLAND' 'Unspecified'
 None]

Unique values in agency_name:
['Department of Housing Preservation and Development'
 'New York City Police Department' 'Department of Buildings'
 'Department of Environmental Protection'
 'Department of Health and Mental Hygiene'
 'Department of Consumer and Worker Protection' 'Department of Sanitation'
 'Department of Transportation' 'Taxi and Limousine Commission'
 'Department of Parks and Recreation' 'Department of Homeless Services'
 'Economic Development Corporation' 'Department of Education'
 'Office of Technology and Innovation' 'Department for the Aging'
 'Department of Information Technology and Telecommunications'
 "Mayor's Office of Special Enforcement"
 'Operations Unit - Department of Homeless Services' '3-1-1']

Unique values in status:
['Closed' 'Assigned' 'Open' 'Pending' 'In Progress' 'Started'
 'Unspecified' 'Cancel']

Unique values in open_data_channel_type:
['ONL

In [None]:
# check the top 10 incident_zip by compliant volume
df['incident_zip'] = df['incident_zip'].astype(str).str.zfill(5)
display(df['incident_zip'].value_counts().head(10))

incident_zip
10466    468470
11226    269453
10467    246432
10457    241643
10468    241083
11385    240000
0None    234462
10452    230220
10458    216689
11207    215907
Name: count, dtype: int64

### compliants volume by type

In [10]:
# find the top 10 complaint types
top_complaint_types = df['complaint_type'].value_counts().head(10).index.tolist()

condition = (
    df['complaint_type'].isin(top_complaint_types) &
    (df['created_date']<='2025-07-01') 
)

compliant_volume_by_month = df[condition].groupby(['created_month', 'complaint_type']).size().reset_index(name='complaint_count')

fig = px.line(
    compliant_volume_by_month, 
    x='created_month', 
    y='complaint_count', 
    color='complaint_type', 
    title='Monthly Complaint Volume by Type (Top 10 Complaint Types)'
)
fig.update_layout(width=1200, height=500)

### compliants volume by borough

In [11]:
condition = (
    (df['created_date']<='2025-07-01') &
    (df['borough'].notnull()) &
    (df['borough'].isin(['MANHATTAN', 'BRONX', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND']))
)

compliant_volume_by_borough = df[condition].groupby(['created_month', 'borough']).agg({'unique_key': 'count'}).reset_index()

compliant_volume_by_borough.rename(columns={'unique_key': 'complaint_count'}, inplace=True)

fig = px.line(
    compliant_volume_by_borough, 
    x='created_month', 
    y='complaint_count', 
    color='borough', 
    title='Monthly Complaint Volume by Borough'
)
fig.update_layout(width=1200, height=500)

In [12]:
# Create a mapping of zip codes to boroughs
# If a zip code has multiple boroughs, keep the one with the highest complaint count
zip_borough_map = (
    df[df['borough'].notna() & (df['borough'] != 'Unspecified')]
    .groupby(['incident_zip', 'borough'])
    .size()
    .reset_index(name='complaint_count')
    .sort_values(['incident_zip', 'complaint_count'], ascending=[True, False])
    .drop_duplicates(subset=['incident_zip'], keep='first')
    .reset_index(drop=True)
)[['incident_zip', 'borough']]
 
# Use the zip_borough_map to merge with the nyc_population data to get borough populations
nyc_pop['zip'] = nyc_pop['zip'].astype(str).str.zfill(5)
merged_pop = nyc_pop.merge(zip_borough_map, left_on='zip', right_on = 'incident_zip', how='inner', suffixes=('', '_borough'))
borough_pop = merged_pop.groupby('borough')['population'].sum().reset_index()
 
# Find the compliant volume per 1000 residents for each borough
compliant_volume_by_borough = compliant_volume_by_borough.merge(borough_pop, on='borough', how='left')
compliant_volume_by_borough['compliant_per_capita_1000'] = compliant_volume_by_borough['complaint_count'] / compliant_volume_by_borough['population'] * 1000
fig = px.line(
    compliant_volume_by_borough, 
    x='created_month', 
    y='compliant_per_capita_1000', 
    color='borough', 
    title='Monthly Complaint Volume per 1000 Residents by Borough'
    )
fig.update_layout(width=1200, height=500)

In [13]:
# Find top 5 complaint types in each borough
top5_by_borough = (
    df.groupby(['borough', 'complaint_type'])
    .size()
    .reset_index(name='complaint_count')
    .sort_values(['borough', 'complaint_count'], ascending=[True, False])
    .groupby('borough')
    .head(5)
)

# Display the result
display(top5_by_borough)

Unnamed: 0,borough,complaint_type,complaint_count
135,BRONX,Noise - Residential,747249
86,BRONX,HEAT/HOT WATER,461636
102,BRONX,Illegal Parking,347848
136,BRONX,Noise - Street/Sidewalk,273876
215,BRONX,UNSANITARY CONDITION,179426
348,BROOKLYN,Illegal Parking,821469
383,BROOKLYN,Noise - Residential,502475
331,BROOKLYN,HEAT/HOT WATER,342461
258,BROOKLYN,Blocked Driveway,314393
417,BROOKLYN,Request Large Bulky Item Collection,236861


### resolution time by borough

In [14]:
condition = (

    (df['borough'].isin(['MANHATTAN', 'BRONX', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND'])) &
    (df['status']== 'Closed')
)

resolution_time = df[condition].groupby('borough').agg({'days_to_close': 'mean'}).reset_index().sort_values(by='days_to_close', ascending=False)

fig = px.bar(
    resolution_time,
    x='borough',
    y='days_to_close',
    title='Average Resolution Time by Borough (in days)',
)
fig.update_layout(width=1200, height=500)


### resolution time by compliant type

In [15]:
condition = (

    (df['borough'].isin(['MANHATTAN', 'BRONX', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND'])) &
    (df['status']== 'Closed') &
    df['complaint_type'].isin(top_complaint_types)
)

resolution_time = df[condition].groupby('complaint_type').agg({'days_to_close': 'mean'}).reset_index().sort_values(by='days_to_close', ascending=False)

fig = px.bar(
    resolution_time,
    x='complaint_type',
    y='days_to_close',
    title='Average Resolution Time by Complaint Type (in days)',
)
fig.update_layout(width=1200, height=500)

In [10]:
condition = (

    (df['borough'].isin(['MANHATTAN', 'BRONX', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND'])) &
    (df['status']== 'Closed') &
    (df['complaint_type'].isin(['UNSANITARY CONDITION','Street Condition','Request Large Bulky Item Collection','HEAT/HOT WATER']))
)

resolution_time = df[condition].groupby(['borough']).agg({'days_to_close': 'mean'}).reset_index().sort_values(by='days_to_close', ascending=False)

fig = px.bar(
    resolution_time,
    x='borough',
    y='days_to_close',
    title='Average Resolution Time by Borough (in days)',
    barmode='group'  
)
fig.update_layout(width=1200, height=500)

### compliants volume by day of week and time of day

In [None]:
df['day_of_week'] = df['created_date'].dt.day_name()
df['hour_of_day'] = df['created_date'].dt.hour

heatmap = df.groupby(['day_of_week', 'hour_of_day']).size().reset_index(name='complaint_count')

days_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
heatmap['day_of_week'] = pd.Categorical(heatmap['day_of_week'], categories=days_order, ordered=True)

fig = px.density_heatmap(
    heatmap,
    x='hour_of_day',
    y='day_of_week',
    z='complaint_count',
    color_continuous_scale='Viridis',
    title='Complaint Volume by Day of Week and Time of Day'
)
fig.update_layout(
    xaxis_title='Hour of Day',
    yaxis_title='Day of Week',
    width=1200,
    height=600
)
fig.update_yaxes(categoryorder='array', categoryarray=days_order)

### compliants volume by status

In [20]:
# Complaints by status and borough
condition =  (
    (df['created_date'] >= '2025-07-01')  & 
    (df['status'].notnull()) &
    (df['borough'].isin(['MANHATTAN', 'BRONX', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND']))
)

status_borough_counts = (
    df[condition]
    .groupby(['borough', 'status'])
    .size()
    .reset_index(name='complaint_count')
)

fig = px.bar(
    status_borough_counts,
    x='borough',
    y='complaint_count',
    color='status',
    barmode='stack',
    title='Month-to-date Complaint Volume by Status and Borough',
    text='complaint_count'
)
fig.update_layout(width=1200, height=500)

### compliants volume by agency

In [21]:
agency_counts = df['agency_name'].value_counts().reset_index()
agency_counts.columns = ['agency_name', 'complaint_count']

fig = px.bar(
    agency_counts,
    x='agency_name',
    y='complaint_count',
    title='Complaint Volume by Agency',
    text='complaint_count'
)
fig.update_layout(width=1200, height=500)

### compliants volume by open channel

In [24]:
condition =  (
    (df['created_date'] < '2025-07-01') &
    (df['open_data_channel_type'].notnull())
)

vol_by_channel = (
    df[condition]
    .groupby(['created_month', 'open_data_channel_type'])
    .size()
    .reset_index(name='complaint_count')
)

fig = px.bar(
    vol_by_channel,
    x='created_month',
    y='complaint_count',
    color='open_data_channel_type',
    barmode='stack',
    title='Complaint Volume by Open Channel',
)
fig.update_layout(width=1200, height=500)

### resolution time by open channel

In [28]:
condition =  (
    (df['created_date'] < '2025-07-01') &
    (df['open_data_channel_type'].notnull()) &
    (df['open_data_channel_type'].isin(['MOBILE', 'ONLINE', 'PHONE']))
)


res_time_by_channel = (
    df[condition]
    .groupby(['created_month', 'open_data_channel_type'])
    .agg({'days_to_close': 'mean'})
    .reset_index()
)

fig = px.line(
    res_time_by_channel,
    x='created_month',
    y='days_to_close',
    color='open_data_channel_type',
    title='Average Resolution Time by Open Channel',
)
fig.update_layout(width=1200, height=500)