# Introduction

In this project, I want to analyze data on calls to 911, try to find patterns and answer the main hypothesis:

Is it true that people are crazier on a full moon (they call 911 more often)?

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime

Let's immediately see what data we are dealing with, whether it is necessary to process the missing results or change the data type in the columns

In [None]:
df = pd.read_csv('../input/montcoalert/911.csv')

df.info()

In [None]:
df.tail()

In [None]:
df = df.astype({"zip": "Int64"})

### Main questions:

#### What indexes are used to call 911 most often this Pennsylvania region?

In [None]:
df['zip'].value_counts().head()

#### From what localities 911 is called most often?

In [None]:
df['twp'].value_counts().head()

Let's create a new column with the reason for calling the rescue service and find the most common reason for calling

In [None]:
df['reason'] = df['title'].apply(lambda s:s.split(':')[0])
df['reason'].head()

In [None]:
df['reason'].value_counts()

In [None]:
sns.countplot(x='reason', data=df,palette="Dark2")

In [None]:
df['dt'] = df['timeStamp'].apply(lambda x: pd.to_datetime(x))

In [None]:
df['dt']

Since the data for 2020 is only for the first half of it, we will delete them so that they do not interfere with the analysis.

In [None]:
df = df[df['dt'] <= datetime.datetime(2019, 12, 10,0,0,0)]

**Let's add some new columns for further analysis**

In [None]:
df['hour'] = df['dt'].apply(lambda x: x.hour)
df['month'] = df['dt'].apply(lambda x: x.month)
df['weekday'] = df['dt'].apply(lambda x: x.dayofweek)

In [None]:
df['weekday'].unique()

Note that the weekdays are written from 0 to 6, where **0 is Monday, and 6 is Sunday**

Use .map () with this dictionary to map the actual string names to the day of the week:

In [None]:
dmap = {0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}

In [None]:
df['weekday'] = df['weekday'].apply(lambda int:dmap[int])

We stand a count plot of distribution by days of the week and reasons for the call

In [None]:
plt.figure(figsize=(20,8))
sns.countplot(x='weekday', hue='reason', data=df,palette="Dark2")
plt.legend(bbox_to_anchor=(1,1))

The fact that on Saturday and Sunday fewer accidents occur is easily explained by the fact that on weekends people often sit at home or leave somewhere, rather than driving around the city.

In [None]:
plt.figure(figsize=(20,8))
sns.countplot(x='month', hue='reason', data=df,palette="Dark2")
plt.legend(bbox_to_anchor=(1,1))

In general, EMS and Firefighters are called equally frequently throughout the year. But accidents happen more often in winter due to bad weather.

In [None]:
plt.figure(figsize=(20,8))
sns.countplot(x='month', data=df,palette="Dark2")

After the New Year and Christmas, people calm down and the number of calls decreases by ~ 10%.

Let's look at a heatmap using seaborn and our data. First, we will need to restructure the DataFrae so that the columns become Hours and the Index becomes the Day of the week.

In [None]:
dw_h_agg = df.pivot_table(index='weekday', columns='hour', values='e', aggfunc='count')
dw_h_agg = dw_h_agg.loc[['Mon','Tue','Wed','Thu','Fri','Sat','Sun']]
dw_h_agg

In [None]:
plt.figure(figsize=(20,10))
sns.heatmap(dw_h_agg,cmap= 'coolwarm')

It is easy to see that the peak of calls falls on a time in the middle of the day on weekdays.

In [None]:
df['dt'] = df['timeStamp'].apply(lambda x: pd.to_datetime(x).date())

In [None]:
df = df[df['dt'] <= datetime.date(2016, 12, 10)]

In [None]:
df.tail()

Let's create a new DF with data grouped by day

In [None]:
df_aggregation = df.groupby('dt').count()

In [None]:
df_aggregation

This is how the distribution by day looks like for ~ 2016

In [None]:
sns.set_theme(style="darkgrid")
plt.figure(figsize=(20,8))
sns.lineplot(x = df_aggregation.index, y= 'e', data=df_aggregation)

Let's get rid of emissions:

In [None]:
df_aggregation = df_aggregation[df_aggregation['e'] < 700]
plt.figure(figsize=(20,8))
sns.lineplot(x = df_aggregation.index, y= 'e', data=df_aggregation)

In [None]:
!pip install fullmoon

In [None]:
from fullmoon import NextFullMoon, IsFullMoon

Let's create a list of all the full moons for 2016:

In [None]:
n = NextFullMoon()
n.set_origin_date_string('2015-11-11')
full_moons = []
while True:
    next_full_moon = n.next_full_moon()
    if next_full_moon >= datetime.datetime(2016, 12, 10):
        break
    full_moons.append(next_full_moon.date())

In [None]:
fig, ax1 = plt.subplots(figsize=(20,8))

for fm in full_moons:
    ax1.axvline(x=fm, color='black')
ax1.plot(df_aggregation['lat'], color='red')

So far, an intermediate conclusion can be made that the correlation **is not traced**

In [None]:
df_aggregation2 = df.groupby('dt').count()

In [None]:
df_aggregation2.reset_index(inplace=True)

In [None]:
df_aggregation2['is_full_moon'] = df_aggregation2['dt'].apply(lambda x: x in full_moons)

In [None]:
df_aggregation2[df_aggregation2['is_full_moon'] == True]['e'].mean()

In [None]:
df_aggregation2[df_aggregation2['is_full_moon'] == False]['e'].mean()

For hypothesis testing, we use Student's t-test. We have 2 hypotheses:
1) H0: Average 911 calls on full moon and normal days are the same

2) H1: Average 911 calls on full moon and normal days are different

In [None]:
from scipy import stats as st
alpha = 0.05

In [None]:
df_true = df_aggregation2[df_aggregation2['is_full_moon'] == True]['e']
df_false = df_aggregation2[df_aggregation2['is_full_moon'] == False]['e']

In [None]:
results = st.ttest_ind(df_true, df_false)

print('p-value:', results.pvalue)

if (results.pvalue < alpha):
    print("Rejecting the 0-hypothesis")
else:
    print("Failed to reject the 0-hypothesis")

### What global conclusions can we draw?

According to the small research done, it became clear that there is no special correlation and this difference is covered by an error.


Our theory is proven not only by visual analysis, but also by checking the Student's t-test