## Workflow EDA
### A basic recipe

Let's assume we got a new data set (_unknown_ distribution, data types, values etc.), but we are familiar with the related domain (general knowledge, industry, area of expertise).
Our goal is an exploratory description of the data set.
How would you start this assignments, what steps would you apply?

### Task 1: Stop the video to think about your own approach. Maybe a sketch, outline steps and what information is necessary for it.

_spacing for anti-spoiler_
  



  

1) From large to small:
 - Overall data set and origins > variables > values
 - Domain, collection details (time, synthetic, survey)
 - Degree of anonymization (e.g. tokenization)

Today: Emergency - 911 Calls data set
https://www.kaggle.com/datasets/mchirico/montcoalert

2. What matters (to me?)
- Provenance and Licence (✅ checked at link above)
- Context of the data (✅ Montgomery County, PA)
  - Domain: Understanding emergency calls

3. Exploring the data:
  - Amount of variables and their types
    - Interpretation of these
  - Statistical 101 (individual variable, univariate):
    - Mean, Median, Mode 
    - Standard Deviation, Range, Quartile-Range
  - Writing a first summary

Another good moment to stop the video to gather these commands :-)

Step-by-step:

How many variables are in the 911 emergency dataset? What data types are given?

In [4]:
# Import numpy and pandas
import numpy as np
import pandas as pd

# Loading the csv dataset in a dataframe
df = pd.read_csv('911.csv')

#Actual answer to the question
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 663522 entries, 0 to 663521
Data columns (total 9 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   lat        663522 non-null  float64
 1   lng        663522 non-null  float64
 2   desc       663522 non-null  object 
 3   zip        583323 non-null  float64
 4   title      663522 non-null  object 
 5   timeStamp  663522 non-null  object 
 6   twp        663229 non-null  object 
 7   addr       663522 non-null  object 
 8   e          663522 non-null  int64  
dtypes: float64(3), int64(1), object(5)
memory usage: 45.6+ MB


What information can you infer about these variables and their names?
(Not many numerical values, What is e?, overall dataset about ...?)

Retrieve the first _n_ rows (e.g. the first five ot ten) of the data frame.

In [14]:
df.head(10)

Unnamed: 0,lat,lng,desc,zip,title,timeStamp,twp,addr,e
0,40.297876,-75.581294,REINDEER CT & DEAD END; NEW HANOVER; Station ...,19525.0,EMS: BACK PAINS/INJURY,2015-12-10 17:10:52,NEW HANOVER,REINDEER CT & DEAD END,1
1,40.258061,-75.26468,BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP...,19446.0,EMS: DIABETIC EMERGENCY,2015-12-10 17:29:21,HATFIELD TOWNSHIP,BRIAR PATH & WHITEMARSH LN,1
2,40.121182,-75.351975,HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-St...,19401.0,Fire: GAS-ODOR/LEAK,2015-12-10 14:39:21,NORRISTOWN,HAWS AVE,1
3,40.116153,-75.343513,AIRY ST & SWEDE ST; NORRISTOWN; Station 308A;...,19401.0,EMS: CARDIAC EMERGENCY,2015-12-10 16:47:36,NORRISTOWN,AIRY ST & SWEDE ST,1
4,40.251492,-75.60335,CHERRYWOOD CT & DEAD END; LOWER POTTSGROVE; S...,,EMS: DIZZINESS,2015-12-10 16:56:52,LOWER POTTSGROVE,CHERRYWOOD CT & DEAD END,1
5,40.253473,-75.283245,CANNON AVE & W 9TH ST; LANSDALE; Station 345;...,19446.0,EMS: HEAD INJURY,2015-12-10 15:39:04,LANSDALE,CANNON AVE & W 9TH ST,1
6,40.182111,-75.127795,LAUREL AVE & OAKDALE AVE; HORSHAM; Station 35...,19044.0,EMS: NAUSEA/VOMITING,2015-12-10 16:46:48,HORSHAM,LAUREL AVE & OAKDALE AVE,1
7,40.217286,-75.405182,COLLEGEVILLE RD & LYWISKI RD; SKIPPACK; Stati...,19426.0,EMS: RESPIRATORY EMERGENCY,2015-12-10 16:17:05,SKIPPACK,COLLEGEVILLE RD & LYWISKI RD,1
8,40.289027,-75.39959,MAIN ST & OLD SUMNEYTOWN PIKE; LOWER SALFORD;...,19438.0,EMS: SYNCOPAL EPISODE,2015-12-10 16:51:42,LOWER SALFORD,MAIN ST & OLD SUMNEYTOWN PIKE,1
9,40.102398,-75.291458,BLUEROUTE & RAMP I476 NB TO CHEMICAL RD; PLYM...,19462.0,Traffic: VEHICLE ACCIDENT -,2015-12-10 17:35:41,PLYMOUTH,BLUEROUTE & RAMP I476 NB TO CHEMICAL RD,1


Same question again: What information can you infer _now_ about these variables and their names?  
_(EMS - emergency medical service, Stations yes and no, not every zip given, stable factor e?, ...)_

Now, some to the statistical 101 part:
Various functions to explore and calulcate single metrics.

E.g. the mean:

In [15]:
zip_mean = df["zip"].mean()
round(zip_mean,1)

19236.1

What does this mean... _mean_?
(In other words: What does the mean of a ZIP code relate to?)

In [16]:
# With similar caution, an overall description of our data set can be derived:
df.describe()

Unnamed: 0,lat,lng,zip,e
count,663522.0,663522.0,583323.0,663522.0
mean,40.158162,-75.300105,19236.055791,1.0
std,0.220641,1.672884,298.222637,0.0
min,0.0,-119.698206,1104.0,1.0
25%,40.100344,-75.392735,19038.0,1.0
50%,40.143927,-75.305143,19401.0,1.0
75%,40.229008,-75.211865,19446.0,1.0
max,51.33539,87.854975,77316.0,1.0


The same logic applies to the average latitute and longitude variable: While you might get an average value - it possibly does not represent the desired feature (e.g "Average place of 911 calls").

In [17]:
# Let's continue with descriptive stats
# What are the ten most seen zip codes ("Postleitzahl") in the data set?

df["zip"].value_counts().head(10)

19401.0    45606
19464.0    43910
19403.0    34888
19446.0    32270
19406.0    22464
19002.0    21070
19468.0    18939
19046.0    17886
19454.0    17661
19090.0    17377
Name: zip, dtype: int64

In [18]:
# ... and the same for the last rows:
df["zip"].value_counts().tail(10)

18040.0    1
18102.0    1
18080.0    1
17901.0    1
18051.0    1
77316.0    1
19134.0    1
19135.0    1
8502.0     1
18938.0    1
Name: zip, dtype: int64

In [19]:
# How many unique zip codes in total?

df["zip"].nunique()

204

In [20]:
# How many unique titles codes in total?

df["title"].nunique()

148

In [21]:
# What is the first and last 911 call in the data set?

df["timeStamp"].min()

'2015-12-10 14:39:21'

In [22]:
df["timeStamp"].max()

'2020-07-29 15:54:08'

### First draft of a management summary:  
Between **late 2015** and **mid 2020**, the **Montgomery County** in Pennsylvania had **663.522** emergency calls with **148** unique emergencies originated from **204** zip codes. 
An initital EDA showed patterns in the title of calls (primarily **EMS**,  **Traffic** and **Fire**). Further _variables_ for statistical analysis need to be derived from the dataset, as the existing numerical variables (long., lat.) are not suitable for a mere analysis of dispersion or location.

Further avenues to analyze:  
- Retrieve reasons of calls (as they are more calls than titles): 148 to >663k points to  
- Plot number of calls per zip code ("Is there a particular node of emergency?")
- Plot timeseries; frequency per day or _time-unit_ ("Is there a day-night difference?")
- Decide on handling the missing zip codes (80.199 calls)