Fields

* lat : String variable, Latitude
* lng: String variable, Longitude
* desc: String variable, Description of the Emergency Call
* zip: String variable, Zipcode
* title: String variable, Title
* timeStamp: String variable, YYYY-MM-DD HH:MM:SS
* twp: String variable, Township
* addr: String variable, Address
* e: String variable, Dummy variable (always 1)

# Aim:

To analyse the data to find patterns and trends. Use this data to create models to better predict the different unknown features. Follow this by finding solutions to real-life problems based on given data.

# Abstract:

**What is Data Science?**
Making Intelligent Conclusions from a large number of features.

Welcome to 911 Calls database project, as one might know, 911 is the emergency call number in the United States. In case of any emergency be it traffic, ems or fire, the Emergency Services (911 Services) solve these problems. They also keep a record of the information regarding each call. At first sight it might be simply a list of all calls, but with the help of statistics, basic machine learning and data science we can come to many intelligent conclusions that may or may not even be directly connected to the data we have. That is the beauty of Data Science. This Notebook aims to present the Simple and the Complex conclusions we can come to by just using the data described above. 

<h1 style='font-size:50px;text-align:center;color:#555'><br><br>Part I<br><br>Data Acquisition and Cleaning<br><br><br><br></h1>

# Library Imports

In [None]:
import numpy as np #for mathematical manipulation
import pandas as pd #for database manipulation
import matplotlib.pyplot as plt #for plotting
import seaborn as sns #better plotting library
%matplotlib inline

## Data 

In [None]:
data=pd.read_csv('../input/911.csv') #read data from csv

In [None]:
data.head()

This is the Data we have at hand, as we can see we have a mostly text data with some very few numerical data. There are a total 99492 data points which we can use.

In [None]:
data.info()

# Data Cleaning

The first step now will be to remove unnecessary columns, to be specific the dummy variable e

In [None]:
# Drop dummy variable e
data=data.drop('e',axis=1)

In [None]:
data.head(2)

<h1 style='font-size:50px;text-align:center;color:#555'><br><br>Part II<br><br>Exploratory Data Analysis<br><br></h1>

# Exploratory Data Analysis

Now its time to make some simple conclusions from the data.

### **Q1: What are the Top 10 Zipcodes for Emergency Calls?**

In [None]:
top_10_zip=pd.DataFrame(data['zip'].value_counts().head(10))
top_10_zip.reset_index(inplace=True)
top_10_zip.columns=['ZIP','Count']
top_10_zip

**Lets make a plot of the top 20 zip codes**

In [None]:
top_20_zip=pd.DataFrame(data['zip'].value_counts().head(20))
top_20_zip.reset_index(inplace=True)
top_20_zip.columns=['ZIP','Count']
fig1=plt.figure(figsize=(12,6))
sns.barplot(data=top_20_zip,x='ZIP',y='Count',palette="viridis")
fig1.tight_layout()

### **Q2: What are the Top 10 townships for 911 calls?**

In [None]:
top_10_twp=pd.DataFrame(data['twp'].value_counts().head(10))
top_10_twp.reset_index(inplace=True)
top_10_twp.columns=['Township','Count']
top_10_twp

**Lets make a plot of the top 20 townships?**

In [None]:
top_20_twp=pd.DataFrame(data['twp'].value_counts().head(20))
top_20_twp.reset_index(inplace=True)
top_20_twp.columns=['Township','Count']
fig2=plt.figure(figsize=(12,6))
g=sns.barplot(data=top_20_twp,x='Township',y='Count',palette="viridis")
g.set_xticklabels(g.get_xticklabels(),rotation=45)
fig2.tight_layout()

### **Q3: How many unique titles/reasons for emergency are there?**

In [None]:
data['title'].nunique()

This is an enormous amount of Data to process, Lets simplify the data into three main categories:  

* EMS
* Fire
* Traffic

For this purpose we create a new column titled "Reason"

In [None]:
data['Reason']=data['title'].apply(lambda v:v.split(':')[0])

In [None]:
data['Reason'].nunique()

**The title has now been simplified to three categories**  

Now lets analyse this column  

**The distribution of these Reasons are:**

In [None]:
data['Reason'].value_counts()

**Now lets see the distribution of these values in the top 10 townships**

In [None]:
fig3=plt.figure(figsize=(12,6))
g=sns.countplot(data=data[(data['twp'].isin(top_10_twp['Township']))],x='twp',hue='Reason',palette='viridis')
g_x=g.set_xticklabels(g.get_xticklabels(),rotation=30)
fig3.tight_layout()

The desc columns has some more information we might be able to get. Lets extract some more information from it.

In [None]:
data['Station']=data['desc'].apply(lambda v:v.split(';')[2])

## Analysing TimeStamps

The data initially is in str format, to make it usable, we have to convert it to a datetime object and then also obtaining each each hour, month and day

In [None]:
data['timeStamp']=pd.to_datetime(data['timeStamp'])

In [None]:
data['Hour']=data['timeStamp'].apply(lambda v:v.hour)
data['DayOfWeek']=data['timeStamp'].apply(lambda v:v.dayofweek)
data['Month']=data['timeStamp'].apply(lambda v:v.month)
data['Date']=data['timeStamp'].apply(lambda v:v.date())

In [None]:
# Map day values to proper strings
dmap1 = {0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}
data['DayOfWeek']=data['DayOfWeek'].map(dmap1)

In [None]:
data.head(2)

Now we have a lot more information to analyse.

### **Q4: What is the Distribution of Emergency Calls by Day of the Week**

In [None]:
fig4=plt.figure(figsize=(12,8))
sns.countplot(x='DayOfWeek',hue='Reason',palette='viridis',data=data)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

From the Data, it is pretty clear that there is a even variation in the the EMS calls, but a visible drop in Traffic Related Calls on Weekends, this is expected as there are fewer vehicles on the roads. Most other extremes are most likely a coincidence.

### **Q5: Distribution of Emergency Calls by Day of the Week**

In [None]:
fig5=plt.figure(figsize=(12,8))
sns.countplot(x='Month',hue='Reason',palette='viridis',data=data)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

**The Data is missing some months!** So we need to make a more continuous distribution. For example a line plot.

In [None]:
databyMonth_EMS = data[data['Reason']=='EMS'].groupby('Month').count()
databyMonth_Fire = data[data['Reason']=='Fire'].groupby('Month').count()
databyMonth_Traffic = data[data['Reason']=='Traffic'].groupby('Month').count()
databyMonth_Cumul = data.groupby('Month').count()

databyMonth_EMS['twp'].plot(figsize=(12,8),label='EMS',lw=5,ls='--')
databyMonth_Fire['twp'].plot(figsize=(12,8),label='Fire',lw=5,ls='--')
databyMonth_Traffic['twp'].plot(figsize=(12,8),label='Traffic',lw=5,ls='--')
databyMonth_Cumul['twp'].plot(figsize=(12,8),label='Total',lw=5)

fig=plt.xticks(np.arange(1,13),['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])
plt.title("Emergency Call Rates vs Month")
plt.legend()

**Lets fit a regression line in this Data**

In [None]:
sns.lmplot(data=databyMonth_Cumul.reset_index(),x='Month',y='twp')
plt.title("Regression plot of Emergency calls vs Month")
plt.xlabel('Months')
plt.ylabel('Counts')

### **Now lets check the variation of Emergency Calls by Date**

In [None]:
data.groupby('Date').count()['twp'].plot(figsize=(15,3))
plt.tight_layout()

This distribution is quite random, with certain spikes on specific dates.

**Lets check the distribution of the different reasons**

In [None]:
data[data['Reason']=='EMS'].groupby('Date').count()['twp'].plot(figsize=(15,3),label='EMS')
data[data['Reason']=='Fire'].groupby('Date').count()['twp'].plot(figsize=(15,3),label='Fire')
data[data['Reason']=='Traffic'].groupby('Date').count()['twp'].plot(figsize=(15,3),label='Traffic')
plt.tight_layout()
plt.legend()

Now we can see that there was a major spike in traffic around early february. This might be due to a variety of reasons. Lets Explore the data from January to February

In [None]:
strange_increase=data[(data['Reason']=='Traffic') & ( (data['timeStamp']>pd.to_datetime("2016-01-1")) &  (data['timeStamp']<pd.to_datetime("2016-02-1")))].reset_index().drop('index',axis=1)

In [None]:
strange_increase['title'].value_counts()

We can see that the number of disabled vehicles was unusually high but to put it in context we need to know the number in the other months. For this purpose we will compare it to the average of the next 6 months

In [None]:
normal_counts=data[(data['Reason']=='Traffic') & ( (data['timeStamp']>pd.to_datetime("2016-02-1")) &  (data['timeStamp']<pd.to_datetime("2016-08-1")))].reset_index().drop('index',axis=1)
normal_counts['title'].value_counts()/6

Now we can clearly see that the number of disabled cars was ***UNUSUALLY HIGH*** in that month. By checking weather records of that month, we can find out that the temperature was lowest of the year in Montgomery County at that time. So the cold caused vehicle engines to freeze over and thats why the emergency calls rose

## Lets move on to a Geographical Analysis

First we look at a geographical (2D) plot of the emergency calls

In [None]:
sns.jointplot(data=data,x='lng',y='lat',kind='scatter')

Thus the data a spread over a large region, but it is focused on a smaller region (in the upper-right corner). This area is most probably a city or a large settlement if we are to analyse this better geographically, we have to ignore the outliers

For this we take a error margin of +/- 4.5 * Standard Deviation

In [None]:
data_geog=data[(np.abs(data["lat"]-data["lat"].mean())<=(4.5*data["lat"].std())) & (np.abs(data["lng"]-data["lng"].mean())<=(10*data["lng"].std()))]
data_geog.reset_index().drop('index',axis=1,inplace=True)
sns.jointplot(data=data_geog,x='lng',y='lat',kind='scatter')

The Picture of the Township is now clearer, now lets do a Density Analysis, but before that we will straighten the township in accordance with standard grid system followed in city planning

In [None]:
data_geog[['lat','lng']].head()

In [None]:
pd.options.mode.chained_assignment = None #Remove Error Message
x_mean=data_geog['lng'].mean()
y_mean=data_geog['lat'].mean()
data_geog['x']=data_geog['lng'].map(lambda v:v-x_mean)
data_geog['y']=data_geog['lat'].map(lambda v:v-y_mean)

In [None]:
theta=np.pi/3
rot_mat=np.array([np.cos(theta),-np.sin(theta),np.sin(theta),np.cos(theta)]).reshape(2,2)
data_geog[['x','y']]=data_geog[['x','y']].apply(lambda v:np.dot(v.as_matrix(),rot_mat),axis=1)

In [None]:
sns.jointplot(data=data_geog,x='x',y='y',kind='scatter',xlim=(-0.3,0.3))

In [None]:
sns.jointplot(data=data_geog,x='x',y='y',kind='kde',xlim=(-0.3,0.3))

This gives us a very good picture of the high risk areas. Now lets see the output with respect to different Emergencies.

In [None]:
sns.jointplot(data=data_geog[data_geog['Reason']=='EMS'],x='x',y='y',kind='kde',color='green',xlim=(-0.3,0.3))
plt.title('EMS Distribution')
plt.tight_layout()

In [None]:
sns.jointplot(data=data_geog[data_geog['Reason']=='Fire'],x='x',y='y',kind='kde',color='red',xlim=(-0.3,0.3))
plt.title('Fire Distribution')
plt.tight_layout()

In [None]:
sns.jointplot(data=data_geog[data_geog['Reason']=='Traffic'],x='x',y='y',kind='kde',color='purple',xlim=(-0.3,0.3))
plt.title('Traffic Distribution')
plt.tight_layout()

**Lets observe the location of the different townships**

In [None]:
fig=plt.figure(figsize=(10,10))
twp_group=data_geog.groupby('twp')
for name, group in twp_group:
    plt.plot(group.x, group.y, marker='o', linestyle='', label=name)
plt.xlim(-0.3,0.3)
plt.title("Townships")

Now we can visulize the townships projects in the the townships, but trying to manage the emergency services for each township individually is **inefficent and too unrealistic.** So a different grouping system arrangement is required. 

<h1 style='font-size:50px;text-align:center;color:#555'><br><br>Part III<br><br>From Data to Models<br><br></h1>

# City Geo-Clustering

***What do we mean by clustering?***  
It means the grouping of Data into groups with similar characters/features. The features we will consider will only be the geo-location: latitude and longitude. Since We have closely located data point with similar features this model will group high density areas together for the right values of number of clusters.

Now lets try to divide the township into clusters using basic K-Means Clustering by selecting number of clusters over a range and select the most effective model.

First lets use 10 as the number of clusters

In [None]:
from sklearn.cluster import KMeans

In [None]:
X=data_geog[['x','y']].reset_index().drop('index',axis=1)

In [None]:
kmeans=KMeans(n_clusters=10)

In [None]:
kmeans.fit(X)

In [None]:
fig=plt.figure(figsize=(7,7))
plt.scatter(X['x'],X['y'],c=kmeans.labels_,cmap='rainbow')
plt.xlim(-0.3,0.3)

Now we test out different values for the number of clusters

In [None]:
fig=plt.figure(figsize=(12,12))
for i in range(3,12):
    kmeans=KMeans(n_clusters=i)
    kmeans.fit(X)
    fig.add_subplot(3,3,i-2)
    plt.scatter(X['x'],X['y'],c=kmeans.labels_,cmap='rainbow')
    plt.title("Number of Clusters = {}".format(i))
    plt.xlim(-0.3,0.3)

Now we have a beautiful range of options of clustering our data. But to decide the final clustering model, we need to consider a few factors:  

* Area of Township
* Average Population in need of Emergency Services
* Distribution of said emergencies

Lets calculate each of the above.

Firstly for the area, We will use the formulae:  
*A = 2.pi.R^2 |sin(lat1)-sin(lat2)| |lon1-lon2|/360*

In [None]:
latsin_dist=np.abs(np.sin(np.max(data_geog["lat"])/180*np.pi)-np.sin(np.min(data_geog["lat"])/180*np.pi))
lng_dist=np.abs(np.max(data_geog["lng"])-np.min(data_geog["lng"]))

Now to calculate the value

In [None]:
def ll2area(latsin,lng):
    return 2*np.pi*(6371**2)*latsin*lng/360
A=ll2area(latsin_dist,lng_dist)
print("The Area of the Township is Appoximately {} sq. km".format(A))

Thus we now know the area of the township.

From the lat-long we can find out the country and thus the approximate population density.
Here it is USA so the avg. urban population density is 814 people per square mile = 314 people per square km  

So the Avg. Population can be calculated as follows:

In [None]:
pop=np.int(A*314)
print("The Avg Population of the Township is Appoximately {}".format(pop))

We already know the distribution of emergencies from the earlier graphs so now we are ready to choose our cluster model  

### Cluster Choice

We can see that southern area has very few emergencies so, that suggests a lower population or a safer area, so we do not need multiple cluster, a single larger cluster is enough (like n_clusters=6).  

The density is higher in the central part, which suggests the need of smaller clusters to accomodate a almost constant emergency services to population ratio, again here n_clusters=6 shines.

So at a first look, n_cluster=6, is the best choice.

Now we have to figure out the specifics with n_clusters=6

In [None]:
final_kmeans=KMeans(n_clusters=6)
final_kmeans.fit(X)
fig=plt.figure(figsize=(7,7))
plt.scatter(X['x'],X['y'],c=final_kmeans.labels_,cmap='rainbow')
plt.xlim(-0.3,0.3)

Now to make the new dataset with the cluster data

In [None]:
data_clus=pd.concat([data_geog.reset_index().drop('index',axis=1),pd.DataFrame(final_kmeans.labels_,columns=['Cluster'])],axis=1)

In [None]:
data_clus.drop(['desc','title','timeStamp'],axis=1,inplace=True)

In [None]:
data_clus.tail(2)

Now we will create a model to approximate population of each cluster, for that we need to find the population densities per sqkm. Then we will use these values to find the model to best fit the data.

The steps to this process are as follows:

1. **Group latitudes into 1 km blocks** and find the number of Data Points in each block now we can a use a **normally inefficient** kernel density to approximate the log(densities), which we convert to population densities by an appropriate function.

2. Then we can find a **effective and efficient model** that fits the data. 

In [None]:
data_block=data_clus.copy()

In [None]:
data_block['lat']=np.rint(data_block['lat']*100)/100 #reduce to 1 km block
data_block['lng']=np.rint(data_block['lng']*100)/100 #reduce to 1 km block

In [None]:
data_block.head(2)

In [None]:
model_data=data_block.groupby(['lat','lng']).count().reset_index().drop(['zip','twp','Reason','Hour','DayOfWeek','Month','Date','Station','Cluster','x','y'],axis=1)
print("This is the form of our new data reduced to 1 sqkm blocks")
model_data.head(1)

In [None]:
X2=model_data[['lat','lng']]
y2=model_data.drop(['lat','lng'],axis=1)

We now use Kernel Density Analysis to get the desities of emergencies throughout the data base and then use a simple proportion with mean kernel desnity vs mean population to get the population densities which we then tweak to meet our previous data

In [None]:
from sklearn.neighbors import KernelDensity
kd=KernelDensity()
kd.fit(X2,y2)
y2=pd.DataFrame(np.exp(kd.score_samples(X2)))
mean_kernel=np.exp(kd.score_samples(X2)).mean()
min_kernel=np.exp(kd.score_samples(X2)).min()
mean_pop=0.95*pop/A
def kern2pop(ker):
    return (((mean_kernel+np.sign(ker-mean_kernel)*(np.abs(ker-mean_kernel))**0.59)/mean_kernel))*mean_pop
y2=pd.DataFrame(y2.apply(kern2pop))
print("Our data is now ready")
pd.DataFrame(pd.concat([X2,y2],axis=1).head(5))

#### Lets perform a K-Fold Cross Validation to get the best Regressor Model. The various possible models are:

1. Linear Regression
2. Quadratic Regression
2. Decision Tree Regressor
3. Random Forests Regressor
4. Support Vector Regressor (Gaussian)

In [None]:
from sklearn.model_selection import cross_val_score,cross_val_predict

### Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lin_model=LinearRegression()
print("R2 Score: {} ".format(cross_val_score(lin_model,X2,y2,scoring='r2',cv=10).mean()))
print("Root Mean Squared Error: {}".format(np.sqrt(-cross_val_score(lin_model,X2,y2,scoring='neg_mean_squared_error',cv=10).mean())))
predicted=cross_val_predict(lin_model,X2,y2,cv=10)
plt.scatter(y2,predicted)
plt.plot([y2.min(), y2.max()], [y2.min(), y2.max()], 'k--', lw=4)
plt.title('Residual Error')
plt.xlabel('Measured')
plt.ylabel('Predicted')

This is a bad value of both R-squared and RMSE, thus Linear regression is a inaccurate predictor

### Quadratric Regression

In [None]:
from sklearn.preprocessing import PolynomialFeatures

In [None]:
poly=PolynomialFeatures(2)
X_quad=pd.DataFrame(poly.fit_transform(X2),columns=['1','lat','lng','lat^2','lat*lng','lng^2'])

In [None]:
quad_model=LinearRegression()
print("R2 Score: {}".format(cross_val_score(quad_model,X_quad,y2,scoring='r2',cv=10).mean()))
print("Root Mean Squared Error: {}".format(np.sqrt(-cross_val_score(quad_model,X_quad,y2,scoring='neg_mean_squared_error',cv=10).mean())))
predicted=cross_val_predict(quad_model,X_quad,y2,cv=10)
plt.scatter(y2,predicted)
plt.plot([y2.min(), y2.max()], [y2.min(), y2.max()], 'k--', lw=4)
plt.title('Residual Error')
plt.xlabel('Measured')
plt.ylabel('Predicted')

Comparitively the R squared value is much better, thus Quadratic regression is a much better model 

### Decision Tree Regressor

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
dtree_model=DecisionTreeRegressor()
print("R2 Score: {}".format(cross_val_score(dtree_model,X2,y2,scoring='r2',cv=10).mean()))
print("Root Mean Squared Error: {}".format(np.sqrt(-cross_val_score(dtree_model,X2,y2,scoring='neg_mean_squared_error',cv=10).mean())))
predicted=cross_val_predict(dtree_model,X2,y2,cv=10)
plt.scatter(y2,predicted)
plt.plot([y2.min(), y2.max()], [y2.min(), y2.max()], 'k--', lw=4)
plt.title('Residual Error')
plt.xlabel('Measured')
plt.ylabel('Predicted')

As apparent from the values, Decision Trees is also not the perfect model 

### Random Forests

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
randf_model=RandomForestRegressor(n_jobs=-1)
yx=y2.values.ravel()
print("R2 Score: {}".format(cross_val_score(randf_model,X2,yx,scoring='r2',cv=10).mean()))
print("Root Mean Squared Error: {}".format(np.sqrt(-cross_val_score(randf_model,X2,yx,scoring='neg_mean_squared_error',cv=10).mean())))
predicted=cross_val_predict(randf_model,X2,yx,cv=10)
plt.scatter(yx,predicted)
plt.plot([yx.min(), yx.max()], [yx.min(), yx.max()], 'k--', lw=4)
plt.title('Residual Error')
plt.xlabel('Measured')
plt.ylabel('Predicted')

This result was expected as the number of features is less and there is nothing to decorrelate, so we cannot use Random Forest

### Support Vector Machine

In [None]:
from sklearn.svm import SVR 

In [None]:
svm_model=SVR()
print("R2 Score: {}".format(cross_val_score(svm_model,X2,yx,scoring='r2',cv=10).mean()))
print("Root Mean Squared Error: {}".format(np.sqrt(-cross_val_score(svm_model,X2,yx,scoring='neg_mean_squared_error',cv=10).mean())))
predicted=cross_val_predict(svm_model,X2,yx,cv=10)
plt.scatter(yx,predicted)
plt.plot([yx.min(), yx.max()], [yx.min(), yx.max()], 'k--', lw=4)
plt.title('Residual Error')
plt.xlabel('Measured')
plt.ylabel('Predicted')

Without the best params, this SVM result was quite expected. We now run an grid search cv to find the best params

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

In [None]:
param_grid={'C':[0.1,1,10,100,1000],'gamma':[1,0.1,0.01,0.001,0.0001]}
grid=GridSearchCV(SVR(),param_grid,verbose=0)
grid.fit(X2,yx)
print('Best Score is {} at {}'.format(grid.best_score_,grid.best_params_))

Now for scoring the best SVC model

In [None]:
gridsvm=SVR(C=1000,gamma=1)
print("R2 Score: {}".format(cross_val_score(gridsvm,X2,yx,scoring='r2',cv=10).mean()))
print("Root Mean Squared Error: {}".format(np.sqrt(-cross_val_score(gridsvm,X2,yx,scoring='neg_mean_squared_error',cv=10).mean())))
predicted=cross_val_predict(gridsvm,X2,yx,cv=10)
plt.scatter(yx,predicted)
plt.plot([yx.min(), yx.max()], [yx.min(), yx.max()], 'k--', lw=4)
plt.title('Residual Error')
plt.xlabel('Measured')
plt.ylabel('Predicted')

So here we can see the this is also a good model, but the results fall behind quadratic regression

### Conclusion:  
Since we can see both the SVM model and Quadratic regression are equally efficient, But since SVM are more computationally expensive , Quadratic regression is the model best suited to our purpose

-----------
### Cluster information
The next step is to find the area of each cluster, and the population density at the centriods

In [None]:
clus_info=pd.DataFrame(final_kmeans.cluster_centers_,columns=['x','y'])
print("Cluster Centers in local coordinate are:")
clus_info

In [None]:
fig=plt.figure(figsize=(7,7))
plt.scatter(X['x'],X['y'],c=final_kmeans.labels_,cmap='summer')
plt.xlim(-0.3,0.3)
plt.scatter(clus_info['x'],clus_info['y'],marker='^',color='black')
n=[' C1',' C2',' C3',' C4',' C5',' C6']
for i,txt in enumerate(n):
    plt.annotate(txt,xy=(clus_info['x'][i],clus_info['y'][i]),color='black')

Lets find the average population density of each of the clusters. For this we calculate the population density at each data point and then find average by cluster

In [None]:
poly=PolynomialFeatures(2)
clus_quad=pd.DataFrame(poly.fit_transform(data_clus[['lat','lng']]),columns=['1','lat','lng','lat^2','lat*lng','lng^2'])

In [None]:
quad_model.fit(X_quad,y2)
popdense=pd.DataFrame(quad_model.predict(clus_quad).ravel(),columns=["Pop. Density"])
data_clus=pd.concat([data_clus,popdense],axis=1)
data_clus.head()

Now we have the population density at each point in the dataset. So now we can find the cluster averages.

In [None]:
pope=data_clus.groupby('Cluster').mean()['Pop. Density'].as_matrix()
pope

In [None]:
print("The predicted population densities are:\n")
#popdense=quad_model.predict(clus_quad).ravel()
#popdense=gridsvm.predict(data_clus[['lat','lng']]).ravel()
print(pope)
print("\nSo we can see that {} has the maximum density and correspondingly a small cluster size.\nAlso {} has least population density but much larger size. This generaly means that the populations have been managed equally".format(n[pope.argmax()],n[pope.argmin()]))

In [None]:
areas=[]
print("The approximate cluster areas are:\n")
for i in range(0,6):
    tempdata=data_clus[data_clus['Cluster']==i]
    lats=np.abs(np.sin(np.max(tempdata["lat"])/180*np.pi)-np.sin(np.min(tempdata["lat"])/180*np.pi))
    lngs=np.abs(np.max(tempdata["lng"])-np.min(tempdata["lng"]))
    pops=(2/3)*ll2area(lats,lngs)
    areas.append(pops)
    print("Cluster {} : {:.2f} sq km".format(i+1,pops))
print("\nThe predicted cluster populations are:")
print(areas*pope)

<h1 style='font-size:50px;text-align:center;color:#555'><br><br>Part IV<br><br>Understanding the Situation<br><br></h1>

### Study of Existing Services

First we clean the Station Column to get the Codes/Names of the Stations in the form of a Database

In [None]:
def getname(v):
    if len(v.split('Station'))>1:
        if v.split('Station')[1][0]==':':
            return v.split('Station')[1][1:]
        else:
            return v.split('Station')[1]
    else:
        return 0
data_geog['Station Name']=data_geog['Station'].apply(getname)
station_base=data_geog[data_geog['Station Name'] != 0].copy().drop(['timeStamp','title','desc'],axis=1)
station_base.head(3)

How many Unique Stations are there?

In [None]:
station_base['Station Name'].nunique()

Lets interpolate the average location of these Station:

In [None]:
station_list =station_base.groupby('Station Name').mean().reset_index().drop(['zip','Hour','Month'],axis=1).drop(0)
station_list.head(2)

Now to plot this data on the map

In [None]:
fig=plt.figure(figsize=(7,7))
plt.scatter(X['x'],X['y'],c=final_kmeans.labels_,cmap='summer')
plt.scatter(station_list['x'],station_list['y'],marker='^',color='black')
plt.xlim(-0.3,0.3)

As apparent from the plot, the distribution of the services is uniform and ***doesnt take high risk areas into consideration***, but still we lack details as to which station handles which emergencies

In [None]:
dummies=pd.get_dummies(station_base['Reason'])
dummies.head(2)

Notice how there is no Traffic Column, we have to assume that this means police stations responsible for managing traffic calls, are not properly labelled in the data so are of not much use to us

In [None]:
emergencies=pd.concat([station_base,dummies],axis=1).groupby('Station Name').sum().drop(['lat','lng','x','y','zip','Hour','Month'],axis=1).reset_index().drop(0)
emergencies=pd.concat([emergencies,station_list[['lat','lng','x','y']]],axis=1)
popdenseStation=pd.DataFrame(quad_model.predict(pd.DataFrame(poly.fit_transform(emergencies[['lat','lng']]),columns=['1','lat','lng','lat^2','lat*lng','lng^2'])).ravel(),columns=["Pop. Density"])
emergencies=pd.concat([emergencies.reset_index().drop('index',axis=1),popdenseStation],axis=1)
emergencies.head(3)

In [None]:
emergencies.tail(3)

Interesting! We now know, Identification Codes with 'STA' handle fire emergencies (Fire Station) and the rest are Emergency Services

In [None]:
def makemap(ems,fire,lat,lng,size,alpha):
    if fire>ems:
        plt.scatter(lat,lng,marker='o',color='red',s=size,alpha=alpha)
        plt.xlim(-0.3,0.3)
    else:
        ems=plt.scatter(lat,lng,marker='o',color='blue',s=size,alpha=alpha)
        plt.xlim(-0.3,0.3)
fig=plt.figure(figsize=(7,7))
plt.scatter(X['x'],X['y'],c=final_kmeans.labels_,cmap='summer')
plt.xlim(-0.3,0.3)


for index,row in emergencies.iterrows():
    makemap(row['EMS'],row['Fire'],row['x'],row['y'],50,1)

fire_station=emergencies[emergencies["Fire"]>0].drop(['EMS','Fire'],axis=1)
ems_station=emergencies[emergencies["EMS"]>0].drop(['EMS','Fire'],axis=1)

Now we can see the distribution of the Services (RED for Fire Stations, Blue for EMS Service) but what is the actual zone of influence of each station, lets use an arbritary function  
** f(Pop. Density)= 4 * (500 - Pop. Density) **

In [None]:
fig=plt.figure(figsize=(7,7))
plt.scatter(X['x'],X['y'],c=final_kmeans.labels_,cmap='summer')
plt.xlim(-0.3,0.3)

for index,row in emergencies.iterrows():
    makemap(row['EMS'],row['Fire'],row['x'],row['y'],3* (550-row['Pop. Density']),0.4)

As we can see already that the allocation of services is extremely inefficient with high density area where there is not much demand and sparsity of services in important areas

# Conclusion

We began with a simple detailed records of all emergency calls of a county. Firstly, we dealt with the features we did not need. Then we extracted whatever additional features we could from the given features. Then we used exploratory data analysis to find out some interesting aspects of the data including the emergencies per township. 

This followed by an analysis of the number and type of emergencies over the year. We saw an sudden rise in disabled vehicles in february. From this we can suggest that the emergencies sevicies should expect as rise in such emergencies and be adequately prepared. 

Then we proceeded with the Geographical analysis of the data with the latitude and longitude aspects. We observed the maximum distribution of emergencies across a settlement and focused on it. The data suggests higher emergencies in certain areas over others, this suggests the presence of high risk populated zones which should be carefully managed.

To help in the above task we tried to find an efficient division of zones by clustering. We then evaluated our choice of number of clusters by evaluating the population densities calculated by the most accurate model.

This was followed by an analysis of the pre existing stations which not only proved the fact that the system in place is not extremely efficient adn has a lot of room for intelligent improvement

# What Next?

After doing a detailed analysis of the data at hand, we went forward with *clustering* the city (into something similar to districts). How is this more efficient?

1. These districts are not arbitrary boundaries, these district boundaries have been computed to ensure that each of these have similar geographically controlled properties, specifically the demand for Public Services. So the Officials responsible for each such 'district' will be able to look after the needs more effectively.

2. This entire process can be repeated *again* for each individual district to build more sub-clusters each with better service handling capabilities. For this purpose we will require more data, preferably over a time period greater than 10 years. That way the increase in needs can also be considered as a factor

3. Thirdly, this entire process promotes a heirarchical organisation which is more efficient for both human and other intelligent systems

## Thank You for Reading

#### About the Author: *Saptarshi Soham Mohanta*

Student of Physics and Mathematics. Interested in Data Science, Full-Stack Development, Machine Learning, and Statistical Analytics.

Known Languages: Java, Javascript (including jquery, D3, React.js, npm, Node.js), XML, HTML (including Bootstrap), CSS (including animate.css, SCSS), Python (including NumPy, Pandas, Seaborn, MatPlotLib, Plotly and Scikit-Learn libraries ) Learning Currently: C, C++, CUDA-C, PyCuda, Tensorflow Libraries