<h1 style="color:red"><center>Customer Service Requests Analysis</h1>

#### Background of Problem Statement :

NYC 311's mission is to provide the public with quick and easy access to all New York City government services and information while offering the best customer service. Each day, NYC311 receives thousands of requests related to several hundred types of non-emergency services, including noise complaints, plumbing issues, and illegally parked cars. These requests are received by NYC311 and forwarded to the relevant agencies such as the police, buildings, or transportation. The agency responds to the request, addresses it, and then closes it.

#### Problem Objective :

Perform a service request data analysis of New York City 311 calls. You will focus on the data wrangling techniques to understand the pattern in the data and also visualize the major complaint types.

#### Domain: 
###### Customer Service


#### Tasks Performed:

<ul>
<li>Import a 311 NYC service request.</li>
<li>Read or convert the columns ‘Created Date’ and Closed Date’ to datetime datatype and create a new column ‘Request_Closing_Time’ as the time elapsed between request creation and request closing.</li>
<li>Provide major insights/patterns that you can offer in a visual format (graphs or tables); at least 4 major conclusions that you can come up with after generic data mining.</li>
<li>Order the complaint types based on the average ‘Request_Closing_Time’, grouping them for different locations.</li>
    <li>Perform a statistical tests.</li>
</ul>

Please note: For the below statements you need to state the Null and Alternate and then provide a statistical test to accept or reject the Null Hypothesis along with the corresponding ‘p-value’.

<ul>
    <li>Whether the average response time across complaint types is similar or not (overall)</li>
<li>Are the type of complaint or service requested and location related?</li>
    </ul>

Whether the average response time across complaint types is similar or not (overall)
Are the type of complaint or service requested and location related?

### Importing the Library

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

import warnings
warnings.filterwarnings("ignore")

from scipy import stats
from scipy.stats import chi2_contingency 

import statsmodels.api as sm
from statsmodels.formula.api import ols

### Loading the Data

In [None]:
df=pd.read_csv("../input/311-service-requests-nyc/311_Service_Requests_from_2010_to_Present.csv")
df.head()

### Descriptive Analysis

In [None]:
df.describe()

In [None]:
df.shape

<p style="color:green">We see lots of missing value. All the values given in the above does not provides us very clear insights about our data so we can move ahead with Exploratory Data Analysis.

### Feature Creation

In [None]:
# Converting the data into datetime format
df["Created Date"]=pd.to_datetime(df["Created Date"])
df["Closed Date"]=pd.to_datetime(df["Closed Date"])

In [None]:
#Creating the new column that consist the amount of time taken to resolve the complaint
df["Request_Closing_Time"]=(df["Closed Date"]-df["Created Date"])

Request_Closing_Time=[]
for x in (df["Closed Date"]-df["Created Date"]):
    close=x.total_seconds()/60
    Request_Closing_Time.append(close)
    
df["Request_Closing_Time"]=Request_Closing_Time

### Exploratory Data Analysis

In [None]:
df["Agency"].unique()

<p style="color:green">All of our data belongs to a single agency  NYPD i.e New York City Police Department.

In [None]:
#Univariate Distribution Plot for Request Closing Time
sns.distplot(df["Request_Closing_Time"])
plt.show

In [None]:
print("Total Number of Concerns : ",len(df),"\n")
print("Percentage of Requests took less than 100 hour to get solved   : ",round((len(df)-(df["Request_Closing_Time"]>100).sum())/len(df)*100,2),"%")
print("Percentage of Requests took less than 1000 hour to get solved : ",round((len(df)-(df["Request_Closing_Time"]>1000).sum())/len(df)*100,2),"%")

<p style="color:green">From above we can see that the data is heavily skewed. There are lots of outliers. Almost more than 97% of the requests are solved in less than 1000 hours i.e 17 days.

In [None]:
#Univariate Distribution Plot for Request Closing Time
sns.distplot(df["Request_Closing_Time"])
plt.xlim((0,5000))
plt.ylim((0,0.0003))
plt.show()

In [None]:
# Count plot to understand the type of the complaint raised
df['Complaint Type'].value_counts()[:10].plot(kind='barh',alpha=0.6,figsize=(15,10))
plt.show()

<p style="color:green">Almost around 85% of the the requests belongs to transport (Blocked driveway,Illegal Parking, Vehicle Noise, Road Traffic etc ).

In [None]:
#Categorical Scatter Plot to understand which type of complaints are taking more time to get resolved
g=sns.catplot(x='Complaint Type', y="Request_Closing_Time",data=df)
g.fig.set_figwidth(15)
g.fig.set_figheight(7)
plt.xticks(rotation=90)
plt.ylim((0,5000))
plt.show()

<p style="color:green">As we have got above that almost around 85% of the the requests belongs to transport (Blocked driveway,Illegal Parking, Vehicle Noise, Road Traffic etc ). From this plot we can understand that most of these issues have taken more time to get resolved. Government should take measure in incresing awareness and find some measures to reduce traffic problems.

In [None]:
# Count plot to know the status of the requests
df['Status'].value_counts().plot(kind='bar',alpha=0.6,figsize=(15,7))
plt.show()

<p style="color:green">As of now almost 98% of the cases are closed state.

In [None]:
#Count Plot for Coloumn Borough
plt.figure(figsize=(12,7))
df['Borough'].value_counts().plot(kind='bar',alpha=0.7)
plt.show()

In [None]:
#Percentage of cases in each Borough
for x in df["Borough"].unique():
    print("Percentage of Request from ",x," Division : ",round((df["Borough"]==x).sum()/len(df)*100,2))

In [None]:
#Unique Location Types
df["Location Type"].unique()

In [None]:
#Request Closing Time for all location Type sorted in ascending Order
pd.DataFrame(df.groupby("Location Type")["Request_Closing_Time"].mean()).sort_values("Request_Closing_Time")

<p style-"color:red">We see that maximum(mean) time  to resolve the complaint is taken in Park,Vacant Lot and Commercial areas whereas the cases in the Subway Station and Restaurent are resolved in very less time

In [None]:
#Request Closing Time for all City sorted in ascending Order
pd.DataFrame(df.groupby("City")["Request_Closing_Time"].mean()).sort_values("Request_Closing_Time")

<h2 style="color:red;">Handling Missing Values

In [None]:
#Percentage Of Missing Value
pd.DataFrame((df.isnull().sum()/df.shape[0]*100)).sort_values(0,ascending=False)[:20]

<p style="color:green;">We see that all the data related to school columns are empty which must be  because none of the request or complaint are from the school sector. Thus we can go on and remove that column.

In [None]:
#Remove the column with very high percentage of missing value
new_df=df.loc[:,(df.isnull().sum()/df.shape[0]*100)<=50]

In [None]:
print("Old DataFrame Shape :",df.shape)
print("New DataFrame Shape : ",new_df.shape)

In [None]:
rem=[]
for x in new_df.columns.tolist():
    if new_df[x].nunique()<=3:
        print(x+ " "*10+" : ",new_df[x].unique())
        rem.append(x)

<p style="color:green;">We see that all the data above have not much details, are Unspecified. So we can remove those columns to ease our analysis

In [None]:
new_df.drop(rem,axis=1,inplace=True)

In [None]:
new_df.shape

In [None]:
#Remove columns that are not needed for our analysis
rem1=["Unique Key","Incident Address","Descriptor","Street Name","Cross Street 1","Cross Street 2","Due Date","Resolution Description","Resolution Action Updated Date","Community Board","X Coordinate (State Plane)","Y Coordinate (State Plane)","Park Borough","Latitude","Longitude","Location"]

new_df.drop(rem1,axis=1,inplace=True)

In [None]:
new_df.head()

<h2 style="color:red;">Hypothesis Testing</h2>

In [None]:
g=sns.catplot(x="Complaint Type",y="Request_Closing_Time",kind="box",data=new_df)
g.fig.set_figheight(8)
g.fig.set_figwidth(15)
plt.xticks(rotation=90)
plt.ylim((0,2000))

$H_0 : \text{ there is no significant different in mean of Request_Closing_Time for different Complaint}\\
H_1 : \text{there is signficant different in mean of Request_Closing_Time for different Complaint}$

In [None]:
anova_df=pd.DataFrame()
anova_df["Request_Closing_Time"]=new_df["Request_Closing_Time"]
anova_df["Complaint"]=new_df["Complaint Type"]

anova_df.dropna(inplace=True)
anova_df.head()

In [None]:
lm=ols("Request_Closing_Time~Complaint",data=anova_df).fit()
table=sm.stats.anova_lm(lm)
table

<p style="color:green;">Since p value for the Complaint is less that 0.01 thus we accept alternate hypothesis i.e there is significant difference in the mean response time w.r.t different type of complaint.

$H_0 : \text{Complaint  Type and Location Type are independent}\\
H_1 : \text{Complaint Type and Location Type  are  related}$

In [None]:
chi_sq=pd.DataFrame()
chi_sq["Location Type"]=new_df["Location Type"]
chi_sq["Complaint Type"]=new_df["Complaint Type"]

chi_sq.dropna(inplace=True)

In [None]:
data_crosstab = pd.crosstab( chi_sq["Location Type"],chi_sq["Complaint Type"])

In [None]:
stat, p, dof, expected = chi2_contingency(data_crosstab) 

alpha = 0.05
if p <= alpha: 
    print('Dependent (reject H0)') 
else: 
    print('Independent (H0 holds true)') 

<p style="color:green;">Since p value for the chi square test is less than 0.05(LOS) we can conclude that Complaint Type is dependent on Location Type i.e specific type of complaint is raised from specific places,

<h1 style="color:tomato;"><center>Conclusions</h1>
<ul style="color:blue;">
    <li>Maximum Complaints are raised in road and parking (vehicle) related sectors</li>
    <li>On an average complains are closed in an span of 150 to 300 hours</li>
    <li>Transport and Road related issues are taking more time to get resolved as number of these cases are quite high.</li>
    <li>Number of cases from Borough goes as follows BROOKLYN > QUEENS > MANHATTAN > BRONX > STATEN ISLAND</li>
    <li>Complaint Type are Depentent on Location Type.</li>
    <li>Time taken for solving different complaint type are different</li>
</li>