# SF311 - Final Project

## Table of Contents

* [1. Motivation](#c1)

* [2. Basic Statistics](#c2)
    * [2.1 Data Preperation](#s_2_1)
        * [2.1.1 Import data](#s_2_1_1)
        * [2.1.2 Cleaning and Filtering](#s_2_1_2)
    * [2.2 Exploratory Analasis](#s_2_1)
        * [2.2.1 Basic statistics](#s_2_2_1)
        * [2.2.2 Data exploration](#s_2_2_2)
    * [2.3 Preliminary Conclusions](#s_2_1)
    
* [3. Data Analysis](#c3)
    * [3.1 Temporal Patterns](#s_3_1)
        * [3.1.1 Evolution over time](#s_3_1_1)
        * [3.1.2 Before and After Covid](#s_3_1_2)
    * [3.2 Temporal- Spatial patterns](#s_3_2)
        * [3.2.1 Investigating Yearly Patterns](#s_3_2_1)
        * [3.2.2 Investigating Monthly Patterns](#s_3_2_2)
        * [3.2.3 Investigating Daily Patterns](#s_3_2_3)
    * [3.3 Cluster Analysis](#s_3_3)
        * [3.3.1 Learning Clusters](#s_3_3_1)
        * [3.3.2 Visualizing and Exploring Clusters](#s_3_3_2)
    * [3.4 Topic Extraction](#s_3_4)
    * [3.5 Other?](#s_3_5)
* [4. Genre](#c4)
* [5. Visualization](#c5)
* [6. Discussion](#c6)
* [7. Contributions](#c7)
* [8. References](#c8)

## 1. Motivation  <a class="anchor" id="c1"></a>

Our final project focuses on exploring and analysing the complaints that San Francisco citizens make through the SF311 customer service and to provide insights on how these relate into various factors, while providing relevant findings.

Because complaints and requests from citizens gives us insights into the cares and needs of SanFran-peoples everyday lives across different districts and over time.

It is a direct way for people to communicate how the public environment is and analysing it has potential to help city authorities make San Francisco a nicer place to live.

**We should give answer to these questions: DELETE LATER**

- **What is your dataset?**

- **Why did you choose this/these particular dataset(s)?**

- **What was your goal for the end user's experience?**

## 2. Basic Statistics <a class="anchor" id="c2"></a>
The main dataset used in this project is composed by around 4.8 million observations and 47 variables, occupying a total size of 2.1 GB. Below the detail of the variables que dataset provides is shown, as well as a preview of the first rows of it.

**We should give answer to these questions: DELETE LATER**

- **Write about your choices in data cleaning and preprocessing**

- **Write a short section that discusses the dataset stats, containing key points/plots from your exploratory data analysis.**


### 2.1 Data Preperation <a class="anchor" id="s_2_1"></a>

#### 2.1.1 Import Data <a class="anchor" id="s_2_1_1"></a>

We start by importing the libaries and the main SF311 dataset. $\color{red}{\text{put in link to dataset}}$. After that we import geojson data about the neighborhoods.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import chart_studio
import chart_studio.plotly as py
import plotly.offline as pyo
import plotly.graph_objects as go
import plotly.express as px
from plotly.graph_objs.scatter.marker import Line


pyo.init_notebook_mode()
chart_studio.tools.set_credentials_file(username='mmestre', api_key='YbVYpQRqmw3RvNPohYBn')

In [None]:
#Import of the main dataset
SF_311 = pd.read_csv("311_cases.csv")

In [None]:
SF311_columns = SF_311.columns

print('The dataset is composed by the following columns:\n'+ str(SF311_columns))
print('\n Its size is '+str(SF_311.shape[0])+' rows and '+str(SF_311.shape[1])+' columns in total.')
SF_311.head()

##### Import of GeoData

We import geojson data that contain the geometry of 41 neighborhoods in SF. The SF311 dataset has the same neighborhoods in the columnn, but they are numbered 1-41, so the first task was couple the geojson neighborhood names with the corresponding neighborhoods in SF311 dataset. This was done by plotting the longitudes and latitudes fom both dataset and then manually write down which corresponeded to each other.

In [None]:
import geojson

# import neighborhood geojson data
with open('sf_nhood.geojson') as f:
    gj = geojson.load(f)

#Extract the names of neighborhoods
nhoods = []
for i in range(41):
    name = gj['features'][i]['properties']['nhood']
    nhoods.append(name)
    
## map neighborhood name in geojson into correspond neighborhood in SF311
name_map=[[1,1],[2,2],[3,5],[4,6],[5,7],[6,8],[7,10],[8,11],[9,12],[10,3],[11,9],[12,14],[13,15],[14,19],[15,36],
          [16,16],[17,17],[18,18],[19,13],[20,32],[21,20],[22,4],[23,21],[24,33],[25,22],[26,23],[27,24],[28,34],[29,35],[30,28],
          [31,29],[32,30],[33,25],[34,26],[35,27],[36,31],[37,37],[38,38],[39,40],[40,41],[41,39]]

#index_gj holds the names for the SF311 neighborhood 
index_gj = ['']*41
for i in range(41):
    name = nhoods[name_map[i][0]-1]
    index = name_map[i][1]-1
    index_gj[index] = name

#### 2.1.2 Cleaning and Filtering <a class="anchor" id="s_2_1_1"></a>

$\color{red}{\text{Write something about the data is clean (no missing data), making datetimes, picking only whole years, merging and choosing focus requests}}$

In [None]:
df = SF_311.copy()

##### Making relevant datetime columns

In [None]:
# converting to datetime and creating some useful columns for easy filtering
df['Opened_DT'] = pd.to_datetime(df['Opened'], format = '%m/%d/%Y %I:%M:%S %p')
df['Closed_DT'] = pd.to_datetime(df['Closed'], format = '%m/%d/%Y %I:%M:%S %p')
df['Updated_DT'] = pd.to_datetime(df['Updated'], format = '%m/%d/%Y %I:%M:%S %p')

df['Opened_Year'] = df.Opened_DT.dt.year
df['Opened_Month'] = df.Opened_DT.dt.month
df['Opened_Hour'] = df.Opened_DT.dt.hour
df['Opened_Year_Month'] = df.Opened_DT.dt.strftime('%Y-%m')
#df['Opened_Hour_Minute'] = df.Opened_DT.dt.strftime('%I:%M')
df['Opened_Hour_Minute'] = df.Opened_Hour + df.Opened_DT.dt.minute/60
df['DOW_num'] = df.Opened_DT.dt.weekday
df['DOW'] = df.Opened_DT.dt.strftime('%A')
df['DOW_Hour'] = df.Opened_DT.dt.strftime('%A-%I')
df['Month_Str'] = df.Opened_DT.dt.strftime('%b')
df.sort_values(by=['DOW_num'],inplace=True) #order the dataframe sorting it by day of the week

dow = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
sorted_months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

##### Deleting half years
We exclude the years 2008 and 2021 since these are incomplete years.

In [None]:
df_all_years = df.copy()
df = df[(df.Opened_Year.between(2009,2021))]

#####  Thining and merging the categories

In [None]:
list_crime_categories = df['Request Type'].unique()
list_crime_categories

#list the number of entries in each category
category_crimes1 = df['Category'].value_counts()
print('The total number of complaint categories is '+ str(len(category_crimes1)))
print('\nThe total count of the top 20 complaint categories is shown below:')
print(category_crimes1[0:20].to_string())
print('\nThe top 20 complaint categories total count is '+str(category_crimes1[0:20].sum()))
c = category_crimes1.cumsum()

category_crimes = c/df['Category'].value_counts().sum()*100
print(category_crimes[19:20].to_string()+'% represents the percentage represented in the top 20 categories')

print('\nThe top 10 complaint categories total count is '+str(category_crimes1[0:10].sum()))
print(category_crimes[9:10].to_string()+'% represents the percentage represented in the top 10 categories')

top_3=category_crimes1/df['Category'].value_counts().sum()*100
category_crimes.plot.bar(figsize = (16,4))

We also would like to focus on the most relevant request types, so we create a list of the most requested ones. We can choose to focus on the 20 most requested ones, which we have calculated to represent **92.6%** of all the requests, or the top 10 complaints, which represent the **77.7%** of all the requests.

We are also interested in the greater themes of complaints, so we merge selected categories, for example the four different categories that relates to the MUNI feedback, into one.

In [None]:
focusrequests_20 = ['Street and Sidewalk Cleaning', 'Graffiti', 'Encampments', 
                 'Abandoned Vehicle', 'MUNI Feedback', 'Parking Enforcement', 
                'General Request - PUBLIC WORKS', 'Damaged Property', 'Sewer Issues', 'Tree Maintenance', 
                 'General Request - MTA', 'Illegal Postings', 'Streetlights', 'Street Defects', 'Litter Receptacles', 
                 'Rec and Park Requests', 'SFHA Requests', 'Sign Repair', 'Sidewalk or Curb', 'Noise Report']

focusrequests_10 = ['Street and Sidewalk Cleaning', 'Graffiti', 'Encampments', 
                 'Abandoned Vehicle', 'MUNI Feedback', 'Parking Enforcement', 
                'General Request - PUBLIC WORKS', 'Damaged Property', 'Sewer Issues', 'Tree Maintenance']

focusrequests_22 = ['Street and Sidewalk Cleaning', 'Graffiti', 'Encampments', 'Abandoned Vehicle', 
                 'MUNI Feedback', 'Parking Enforcement', 'General Request - PUBLIC WORKS', 'Damaged Property', 
                 'Sewer Issues', 'Tree Maintenance', 'General Request - MTA', 'Illegal Postings', 'Streetlights', 
                 'Street Defects', 'Litter Receptacles', 'Rec and Park Requests', 'SFHA Requests','Sign Repair',
                 'Sidewalk or Curb','Noise Report','Blocked Street or SideWalk','Homeless Concerns']

In [None]:
### Merge selected 
merg2 = ['MUNI Feedback','Muni Service Feedback', 'Muni Employee Feedback', 'General Request - MUNI']
df.Category = df.Category.replace(merg2[1],merg2[0])
df.Category = df.Category.replace(merg2[2],merg2[0])
df.Category = df.Category.replace(merg2[3],merg2[0])

merg3 = ['Homeless Concerns', 'General Request - HSH']
df.Category = df.Category.replace(merg3[1],merg3[0])

### 2.2 Basic Statistics and Exploratory Analysis <a class="anchor" id="s_2_2"></a> 

#### 2.2.1 Basic Statistics <a class="anchor" id="s_2_2_1"></a>

The dataset **basic statistics** are briefly outlined in the following overview:

- Total columns: 47
- Total rows: 4770783
- Total number of complaint categories: 103
- Top 20 complaint categories count: 4418865
- Top 20 complaint categories % overall: 92.6%

But we can even look at the representation of the top 8 complaint categories.
- Top 10 complaint categories count: 3707477
- Top 10 complaint categories % overall: 77.7%
- The top 3 categories are grouped as following:
 - _**Street and Sidewalk Cleaning**_ with a total of 1688126 complaints representing the 35.4% of the total.
 - _**Graffiti**_ with a total of 611864 complaints representing the 12.8% of the total.
 - _**Encampments**_ with a total of 289196 complaints representing the 6.1% of the total.
 
Request type column allows us to gain insight about what the complaint is actually about.
- Total number of different request types: 1239

The complaints recorded in the dataset range from the **July 2008 til April 2021**. Therefore whole years in the dataset are from **2009 to 2020**.

Similarly, we are provided of a total of **119 different neighborhoods** in the Database which helps us group the complaints by zone and find particular trends. There is also the wider categorization of _Police District_, as treated during the semester in the assignments, but this one has just a total of **12 different districts**, which is much less restrictive.

The dataset also contains two columns called _Longitude_ and _Latitude_ which contain geographic data, very useful to obtain insights about the location of the complaints.

The different sources SF311 has to communicate with SF citizens and obtain their complaints are via:
- Email, Integrated Agency, Mail, Mobile/Open311, Other Department, Phone, Twitter and Web.

#### 2.2.2 Data exploration <a class="anchor" id="s_2_2_2"></a>

##### Which categories could be interesting to investigate more?

In [None]:
focusrequests = focusrequests_10
df3 = df[df.Category.isin(focusrequests)]

df_complain_count = df3.groupby(['Category'], as_index=False).count()

fig = go.Figure([go.Bar(x=df_complain_count['Category'], y=df_complain_count['CaseID'])])

# Set titles
fig.update_layout(
    title="Count of complaints 2009-2021",
    xaxis_title="Complaint category",
    yaxis_title="Count of complaints",
    autosize=False,
    width=980,
    height=500,
)
fig.show()

In [None]:
# pick categories
d_sub1 = df.loc[df.Category.isin(["Tree Maintenance"])]["Longitude"]
d_sub2 = df.loc[df.Category.isin(["Graffiti"])]["Longitude"]
d_sub3 = df.loc[df.Category.isin(["Encampments"])]["Longitude"]

d_sub1 = d_sub1[d_sub1 < -122]
d_sub2 = d_sub2[(d_sub2 < -122) & (d_sub2>=-123)]
d_sub3 = d_sub3[d_sub3 < -122]

# plot categories in one histogram
plt.figure(figsize = (8,5))
plt.hist(d_sub1, bins = 100, alpha = 0.7)
plt.hist(d_sub2, bins = 100, alpha = 0.7)
plt.hist(d_sub3, bins = 100, alpha = 0.7)
plt.title('Histogram of request counts by longitude')
plt.show()
# pick categories
d_sub1 = df.loc[df.Category.isin(["Tree Maintenance"])]["Latitude"]
d_sub2 = df.loc[df.Category.isin(["Graffiti"])]["Latitude"]
d_sub3 = df.loc[df.Category.isin(["Encampments"])]["Latitude"]

d_sub1 = d_sub1[d_sub1 > 35]
d_sub2 = d_sub2[d_sub2 > 35]
d_sub3 = d_sub3[d_sub3 > 35]

# plot categories in one histogram
plt.figure(figsize = (8,5))
plt.hist(d_sub1, bins = 100, alpha = 0.7)
plt.hist(d_sub2, bins = 100, alpha = 0.7)
plt.hist(d_sub3, bins = 100, alpha = 0.7)
plt.title('Histogram of request counts by longitude')
plt.show()

### 2.3 Preliminary Conclusions<a class="anchor" id="s_2_3"></a>

## 3. Data Analysis <a class="anchor" id="c3"></a>

### 3.1 Temporal Patterns <a class="anchor" id="s_3_1"></a>

#### 3.1.1 Evolution over time<a class="anchor" id="s_3_2_2"></a>

##### Complaint type over time

In [None]:
df_c = df3

df_c_year = df_c.groupby(['Opened_Year','Category'], as_index=False).count()
df_c_year_month = df_c.groupby(['Opened_Year_Month','Category'], as_index=False).count()

fig = go.Figure() #Initialization of the figure (Plotly - Graph Objects)

complaints_list = list()

for complaint_type in df_c['Category'].unique():
    complaints_list.append(str(complaint_type))
    fig.add_trace(
        go.Bar(x = df_c_year[df_c_year['Category']==complaint_type]['Opened_Year'],
               y = df_c_year[df_c_year['Category']==complaint_type]['CaseID'],
               name = complaint_type))
    fig.add_trace(
        go.Bar(x = df_c_year_month[df_c_year_month['Category']==complaint_type]['Opened_Year_Month'],
               y = df_c_year_month[df_c_year_month['Category']==complaint_type]['CaseID'],
               visible=False,
               name = complaint_type))


    # Add drowdowns
button_layer_1_height = 1.12
fig.update_layout(
    updatemenus=[
        dict(           
            buttons=list([
                dict(label="All by Year",
                     method="update",
                     args=[{"visible": [True, False, True, False, True, False,
                                        True, False, True, False, True, False,
                                        True, False, True, False, True, False
                                       ]},
                           {"title": "Complaint type"}]),
                dict(label="All by Year-Month",
                     method="update",
                     args=[{"visible": [False, True, False, True, False, True,
                                        False, True, False, True, False, True,
                                        False, True, False, True, False, True
                                       ]},
                           {"title": "Complaint type"}]),
            ]),            
            type="buttons",
            direction="right",
            active=0,
            x=1.0,
            y=1.2,
        ),
        dict(
            buttons=list([
                dict(label="All by Year",
                     method="update",
                     args=[{"visible": [True, False, True, False, True, False,
                                        True, False, True, False, True, False,
                                        True, False, True, False, True, False,
                                        True, False, True, False, True, False,
                                        True, False, True, False, True, False,
                                       ]},
                           {"title": "Complaint type:"}]),
                dict(label=str(complaints_list[0]),
                     method="update",
                     args=[{"visible": [True] + 19*[False]},
                           {"title": "Complaint type:<br>" + str(complaints_list[0]) + " by Year"}]),
                dict(label=str(complaints_list[1]),
                     method="update",
                     args=[{"visible": 2*[False] + [True] + 17*[False]},
                           {"title": "Complaint type:<br>" + str(complaints_list[1]) + " by Year"}]),
                dict(label=str(complaints_list[2]),
                     method="update",
                     args=[{"visible": 4*[False] + [True] + 15*[False]},
                           {"title": "Complaint type:<br>" + str(complaints_list[2]) + " by Year"}]),
                dict(label=str(complaints_list[3]),
                     method="update",
                     args=[{"visible": 6*[False] + [True] + 13*[False]},
                           {"title": "Complaint type:<br>" + str(complaints_list[3]) + " by Year"}]),
                dict(label=str(complaints_list[4]),
                     method="update",
                     args=[{"visible": 8*[False] + [True] + 11*[False]},
                           {"title": "Complaint type:<br>" + str(complaints_list[4]) + " by Year"}]),
                dict(label=str(complaints_list[5]),
                     method="update",
                     args=[{"visible": 10*[False] + [True] + 9*[False]},
                           {"title": "Complaint type:<br>" + str(complaints_list[5]) + " by Year"}]),
                dict(label=str(complaints_list[6]),
                     method="update",
                     args=[{"visible": 12*[False] + [True] + 7*[False]},
                           {"title": "Complaint type:<br>" + str(complaints_list[6]) + " by Year"}]),
                dict(label=str(complaints_list[7]),
                     method="update",
                     args=[{"visible": 14*[False] + [True] + 5*[False]},
                           {"title": "Complaint type:<br>" + str(complaints_list[7]) + " by Year"}]),
                dict(label=str(complaints_list[8]),
                     method="update",
                     args=[{"visible": 16*[False] + [True] + 3*[False]},
                           {"title": "Complaint type:<br>" + str(complaints_list[8]) + " by Year"}]),
                dict(label=str(complaints_list[9]),
                     method="update",
                     args=[{"visible": 18*[False] + [True] + [False]},
                           {"title": "Complaint type:<br>" + str(complaints_list[9]) + " by Year"}]),
            ]),
            direction="down",
            pad={"r": 10, "t": 10},
            showactive=True,
            x=0.43,
            xanchor="left",
            y=button_layer_1_height,
            yanchor="top"
        ),
        dict(
            buttons=list([
                dict(label="All by Year-Month",
                     method="update",
                     args=[{"visible": [False, True, False, True, False, True,
                                        False, True, False, True, False, True,
                                        False, True, False, True, False, True,
                                        False, True, False, True, False, True,
                                        False, True, False, True, False, True
                                       ]},
                           {"title": "Complaint type:"}]),
                dict(label=str(complaints_list[0]),
                     method="update",
                     args=[{"visible": [False] + [True] + 18*[False]},
                           {"title": "Complaint type:<br>" + str(complaints_list[0]) + " by Year-month"}]),
                dict(label=str(complaints_list[1]),
                     method="update",
                     args=[{"visible": 3*[False] + [True] + 16*[False]},
                           {"title": "Complaint type:<br>" + str(complaints_list[1]) + " by Year-month"}]),
                dict(label=str(complaints_list[2]),
                     method="update",
                     args=[{"visible": 5*[False] + [True] + 14*[False]},
                           {"title": "Complaint type:<br>" + str(complaints_list[2]) + " by Year-month"}]),
                dict(label=str(complaints_list[3]),
                     method="update",
                     args=[{"visible": 7*[False] + [True] + 12*[False]},
                           {"title": "Complaint type:<br>" + str(complaints_list[3]) + " by Year-month"}]),
                dict(label=str(complaints_list[4]),
                     method="update",
                     args=[{"visible": 9*[False] + [True] + 10*[False]},
                           {"title": "Complaint type:<br>" + str(complaints_list[4]) + " by Year-month"}]),
                dict(label=str(complaints_list[5]),
                     method="update",
                     args=[{"visible": 11*[False] + [True] + 8*[False]},
                           {"title": "Complaint type:<br>" + str(complaints_list[5]) + " by Year-month"}]),
                dict(label=str(complaints_list[6]),
                     method="update",
                     args=[{"visible": 13*[False] + [True] + 6*[False]},
                           {"title": "Complaint type:<br>" + str(complaints_list[6]) + " by Year-month"}]),
                dict(label=str(complaints_list[7]),
                     method="update",
                     args=[{"visible": 15*[False] + [True] + 4*[False]},
                           {"title": "Complaint type:<br>" + str(complaints_list[7]) + " by Year-month"}]),
                dict(label=str(complaints_list[8]),
                     method="update",
                     args=[{"visible": 17*[False] + [True] + 2*[False]},
                           {"title": "Complaint type:<br>" + str(complaints_list[8]) + " by Year-month"}]),
                dict(label=str(complaints_list[9]),
                     method="update",
                     args=[{"visible": 19*[False] + [True]},
                           {"title": "Complaint type:<br>" + str(complaints_list[9]) + " by Year-month"}]),          
            ]),
            direction="down",
            pad={"r": 10, "t": 10},
            showactive=True,
            x=0.72,
            xanchor="left",
            y=button_layer_1_height,
            yanchor="top"
        ),
    ]
)
    
fig.update_layout(
    xaxis=dict(
        rangeselector=dict(
            buttons=list([
                dict(count=3,
                     label="Last 3 years",
                     step="year",
                     stepmode="backward"),
                dict(count=5,
                     label="Last 5 Years",
                     step="year",
                     stepmode="backward"),
                dict(count=10,
                     label="Last 10 years",
                     step="year",
                     stepmode="backward"),
                dict(step="all", label="All")
            ]),
            x=0.37,
            y=1.13
        ),
        rangeslider=dict(
            visible=True
        ),
        type="date"
    )
)

    
# Set titles
fig.update_layout(
    title="Complaint count over time",
    xaxis_title="Date",
    yaxis_title="Count of complaints",
    autosize=False,
    width=1000,
    height=700,
)

fig.update_layout(legend=dict(x=0, y=1, bgcolor='rgba(255, 255, 255, 0)'))
fig.update_yaxes(automargin=True)
fig.show()
#py.plot(fig, filename='complain-count-dropdown')

##### Source type over time

In [None]:
focusrequests = focusrequests_10
df_c = df[df.Category.isin(focusrequests)]

df_c_year = df_c.groupby(['Opened_Year','Source'], as_index=False).count()
df_c_year_month = df_c.groupby(['Opened_Year_Month','Source'], as_index=False).count()

fig = go.Figure() #Initialization of the figure (Plotly - Graph Objects)

source_list = ['Twitter','Integrated Agency','Web','Phone','Mobile/Open311']
source_list_2 = list()

for source_type in source_list:
    source_list_2.append(str(source_type))
    fig.add_trace(
        go.Bar(x = df_c_year[df_c_year['Source']==source_type]['Opened_Year'],
               y = df_c_year[df_c_year['Source']==source_type]['CaseID'],
               name = source_type))
    fig.add_trace(
        go.Bar(x = df_c_year_month[df_c_year_month['Source']==source_type]['Opened_Year_Month'],
               y = df_c_year_month[df_c_year_month['Source']==source_type]['CaseID'],
               visible=False,
               name = source_type))


    # Add drowdowns
button_layer_1_height = 1.12
fig.update_layout(
    updatemenus=[
        dict(           
            buttons=list([
                dict(label="All by Year",
                     method="update",
                     args=[{"visible": [True, False, True, False, True, False,
                                        True, False, True, False, True, False,
                                        True, False, True, False, True, False
                                       ]},
                           {"title": "Source type"}]),
                dict(label="All by Year-Month",
                     method="update",
                     args=[{"visible": [False, True, False, True, False, True,
                                        False, True, False, True, False, True,
                                        False, True, False, True, False, True
                                       ]},
                           {"title": "Source type"}]),
            ]),            
            type="buttons",
            direction="right",
            active=0,
            x=1.0,
            y=1.2,
        ),
        dict(
            buttons=list([
                dict(label="All by Year",
                     method="update",
                     args=[{"visible": [True, False, True, False, True, False,
                                        True, False, True, False, True, False,
                                        True, False, True, False, True, False
                                       ]},
                           {"title": "Source type:"}]),
                dict(label=str(source_list_2[0]),
                     method="update",
                     args=[{"visible": [True] + 11*[False]},
                           {"title": "Source type:<br>" + str(source_list_2[0]) + " by Year"}]),
                dict(label=str(source_list_2[1]),
                     method="update",
                     args=[{"visible": 2*[False] + [True] + 9*[False]},
                           {"title": "Source type:<br>" + str(source_list_2[1]) + " by Year"}]),
                dict(label=str(source_list_2[2]),
                     method="update",
                     args=[{"visible": 4*[False] + [True] + 7*[False]},
                           {"title": "Source type:<br>" + str(source_list_2[2]) + " by Year"}]),
                dict(label=str(source_list_2[3]),
                     method="update",
                     args=[{"visible": 6*[False] + [True] + 5*[False]},
                           {"title": "Source type:<br>" + str(source_list_2[3]) + " by Year"}]),
                dict(label=str(source_list_2[4]),
                     method="update",
                     args=[{"visible": 8*[False] + [True] + 3*[False]},
                           {"title": "Source type:<br>" + str(source_list_2[4]) + " by Year"}]),
                #dict(label=str(source_list_2[5]),
                #     method="update",
                #     args=[{"visible": 10*[False] + [True] + [False]},
                #           {"title": "Source type:<br>" + str(source_list_2[5]) + " by Year"}]),
            ]),
            direction="down",
            pad={"r": 10, "t": 10},
            showactive=True,
            x=0.64,
            xanchor="left",
            y=button_layer_1_height,
            yanchor="top"
        ),
        dict(
            buttons=list([
                dict(label="All by Year-Month",
                     method="update",
                     args=[{"visible": [False, True, False, True, False, True,
                                        False, True, False, True, False, True,
                                        False, True, False, True, False, True
                                       ]},
                           {"title": "Source type:"}]),
                dict(label=str(source_list_2[0]),
                     method="update",
                     args=[{"visible": [False] + [True] + 10*[False]},
                           {"title": "Source type:<br>" + str(source_list_2[0]) + " by Year-month"}]),
                dict(label=str(source_list_2[1]),
                     method="update",
                     args=[{"visible": 3*[False] + [True] + 8*[False]},
                           {"title": "Source type:<br>" + str(source_list_2[1]) + " by Year-month"}]),
                dict(label=str(source_list_2[2]),
                     method="update",
                     args=[{"visible": 5*[False] + [True] + 6*[False]},
                           {"title": "Source type:<br>" + str(source_list_2[2]) + " by Year-month"}]),
                dict(label=str(source_list_2[3]),
                     method="update",
                     args=[{"visible": 7*[False] + [True] + 4*[False]},
                           {"title": "Source type:<br>" + str(source_list_2[3]) + " by Year-month"}]),
                dict(label=str(source_list_2[4]),
                     method="update",
                     args=[{"visible": 9*[False] + [True] + 2*[False]},
                           {"title": "Source type:<br>" + str(source_list_2[4]) + " by Year-month"}]),
                #dict(label=str(source_list_2[5]),
                #     method="update",
                #     args=[{"visible": 11*[False] + [True]},
                #           {"title": "Source type:<br>" + str(source_list_2[5]) + " by Year-month"}]),          
            ]),
            direction="down",
            pad={"r": 10, "t": 10},
            showactive=True,
            x=0.825,
            xanchor="left",
            y=button_layer_1_height,
            yanchor="top"
        ),
    ]
)

# Add range slider
fig.update_layout(
    xaxis=dict(
        rangeselector=dict(
            buttons=list([
                dict(count=3,
                     label="Last 3 years",
                     step="year",
                     stepmode="backward"),
                dict(count=5,
                     label="Last 5 Years",
                     step="year",
                     stepmode="backward"),
                dict(count=10,
                     label="Last 10 years",
                     step="year",
                     stepmode="backward"),
                dict(step="all", label="All")
            ]),
            x=0.37,
            y=1.13
        ),
        rangeslider=dict(
            visible=True
        ),
        type="date"
    )
)

    
# Set titles
fig.update_layout(
    title="Complaint count over time by source",
    xaxis_title="Date",
    yaxis_title="Count of complaints by source",
    autosize=False,
    width=1000,
    height=700,
)

fig.update_layout(legend=dict(x=0, y=1, bgcolor='rgba(255, 255, 255, 0)'))
fig.update_yaxes(automargin=True)
fig.show()
#py.plot(fig, filename='complaints-by-source-count-dropdown')

##### Complaint count by hours of the week

In [None]:
### we first set out the shape and number of graphs we require
focusrequests_10 = ['Street and Sidewalk Cleaning', 'Graffiti', 'Encampments', 
                 'Abandoned Vehicle', 'MUNI Feedback', 'Parking Enforcement', 
                'General Request - PUBLIC WORKS', 'Damaged Property', 'Sewer Issues', 'Tree Maintenance']
fig, axs = plt.subplots(nrows = 5, ncols = 2, figsize = (18,12))
fig.patch.set_facecolor('#E7E7E7')
fig.suptitle('Complaint count by hours of the week', fontsize=20,x=0.5,y=0.91)
dow = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
dow2 = dow.copy()
dow2.append('')

for i, cat in enumerate(focusrequests_10):    # in this loop we accordingly detail each one of the plots
    ax = axs[i//2,i%2]
    df_ct = df[df.Category == cat]
    df_ct_how = df_ct.groupby(['DOW','Opened_Hour']).Opened_Hour.count() 
    df_ct_how_sorted = df_ct_how[dow]
    df_ct_how_sorted.plot(kind = 'line', ax = ax, rot=0, color='black', ylim=(0,max(df_ct_how_sorted)*1.25),
                         # ylabel='Complaint count'
                         )
    ax.text(x=5,y=max(df_ct_how_sorted)*1.05,s = cat, fontsize=12)
    ax.tick_params(axis='y', direction='in')
    ax.set_xlabel(None)
    ax.grid(linestyle='-.', linewidth='0.9', axis='x')
    ax.set_xticks(range(0,169,24))
    ax.set_xticklabels(dow2, ha='left')
    
#fig.savefig('complaint-count-by-hours-of-theweek.png',dpi=200)

##### Complaint count by 24h cycle

In [None]:
# we first set out the shape and number of graphs we require
fig, axs = plt.subplots(nrows = 5, ncols = 2, figsize = (18,12))
fig.patch.set_facecolor('#E7E7E7')
fig.suptitle('24 hour cycle complaint count', fontsize=18,x=0.5,y=0.9)

for i, cat in enumerate(focusrequests_10):    # in this loop we accordingly detail each one of the plots
    ax = axs[i//2,i%2]
    df_ct = df[df.Category == cat]
    df_ct_h = df_ct.groupby('Opened_Hour').Opened_Hour.count()
    df_ct_h.plot(kind = 'bar', ax = ax, rot=0, align='center', width=0.5, color='grey', 
                 edgecolor='black', ylim=(0,max(df_ct_h)*1.25), 
                 #ylabel='Complaint count'
                )
    ax.text(x=1,y=max(df_ct_h)*1.05,s = cat, fontsize=12)
    ax.set_xlabel(xlabel=None)
    ax.tick_params(axis='both', direction='in')
#fig.savefig('complaint-count-by-24h-cycle.png',dpi=200)

##### Complaint count by month

In [None]:
# we first set out the shape and number of graphs we require
fig, axs = plt.subplots(nrows = 5, ncols = 2, figsize = (18,12))
fig.patch.set_facecolor('#E7E7E7')
fig.suptitle('Monthly complaint count', fontsize=18,x=0.5,y=0.9)

for i, cat in enumerate(focusrequests_10):    # in this loop we accordingly detail each one of the plots
    ax = axs[i//2,i%2]
    df_ct = df[df.Category == cat]
    df_ct_mth = df_ct.groupby('Month_Str').Month_Str.count() 
    df_ct_mth_sorted = df_ct_mth[sorted_months]
    df_ct_mth_sorted.plot(kind = 'bar', ax = ax, rot=0, align='center', width=0.5, color='grey', 
                          edgecolor='black', ylim=(0,max(df_ct_mth_sorted)*1.25), 
                          #ylabel='Complaint count'
                         )
    ax.text(x=0,y=max(df_ct_mth_sorted)*1.05,s = cat, fontsize=12)
    ax.set_xlabel(xlabel=None)
    ax.tick_params(axis='both', direction='in')
#fig.savefig('monthly-complaint-count.png',dpi=200)

##### Complaint count by week days

In [None]:
# we first set out the shape and number of graphs we require
fig, axs = plt.subplots(nrows = 5, ncols = 2, figsize = (18,16))
fig.patch.set_facecolor('#E7E7E7')
fig.suptitle('Complaint count per weekday', fontsize=18 ,x=0.5, y=0.9)

for i, cat in enumerate(focusrequests_10):    # in this loop we accordingly detail each one of the plots
    ax = axs[i//2,i%2]
    df_ct = df[df.Category == cat]
    df_ct_dow = df_ct.groupby('DOW').DOW.count() 
    df_ct_dow_sorted = df_ct_dow[dow]
    df_ct_dow_sorted.plot(kind = 'bar', ax = ax, rot=0, align='center', width=0.5, color='grey', 
                          edgecolor='black', ylim=(0,max(df_ct_dow_sorted)*1.25), 
                          #ylabel='Complaint count'
                         )
    ax.text(x=0,y=max(df_ct_dow_sorted)*1.05,s = cat, fontsize=12)
    ax.set_xlabel(xlabel=None)
    ax.tick_params(direction='in')
#fig.savefig('weekly-complaint-count.png',dpi=200)

#### 3.1.2 Before and After Covid <a class="anchor" id="s_3_1_2"></a>

##### Complaints by day of the week pre and post covid

###### Graffiti

In [None]:
#Focusing on Graffiti behaviour the days of the week before and after Covid.
dow = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
SEED = 123
n_samples = 9500

df_g = df[df['Category'] == 'Graffiti'] #focusing on one category of complaint
#df.sort_values(by=['DOW_num'],inplace=True) #order the dataframe sorting it by day of the week

df_pre = df_g[(df_g.Opened_Year.between(2016,2019))].sample(n_samples,random_state = SEED) #creating a df for pre covid
df_post = df_g[(df_g.Opened_Year.between(2020,2021))].sample(n_samples,random_state = SEED) #creating a df for post and during covid

fig = go.Figure()


fig.add_trace(go.Violin(x=df_pre['DOW'],
                        y=df_pre['Opened_Hour_Minute'],
                        legendgroup='PreCovid', scalegroup='PreCovid', name='PreCovid',
                        side='negative',
                        line_color='blue', meanline_visible=True, box_visible=True, opacity=0.9, width=1)
             )

fig.add_trace(go.Violin(x=df_post['DOW'],
                        y=df_post['Opened_Hour_Minute'],
                        legendgroup='PostCovid', scalegroup='PostCovid', name='PostCovid',
                        side='positive',
                        line_color='orange', meanline_visible=True, box_visible=True, opacity=0.9, width=1)
             )

fig.update_traces(meanline_visible=True)
fig.update_layout(violingap=0, violinmode='overlay',
                  title="Evolution of complaints throughout the hours of the day Pre vs Post Covid for Graffiti",
                  xaxis_title="Day of the Week",
                  yaxis_title="Hours of the day",
                  autosize=False,
                  width=980,
                  height=500,
                  yaxis = dict(
                    tickmode = 'array',
                    tickvals = [0,2,4,6,8,10,12,14,16,18,20,22,24]),
                  xaxis=dict(range=[-0.5, 6.5])
                  ) 


fig.show()
#py.plot(fig, filename='covid-violin-plot-graffiti')

###### Tree maintainance

In [None]:
#Focusing on Graffiti behaviour the days of the week before and after Covid.
dow = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
SEED = 123
n_samples = 9500

df_T = df[df['Category'] == 'Tree Maintenance'] #focusing on one category of complaint
#df.sort_values(by=['DOW_num'],inplace=True) #order the dataframe sorting it by day of the week

df_pre = df_T[(df_T.Opened_Year.between(2016,2019))].sample(n_samples,random_state = SEED) #creating a df for pre covid
df_post = df_T[(df_T.Opened_Year.between(2020,2021))].sample(n_samples,random_state = SEED) #creating a df for post and during covid

fig = go.Figure()


fig.add_trace(go.Violin(x=df_pre['DOW'],
                        y=df_pre['Opened_Hour_Minute'],
                        legendgroup='PreCovid', scalegroup='PreCovid', name='PreCovid',
                        side='negative',
                        line_color='blue', meanline_visible=True, box_visible=True, opacity=0.9, width=1)
             )

fig.add_trace(go.Violin(x=df_post['DOW'],
                        y=df_post['Opened_Hour_Minute'],
                        legendgroup='PostCovid', scalegroup='PostCovid', name='PostCovid',
                        side='positive',
                        line_color='orange', meanline_visible=True, box_visible=True, opacity=0.9, width=1)
             )

fig.update_traces(meanline_visible=True)
fig.update_layout(violingap=0, violinmode='overlay',
                  title="Evolution of complaints throughout the hours of the day Pre vs Post Covid for Tree Maintenance",
                  xaxis_title="Day of the Week",
                  yaxis_title="Hours of the day",
                  autosize=False,
                  width=980,
                  height=500,
                  yaxis = dict(
                    tickmode = 'array',
                    tickvals = [0,2,4,6,8,10,12,14,16,18,20,22,24]),
                  xaxis=dict(range=[-0.5, 6.5])
                  ) 


fig.show()
#py.plot(fig, filename='covid-violin-plot-tree-maintenance')

###### Encampments

In [None]:
dow = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
SEED = 123
n_samples = 9500

df_E = df[df['Category'] == 'Encampments'] #focusing on one category of complaint
#df.sort_values(by=['DOW_num'],inplace=True) #order the dataframe sorting it by day of the week

df_pre = df_E[(df_E.Opened_Year.between(2016,2019))].sample(n_samples,random_state = SEED) #creating a df for pre covid
df_post = df_E[(df_E.Opened_Year.between(2020,2021))].sample(n_samples,random_state = SEED) #creating a df for post and during covid

fig = go.Figure()


fig.add_trace(go.Violin(x=df_pre['DOW'],
                        y=df_pre['Opened_Hour_Minute'],
                        legendgroup='PreCovid', scalegroup='PreCovid', name='PreCovid',
                        side='negative',
                        line_color='blue', meanline_visible=True, box_visible=True, opacity=0.9, width=1)
             )

fig.add_trace(go.Violin(x=df_post['DOW'],
                        y=df_post['Opened_Hour_Minute'],
                        legendgroup='PostCovid', scalegroup='PostCovid', name='PostCovid',
                        side='positive',
                        line_color='orange', meanline_visible=True, box_visible=True, opacity=0.9, width=1)
             )

fig.update_traces(meanline_visible=True)
fig.update_layout(violingap=0, violinmode='overlay',
                  title="Evolution of complaints throughout the hours of the day Pre vs Post Covid for Encampments",
                  xaxis_title="Day of the Week",
                  yaxis_title="Hours of the day",
                  autosize=False,
                  width=980,
                  height=500,
                  yaxis = dict(
                    tickmode = 'array',
                    tickvals = [0,2,4,6,8,10,12,14,16,18,20,22,24]),
                  xaxis=dict(range=[-0.5, 6.5])
                  ) 


fig.show()
#py.plot(fig, filename='covid-violin-plot-encampments')

### 3.2 Temporal- Spatial patterns <a class="anchor" id="s_3_2"></a>

We start by subsetting the relevant columns to decrease the dataset size and make the code run more effciently.

For all but the section investigating yearly patterns, we chose to focus on the years 2016-2020 to make the analysis more relevant to the present day situation.

In [None]:
# selec relevant columns of  
df_3_2 = df.loc[:,['Category', 'Longitude', 'Latitude', 'Opened_Year', 'Opened_Month', 'Opened_Hour']]

# select most recent year 2016-2020
df_3_2 = df_3_2.loc[df_3_2.Opened_Year.isin([2016, 2017, 2018, 2019, 2020])]

# subsetting separate dataset for yearly development
d_year = df.loc[df.Category.isin(["Graffiti", 'Encampments', 'Tree Maintenance'])]

#### Visualising geographical distribution of service complaints.

In the following we will be visualising the geographical distribution of service complaints within different categories. We will be looking at the development of this spatial distribution over years, months and hours, thereby investigating both the spatial and temporal development of the service complaints.

For this purpose we constructed a scatterplot taking the longitude and latitude of each service complaints and plotting these points on map of San Fransisco. When plotting observation we defined two different approaches:
1. Sample a fixed number of complaints from each category.
2. Sample a fraction of complaints from each category.

The first approach allows solely for investigation of the development in spatial patterns in the categories over time, whereas the second approach allows us to investigate how the total number of complaints changes during the year, month and hour. Generally a scatterplot is not well suited for visualising count data, but when used in conjunction with the bar plot in part XX, the scatter plot allows us to investigate whether this changes in the total number of complaints is due to local or global changes in the geographical distribution of complaints. 

**Focus categories:**

In this part we chose to focus on the categories "Graffiti", "Encampments" and "Tree Maintenance" out of the 101 different categories, to focus our analysis on a few relevant categories. We focused on these specific categories since they cover a diverse set of complaint types in the city of San Fransisco. 


**Notice**:

Due to constraints in upload capacity we were not able to upload the full scatterplots to the website. For that reason, we plot around 1500 fewer samples per category on the website compared to the notebook. For that reason, the highlighted patterns might not be as evident on the website as in the notebook. Please feel free to refer to the notebook for the full size visualisations.

#### Investigating yearly patterns... <a class="anchor" id="s_3_2_1"></a>

Lets start by investigating the yearly pattern for the three categories...

In [None]:
mapbox_access_token = 'pk.eyJ1IjoibWFkc2JpcmNoIiwiYSI6ImNrb2g2MWd0ZDEzMTcydXRyeHFudGV4cHMifQ.1vc7_kQJefvlgOm8hP9mxA'

# sample n service requests
SEED = 123
n_samples = 1500

# Create figure
fig = go.Figure()

# Constants
img_width = 900
img_height = 600


cat_list = []
    
for j, cat in d_year.groupby('Category'):
    cat_list.append(j)
    for i, year in cat.groupby('Opened_Year'):
        # lon and lat
        lon = year.Longitude.sample(n_samples, random_state = SEED)
        lat = year.Latitude.sample(n_samples, random_state = SEED)

        # plot
        fig.add_trace(
            go.Scattermapbox(
            lat = lat,
            lon = lon,
            mode='markers',
            visible = False,
            name = j,
            hoverinfo='skip',
            marker=go.scattermapbox.Marker(
                size=5,
                opacity = 0.4
            )
        ))
    
fig.update_layout(
    #autosize=True,
    hovermode='closest',
    mapbox=dict(
        accesstoken=mapbox_access_token,
        bearing=0,
        center=dict(
            lat=37.765,
            lon=-122.431297
        ),
        pitch=0,
        zoom=10.7
    ),
)

# Create and add slider
years = ['2009','2010','2011','2012','2013','2014','2015','2016','2017','2018', '2019','2020']
steps = []
for idx, val in enumerate(years):
    step = dict(
        method="update",
        args=[{"visible": [False] * len(years)}],
        label = val
    )
    step["args"][0]["visible"][idx] = True # Toggle i'th trace to "visible"
    steps.append(step)

sliders = [dict(
    active=0,
    currentvalue={'visible': False, "prefix": "Year: "},
    pad={"t": 30, 'r':10, 'l':10, 'b':10},
    steps=steps
)]

# update layout
fig.update_layout(
    title="Service Requests by Category over Year",
    xaxis_title= "Year",
    autosize = False,
    width=img_width,
    height=img_height,
    sliders=sliders
)

fig.update_yaxes(automargin=True)
fig.show()
#py.plot(fig, filename='geo-plot-year')

Overall 'Encampments' and 'Graffiti' occurs mostly in the city center. Interestingly however, we see that both 'Graffiti' and 'Encampments' occur more often in the outer regions of the city in the early years of 2009-2011 compared to the later years 2018-2019. For instance, in the southern region of SF there is a cluster of 'Graffiti' complaints in the early years, which is almost completely dissolved by 2015. 

Based on this development we can say that 'Graffiti' and 'Encampments' are most likely to occur in the city center of SF. Hence, future efforts to reduce the number of complaints for these categories should be geographically focused at the city center.

'Tree Maintenance' complaints are more evenly distributed across the city and remain so over the years. Hence, there is no clear change in the spatial pattern of this complaint type over the years.

Lets zoom in and investigate developments in the geographical distribution on a monthly basis.

#### Investigating monthly patterns... <a class="anchor" id="s_3_2_2"></a>

When investigating the barplot in part XX we noticed some kind of "beginning-of-year-effect" in the distribution of 'Graffity' complaints. Every January-February the amount of complaints increase suddenly and keep increasing throughout March-April after which the amount of complaints drop again until it reaches its low in December. This effect is particularly evident in the years 2016-2019, so we will focus on these years in the following.

Lets see if this trend is due to changes in local or global trends in the geographical distirbution of complaints...

In [None]:
# select single category
d_sub = df_3_2.loc[df_3_2.Category.isin(["Graffiti"])]

# sample n service requests
SEED = 123
frac = 0.05

# Create figure
fig = go.Figure()

# Constants
img_width = 900
img_height = 600

for i, year in d_sub.groupby('Opened_Month'):
    # lon and lat
    lon = year.Longitude.sample(frac = frac, random_state = SEED)
    lat = year.Latitude.sample(frac = frac, random_state = SEED)

    # plot
    fig.add_trace(
        go.Scattermapbox(
        lat = lat,
        lon = lon,
        mode='markers',
        visible = False,
        name = i,
        hoverinfo='skip',
        marker=go.scattermapbox.Marker(
            size=5,
            opacity = 0.7
        )
    ))
    

fig.update_layout(
    #autosize=True,
    mapbox=dict(
        accesstoken=mapbox_access_token,
        bearing=0,
        center=dict(
            lat=37.773972,
            lon=-122.431297
        ),
        pitch=0,
        zoom=10.7
    ),
)


# Create and add slider
months = list(range(1,13))
steps = []
for idx, val in enumerate(months):
    step = dict(
        method="update",
        args=[{"visible": [False] * len(months)}],
        label = val
    )
    step["args"][0]["visible"][idx] = True # Toggle i'th trace to "visible"
    steps.append(step)

sliders = [dict(
    active=0,
    currentvalue={'visible': False, "prefix": "Year: "},
    pad={"t": 30, 'r':10, 'l':10, 'b':10},
    steps=steps
)]

# update layout
fig.update_layout(
    title="Graffiti Service Requests by Month for the Years 2016-2019",
    xaxis_title= "Year",
    autosize = False,
    width=img_width,
    height=img_height,
    sliders=sliders
)

fig.update_yaxes(automargin=True)
fig.show()
#py.plot(fig, filename='geo-plot-month')

**Graffiti**
In the plot we see that "beginning-of-year-effect" results in a global increase the number of complaints. We only see a small increase in local complaint count, for instance the southern area of the city where we just identified a cluster back in the previous section.

All in all, the spatial pattern is very similar and the overall increase in 'Graffiti' complaints is not driven by any local increase. Graffiti complaints are still most prevalent in the city center throughout the months of the year. Hence, to address the increase in the number of 'Graffiti', the city should implement initiatives focused on the city center.

Lets now zoom in once again and investigate the development in the geographical distribution over the hours of the day...

#### Investigating daily patterns.. <a class="anchor" id="s_3_2_3"></a>

In [None]:
# subset
d_sub = df_3_2.loc[df_3_2.Category.isin(["Graffiti", "Encampments"])]

# sample n service requests
SEED = 123

# Create figure
fig = go.Figure()

# Constants
img_width = 900
img_height = 600

frac = 0.02

cat_list = []
    
for j, cat in d_sub.groupby('Category'):
    cat_list.append(j)
    for i, hour in cat.groupby('Opened_Hour'):
        # lon and lat
        lon = hour.Longitude.sample(frac = frac, random_state = SEED)
        lat = hour.Latitude.sample(frac = frac, random_state = SEED)

        # plot
        fig.add_trace(
            go.Scattermapbox(
            lat = lat,
            lon = lon,
            mode='markers',
            visible = False,
            name = j,
            hoverinfo='skip',
            marker=go.scattermapbox.Marker(
                size=5,
                opacity = 0.7
            )
        ))
    
fig.update_layout(
    #autosize=True,
    hovermode='closest',
    mapbox=dict(
        accesstoken=mapbox_access_token,
        bearing=0,
        center=dict(
            lat=37.765,
            lon=-122.431297
        ),
        pitch=0,
        zoom=10.7
    ),
)

# Create and add slider
hours = list(range(0,24))
steps = []
for idx, val in enumerate(hours):
    step = dict(
        method="update",
        args=[{"visible": [False] * len(hours)}],
        label = val
    )
    step["args"][0]["visible"][idx] = True # Toggle i'th trace to "visible"
    steps.append(step)

sliders = [dict(
    active=0,
    currentvalue={'visible': False, "prefix": "Year: "},
    pad={"t": 30, 'r':10, 'l':10, 'b':10},
    steps=steps
)]

# update layout
fig.update_layout(
    title="Service Requests by Category over the Hours of the Day",
    xaxis_title= "Year",
    width=img_width,
    height=img_height,
    sliders=sliders
)

fig.update_yaxes(automargin=True)
fig.show()
#py.plot(fig, filename='geo-plot-day')

In the plot we see the daily cycle of the number of complaints for the categories of 'Graffiti' and 'Encampments'. We chose to focus on these categories after investigating the distribution of 'Tree Maintenance', where we found no interesting pattern. Additionally, this helps limit the size of the plot for efficient integration into the website.

Overall, we see that during the first hours of the day (the night) there is few complaints in all categories. As we approach the 7:00 AM, and people start going to work, we see a steep increase in the number of complaints in all categories. From 7:00 until Noon the number of complaints keep increasing. As we approach 16:00-17:00 PM we see a large decrease in the number of complaints across all categories. By 11:00 PM the number of complaints is at a low point and stays there throughout the night.

This pattern, arising from the natural rythm of our society, is of cause to be expected. 

When plotting the distribution of complaints over the hours for each category in the previous section, we found that Graffiti complaints are more tightly and evenly distributed between the hours 8:00 AM to 16:00 PM. Encampment complaints on the other hand, has a larger peak at 8:00 AM and are slightly more spread out throughout the hours 8:00 AM to 7:00 PM. 

Lets see if this pattern coincides with specific geographical patterns for each of the categories separately...

In [None]:
# subset data
d_sub = df_3_2.loc[df_3_2.Category.isin(["Encampments"])]

# sample n service requests
SEED = 123
frac = 0.06

# Create figure
fig = go.Figure()

# Constants
img_width = 900
img_height = 600

for i, year in d_sub.groupby('Opened_Hour'):
    # lon and lat
    lon = year.Longitude.sample(frac = frac, random_state = SEED)
    lat = year.Latitude.sample(frac = frac, random_state = SEED)

    # plot
    fig.add_trace(
        go.Scattermapbox(
        lat = lat,
        lon = lon,
        mode='markers',
        visible = False,
        name = i,
        hoverinfo='skip',
        marker=go.scattermapbox.Marker(
            size=5,
            opacity = 0.7
        )
    ))
    

fig.update_layout(
    mapbox=dict(
        accesstoken=mapbox_access_token,
        bearing=0,
        center=dict(
            lat=37.773972,
            lon=-122.431297
        ),
        pitch=0,
        zoom=10.7
    ),
)


# Create and add slider
hours = list(range(1,25))
steps = []
for idx, val in enumerate(hours):
    step = dict(
        method="update",
        args=[{"visible": [False] * len(hours)}],
        label = val
    )
    step["args"][0]["visible"][idx] = True # Toggle i'th trace to "visible"
    steps.append(step)

sliders = [dict(
    active=0,
    currentvalue={'visible': False, "prefix": "Year: "},
    pad={"t": 30, 'r':10, 'l':10, 'b':10},
    steps=steps
)]

# update layout
fig.update_layout(
    title="Encampments Service Requests by Hours of the Day",
    xaxis_title= "Year",
    autosize = False,
    width=img_width,
    height=img_height,
    sliders=sliders
)

fig.update_yaxes(automargin=True)
fig.show()
#py.plot(fig, filename='geo-plot-day-encampents')

From the plot we see the initial spike in number of complaints at 8:00 AM is arise in the city center. At the same time we see that the drop in number of complaints occur later, around 7:00 PM. From the plot we see that the distribution is stable throughout the day, and the increase in number of complaints arises from a global increase in complaints all over the city. In other words, the increase is not driven by increases in certain geographical locations. Hence, to combat Encampment occurrences our reccomendation is that the efforts should be focused in the city center, since this is where they occur throughout the entire day.

Lets look at the Graffiti complaints...

In [None]:
# subset data
d_sub = df_3_2.loc[df_3_2.Category.isin(["Graffiti"])]

# sample n service requests
SEED = 123
frac = 0.04

# Create figure
fig = go.Figure()

# Constants
img_width = 900
img_height = 600

for i, year in d_sub.groupby('Opened_Hour'):
    # lon and lat
    lon = year.Longitude.sample(frac = frac, random_state = SEED)
    lat = year.Latitude.sample(frac = frac, random_state = SEED)

    # plot
    fig.add_trace(
        go.Scattermapbox(
        lat = lat,
        lon = lon,
        mode='markers',
        visible = False,
        name = i,
        hoverinfo='skip',
        marker=go.scattermapbox.Marker(
            size=4.5,
            opacity = 0.7
        )
    ))
    

fig.update_layout(
    #autosize=True,
    mapbox=dict(
        accesstoken=mapbox_access_token,
        bearing=0,
        center=dict(
            lat=37.773972,
            lon=-122.431297
        ),
        pitch=0,
        zoom=10.7
    ),
)

# Create and add slider
hours = list(range(1,25))
steps = []
for idx, val in enumerate(hours):
    step = dict(
        method="update",
        args=[{"visible": [False] * len(hours)}],
        label = val
    )
    step["args"][0]["visible"][idx] = True # Toggle i'th trace to "visible"
    steps.append(step)

sliders = [dict(
    active=0,
    currentvalue={'visible': False, "prefix": "Year: "},
    pad={"t": 30, 'r':10, 'l':10, 'b':10},
    steps=steps
)]

# update layout
fig.update_layout(
    title="Graffiti Service Requests by Hours of the Day",
    xaxis_title= "Year",
    autosize = False,
    width=img_width,
    height=img_height,
    sliders=sliders
)

fig.update_yaxes(automargin=True)
fig.show()
#py.plot(fig, filename='geo-plot-day-graffiti')

For the Graffiti complaints we see that distribution throughout the hours 8:00 AM to 4:00 PM is focused in the city center. Additionally we see the drop in complaints at around 4:00 - 5:00 PM. Interestingly, it seems that throughout the evening hours, the number of complaints drop most in the city center, but stay at a higher level in the outer regions of the city, e.g. Golden Gate Park. This might be driven by the fact that people go to the parks in the outer regions of the city during the evening. Contrary to Encampment complaints, which are assumed to be occuring at the time of the complaint, Graffiti complaints can have occurred at any time prior to the complaint. Hence, we cant say with certainty that efforts to combat Graffiti should be focused in the outer regions of the city during the evening hours. But we can say that the higher amount of complaints might indicate a slighty higher than average Graffiti occurences in the outer regions during the evening hours, while taking into consideration that this might be driven by the fact that more people go to these areas during these hours and file the complaints.

### 3.3 Cluster Analysis <a class="anchor" id="s_3_3"></a>

We are interested in seeing how different neighborhoods in San Fransisco differ when it comes to the distribution of 311 complaints. This can tells us something about the issues in different areas. We would also cluster similar neighborhoods 

#### Learning Clusters <a class="anchor" id="s_3_3_1"></a>

##### Frequency Table
We pick the top 22 frequent request-categories to focus on and create a dataframe where each row vector is a frequency distribution over request types in a specific neighborhood. The rows are normalized so each entry correspond to the percentage  a given request type account for in a given neighborhood.

In [None]:
### Focus requests
FR = focusrequests_22
# subset of dataframe
df_3_3 = df[df.Category.isin(FR)]
request_per_cat = df_3_3.Category.value_counts().sort_index()

# creating frequency pivot table
df_group = df.groupby('Analysis Neighborhoods').Category.value_counts(normalize = True, sort = False).mul(100).reset_index(name="Req_count")
pivot = df_group.pivot(index = 'Analysis Neighborhoods', columns = 'Category', values = 'Req_count')

# rename index so it correspond with geojson
pivot.index = index_gj

# put into panda dataframe
nhood_df = pd.DataFrame(pivot, columns = FR)
nhood_df = nhood_df.fillna(0)

##### Self-Organising-Maps

To explore the dataset we implemented a Self-Organising-Map (SOM) a kind of Artificial Neural Network used for unsupervised clustering. SOMs are very useful for exploring high dimensional data because they work by mapping high dimensional data onto a 2D grid or Kohonen Layer. SOMs map high dimensional data to 2D by grouping obervations mucn like K-means clustering. By mapping the pivot table consisting of neighborhood and complaints frequency to a 2D-map, we investigate whether the neighborhoods can be clustered into meaningful clusters.

We defined a 3x3 grid (9 clusters) becuase this strikes a nice balance between interpretability and flexibility in the representation.

In [None]:
from minisom import MiniSom
SEED = 123

X = nhood_df.values
y = nhood_df.index

# Initialization and training
som_shape = (3,3)
som = MiniSom(som_shape[0], som_shape[1], X.shape[1], sigma = 0.5, learning_rate = 0.5, random_seed = SEED)
som.pca_weights_init(X)
som.train_random(X, 10000, verbose=False)

plt.figure(figsize=(15, 15))
for x, t in zip(X, y):
    #t = float(t)
    w = som.winner(x)
    xval = w[0]+.6+0.5*np.random.rand(1)-0.5
    yval = w[1]+.8+0.6*np.random.rand(1)-0.5
    plt.text(xval, yval,  t, fontdict={'weight': 'bold',  'size': 11})
    
plt.axis([0, som.get_weights().shape[0], 0,  som.get_weights().shape[1]])
plt.show()

##### Agglomerative Hierarchical Clustering

- wE CLUSTER how..
- gather in a cluster df

In [None]:
X = nhood_df.values
y = nhood_df.index

from scipy.cluster.hierarchy import dendrogram, linkage, fcluster, cophenet
d_sample = 'euclidean' #See possible values: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html#scipy.spatial.distance.pdist
d_group = 'ward'
N_leafs = 41

# method for cluster distance and distance metric
metric = 'euclidean'
method = 'ward'
N_leafs = 41
Z = linkage(X, method=method, metric=metric) 

# dendogram
plt.figure(figsize = (8,8))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
den = dendrogram(
    Z,
    leaf_rotation=90.,
    leaf_font_size=8.,
    truncate_mode='lastp',
    p = N_leafs,
    show_contracted = True
)
plt.show()

#### Prinicipal Component Analaysis
In addition we do PCA and look at the projection onto the first and second principal component **WHY**

In [None]:
### Number of clusters
k = 10
# designating clusters
cluster_designation = fcluster(Z, k, criterion='maxclust')

# Cluster dataframe
clust_df = pd.DataFrame(y, columns = ['nhood'])
cluster = [f'Cluster {cluster_designation[i]}' for i in range(len(y))]
cluster_num = [cluster_designation[i] for i in range(len(y))]
clust_df.insert(1, 'Cluster',cluster)
clust_df.insert(2, 'Cluster_num',cluster_num)
# Sort by cluster
clust_df = clust_df.sort_values(by=['Cluster_num'])


from sklearn.decomposition import PCA
# PCA analysis
pca = PCA(n_components=2)
components = pca.fit_transform(X)

# PCA dataframe
pca_df = pd.DataFrame(components, columns = ['PC1','PC2'])
pca_df.insert(0, 'nhood',y)
cluster = [f'Cluster{cluster_designation[i]}' for i in range(len(y))]
cluster_num = [cluster_designation[i] for i in range(len(y))]
pca_df.insert(3, 'Cluster',cluster)
pca_df = pca_df.sort_values('Cluster')

# Project onto first and second principal component
fig = px.scatter(pca_df,x='PC1',y='PC2', hover_name='nhood',color = 'Cluster')
fig.update_layout(
    width =800,
    height =800,
    title_text='PCA projection')

fig.update_traces(marker=dict(size=12,
                              line=dict(width=2,
                                        color='DarkSlateGrey')),
                  selector=dict(mode='markers'))

fig.show()

#### Results
* We ended up with

* The dataframe constucted below is a table with the information on how each request category is distributed across clusters. For example we see in below that 39.8 % of reports of abandon vehicles are from cluster 6 and 55% of all encampments are found in cluster 4. This is the resulting information we achieved from the clustering analysis which we would like to communicate

In [None]:
cols = nhood_df.columns
def cluster_percentage(category):
    # input: category
    # output: percentage of that category in each k clusters
    cluster_pct = []
    for i in range(1,k+1):
        cluster = nhood_df.T[clust_df[clust_df.Cluster_num == i].nhood].T
        cluster_cat = np.sum(cluster[category])
        total = np.sum(nhood_df[category])
        cluster_pct.append(cluster_cat/total*100)
    return cluster_pct

pct_df = pd.DataFrame()
for index in range(0,len(cols)):
    cluster_pct = cluster_percentage(cols[index])
    pct_df.insert(index, f'{cols[index]}',cluster_pct)

pct_df.index = clust_df.Cluster.unique()
pct_df.T.head()

#### Visualizing and Exploring Clusters <a class="anchor" id="s_3_3_2"></a>

**Motivating text**

##### Geographically
**DEcribe the plot below**

In [None]:
import chart_studio
import chart_studio.plotly as py
import plotly.offline as pyo
import plotly.graph_objects as go
import plotly.express as px
# Set notebook mode to work in offline
pyo.init_notebook_mode()

chart_studio.tools.set_credentials_file(username='mmestre',
api_key='YbVYpQRqmw3RvNPohYBn')

import plotly.express as px

geodata = clust_df.copy()

fig = px.choropleth_mapbox(geodata, geojson=gj, locations='nhood', 
                           featureidkey="properties.nhood",
                           color='Cluster',
                           mapbox_style="carto-positron",
                           zoom=10.2, center = {"lat": 37.765, "lon": -122.446},
                           opacity=0.5,
                           labels={'Cluster','Neighborhoods'},
                           title="Clusters of SF neighborhoods based on the distribution of 311 request"
                          )
#fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
#py.plot(fig, filename='cluster_map')
#fig.savefig('words_white.png')
#fig.write_html("cluster.html")

##### Wordclouds
**Describe the text below**

In [None]:
from wordcloud import WordCloud
import matplotlib.image as mpimg
import matplotlib as mpl
from PIL import Image

# Data for the wordclouds
worddata = pct_df.T
C = worddata.columns

# use same colors for each cluster as in the other plots
color = ['#636EFA',
'#EF553B',
'#00CC96',
'#AB63FA',
'#FFA15A',
'#19D3F3',
'#FF6692',
'#B6E880',
'#FF97FF',
'#FECB52']

fig, axs = plt.subplots(nrows = 2 , ncols = 5,figsize = (18,6))

for i in range(k):
    text=worddata[C[i]].sort_values(ascending=False)
    wordcloud = WordCloud(background_color='white',contour_width=1,
                          contour_color='black',
                          color_func=lambda *args, **kwargs: color[i],
                           width=400, height = 200).generate_from_frequencies(text)
    #plot
    axs[i//5,i%5].imshow(wordcloud, interpolation='bilinear',aspect="auto")
    axs[i//5,i%5].set_title(f'Cluster{i+1}')
    axs[i//5,i%5].axis('off')
#plt.tight_layout()
fig.suptitle('Typical 311 requests in each cluster')
plt.show()
#fig.savefig('words_white.png')

##### Barplot 
**Describe**
Below you see a bar plot with a horisontal bar for each focusrequest 

In [None]:
bardata = pct_df.T
bardata = bardata.round(2)
C = bardata.columns
bardata.insert(0,'type',bardata.index)

fig = px.bar(bardata, x=C, y='type',
            title="Distribution of 311 requests across clusters",
            labels={"value": "Percentage", "variable": "Cluster",'type':'311 request type'})
fig.show()
#py.plot(fig, filename='cluster_barplot')

### The different types of neighborhoods in San Fransico

###### The Parks and Recreations of San fransico


* We start by diving into **Cluster 6**, so go and unpick all other clusters in the barplot. If you hover over the map, you see that this cluster consists of three park neighborhoods so naturally the majority of concerns and requests about Parks and Recreations. In fact, if you hover over the *'Rec and Park Requests'* in the barplot it is seen that about 57 % of all request of this type is in this cluster. Another concern is the homeless who might favor seeking shelter in parks.

* **Cluster 7** includes the neighborhood *Presidio*, a big park area where the Golden Gate Bridge connects and it includes *Lakeshore*, a nice area with a lake, park and the zoo! Here you will, as in **Cluster 6**, have many people with requests about parks and recreation, but instead of homeless concerns people in **Cluster 7** complain about street defects and sign repair. People in this cluster also provides lot of feedback for the puplic transport system MUNI, which might stem from tourists or visitors travelling to these areas by puplic transportation. 

###### The odd ones out

* The neighborhood '*Seacliff*' is the singleton **Cluster 9**. It sepearates itself from other neighborhoods because it has extraordinary high frequency of complaints regarding *'Abandoned Vehicles*'. 
* **Cluster 10** consists solely of *'Treasure Island'*, a small artificial Island that according to 311 data has biggest problems with '*Streetlights*' and '*Sewers Issues*'. Almost 30 % of all streetlight complaints are from here. Again you can see a cluster that provides loads of *'Muni Feedback'* 

###### The crowded and busy city center

* Now we go into the areas around the city center which is **Cluster 3** and **Cluster 5**. The complaints and request in these two clusters are almost the same but with slight differences. Both cluster struggle with homeless people and together these two clusters have almost 50% of all *'Homeless Concerns'* and over 60% of all *'Encampment'* complaints. The complaint types here are reflected by the fact that theese neighborhoods are very active and crowded with commercial areas, bars and buisness areas, so you see many complaints about noise, litter, damaged property, blocked traffic, cleanliness BUT no abandoned vehicle.

* Another cluster in the center of San Fransisco is **Cluster 4**, which you can see on the map consist of three small neighborhoods including *Chinatown*. In constrast to **Cluster 3** and **Cluster 5** the cluster doesn't struggle with encampments and homeless, but has about the same amount of complaints about *'Graffiti'* and *'Illigal Posting'* as **Cluster 5** eventhough **Cluster 4** accounts for way less 311 recuests in total.

###### The outskirt of the center

* **Cluster 8** represents a big number of neighborhoods in the outskirts of the center and account for a big chunk of all requests in all categories.. In contrast to the more crammed and busy clusters 3 and 5, that contains more deserted areas where poeple dump their cars. You can see in the barplot that 41 % of the total amount of *'Abandoned Vehicles*' are found in this cluster. Other notable complaints are about the *'Sewer Issues*' and '*Tree maintainance*' but the cluster represents.

* You can also see **Cluster 1** on the map as three spread out chunks of neighborhoods in the outer areas of San Fransisco. If you come to these areas of town you will experience issues within all categories but the need for *'Parking Enforcement*' might stand out!


* Lastly you can see that **Cluster 2** is a collection of neighborhoods by they Bay. This includes the neighborhood *Portola* wherein the office of San Francisco Housing Authority lies, hence the overhwelming amount of *'SFHA requests*'. Otherwise you can experience a broad palette of issues, just as in **Cluster 1**.

### 3.4 Topic Extraction <a class="anchor" id="s_3_4"></a>

### 3.5 Other? <a class="anchor" id="s_3_5"></a>

## 4. Genre <a class="anchor" id="c4"></a>

## 5. Visualization <a class="anchor" id="c5"></a>

## 6. Discussion <a class="anchor" id="c6"></a>

## 7. Contributions <a class="anchor" id="c7"></a>

## 8. References <a class="anchor" id="c8"></a>