<a href="https://colab.research.google.com/github/caro28/stinky/blob/master/stinky_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Summary

*   This file loads and merges data from two databases for residents to report odor complaints: [SmellMyCity (SMC)](https://smellmycity.org/) and [SeeClickFix (SCF)](https://seeclickfix.com/portland_2). The data is from 2020 and 2021. 
* 83% of SMC data is from S Portland. SCF is primarily used by Portland residents, according to Maria Guerra and Luke Truman.
* Only the SMC data specifies zipcodes. We filtered the SMC data to select all zipcodes in Portland and S Portland. 83% of the odor complaints are from zip code 04106, or the area of S Portland immediately to the south of the Fore River. 99.5% of the data isi from the three zipcodes immediately to the north and south of the Fore River (04106, 04101, 04102).
*   Key columns are: Report date, Lat/Long of user, Smell description (user text), Smell value (for SMC dataset only)
*   Separate columns were added for Year/Month/Day/Hour/Minute/Seconds

## How to update df_stinky (table that combines data from SMC and SCF data)
1. Update raw SMC data and SCF data: 
   * SMC data: download from [SMC website](https://smellmycity.org/data). Select by zipcode, enter "4101,4102,4106,4107,4103,4108,4124", download and save as csv (use this file name: smc.csv)
   * Get updated SCF Excel spreadsheet from city of Portland and save as csv (use this file name: scf.csv)
   *   Upload updated csv files from SmellMyCity (SMC) and SeeClickFix (SCF) in [data folder](https://github.com/ds5110/stinky/tree/master/data)
2. Run all cells --> this will download a csv file containing df_stinky
3. Upload df_stinky.csv to [data folder](https://github.com/ds5110/stinky/tree/master/data)


# Portland and S Portland zip codes


*   Portland
  *   04101, 04102: Portland immediately north of Fore River
  *   04103: Portland, further north (to northern border of the city)
  *   04108: Portland, islands to the east (Peaks Island and Cushing Island)
  *   04124: Portland, west of airport 

*   South Portland
  *   04106, 04107: S Portland immediately south of Fore River


In [4]:
import pandas as pd

# load SMC data
url_all_zips = 'https://raw.githubusercontent.com/ds5110/stinky/master/data/smc.csv'
df_smc = pd.read_csv(url_all_zips)
print(df_smc.shape)

df_smc.head()

(2612, 9)


Unnamed: 0,epoch time,date & time,smell value,skewed latitude,skewed longitude,zipcode,smell description,symptoms,additional comments
0,1558531873,05/22/2019 09:31:13 -04:00,3,43.6608,-70.2498,4101,Grainy / malt like - coming from St. John street,,
1,1558691615,05/24/2019 05:53:35 -04:00,3,43.6435,-70.2702,4102,,,
2,1559178135,05/29/2019 21:02:15 -04:00,3,43.6466,-70.277,4102,Asphalty,,
3,1559341934,05/31/2019 18:32:14 -04:00,1,43.6325,-70.2828,4106,,,
4,1559387558,06/01/2019 07:12:38 -04:00,3,43.6343,-70.2825,4106,Oil fumes,Throat irritation,


In [5]:
df_smc['zipcode'].unique()

array([4101, 4102, 4106, 4107, 4103, 4108, 4124])

In [6]:
df_smc.groupby('zipcode').count()

Unnamed: 0_level_0,epoch time,date & time,smell value,skewed latitude,skewed longitude,smell description,symptoms,additional comments
zipcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
4101,53,53,53,53,53,43,16,2
4102,376,376,376,376,376,254,122,11
4103,5,5,5,5,5,4,4,1
4106,2169,2169,2169,2169,2169,1748,791,49
4107,7,7,7,7,7,3,1,0
4108,1,1,1,1,1,0,0,0
4124,1,1,1,1,1,0,0,0


In [7]:
df_smc.isnull().sum()

epoch time                0
date & time               0
smell value               0
skewed latitude           0
skewed longitude          0
zipcode                   0
smell description       560
symptoms               1678
additional comments    2549
dtype: int64

In [8]:
# Chnaging the date and time column to datetime format
df_smc['date & time']=df_smc['date & time'].str[0:20]
df_smc['date & time']=pd.to_datetime(df_smc['date & time'])

# Creating separate columns for date and time
df_smc['date'] = [d.date() for d in df_smc['date & time']]
df_smc['time'] = [d.time() for d in df_smc['date & time']]

# Creating separate coulumns for day, month, year, hour and month name
df_smc['Day']=df_smc['date & time'].dt.day
df_smc['Month']=df_smc['date & time'].dt.month
df_smc['Year']=df_smc['date & time'].dt.year
df_smc['Hour']=df_smc['date & time'].dt.hour
df_smc['Month_name'] = pd.to_datetime(df_smc['Month'], format='%m').dt.month_name().str.slice(stop=3)

# Creating a date and hour column
df_smc['Date & time (hour rounded)'] = df_smc['date & time'].dt.strftime("%Y-%m-%d %H:00:00")


# renaming columns
df_smc.rename(columns={'skewed latitude':'Latitude', 'skewed longitude':'Longitude'}, inplace=True)

df_smc

Unnamed: 0,epoch time,date & time,smell value,Latitude,Longitude,zipcode,smell description,symptoms,additional comments,date,time,Day,Month,Year,Hour,Month_name,Date & time (hour rounded)
0,1558531873,2019-05-22 09:31:13,3,43.6608,-70.2498,4101,Grainy / malt like - coming from St. John street,,,2019-05-22,09:31:13,22,5,2019,9,May,2019-05-22 09:00:00
1,1558691615,2019-05-24 05:53:35,3,43.6435,-70.2702,4102,,,,2019-05-24,05:53:35,24,5,2019,5,May,2019-05-24 05:00:00
2,1559178135,2019-05-29 21:02:15,3,43.6466,-70.2770,4102,Asphalty,,,2019-05-29,21:02:15,29,5,2019,21,May,2019-05-29 21:00:00
3,1559341934,2019-05-31 18:32:14,1,43.6325,-70.2828,4106,,,,2019-05-31,18:32:14,31,5,2019,18,May,2019-05-31 18:00:00
4,1559387558,2019-06-01 07:12:38,3,43.6343,-70.2825,4106,Oil fumes,Throat irritation,,2019-06-01,07:12:38,1,6,2019,7,Jun,2019-06-01 07:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2607,1626995456,2021-07-22 19:10:56,4,43.6518,-70.2736,4102,,,,2021-07-22,19:10:56,22,7,2021,19,Jul,2021-07-22 19:00:00
2608,1627002279,2021-07-22 21:04:39,5,43.6340,-70.2849,4106,Tank fumes,,,2021-07-22,21:04:39,22,7,2021,21,Jul,2021-07-22 21:00:00
2609,1627020315,2021-07-23 02:05:15,5,43.6428,-70.2452,4106,Petroleum smell most nites at 2am!!,,,2021-07-23,02:05:15,23,7,2021,2,Jul,2021-07-23 02:00:00
2610,1627043324,2021-07-23 08:28:44,3,43.6325,-70.2731,4106,Tar,,,2021-07-23,08:28:44,23,7,2021,8,Jul,2021-07-23 08:00:00


In [9]:
# load SCF data
url = 'https://raw.githubusercontent.com/ds5110/stinky/master/data/scf.csv'
df_scf=pd.read_csv(url)
print(df_scf.shape)
df_scf.head()

(295, 12)


Unnamed: 0,Id,Report Source,Category,Created at local,Closed at local,Status,Address,Description,URL,Lat,Lng,Export tagged places
0,7181157,iPhone,Odor,01/07/2020 - 08:26AM,01/07/2020 - 09:20AM,Archived,315 Spring Street,Petroleum smell coming from south portland,https://crm.seeclickfix.com/#/organizations/61...,43.64774,-70.269455,City Council District 2
1,7181402,Android,Odor,01/07/2020 - 09:11AM,01/07/2020 - 09:20AM,Archived,25 Cushman St,usual petroleum,https://crm.seeclickfix.com/#/organizations/61...,43.649448,-70.268626,City Council District 2
2,7192000,Android,Odor,01/09/2020 - 07:14AM,01/09/2020 - 08:45AM,Archived,25 Cushman St,usual petroleum,https://crm.seeclickfix.com/#/organizations/61...,43.649448,-70.268626,City Council District 2
3,7206428,Android,Odor,01/13/2020 - 08:22AM,01/13/2020 - 09:09AM,Archived,25 Cushman St,worst yet,https://crm.seeclickfix.com/#/organizations/61...,43.649448,-70.268626,City Council District 2
4,7210067,Android,Odor,01/14/2020 - 08:24AM,01/14/2020 - 02:50PM,Archived,25 Cushman St,usual petroleum stink. Cushman and Reiche play...,https://crm.seeclickfix.com/#/organizations/61...,43.649448,-70.268626,City Council District 2


In [10]:
df_scf.isnull().sum()

Id                       0
Report Source            0
Category                 0
Created at local         0
Closed at local          2
Status                   0
Address                  0
Description             39
URL                      0
Lat                      0
Lng                      0
Export tagged places     0
dtype: int64

In [11]:
# Changing the date and time column to datetime format
df_scf['Created at local']=pd.to_datetime(df_scf['Created at local'])
df_scf['Closed at local']=pd.to_datetime(df_scf['Closed at local'])

# Creating separate columns for date and time
df_scf['date'] = [d.date() for d in df_scf['Created at local']]
df_scf['time'] = [d.time() for d in df_scf['Created at local']]

df_scf['Day']=df_scf['Created at local'].dt.day
df_scf['Month']=df_scf['Created at local'].dt.month
df_scf['Year']=df_scf['Created at local'].dt.year
df_scf['Hour']=df_scf['Created at local'].dt.hour
df_scf['Month_name'] = pd.to_datetime(df_scf['Month'], format='%m').dt.month_name().str.slice(stop=3)

# Creating a date and hour column (using the 'Created at local' column)
df_scf['Date & time (hour rounded)'] = df_scf['Created at local'].dt.strftime("%Y-%m-%d %H:00:00")

# renaming columns
df_scf.rename(columns={'Description':'smell description', 'Lat':'Latitude', 'Lng':'Longitude'}, inplace=True)

df_scf

Unnamed: 0,Id,Report Source,Category,Created at local,Closed at local,Status,Address,smell description,URL,Latitude,Longitude,Export tagged places,date,time,Day,Month,Year,Hour,Month_name,Date & time (hour rounded)
0,7181157,iPhone,Odor,2020-01-07 08:26:00,2020-01-07 09:20:00,Archived,315 Spring Street,Petroleum smell coming from south portland,https://crm.seeclickfix.com/#/organizations/61...,43.647740,-70.269455,City Council District 2,2020-01-07,08:26:00,7,1,2020,8,Jan,2020-01-07 08:00:00
1,7181402,Android,Odor,2020-01-07 09:11:00,2020-01-07 09:20:00,Archived,25 Cushman St,usual petroleum,https://crm.seeclickfix.com/#/organizations/61...,43.649448,-70.268626,City Council District 2,2020-01-07,09:11:00,7,1,2020,9,Jan,2020-01-07 09:00:00
2,7192000,Android,Odor,2020-01-09 07:14:00,2020-01-09 08:45:00,Archived,25 Cushman St,usual petroleum,https://crm.seeclickfix.com/#/organizations/61...,43.649448,-70.268626,City Council District 2,2020-01-09,07:14:00,9,1,2020,7,Jan,2020-01-09 07:00:00
3,7206428,Android,Odor,2020-01-13 08:22:00,2020-01-13 09:09:00,Archived,25 Cushman St,worst yet,https://crm.seeclickfix.com/#/organizations/61...,43.649448,-70.268626,City Council District 2,2020-01-13,08:22:00,13,1,2020,8,Jan,2020-01-13 08:00:00
4,7210067,Android,Odor,2020-01-14 08:24:00,2020-01-14 14:50:00,Archived,25 Cushman St,usual petroleum stink. Cushman and Reiche play...,https://crm.seeclickfix.com/#/organizations/61...,43.649448,-70.268626,City Council District 2,2020-01-14,08:24:00,14,1,2020,8,Jan,2020-01-14 08:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
290,9986497,Portal,Odor,2021-05-24 16:31:00,NaT,Archived,14 Lawn Ave,Very strong oil smell,https://crm.seeclickfix.com/#/organizations/61...,43.670415,-70.293159,City Council District 5,2021-05-24,16:31:00,24,5,2021,16,May,2021-05-24 16:00:00
291,9989334,Web - Mobile,Odor,2021-05-25 02:31:00,2021-05-25 08:16:00,Archived,Carroll St & Western Prom,petro/chem smell in the air,https://crm.seeclickfix.com/#/organizations/61...,43.648216,-70.275764,City Council District 2,2021-05-25,02:31:00,25,5,2021,2,May,2021-05-25 02:00:00
292,10033882,Android,Odor,2021-06-01 07:49:00,2021-06-02 15:03:00,Archived,331 Spring St,,https://crm.seeclickfix.com/#/organizations/61...,43.647309,-70.269934,City Council District 2,2021-06-01,07:49:00,1,6,2021,7,Jun,2021-06-01 07:00:00
293,10040039,iPhone,Odor,2021-06-01 20:29:00,2021-06-02 09:58:00,Archived,15 Vaughan St,Moderate odor,https://crm.seeclickfix.com/#/organizations/61...,43.644839,-70.272152,City Council District 2,2021-06-01,20:29:00,1,6,2021,20,Jun,2021-06-01 20:00:00


In [12]:

# Merge two datasets
df_stinky = df_scf.append(df_smc, sort=False)
df_stinky

Unnamed: 0,Id,Report Source,Category,Created at local,Closed at local,Status,Address,smell description,URL,Latitude,Longitude,Export tagged places,date,time,Day,Month,Year,Hour,Month_name,Date & time (hour rounded),epoch time,date & time,smell value,zipcode,symptoms,additional comments
0,7181157.0,iPhone,Odor,2020-01-07 08:26:00,2020-01-07 09:20:00,Archived,315 Spring Street,Petroleum smell coming from south portland,https://crm.seeclickfix.com/#/organizations/61...,43.647740,-70.269455,City Council District 2,2020-01-07,08:26:00,7,1,2020,8,Jan,2020-01-07 08:00:00,,NaT,,,,
1,7181402.0,Android,Odor,2020-01-07 09:11:00,2020-01-07 09:20:00,Archived,25 Cushman St,usual petroleum,https://crm.seeclickfix.com/#/organizations/61...,43.649448,-70.268626,City Council District 2,2020-01-07,09:11:00,7,1,2020,9,Jan,2020-01-07 09:00:00,,NaT,,,,
2,7192000.0,Android,Odor,2020-01-09 07:14:00,2020-01-09 08:45:00,Archived,25 Cushman St,usual petroleum,https://crm.seeclickfix.com/#/organizations/61...,43.649448,-70.268626,City Council District 2,2020-01-09,07:14:00,9,1,2020,7,Jan,2020-01-09 07:00:00,,NaT,,,,
3,7206428.0,Android,Odor,2020-01-13 08:22:00,2020-01-13 09:09:00,Archived,25 Cushman St,worst yet,https://crm.seeclickfix.com/#/organizations/61...,43.649448,-70.268626,City Council District 2,2020-01-13,08:22:00,13,1,2020,8,Jan,2020-01-13 08:00:00,,NaT,,,,
4,7210067.0,Android,Odor,2020-01-14 08:24:00,2020-01-14 14:50:00,Archived,25 Cushman St,usual petroleum stink. Cushman and Reiche play...,https://crm.seeclickfix.com/#/organizations/61...,43.649448,-70.268626,City Council District 2,2020-01-14,08:24:00,14,1,2020,8,Jan,2020-01-14 08:00:00,,NaT,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2607,,,,NaT,NaT,,,,,43.651800,-70.273600,,2021-07-22,19:10:56,22,7,2021,19,Jul,2021-07-22 19:00:00,1.626995e+09,2021-07-22 19:10:56,4.0,4102.0,,
2608,,,,NaT,NaT,,,Tank fumes,,43.634000,-70.284900,,2021-07-22,21:04:39,22,7,2021,21,Jul,2021-07-22 21:00:00,1.627002e+09,2021-07-22 21:04:39,5.0,4106.0,,
2609,,,,NaT,NaT,,,Petroleum smell most nites at 2am!!,,43.642800,-70.245200,,2021-07-23,02:05:15,23,7,2021,2,Jul,2021-07-23 02:00:00,1.627020e+09,2021-07-23 02:05:15,5.0,4106.0,,
2610,,,,NaT,NaT,,,Tar,,43.632500,-70.273100,,2021-07-23,08:28:44,23,7,2021,8,Jul,2021-07-23 08:00:00,1.627043e+09,2021-07-23 08:28:44,3.0,4106.0,,


In [13]:
# check that common columns were correctly merged
df_stinky.isnull().sum()

Id                            2612
Report Source                 2612
Category                      2612
Created at local              2612
Closed at local               2614
Status                        2612
Address                       2612
smell description              599
URL                           2612
Latitude                         0
Longitude                        0
Export tagged places          2612
date                             0
time                             0
Day                              0
Month                            0
Year                             0
Hour                             0
Month_name                       0
Date & time (hour rounded)       0
epoch time                     295
date & time                    295
smell value                    295
zipcode                        295
symptoms                      1973
additional comments           2844
dtype: int64

In [14]:
# look into missing 'smell description' data to check there were no merge errors
print('SCF has {} null smell description rows'.format(df_scf['smell description'].isnull().sum()))
print('SMC has {} null smell description rows'.format(df_smc['smell description'].isnull().sum()))
print('Merged df has {} null smell description rows'.format(df_stinky['smell description'].isnull().sum()))

SCF has 39 null smell description rows
SMC has 560 null smell description rows
Merged df has 599 null smell description rows


In [15]:
# download tidied df_stinky
from google.colab import files
df_stinky.to_csv('df_stinky.csv', index=False) 
files.download('df_stinky.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>