## Summary

*   This file loads and merges data from two databases for residents to report odor complaints (ADD LINKS TO SMC AND SCF). The data is from 2020 and 2021, and was filtered for all zipcodes in Portland and S Portland. (ADD NOTE THAT THERE IS ONLY ONE ZIPCODE THAT IS NOT BY THE RIVER, AND THIS ONLY ADDS LESS THAN 100 ROWS.)
*   Key columns are: Report date, Lat/Long of user, Smell description (user text), Smell value (for SMC dataset only)
*   Separate columns were added for Year/Month/Day/Hour/Minute/Seconds

# Tidying next steps
*   Should this notebook produce the final, tidied dataframe in a way that can be accessed from a different notebook?
*   SMC data includes reports from Portland and S Portland. Try to determine whether there are any duplicate reports.



# Portland and S Portland zip codes


*   Portland
  *   04101, 04102: Portland immediately north of Fore River
  *   04103: Portland, further north (to northern border of the city)
  *   04108: Portland, islands to the east (Peaks Island and Cushing Island)
  *   04124: Portland, west of airport 

*   South Portland
  *   04106, 04107: S Portland immediately south of Fore River


In [15]:
import pandas as pd

# load SMC data
url_all_zips = 'https://raw.githubusercontent.com/caro28/ds5110_project/main/smell_reports_all_zipcodes.csv'
df_smc = pd.read_csv(url_all_zips)
print(df_smc.shape)

df_smc.head()

(2612, 9)


Unnamed: 0,epoch time,date & time,smell value,skewed latitude,skewed longitude,zipcode,smell description,symptoms,additional comments
0,1558531873,05/22/2019 09:31:13 -04:00,3,43.6608,-70.2498,4101,Grainy / malt like - coming from St. John street,,
1,1558691615,05/24/2019 05:53:35 -04:00,3,43.6435,-70.2702,4102,,,
2,1559178135,05/29/2019 21:02:15 -04:00,3,43.6466,-70.277,4102,Asphalty,,
3,1559341934,05/31/2019 18:32:14 -04:00,1,43.6325,-70.2828,4106,,,
4,1559387558,06/01/2019 07:12:38 -04:00,3,43.6343,-70.2825,4106,Oil fumes,Throat irritation,


In [16]:
df_smc['zipcode'].unique()

array([4101, 4102, 4106, 4107, 4103, 4108, 4124])

In [17]:
df_smc.isnull().sum()

epoch time                0
date & time               0
smell value               0
skewed latitude           0
skewed longitude          0
zipcode                   0
smell description       560
symptoms               1678
additional comments    2549
dtype: int64

In [18]:
# create Year, Month, Day, Hour, Minute, Seconds column
df_smc['date & time']=df_smc['date & time'].str[0:20]
df_smc['date & time']=pd.to_datetime(df_smc['date & time'])
df_smc['Day']=df_smc['date & time'].dt.day
df_smc['Month']=df_smc['date & time'].dt.month
df_smc['Year']=df_smc['date & time'].dt.year
df_smc['Hour']=df_smc['date & time'].dt.hour
df_smc['Minutes']=df_smc['date & time'].dt.minute
df_smc['Seconds']=df_smc['date & time'].dt.second

# rename columns
df_smc.rename(columns={'skewed latitude':'Latitude', 'skewed longitude':'Longitude'}, inplace=True)

df_smc

Unnamed: 0,epoch time,date & time,smell value,Latitude,Longitude,zipcode,smell description,symptoms,additional comments,Day,Month,Year,Hour,Minutes,Seconds
0,1558531873,2019-05-22 09:31:13,3,43.6608,-70.2498,4101,Grainy / malt like - coming from St. John street,,,22,5,2019,9,31,13
1,1558691615,2019-05-24 05:53:35,3,43.6435,-70.2702,4102,,,,24,5,2019,5,53,35
2,1559178135,2019-05-29 21:02:15,3,43.6466,-70.2770,4102,Asphalty,,,29,5,2019,21,2,15
3,1559341934,2019-05-31 18:32:14,1,43.6325,-70.2828,4106,,,,31,5,2019,18,32,14
4,1559387558,2019-06-01 07:12:38,3,43.6343,-70.2825,4106,Oil fumes,Throat irritation,,1,6,2019,7,12,38
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2607,1626995456,2021-07-22 19:10:56,4,43.6518,-70.2736,4102,,,,22,7,2021,19,10,56
2608,1627002279,2021-07-22 21:04:39,5,43.6340,-70.2849,4106,Tank fumes,,,22,7,2021,21,4,39
2609,1627020315,2021-07-23 02:05:15,5,43.6428,-70.2452,4106,Petroleum smell most nites at 2am!!,,,23,7,2021,2,5,15
2610,1627043324,2021-07-23 08:28:44,3,43.6325,-70.2731,4106,Tar,,,23,7,2021,8,28,44


In [19]:
# load SCF data
url = 'https://raw.githubusercontent.com/caro28/ds5110_project/main/SCF%20ODOR%20REPORT%201-1-20%20to%206-3-21.csv'
df_scf=pd.read_csv(url)
print(df_scf.shape)
df_scf.head()

(295, 12)


Unnamed: 0,Id,Report Source,Category,Created at local,Closed at local,Status,Address,Description,URL,Lat,Lng,Export tagged places
0,7181157,iPhone,Odor,01/07/2020 - 08:26AM,01/07/2020 - 09:20AM,Archived,315 Spring Street,Petroleum smell coming from south portland,https://crm.seeclickfix.com/#/organizations/61...,43.64774,-70.269455,City Council District 2
1,7181402,Android,Odor,01/07/2020 - 09:11AM,01/07/2020 - 09:20AM,Archived,25 Cushman St,usual petroleum,https://crm.seeclickfix.com/#/organizations/61...,43.649448,-70.268626,City Council District 2
2,7192000,Android,Odor,01/09/2020 - 07:14AM,01/09/2020 - 08:45AM,Archived,25 Cushman St,usual petroleum,https://crm.seeclickfix.com/#/organizations/61...,43.649448,-70.268626,City Council District 2
3,7206428,Android,Odor,01/13/2020 - 08:22AM,01/13/2020 - 09:09AM,Archived,25 Cushman St,worst yet,https://crm.seeclickfix.com/#/organizations/61...,43.649448,-70.268626,City Council District 2
4,7210067,Android,Odor,01/14/2020 - 08:24AM,01/14/2020 - 02:50PM,Archived,25 Cushman St,usual petroleum stink. Cushman and Reiche play...,https://crm.seeclickfix.com/#/organizations/61...,43.649448,-70.268626,City Council District 2


In [20]:
df_scf.isnull().sum()

Id                       0
Report Source            0
Category                 0
Created at local         0
Closed at local          2
Status                   0
Address                  0
Description             39
URL                      0
Lat                      0
Lng                      0
Export tagged places     0
dtype: int64

In [21]:
# create Year, Month, Day, Hour, Minute, Seconds column
df_scf['Created at local']=pd.to_datetime(df_scf['Created at local'])
df_scf['Closed at local']=pd.to_datetime(df_scf['Closed at local'])
df_scf['Day']=df_scf['Created at local'].dt.day
df_scf['Month']=df_scf['Created at local'].dt.month
df_scf['Year']=df_scf['Created at local'].dt.year
df_scf['Hour']=df_scf['Created at local'].dt.hour
df_scf['Minutes']=df_scf['Created at local'].dt.minute
df_scf['Seconds']=df_scf['Created at local'].dt.second

# change name of "Description" column to match df_smc
df_scf.rename(columns={'Description':'smell description', 'Lat':'Latitude', 'Lng':'Longitude'}, inplace=True)

df_scf

Unnamed: 0,Id,Report Source,Category,Created at local,Closed at local,Status,Address,smell description,URL,Latitude,Longitude,Export tagged places,Day,Month,Year,Hour,Minutes,Seconds
0,7181157,iPhone,Odor,2020-01-07 08:26:00,2020-01-07 09:20:00,Archived,315 Spring Street,Petroleum smell coming from south portland,https://crm.seeclickfix.com/#/organizations/61...,43.647740,-70.269455,City Council District 2,7,1,2020,8,26,0
1,7181402,Android,Odor,2020-01-07 09:11:00,2020-01-07 09:20:00,Archived,25 Cushman St,usual petroleum,https://crm.seeclickfix.com/#/organizations/61...,43.649448,-70.268626,City Council District 2,7,1,2020,9,11,0
2,7192000,Android,Odor,2020-01-09 07:14:00,2020-01-09 08:45:00,Archived,25 Cushman St,usual petroleum,https://crm.seeclickfix.com/#/organizations/61...,43.649448,-70.268626,City Council District 2,9,1,2020,7,14,0
3,7206428,Android,Odor,2020-01-13 08:22:00,2020-01-13 09:09:00,Archived,25 Cushman St,worst yet,https://crm.seeclickfix.com/#/organizations/61...,43.649448,-70.268626,City Council District 2,13,1,2020,8,22,0
4,7210067,Android,Odor,2020-01-14 08:24:00,2020-01-14 14:50:00,Archived,25 Cushman St,usual petroleum stink. Cushman and Reiche play...,https://crm.seeclickfix.com/#/organizations/61...,43.649448,-70.268626,City Council District 2,14,1,2020,8,24,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
290,9986497,Portal,Odor,2021-05-24 16:31:00,NaT,Archived,14 Lawn Ave,Very strong oil smell,https://crm.seeclickfix.com/#/organizations/61...,43.670415,-70.293159,City Council District 5,24,5,2021,16,31,0
291,9989334,Web - Mobile,Odor,2021-05-25 02:31:00,2021-05-25 08:16:00,Archived,Carroll St & Western Prom,petro/chem smell in the air,https://crm.seeclickfix.com/#/organizations/61...,43.648216,-70.275764,City Council District 2,25,5,2021,2,31,0
292,10033882,Android,Odor,2021-06-01 07:49:00,2021-06-02 15:03:00,Archived,331 Spring St,,https://crm.seeclickfix.com/#/organizations/61...,43.647309,-70.269934,City Council District 2,1,6,2021,7,49,0
293,10040039,iPhone,Odor,2021-06-01 20:29:00,2021-06-02 09:58:00,Archived,15 Vaughan St,Moderate odor,https://crm.seeclickfix.com/#/organizations/61...,43.644839,-70.272152,City Council District 2,1,6,2021,20,29,0


In [22]:
# Merge two datasets
df_stinky = df_scf.append(df_smc, sort=False)
df_stinky

Unnamed: 0,Id,Report Source,Category,Created at local,Closed at local,Status,Address,smell description,URL,Latitude,Longitude,Export tagged places,Day,Month,Year,Hour,Minutes,Seconds,epoch time,date & time,smell value,zipcode,symptoms,additional comments
0,7181157.0,iPhone,Odor,2020-01-07 08:26:00,2020-01-07 09:20:00,Archived,315 Spring Street,Petroleum smell coming from south portland,https://crm.seeclickfix.com/#/organizations/61...,43.647740,-70.269455,City Council District 2,7,1,2020,8,26,0,,NaT,,,,
1,7181402.0,Android,Odor,2020-01-07 09:11:00,2020-01-07 09:20:00,Archived,25 Cushman St,usual petroleum,https://crm.seeclickfix.com/#/organizations/61...,43.649448,-70.268626,City Council District 2,7,1,2020,9,11,0,,NaT,,,,
2,7192000.0,Android,Odor,2020-01-09 07:14:00,2020-01-09 08:45:00,Archived,25 Cushman St,usual petroleum,https://crm.seeclickfix.com/#/organizations/61...,43.649448,-70.268626,City Council District 2,9,1,2020,7,14,0,,NaT,,,,
3,7206428.0,Android,Odor,2020-01-13 08:22:00,2020-01-13 09:09:00,Archived,25 Cushman St,worst yet,https://crm.seeclickfix.com/#/organizations/61...,43.649448,-70.268626,City Council District 2,13,1,2020,8,22,0,,NaT,,,,
4,7210067.0,Android,Odor,2020-01-14 08:24:00,2020-01-14 14:50:00,Archived,25 Cushman St,usual petroleum stink. Cushman and Reiche play...,https://crm.seeclickfix.com/#/organizations/61...,43.649448,-70.268626,City Council District 2,14,1,2020,8,24,0,,NaT,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2607,,,,NaT,NaT,,,,,43.651800,-70.273600,,22,7,2021,19,10,56,1.626995e+09,2021-07-22 19:10:56,4.0,4102.0,,
2608,,,,NaT,NaT,,,Tank fumes,,43.634000,-70.284900,,22,7,2021,21,4,39,1.627002e+09,2021-07-22 21:04:39,5.0,4106.0,,
2609,,,,NaT,NaT,,,Petroleum smell most nites at 2am!!,,43.642800,-70.245200,,23,7,2021,2,5,15,1.627020e+09,2021-07-23 02:05:15,5.0,4106.0,,
2610,,,,NaT,NaT,,,Tar,,43.632500,-70.273100,,23,7,2021,8,28,44,1.627043e+09,2021-07-23 08:28:44,3.0,4106.0,,


In [23]:
# check that common columns were correctly merged
df_stinky.isnull().sum()

Id                      2612
Report Source           2612
Category                2612
Created at local        2612
Closed at local         2614
Status                  2612
Address                 2612
smell description        599
URL                     2612
Latitude                   0
Longitude                  0
Export tagged places    2612
Day                        0
Month                      0
Year                       0
Hour                       0
Minutes                    0
Seconds                    0
epoch time               295
date & time              295
smell value              295
zipcode                  295
symptoms                1973
additional comments     2844
dtype: int64

In [24]:
# look into missing 'smell description' data to check there were no merge errors
print('SCF has {} null smell description rows'.format(df_scf['smell description'].isnull().sum()))
print('SMC has {} null smell description rows'.format(df_smc['smell description'].isnull().sum()))
print('Merged df has {} null smell description rows'.format(df_stinky['smell description'].isnull().sum()))

SCF has 39 null smell description rows
SMC has 560 null smell description rows
Merged df has 599 null smell description rows
