## Wrangling parks data

### Goals of the Task


The parks and recreation data consists of two data sets. 

- The smaller data set contains address, longitude and latitude for Seattle parks (each row is a park). 
- The second data set (features) indicates which facilities a park has (each row is a facility in a park) such as picnic areas, basketball courts and football pitches. 

The aim of this task is to combine and reshape the data into a wide rather than long frame where each row is a park, and there is a Boolean column for each feature type. 

#### Step 1 : use pandas to read the parks and features data files into data frames
- import pandas as pd 
- use pandas read_csv to create a parks data frame and a facilities data frame 
- ensure you are pointing at the correct file path for the data source (you may have to navigate in your notebook!) 


In [164]:
import pandas as pd
df_parks = pd.read_csv('Data/Seattle_Parks_And_Recreation_Park_Addresses.csv')
df_facilities = pd.read_csv('Data/Seattle_Parks_and_Recreation_Parks_Features.csv')

In [165]:
df_parks.head(5)

Unnamed: 0,PMAID,LocID,Name,Address,ZIP Code,X Coord,Y Coord,Location 1
0,281,2545,12th and Howe Play Park,1200 W Howe St,98119,-122.372985,47.636097,"(47.636097, -122.372985)"
1,4159,2387,12th Ave S Viewpoint,2821 12TH Ave S,98144,-122.317765,47.577953,"(47.577953, -122.317765)"
2,4467,2382,12th Ave Square Park,564 12th Ave,98122,-122.316455,47.607427,"(47.607427, -122.316455)"
3,4010,2546,14th Ave NW Boat Ramp,4400 14th Ave NW,98107,-122.373536,47.660775,"(47.660775, -122.373536)"
4,296,296,3001 E Madison,3001 E Madison St,98112,-122.293173,47.625169,"(47.625169, -122.293173)"


In [166]:
df_facilities.head(5)

Unnamed: 0,PMAID,Name,Alt_Name,xPos,yPos,Feature_ID,hours,Feature_Desc,CHILD_DESC,FIELD_TYPE,YOUTH_ONLY,LIGHTING,Location 1
0,281,12th and Howe Play Park,,-122.372985,47.636097,22,6 a.m. - 10 p.m.,Play Area,Play Area,,False,False,"1200 W Howe St\n(-122.372985, 47.636097)"
1,4159,12th Ave S Viewpoint,,-122.317765,47.577953,34,6 a.m. - 10 p.m.,View,,,False,False,"2821 12TH Ave S\n(-122.317765, 47.577953)"
2,4010,14th Ave NW Boat Ramp,,-122.373536,47.660775,7,4 a.m. - 11:30 p.m.,Boat Launch (Hand Carry),,,False,False,"4400 14th Ave NW\n(-122.373536, 47.660775)"
3,4010,14th Ave NW Boat Ramp,,-122.373536,47.660775,6,4 a.m. - 11:30 p.m.,Boat Launch (Motorized),,,False,False,"4400 14th Ave NW\n(-122.373536, 47.660775)"
4,4010,14th Ave NW Boat Ramp,,-122.373536,47.660775,36,4 a.m. - 11:30 p.m.,Waterfront,,,False,False,"4400 14th Ave NW\n(-122.373536, 47.660775)"


#### Step 2 : reformat the column headers in lower case 

- the two data sets have some inconsistencies in the header case used on columns so this should be fixed using the str.lower() method. 

    - example : df.columns = df.columns.str.lower() function 

In [167]:
df_parks.columns = df_parks.columns.str.lower()
df_facilities.columns = df_facilities.columns.str.lower()

In [168]:
df_parks.head(5)

Unnamed: 0,pmaid,locid,name,address,zip code,x coord,y coord,location 1
0,281,2545,12th and Howe Play Park,1200 W Howe St,98119,-122.372985,47.636097,"(47.636097, -122.372985)"
1,4159,2387,12th Ave S Viewpoint,2821 12TH Ave S,98144,-122.317765,47.577953,"(47.577953, -122.317765)"
2,4467,2382,12th Ave Square Park,564 12th Ave,98122,-122.316455,47.607427,"(47.607427, -122.316455)"
3,4010,2546,14th Ave NW Boat Ramp,4400 14th Ave NW,98107,-122.373536,47.660775,"(47.660775, -122.373536)"
4,296,296,3001 E Madison,3001 E Madison St,98112,-122.293173,47.625169,"(47.625169, -122.293173)"


In [169]:
df_facilities.head(5)

Unnamed: 0,pmaid,name,alt_name,xpos,ypos,feature_id,hours,feature_desc,child_desc,field_type,youth_only,lighting,location 1
0,281,12th and Howe Play Park,,-122.372985,47.636097,22,6 a.m. - 10 p.m.,Play Area,Play Area,,False,False,"1200 W Howe St\n(-122.372985, 47.636097)"
1,4159,12th Ave S Viewpoint,,-122.317765,47.577953,34,6 a.m. - 10 p.m.,View,,,False,False,"2821 12TH Ave S\n(-122.317765, 47.577953)"
2,4010,14th Ave NW Boat Ramp,,-122.373536,47.660775,7,4 a.m. - 11:30 p.m.,Boat Launch (Hand Carry),,,False,False,"4400 14th Ave NW\n(-122.373536, 47.660775)"
3,4010,14th Ave NW Boat Ramp,,-122.373536,47.660775,6,4 a.m. - 11:30 p.m.,Boat Launch (Motorized),,,False,False,"4400 14th Ave NW\n(-122.373536, 47.660775)"
4,4010,14th Ave NW Boat Ramp,,-122.373536,47.660775,36,4 a.m. - 11:30 p.m.,Waterfront,,,False,False,"4400 14th Ave NW\n(-122.373536, 47.660775)"


#### Step 3 : join the data frames together 

- use the pandas merge method to combine the two data frames into a new single data frame
- use the pmaid column as the merge key

https://www.geeksforgeeks.org/merge-two-pandas-dataframes-by-matched-id-number/ 

In [170]:
merged_df = pd.merge(df_parks, df_facilities, on="pmaid") 
merged_df

Unnamed: 0,pmaid,locid,name_x,address,zip code,x coord,y coord,location 1_x,name_y,alt_name,xpos,ypos,feature_id,hours,feature_desc,child_desc,field_type,youth_only,lighting,location 1_y
0,281,2545,12th and Howe Play Park,1200 W Howe St,98119,-122.372985,47.636097,"(47.636097, -122.372985)",12th and Howe Play Park,,-122.372985,47.636097,22,6 a.m. - 10 p.m.,Play Area,Play Area,,False,False,"1200 W Howe St\n(-122.372985, 47.636097)"
1,4159,2387,12th Ave S Viewpoint,2821 12TH Ave S,98144,-122.317765,47.577953,"(47.577953, -122.317765)",12th Ave S Viewpoint,,-122.317765,47.577953,34,6 a.m. - 10 p.m.,View,,,False,False,"2821 12TH Ave S\n(-122.317765, 47.577953)"
2,4010,2546,14th Ave NW Boat Ramp,4400 14th Ave NW,98107,-122.373536,47.660775,"(47.660775, -122.373536)",14th Ave NW Boat Ramp,,-122.373536,47.660775,7,4 a.m. - 11:30 p.m.,Boat Launch (Hand Carry),,,False,False,"4400 14th Ave NW\n(-122.373536, 47.660775)"
3,4010,2546,14th Ave NW Boat Ramp,4400 14th Ave NW,98107,-122.373536,47.660775,"(47.660775, -122.373536)",14th Ave NW Boat Ramp,,-122.373536,47.660775,6,4 a.m. - 11:30 p.m.,Boat Launch (Motorized),,,False,False,"4400 14th Ave NW\n(-122.373536, 47.660775)"
4,4010,2546,14th Ave NW Boat Ramp,4400 14th Ave NW,98107,-122.373536,47.660775,"(47.660775, -122.373536)",14th Ave NW Boat Ramp,,-122.373536,47.660775,36,4 a.m. - 11:30 p.m.,Waterfront,,,False,False,"4400 14th Ave NW\n(-122.373536, 47.660775)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1541,200,9909,Woodland Park Zoo,700 N 50th St,98103,-122.350430,47.666183,"(47.666183, -122.35043)",Woodland Park Zoo,,-122.350430,47.666183,17,4 a.m. - 11:30 p.m.,Garden,,,False,False,"700 N 50th St\n(-122.35043, 47.666183)"
1542,4563,0,Yesler Playfield,835 Yesler Way,98104,-122.320083,47.601262,"(47.601262, -122.320083)",Yesler Playfield,,-122.320083,47.601262,41,4 a.m. - 11:30 p.m.,Play Area (ADA Compliant),,,False,False,"835 Yesler Way\n(-122.320083, 47.601262)"
1543,4563,0,Yesler Playfield,835 Yesler Way,98104,-122.320083,47.601262,"(47.601262, -122.320083)",Yesler Playfield,,-122.320083,47.601262,45,4 a.m. - 11:30 p.m.,Baseball/Softball,Baseball Field,,False,False,"835 Yesler Way\n(-122.320083, 47.601262)"
1544,580,1712,York Park,3650 Renton Ave S,98144,-122.295537,47.570461,"(47.570461, -122.295537)",York Park,,-122.295537,47.570461,33,4 a.m. - 11:30 p.m.,Paths,,,False,False,"3650 Renton Ave S\n(-122.295537, 47.570461)"


#### step 4: drop unneccesary columns

the columns we want to keep in the resulting data frame are 

- zip code
- x coord
- y coord
- locid (location id) 
- name (park name) 
- pmaid (park id) 
- feature_id (facility id) 
- feature_desc (facility description)

drop all remaining columns

In [171]:
merged_df2 = merged_df[['zip code', 'x coord', 'y coord', 'locid', 'name_x', 'pmaid', 'feature_id', 'feature_desc' ]]
merged_df2.columns

Index(['zip code', 'x coord', 'y coord', 'locid', 'name_x', 'pmaid',
       'feature_id', 'feature_desc'],
      dtype='object')

#### step 5: examine and clean the feature column

- examine the feature_desc column using the pandas function unique()
- note that this column contains a description of just one facility that a park contains
- this means each park has multiple rows (one row for each park facility)
- in some cases you will also see duplicates- this is due to the presence of columns you removed earlier
- for example, Alki Beach Park (PMAID 445)  has 
    - 2 x boat launches (hand carry)
    - a fire pit
    - 2 x paths
    - picnic sites
    - 2 x restrooms
    - a view
    - a waterfont
- first, de duplicate the data frame to remove duplicate feature listings
- remember to reset the index of your data frame after dropping duplicate rows

In [172]:
# storing unique value in a variable 
merged_df2["feature_desc"].unique() 

array(['Play Area', 'View', 'Boat Launch (Hand Carry)',
       'Boat Launch (Motorized)', 'Waterfront',
       'Play Area (ADA Compliant)', 'Paths', 'Fire Pit',
       'Paths (ADA Compliant)', 'Picnic Sites', 'Rental Facility',
       'Restrooms', 'Restrooms (ADA Compliant)', 'Soccer',
       'Basketball (Half)', 'Tennis Court (Outdoor)', 'Tennis Lights',
       'Pesticide Free', 'Baseball/Softball', 'Community Center',
       'Green Space', 'Woods', 'Adult Fitness Equipment',
       'Basketball (Full)', 'Skatepark', 'Wading Pool or Water Feature',
       'Pool (Indoor)', 'Tennis Backboard (Outdoor)', 'Historic Landmark',
       'P-Patch Community Garden', 'Weddings and Ceremonies', 'Fishing',
       'T-Ball', 'Dog Off Leash Area', 'Community Building', 'Garden',
       'Environmental Learning Center', 'Hiking Trails', 'Creek',
       'Picnic Sites (ADA Compliant)', 'Football', 'Skate Dot',
       'U-8 Soccer', 'Track', 'Decorative Fountain', 'Golf',
       'Guarded Beach', 'Skatespot'

In [173]:
df3 = merged_df2.drop_duplicates(subset=['name_x','feature_desc'])

In [174]:
df3[df3['pmaid']==445]

Unnamed: 0,zip code,x coord,y coord,locid,name_x,pmaid,feature_id,feature_desc
12,98116,-122.402173,47.582912,1888,Alki Beach Park,445,7,Boat Launch (Hand Carry)
14,98116,-122.402173,47.582912,1888,Alki Beach Park,445,14,Fire Pit
15,98116,-122.402173,47.582912,1888,Alki Beach Park,445,33,Paths
16,98116,-122.402173,47.582912,1888,Alki Beach Park,445,42,Paths (ADA Compliant)
17,98116,-122.402173,47.582912,1888,Alki Beach Park,445,21,Picnic Sites
18,98116,-122.402173,47.582912,1888,Alki Beach Park,445,26,Rental Facility
19,98116,-122.402173,47.582912,1888,Alki Beach Park,445,27,Restrooms
20,98116,-122.402173,47.582912,1888,Alki Beach Park,445,40,Restrooms (ADA Compliant)
21,98116,-122.402173,47.582912,1888,Alki Beach Park,445,34,View
22,98116,-122.402173,47.582912,1888,Alki Beach Park,445,36,Waterfront


In [175]:
df3.reset_index(drop=True)

Unnamed: 0,zip code,x coord,y coord,locid,name_x,pmaid,feature_id,feature_desc
0,98119,-122.372985,47.636097,2545,12th and Howe Play Park,281,22,Play Area
1,98144,-122.317765,47.577953,2387,12th Ave S Viewpoint,4159,34,View
2,98107,-122.373536,47.660775,2546,14th Ave NW Boat Ramp,4010,7,Boat Launch (Hand Carry)
3,98107,-122.373536,47.660775,2546,14th Ave NW Boat Ramp,4010,6,Boat Launch (Motorized)
4,98107,-122.373536,47.660775,2546,14th Ave NW Boat Ramp,4010,36,Waterfront
...,...,...,...,...,...,...,...,...
1301,98103,-122.350430,47.666183,9909,Woodland Park Zoo,200,17,Garden
1302,98104,-122.320083,47.601262,0,Yesler Playfield,4563,41,Play Area (ADA Compliant)
1303,98104,-122.320083,47.601262,0,Yesler Playfield,4563,45,Baseball/Softball
1304,98144,-122.295537,47.570461,1712,York Park,580,33,Paths


#### step 6 : turn the feature column into multiple boolean facility 1/0 columns

- we want a list of parks alongside columns for all the possible features, showing which feature each park contains
- there are 68 feature described in total, and you will see that some features are very similar (eg basketball(full)/ basketball(half)) so OPTIONALLY you can pause here to reduce those features using text analysis methods you learnt in topic 8. 
- use the pandas pivot_table method to pivot the feature desciption column into multiple columns which will change the shape of the data from long to wide

    - example:  pd.pivot_table(df, index=[park], columns=[feature],aggfunc="count")

- replace the NaN entries in the resulting df with 0 with the pandas fillna() method 

In [176]:
df3['feature_desc'].unique()

array(['Play Area', 'View', 'Boat Launch (Hand Carry)',
       'Boat Launch (Motorized)', 'Waterfront',
       'Play Area (ADA Compliant)', 'Paths', 'Fire Pit',
       'Paths (ADA Compliant)', 'Picnic Sites', 'Rental Facility',
       'Restrooms', 'Restrooms (ADA Compliant)', 'Soccer',
       'Basketball (Half)', 'Tennis Court (Outdoor)', 'Tennis Lights',
       'Pesticide Free', 'Baseball/Softball', 'Community Center',
       'Green Space', 'Woods', 'Adult Fitness Equipment',
       'Basketball (Full)', 'Skatepark', 'Wading Pool or Water Feature',
       'Pool (Indoor)', 'Tennis Backboard (Outdoor)', 'Historic Landmark',
       'P-Patch Community Garden', 'Weddings and Ceremonies', 'Fishing',
       'T-Ball', 'Dog Off Leash Area', 'Community Building', 'Garden',
       'Environmental Learning Center', 'Hiking Trails', 'Creek',
       'Picnic Sites (ADA Compliant)', 'Football', 'Skate Dot',
       'U-8 Soccer', 'Track', 'Decorative Fountain', 'Golf',
       'Guarded Beach', 'Skatespot'

In [186]:
# Assuming your initial DataFrame is named df3
# Drop unnecessary columns except 'name_x' and 'feature_desc'
df3 = df3[['name_x', 'feature_desc']]

# Define the feature mapping dictionary
feature_mapping = {
    'Play Area (ADA Compliant)': 'Play Area',
    'Restrooms (ADA Compliant)': 'Restrooms',
    'Paths (ADA Compliant)': 'Paths',
    'Wading Pool or Water Feature': 'Pool',
    'Pool (Outdoor)': 'Pool',
    'Soccer': 'Football',
    'U-8 Soccer': 'Football',
    'Flag Football': 'Football',
    'Skate Dot': 'Skate Park',
    'Skatespot': 'Skate Park',
    'Tennis Court (Outdoor)': 'Tennis',
    'Tennis Lights': 'Tennis',
    'Tennis Backboard (Outdoor)': 'Tennis',
    'Community Building': 'Community Center',
    'P-Patch Community Garden': 'Garden',
    'Green Space': 'Garden',
    'Basketball (Full)': 'Basketball',
    'Basketball (Half)': 'Basketball',
    'Boat Launch (Hand Carry)': 'Boat Launch',
    'Boat Launch (Motorized)': 'Boat Launch',
    'Boat Moorage': 'Boat Launch',
    'Model Boat Pond': 'Boat Launch'
    # Add more mappings as needed
}

# Replace values in 'feature_desc' based on the mapping
df3['feature_desc'] = df3['feature_desc'].replace(feature_mapping)

# Group by 'name_x' and aggregate 'feature_desc' values
df3_grouped = df3.groupby('name_x')['feature_desc'].agg(', '.join).reset_index()

# Display the updated and grouped DataFrame
df3_grouped['feature_desc']


array(['View', 'Play Area', 'Boat Launch, Boat Launch, Waterfront',
       'Boat Launch', 'Paths, View', 'Boat Launch, Waterfront',
       'Boat Launch, Fire Pit, Paths, Paths, Picnic Sites, Rental Facility, Restrooms, Restrooms, View, Waterfront',
       'Football, Basketball, Tennis, Tennis, Pesticide Free, Baseball/Softball, Community Center, Play Area, Restrooms',
       'Paths', 'Tennis', 'View, Waterfront', 'Garden, Woods',
       'Adult Fitness Equipment, Paths, Paths, Play Area',
       'Play Area, Baseball/Softball, Football, Basketball, Basketball',
       'Paths, Pesticide Free, Play Area', 'Skatepark, Pool',
       'Play Area, Baseball/Softball, Football, Pool (Indoor), Community Center',
       'Baseball/Softball',
       'Basketball, Football, View, Restrooms, Play Area, Baseball/Softball',
       'Tennis, Restrooms, Baseball/Softball, Restrooms, Basketball, Football, Play Area, Pool, Tennis',
       'Boat Launch, Pesticide Free, Picnic Sites, Play Area, Restrooms, View, 

In [None]:
# df4[df4['feature_']]

In [None]:
# df4 = pd.pivot_table(df3, values=['feature_id'], index=['name_x'], columns=['feature_desc'], aggfunc="count")

In [None]:
df5 = df4.fillna(0)
df5

In [None]:
len(df5.columns)

In [None]:
df5.columns

#### Step 7: validate the data
- use EDA techniques including visualisation to validate the reshaping process 

In [None]:
# df3[['feature_desc', 'name_x']].groupby('feature_desc').count()