## Wrangling parks data

### Goals of the Task


The parks and recreation data consists of two data sets. 

- The smaller data set contains address, longitude and latitude for Seattle parks (each row is a park). 
- The second data set (features) indicates which facilities a park has (each row is a facility in a park) such as picnic areas, basketball courts and football pitches. 

The aim of this task is to combine and reshape the data into a wide rather than long frame where each row is a park, and there is a Boolean column for each feature type. 

#### Step 1 : use pandas to read the parks and features data files into data frames
- import pandas as pd 
- use pandas read_csv to create a parks data frame and a facilities data frame 
- ensure you are pointing at the correct file path for the data source (you may have to navigate in your notebook!) 


#### Step 2 : reformat the column headers in lower case 

- the two data sets have some inconsistencies in the header case used on columns so this should be fixed using the str.lower() method. 

    - example : df.columns = df.columns.str.lower() function 

#### Step 3 : join the data frames together 

- use the pandas merge method to combine the two data frames into a new single data frame
- use the pmaid column as the merge key

https://www.geeksforgeeks.org/merge-two-pandas-dataframes-by-matched-id-number/ 

#### step 4: drop unneccesary columns

the columns we want to keep in the resulting data frame are 

- zip code
- x coord
- y coord
- locid (location id) 
- name (park name) 
- pmaid (park id) 
- feature_id (facility id) 
- feature_desc (facility description)

drop all remaining columns

#### step 5: examine and clean the feature column

- examine the feature_desc column using the pandas function unique()
- note that this column contains a description of just one facility that a park contains
- this means each park has multiple rows (one row for each park facility)
- in some cases you will also see duplicates- this is due to the presence of columns you removed earlier
- for example, Alki Beach Park (PMAID 445)  has 
    - 2 x boat launches (hand carry)
    - a fire pit
    - 2 x paths
    - picnic sites
    - 2 x restrooms
    - a view
    - a waterfont
- first, de duplicate the data frame to remove duplicate feature listings
- remember to reset the index of your data frame after dropping duplicate rows

#### step 6 : turn the feature column into multiple boolean facility 1/0 columns

- we want a list of parks alongside columns for all the possible features, showing which feature each park contains
- there are 68 feature described in total, and you will see that some features are very similar (eg basketball(full)/ basketball(half)) so OPTIONALLY you can pause here to reduce those features using text analysis methods you learnt in topic 8. 
- use the pandas pivot_table method to pivot the feature desciption column into multiple columns which will change the shape of the data from long to wide

    - example:  pd.pivot_table(df, index=[park], columns=[feature],aggfunc="count")

- replace the NaN entries in the resulting df with 0 with the pandas fillna() method 

#### Step 7: validate the data
- use EDA techniques including visualisation to validate the reshaping process 