# Retrieving and Cleaning OSM Data
A notebook for retrieving and cleaning data from Open Street Maps.

There are some data cleaning functions on by default, found in the `_data_utils.py` module. Additional cleaning options are part of this notebook, for the purpose of allowing the option to adjust parameters as needed

### Import the data retriever

In [2]:
from bikeability._get_osm_data import OSM_retriever

### Set the OSM Retriever Type
The defaults are to retrieve any sort of bike lane from a given city. 

The Default is `All`;this includes any city for which there is an OSM relation ID assigned.

The `cities` argument can also take a list of cities.

In [6]:
bikes = OSM_retriever('bikes', ['New York'])

### Get the OSM data

In [7]:
bike_lanes = bikes.get()

INFO:root:
Retrieved 6206 entries from area: 175905
	Time: 11.92 seconds
	Attempts: 1


dict_keys(['New York'])
['New York']
['New York']


### Overview

In [8]:
bike_lanes.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6206 entries, 0 to 6205
Data columns (total 251 columns):
 #    Column                                      Dtype  
---   ------                                      -----  
 0    index                                       int64  
 1    type                                        object 
 2    id                                          int64  
 3    nodes                                       object 
 4    geometry                                    object 
 5    bounds.minlat                               float64
 6    bounds.minlon                               float64
 7    bounds.maxlat                               float64
 8    bounds.maxlon                               float64
 9    tags.bicycle                                object 
 10   tags.cycleway:right                         object 
 11   tags.hgv                                    object 
 12   tags.highway                                object 
 13   tags.maxspeed   

### Removing poor matches
We're going to remove all the highways from our dataset that don't have some sort of designated bicycle passageway on them. The basis for this is biking on a highway (or really any road over 25mph) is slightly terrifying.

In [None]:
remove = ['motorway', 'trunk', 'motorway_link','trunk_link']
bike_lanes.drop(bike_lanes.loc[(bike_lanes['highway'].isin(remove)) & (bike_lanes['cycleway'].isnull()) & (bike_lanes['bicycle'] != 'designated')].index, inplace=True)

### Removing extra columns


In [None]:
remove_columns = ['index','type','tiger', 'bounds', 'source', 'note', 'ref', 'horse', 'maxweight',
                  'layer','name', 'description', 'lanes:']
columns = bike_lanes.columns
for r in remove_columns:
    columns = [c for c in columns if r not in c]

bike_lanes = bike_lanes[columns]

### Removing Unimportant Classifiers
OSM is wonderful. It really is. However, due to the open source nature and the diversity between cities, the daata retrieved can be bogged down with ancillary attribute that only apply to a small select ways. This removes any column where the mean of the `NaN` is less than 20% of the values returned. For a more diverse (but admittedly more difficult to classify) dataset this can be increased or decreased.

In [None]:
bike_lanes = bike_lanes.loc[:, bike_lanes.isnull().mean() <=.8]

In [None]:
for c in bike_lanes.columns:
    if c not in ['id','nodes','geometry', 'length']:
        print(f'{c}:')
        values = bike_lanes[c].value_counts(dropna=False)
        print(f'{values}\n\n--\n')

In [None]:
# bicycel NAN = No
try:
    bike_lanes['bicycle'] = bike_lanes['bicycle'].fillna('no')
except:
    print("""The 'bicycle' column was dropped from the dataframe""")

In [None]:
# oneway NAN=no
bike_lanes['oneway'] = bike_lanes['oneway'].fillna('no')

In [None]:
# cycleway NAN and none = no
bike_lanes['cycleway'] = bike_lanes['cycleway'].fillna('no')

In [None]:
# bike_lanes.to_json('bike_lanes_cleaned.json')