# Uber Features
Uber launched its Uber Movement service at the beginning of 2017. It consists of billions of pieces of trip data and provides access to the summary of travel times between different regions of the selected city. In this notebook, we will see how to extract maybe some useful features related to the sendy challenge such as the traffic congestions trends on a particular day of the week.

You can download these datasets in a csv format from this [link](https://movement.uber.com/cities/nairobi/downloads/speeds?lang=en-US&tp[y]=2019&tp[q]=1)

## Data understanding

We’ll first go to the Uber Movement website and navigate our way to Nairobi. Then we’ll download the CSV file for “Weekly Aggregate” and "monthly aggregate" and "hourly Aggregate" since we dont have the year. In this case, I choosed the first Quarter since I have assumed that sendy provided us with her recent datasets.


We’ll also need the geographical boundaries file to set regional coordinates.

In [None]:
# import needed librairies
import sys
sys.path.append('../')
from LIB.utils import *

In [None]:
# real all datasets
data_path = '../data/'
hourly = pd.read_csv(data_path+'nairobi-hexclusters-2019-1-All-HourlyAggregate.csv')
monthly = pd.read_csv(data_path+ 'nairobi-hexclusters-2019-2-All-MonthlyAggregate.csv')
weekly = pd.read_csv(data_path+ 'nairobi-hexclusters-2019-2-WeeklyAggregate.csv')


version = 'V0'
train = pd.read_csv(data_path+'processed_data/train{}.csv'.format(version))
test = pd.read_csv(data_path+'processed_data/test{}.csv'.format(version))
# merge train and test so we can apply the same steps on it
train['train'] = 1
test['train'] = 0 
all_data = pd.concat([train, test])

In [None]:
# transform the destination and pickup coordinates to tuple object
all_data['coordinatesDestination'] = all_data[['Destination Long','Destination Lat']].apply(lambda row:tuple([row[0],row[1]]), axis=1)
all_data['coordinatesPickup'] = all_data[['Pickup Long','Pickup Lat']].apply(lambda row:tuple([row[0],row[1]]), axis=1)

In [None]:
# transfor the json geographical boundaries file to a dataFrame
#read geojson file 
movement_id, geometry = [],[]
with open(data_path+'nairobi_hexclusters.json') as json_file:
    data = json.load(json_file)  
    for result in data['features']:
        movement_id.append(result[u'properties'][u'MOVEMENT_ID'])
        geometry.append(result[u'geometry'][u'coordinates'][0])
    df = pd.DataFrame([movement_id,geometry]).T
df.columns = ['movement_id','geometry']

In [None]:
# associate to each coordinates the corresponding hexapolygone cluster
polygones = df['geometry']
def which_polygone(values, name_col):
    """
    input:
     values: raw values for the each sample
     name_col: the name of the column to process  
    """
    coordinates = values['coordinates'+name_col]
    for index, polygone in enumerate(polygones):
        if Point(coordinates).within(Polygon(polygone)):
            return pd.Series([df.iloc[index,0]], index=[name_col+'Id'])

In [None]:
# applicate the function wich polygone to both the destination and pickp coordinates 
tqdm.pandas()
all_data['DestinationId']= all_data.progress_apply(which_polygone, args=('Destination',),axis=1)
all_data['PickupId'] = all_data.progress_apply(which_polygone, args=('Pickup',),axis=1)

In [None]:
# changing some of the columns's name to make the join easier 
all_data.rename(columns= {
    "DestinationId":"dstid",
    "PickupId":"sourceid",
    "Placement - Weekday (Mo = 1)":"dow",
    "Pickup - TimeHour":"hod"
}, inplace=True)

# transform the columns to join on to the same dtype
weekly.sourceid = weekly.sourceid.astype(str)
weekly.dstid = weekly.dstid.astype(str)

hourly.sourceid = hourly.sourceid.astype(str)
hourly.dstid = hourly.dstid.astype(str)

all_data.sourceid = all_data.sourceid.astype(str)
all_data.dstid = all_data.dstid.astype(str)

# merge the weekly and hourly datasets
all_data = all_data.merge(weekly, on=['sourceid','dstid','dow'], how='left')
all_data = all_data.merge(hourly, on=['sourceid','dstid','hod'], how='left', suffixes=('_hourly','_hourly'))

In [None]:
del(all_data.coordinatesDestination, all_data.coordinatesDestination)
train, test = all_data[all_data.train==1], all_data[all_data.train==0]
del(train.train, test.train)
version = 'V3'
train.to_csv(DATA_PATH+'processed_data/train{}.csv'.format(version), index=False)
test.to_csv(DATA_PATH+'processed_data/test{].csv'.format(version), index=False)