**Problem description**: You are presented with a dataset containing the courier locations captured during food collection at restaurants during a time interval. This data is used to generate a group of features with the intention of modeling the busyness of a certain geographical region. The team was tasked with developing a model to predict this, and the data scientist has produced a working Proof of Concept (PoC). Now, as an ML Engineer, you are tasked with productionizing this PoC.
This notebook contains the data scientist’s code to collect and create geo-location features to describe the busyness of regions (defined as h3 hexagons), and then train an ML model. The data scientist rushed to produce the PoC notebook, so the code is not well structured for a production application. As an ML engineer, your task is to:
1. Define a structured ML pipeline project.
2. Refactor the notebook into separate files to produce an executable ML pipeline using software engineering best practices and object-oriented programming appropriately.

**Expected outcome**: A structured ML pipeline project in a Git repo that you will be expected to talk us through and explain your design choices at the next stage.
It should at least contain:
- Scripts for each step
- Training and prediction pipelines
- Configurations file/s
- Dependency management
- CI/CD

**Hints**:
We suggest containerization with Docker, using GCS for storage, Vertex AI to execute the pipeline, and GitHub Actions for CI/CD. But if you feel more comfortable with other tools that is ok.


Consider creating files for each step, for example, `data_collection.py`, `feature_generation.py`, `training.py`, and `prediction.py`, in addition to pipeline and config files to connect and execute the pipeline. Some features might be poorly implemented or not be in use. Your focus as an ML Engineer is refactoring the notebook into a structure project, but you can highlight any implementation issues you spot.




In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# 1. Data collection

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv('final_dataset.csv')
df.dropna(axis=0, inplace=True)

FileNotFoundError: ignored

In [None]:
#unique couriers
len(df.courier_id.unique())

504

In [None]:
#unique restaurants
restaurants_ids = {}
list_restaurants_ids = []
for a,b in zip(df.restaurant_lat, df.restaurant_lon):
  id = "{}_{}".format(a,b)
  restaurants_ids[id] = {"lat": a, "lon":b}
for i,key in enumerate(restaurants_ids.keys()):
  restaurants_ids[key]['id'] = i

#labeling of restaurants
df['restaurant_id']=[restaurants_ids["{}_{}".format(a,b)]['id'] for a,b in zip(df.restaurant_lat, df.restaurant_lon)]
# number of unique restaurants
len(restaurants_ids)

268

# 2. Features
## 2.1 euclidean distance to restaurant


In [None]:
import collections.abc
# calc. eucl. distances to restaurants arrays
def calc_dist(p1x, p1y, p2x, p2y):
  p1 = (p2x - p1x)**2
  p2 = (p2y - p1y)**2
  dist = np.sqrt(p1 + p2)
  return dist.tolist() if isinstance(p1x, collections.abc.Sequence) else dist

df['dist_to_restaurant'] = calc_dist(df.courier_lat, df.courier_lon, df.restaurant_lat, df.restaurant_lon)

## 2.2 avg. eucl. distance to restantaurants

In [None]:
# calc. avg. distance to restaurants
def avg_dist_to_restaurants(courier_lat,courier_lon):
  return np.mean([calc_dist(v['lat'], v['lon'], courier_lat, courier_lon) for v in restaurants_ids.values()])

df['avg_dist_to_restaurants'] = [avg_dist_to_restaurants(lat,lon) for lat,lon in zip(df.courier_lat, df.courier_lon)]

## 2.3 Haversine distance to restaurant

In [None]:
from math import radians, cos, sin, asin, sqrt
import numpy as np

def calc_haversine_dist(lat1, lon1, lat2, lon2):

  R = 6372.8    #3959.87433  this is in miles.  For Earth radius in kilometers use 6372.8 km
  if isinstance(lat1, collections.abc.Sequence):
    dLat = np.array([radians(l2 - l1) for l2,l1 in zip(lat2, lat1)])
    dLon = np.array([radians(l2 - l1) for l2,l1 in zip(lon2, lon1)])
    lat1 = np.array([radians(l) for l in lat1])
    lat2 = np.array([radians(l) for l in lat2])
  else:
    dLat = radians(lat2 - lat1)
    dLon = radians(lon2 - lon1)
    lat1 = radians(lat1)
    lat2 = radians(lat2)

  a = np.sin(dLat/2)**2 + np.cos(lat1)*np.cos(lat2)*np.sin(dLon/2)**2
  c = 2*np.arcsin(np.sqrt(a))
  dist = R*c
  return dist.tolist() if isinstance(lon1, collections.abc.Sequence) else dist

df['Hdist_to_restaurant'] = calc_haversine_dist(df.courier_lat.tolist(), df.courier_lon.tolist(), df.restaurant_lat.tolist(), df.restaurant_lon.tolist())

## 2.4 avg. Haversine distance to restantaurants

In [None]:
# calc. avg. distance to restaurants
def avg_Hdist_to_restaurants(courier_lat,courier_lon):
  return np.mean([calc_haversine_dist(v['lat'], v['lon'], courier_lat, courier_lon) for v in restaurants_ids.values()])

df['avg_Hdist_to_restaurants'] = [avg_Hdist_to_restaurants(lat,lon) for lat,lon in zip(df.courier_lat, df.courier_lon)]

## 2.5 Five-Clusters embedding

In [None]:
#STEP 1 - define K & initiate data

def initiate_centroids(k, df):
    '''
    Select k data points as centroids
    k: number of centroids
    dset: pandas dataframe
    '''
    centroids = df.sample(k)
    return centroids

np.random.seed(1)
k=5
df_restaurants = pd.DataFrame([{"lat": v['lat'], "lon": v['lon']} for v in restaurants_ids.values()])
centroids = initiate_centroids(k, df_restaurants)

df_couriers = pd.DataFrame({})
df_couriers['lat'] = df['courier_lat']
df_couriers['lon'] = df['courier_lon']


# STEP 2 - define distance metric : Euclidean distance
def eucl_dist(p1x,p1y,p2x,p2y):
  return calc_dist(p1x, p1y, p2x, p2y)

# STEP 3 - Centroid assignment
def centroid_assignation(df, centroids):
  k = len(centroids)
  n = len(df)
  assignation = []
  assign_errors = []
  centroids_list = [c for i,c in centroids.iterrows()]
  for i,obs in df.iterrows():
    # Estimate error
    all_errors = [eucl_dist( centroid['lat'],
                            centroid['lon'],
                            obs['courier_lat'],
                            obs['courier_lon']) for centroid in centroids_list]

    # Get the nearest centroid and the error
    nearest_centroid =  np.where(all_errors==np.min(all_errors))[0].tolist()[0]
    nearest_centroid_error = np.min(all_errors)

    # Add values to corresponding lists
    assignation.append(nearest_centroid)
    assign_errors.append(nearest_centroid_error)
  df['Five_Clusters_embedding'] =assignation
  df['Five_Clusters_embedding_error'] =assign_errors
  return df

df = centroid_assignation(df,centroids)

## 2.6 H3 clustering

In [None]:
!pip install h3

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import h3

resolution=7
df['courier_location_timestamp']=  pd.to_datetime(df['courier_location_timestamp'])
df['order_created_timestamp'] = pd.to_datetime(df['order_created_timestamp'])
df['h3_index'] = [h3.geo_to_h3(lat,lon,resolution) for (lat,lon) in zip(df.courier_lat, df.courier_lon)]
df['date_day_number'] = [d for d in df.courier_location_timestamp.dt.day_of_year]
df['date_hour_number'] = [d for d in df.courier_location_timestamp.dt.hour]

## 2.7 Orders busyness

In [None]:
index_list = [(i,d,hr) for (i,d,hr) in zip(df.h3_index, df.date_day_number, df.date_hour_number)]

set_indexes = list(set(index_list))
dict_indexes = {label: index_list.count(label) for label in set_indexes}
df['orders_busyness_by_h3_hour'] = [dict_indexes[i] for i in index_list]

## 2.8 number de restuarants per h3 index

In [None]:
restaurants_counts_per_h3_index = {a:len(b) for a,b in zip(df.groupby('h3_index')['restaurant_id'].unique().index, df.groupby('h3_index')['restaurant_id'].unique()) }
df['restaurants_per_index'] = [restaurants_counts_per_h3_index[h] for h in df.h3_index]

## 2.9 Label encoding

In [None]:
from sklearn.preprocessing import LabelEncoder

def Encoder(df):
  columnsToEncode = list(df.select_dtypes(include=['category','object']))
  le = LabelEncoder()
  for feature in columnsToEncode:
      try:
          df[feature] = le.fit_transform(df[feature])
      except:
          print('Error encoding '+feature)
  return df

df['h3_index'] = df.h3_index.astype('category')

df= Encoder(df)

In [None]:
df.head()

Unnamed: 0,courier_id,order_number,courier_location_timestamp,courier_lat,courier_lon,order_created_timestamp,restaurant_lat,restaurant_lon,restaurant_id,dist_to_restaurant,avg_dist_to_restaurants,Hdist_to_restaurant,avg_Hdist_to_restaurants,Five_Clusters_embedding,Five_Clusters_embedding_error,h3_index,date_day_number,date_hour_number,orders_busyness_by_h3_hour,restaurants_per_index
0,346,281289453,2021-04-02 04:30:42.328000+00:00,50.48452,-104.618876,2021-04-02 04:20:42+00:00,50.483696,-104.61435,0,0.0046,0.056686,0.333173,5.267,0,0.028487,28,92,4,422,51
1,116,280949566,2021-04-01 06:14:47.386000+00:00,50.442573,-104.550463,2021-04-01 06:05:18+00:00,50.442422,-104.550487,1,0.000152,0.064685,0.016818,5.03671,3,0.006967,17,91,6,701,75
2,110,281328578,2021-04-02 05:48:57.224000+00:00,50.49592,-104.635605,2021-04-02 05:13:26+00:00,50.496595,-104.635606,2,0.000675,0.067024,0.075033,6.284221,0,0.008998,6,92,5,941,59
3,328,281317998,2021-04-02 05:12:17.252000+00:00,50.449445,-104.611521,2021-04-02 04:59:57+00:00,50.449504,-104.611074,3,0.00045,0.045752,0.032265,3.942031,4,0.035748,22,92,5,1026,69
4,178,281314132,2021-04-02 05:15:38.266000+00:00,50.495254,-104.666383,2021-04-02 04:54:53+00:00,50.49516,-104.665733,4,0.000656,0.082413,0.047152,7.16197,0,0.021886,5,92,5,397,35


# 3. data preparation, trainig & validation

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

X = df[['dist_to_restaurant', 'Hdist_to_restaurant', 'avg_Hdist_to_restaurants',	'date_day_number', 'restaurant_id', 'Five_Clusters_embedding', 'h3_index','date_hour_number',		'restaurants_per_index']]
y = df[['orders_busyness_by_h3_hour']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


regr = RandomForestRegressor(max_depth=4, random_state=0, n_jobs=-1)


In [None]:
regr.fit(X_train, y_train)
regr.score(X_test, y_test)

  regr.fit(X_train, y_train)


0.9122340438119385

In [None]:
params = {
    'max_depth': [4,5],
    'min_samples_leaf': [50,75],
    'n_estimators': [100,150]
}
from sklearn.model_selection import GridSearchCV
# Instantiate the grid search model
grid_search = GridSearchCV(estimator=regr,
                           param_grid=params,
                           cv = 3,
                           n_jobs=-1, verbose=1, scoring="r2")

In [None]:
grid_search.fit(X_train, y_train)

Fitting 3 folds for each of 8 candidates, totalling 24 fits


  self.best_estimator_.fit(X, y, **fit_params)


In [None]:
grid_search.best_score_

0.9599586176921253

In [None]:
rf_best = grid_search.best_estimator_
rf_best

In [None]:
rf_best.score(X_test, y_test)

0.9621476918712052