<a href="https://colab.research.google.com/github/vmkainga/Classification-Analysis-Python/blob/main/Feature_Engineering_with_Python_Violet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <font color='#2F4F4F'>1. Defining the Question</font>

### a) Specifying the Data Analysis Question

Sendy has hired you to help predict the estimated time of delivery of orders, from the
point of driver pickup to the point of arrival at the final destination. Build a model that
predicts an accurate delivery time, from picking up a package arriving at the final
destination


### b) Defining the Metric for Success

The metrics we will use to evaluate our model are RMSE and R2 scores.

### c) Understanding the Context 

Logistics in Sub-Saharan Africa increases the cost of manufactured goods by up to
320%; while in Europe, it only accounts for up to 90% of the manufacturing cost. Sendy
is a business-to-business platform established in 2014, to enable businesses of all types
and sizes to transport goods more efficiently across East Africa. The company is
headquartered in Kenya with a team of more than 100 staff, focused on building practical
solutions for Africa’s dynamic transportation needs, from developing apps and web
solutions to providing dedicated support for goods on the move.


Sendy has hired you to help predict the estimated time of delivery of orders, from the
point of driver pickup to the point of arrival at the final destination. Build a model that
predicts an accurate delivery time, from picking up a package arriving at the final
destination. An accurate arrival time prediction will help all business to improve their
logistics and communicate the accurate time their time to their customers. You will be
required to perform various feature engineering techniques while preparing your data for
further analysis.


### d) Recording the Experimental Design

* Defining the Research Question
* Data Importation
* Data Exploration
* Data Cleaning
* Data Analysis
* Data Preparation
* Data Modeling
* Model Evaluation
* Challenging your Solution
* Recommendations / Conclusion 

### e) Data Relevance

The data provided was relevant to answering the research question.

# <font color='#2F4F4F'>2. Data Cleaning & Preparation</font>

In [1]:
# loading libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max.columns', None)
pd.set_option('display.max_colwidth', None)
%matplotlib inline

In [2]:
# loading and previewing dataset
df = pd.read_csv('https://bit.ly/3deaKEM')
df.sample(3)

Unnamed: 0,Order No,User Id,Vehicle Type,Platform Type,Personal or Business,Placement - Day of Month,Placement - Weekday (Mo = 1),Placement - Time,Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),Confirmation - Time,Arrival at Pickup - Day of Month,Arrival at Pickup - Weekday (Mo = 1),Arrival at Pickup - Time,Pickup - Day of Month,Pickup - Weekday (Mo = 1),Pickup - Time,Arrival at Destination - Day of Month,Arrival at Destination - Weekday (Mo = 1),Arrival at Destination - Time,Distance (KM),Temperature,Precipitation in millimeters,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Rider Id,Time from Pickup to Arrival
19982,Order_No_14903,User_Id_1271,Bike,3,Business,13,3,10:09:17 AM,13,3,10:09:25 AM,13,3,10:09:31 AM,13,3,10:43:26 AM,13,3,10:56:29 AM,15,24.8,,-1.324488,36.897792,-1.222419,36.886882,Rider_Id_874,783
2815,Order_No_13001,User_Id_1271,Bike,3,Business,14,4,5:28:29 PM,14,4,5:30:38 PM,14,4,5:52:57 PM,14,4,5:56:10 PM,14,4,6:16:06 PM,11,21.3,,-1.324488,36.897792,-1.326825,36.864345,Rider_Id_549,1196
12051,Order_No_718,User_Id_1891,Bike,3,Business,29,2,10:43:50 AM,29,2,10:44:45 AM,29,2,10:50:09 AM,29,2,10:52:51 AM,29,2,11:12:13 AM,6,23.4,,-1.260698,36.808863,-1.250646,36.846258,Rider_Id_72,1162


In [5]:
# loading glossary
glossary = pd.read_csv('https://bit.ly/30O3xsr', header = None)
glossary

Unnamed: 0,0,1
0,Order No,Unique number identifying the order
1,User Id,Unique number identifying the customer on a platform
2,Vehicle Type,"For this competition limited to bikes, however in practice Sendy service extends to trucks and vans"
3,Platform Type,"Platform used to place the order, there are 4 types"
4,Personal or Business,Customer type
5,Placement - Day of Month,Placement - Day of Month i.e 1-31
6,Placement - Weekday (Mo = 1),Placement - Weekday (Monday = 1)
7,Placement - Time,Placement - Time - Time of day the order was placed
8,Confirmation - Day of Month,Confirmation - Day of Month i.e 1-31
9,Confirmation - Weekday (Mo = 1),Confirmation - Weekday (Monday = 1)


In [None]:
# dropping the 'name' variable
df.drop(columns = ['name'], inplace = True)

In [None]:
# checking dataset shape
df.shape

In [None]:
# checking data types
df.dtypes

In [None]:
# dropping duplicates, if any
df.drop_duplicates(inplace = True)
df.shape

In [None]:
# checking for missing data
df.isna().sum()

In [None]:
# dropping the 'neighbourhood_group' variable, and the records with missing values for 'last_review'
# and 'reviews_per_month'
df.drop(columns = ['neighbourhood_group'], inplace = True)
df.dropna(inplace = True)

# confirming we have no null values
df.isnull().sum()

In [None]:
df.shape

In [None]:
# get the unique value of each variable to ensure there are no anomalies
cols = df.columns.to_list()

for col in cols:
    print("Variable:", col)
    print("Number of unique variables:", df[col].nunique())
    print(df[col].unique())
    print()

In [None]:
# visualizing the distribution of outliers
plt.figure(figsize = (15, 8))
df.boxplot()
plt.show()

In [None]:
# getting the records with outliers
q1 = df.quantile(0.25)
q3 = df.quantile(0.75)
iqr = q3 - q1

outliers_df = df[((df < (q1 - 1.5 * iqr)) | (df > (q3 + 1.5 * iqr))).any(axis = 1)]
print(outliers_df.shape)
outliers_df.sample(3)

In [None]:
# calculating percentage of outliers
round((outliers_df.shape[0] / df.shape[0]) * 100, 2)

It would be tricky to drop the records with outliers since that will reduce our dataset by half so we will leave them there. However, we will drop the host_id variable later on, right before modeling.

# <font color='#2F4F4F'>3. Data Analysis</font>

## 3.1 Univariate Analysis 

In [None]:
# getting the top 10 most common hosts
plt.figure(figsize = (10, 6))
df.host_name.value_counts()[:10].plot(kind = 'bar', rot = 0)
plt.title("Top 10 Hosts")
plt.show()

In [None]:
# getting the top 10 most common neighbourhoods
plt.figure(figsize = (10, 6))
df.neighbourhood.value_counts()[:10].plot(kind = 'bar', rot = 25)
plt.xticks(ha = "right")
plt.title("Top 10 Neighbourhoods")
plt.show()

In [None]:
df.room_type.value_counts()

In [None]:
# getting the most common room types
plt.figure(figsize = (6, 6))
labels = ['Entire home/apt', 'Private room', 'Hotel room', 'Shared room']
df.room_type.value_counts().plot(kind = 'pie', autopct = '%0.1f%%', labels = labels)
plt.show()

In [None]:
# distribution of price
plt.figure(figsize = (10,4))
sns.distplot(df['price'])
plt.show()

In [None]:
# getting the top 10 most common minimum number of nights to spend
plt.figure(figsize = (10, 5))
df.minimum_nights.value_counts()[:10].plot(kind = 'bar', rot = 0)
plt.show()

## 3.2 Bivariate Analysis

In [None]:
# price by room type
df.hist('price', by = 'room_type', rot = 0, figsize = (10, 6))
plt.show()

In [None]:
# average price by neighbourhood
df.groupby('neighbourhood')['price'].mean().sort_values(ascending = False)

In [None]:
# average price by neighbourhood
df.groupby('minimum_nights')['price'].mean().sort_values(ascending = False)

## 3.3 Feature Engineering

In [None]:
# getting the average price per room type
YOUR CODE HERE

# adding to our dataset
YOUR CODE HERE

# previewing our modified dataset
YOUR CODE HERE

In [None]:
# getting the average price per neighbourhood
YOUR CODE HERE

# adding to our dataset
YOUR CODE HERE

# previewing our modified dataset
YOUR CODE HERE

In [None]:
# encoding 'room_type'
YOUR CODE HERE

In [None]:
# dropping unneeded columns in preparation for modeling
YOUR CODE HERE

# <font color='#2F4F4F'>4. Data Modeling</font>

In [None]:
# split into features (X) and label (Y)
YOUR CODE HERE

In [None]:
# split into 70-30 train and test sets
YOUR CODE HERE

In [None]:
# scaling our features 
YOUR CODE HERE

For purposes of simplicity, we will work with the following regressors:
* Decision Tree Regressor
* Random Forest Regressor

## 4.1 Normal Modeling

In [None]:
# loading our regressors
YOUR CODE HERE

# instantiating our regressors
YOUR CODE HERE

# fitting to our training data
YOUR CODE HERE

# making predictions
YOUR CODE HERE

# evaluating the RMSE and R2 scores
YOUR CODE HERE

In [None]:
# 10% of target variable's mean
YOUR CODE HERE

Record your observations.

## 4.2 Modeling with Grid Search

In [None]:
# setting our grid parameters
YOUR CODE HERE

# setting up the Grid Search with our regressors with cv = 5 and n_jobs = -1
YOUR CODE HERE

# fitting to training data
YOUR CODE HERE

# getting the best parameters
YOUR CODE HERE

In [None]:
# implementing this recommendation

# instantiating our regressors with the recommended parameters
YOUR CODE HERE

# fitting to our training data
YOUR CODE HERE

# making predictions
YOUR CODE HERE

# evaluating the RMSE and R2 scores
YOUR CODE HERE

## 4.3 Modeling with Random Search

In [None]:
# setting up our parameters and the respective distributions to sample from
YOUR CODE HERE

# setting up Randomized Search for each regressor with cv = 5
YOUR CODE HERE

# fitting to training data
YOUR CODE HERE

# getting the best parameters
YOUR CODE HERE

In [None]:
# implementing this recommendation

# instantiating our regressors using the recommended parameters
YOUR CODE HERE

# fitting to our training data
YOUR CODE HERE

# making predictions
YOUR CODE HERE

# evaluating the RMSE and R2 scores
YOUR CODE HERE

## 4.4 Modeling with Bayesian Optimization

In [None]:
# importing the hyperopt library and cross_val_score
YOUR CODE HERE

# setting up a space dictionary
YOUR CODE HERE

# setting up our objective functions
YOUR CODE HERE

# running our optimizers and setting max_evals to 100
YOUR CODE HERE

#printing our outcomes
YOUR CODE HERE

In [None]:
# instantiating our regressors using the recommended parameters
YOUR CODE HERE

# fitting to our training data
YOUR CODE HERE

# making predictions
YOUR CODE HERE

# evaluating the RMSE and R2 scores
YOUR CODE HERE

# <font color='#2F4F4F'>5. Summary of Findings</font>

Summarize your findings.

# <font color='#2F4F4F'>6. Recommendations</font>

Provide your recommendations.

# <font color='#2F4F4F'>7. Challenging your Solution</font>

### a) Did we have the right question?


### b) Did we have the right data?


### c) What can be done to improve the solution?
