# **Automatidata project**
**Course 6 - The Nuts and bolts of machine learning**

You are a data professional in a data analytics firm called Automatidata. Their client, the New York City Taxi & Limousine Commission (New York City TLC), was impressed with the work you have done and has requested that you build a machine learning model to predict if a customer will not leave a tip. They want to use the model in an app that will alert taxi drivers to customers who are unlikely to tip, since drivers depend on tips.

A notebook was structured and prepared to help you in this project. Please complete the following questions.

# Course 6 End-of-course project: Build a machine learning model

In this activity, you will practice using tree-based modeling techniques to predict on a binary target class.  
<br/>   

**The purpose** of this model is to find ways to generate more revenue for taxi cab drivers.  
  
**The goal** of this model is to predict whether or not a customer is a generous tipper.  
<br/>  

*This activity has three parts:*

**Part 1:** Ethical considerations 
* Consider the ethical implications of the request 

* Should the objective of the model be adjusted?

**Part 2:** Feature engineering

* Perform feature selection, extraction, and transformation to prepare the data for modeling

**Part 3:** Modeling

* Build the models, evaluate them, and advise on next steps

Follow the instructions and answer the questions below to complete the activity. Then, complete an Executive Summary using the questions listed on the PACE Strategy Document. 

Be sure to complete this activity before moving on. The next course item will provide you with a completed exemplar to compare to your own work. 



# Build a machine learning model

<img src="images/Pace.png" width="100" height="100" align=left>

# **PACE stages**


Throughout these project notebooks, you'll see references to the problem-solving framework PACE. The following notebook components are labeled with the respective PACE stage: Plan, Analyze, Construct, and Execute.

<img src="images/Plan.png" width="100" height="100" align=left>


## PACE: Plan 

Consider the questions in your PACE Strategy Document to reflect on the Plan stage.

In this stage, consider the following questions:

1.   What are you being asked to do?


2.   What are the ethical implications of the model? What are the consequences of your model making errors?
  *   What is the likely effect of the model when it predicts a false negative (i.e., when the model says a customer will give a tip, but they actually won't)?
  
  *   What is the likely effect of the model when it predicts a false positive (i.e., when the model says a customer will not give a tip, but they actually will)?  
  
  
3.   Do the benefits of such a model outweigh the potential problems?
  
4.   Would you proceed with the request to build this model? Why or why not?
 
5.   Can the objective be modified to make it less problematic?
 


What are you being asked to do?
I’m being asked to build a machine learning model to predict whether a taxi passenger will not leave a tip. This model is intended for use in a driver-facing app to alert taxi drivers about likely non-tippers.

What are the ethical implications of the model? What are the consequences of your model making errors?
There are serious ethical concerns:

The model could lead to bias or discrimination, especially if features like location or time of day correlate with certain socioeconomic or demographic groups.

Drivers might alter their behavior (e.g., provide lower quality service or avoid picking up certain passengers), which could worsen customer experience and reinforce unfair treatment.

What is the likely effect of a false negative (model predicts a tip when there is none)?
The driver expects a tip but doesn't receive one. This may lead to frustration or disappointment, but the passenger would still receive unbiased service if the driver relied on the model’s advice.

What is the likely effect of a false positive (model predicts no tip, but customer actually tips)?
This is more harmful. Drivers might provide poorer service or be less respectful, leading to a poor experience for generous customers, potentially harming the driver’s own earnings or the company’s reputation.

Do the benefits of such a model outweigh the potential problems?
In its current form, no. The risks of reinforcing unfair treatment or discrimination are high. The potential for harm outweighs the operational benefit of "warning" drivers about non-tippers.

Would you proceed with the request to build this model? Why or why not?
Not as currently defined. Predicting non-tippers introduces ethical risks, including reinforcing bias and affecting customer experience. A better approach would be to encourage positive behaviors (e.g., identifying generous tippers) rather than penalizing perceived negative ones.

Can the objective be modified to make it less problematic?
Yes. The model could instead be used to identify highly generous tippers. This would shift the model’s purpose from warning to positive reinforcement, helping drivers optimize their service for high-value passengers without biasing them against others.

Suppose you were to modify the modeling objective so, instead of predicting people who won't tip at all, you predicted people who are particularly generous&mdash;those who will tip 20% or more? Consider the following questions:

1.  What features do you need to make this prediction?

2.  What would be the target variable?  

3.  What metric should you use to evaluate your model? Do you have enough information to decide this now?


 What features do you need to make this prediction?
To predict whether a passenger will tip 20% or more, the model should consider variables that are available at the time of or shortly after the ride:

Fare amount

Trip distance

Trip duration

Time of day

Day of week

Pickup and drop-off locations (as zones or boroughs)

Passenger count

Payment type (e.g., credit card, cash)

Trip type (e.g., airport ride, short trip, etc.)

Vendor ID or driver ID (optional, if driver behavior influences tipping)

These features provide useful context for ride characteristics that might influence tipping behavior.

2. What would be the target variable?
The target variable would be binary, representing whether a tip is equal to or greater than 20% of the fare:

python
Copier le code
target = 1 if tip_amount / fare_amount >= 0.20 else 0
This allows us to frame the problem as a binary classification task (generous tipper vs. not).

3. What metric should you use to evaluate your model? Do you have enough information to decide this now?
We should prioritize classification metrics suited for imbalanced datasets, as generous tippers likely make up a smaller portion of the data.

Recommended metrics:

Precision – to measure how many of the predicted generous tippers were actually generous.

Recall – to assess how well the model captures actual generous tippers.

F1 Score – the harmonic mean of precision and recall, useful if we want a balance between the two.

ROC AUC Score – to evaluate the model’s ability to separate classes across all thresholds.

✅ Yes, we have enough information now to make an informed decision to focus on F1 Score or ROC AUC, depending on our final goal.




**_Complete the following steps to begin:_**

### **Task 1. Imports and data loading**

Import packages and libraries needed to build and evaluate random forest and XGBoost classification models.

In [1]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn for preprocessing, model building, and evaluation
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.preprocessing import StandardScaler

# XGBoost
import xgboost as xgb

# Suppress warnings (optional for cleaner output)
import warnings
warnings.filterwarnings('ignore')

In [2]:
# RUN THIS CELL TO SEE ALL COLUMNS 
# This lets us see all of the columns, preventing Juptyer from redacting them.
pd.set_option('display.max_columns', None)

Begin by reading in the data. There are two dataframes: one containing the original data, the other containing the mean durations, mean distances, and predicted fares from the previous course's project called nyc_preds_means.csv.

**Note:** `Pandas` reads in the dataset as `df0`, now inspect the first five rows. As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [3]:
# RUN THE CELL BELOW TO IMPORT YOUR DATA. 

# Load dataset into dataframe
df0 = pd.read_csv('2017_Yellow_Taxi_Trip_Data.csv')

# Import predicted fares and mean distance and duration from previous course
nyc_preds_means = pd.read_csv('nyc_preds_means.csv')

Inspect the first few rows of `df0`.


In [4]:
# Inspect the first few rows of df0
print("First few rows of df0:")
print(df0.head())

First few rows of df0:
   Unnamed: 0  VendorID    tpep_pickup_datetime   tpep_dropoff_datetime  \
0    24870114         2   03/25/2017 8:55:43 AM   03/25/2017 9:09:47 AM   
1    35634249         1   04/11/2017 2:53:28 PM   04/11/2017 3:19:58 PM   
2   106203690         1   12/15/2017 7:26:56 AM   12/15/2017 7:34:08 AM   
3    38942136         2   05/07/2017 1:17:59 PM   05/07/2017 1:48:14 PM   
4    30841670         2  04/15/2017 11:32:20 PM  04/15/2017 11:49:03 PM   

   passenger_count  trip_distance  RatecodeID store_and_fwd_flag  \
0                6           3.34           1                  N   
1                1           1.80           1                  N   
2                1           1.00           1                  N   
3                1           3.70           1                  N   
4                1           4.37           1                  N   

   PULocationID  DOLocationID  payment_type  fare_amount  extra  mta_tax  \
0           100           231            

Inspect the first few rows of `nyc_preds_means`.

In [5]:
# Inspect the first few rows of nyc_preds_means
print("\nFirst few rows of nyc_preds_means:")
print(nyc_preds_means.head())


First few rows of nyc_preds_means:
   mean_duration  mean_distance  predicted_fare
0      22.847222       3.521667       16.434245
1      24.470370       3.108889       16.052218
2       7.250000       0.881429        7.053706
3      30.250000       3.700000       18.731650
4      14.616667       4.435000       15.845642


#### Join the two dataframes

Join the two dataframes using a method of your choice.

In [6]:
# Concatenate the two DataFrames column-wise
df = pd.concat([df0, nyc_preds_means], axis=1)

# Check the first few rows of the merged DataFrame
print(df.head())

   Unnamed: 0  VendorID    tpep_pickup_datetime   tpep_dropoff_datetime  \
0    24870114         2   03/25/2017 8:55:43 AM   03/25/2017 9:09:47 AM   
1    35634249         1   04/11/2017 2:53:28 PM   04/11/2017 3:19:58 PM   
2   106203690         1   12/15/2017 7:26:56 AM   12/15/2017 7:34:08 AM   
3    38942136         2   05/07/2017 1:17:59 PM   05/07/2017 1:48:14 PM   
4    30841670         2  04/15/2017 11:32:20 PM  04/15/2017 11:49:03 PM   

   passenger_count  trip_distance  RatecodeID store_and_fwd_flag  \
0                6           3.34           1                  N   
1                1           1.80           1                  N   
2                1           1.00           1                  N   
3                1           3.70           1                  N   
4                1           4.37           1                  N   

   PULocationID  DOLocationID  payment_type  fare_amount  extra  mta_tax  \
0           100           231             1         13.0    0.0 

<img src="images/Analyze.png" width="100" height="100" align=left>

## PACE: **Analyze**

Consider the questions in your PACE Strategy Documentto reflect on the Analyze stage.

### **Task 2. Feature engineering**

You have already prepared much of this data and performed exploratory data analysis (EDA) in previous courses. 

Call `info()` on the new combined dataframe.

In [7]:
# Display summary information about the combined DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22699 entries, 0 to 22698
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             22699 non-null  int64  
 1   VendorID               22699 non-null  int64  
 2   tpep_pickup_datetime   22699 non-null  object 
 3   tpep_dropoff_datetime  22699 non-null  object 
 4   passenger_count        22699 non-null  int64  
 5   trip_distance          22699 non-null  float64
 6   RatecodeID             22699 non-null  int64  
 7   store_and_fwd_flag     22699 non-null  object 
 8   PULocationID           22699 non-null  int64  
 9   DOLocationID           22699 non-null  int64  
 10  payment_type           22699 non-null  int64  
 11  fare_amount            22699 non-null  float64
 12  extra                  22699 non-null  float64
 13  mta_tax                22699 non-null  float64
 14  tip_amount             22699 non-null  float64
 15  to

You know from your EDA that customers who pay cash generally have a tip amount of $0. To meet the modeling objective, you'll need to sample the data to select only the customers who pay with credit card. 

Copy `df0` and assign the result to a variable called `df1`. Then, use a Boolean mask to filter `df1` so it contains only customers who paid with credit card.

In [8]:
# Copy the original DataFrame
df1 = df0.copy()

# Filter the DataFrame to include only credit card payments (payment_type == 1)
df1 = df1[df1['payment_type'] == 1]

# Optional: Check the result
print(df1['payment_type'].unique())
print(df1.shape)

[1]
(15265, 18)


##### **Target**

Notice that there isn't a column that indicates tip percent, which is what you need to create the target variable. You'll have to engineer it. 

Add a `tip_percent` column to the dataframe by performing the following calculation:  
<br/>  


$$tip\ percent = \frac{tip\ amount}{total\ amount - tip\ amount}$$  

Round the result to three places beyond the decimal. **This is an important step.** It affects how many customers are labeled as generous tippers. In fact, without performing this step, approximately 1,800 people who do tip ≥ 20% would be labeled as not generous. 

To understand why, you must consider how floats work. Computers make their calculations using floating-point arithmetic (hence the word "float"). Floating-point arithmetic is a system that allows computers to express both very large numbers and very small numbers with a high degree of precision, encoded in binary. However, precision is limited by the number of bits used to represent a number, which is generally 32 or 64, depending on the capabilities of your operating system. 

This comes with limitations in that sometimes calculations that should result in clean, precise values end up being encoded as very long decimals. Take, for example, the following calculation:


In [9]:
# Run this cell
1.1 + 2.2

3.3000000000000003

Notice the three that is 16 places to the right of the decimal. As a consequence, if you were to then have a step in your code that identifies values ≤ 3.3, this would not be included in the result. Therefore, whenever you perform a calculation to compute a number that is then used to make an important decision or filtration, round the number. How many degrees of precision you round to is your decision, which should be based on your use case. 

Refer to this [guide for more information related to floating-point arithmetic](https://floating-point-gui.de/formats/fp/).  

In [10]:
# Create a new column for tip percent
df1['tip_percent'] = round(df1['tip_amount'] / (df1['total_amount'] - df1['tip_amount']), 3)

# Optional: Check the result
print(df1[['tip_amount', 'total_amount', 'tip_percent']].head())

   tip_amount  total_amount  tip_percent
0        2.76         16.56        0.200
1        4.00         20.80        0.238
2        1.45          8.75        0.199
3        6.39         27.69        0.300
5        2.06         12.36        0.200


Now create another column called `generous`. This will be the target variable. The column should be a binary indicator of whether or not a customer tipped ≥ 20% (0=no, 1=yes).

1. Begin by making the `generous` column a copy of the `tip_percent` column.
2. Reassign the column by converting it to Boolean (True/False).
3. Reassign the column by converting Boolean to binary (1/0).

In [11]:
# Step 1: Copy the tip_percent column
df1['generous'] = df1['tip_percent']

# Step 2: Convert to Boolean: True if tip_percent ≥ 0.20
df1['generous'] = df1['generous'] >= 0.20

# Step 3: Convert Boolean to binary: True → 1, False → 0
df1['generous'] = df1['generous'].astype(int)

# Optional: Check the result
print(df1[['tip_percent', 'generous']].head())

   tip_percent  generous
0        0.200         1
1        0.238         1
2        0.199         0
3        0.300         1
5        0.200         1


<details>
  <summary><h5>HINT</h5></summary>

To convert from Boolean to binary, use `.astype(int)` on the column.
</details>

#### Create day column

Next, you're going to be working with the pickup and dropoff columns.

Convert the `tpep_pickup_datetime` and `tpep_dropoff_datetime` columns to datetime.

In [12]:
# Convert pickup and dropoff columns to datetime format
df1['tpep_pickup_datetime'] = pd.to_datetime(df1['tpep_pickup_datetime'])
df1['tpep_dropoff_datetime'] = pd.to_datetime(df1['tpep_dropoff_datetime'])

Create a `day` column that contains only the day of the week when each passenger was picked up. Then, convert the values to lowercase.

In [13]:
# Create a 'day' column with lowercase day names
df1['day'] = df1['tpep_pickup_datetime'].dt.day_name().str.lower()


<details>
  <summary><h5>HINT</h5></summary>

To convert to day name, use `dt.day_name()` on the column.
</details>

#### Create time of day columns

Next, engineer four new columns that represent time of day bins. Each column should contain binary values (0=no, 1=yes) that indicate whether a trip began (picked up) during the following times:

`am_rush` = [06:00&ndash;10:00)  
`daytime` = [10:00&ndash;16:00)  
`pm_rush` = [16:00&ndash;20:00)  
`nighttime` = [20:00&ndash;06:00)  

To do this, first create the four columns. For now, each new column should be identical and contain the same information: the hour (only) from the `tpep_pickup_datetime` column.

In [14]:
# Extract the hour from the pickup datetime
pickup_hour = df1['tpep_pickup_datetime'].dt.hour

# Create time-of-day columns and initially fill them with the hour
df1['am_rush'] = pickup_hour
df1['daytime'] = pickup_hour
df1['pm_rush'] = pickup_hour
df1['nighttime'] = pickup_hour

# Optional: Check the output
print(df1[['tpep_pickup_datetime', 'am_rush', 'daytime', 'pm_rush', 'nighttime']].head())

  tpep_pickup_datetime  am_rush  daytime  pm_rush  nighttime
0  2017-03-25 08:55:43        8        8        8          8
1  2017-04-11 14:53:28       14       14       14         14
2  2017-12-15 07:26:56        7        7        7          7
3  2017-05-07 13:17:59       13       13       13         13
5  2017-03-25 20:34:11       20       20       20         20


You'll need to write four functions to convert each new column to binary (0/1). Begin with `am_rush`. Complete the function so if the hour is between [06:00–10:00), it returns 1, otherwise, it returns 0.

In [15]:
# Define 'am_rush()' conversion function [06:00–10:00)
def am_rush(hour):
    if 6 <= hour < 10:
        return 1
    else:
        return 0

Now, apply the `am_rush()` function to the `am_rush` series to perform the conversion. Print the first five values of the column to make sure it did what you expected it to do.

**Note:** Be careful! If you run this cell twice, the function will be reapplied and the values will all be changed to 0.

In [16]:
# Apply 'am_rush' function to the 'am_rush' series
df1['am_rush'] = df1['am_rush'].apply(am_rush)

# Print the first five values to verify
print(df1['am_rush'].head())

0    1
1    0
2    1
3    0
5    0
Name: am_rush, dtype: int64


Write functions to convert the three remaining columns and apply them to their respective series.

In [17]:
# Define 'daytime()' conversion function [10:00–16:00)
def daytime(hour):
    return 1 if 10 <= hour < 16 else 0

In [18]:
# Apply 'daytime()' function to the 'daytime' series
df1['daytime'] = df1['daytime'].apply(daytime)

In [19]:
# Define 'pm_rush()' conversion function [16:00–20:00)
def pm_rush(hour):
    return 1 if 16 <= hour < 20 else 0

In [20]:
# Apply 'pm_rush()' function to the 'pm_rush' series
df1['pm_rush'] = df1['pm_rush'].apply(pm_rush)

In [21]:
# Define 'nighttime()' conversion function [20:00–06:00)
def nighttime(hour):
    return 1 if hour >= 20 or hour < 6 else 0

In [22]:
# Apply 'nighttime' function to the 'nighttime' series
df1['nighttime'] = df1['nighttime'].apply(nighttime)

#### Create `month` column

Now, create a `month` column that contains only the abbreviated name of the month when each passenger was picked up, then convert the result to lowercase.

<details>
  <summary><h5>HINT</h5></summary>

Refer to the [strftime cheatsheet](https://strftime.org/) for help.
</details>

In [23]:
# Create 'month' column with abbreviated month names (lowercase)
df1['month'] = df1['tpep_pickup_datetime'].dt.strftime('%b').str.lower()

Examine the first five rows of your dataframe.

In [24]:
# Examine the first five rows of the dataframe
print(df1[['tpep_pickup_datetime', 'month']].head())

  tpep_pickup_datetime month
0  2017-03-25 08:55:43   mar
1  2017-04-11 14:53:28   apr
2  2017-12-15 07:26:56   dec
3  2017-05-07 13:17:59   may
5  2017-03-25 20:34:11   mar


#### Drop columns

Drop redundant and irrelevant columns as well as those that would not be available when the model is deployed. This includes information like payment type, trip distance, tip amount, tip percentage, total amount, toll amount, etc. The target variable (`generous`) must remain in the data because it will get isolated as the `y` data for modeling.

In [25]:
# Drop irrelevant and unavailable-at-deployment columns
df1.drop(columns=[
    'Unnamed: 0',               # index column from CSV
    'tpep_pickup_datetime',     # not available in real-time deployment
    'tpep_dropoff_datetime',    # not available at prediction time
    'payment_type',             # already filtered to credit cards
    'trip_distance',            # could be estimated, but not known at pickup
    'tip_amount',               # leakage (target-related)
    'tip_percent',              # engineered from target
    'total_amount',             # includes tip
    'tolls_amount',             # not known at pickup
    'fare_amount',              # may be predicted
    'extra',                    # fare component
    'mta_tax',                  # fare component
    'improvement_surcharge',   # fare component
    'PULocationID',             # could be replaced with engineered features
    'DOLocationID',             # not known at pickup
    'store_and_fwd_flag'       # irrelevant for modeling
], inplace=True)

# Display the remaining columns to verify
print(df1.columns)

Index(['VendorID', 'passenger_count', 'RatecodeID', 'generous', 'day',
       'am_rush', 'daytime', 'pm_rush', 'nighttime', 'month'],
      dtype='object')


#### Variable encoding

Many of the columns are categorical and will need to be dummied (converted to binary). Some of these columns are numeric, but they actually encode categorical information, such as `RatecodeID` and the pickup and dropoff locations. To make these columns recognizable to the `get_dummies()` function as categorical variables, you'll first need to convert them to `type(str)`. 

1. Define a variable called `cols_to_str`, which is a list of the numeric columns that contain categorical information and must be converted to string: `RatecodeID`, `PULocationID`, `DOLocationID`.
2. Write a for loop that converts each column in `cols_to_str` to string.


In [26]:
# 1. Define list of columns to convert to string
cols_to_str = ['RatecodeID']

# 2. Convert each of those columns to string type
for col in cols_to_str:
    df1[col] = df1[col].astype(str)

# Optional: verify conversion
print(df1[cols_to_str].dtypes)

RatecodeID    object
dtype: object



<details>
  <summary><h5>HINT</h5></summary>

To convert to string, use `astype(str)` on the column.
</details>

Now convert all the categorical columns to binary.

1. Call `get_dummies()` on the dataframe and assign the results back to a new dataframe called `df2`.


In [27]:
# Convert categorical variables to binary using one-hot encoding
df2 = pd.get_dummies(df1, drop_first=True)

# Optional: Display the first few rows to confirm
print(df2.head())

   VendorID  passenger_count  generous  am_rush  daytime  pm_rush  nighttime  \
0         2                6         1        1        0        0          0   
1         1                1         1        0        1        0          0   
2         1                1         0        1        0        0          0   
3         2                1         1        0        1        0          0   
5         2                6         1        0        0        0          1   

   RatecodeID_2  RatecodeID_3  RatecodeID_4  RatecodeID_5  RatecodeID_99  \
0         False         False         False         False          False   
1         False         False         False         False          False   
2         False         False         False         False          False   
3         False         False         False         False          False   
5         False         False         False         False          False   

   day_monday  day_saturday  day_sunday  day_thursday  day_tue

##### Evaluation metric

Before modeling, you must decide on an evaluation metric. 

1. Examine the class balance of your target variable. 

In [28]:
# Get class balance of 'generous' column
class_balance = df2['generous'].value_counts()

# Display the class balance
print(class_balance)

generous
1    8035
0    7230
Name: count, dtype: int64


In [29]:
# Get class balance as percentage
class_balance_percent = df2['generous'].value_counts(normalize=True) * 100

# Display the percentage balance
print(class_balance_percent)

generous
1    52.636751
0    47.363249
Name: proportion, dtype: float64


A little over half of the customers in this dataset were "generous" (tipped ≥ 20%). The dataset is very nearly balanced.

To determine a metric, consider the cost of both kinds of model error:
* False positives (the model predicts a tip ≥ 20%, but the customer does not give one)
* False negatives (the model predicts a tip < 20%, but the customer gives more)

False positives are worse for cab drivers, because they would pick up a customer expecting a good tip and then not receive one, frustrating the driver.

False negatives are worse for customers, because a cab driver would likely pick up a different customer who was predicted to tip more&mdash;even when the original customer would have tipped generously.

**The stakes are relatively even. You want to help taxi drivers make more money, but you don't want this to anger customers. Your metric should weigh both precision and recall equally. Which metric is this?**

The metric that weighs both precision and recall equally is the F1 score.

The F1 score is the harmonic mean of precision and recall and is particularly useful when you want to balance the trade-off between false positives and false negatives. It is a good choice when both types of errors (false positives and false negatives) have similar consequences, as in this case, where you want to avoid frustrating drivers (false positives) while also ensuring customers who tip generously aren't overlooked (false negatives).

The F1 score can be calculated using the following formula:

𝐹
1
=
2
×
𝑝
𝑟
𝑒
𝑐
𝑖
𝑠
𝑖
𝑜
𝑛
×
𝑟
𝑒
𝑐
𝑎
𝑙
𝑙
𝑝
𝑟
𝑒
𝑐
𝑖
𝑠
𝑖
𝑜
𝑛
+
𝑟
𝑒
𝑐
𝑎
𝑙
𝑙
F1=2× 
precision+recall
precision×recall
​
 
Where:

Precision = 
𝑇
𝑃
𝑇
𝑃
+
𝐹
𝑃
TP+FP
TP
​
  (True Positives / (True Positives + False Positives))

Recall = 
𝑇
𝑃
𝑇
𝑃
+
𝐹
𝑁
TP+FN
TP
​
  (True Positives / (True Positives + False Negatives))

This metric will give you a single number that captures both the precision and recall of the model, making it a suitable choice when the costs of both types of errors are approximately equal.

<img src="images/Construct.png" width="100" height="100" align=left>

## PACE: **Construct**

Consider the questions in your PACE Strategy Document to reflect on the Construct stage.

### **Task 3. Modeling**

##### **Split the data**

Now you're ready to model. The only remaining step is to split the data into features/target variable and training/testing data. 

1. Define a variable `y` that isolates the target variable (`generous`).
2. Define a variable `X` that isolates the features.
3. Split the data into training and testing sets. Put 20% of the samples into the test set, stratify the data, and set the random state.

In [30]:
from sklearn.model_selection import train_test_split

# 1. Isolate the target variable (y)
y = df2['generous']

# 2. Isolate the features (X)
X = df2.drop(columns=['generous'])

# 3. Split into train and test sets (80% train, 20% test), stratified by 'generous', with a fixed random_state
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Optional: Check the shape of the splits to confirm
print(f"Training features shape: {X_train.shape}")
print(f"Test features shape: {X_test.shape}")
print(f"Training target shape: {y_train.shape}")
print(f"Test target shape: {y_test.shape}")

Training features shape: (12212, 28)
Test features shape: (3053, 28)
Training target shape: (12212,)
Test target shape: (3053,)


##### **Random forest**

Begin with using `GridSearchCV` to tune a random forest model.

1. Instantiate the random forest classifier `rf` and set the random state.

2. Create a dictionary `cv_params` of any of the following hyperparameters and their corresponding values to tune. The more you tune, the better your model will fit the data, but the longer it will take. 
 - `max_depth`  
 - `max_features`  
 - `max_samples` 
 - `min_samples_leaf`  
 - `min_samples_split`
 - `n_estimators`  

3. Define a set `scoring` of scoring metrics for GridSearch to capture (precision, recall, F1 score, and accuracy).

4. Instantiate the `GridSearchCV` object `rf1`. Pass to it as arguments:
 - estimator=`rf`
 - param_grid=`cv_params`
 - scoring=`scoring`
 - cv: define the number of you cross-validation folds you want (`cv=_`)
 - refit: indicate which evaluation metric you want to use to select the model (`refit=_`)


**Note:** `refit` should be set to `'f1'`.<font/>
</details>
 


In [31]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# 1. Instantiate the random forest classifier
rf = RandomForestClassifier(random_state=42)

# 2. Create a dictionary of hyperparameters to tune
cv_params = {
    'max_depth': [None, 10, 20, 30],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_samples': [None, 0.5, 0.8],
    'min_samples_leaf': [1, 2, 4],
    'min_samples_split': [2, 5, 10],
    'n_estimators': [50, 100, 200]
}

# 3. Define a list of scoring metrics to capture
scoring = ['precision', 'recall', 'f1', 'accuracy']

# 4. Instantiate the GridSearchCV object
rf1 = GridSearchCV(
    estimator=rf,
    param_grid=cv_params,
    scoring=scoring,
    cv=5,            # Number of cross-validation folds
    refit='f1',      # Use F1 score to refit the model
    verbose=1,       # Print progress
    n_jobs=-1        # Use all available CPUs
)

# Optional: Check the parameters for GridSearchCV
print(rf1)

GridSearchCV(cv=5, estimator=RandomForestClassifier(random_state=42), n_jobs=-1,
             param_grid={'max_depth': [None, 10, 20, 30],
                         'max_features': ['auto', 'sqrt', 'log2'],
                         'max_samples': [None, 0.5, 0.8],
                         'min_samples_leaf': [1, 2, 4],
                         'min_samples_split': [2, 5, 10],
                         'n_estimators': [50, 100, 200]},
             refit='f1', scoring=['precision', 'recall', 'f1', 'accuracy'],
             verbose=1)


Now fit the model to the training data. Note that, depending on how many options you include in your search grid and the number of cross-validation folds you select, this could take a very long time&mdash;even hours. If you use 4-fold validation and include only one possible value for each hyperparameter and grow 300 trees to full depth, it should take about 5 minutes. If you add another value for GridSearch to check for, say, `min_samples_split` (so all hyperparameters now have 1 value except for `min_samples_split`, which has 2 possibilities), it would double the time to ~10 minutes. Each additional parameter would approximately double the time. 

In [None]:
# Fit the GridSearchCV object to the training data
rf1.fit(X_train, y_train)

# Print the best hyperparameters found by GridSearchCV
print("Best hyperparameters found: ", rf1.best_params_)

# Print the best score (f1 score) achieved during cross-validation
print("Best F1 score: ", rf1.best_score_)

# Optionally: Print the results for all parameter combinations
# This will show how each combination of hyperparameters performed
print("GridSearchCV Results: ")
print(rf1.cv_results_)

<details>
  <summary><h5>HINT</h5></summary>

If you get a warning that a metric is 0 due to no predicted samples, think about how many features you're sampling with `max_features`. How many features are in the dataset? How many are likely predictive enough to give good predictions within the number of splits you've allowed (determined by the `max_depth` hyperparameter)? Consider increasing `max_features`.

</details>

If you want, use `pickle` to save your models and read them back in. This can be particularly helpful when performing a search over many possible hyperparameter values.

In [None]:
import pickle 

# Define a path to the folder where you want to save the model
path = '/home/jovyan/work/'

In [None]:
def write_pickle(path, model_object, save_name:str):
    '''
    save_name is a string.
    '''
    with open(path + save_name + '.pickle', 'wb') as to_write:
        pickle.dump(model_object, to_write)

In [None]:
def read_pickle(path, saved_model_name:str):
    '''
    saved_model_name is a string.
    '''
    with open(path + saved_model_name + '.pickle', 'rb') as to_read:
        model = pickle.load(to_read)

        return model

Examine the best average score across all the validation folds. 

In [None]:
# Examine best score
#==> ENTER YOUR CODE HERE

Examine the best combination of hyperparameters.

In [None]:
#==> ENTER YOUR CODE HERE

Use the `make_results()` function to output all of the scores of your model. Note that it accepts three arguments. 

<details>
  <summary><h5>HINT</h5></summary>

To learn more about how this function accesses the cross-validation results, refer to the [`GridSearchCV` scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html?highlight=gridsearchcv#sklearn.model_selection.GridSearchCV) for the `cv_results_` attribute.

</details>

In [None]:
def make_results(model_name:str, model_object, metric:str):
    '''
    Arguments:
    model_name (string): what you want the model to be called in the output table
    model_object: a fit GridSearchCV object
    metric (string): precision, recall, f1, or accuracy

    Returns a pandas df with the F1, recall, precision, and accuracy scores
    for the model with the best mean 'metric' score across all validation folds.
    '''

    # Create dictionary that maps input metric to actual metric name in GridSearchCV
    metric_dict = {'precision': 'mean_test_precision',
                 'recall': 'mean_test_recall',
                 'f1': 'mean_test_f1',
                 'accuracy': 'mean_test_accuracy',
                 }

    # Get all the results from the CV and put them in a df
    cv_results = pd.DataFrame(model_object.cv_results_)

    # Isolate the row of the df with the max(metric) score
    best_estimator_results = cv_results.iloc[cv_results[metric_dict[metric]].idxmax(), :]

    # Extract Accuracy, precision, recall, and f1 score from that row
    f1 = best_estimator_results.mean_test_f1
    recall = best_estimator_results.mean_test_recall
    precision = best_estimator_results.mean_test_precision
    accuracy = best_estimator_results.mean_test_accuracy

    # Create table of results
    table = pd.DataFrame({'model': [model_name],
                        'precision': [precision],
                        'recall': [recall],
                        'F1': [f1],
                        'accuracy': [accuracy],
                        },
                       )

    return table

Call `make_results()` on the GridSearch object.

In [None]:
#==> ENTER YOUR CODE HERE

Your results should produce an acceptable model across the board. Typically scores of 0.65 or better are considered acceptable, but this is always dependent on your use case. Optional: try to improve the scores. It's worth trying, especially to practice searching over different hyperparameters.

<details>
  <summary><h5>HINT</h5></summary>

For example, if the available values for `min_samples_split` were [2, 3, 4] and GridSearch identified the best value as 4, consider trying [4, 5, 6] this time.
</details>

Use your model to predict on the test data. Assign the results to a variable called `rf_preds`.

<details>
  <summary><h5>HINT</h5></summary>
    
You cannot call `predict()` on the GridSearchCV object directly. You must call it on the `best_estimator_`.
</details>

For this project, you will use several models to predict on the test data. Remember that this decision comes with a trade-off. What is the benefit of this? What is the drawback?

==> ENTER YOUR RESPONSE HERE

In [None]:
# Get scores on test data
#==> ENTER YOUR CODE HERE

Use the below `get_test_scores()` function you will use to output the scores of the model on the test data.

In [None]:
def get_test_scores(model_name:str, preds, y_test_data):
    '''
    Generate a table of test scores.

    In:
    model_name (string): Your choice: how the model will be named in the output table
    preds: numpy array of test predictions
    y_test_data: numpy array of y_test data

    Out:
    table: a pandas df of precision, recall, f1, and accuracy scores for your model
    '''
    accuracy = accuracy_score(y_test_data, preds)
    precision = precision_score(y_test_data, preds)
    recall = recall_score(y_test_data, preds)
    f1 = f1_score(y_test_data, preds)

    table = pd.DataFrame({'model': [model_name],
                        'precision': [precision],
                        'recall': [recall],
                        'F1': [f1],
                        'accuracy': [accuracy]
                        })

    return table

1. Use the `get_test_scores()` function to generate the scores on the test data. Assign the results to `rf_test_scores`.
2. Call `rf_test_scores` to output the results.

###### RF test results

In [None]:
 # Get scores on test data
#==> ENTER YOUR CODE HERE

**Question:** How do your test results compare to your validation results?

#==> ENTER YOUR RESPONSE HERE

##### **XGBoost**

 Try to improve your scores using an XGBoost model.

1. Instantiate the XGBoost classifier `xgb` and set `objective='binary:logistic'`. Also set the random state.

2. Create a dictionary `cv_params` of the following hyperparameters and their corresponding values to tune:
 - `max_depth`
 - `min_child_weight`
 - `learning_rate`
 - `n_estimators`

3. Define a set `scoring` of scoring metrics for grid search to capture (precision, recall, F1 score, and accuracy).

4. Instantiate the `GridSearchCV` object `xgb1`. Pass to it as arguments:
 - estimator=`xgb`
 - param_grid=`cv_params`
 - scoring=`scoring`
 - cv: define the number of cross-validation folds you want (`cv=_`)
 - refit: indicate which evaluation metric you want to use to select the model (`refit='f1'`)

In [None]:
# 1. Instantiate the XGBoost classifier
#==> ENTER YOUR CODE HERE

# 2. Create a dictionary of hyperparameters to tune
#==> ENTER YOUR CODE HERE

# 3. Define a list of scoring metrics to capture
#==> ENTER YOUR CODE HERE

# 4. Instantiate the GridSearchCV object
#==> ENTER YOUR CODE HERE

Now fit the model to the `X_train` and `y_train` data.

In [None]:
%%time
#==> ENTER YOUR CODE HERE


Get the best score from this model.

In [None]:
# Examine best score
#==> ENTER YOUR CODE HERE

And the best parameters.

In [None]:
# Examine best parameters
#==> ENTER YOUR CODE HERE

##### XGB CV Results

Use the `make_results()` function to output all of the scores of your model. Note that it accepts three arguments. 

In [None]:
# Call 'make_results()' on the GridSearch object
#==> ENTER YOUR CODE HERE

Use your model to predict on the test data. Assign the results to a variable called `xgb_preds`.

<details>
  <summary><h5>HINT</h5></summary>
    
You cannot call `predict()` on the GridSearchCV object directly. You must call it on the `best_estimator_`.
</details>

In [None]:
# Get scores on test data
#==> ENTER YOUR CODE HERE

###### XGB test results

1. Use the `get_test_scores()` function to generate the scores on the test data. Assign the results to `xgb_test_scores`.
2. Call `xgb_test_scores` to output the results.

In [None]:
# Get scores on test data
#==> ENTER YOUR CODE HERE

**Question:** Compare these scores to the random forest test scores. What do you notice? Which model would you choose?

==> ENTER YOUR RESPONSE HERE

Plot a confusion matrix of the model's predictions on the test data.

In [None]:
# Generate array of values for confusion matrix
#==> ENTER YOUR CODE HERE

# Plot confusion matrix
#==> ENTER YOUR CODE HERE

**Question:** What type of errors are more common for your model?

==> ENTER YOUR RESPONSE HERE

##### Feature importance

Use the `feature_importances_` attribute of the best estimator object to inspect the features of your final model. You can then sort them and plot the most important ones.

In [None]:
#==> ENTER YOUR CODE HERE

<img src="images/Execute.png" width="100" height="100" align=left>

## PACE: **Execute**

Consider the questions in your PACE Strategy Document to reflect on the Execute stage.

### **Task 4. Conclusion**

In this step, use the results of the models above to formulate a conclusion. Consider the following questions:

1. **Would you recommend using this model? Why or why not?**  

2. **What was your model doing? Can you explain how it was making predictions?**   

3. **Are there new features that you can engineer that might improve model performance?**   

4. **What features would you want to have that would likely improve the performance of your model?**   

Remember, sometimes your data simply will not be predictive of your chosen target. This is common. Machine learning is a powerful tool, but it is not magic. If your data does not contain predictive signal, even the most complex algorithm will not be able to deliver consistent and accurate predictions. Do not be afraid to draw this conclusion. Even if you cannot use the model to make strong predictions, was the work done in vain? Consider any insights that you could report back to stakeholders.

**Congratulations!** You've completed this lab. However, you may not notice a green check mark next to this item on Coursera's platform. Please continue your progress regardless of the check mark. Just click on the "save" icon at the top of this notebook to ensure your work has been logged.