In [None]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import OneHotEncoder

In [2]:
df = pd.read_csv(
    filepath_or_buffer="tips.csv"
)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    object 
 3   smoker      244 non-null    object 
 4   day         244 non-null    object 
 5   time        244 non-null    object 
 6   size        244 non-null    int64  
dtypes: float64(2), int64(1), object(4)
memory usage: 13.5+ KB


Nice. No missing data

Key Features
- Total Bill: The total cost of the meal.
- Tip: The amount left as a tip by the customer.
- Sex: The gender of the individual paying the bill.
- Smoker: Indicates whether the party includes smokers(Yes) or non-smokers (No).
- Day: The day of the week when the dining occurred. (e.g., Sun for Sunday)
- Time: Specifies whether the dining experience was during lunch or dinner.
- Size: The size of the dining party, representing the number of individuals in the group.


In [3]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


### Analysis

In [13]:
df[df["tip"] <= 0]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size


In [14]:
gender_tip = (
    df
    .groupby("sex")
    .agg(tip_amount=("tip", "sum"))
    .sort_values(by="tip_amount")
)

gender_tip_times = (
    df
    .groupby("sex")
    .agg(tip_times=("tip", "count"))
    .sort_values(by="tip_times")
)

In [15]:
gender_tip

Unnamed: 0_level_0,tip_amount
sex,Unnamed: 1_level_1
Female,246.51
Male,485.07


In [16]:
gender_tip_times

Unnamed: 0_level_0,tip_times
sex,Unnamed: 1_level_1
Female,87
Male,157


From above 2 table, it is not hard to notice that Males received far more tips than their female counterparts, both in total amount and number of occasions. Specifically, males received a total of $485 across 157 occasions, while female waiters collected only $246 across 87 occasions.

### Model Building

- `total_bill` will almost certainly be the strongest predictor; without it, performance will drop a lot.
- `size` is numeric and can be used directly.
- `sex`, `smoker`, `day`, and `time` are categorical and must be encoded before fitting a Random Forest (e.g., one-hot encoding).

I should not scale features for Random Forests; tree-based models are scale-invariant.