## How to use few shot learning to optimize the competition leaderboard

In the <a href='https://machinehack.com/hackathons/the_great_real_estate_data_challenge/overview'>Machine Hack The Great Real Estate Data Challenge</a> hosted by MachineHack the Target variable to be predicted is obscured. The target to be predicted is the categorical segment that the real estate listing falls into. But the training data doesn't have this column, Rather the nearest variable that acts as a proxy is "Sale Price". The organizers want a model that segments the listings into 4 segments which are hidden from the participants. 

This notebook shows how I used few shot learning and then optimization on top of the meta-learner model to grab 6th place with only 2 hours of effort!!! 

### Note: Overfitting is for competitions and not Production

In [1]:
import pandas as pd

### Here is the dataset for training and to make test predictions

Final Target to be predicted is Segment. However we only have a proxy "Sale Price" and the segmentation logic is hidden from us.

#### Segments: 
0: Premium Properties 💰🏰 <br>
1: Valuable Properties 💎🏡 <br>
2: Standard Properties 🏘️💸 <br>
3: Budget Properties  🏠💵 <br>

In [3]:
train = pd.read_csv('./train.csv')
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 553952 entries, 0 to 553951
Data columns (total 11 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   Year               553952 non-null  int64  
 1   Date               553952 non-null  object 
 2   Locality           553952 non-null  object 
 3   Address            553952 non-null  object 
 4   Estimated Value    553952 non-null  float64
 5   Sale Price         553952 non-null  float64
 6   Property           553952 non-null  object 
 7   Residential        553952 non-null  object 
 8   num_rooms          553952 non-null  int64  
 9   carpet_area        553952 non-null  int64  
 10  property_tax_rate  553952 non-null  float64
dtypes: float64(3), int64(3), object(5)
memory usage: 46.5+ MB


In [4]:
train.head()

Unnamed: 0,Year,Date,Locality,Address,Estimated Value,Sale Price,Property,Residential,num_rooms,carpet_area,property_tax_rate
0,2009,2009-01-02,Greenwich,40 ETTL LN UT 24,711270.0,975000.0,Condo,Condominium,2,760,1.025953
1,2009,2009-01-02,East Hampton,18 BAUER RD,119970.0,189900.0,Single Family,Detached House,3,921,1.025953
2,2009,2009-01-02,Ridgefield,48 HIGH VALLEY RD.,494530.0,825000.0,Single Family,Detached House,3,982,1.025953
3,2009,2009-01-02,Old Lyme,56 MERIDEN RD,197600.0,450000.0,Single Family,Detached House,3,976,1.025953
4,2009,2009-01-02,Naugatuck,13 CELENTANO DR,105440.0,200000.0,Single Family,Detached House,3,947,1.025953


In [3]:
test = pd.read_csv('./test.csv')
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43954 entries, 0 to 43953
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Year               43954 non-null  int64  
 1   Date               43954 non-null  object 
 2   Locality           43954 non-null  object 
 3   Address            43954 non-null  object 
 4   Estimated Value    43954 non-null  float64
 5   Sale Price         43954 non-null  int64  
 6   Property           43954 non-null  object 
 7   Residential        43954 non-null  object 
 8   num_rooms          43954 non-null  int64  
 9   carpet_area        43954 non-null  float64
 10  property_tax_rate  43954 non-null  float64
 11  Segment            43954 non-null  int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 4.0+ MB


In [6]:
test.Segment.value_counts()

0    43954
Name: Segment, dtype: int64

In [2]:
submission = pd.read_csv('./submission.csv')
submission.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43954 entries, 0 to 43953
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   Segment  43954 non-null  int64
dtypes: int64(1)
memory usage: 343.5 KB


### Approach
Two Step approach. Build a regression model and bin the Sale Price into segments.

#### Step 1: Build a regression model for predicting Sale Price

In [23]:
preds = pd.read_csv('./preds_reco.csv')
preds.head()

Unnamed: 0,row_id,Prediction
0,0,214339.2
1,1,1085608.0
2,2,215587.9
3,3,195831.8
4,4,151416.8


In [24]:
preds['Prediction'].describe()

count    4.395400e+04
mean     3.943656e+05
std      6.698036e+05
min      1.000000e+04
25%      1.734252e+05
50%      2.323893e+05
75%      3.790725e+05
max      1.328608e+07
Name: Prediction, dtype: float64

#### Step 2: Create Bins

Create a few samples with manual bins. Get results from Leaderboard. Use these samples to build a meta-learner and then run optimizer on the meta-learner to get the best result. Keep repeating till you're happy or tired!!!

In [127]:
# Manual Binning Strategies

# bins = [0,1.678316e+05,2.266937e+05,3.561459e+05] score:0.26837
# bins = [0,2.266937e+05,3.561459e+05,1.671817e+06] score:0.29188
# bins = [0,2.266937e+05,3.561459e+05,1.671817e+06] score:0.30363
# bins = [0,2.266937e+05,3.561459e+05,5e+06] score:0.30196
# bins = [0,2.266937e+05,8e+05,1.671817e+06] score:0.25760
# bins = [0,5e+04,3.561459e+05,1.671817e+06] score:0.23760
# bins = [0,2.323893e+05,3.790725e+05,1.5e+06] score:0.30121
# bins = [0,2e+05,3e+05,4e+06]  score:0.30576
# bins = [0,3e+05,4e+05,5e+06] score:0.30545
# bins = [0,1e+05,2e+05,3e+06] score:0.29703

# Bins from Few Shot Learning

# bins = [2.2730754153979382,2.7197252013039757e+05,2.773025577217706e+05, 4.473795183689448e+07] score:0.33298
# bins = [2.2730754153979382,2.7197252013039757e+07,2.773025577217706e+05, 4.473795183689448e+07] score:0.32244
# bins = [2.2730754153979382,2.7197252013039757e+05,2.773025577217706e+05, 4.473795183689448e+06] score:0.33313
# bins = [5.941378123396479,3.6184591543456848e+05,1.3414414546277338e+05, 5.121145233903171e+06] score:0.33389
# bins = [5.952248945744768,3.268839654777005e+05,3.6848146021179238e+05, 6.848476404133829e+06]  score:0.32153
# bins = [5.952248945744768,3.268839654777005e+06,3.6848146021179238e+06, 6.848476404133829e+07] score:0.32267
bins = [5.952248945744768,3.268839654777005e+05,3.6848146021179238e+05, 6.848476404133829e+07] score:0.32153

def get_segments(num):
    num = float(num)
    if((num>bins[0]) & (num<bins[1])):
        return "0"
    elif((num>bins[1]) & (num<bins[2])):
        return "1"
    elif((num>bins[2]) & (num<bins[3])):
        return "2"
    else:
        return "3"

In [112]:
data = pd.DataFrame([[0,1.678316,5,2.266937,5,3.561459,5,0.26837],
[0,2.266937,5,3.561459,5,1.671817,6,0.29188],
[0,2.266937,5,3.561459,5,1.671817,6,0.30363],
[0,2.266937,5,3.561459,5,5,6,0.30196],
[0,2.266937,5,8,5,1.671817,6,0.25760],
[0,5,4,3.561459,5,1.671817,6,0.23760],
[0,2.323893,5,3.790725,5,1.5,6,0.30121],
[0,2,5,3,5,4,6,0.30576],
[0,3,5,4,5,5,6,0.30545],
[0,1,5,2,5,3,6,0.29703],
[2.2730754153979382,2.7197252013039757,5,2.773025577217706,5, 4.473795183689448,7,0.33298],
[2.2730754153979382,2.7197252013039757,7,2.773025577217706,5, 4.473795183689448,7,0.32244],
[2.2730754153979382,2.7197252013039757,5,2.773025577217706,5, 4.473795183689448,6,0.33313],
[5.941378123396479,3.6184591543456848,5,1.3414414546277338,7, 5.121145233903171,9,0.29491],
[5.941378123396479,3.6184591543456848,5,1.3414414546277338,5, 5.121145233903171,6,0.33389],
],columns=['m1','m2','p2','m3','p3','m4','p4','score'])
data

from sklearn.ensemble import RandomForestRegressor
regr = RandomForestRegressor(n_estimators=15, max_depth=10, random_state=0)
regr.fit(data[['m1', 'm2', 'p2', 'm3', 'p3', 'm4', 'p4']], data[['score']])


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



In [113]:
fts = pd.DataFrame({'features':regr.feature_names_in_, 'importance':regr.feature_importances_})
fts.sort_values(['importance'], ascending=False)

Unnamed: 0,features,importance
1,m2,0.215998
0,m1,0.211127
5,m4,0.19795
6,p4,0.127606
2,p2,0.115756
3,m3,0.094392
4,p3,0.037172


#### Optimizer Step to model relationship between Leaderboard score and our bins. 

This will try to model the latent relationship that is hidden from the participants. One good thing about the competition is that it allows infinite submissions, which would allow more sample generation.

In [114]:
from hyperopt import hp, Trials, fmin, STATUS_OK, tpe

space = {
    "m1": hp.uniform("m1", 0, 10),
    "m2": hp.uniform("m2", 0, 10),
    "p2": hp.uniform("p2", 0, 10),
    "m3": hp.uniform("m3", 0, 10),
    "p3": hp.uniform("p3", 0, 10),
    "m4": hp.uniform("m4", 0, 10),
    "p4": hp.uniform("p4", 0, 10),
}

def objective(params):
    expt = pd.DataFrame([[params['m1'],params['m2'],
                          params['p2'],params['m3'],
                          params['p3'],params['m4'],
                          params['p4']]], columns=['m1', 'm2', 'p2', 'm3',
                                                   'p3', 'm4', 'p4'])
    
    pred = regr.predict(expt)
    return {"loss": -1*pred, "status": STATUS_OK}

trials = Trials()

best = fmin(
    fn=objective,
    space = space, 
    algo=tpe.suggest, 
    max_evals=1000, 
    trials=trials
)

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:29<00:00, 33.53trial/s, best loss: -0.33352533333333334]


In [115]:
min([trial['loss'] for trial in trials.results])

-0.33352533333333334

#### Best combination

In [116]:
best

{'m1': 5.952248945744768,
 'm2': 3.268839654777005,
 'm3': 3.6848146021179238,
 'm4': 6.848476404133829,
 'p2': 5.774031165722173,
 'p3': 5.733181407133727,
 'p4': 6.201172216640995}

In [128]:
preds['Segment'] = preds['Prediction'].apply(get_segments)
preds.head()

Unnamed: 0,row_id,Prediction,Segment
0,0,214339.2,0
1,1,1085608.0,2
2,2,215587.9,0
3,3,195831.8,0
4,4,151416.8,0


### Submission

In [129]:
preds[['Segment']].to_csv('./submission_19.csv', index=False)