# Room_Match
Cupid API’s Room Match

### Objective: Classification for Wine Quality (Binary Classification)

Build a machine learning API similar to the Cupid API’s Room Match feature. <br> 
The API should handle POST requests and return sample request/response payloads in a similar <br> 
format to the Cupid Room Match API. Provide a detailed explanation of your development process, <br> 
including how you collect and process data, develop models, and scale the system.


**Random Forest** and **XGBoost** <br> 
The workflow includes data preprocessing, model training, <br>
hyperparameter optimization, evaluation, and visualization of the results.

#### Steps
1: Data Exploration and Preprocessing <br>
2: Model Training with Random Forest and XGBoost <br>
3: Evaluation Metrics and Visualization <br>
4: Deliverables <br>
<br>
<br>

1. **Data Preparation**
    - Load the wine quality dataset.
    - Analyze statistics and correlations of features.
    - Transform multiple classifications of wine quality
     to binary classification.
    - Standard Scaling ($\mu$ = 0, $\sigma$ = 1)

2. **Model Training**
    - Split the dataset into training and testing sets.
    - Train with Random Forest (w/o grid search) 
    and XGBoost with optuna.

3. **Evaluation Metrics and Visualization**
    - Evaluate precision, recall, and F1 scores
    - Visualized ROC curve, confusion matrix, and
    feature importance 


4. **Deliverables**
    - RF, and XGBoost trained models were saved to **pkl**
    files, and reproducing test results



In [1]:
# check python vsersion
!python --version

Python 3.12.2


In [None]:
%%writefile requirements.txt
pandas==2.2.2
numpy==1.26.4
fasttext==0.9.3
seaborn==0.13.2
matplotlib==3.9.2
statsmodels==0.14.4
scikit-learn==1.5.2
xgboost==2.1.1
optuna==4.0.0
tensorflow==2.18.0
joblib==1.4.2

Overwriting requirements.txt


In [None]:
!pip install -r requirements.txt

In [2]:
import pandas as pd

In [3]:
df_rooms = pd.read_csv('data/updated_core_rooms.csv')
df_ref = pd.read_csv('data/referance_rooms-1737378184366.csv')

In [4]:
df_rooms

Unnamed: 0,core_room_id,core_hotel_id,lp_id,supplier_room_id,supplier_name,supplier_room_name
0,1,506732,lp7bb6c,200979491,Expedia,Superior Double Room
1,2,509236,lp7c534,200998017,Expedia,"Deluxe Room, Balcony"
2,3,516326,lp7e0e6,201144757,Expedia,Female Dormitory- 3 Beds
3,4,495330,lp78ee2,201028863,Expedia,"Standard Apartment, 2 Bedrooms (6 people)"
4,5,970167,lpecdb7,218116045,Expedia,"Traditional Cottage, 2 Bedrooms, Harbor View"
...,...,...,...,...,...,...
2869051,2912439,193359,lp2f34f,323872346,Expedia,"Deluxe Room, 1 King Bed with Sofa bed"
2869052,2912440,143473,lp23071,230770971,Expedia,Ocean Bay Pool Room
2869053,2912441,1701692958,lp656dc61e,322166812,Expedia,8 Berth Luxury Caravan
2869054,2912442,143473,lp23071,315521742,Expedia,Beach Room


In [5]:
df_ref

Unnamed: 0,hotel_id,lp_id,room_id,room_name
0,13484077,lp23e8ef,1142730702,Double or Twin Room
1,13487663,lp6554de34,1141927122,House
2,13462809,lp6556c3dc,1142722063,Room
3,13530116,lp6555450b,1141968275,Triple Room
4,13530071,lp6557a92c,1142513784,Apartment
...,...,...,...,...
99995,21684,lp6561b025,2168409,Two-Bedroom Suite
99996,21684,lp6561b025,2168411,Deluxe Triple Room
99997,21684,lp6561b025,2168412,Deluxe Queen Room with Two Queen Beds
99998,21684,lp6561b025,2168413,Classic Quadruple Room


In [11]:
print(df_rooms.info())
print(df_ref.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2869056 entries, 0 to 2869055
Data columns (total 7 columns):
 #   Column              Dtype 
---  ------              ----- 
 0   core_room_id        int64 
 1   core_hotel_id       int64 
 2   lp_id               object
 3   supplier_room_id    int64 
 4   supplier_name       object
 5   supplier_room_name  object
 6   lang_supplier       object
dtypes: int64(3), object(4)
memory usage: 153.2+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   hotel_id   100000 non-null  int64 
 1   lp_id      100000 non-null  object
 2   room_id    100000 non-null  int64 
 3   room_name  100000 non-null  object
 4   lang_ref   100000 non-null  object
dtypes: int64(2), object(3)
memory usage: 3.8+ MB
None


In [12]:
print("🔍 NaN values in df_rooms:")
print(df_rooms.isna().sum())

print("\n🔍 NaN values in df_ref:")
print(df_ref.isna().sum())


🔍 NaN values in df_rooms:
core_room_id          0
core_hotel_id         0
lp_id                 0
supplier_room_id      0
supplier_name         0
supplier_room_name    1
lang_supplier         1
dtype: int64

🔍 NaN values in df_ref:
hotel_id     0
lp_id        0
room_id      0
room_name    0
lang_ref     0
dtype: int64


In [14]:
df_rooms[df_rooms['supplier_room_name'].isna() | df_rooms['lang_supplier'].isna()]

Unnamed: 0,core_room_id,core_hotel_id,lp_id,supplier_room_id,supplier_name,supplier_room_name,lang_supplier
1376206,1378719,970619,lpecf7b,220527262,Expedia,,


In [16]:
df_rooms_cleaned = df_rooms.dropna(subset=['supplier_room_name', 'lang_supplier'])

In [17]:
print("🔍 Empty string counts in df_rooms:")
print((df_rooms.select_dtypes(include='object') == '').sum())

print("\n🔍 Empty string counts in df_ref:")
print((df_ref.select_dtypes(include='object') == '').sum())

🔍 Empty string counts in df_rooms:
lp_id                 0
supplier_name         0
supplier_room_name    0
lang_supplier         0
dtype: int64

🔍 Empty string counts in df_ref:
lp_id        0
room_name    0
lang_ref     0
dtype: int64


In [None]:
# over write to df_rooms
df_rooms = df_rooms_cleaned

In [7]:
# Load fastText language detection model (must be downloaded beforehand)
# https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

import fasttext
import pandas as pd

# Load pre-trained fastText model
fasttext_model = fasttext.load_model('lid.176.bin')

# Define language detection function
def fasttext_lang(text):
    try:
        if not isinstance(text, str) or not text.strip():
            return 'unknown'
        label = fasttext_model.predict(text.strip().replace('\n', ''))[0][0]
        return label.replace('__label__', '')
    except:
        return 'unknown'

# Apply fastText language detection with filtering
df_rooms = df_rooms.copy()
df_ref = df_ref.copy()

# Remove redundant/duplicate names to speed up detection
unique_supplier_names = df_rooms['supplier_room_name'].dropna().unique()
lang_map_supplier = {name: fasttext_lang(name) for name in unique_supplier_names}
df_rooms['lang_supplier'] = df_rooms['supplier_room_name'].map(lang_map_supplier)

unique_ref_names = df_ref['room_name'].dropna().unique()
lang_map_ref = {name: fasttext_lang(name) for name in unique_ref_names}
df_ref['lang_ref'] = df_ref['room_name'].map(lang_map_ref)

# Count languages
supplier_langs = df_rooms['lang_supplier'].value_counts()
ref_langs = df_ref['lang_ref'].value_counts()

# Print results
print("Languages detected in df_rooms['supplier_room_name']:\n")
print(supplier_langs)
print(f"\n🌍 Total unique languages in df_rooms: {df_rooms['lang_supplier'].nunique()}")

print("\nLanguages detected in df_ref['room_name']:\n")
print(ref_langs)
print(f"\n🌍 Total unique languages in df_ref: {df_ref['lang_ref'].nunique()}")


Languages detected in df_rooms['supplier_room_name']:

lang_supplier
en     2769588
it       46120
fr       12469
es       10253
de        9766
        ...   
arz          1
hsb          1
pam          1
am           1
rm           1
Name: count, Length: 96, dtype: int64

🌍 Total unique languages in df_rooms: 96

Languages detected in df_ref['room_name']:

lang_ref
en     97479
it      1354
fr       372
es       216
de       213
pt       104
nl        58
ja        27
zh        16
ru        15
sv        14
id        14
pl        13
oc        12
tr        11
fi        10
ceb        8
ca         8
eo         8
ms         7
fa         7
hu         5
ro         3
eu         3
uk         3
cs         3
vi         2
sl         2
bn         2
da         2
hy         1
sh         1
lt         1
az         1
mn         1
no         1
af         1
gl         1
war        1
Name: count, dtype: int64

🌍 Total unique languages in df_ref: 39


In [54]:

common_lp_ids = pd.Series(list(set(df_rooms['lp_id']) & set(df_ref['lp_id'])))
print(f"Common lp_id count: {common_lp_ids}")

Common lp_id count: 0           lpbe3f4
1           lp75a46
2          lp1a35ab
3           lpd0b23
4        lp6555d54b
            ...    
28633    lp655835e3
28634       lp9159c
28635    lp65579276
28636       lpe40d6
28637       lp9b8e0
Length: 28638, dtype: object


In [59]:
common_hotel_ids = pd.Series(list(set(df_rooms['core_hotel_id']) & set(df_ref['hotel_id'])))
print(f"common hotel_id count: {common_hotel_ids}")


common hotel_id count: 0       364545
1       286722
2       720899
3      2201605
4       440327
        ...   
847     186352
848     188400
849     454643
850    2627579
851     600060
Length: 852, dtype: int64


In [69]:
common_room_ids = pd.Series(list(set(df_rooms['core_room_id']) & set(df_ref['room_id'])))
print(f"Common room_id : {common_room_ids}")


Common room_id : 0      1429504
1      1429505
2        45058
3      1429506
4      1429507
        ...   
575     114683
576    2646012
577    1429501
578    1429502
579    1429503
Length: 580, dtype: int64


In [37]:
common_room_ids = pd.Series(list(set(df_rooms['supplier_room_id']) & set(df_ref['room_id'])))
print(f"Common hotel_id count: {common_room_ids}")

Common hotel_id count: 0      314329102
1      324035101
2      322947102
3      228316702
4      228316703
         ...    
99     216939502
100    216939503
101    216939504
102    202266104
103       114687
Length: 104, dtype: int64


In [45]:
df_rooms[df_rooms['lp_id'] == 'lp655835e3']

Unnamed: 0,core_room_id,core_hotel_id,lp_id,supplier_room_id,supplier_name,supplier_room_name
2350656,2394044,1700279779,lp655835e3,320662691,Expedia,إستديو ديلوكس
2354917,2398305,1700279779,lp655835e3,320662463,Expedia,غرفة مزدوجة


In [46]:
df_ref[df_ref['lp_id'] == 'lp655835e3']

Unnamed: 0,hotel_id,lp_id,room_id,room_name
74793,13463994,lp655835e3,1142451754,Deluxe Studio
74794,13463994,lp655835e3,1142451769,Double Room


In [84]:
df_rooms[df_rooms['core_hotel_id'] == 628660]

Unnamed: 0,core_room_id,core_hotel_id,lp_id,supplier_room_id,supplier_name,supplier_room_name
153,154,628660,lp997b4,201691711,Expedia,"Family House, 5 Bedrooms"


In [83]:
df_ref[df_ref['hotel_id'] == 628660]

Unnamed: 0,hotel_id,lp_id,room_id,room_name
65895,628660,lp71165,62866001,Double Room
65896,628660,lp71165,62866002,Twin Room
65897,628660,lp71165,62866003,Standard Triple Room
65898,628660,lp71165,62866005,Quadruple Room
65899,628660,lp71165,62866006,Triple Room with One Double Bed and One Single...
65900,628660,lp71165,62866009,Superior Double Room


In [66]:
print(df_rooms['core_hotel_id'].dtype)
print(df_ref['hotel_id'].dtype)

int64
int64


In [67]:
print(df_rooms['core_hotel_id'].isna().sum(), 'NaNs in core_hotel_id')
print(df_ref['hotel_id'].isna().sum(), 'NaNs in hotel_id')

0 NaNs in core_hotel_id
0 NaNs in hotel_id


In [75]:
print(df_rooms.columns)
print(df_ref.columns)

Index(['core_room_id', 'core_hotel_id', 'lp_id', 'supplier_room_id',
       'supplier_name', 'supplier_room_name'],
      dtype='object')
Index(['hotel_id', 'lp_id', 'room_id', 'room_name'], dtype='object')


In [86]:
df_rooms[df_rooms['core_room_id'] == 1429507]

Unnamed: 0,core_room_id,core_hotel_id,lp_id,supplier_room_id,supplier_name,supplier_room_name
1426994,1429507,674747,lpa4bbb,201881189,Expedia,Deluxe Room


In [87]:
df_ref[df_ref['room_id'] == 1429507]

Unnamed: 0,hotel_id,lp_id,room_id,room_name
62113,14295,lp4d7e4,1429507,Superior Three-Bedroom Apartment


In [88]:
df_rooms[df_rooms['supplier_room_id'] == 228316703]

Unnamed: 0,core_room_id,core_hotel_id,lp_id,supplier_room_id,supplier_name,supplier_room_name
866707,869220,1700065373,lp6554f05d,228316703,Expedia,Executive Room


In [89]:
df_ref[df_ref['room_id'] == 228316703]

Unnamed: 0,hotel_id,lp_id,room_id,room_name
90958,2283167,lp6564419b,228316703,Two-Bedroom Apartment
