# User-Aware, Content-Based Rating Predction

***(Not a recommendation system - evaluates prediction accuracy only)***

**Best Model Performance:**  
• Algorithm: CatBoost with 5-Fold Cross Validation  
• RMSE: 0.8746 (Root Mean Squared Error)  

**Interpretation:**  
• Lower values = Better accuracy (0 = perfect)  
• Predictions are typically ±0.87 stars from actual ratings (1-5 scale)  
• Competitive results:  
    ◦    Netflix Prize winner: 0.856 RMSE  
    ◦    Baseline (average rating prediction): 0.95-1.10 RMSE  


In [31]:
pip install xgboost lightgbm catboost

^C
Note: you may need to restart the kernel to use updated packages.




In [None]:
import pandas as pd 
import numpy as np 
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostRegressor, Pool

In [None]:
user_content_df_path = '../data/prepared-data/user-content-based-features.csv'
user_content_df = pd.read_csv(user_content_df_path)

print(user_content_df.head())

   UserID  MovieID  Rating  Avg_Rating  Rating_Count  Age  Gender_M  \
0       1     1193       5    0.847681      0.503064  0.0       0.0   
1       1      661       3    0.616190      0.152903  0.0       0.0   
2       1      914       3    0.788522      0.185293  0.0       0.0   
3       1     3408       4    0.715970      0.383426  0.0       0.0   
4       1     2355       5    0.713594      0.496644  0.0       0.0   

   Occupation_1  Occupation_10  Occupation_11  ...  Genre_Animation  \
0           0.0            1.0            0.0  ...                0   
1           0.0            1.0            0.0  ...                1   
2           0.0            1.0            0.0  ...                0   
3           0.0            1.0            0.0  ...                0   
4           0.0            1.0            0.0  ...                1   

   Genre_Horror  Genre_Fantasy  Genre_Romance  Genre_Documentary  \
0             0              0              0                  0   
1         

## XGBoost, LightGBM and CatBoost (Train - Validation - Test Split)

Using a simple Train (6) - Validation (2) - Test (2) Split CatBoost gave the best RMSE (Root Mean Squared Error) at 0.8821. So the average distance between the predicted and the actual rating was less than one. This already looks like a quite good result.

XGBoost was worst at 0.9842 RMSE

LightGBM was a bit better at 0.9252 RMSE

CatBoost was best at only 0.8821 RMSE

In [None]:
print(user_content_df.columns)

Index(['UserID', 'MovieID', 'Rating', 'Avg_Rating', 'Rating_Count', 'Age',
       'Gender_M', 'Occupation_1', 'Occupation_10', 'Occupation_11',
       'Occupation_12', 'Occupation_13', 'Occupation_14', 'Occupation_15',
       'Occupation_16', 'Occupation_17', 'Occupation_18', 'Occupation_19',
       'Occupation_2', 'Occupation_20', 'Occupation_3', 'Occupation_4',
       'Occupation_5', 'Occupation_6', 'Occupation_7', 'Occupation_8',
       'Occupation_9', 'State_Alaska', 'State_Arizona', 'State_Arkansas',
       'State_California', 'State_Colorado', 'State_Connecticut',
       'State_Delaware', 'State_District of Columbia', 'State_Florida',
       'State_Georgia', 'State_Hawaii', 'State_Idaho', 'State_Illinois',
       'State_Indiana', 'State_Iowa', 'State_Kansas', 'State_Kentucky',
       'State_Louisiana', 'State_Maine', 'State_Maryland',
       'State_Massachusetts', 'State_Michigan', 'State_Minnesota',
       'State_Mississippi', 'State_Missouri', 'State_Montana',
       'State_Neb

In [None]:
# Datasets
train_list = []
val_list = []
test_list = []

for user_id, group in user_content_df.groupby('UserID'):
    if len(group) >= 5:  # only split users with enough ratings
        train, temp = train_test_split(group, test_size=0.4, random_state=42)
        val, test = train_test_split(temp, test_size=0.5, random_state=42)
    else:
        # If too few ratings, keep them all in train
        train = group
        val = pd.DataFrame()
        test = pd.DataFrame()
    
    train_list.append(train)
    val_list.append(val)
    test_list.append(test)

train_df = pd.concat(train_list)
val_df = pd.concat(val_list)
test_df = pd.concat(test_list)

# Rating will be the target
# Dropping Rating, UserID and MovieID
X_train = train_df.drop(columns=['Rating', 'Avg_Rating']) 
y_train = train_df['Rating']

X_val = val_df.drop(columns=['Rating', 'Avg_Rating'])
y_val = val_df['Rating']

X_test = test_df.drop(columns=['Rating', 'Avg_Rating'])
y_test = test_df['Rating']


In [None]:
print(X_train.columns)

Index(['UserID', 'MovieID', 'Rating_Count', 'Age', 'Gender_M', 'Occupation_1',
       'Occupation_10', 'Occupation_11', 'Occupation_12', 'Occupation_13',
       'Occupation_14', 'Occupation_15', 'Occupation_16', 'Occupation_17',
       'Occupation_18', 'Occupation_19', 'Occupation_2', 'Occupation_20',
       'Occupation_3', 'Occupation_4', 'Occupation_5', 'Occupation_6',
       'Occupation_7', 'Occupation_8', 'Occupation_9', 'State_Alaska',
       'State_Arizona', 'State_Arkansas', 'State_California', 'State_Colorado',
       'State_Connecticut', 'State_Delaware', 'State_District of Columbia',
       'State_Florida', 'State_Georgia', 'State_Hawaii', 'State_Idaho',
       'State_Illinois', 'State_Indiana', 'State_Iowa', 'State_Kansas',
       'State_Kentucky', 'State_Louisiana', 'State_Maine', 'State_Maryland',
       'State_Massachusetts', 'State_Michigan', 'State_Minnesota',
       'State_Mississippi', 'State_Missouri', 'State_Montana',
       'State_Nebraska', 'State_Nevada', 'State_

In [None]:
#XGBoost

xgb_model = xgb.XGBRegressor(n_estimators=100, learning_rate=0.1)
xgb_model.fit(X_train, y_train)

y_pred_val = xgb_model.predict(X_val)

# Calculate MSE 
mse = mean_squared_error(y_val, y_pred_val)
# Calculate RMSE
rmse = mse ** 0.5  # Equivalent to sqrt(mse)
print(f'Validation RMSE: {rmse:.4f}')


Validation RMSE: 0.9860


In [None]:
y_pred_test = xgb_model.predict(X_test)
test_mse = mean_squared_error(y_test, y_pred_test)
test_rmse = test_mse ** 0.5
print(f'Test RMSE: {test_rmse:.4f}')


Test RMSE: 0.9842


In [None]:
#LightGBM

# Create dataset
train_data = lgb.Dataset(X_train, label=y_train)
val_data = lgb.Dataset(X_val, label=y_val)

# Train
lgb_model = lgb.train(
    {
        "objective": "regression",
        "metric": "rmse",
        "verbosity": -1,
        "early_stopping_rounds": 20
    },
    train_data,
    valid_sets=[val_data],
    num_boost_round=1000
)

# Predict
y_pred = lgb_model.predict(X_test)


In [None]:
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Test RMSE: {rmse:.4f}")

Test RMSE: 0.9252


In [None]:
X_train.columns

Index(['UserID', 'MovieID', 'Rating_Count', 'Age', 'Gender_M', 'Occupation_1',
       'Occupation_10', 'Occupation_11', 'Occupation_12', 'Occupation_13',
       'Occupation_14', 'Occupation_15', 'Occupation_16', 'Occupation_17',
       'Occupation_18', 'Occupation_19', 'Occupation_2', 'Occupation_20',
       'Occupation_3', 'Occupation_4', 'Occupation_5', 'Occupation_6',
       'Occupation_7', 'Occupation_8', 'Occupation_9', 'State_Alaska',
       'State_Arizona', 'State_Arkansas', 'State_California', 'State_Colorado',
       'State_Connecticut', 'State_Delaware', 'State_District of Columbia',
       'State_Florida', 'State_Georgia', 'State_Hawaii', 'State_Idaho',
       'State_Illinois', 'State_Indiana', 'State_Iowa', 'State_Kansas',
       'State_Kentucky', 'State_Louisiana', 'State_Maine', 'State_Maryland',
       'State_Massachusetts', 'State_Michigan', 'State_Minnesota',
       'State_Mississippi', 'State_Missouri', 'State_Montana',
       'State_Nebraska', 'State_Nevada', 'State_

In [None]:
# CatBoost

# Categorical features
cat_features = [
    'UserID',        # Identifier (not continuous)
    'MovieID',       # Identifier
    'Gender_M',      # Binary but categorical
    'Decade',        # Ordinal categorical
    *[f'Occupation_{i}' for i in range(1, 21)],  # One-hot encoded occupations
    *[col for col in X_train.columns if col.startswith('State_')],  # States
    *[col for col in X_train.columns if col.startswith('Genre_')]   # Genres
]

# Convert to strings
for col in cat_features:
    X_train[col] = X_train[col].astype(str)
    X_val[col] = X_val[col].astype(str)
    X_test[col] = X_test[col].astype(str)

cat_model = CatBoostRegressor(
    iterations=1000,
    learning_rate=0.1,
    depth=6,
    eval_metric="RMSE",
    early_stopping_rounds=20,
    verbose=100
)

cat_model.fit(
    X_train, y_train,
    eval_set=(X_val, y_val),
    cat_features=cat_features
)

# Predict
y_pred = cat_model.predict(X_test)


0:	learn: 1.0901630	test: 1.0902843	best: 1.0902843 (0)	total: 769ms	remaining: 12m 48s
100:	learn: 0.9258705	test: 0.9057462	best: 0.9057462 (100)	total: 31s	remaining: 4m 36s
200:	learn: 0.9185684	test: 0.8986158	best: 0.8986158 (200)	total: 1m 4s	remaining: 4m 17s
300:	learn: 0.9139935	test: 0.8944331	best: 0.8944331 (300)	total: 1m 37s	remaining: 3m 45s
400:	learn: 0.9106270	test: 0.8915031	best: 0.8915031 (400)	total: 2m 12s	remaining: 3m 17s
500:	learn: 0.9079283	test: 0.8893629	best: 0.8893629 (500)	total: 2m 49s	remaining: 2m 49s
600:	learn: 0.9057494	test: 0.8875837	best: 0.8875837 (600)	total: 3m 25s	remaining: 2m 16s
700:	learn: 0.9039449	test: 0.8862578	best: 0.8862578 (700)	total: 4m 3s	remaining: 1m 43s
800:	learn: 0.9023515	test: 0.8852341	best: 0.8852341 (800)	total: 4m 37s	remaining: 1m 8s
900:	learn: 0.9008919	test: 0.8843621	best: 0.8843621 (900)	total: 5m 13s	remaining: 34.5s
999:	learn: 0.8996064	test: 0.8835720	best: 0.8835720 (999)	total: 5m 57s	remaining: 0us

b

In [None]:
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Test RMSE: {rmse:.4f}")

Test RMSE: 0.8821


## XGBoost, LightGBM and CatBoost (K-Fold Cross-Validation)

Using K-Fold Cross-Validation our result improved a little bit with the best result still being CatBoost now at 0.8746 RMSE.

XGBoost improved a lot to 0.8969 RMSE (0.9842 RMSE previously so almost by 0.1)

LightGBM didn't improve much at 0.9230 RMSE (0.9252 RMSE previourly so very little improvement)

CatBoost still was best at 0.8746 RMSE (0.8821 RMSE previously so not that much improvement but still better)

In [None]:
X = user_content_df.drop(columns=['Rating', 'Avg_Rating'])
y = user_content_df['Rating']

In [None]:
#XGBoost
# Initialize KFold
k = 5
kf = KFold(n_splits=k, shuffle=True, random_state=42)

sum = 0
for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
    print(f"\nFold {fold + 1}")
    
    # Split data
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
    
    # Create DMatrix (XGBoost's optimized data structure)
    dtrain = xgb.DMatrix(X_train, label=y_train)
    dval = xgb.DMatrix(X_val, label=y_val)
    
    # Train model
    params = {
        "objective": "reg:squarederror",
        "eval_metric": "rmse",
        "early_stopping_rounds": 20,
        "verbosity": 1
    }
    xgb_model = xgb.train(
        params,
        dtrain,
        num_boost_round=1000,
        evals=[(dval, "validation")],
        verbose_eval=100
    )
    
    # Evaluate
    val_pred = xgb_model.predict(dval)
    rmse = np.sqrt(mean_squared_error(y_val, val_pred))
    print(f"Validation RMSE: {rmse:.4f}")
    print(f"Sample predictions: {val_pred[:5]} vs actual {y_val.values[:5]}")
    sum = sum + rmse

print(f"\nAverage RMSE across the folds: {(sum / 5):.4f}")


Fold 1


Parameters: { "early_stopping_rounds" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


[0]	validation-rmse:1.07385
[100]	validation-rmse:0.95175
[200]	validation-rmse:0.92882
[300]	validation-rmse:0.91667
[400]	validation-rmse:0.90988
[500]	validation-rmse:0.90547
[600]	validation-rmse:0.90227
[700]	validation-rmse:0.90017
[800]	validation-rmse:0.89826
[900]	validation-rmse:0.89698
[999]	validation-rmse:0.89623
Validation RMSE: 0.8962
Sample predictions: [4.120925  3.850048  3.7341363 4.2973332 4.3388624] vs actual [5 5 4 5 5]

Fold 2


Parameters: { "early_stopping_rounds" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


[0]	validation-rmse:1.07313
[100]	validation-rmse:0.95097
[200]	validation-rmse:0.92720
[300]	validation-rmse:0.91528
[400]	validation-rmse:0.90881
[500]	validation-rmse:0.90426
[600]	validation-rmse:0.90089
[700]	validation-rmse:0.89890
[800]	validation-rmse:0.89718
[900]	validation-rmse:0.89598
[999]	validation-rmse:0.89505
Validation RMSE: 0.8951
Sample predictions: [3.8266802 4.5522766 4.577006  3.7671819 3.5718682] vs actual [5 4 5 4 3]

Fold 3


Parameters: { "early_stopping_rounds" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


[0]	validation-rmse:1.07570
[100]	validation-rmse:0.95578
[200]	validation-rmse:0.93130
[300]	validation-rmse:0.92023
[400]	validation-rmse:0.91318
[500]	validation-rmse:0.90877
[600]	validation-rmse:0.90568
[700]	validation-rmse:0.90292
[800]	validation-rmse:0.90110
[900]	validation-rmse:0.89950
[999]	validation-rmse:0.89820
Validation RMSE: 0.8982
Sample predictions: [4.9098783 4.1734066 4.0148816 4.383259  3.9715352] vs actual [3 4 4 3 4]

Fold 4


Parameters: { "early_stopping_rounds" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


[0]	validation-rmse:1.07774
[100]	validation-rmse:0.95621
[200]	validation-rmse:0.93336
[300]	validation-rmse:0.92068
[400]	validation-rmse:0.91453
[500]	validation-rmse:0.90991
[600]	validation-rmse:0.90661
[700]	validation-rmse:0.90447
[800]	validation-rmse:0.90279
[900]	validation-rmse:0.90138
[999]	validation-rmse:0.90009
Validation RMSE: 0.9001
Sample predictions: [3.5050843 3.8530035 4.176231  3.977776  4.3112793] vs actual [3 5 3 4 5]

Fold 5


Parameters: { "early_stopping_rounds" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


[0]	validation-rmse:1.07260
[100]	validation-rmse:0.95013
[200]	validation-rmse:0.92757
[300]	validation-rmse:0.91711
[400]	validation-rmse:0.91083
[500]	validation-rmse:0.90575
[600]	validation-rmse:0.90236
[700]	validation-rmse:0.89992
[800]	validation-rmse:0.89730
[900]	validation-rmse:0.89596
[999]	validation-rmse:0.89511
Validation RMSE: 0.8951
Sample predictions: [3.6500888 4.7673464 4.2546234 4.302665  3.638714 ] vs actual [4 3 4 4 4]

Average RMSE across the folds: 0.8969


In [None]:
#LightGBM
# Initialize KFold
k = 5
kf = KFold(n_splits=k, shuffle=True, random_state=42)

sum = 0
# Iterate through folds
for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
    print(f"\nFold {fold + 1}")
    
    # Split data
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
    
    # Create LightGBM datasets
    train_data = lgb.Dataset(X_train, label=y_train)
    val_data = lgb.Dataset(X_val, label=y_val)
    
    # Train model (using train(), not fit())
    lgb_model = lgb.train(
        params={
            "objective": "regression",
            "metric": "rmse",
            "verbosity": -1,
            "early_stopping_rounds" : 20,
            "verbose_eval" : 100
        },
        train_set=train_data,
        valid_sets=[val_data],
        num_boost_round=1000
    )
    
    # Evaluate
    val_pred = lgb_model.predict(X_val)
    rmse = np.sqrt(mean_squared_error(y_val, val_pred))
    print(f"Validation RMSE: {rmse:.4f}")
    print(f"Sample predictions: {val_pred[:5]} vs actual {y_val.values[:5]}")
    sum = sum + rmse

print(f"\nAverage RMSE across the folds: {(sum / 5):.4f}")


Fold 1
Validation RMSE: 0.9220
Sample predictions: [4.32127794 4.3608155  3.97364269 4.33503937 4.08090667] vs actual [5 5 4 5 5]

Fold 2
Validation RMSE: 0.9228
Sample predictions: [4.2541268  4.3755877  4.31654812 3.8813446  3.42206268] vs actual [5 4 5 4 3]

Fold 3
Validation RMSE: 0.9248
Sample predictions: [4.61011228 4.08893249 4.36865875 4.68068453 3.9540503 ] vs actual [3 4 4 3 4]

Fold 4
Validation RMSE: 0.9251
Sample predictions: [3.99234066 4.08473618 4.12662646 4.17505799 4.08593463] vs actual [3 5 3 4 5]

Fold 5
Validation RMSE: 0.9204
Sample predictions: [4.04367486 4.67524607 3.97769593 4.43682941 4.16190585] vs actual [4 3 4 4 4]

Average RMSE across the folds: 0.9230


In [None]:
#CatBoost
#Categorical features
cat_features = [
    'UserID',        # Identifier (not continuous)
    'MovieID',       # Identifier
    'Gender_M',      # Binary but categorical
    'Decade',        # Ordinal categorical
    *[f'Occupation_{i}' for i in range(1, 21)],  # One-hot encoded occupations
    *[col for col in X.columns if col.startswith('State_')],  # States
    *[col for col in X.columns if col.startswith('Genre_')]   # Genres
]

# Initialize KFold
k = 5
kf = KFold(n_splits=k, shuffle=True, random_state=42)

sum = 0
for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
    print(f"\nFold {fold + 1}")
    
    # Split data
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

    # Convert categorical features to strings
    for col in cat_features:
        X_train[col] = X_train[col].astype(str)
        X_val[col] = X_val[col].astype(str)
    
    # Create Pool (CatBoost's data structure)
    train_pool = Pool(X_train, y_train, cat_features=cat_features)
    val_pool = Pool(X_val, y_val, cat_features=cat_features)
    
    # Train model
    cat_model = CatBoostRegressor(
        iterations=1000,
        learning_rate=0.1,
        loss_function='RMSE',
        early_stopping_rounds=20,
        verbose=100
    )
    cat_model.fit(
        train_pool,
        eval_set=val_pool,
        use_best_model=True
    )
    
    # Evaluate
    val_pred = cat_model.predict(X_val)
    rmse = np.sqrt(mean_squared_error(y_val, val_pred))
    print(f"Validation RMSE: {rmse:.4f}")
    print(f"Sample predictions: {val_pred[:5]} vs actual {y_val.values[:5]}")
    sum = sum + rmse

print(f"\nAverage RMSE across the folds: {(sum / 5):.4f}")


Fold 1


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train[col] = X_train[col].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_val[col] = X_val[col].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train[col] = X_train[col].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_ind

0:	learn: 1.0896870	test: 1.0874851	best: 1.0874851 (0)	total: 631ms	remaining: 10m 29s
100:	learn: 0.9196296	test: 0.8965605	best: 0.8965605 (100)	total: 33s	remaining: 4m 53s
200:	learn: 0.9127069	test: 0.8896333	best: 0.8896333 (200)	total: 1m 16s	remaining: 5m 4s
300:	learn: 0.9081060	test: 0.8852104	best: 0.8852104 (300)	total: 1m 59s	remaining: 4m 38s
400:	learn: 0.9049532	test: 0.8823541	best: 0.8823541 (400)	total: 2m 45s	remaining: 4m 6s
500:	learn: 0.9021873	test: 0.8798914	best: 0.8798914 (500)	total: 3m 34s	remaining: 3m 33s
600:	learn: 0.8999355	test: 0.8781224	best: 0.8781224 (600)	total: 4m 26s	remaining: 2m 56s
700:	learn: 0.8980098	test: 0.8766001	best: 0.8766001 (700)	total: 5m 12s	remaining: 2m 13s
800:	learn: 0.8964030	test: 0.8754002	best: 0.8754002 (800)	total: 6m	remaining: 1m 29s
900:	learn: 0.8950164	test: 0.8743275	best: 0.8743275 (900)	total: 6m 46s	remaining: 44.6s
999:	learn: 0.8937006	test: 0.8734043	best: 0.8734043 (999)	total: 7m 29s	remaining: 0us

best

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train[col] = X_train[col].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_val[col] = X_val[col].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train[col] = X_train[col].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_ind

0:	learn: 1.0901017	test: 1.0862287	best: 1.0862287 (0)	total: 614ms	remaining: 10m 13s
100:	learn: 0.9200297	test: 0.8959630	best: 0.8959630 (100)	total: 37.3s	remaining: 5m 32s
200:	learn: 0.9124119	test: 0.8884752	best: 0.8884752 (200)	total: 1m 20s	remaining: 5m 19s
300:	learn: 0.9079585	test: 0.8842787	best: 0.8842787 (300)	total: 2m 6s	remaining: 4m 52s
400:	learn: 0.9048433	test: 0.8816041	best: 0.8816041 (400)	total: 2m 50s	remaining: 4m 14s
500:	learn: 0.9022054	test: 0.8792611	best: 0.8792611 (500)	total: 3m 38s	remaining: 3m 37s
600:	learn: 0.9001504	test: 0.8776656	best: 0.8776656 (600)	total: 4m 31s	remaining: 3m
700:	learn: 0.8983710	test: 0.8764048	best: 0.8764048 (700)	total: 5m 18s	remaining: 2m 15s
800:	learn: 0.8968532	test: 0.8752488	best: 0.8752488 (800)	total: 6m 6s	remaining: 1m 30s
900:	learn: 0.8953922	test: 0.8742027	best: 0.8742027 (900)	total: 6m 49s	remaining: 45s
999:	learn: 0.8941385	test: 0.8733991	best: 0.8733991 (999)	total: 7m 32s	remaining: 0us

best

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train[col] = X_train[col].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_val[col] = X_val[col].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train[col] = X_train[col].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_ind

0:	learn: 1.0893339	test: 1.0894278	best: 1.0894278 (0)	total: 575ms	remaining: 9m 34s
100:	learn: 0.9194143	test: 0.9001486	best: 0.9001486 (100)	total: 35.6s	remaining: 5m 16s
200:	learn: 0.9124434	test: 0.8932358	best: 0.8932358 (200)	total: 1m 19s	remaining: 5m 16s
300:	learn: 0.9077791	test: 0.8884425	best: 0.8884425 (300)	total: 2m 4s	remaining: 4m 49s
400:	learn: 0.9042888	test: 0.8851051	best: 0.8851051 (400)	total: 2m 50s	remaining: 4m 14s
500:	learn: 0.9016793	test: 0.8827469	best: 0.8827469 (500)	total: 3m 35s	remaining: 3m 34s
600:	learn: 0.8995426	test: 0.8809104	best: 0.8809104 (600)	total: 4m 26s	remaining: 2m 56s
700:	learn: 0.8977297	test: 0.8794685	best: 0.8794685 (700)	total: 5m 10s	remaining: 2m 12s
800:	learn: 0.8961450	test: 0.8782281	best: 0.8782281 (800)	total: 5m 53s	remaining: 1m 27s
900:	learn: 0.8946807	test: 0.8770774	best: 0.8770774 (900)	total: 6m 36s	remaining: 43.6s
999:	learn: 0.8933795	test: 0.8760722	best: 0.8760722 (999)	total: 7m 22s	remaining: 0us

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train[col] = X_train[col].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_val[col] = X_val[col].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train[col] = X_train[col].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_ind

0:	learn: 1.0889488	test: 1.0914485	best: 1.0914485 (0)	total: 622ms	remaining: 10m 21s
100:	learn: 0.9196923	test: 0.9005575	best: 0.9005575 (100)	total: 35.1s	remaining: 5m 12s
200:	learn: 0.9121161	test: 0.8930213	best: 0.8930213 (200)	total: 1m 19s	remaining: 5m 16s
300:	learn: 0.9077009	test: 0.8887482	best: 0.8887482 (300)	total: 2m 3s	remaining: 4m 46s
400:	learn: 0.9043978	test: 0.8856388	best: 0.8856388 (400)	total: 2m 51s	remaining: 4m 16s
500:	learn: 0.9019478	test: 0.8834781	best: 0.8834781 (500)	total: 3m 40s	remaining: 3m 39s
600:	learn: 0.8996904	test: 0.8815666	best: 0.8815666 (600)	total: 4m 28s	remaining: 2m 58s
700:	learn: 0.8979555	test: 0.8802439	best: 0.8802439 (700)	total: 5m 10s	remaining: 2m 12s
800:	learn: 0.8962949	test: 0.8789035	best: 0.8789035 (800)	total: 5m 56s	remaining: 1m 28s
900:	learn: 0.8948097	test: 0.8778205	best: 0.8778205 (900)	total: 6m 41s	remaining: 44.2s
999:	learn: 0.8936239	test: 0.8770105	best: 0.8770105 (999)	total: 7m 25s	remaining: 0u

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train[col] = X_train[col].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_val[col] = X_val[col].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train[col] = X_train[col].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_ind

0:	learn: 1.0903077	test: 1.0866855	best: 1.0866855 (0)	total: 423ms	remaining: 7m 2s
100:	learn: 0.9202067	test: 0.8966181	best: 0.8966181 (100)	total: 36.4s	remaining: 5m 24s
200:	learn: 0.9129271	test: 0.8891279	best: 0.8891279 (200)	total: 1m 21s	remaining: 5m 24s
300:	learn: 0.9083769	test: 0.8847841	best: 0.8847841 (300)	total: 2m 4s	remaining: 4m 48s
400:	learn: 0.9052062	test: 0.8820516	best: 0.8820516 (400)	total: 2m 48s	remaining: 4m 11s
500:	learn: 0.9026715	test: 0.8797836	best: 0.8797836 (500)	total: 3m 31s	remaining: 3m 30s
600:	learn: 0.9005077	test: 0.8777765	best: 0.8777765 (600)	total: 4m 25s	remaining: 2m 56s
700:	learn: 0.8986602	test: 0.8762686	best: 0.8762686 (700)	total: 5m 8s	remaining: 2m 11s
800:	learn: 0.8972046	test: 0.8751756	best: 0.8751756 (800)	total: 5m 52s	remaining: 1m 27s
900:	learn: 0.8957936	test: 0.8740760	best: 0.8740760 (900)	total: 6m 37s	remaining: 43.6s
999:	learn: 0.8944298	test: 0.8730901	best: 0.8730901 (999)	total: 7m 22s	remaining: 0us

