This synthetic dataset simulates financial, demographic, and operational information about young Nigerian farmers to support the development of predictive models for credit scoring and loan repayment likelihood. It contains 1,000 records of individual agripreneurs across the country.

### Demographical Data

* **Age**: Age of the farmer in years (integer).
* **Gender**: Farmer's gender (Male/Female).
* **Education**: Highest education level attained (Primary, Secondary, Tertiary).
* **Marital_Status**: Marital status of the farmer (Single, Married, Divorced).
* **Region**: Geopolitical zone (e.g., North West, South South).
* **State**: Nigerian state of residence (e.g., Kaduna, Lagos).

### Farm Characteristics

* **Farm_Size**: Size of the farm in hectares (float).
* **Crop_Type**: Major crop(s) grown (e.g., Maize, Cassava, Yam).
* **Livestock_Type**: Type of livestock raised (if any).
* **Livestock_Number**: Number of livestock owned.
* **Irrigation**: Whether an irrigation system is used (Yes/No).
* **Crop_Cycles**: Number of crop cycles per year (integer).
* **Technology_Use**: Use of modern agricultural technology (e.g., apps, mechanized tools).

### Financial History

* **Previous_Loans**: Whether the farmer has taken loans before (Yes/No).
* **Loan_Amount**: Current or most recent loan amount in Naira (integer).
* **Repayment_Status**: Status of past loan repayment (Paid on Time, Late, Defaulted).
* **Savings_Behavior**: Whether the farmer saves regularly (No or Yes).
* **Financial_Access**: Access to financial institutions (e.g., bank account, mobile money).
* **Annual_Income**: Annual income from farming in Naira (float).

### Operational Capability

* **Extension_Services**: Access to agricultural extension services or training (Yes/No).
* **Market_Distance**: Distance to nearest market in kilometers (float).
* **Yield_Per_Season**: Average crop yield or revenue per season (float).
* **Input_Usage**: Use of essential inputs like fertilizers, seeds, pesticides (Yes/No/Some).
* **Labor**: Type of labor used (Family, Hired, Both).

### Target Variables

* **Credit\_Score**: A computed numerical score (range: 300-850) indicating creditworthiness based on income, repayment history, savings, and loan size.
* **Loan_Repaid/Repayment_Probability**: A derived repayment probability (0-100%) estimating the likelihood of successful loan repayment.

For realism, the data was generated based on reports from FAO's world census of agriculture and IFPRI for distributions such as age, education and farm size.

Demographical data distribution:
- Age: Uniform distribution between [18 and 35](https://biomedres.us/fulltexts/BJSTR.MS.ID.003792.php).
- Gender: Binary distribution with a [60% male and 40% female](https://www.researchgate.net/publication/350366662_Male_and_Female_Employment_in_Agriculture_and_Agricultural_Productivity_in_Nigeria) split.
- Education Level: Categorical distribution: 30% Primary, 50% Secondary, 20% Tertiary.
- Marital Status: Categorical distribution: 50% Single, 40% Married, 10% Divorced.
- Location: Random selection from regional dictionary with data obtained from [Wiki](https://en.wikipedia.org/wiki/Agriculture_in_Nigeria#:~:text=Maize%2C%20cassava%2C%20guinea%20corn%2C,raise%20livestock%20in%20northwest%20Nigeria.) and [FAO](https://www.fao.org/nigeria/fao-in-nigeria/nigeria-at-a-glance/en/).

Farm Characteristics data distribution:
- Farm Size (hectares): Normal distribution with a mean of 2 hectares and a standard deviation of 1 hectare from [FAO](https://openknowledge.fao.org/server/api/core/bitstreams/1775a0a9-8796-4fee-bcc2-759ff92759fc/content#:~:text=The%20national%20average%20size%20of%20household%20farm,of%20the%20land%20that%20non%2Dsmall%20producers%20do.).
- Crop Types: Random selection from region dictionary.
- Livestock Type/Number: Random selection [with a 30% chance of owning livestock](https://nigerianfarming.com/nigerian-livestock-sector-employs-about-30-of-rural-population-afdb/); if owned, number ranges from 1 to 20.
- Irrigation System: Binary distribution with a [30%](https://nssp.ifpri.info/files/2012/08/NSSP-Report-10.pdf#:~:text=A%20farmer%20chooses%20an%20irrigation%20system%20based,and%20socioeconomic%20conditions%2C%20and%20existing%20government%20policies.) chance of having an irrigation system.
- Number of Crop Cycles per Year: Integer values between [1 and 3](https://www.researchgate.net/figure/Crop-production-of-Nigeria_tbl1_305296457).
- Use of Technology: Binary distribution with a 50% chance of using mobile apps or mechanised tools.

Financial History:
- Previous Loans Taken: Binary distribution with a [40% chance of having taken a loan](https://www.cbn.gov.ng/DFD/agriculture/acgsf.html#:~:text=ACGSF%20is%20an%20acronym%20for,Who%20is%20eligible%20to%20participate?).
- Loan Amount: If a loan was taken, amount ranges between [₦50,000 and ₦500,000](https://www.rvo.nl/sites/default/files/2022-05/Finance-for-Agriculture-and-Agribusiness-in-Nigeria.pdf).
- Repayment Status: Categorical distribution: [70% Paid on Time, 20% Late, 10% Defaulted](https://www.ajol.info/index.php/jafs/article/view/189402/178633).
- Savings Behavior: Binary distribution with a 60% chance of having a [savings account](https://ageconsearch.umn.edu/record/277270/files/1521.pdf) or being a [cooperative member](https://www.sciencedirect.com/science/article/pii/S2666154324004629#:~:text=The%20estimates%20indicate%20that%20joining,women%20farmers'%20participation%20in%20cooperatives.).
- Access to Financial Institutions: Binary distribution with a 70% chance of having access.
- Income from Farming: Normal distribution with a mean of ₦300,000 and a standard deviation of ₦100,000 [annually](https://www.ajol.info/index.php/naj/article/view/189491/178719#:~:text=The%20mean%20annual%20farm%20income,Naira%20and%20146%2C807%20Naira%20respectively.).

Operational Capability
- Access to Extension Services/Training: Binary distribution with a 50% chance.
- Market Access: [Distance to market in kilometers](https://openknowledge.fao.org/server/api/core/bitstreams/8ce31a78-2848-4388-87a9-a3b1abb73e40/content), normal distribution with a mean of 10km and standard deviation of 5km.
- Yield or Revenue per Season: Normal distribution with a mean of ₦150,000 and standard deviation of ₦50,000 (about [50% of invested amount](https://openknowledge.fao.org/server/api/core/bitstreams/8ce31a78-2848-4388-87a9-a3b1abb73e40/content#:~:text=On%20average%2C%2055%20percent%20of%20a%20Nigerian,6%20percent%20to%20the%20average%20annual%20income.&text=Nigerian%20family%20farms%20sell%20only%2026%20percent,indicating%20the%20high%20share%20of%20domestic%20consumption.])).
- Input Usage: Categorical distribution:[ 40% use all inputs, 30% use some, 30% use none.](https://www.crop2cash.com.ng/blog/farm-inputs-the-essential-ingredient-of-successful-farming#:~:text=Inputs%20for%20growth%20include%20fertilisers,harvesters%2C%20and%20other%20farm%20equipment.&text=Inputs%20can%20be%20classified%20as%20soil%2Dapplied,foliar%20applied%20or%20seed%2Dapplied.)
- Labor Used: Categorical distribution: [50% Family Labor, 30% Hired Workers, 20% Both](https://www.ajol.info/index.php/jae/article/view/282017/265744#:~:text=At%201:7500%2C%20Nigeria%20has,hindering%20progress%20in%20the%20sector.).

In [None]:
import pandas as pd
import numpy as np
import random

np.random.seed(42)

num_records = 1000

regions = {
   "North West": {
        "states": ["Kano", "Kaduna", "Katsina", "Sokoto", "Kebbi", "Jigawa", "Zamfara"],
        "crops": ["Millet", "Sorghum", "Maize", "Rice", "Groundnut", "Cotton", "Beans"],
        "livestock": ["Cattle", "Goats", "Sheep", "Poultry"]
    },
    "North East": {
        "states": ["Borno", "Yobe", "Adamawa", "Bauchi", "Gombe", "Taraba"],
        "crops": ["Millet", "Maize", "Cowpea", "Groundnut", "Sesame", "Sorghum"],
        "livestock": ["Cattle", "Goats", "Sheep"]
    },
    "North Central": {
        "states": ["Niger", "Kwara", "Kogi", "Benue", "Nassarawa", "Plateau", "FCT"],
        "crops": ["Yam", "Cassava", "Maize", "Rice", "Sesame", "Soybeans"],
        "livestock": ["Cattle", "Goats", "Poultry", "Pigs"]
    },
    "South West": {
        "states": ["Lagos", "Ogun", "Osun", "Ondo", "Oyo", "Ekiti"],
        "crops": ["Cocoa", "Cassava", "Maize", "Oil Palm", "Plantain", "Vegetables"],
        "livestock": ["Goats", "Pigs", "Poultry"]
    },
    "South East": {
        "states": ["Abia", "Imo", "Ebonyi", "Anambra", "Enugu"],
        "crops": ["Cassava", "Yam", "Rice", "Oil Palm", "Vegetables"],
        "livestock": ["Goats", "Poultry", "Pigs"]
    },
    "South South": {
        "states": ["Rivers", "Bayelsa", "Delta", "Akwa Ibom", "Edo", "Cross River"],
        "crops": ["Oil Palm", "Cassava", "Plantain", "Rice", "Rubber", "Cocoa", "Vegetables"],
        "livestock": ["Poultry", "Pigs", "Goats"]
    }
}

def get_region_and_state():
    region = random.choice(list(regions.keys()))
    state = random.choice(regions[region]["states"])
    return region, state

records = []
for _ in range(num_records):
    region, state = get_region_and_state()
    crops = regions[region]["crops"]
    livestock_options = regions[region]["livestock"]

    crop_type = random.choice(crops)
    livestock_type = random.choice(["None"] + livestock_options)
    livestock_number = 0 if livestock_type == "None" else np.random.randint(1, 21)

    record = {
        "Age": np.random.randint(18, 36),
        "Gender": np.random.choice(["Male", "Female"], p=[0.6, 0.4]),
        "Education": np.random.choice(["Primary", "Secondary", "Tertiary"], p=[0.3, 0.5, 0.2]),
        "Marital_Status": np.random.choice(["Single", "Married", "Divorced"], p=[0.5, 0.4, 0.1]),
        "Region": region,
        "State": state,
        "Farm_Size": round(np.random.normal(2, 1), 2),
        "Crop_Type": crop_type,
        "Livestock_Type": livestock_type,
        "Livestock_Number": livestock_number,
        "Irrigation": np.random.choice(["Yes", "No"], p=[0.3, 0.7]),
        "Crop_Cycles": np.random.randint(1, 4),
        "Technology_Use": np.random.choice(["Yes", "No"], p=[0.5, 0.5]),
        "Previous_Loans": np.random.choice(["Yes", "No"], p=[0.4, 0.6]),
        "Loan_Amount": 0,
        "Repayment_Status": np.random.choice(["Paid on Time", "Late", "Defaulted"], p=[0.7, 0.2, 0.1]),
        "Savings_Behavior": np.random.choice(["Yes", "No"], p=[0.6, 0.4]),
        "Financial_Access": np.random.choice(["Yes", "No"], p=[0.7, 0.3]),
        "Annual_Income": round(np.random.normal(300000, 100000), 2),
        "Extension_Services": np.random.choice(["Yes", "No"], p=[0.5, 0.5]),
        "Market_Distance": round(abs(np.random.normal(10, 5)), 2),
        "Yield_Per_Season": round(np.random.normal(150000, 50000), 2),
        "Input_Usage": np.random.choice(["All", "Some", "None"], p=[0.4, 0.3, 0.3]),
        "Labor": np.random.choice(["Family", "Hired", "Both"], p=[0.5, 0.3, 0.2])
    }

    if record["Previous_Loans"] == "Yes":
        record["Loan_Amount"] = np.random.randint(50000, 500001)
    else:
        record["Loan_Amount"] = 0

    record["Loan_Repaid"] = 1 if record["Repayment_Status"] == "Paid on Time" else 0
    record["Credit_Score"] = round(np.random.uniform(40, 90) if record["Loan_Repaid"] else np.random.uniform(10, 60), 2)

    records.append(record)

df = pd.DataFrame(records)
df.to_csv("agripreneur_data.csv", index=False)

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('/content/agripreneur_data.csv')
df.head()

Unnamed: 0,Age,Gender,Education,Marital_Status,Region,State,Farm_Size,Crop_Type,Livestock_Type,Livestock_Number,...,Savings_Behavior,Financial_Access,Annual_Income,Extension_Services,Market_Distance,Yield_Per_Season,Input_Usage,Labor,Loan_Repaid,Credit_Score
0,32,Female,Secondary,Single,North Central,FCT,2.28,Sesame,Poultry,7,...,Yes,No,401051.53,No,10.11,128610.35,All,Family,0,58.69
1,32,Female,Secondary,Divorced,South South,Rivers,3.13,Rice,Poultry,15,...,No,Yes,289525.45,Yes,6.99,242613.91,All,Family,0,11.72
2,31,Male,Primary,Married,North East,Taraba,0.62,Sesame,,0,...,No,Yes,264688.33,No,9.42,134944.82,All,Family,0,51.44
3,26,Male,Primary,Single,North Central,Nassarawa,0.12,Yam,Cattle,13,...,No,Yes,163321.79,Yes,2.46,204982.35,All,Family,0,26.26
4,29,Female,Tertiary,Single,South East,Imo,2.36,Yam,Poultry,3,...,No,Yes,235488.02,Yes,10.75,144328.94,Some,Hired,1,85.38


In [None]:
df.isna().sum()

Unnamed: 0,0
Age,0
Gender,0
Education,0
Marital_Status,0
Region,0
State,0
Farm_Size,0
Crop_Type,0
Livestock_Type,225
Livestock_Number,0


In [None]:
df.dtypes

Unnamed: 0,0
Age,int64
Gender,object
Education,object
Marital_Status,object
Region,object
State,object
Farm_Size,float64
Crop_Type,object
Livestock_Type,object
Livestock_Number,int64


In [None]:
from sklearn.preprocessing import LabelEncoder

categorical_columns = df.select_dtypes(include="object").columns.tolist()

label_mappings = {}

label_encoders = {}
for col in categorical_columns:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le
    label_mappings[col] = {cls: int(i) for i, cls in enumerate(le.classes_)}

print(f"Encoded columns: {categorical_columns}")
print(df.head())

Encoded columns: ['Education', 'Marital_Status', 'Region', 'State', 'Crop_Type', 'Livestock_Type', 'Irrigation', 'Technology_Use', 'Previous_Loans', 'Repayment_Status', 'Savings_Behavior', 'Financial_Access', 'Extension_Services', 'Input_Usage', 'Labor']
   Age  Gender  Education  Marital_Status  Region  State  Farm_Size  \
0   32       0          1               2       0     14       2.28   
1   32       0          1               0       4     32       3.13   
2   31       1          0               1       1     34       0.62   
3   26       1          0               2       0     25       0.12   
4   29       0          2               2       3     16       2.36   

   Crop_Type  Livestock_Type  Livestock_Number  ...  Savings_Behavior  \
0         12               3                 7  ...                 1   
1         10               3                15  ...                 0   
2         12               5                 0  ...                 0   
3         16              

In [None]:
for col, mapping in label_mappings.items():
    print(f"{col} mapping:")
    for k, v in mapping.items():
        print(f"  {k} → {v}")
    print()

Education mapping:
  Primary → 0
  Secondary → 1
  Tertiary → 2

Marital_Status mapping:
  Divorced → 0
  Married → 1
  Single → 2

Region mapping:
  North Central → 0
  North East → 1
  North West → 2
  South East → 3
  South South → 4
  South West → 5

State mapping:
  Abia → 0
  Adamawa → 1
  Akwa Ibom → 2
  Anambra → 3
  Bauchi → 4
  Bayelsa → 5
  Benue → 6
  Borno → 7
  Cross River → 8
  Delta → 9
  Ebonyi → 10
  Edo → 11
  Ekiti → 12
  Enugu → 13
  FCT → 14
  Gombe → 15
  Imo → 16
  Jigawa → 17
  Kaduna → 18
  Kano → 19
  Katsina → 20
  Kebbi → 21
  Kogi → 22
  Kwara → 23
  Lagos → 24
  Nassarawa → 25
  Niger → 26
  Ogun → 27
  Ondo → 28
  Osun → 29
  Oyo → 30
  Plateau → 31
  Rivers → 32
  Sokoto → 33
  Taraba → 34
  Yobe → 35
  Zamfara → 36

Crop_Type mapping:
  Beans → 0
  Cassava → 1
  Cocoa → 2
  Cotton → 3
  Cowpea → 4
  Groundnut → 5
  Maize → 6
  Millet → 7
  Oil Palm → 8
  Plantain → 9
  Rice → 10
  Rubber → 11
  Sesame → 12
  Sorghum → 13
  Soybeans → 14
  Vegetables → 

In [None]:
income_norm = df['Annual_Income'] / df['Annual_Income'].max()
loan_amount_norm = df['Loan_Amount'] / df['Loan_Amount'].max()
savings = df['Savings_Behavior']
repayment = df['Repayment_Status']

df['Credit_Score'] = (
    300 +
    250 * income_norm +
    100 * savings +
    100 * repayment +
    -50 * loan_amount_norm +
    np.random.normal(0, 20, len(df))
).clip(300, 850).round(0)

import numpy as np

def sim_repayment(score, noise_std=5):
    if score >= 750:
        base_probability = 95  # 95% chance of repayment
    elif score >= 650:
        base_probability = 85  # 85% chance of repayment
    elif score >= 550:
        base_probability = 60  # 60% chance of repayment
    elif score >= 450:
        base_probability = 40  # 40% chance of repayment
    else:
        base_probability = 30  # 30% chance of repayment

    adjusted_probability = base_probability + np.random.normal(0, noise_std)

    adjusted_probability = np.clip(adjusted_probability, 0, 100)

    return adjusted_probability

df['Loan_Repaid'] = df['Credit_Score'].apply(sim_repayment)
df = df.rename(columns={'Loan_Repaid': 'Repayment_Probability'})

df.head()

Unnamed: 0,Age,Gender,Education,Marital_Status,Region,State,Farm_Size,Crop_Type,Livestock_Type,Livestock_Number,...,Savings_Behavior,Financial_Access,Annual_Income,Extension_Services,Market_Distance,Yield_Per_Season,Input_Usage,Labor,Repayment_Probability,Credit_Score
0,32,0,1,2,0,14,2.28,12,3,7,...,1,0,401051.53,0,10.11,128610.35,0,1,52.660278,564.0
1,32,0,1,0,4,32,3.13,10,3,15,...,0,1,289525.45,1,6.99,242613.91,0,1,25.624057,424.0
2,31,1,0,1,1,34,0.62,12,5,0,...,0,1,264688.33,0,9.42,134944.82,0,1,34.777519,437.0
3,26,1,0,2,0,25,0.12,16,0,13,...,0,1,163321.79,1,2.46,204982.35,0,1,27.025403,432.0
4,29,0,2,2,3,16,2.36,16,3,3,...,0,1,235488.02,1,10.75,144328.94,1,2,61.592467,570.0


In [None]:
# df.to_csv("clean_agripreneur.csv", index=False)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
X = df.drop(columns=['Credit_Score', 'Repayment_Probability'])
y = df[['Credit_Score', 'Repayment_Probability']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = MultiOutputRegressor(RandomForestRegressor(random_state=42))
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

r2_credit = r2_score(y_test['Credit_Score'], y_pred[:, 0])
r2_repay = r2_score(y_test['Repayment_Probability'], y_pred[:, 1])
mse_credit = mean_squared_error(y_test['Credit_Score'], y_pred[:, 0])
mse_repay = mean_squared_error(y_test['Repayment_Probability'], y_pred[:, 1])

print("Credit Score - R2:", r2_credit)
print("Repayment Probability - R2:", r2_repay)
print("Credit Score - MSE:", mse_credit)
print("Repayment Probability - MSE:", mse_repay)

Credit Score - R2: 0.9400012746536005
Repayment Probability - R2: 0.8195458036132416
Credit Score - MSE: 563.411693
Repayment Probability - MSE: 73.94019406467065


In [None]:
for i in X.columns:
  print(i)

Age
Gender
Education
Marital_Status
Region
State
Farm_Size
Crop_Type
Livestock_Type
Livestock_Number
Irrigation
Crop_Cycles
Technology_Use
Previous_Loans
Loan_Amount
Repayment_Status
Savings_Behavior
Financial_Access
Annual_Income
Extension_Services
Market_Distance
Yield_Per_Season
Input_Usage
Labor


In [None]:
sample_predictions = model.predict(X_test[:5])

pred_df = pd.DataFrame(sample_predictions, columns=['Predicted_Credit_Score', 'Predicted_Repayment_Probability'])

pred_df['Actual_Credit_Score'] = y_test['Credit_Score'].values[:5]
pred_df['Actual_Repayment_Probability'] = y_test['Repayment_Probability'].values[:5]

print(pred_df)

   Predicted_Credit_Score  Predicted_Repayment_Probability  \
0                  674.69                        84.074353   
1                  669.04                        80.341346   
2                  706.72                        84.408375   
3                  589.63                        60.734728   
4                  584.98                        59.051081   

   Actual_Credit_Score  Actual_Repayment_Probability  
0                669.0                     86.199533  
1                665.0                     83.973640  
2                722.0                     88.951866  
3                613.0                     59.255330  
4                582.0                     53.835976  


In [None]:
import joblib

joblib.dump(model, 'multioutput_credit_model.pkl')

['multioutput_credit_model.pkl']

## Methodology

To develop a predictive model for assessing the creditworthiness and loan repayment likelihood of young Nigerian agripreneurs, the following steps were taken:

### 1. **Data Design & Generation**

* A synthetic dataset of 1,000 entries was created to simulate realistic financial and demographic profiles of young farmers across Nigeria.
* Data was reverse-engineered based on publicly available statistics, agricultural trends by region, and common financial behaviors.
* Each record included features on demographics, farm operations, financial history, and agricultural practices.

### 2. **Feature Engineering**

* Key categorical features (e.g., gender, crop type, region) were encoded numerically using `LabelEncoder`.
* The `Credit_Score` was computed using a formula weighted by income, loan size, savings behavior, and repayment history, with added random noise for realism.
* `Loan_Repaid` was derived as a probabilistic function of the credit score, scaled to represent percentage likelihood.

### 3. **Model Building**

* A **MultiOutputRegressor** with a `RandomForestRegressor` base was trained to simultaneously predict:

  * `Credit_Score` (continuous value between 300-850)
  * `Loan_Repaid` (repayment likelihood as a percentage)
* The dataset was split into training and test sets (80/20 split), and the model was evaluated using **R² and MSE** metrics.

### 4. **Evaluation & Export**

* The final model achieved an R² of **0.94** for credit score prediction and **0.81** for loan repayment probability.
* The trained model was exported using `joblib` for integration into application.


In [None]:
X = df.drop(columns=["Loan_Repaid"])
y = df["Loan_Repaid"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]
threshold = 0.4
y_pred_adjusted = (y_proba >= threshold).astype(int)

KeyError: "['Loan_Repaid'] not found in axis"

In [None]:
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_adjusted))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_adjusted))
print("ROC AUC Score:", round(roc_auc_score(y_test, y_proba), 3))

In [None]:
importances = model.feature_importances_
features = X.columns
indices = np.argsort(importances)[::-1]

plt.figure(figsize=(10, 6))
sns.barplot(x=importances[indices], y=features[indices], hue=features[indices], palette="viridis", dodge=False, legend=False)  # Updated line
plt.title("Feature Importance")
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.tight_layout()
plt.show()

In [None]:
import xgboost as xgb
from sklearn.metrics import roc_auc_score

xgb_model = xgb.XGBClassifier(
    objective="binary:logistic",
    eval_metric="logloss",
    scale_pos_weight=(len(y_train) - sum(y_train)) / sum(y_train),
    random_state=42
)

xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
y_proba_xgb = xgb_model.predict_proba(X_test)[:, 1]

print(confusion_matrix(y_test, y_pred_xgb))
print(classification_report(y_test, y_pred_xgb))
print(f"XGBoost ROC AUC Score: {roc_auc_score(y_test, y_proba_xgb)}")

In [None]:
for i in X_train.columns:
  print(df[i].unique())