# WELCOME!

Applied EDA processes for the development of predictive models. Handling outliers, domain knowledge and feature engineering will be challenges.

This project aims to improve ability to implement algorithms for Multi-Class Classification, implement many algorithms commonly used for Multi-Class Classification problems.


# Determines

The 2012 US Army Anthropometric Survey (ANSUR II) was executed by the Natick Soldier Research, Development and Engineering Center (NSRDEC) from October 2010 to April 2012 and is comprised of personnel representing the total US Army force to include the US Army Active Duty, Reserves, and National Guard. In addition to the **anthropometric and demographic data** described below, the ANSUR II database also consists of **3D whole body, foot, and head scans** of Soldier participants. These 3D data are not publicly available out of respect for the privacy of ANSUR II participants. The data from this survey are used for a wide range of equipment design, sizing, and tariffing applications within the military and has many potential commercial, industrial, and academic applications.

The ANSUR II working databases contain **93 anthropometric measurements which were directly measured, and 15 demographic/administrative** variables explained below. The ANSUR II Male working database contains a total sample of 4,082 subjects. The ANSUR II Female working database contains a total sample of 1,986 subjects.


DATA DICT:
https://data.world/datamil/ansur-ii-data-dictionary/workspace/file?filename=ANSUR+II+Databases+Overview.pds

---

To achieve high prediction success, you must understand the data well and develop different approaches that can affect the dependent variable.

Firstly, try to understand the dataset column by column using pandas module. Do research within the scope of domain (body scales, and race characteristics) knowledge on the internet to get to know the data set in the fastest way.

You will implement ***Logistic Regression, Support Vector Machine, XGBoost, Random Forest*** algorithms. Also, evaluate the success of your models with appropriate performance metrics.

At the end of the project, choose the most successful model and try to enhance the scores with ***SMOTE*** make it ready to deploy. Furthermore, use ***SHAP*** to explain how the best model you choose works.

# Tasks

#### 1. Exploratory Data Analysis (EDA)
- Import Libraries, Load Dataset, Exploring Data

    *i. Import Libraries*
    
    *ii. Ingest Data *
    
    *iii. Explore Data*
    
    *iv. Outlier Detection*
    
    *v.  Drop unnecessary features*

#### 2. Data Preprocessing
- Scale (if needed)
- Separete the data frame for evaluation purposes

#### 3. Multi-class Classification
- Import libraries
- Implement SVM Classifer
- Implement Decision Tree Classifier
- Implement Random Forest Classifer
- Implement XGBoost Classifer
- Compare The Models

#### 4. SMOTE
- Apply Imbalance Learning Techniques

#### 5. SHAP
- Apply Feature selection with SHAP


# EDA
- Drop unnecessary colums
- Drop DODRace class if value counts below 500 (we assume that our data model can't learn if it is below 500)

In [None]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import interact

import xgboost
import sklearn
import shap
import imblearn

print('imblearn version:', imblearn.__version__)  # imblearn version : '0.12.4'
print("scikit-learn version:", sklearn.__version__) # scikit-learn version: 1.4.0
print("xgboost version:", xgboost.__version__) # XGBoost version : 2.1.3
print("numpy version:", np.__version__)# numpy version : 1.23.5
print("shap version:", shap.__version__)# shao version : 0.41.0
print("seaborn version:", sns.__version__) # seaborn version : 0.12.2

# These versions must be used together for compatibility, otherwise you will get an error.
# i worked on uv environment, took a lot of time to setup these :) 

In [None]:
plt.rcParams["figure.figsize"] = (7,4)
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 1000)
pd.set_option('display.width', 1000)

In [None]:

# 1) read CSV files (latin1 for encoding errors )

ds_female = pd.read_csv('fm.csv', encoding='latin1')
ds_male   = pd.read_csv('m.csv',   encoding='latin1')

# normalize column names

ds_female.columns = ds_female.columns.str.lower()
ds_male.columns   = ds_male.columns.str.lower()


# 3)concat
fm = pd.concat([ds_female, ds_male], ignore_index=True)

# 
fm.head()

In [None]:
fm.shape
fm.info()

In [None]:

cols_f = ds_female.columns.tolist()
cols_m = ds_male.columns.tolist()

# 
print("col names are same :", cols_f == cols_m)


In [None]:
# drop SubjectId
fm.drop(columns=['SubjectId', 'subjectid'], inplace=True, errors='ignore')

# 
print(fm.columns.tolist())



In [None]:
# 1) fm’in kolon adlarının gerçekten küçük harfli olduğundan emin ol
print(fm.columns.tolist())

# 2) 'dodrace' sütununu incele
print("DODRace değerleri ve frekansları:")
print(fm['dodrace'].value_counts(dropna=False))

print("\nBenzersiz DODRace kategorileri:")
print(fm['dodrace'].unique())

In [None]:
fm["dodrace"].value_counts().plot(kind="pie", figsize=(6,3) )


In [None]:
missing_pct = fm.isnull().mean() * 100  
cols_to_drop_missing = missing_pct[missing_pct > 50].index.tolist()
cols_to_drop_missing

In [None]:
# DODRace kodlarını anlamlı ırk etiketlerine dönüştürme

# 1) Kod–etiket eşlemesini tanımla
race_mapping = {
    1: 'White', 
    2: 'Black_or_African_American', 
    3: 'Hispanic', 
    4: 'Asian', 
    5: 'Native_American', 
    6: 'Pacific_Islander', 
    8: 'Other'
}

# 2) Yeni bir sütun ekleyerek sayısal kodları etiketlere çevir
fm['dodrace_label'] = fm['dodrace'].map(race_mapping)

# 3) İlk 14 satırı görüntüleyerek dönüşümü doğrula
fm.loc[:13, ['dodrace', 'dodrace_label']]

In [None]:
fm['dodrace_label'].value_counts(dropna=False)

In [None]:
# 1) Her sütundaki eksik değer sayısını ve oranını hesapla
missing_count = fm.isnull().sum()
missing_pct   = fm.isnull().mean() * 100

# 2) Sonuçları tek bir DataFrame’de birleştir ve eksik oranına göre sırala
missing_ds = pd.DataFrame({
    'missing_count': missing_count,
    'missing_pct': missing_pct
}).sort_values('missing_pct', ascending=False)

# 3) Eksik veri tablosunun ilk 10 satırını görüntüle
print("=== Eksik Veri Özeti (En Yüksek %10) ===")
print(missing_ds.head(10))


In [None]:
# Eksik veri oranı %50’den fazla olan sütunu kaldırma
fm.drop(columns=['ethnicity'], inplace=True)


In [None]:

# Kontrol: artık eksik yüzde listesinde görünmemeli
print("Güncel eksik oranları:\n", (fm.isnull().mean() * 100).head(10))

In [None]:
# To find how many unique values object (categorical) features have


for col in fm.select_dtypes("object"):  # Iterate over object type columns
    print(f"{col} has {fm[col].nunique()} unique value")

# We check our unique categorical observation numbers.
# We will drop the feature (Date), which shows the body measurement dates,
# the units where the measurements are done (installation),
# the specialty of the soldiers (PrimaryMOS) will not provide insight into races.

# We will check below whether the unit (component) where the soldiers are working
# and the branch (branch) they are working with have an effect.
# (Like blacks with relatively better physical strength come to the fore)

In [None]:

"""date installation component branch primarymos subjectsbirthlocation writingpreference
"""


In [None]:
# Kategorik Değişken Dağılımları

# 1) Kategorik sütunları tespit et
categorical_cols = fm.select_dtypes(include=['object', 'category']).columns.tolist()

# 2) Her bir kategorik sütunun frekans dağılımını yazdır
for col in categorical_cols:
    print(f"=== {col} dağılımı ===")
    print(fm[col].value_counts(dropna=False))

In [None]:
fm['subjectsbirthlocation'].value_counts()

In [None]:
fm.drop(['subjectnumericrace'])

In [None]:
fm.groupby(['component'])["dodrace"].value_counts(normalize=True

# race and dist by component. 
# ayirt edici degil

In [None]:
fm.groupby(['component', "branch"])["dodrace_label"].value_counts(normalize=True)

# ayirt ediciligi yok / az


In [None]:
# Updated list of columns to drop 
drop_list2 = [
    "date",
    "installation",
    "component",
    "branch",
    "primarymos",
    "weightlbs",                # beyan edilen
    "heightin",                 # ''     ''
    "subjectnumericrace",
]

# Drop the selected columns
fm.drop(columns=drop_list2, inplace=True)

# Notes:
# - Dropped columns include identifiers, self-reported values, and potential leakage sources


In [None]:
fm.shape

In [None]:
fm.sample(5)

In [None]:
fm.drop(columns=["dodrace"], inplace=True)

In [None]:
fm.shape

In [None]:
fm.dodrace_label.value_counts()

In [None]:
fm.rename(columns={"dodrace_label": "dodrace"}, inplace=True)


In [None]:
fm.dodrace.value_counts()

In [None]:
# Filter rows where 'dodrace' is one of the selected categories
ds = fm[fm["dodrace"].isin(["White", "Black_or_African_American", "Hispanic"])]

# Comment:
# We are keeping only individuals who are labeled as White, Black, or Hispanic in 'dodrace'.
# This subset (ds2) will be used for a focused multi-class classification task.

In [None]:
ds.head()

In [None]:
ds.shape

In [None]:
ds.reset_index(drop=True, inplace=True)

In [None]:
ds.head()

In [None]:
# Function to detect realistic body proportion anomalies using named columns


def detect_body_proportion_anomalies(row):

    """
    Detects anatomical and proportional anomalies in anthropometric data.

    This function evaluates a set of domain-informed conditions based on real-world
    human body proportions using body measurement features. It highlights rows with
    suspicious or implausible values that may result from measurement errors or
    data entry mistakes.

    Anomalies are returned using color-coded styling for Jupyter Notebook visualization.

    Color codes:
        - 'red'    : Anatomically impossible (e.g., arm longer than body)
        - 'orange' : Proportional mismatch (e.g., leg ratio out of realistic bounds)
        - 'yellow' : Medical outlier (e.g., BMI extremely low or high)
        - ''       : Normal

    Parameters:
        row (pd.Series): A row of anthropometric features from the DataFrame.

    Returns:
        List[str]: A list of style strings (e.g., 'color: red') for each cell in the row.
    """
    results = [''] * len(row)

    # 🔴 1. Arm length >= shoulder height
    if row["acromionradialelength"] >= row["acromialheight"]:
        return ['color: red'] * len(row)

    # 🟠 2. Shoulder width vs foot length or ankle circumference anomaly
    if row["biacromialbreadth"] > (row["balloffootlength"] + (row["balloffootlength"] - row["axillaheight"]) * 1.5) or \
       row["anklecircumference"] < (row["axillaheight"] - (row["balloffootlength"] - row["axillaheight"])):
        return ['color: orange'] * len(row)

    # 🔴 3. Arm span (wingspan) > height by excessive amount (short person with long reach)
    if row["stature"] < 150 and row["span"] > 180:
        return ['color: red'] * len(row)

    # 🔴 4. Total arm length (shoulder to hand) > body height
    total_arm = row["shoulderelbowlength"] + row["shoulderlength"] + row["handlength"]
    if total_arm > row["stature"]:
        return ['color: red'] * len(row)

    # 🟠 5. Leg/stature ratio too small or too large
    leg_ratio = row["functionalleglength"] / row["stature"]
    if leg_ratio < 0.3 or leg_ratio > 0.55:
        return ['color: orange'] * len(row)

    # 🟠 6. Head circumference too small or large
    if row["headcircumference"] < 48 or row["headcircumference"] > 65:
        return ['color: orange'] * len(row)

    # 🔴 7. Waist > buttock circumference (unusual anatomy)
    if row["waistcircumference"] > row["buttockcircumference"]:
        return ['color: red'] * len(row)

    # 🟡 8. BMI too extreme (underweight or obese)
    bmi = row["weightkg"] / ((row["stature"] / 100) ** 2)
    if bmi < 16 or bmi > 40:
        return ['color: yellow'] * len(row)

    return results

In [None]:
ds.style.apply(detect_body_proportion_anomalies, axis=1)

In [None]:

# Sayısal kolonları filtrele (NaN olmayan ve gerçekten sayısal olanlar)
numerical_cols = [
    col for col in ds.select_dtypes(include='number').columns
    if ds[col].dropna().shape[0] > 0 and pd.api.types.is_numeric_dtype(ds[col])
]

# Grid ayarı
n_cols = 3
n_rows = (len(numerical_cols) + n_cols - 1) // n_cols

# Grafik boyutu
plt.figure(figsize=(18, n_rows * 5))

# Her bir sayısal değişken için boxplot (DODRace'e göre)
for i, col in enumerate(numerical_cols, 1):
    plt.subplot(n_rows, n_cols, i)
    sns.boxplot(data=ds, x="dodrace", y=col, palette="Set2")
    plt.title(f"{col} by DODRace")
    plt.xticks(rotation=45)

plt.tight_layout()
plt.show()


In [None]:
# Interactive pairwise relationship plot for numeric columns

# Function to visualize the joint distribution of two numeric columns by race
def column_pair(col1, col2):
    sns.jointplot(
        data=ds,
        x=col1,
        y=col2,
        kind="hist",                      # You can change to "kde", "scatter", etc.
        hue="dodrace",                    # Colour separation by dodrace
        palette='Dark2',
        height=5,
        marginal_kws={"bins": 20}         # Bin count for marginal histograms
    )

# Select only numeric columns for the dropdown menus
cols = ds.select_dtypes(exclude="object").columns

# Create interactive dropdowns for any numeric column pair
interact(column_pair, col1=cols, col2=cols)

In [None]:
# Compute correlation matrix of numerical features
corr_matrix = ds.select_dtypes(include='number').corr()

# Plot heatmap
plt.figure(figsize=(18, 14))
sns.heatmap(corr_matrix, annot=False, cmap='coolwarm', center=0, linewidths=0.5)
plt.title("Correlation Heatmap of Numerical Features")
plt.tight_layout()
plt.show()

 **multicollinearity is not problem for logistic regression with regularisation and non parametric algorithms.**

In [None]:
ds.writingpreference.value_counts()

In [None]:
# Step 1: Replace 'Either hand' with 'Right hand'
ds['writingpreference'] = ds['writingpreference'].replace(
    'Either hand (No preference)', 'Right hand')

In [None]:
ds.writingpreference.value_counts()

In [None]:
ds.select_dtypes(include=["object", "category"]).columns


In [None]:
# ABD eyaletlerini Census bölgelerine göre grupla

us_region_map = {
    'Northeast': ['Connecticut', 'Maine', 'Massachusetts', 'New Hampshire', 'Rhode Island', 'Vermont',
                  'New Jersey', 'New York', 'Pennsylvania'],
    'Midwest': ['Indiana', 'Illinois', 'Michigan', 'Ohio', 'Wisconsin', 'Iowa', 'Kansas', 'Minnesota',
                'Missouri', 'Nebraska', 'North Dakota', 'South Dakota'],
    'South': ['Delaware', 'Florida', 'Georgia', 'Maryland', 'North Carolina', 'South Carolina', 'Virginia',
              'District of Columbia', 'West Virginia', 'Alabama', 'Kentucky', 'Mississippi', 'Tennessee',
              'Arkansas', 'Louisiana', 'Oklahoma', 'Texas'],
    'West': ['Arizona', 'Colorado', 'Idaho', 'Montana', 'Nevada', 'New Mexico', 'Utah', 'Wyoming',
             'Alaska', 'California', 'Hawaii', 'Oregon', 'Washington']
}

# Eyalet → bölge eşlemesi
state_to_region = {}
for region, states in us_region_map.items():
    for state in states:
        state_to_region[state] = f"US_{region}"

# Veride sık geçen yabancı ülkeler (value_counts verisine göre seçilmişti)
foreign_specific = ['Germany', 'Puerto Rico', 'Mexico', 'Jamaica']

# Dönüştürme fonksiyonu (veri silmeden, sadece kategori üretir)
def map_birth_location(loc):
    if loc in state_to_region:
        return state_to_region[loc]
    elif loc in foreign_specific:
        return f"Foreign_{loc}"
    elif pd.isna(loc):
        return "Missing"
    else:
        return "Foreign_Other"

# Yeni sütunu ata (mevcut veri setini değiştirmez, sadece sütun ekler)
ds["birth_region_grouped"] = ds["subjectsbirthlocation"].apply(map_birth_location)

In [None]:
ds["birth_region_grouped"].value_counts(dropna=True)

In [None]:
ds.subjectsbirthlocation.value_counts()

 ** subjectsbirthlocation drop ** 

In [None]:
ds.drop(columns=["subjectsbirthlocation"], inplace=True)


In [None]:
ds.info()

## Import Libraries
Besides Numpy and Pandas, you need to import the necessary modules for data visualization, data preprocessing, Model building and tuning.

*Note: Check out the course materials.*

## Ingest Data


## Explore Data

# DATA Preprocessing
- In this step we divide our data to X(Features) and y(Target) then ,
- To train and evaluation purposes we create train and test sets,
- Lastly, scale our data if features not in same scale. Why?


 **we have hispanic as minority class and try to predict this minority class**


In [None]:
from sklearn.model_selection import train_test_split

# Step 1: Separate features (X) and target (y)


X = ds.drop(columns=["dodrace"])  # drop target column from features
y = ds["dodrace"]                 # define target variable


In [None]:
# Step 2: Split into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=101, stratify=y
)

#We use stratify=y   to preserve class distribution in both train and test
# - random_state     ensures reproducibility

In [None]:
# Print the shape of train and test sets
print("X_train shape:", X_train.shape)
print("X_test shape :", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape :", y_test.shape)


# Modelling
- Fit the model with train dataset
- Get predict from vanilla model on both train and test sets to examine if there is over/underfitting   
- Apply GridseachCV for both hyperparemeter tuning and sanity test of our model.
- Use hyperparameters that you find from gridsearch and make final prediction and evaluate the result according to chosen metric.

In [None]:
from sklearn.metrics import confusion_matrix, classification_report



# Evaluation metric function to compare model performance on training and test sets
def eval_metric(model, X_train, y_train, X_test, y_test):
    
   
    y_train_pred = model.predict(X_train)  # Predict on training set
     

    y_pred = model.predict(X_test)     # Predict on test set

    # --- Evaluation on Test Set ---
    print("Test_Set")
    print(confusion_matrix(y_test, y_pred))  # Confusion matrix for test set
    print(classification_report(y_test, y_pred))  # Classification report for test set

    print() 

    # --- Evaluation on Train Set ---
    print("Train_Set")
    print(confusion_matrix(y_train, y_train_pred))  # Confusion matrix for train set
    print(classification_report(y_train, y_train_pred))  # Classification report for train set



In [None]:

# Categoric columns list

cat = X_train.select_dtypes("object").columns
cat


In [None]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline


# Column transformer
column_trans = make_column_transformer(
    (OneHotEncoder(handle_unknown="ignore", sparse_output=False), cat),  #
    remainder=MinMaxScaler(),
    verbose_feature_names_out=False
)

# Pipeline steps

operations = [
    ("OneHotEncoder", column_trans),
    ("log", LogisticRegression(class_weight="balanced", max_iter=10000, random_state=101)) # data imbalanced bu yuzden balanced
]

# Pipeline
pipe_log_model = Pipeline(steps=operations)



## 1. Logistic model

### Vanilla Logistic Model

In [None]:
# Fit model
pipe_log_model.fit(X_train, y_train)

# Evaluate
eval_metric(pipe_log_model, X_train, y_train, X_test, y_test)

In [None]:

from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score

# Recall for Hispanic
def recall_Hispanic(y_true, y_pred):
    return recall_score(y_true, y_pred, average=None, labels=["Hispanic"])[0]

# Precision for Hispanic
def precision_Hispanic(y_true, y_pred):
    return precision_score(y_true, y_pred, average=None, labels=["Hispanic"])[0]

# F1-score for Hispanic
def f1_Hispanic(y_true, y_pred):
    return f1_score(y_true, y_pred, average=None, labels=["Hispanic"])[0]

# [0] eklememizin sebebi, recall_score, precision_score, f1_score gibi fonksiyonların
# average=None durumunda array döndürmesi ve bizim make_scorer’ın içinden sadece tek sayıyı almak zorunda olmamızdır.


# Wrap with make_scorer for use in model evaluation or GridSearchCV

f1_Hispanic = make_scorer(f1_Hispanic)
precision_Hispanic = make_scorer(precision_Hispanic)
recall_Hispanic = make_scorer(recall_Hispanic)

#response_method="predict" yalnızca predict_proba veya decision_function kullanan (örneğin ROC AUC gibi) metrikler için kullanılır.
#Ama senin recall_score, precision_score ve f1_score metriklerin zaten predict çıktısıyla çalışır. Ekstra belirtmene gerek yoktur.

# Scoring dictionary
scoring = {
    "precision": precision_Hispanic,
    "recall": recall_Hispanic,
    "f1": f1_Hispanic
}

# In multiclass data, you can get CV scores based on whatever your target label is.
# Again, we have to use the make_scorer function. When the data is multiclass,
# the average, and labels parameters must be specified in the make_scorer function.



In [None]:
from sklearn.model_selection import cross_validate


operations = [
    ("OneHotEncoder", column_trans),
    ("log", LogisticRegression(class_weight="balanced", max_iter=10000, random_state=101))
]

model = Pipeline(steps=operations)

# 10-fold cross-validation ile Hispanic'e özel metrikleri hesapla
scores = cross_validate(
    model,
    X_train,
    y_train,
    scoring=scoring,         # scoring sözlüğü: f1, recall, precision (Hispanic)
    cv=10,
    n_jobs=-1,
    return_train_score=True
)

# Skorları DataFrame olarak tut
df_scores = pd.DataFrame(scores, index=range(1, 11))

# Sadece test skorlarının ortalamasını göster
df_scores.mean()[2:]


**skorlar dusuk iyilestirmek lazim**

### Logistic Model GridsearchCV

In [None]:

param_grid = {
    "log__C": [0.01, 0.1, 0.5, 1, 5, 10],  # Regularization strength
    "log__penalty": ["l1", "l2"],         # Regularization type
    "log__solver": ["liblinear", "saga"]  # Solver'lar: l1+l2 destekleyen
}

In [None]:
from sklearn.model_selection import GridSearchCV

# Pipeline adımları
operations = [
    ("OneHotEncoder", column_trans),
    ("log", LogisticRegression(
        class_weight="balanced",
        max_iter=10000,
        random_state=101),),
]

model = Pipeline(steps=operations)

# GridSearchCV kurulumu
log_model_grid = GridSearchCV(
    model,
    param_grid,
    scoring=recall_Hispanic,  # bu skoru iyilestirecek sekilde grid search yap
    cv=10,
    n_jobs=-1,
    return_train_score=True
)

In [None]:
log_model_grid.fit(X_train, y_train)

In [None]:
log_model_grid.best_estimator_

In [None]:
# Best model's mean test and train scores
pd.DataFrame(log_model_grid.cv_results_).loc[
    log_model_grid.best_index_, 
    ["mean_test_score", "mean_train_score"]
]

# for RECALL

In [None]:
eval_metric(log_model_grid, X_train, y_train, X_test, y_test)

In [None]:


operations = [
    ("OneHotEncoder", column_trans),
    (
        "log",
        LogisticRegression(
            class_weight="balanced",
            max_iter=10000,
            random_state=101,
        ),
    ),
]

model = Pipeline(steps=operations)

# Fit the model
model.fit(X_train, y_train)

# Predict class probabilities
y_pred_proba = model.predict_proba(X_test)




In [None]:
# imbalanced data oldugu icin 


from sklearn.metrics import precision_recall_curve, average_precision_score
import matplotlib.pyplot as plt

# Tahmin edilen olasılıkları al
y_scores = model.predict_proba(X_test)

# Sınıf isimlerini sırayla al
class_names = model.classes_

# Hedef sınıflar
target_classes = ["Hispanic", "Black_or_African_American", "White"]

# Grafik başlat
plt.figure(figsize=(8, 6))

# Her hedef sınıf için PR eğrisi + AUC hesapla ve çiz
for cls in target_classes:
    cls_index = list(class_names).index(cls)
    
    # Binarize y_test
    y_true_binary = (y_test == cls).astype(int)
    
    # Precision-Recall hesapla
    precision, recall, _ = precision_recall_curve(y_true_binary, y_scores[:, cls_index])
    
    # AUC (average precision score)
    auc_score = average_precision_score(y_true_binary, y_scores[:, cls_index])
    
    # Eğriyi çiz, AUC'yi label'a ekle
    plt.plot(recall, precision, label=f"{cls} (AUC = {auc_score:.2f})")

# Grafiği düzenle
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve with AUC")
plt.legend()
plt.grid(True)
plt.show()



In [None]:
# We can't use the average_precision_score function with the y_test variable because it's not binary
from sklearn.metrics import average_precision_score

y_test_dummies = pd.get_dummies(y_test).values  # we do that for the sake of the average_precision_score function

average_precision_score(y_test_dummies[:, 1], y_pred_proba[:, 1])

# Returns 0: black, 1: hispanic, 2: white scores.
# We got hispanic scores by specifying 1 here.

In [None]:
y_pred = log_model_grid.predict(X_test)

log_AP = average_precision_score(y_test_dummies[:, 1], y_pred_proba[:, 1])
log_f1 = f1_score(y_test, y_pred, average=None, labels=["Hispanic"])
log_recall = recall_score(y_test, y_pred, average=None, labels=["Hispanic"])

# Since we will compare the scores we got from all models in the table below,
# we assign model scores to the variables.

               **logreg solver LIBLiNEAR  for small dataset**

In [None]:
operations = [
    ("OneHotEncoder", column_trans),
    (
        "log",
        LogisticRegression(
            class_weight="balanced",
            max_iter=10000,
            random_state=101,
            solver="liblinear",
            penalty="l1"
            
        ),
    ),
]

pipelogmodellibl = Pipeline(steps=operations)

# Fit the model
pipelogmodellibl.fit(X_train, y_train)

eval_metric(pipelogmodellibl, X_train, y_train, X_test, y_test)

In [None]:
operations = [
    ("OneHotEncoder", column_trans),
    (
        "log",
        LogisticRegression(
            class_weight="balanced",
            max_iter=10000,
            random_state=101,
            solver="liblinear",
            penalty="l1"
        ),
    ),
]

model = Pipeline(steps=operations)

# Perform cross-validation and collect Hispanic-specific metrics
scores = cross_validate(
    model, X_train, y_train,
    scoring=scoring,         # scoring dict: precision, recall, f1 (for "Hispanic" class)
    cv=10,
    n_jobs=-1,
    return_train_score=True
)

# Convert results to DataFrame and calculate mean of test/train scores (excluding fit/time columns)
df_scores = pd.DataFrame(scores, index=range(1, 11))
df_scores.mean()[2:]


## 2. SVC

### Vanilla SVC model

In [None]:
from sklearn.svm import SVC  # SVC (Support Vector Classification)


operations_svc = [
    ("OneHotEncoder", column_trans),
    ("svc", SVC(class_weight="balanced", random_state=101)),
]

pipe_svc_model = Pipeline(steps=operations_svc)


In [None]:

# Fit the SVC model
pipe_svc_model.fit(X_train, y_train)

# Evaluate the model using custom evaluation function
eval_metric(pipe_svc_model, X_train, y_train, X_test, y_test)



In [None]:
model = Pipeline(steps=operations_svc)

scores = cross_validate(
    model, X_train, y_train, scoring=scoring, cv=10, n_jobs=-1, return_train_score=True
)

df_scores = pd.DataFrame(scores, index=range(1, 11))
df_scores.mean()[2:]

###  SVC Model GridsearchCV

In [None]:
# param_grid = {
#     "svc__C": [0.01, 0.1, 0.5, 1, 10, 50, 100],                 # C için daha geniş aralık
#     "svc__gamma": ["scale", "auto", 0.001, 0.01, 0.1, 1],       # gamma için log ölçekli çeşitlilik
#     "svc__kernel": ["rbf", "poly", "sigmoid"]                  # farklı kernel seçenekleri
# }
# bu cok yordu bilgisayari

param_grid = {
    "svc__C": [0.1, 1, 10],                         # Avoid extreme values like 0.01 or 100
    "svc__gamma": ["scale", 0.01, 0.1],             # Keep most effective gamma range
    "svc__kernel": ["rbf", "poly"]                  # Drop 'sigmoid' – usually underperforms
}



"""C ile regularization gücünü test edersin (düşük C → daha fazla regularization),

gamma ile karar sınırlarının ne kadar karmaşık olabileceğini kontrol edersin,

kernel ile doğrusal olmayan yapıları daha iyi modelleme ihtimali doğar."""


In [None]:
 
operations_svc = [
    ("OneHotEncoder", column_trans),
    ("svc", SVC(class_weight=None, random_state=101)),
]

model = Pipeline(steps=operations_svc)

svm_model_grid = GridSearchCV(
    model,
    param_grid,
    scoring=recall_Hispanic,
    cv=10,
    n_jobs=-1,
    return_train_score=True,
)

In [None]:
svm_model_grid.fit(X_train, y_train)

In [None]:
# Get best cross-validation scores (test/train) from GridSearchCV result
pd.DataFrame(svm_model_grid.cv_results_).loc[
    svm_model_grid.best_index_, ["mean_test_score", "mean_train_score"]
]


In [None]:

# Evaluate the best SVM model on train and test data
eval_metric(svm_model_grid, X_train, y_train, X_test, y_test)


In [None]:
from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.preprocessing import label_binarize
import matplotlib.pyplot as plt

# Binarize y_test to one-hot format
y_test_bin = label_binarize(y_test, classes=model.classes_)

# Get decision scores from SVC
decision_function = model.decision_function(X_test)

# decision_function çıktısı, her sınıf için margin distance (decision boundary'e uzaklık) verir.
# Bu skorlar, precision_recall_curve fonksiyonuna girdi olarak uygundur, 
# çünkü sıralı skorlara göre threshold'lar belirleyerek precision-recall çiftleri oluşturur.


# Class names for labeling
class_names = model.classes_

# Plot setup
plt.figure(figsize=(10, 7))

# Loop through each class
for i, class_name in enumerate(class_names):
    precision, recall, _ = precision_recall_curve(y_test_bin[:, i], decision_function[:, i])
    ap_score = average_precision_score(y_test_bin[:, i], decision_function[:, i])
    
    # Plot precision-recall curve
    plt.plot(recall, precision, label=f"{class_name} (AP={ap_score:.2f})")

# Plot formatting
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curves for All Classes")
plt.legend()
plt.grid()
plt.show()



In [None]:
decision_function

In [None]:
model.classes_

In [None]:
average_precision_score(y_test_dummies[:,1], decision_function[:,1])

In [None]:
# Predict class labels for the test set
y_pred = svm_model_grid.predict(X_test)

# Compute Average Precision (AUC) for the Hispanic class using decision scores
svc_AP = average_precision_score(y_test_dummies[:, 1], decision_function[:, 1])

# Compute F1-score for the Hispanic class
svc_f1 = f1_score(y_test, y_pred, average=None, labels=["Hispanic"])

# Compute Recall for the Hispanic class
svc_recall = recall_score(y_test, y_pred, average=None, labels=["Hispanic"])

## 3. RF

In [None]:
cat

#dont use one hot encode, instead ordinal for rf

In [None]:
from sklearn.preprocessing import OrdinalEncoder

# Define OrdinalEncoder to handle unknown categories by assigning them -1
ord_enc = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)

# Create a column transformer: encode categorical columns, passthrough numeric ones
column_trans = make_column_transformer((ord_enc, cat), remainder="passthrough")



### Vanilla RF Model

In [None]:
from sklearn.ensemble import RandomForestClassifier 
operations_rf = [
    ("OrdinalEncoder", column_trans),
    ("RF_model", RandomForestClassifier(class_weight=None, random_state=101)),
]

pipe_model_rf = Pipeline(steps=operations_rf)

pipe_model_rf.fit(X_train, y_train)

In [None]:
eval_metric(pipe_model_rf, X_train, y_train, X_test, y_test)

In [None]:
# vanillali berbat halde :)
#ilk hali de cv hali de overfit 


In [None]:
operations_rf = [
    ("OrdinalEncoder", column_trans),
    ("RF_model", RandomForestClassifier(class_weight="balanced", random_state=101)),
]

model = Pipeline(steps=operations_rf)

# 5-fold cross-validation using custom Hispanic-focused metrics
scores = cross_validate(
    model, X_train, y_train, scoring=scoring, cv=5, n_jobs=-1, return_train_score=True
)

# Wrap results in a DataFrame and calculate mean scores (skip fit_time and score_time)
df_scores = pd.DataFrame(scores, index=range(1, 6))
df_scores.mean()[2:]  # Only show scoring metrics


### RF Model GridsearchCV. uzayi dene

In [None]:
# Define the hyperparameter grid for Random Forest
param_grid = {
    "RF_model__n_estimators": [100, 200, 300, 400, 500],  # Number of trees in the forest
    "RF_model__max_depth": [3, 5, 7, None],         # Maximum depth of the tree
    "RF_model__min_samples_split": [2, 5, 10],
    "RF_model__max_features": ['sqrt', 'log2', None]
    
    # "RF_model__min_samples_split": [18, 20, 22],  # (optional) Minimum samples to split a node
    # "RF_model__max_features": ['auto', None, 15, 20]  # (optional) Number of features considered for split
}



# Pipeline steps: Encoding + Random Forest
operations_rf = [
    ("OrdinalEncoder", column_trans),  # Encoding step
    ("RF_model", RandomForestClassifier(class_weight="balanced", random_state=101)),  # RF classifier
]


In [None]:

# Build the pipeline
model = Pipeline(steps=operations_rf)

# Grid search with custom recall scorer for the Hispanic class
rf_grid_model = GridSearchCV(
    model,
    param_grid,
    scoring=recall_Hispanic,
    n_jobs=-1,
    return_train_score=True
)


In [None]:
rf_grid_model.fit(X_train, y_train)

In [None]:
rf_grid_model.best_estimator_

In [None]:
rf_grid_model.best_params_

In [None]:
# Extract the best mean test and train scores from GridSearchCV results
pd.DataFrame(rf_grid_model.cv_results_).loc[
    rf_grid_model.best_index_, ["mean_test_score", "mean_train_score"]
]

#scoring=recall_Hispanic

In [None]:
rf_grid_model.best_score_

In [None]:
eval_metric(rf_grid_model, X_train, y_train, X_test, y_test)

In [None]:
# random forest bu haliyle cok kotu sonuc verdi.

In [None]:
from sklearn.preprocessing import label_binarize
from sklearn.metrics import precision_recall_curve, average_precision_score, auc
import matplotlib.pyplot as plt

# Binarize the labels for multiclass precision-recall curves
y_test_bin = label_binarize(y_test, classes=rf_grid_model.classes_)  # shape: (n_samples, n_classes)
y_score = rf_grid_model.predict_proba(X_test)  # shape: (n_samples, n_classes)

n_classes = y_test_bin.shape[1]
colors = ['red', 'green', 'blue']

plt.figure(figsize=(10, 7))

# Plot Precision-Recall curve for each class
for i in range(n_classes):
    precision, recall, _ = precision_recall_curve(y_test_bin[:, i], y_score[:, i])
    ap = average_precision_score(y_test_bin[:, i], y_score[:, i])
    auc_pr = auc(recall, precision)

    plt.plot(recall, precision, lw=2, color=colors[i],
             label=f"{rf_grid_model.classes_[i]} (AUC = {auc_pr:.2f})")

plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve for Each Class")
plt.legend()
plt.grid()
plt.show()


In [None]:
# Predict class labels for X_test
y_pred = rf_grid_model.predict(X_test)

# Calculate Average Precision Score for Hispanic class (index 1)
rf_AP = average_precision_score(y_test_dummies[:, 1], y_pred_proba[:, 1])

# Calculate F1-score for Hispanic class
rf_f1 = f1_score(y_test, y_pred, average=None, labels=["Hispanic"])

# Calculate Recall score for Hispanic class
rf_recall = recall_score(y_test, y_pred, average=None, labels=["Hispanic"])


## 4. XGBoost

### Vanilla XGBoost Model

In [None]:
from xgboost import XGBClassifier


operations_xgb = [
    ("OrdinalEncoder", column_trans),
    ("XGB_model", XGBClassifier(random_state=101, use_label_encoder=False)),
]

pipe_model_xgb = Pipeline(steps=operations_xgb)

# sorting will be same as classification_report.
y_train_xgb = y_train.map({
    "Black_or_African_American": 0,
    "Hispanic": 1,
    "White": 2
})

y_test_xgb = y_test.map({
    "Black_or_African_American": 0,
    "Hispanic": 1,
    "White": 2
})

# If the target is not numeric in xgb 1.6 and higher versions, it returns an error.
# That's why we do the conversion manually.

pipe_model_xgb.fit(X_train, y_train_xgb)


In [None]:
eval_metric(pipe_model_xgb, X_train, y_train_xgb, X_test, y_test_xgb)

In [None]:
from sklearn.utils import class_weight

# Compute class weights for multi-class targets manually
classes_weights = class_weight.compute_sample_weight(
    class_weight="balanced",  # Automatically compute balanced weights
    y=y_train_xgb              # Target values must be numeric for XGBoost
)

classes_weights

In [None]:
my_dict = {
    "weights": classes_weights,  # Computed class/sample weights
    "label": y_train_xgb         # Corresponding encoded target labels
}

# Combine weights and labels into a DataFrame for inspection
comp = pd.DataFrame(my_dict)

# Display the first few rows
comp.head()

In [None]:
a = comp.groupby('label').value_counts()
a

In [None]:
# Fit the XGBoost model using instance-level weights
pipe_model_xgb.fit(
    X_train,
    y_train_xgb,
    XGB_model__sample_weight=classes_weights  # Pass instance weights to XGBClassifier step
)

#XGB_model__sample-weight' (cift __)



# XGBoost accepts sample weights per instance, not per class.
# So we compute balanced weights and assign them to each sample.



In [None]:
eval_metric(pipe_model_xgb, X_train, y_train_xgb, X_test, y_test_xgb)

In [None]:
# Define scoring functions to evaluate model performance specifically on the Hispanic class (label = 1)

from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score

# Define raw metric functions
def recall_Hispanic_raw(y_true, y_pred):
    return recall_score(y_true, y_pred, average=None, labels=[1])

def precision_Hispanic_raw(y_true, y_pred):
    return precision_score(y_true, y_pred, average=None, labels=[1])

def f1_Hispanic_raw(y_true, y_pred):
    return f1_score(y_true, y_pred, average=None, labels=[1])

# Wrap them without response_method
recall_Hispanic = make_scorer(recall_Hispanic_raw)
precision_Hispanic = make_scorer(precision_Hispanic_raw)
f1_Hispanic = make_scorer(f1_Hispanic_raw)

scoring_xgb = {
    "precision": precision_Hispanic,
    "recall":    recall_Hispanic,
    "f1":        f1_Hispanic
}


# Note: label = 1 corresponds to "Hispanic" after mapping, which is why we use labels=[1]


In [None]:
# Define pipeline steps for XGBoost classifier
operations_xgb = [
    ("OrdinalEncoder", column_trans),  # Categorical encoding step
    ("XGB_model", XGBClassifier(random_state=101, use_label_encoder=False)),  # XGBoost classifier
]

# Create pipeline with defined steps
model = Pipeline(steps=operations_xgb)

# Perform 5-fold cross-validation with custom scoring metrics and instance-level weights
scores = cross_validate(
    model,
    X_train,
    y_train_xgb,                         # Encoded target values (Black:0, Hispanic:1, White:2)
    scoring=scoring_xgb,                # scoring_xgb: custom scorers for label 1
    cv=5,
    n_jobs=-1,
    return_train_score=True,
    fit_params={"XGB_model__sample_weight": classes_weights},  # instance-level sample weights
)

# Convert results to DataFrame and view average performance metrics
df_scores = pd.DataFrame(scores, index=range(1, 6))
df_scores.mean()[2:]  # Skip the fit times etc., and display averaged test/train metrics



In [None]:
eval_metric(xgb_grid_model, X_train, y_train_xgb, X_test, y_test_xgb)

### XGBoost Model GridsearchCV

In [None]:
param_grid = {
    "XGB_model__n_estimators": [20, 40],      # __ 
    "XGB_model__max_depth": [1, 2],
    "XGB_model__learning_rate": [0.03, 0.05],
    "XGB_model__subsample": [0.8, 1],
    "XGB_model__colsample_bytree": [0.8, 1],
}


In [None]:
operations_xgb = [
    ("OrdinalEncoder", column_trans),
    ("XGB_model", XGBClassifier(random_state=101, use_label_encoder=False)),
]

model = Pipeline(steps=operations_xgb)

xgb_grid_model = GridSearchCV(
    model,
    param_grid,
    scoring=recall_Hispanic,
    cv=5,
    n_jobs=-1,
    return_train_score=True,
)

In [None]:
xgb_grid_model.fit(
    X_train,
    y_train_xgb,
    XGB_model__sample_weight=classes_weights
)

In [None]:
xgb_grid_model.best_params_

In [None]:
pd.DataFrame(xgb_grid_model.cv_results_).loc[
    xgb_grid_model.best_index_,
    ["mean_test_score", "mean_train_score"]
]

In [None]:
# 1) get probability estimates for each class
y_score = xgb_grid_model.predict_proba(X_test)

# 2) binarize the test labels
classes = [0,1,2]
y_test_bin = label_binarize(y_test_xgb, classes=classes)

# 3) compute & plot
plt.figure(figsize=(8,6))
for i, class_id in enumerate(classes):
    precision, recall, _ = precision_recall_curve(y_test_bin[:, i], y_score[:, i])
    ap = average_precision_score(y_test_bin[:, i], y_score[:, i])
    plt.plot(recall, precision, lw=2,
             label=f'class {class_id} (AP = {ap:.2f})')

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('XGBoost Precision–Recall Curves by Class')
plt.legend(loc='best')
plt.grid(True)
plt.show()

In [None]:
# Convert the true XGBoost test labels to one-hot encoded dummy variables
y_test_xgb_dummies = pd.get_dummies(y_test_xgb).values

# Compute average precision for class “1” using predicted probabilities
average_precision_score(y_test_xgb_dummies[:, 1], y_pred_proba[:, 1])


In [None]:

# Generate class predictions on the test set
y_pred = xgb_grid_model.predict(X_test)

# Calculate Average Precision (AP) for class “1”
xgb_AP = average_precision_score(y_test_xgb_dummies[:, 1], y_pred_proba[:, 1])

# Calculate F1 score for class “1”
xgb_f1 = f1_score(y_test_xgb, y_pred, average=None, labels=[1])

# Calculate recall for class “1”
xgb_recall = recall_score(y_test_xgb, y_pred, average=None, labels=[1])

---
---

## Other Evaluation Metrics for Multiclass Classification

- Evaluation metrics
https://towardsdatascience.com/comprehensive-guide-on-multiclass-classification-metrics-af94cfb83fbd

In [None]:
# mutliclass ve imbalanced data icin   genel bi score matthew or cohen
#
#hangi model daha iyi onu bulmak icin.
# 

In [None]:
# from sklearn.metrics import matthews_corrcoef
# matthews_corrcoef?
# matthews_corrcoef(y_test, y_pred)

In [None]:
# from sklearn.metrics import cohen_kappa_score
# cohen_kappa_score?
# cohen_kappa_score(y_test, y_pred)

In [None]:
# ----------------------------------------
# Compute Matthews Correlation Coefficient
# ----------------------------------------

# Import the metric for balanced evaluation on imbalanced data
from sklearn.metrics import matthews_corrcoef

# Generate predictions on the test set
y_pred = xgb_grid_model.predict(X_test)

# Calculate Matthews correlation coefficient (ranges from –1 (total disagreement) to +1 (perfect prediction))
mcc = matthews_corrcoef(y_test_xgb, y_pred)
print(f"Matthews Correlation Coefficient: {mcc:.3f}")


In [None]:

# ---------------------------------------------------
# Compute Cohen’s Kappa Score for chance-adjusted agreement
# ---------------------------------------------------

# Import Cohen's kappa metric
from sklearn.metrics import cohen_kappa_score

# Calculate Cohen's kappa (1 = perfect agreement, 0 = no agreement beyond chance)
kappa = cohen_kappa_score(y_test_xgb, y_pred)
print(f"Cohen’s Kappa Score: {kappa:.3f}")


In [None]:
# ===============================================
# Compare performance metrics across all models
# ===============================================

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Assemble a DataFrame with F1, Recall, and Average Precision for each model
compare = pd.DataFrame({
    "Model": ["Logistic Regression", "SVM", "Random Forest", "XGBoost"],
    "F1":       [log_f1[0],         svc_f1[0],       rf_f1[0],        xgb_f1[0]],
    "Recall":   [log_recall[0],     svc_recall[0],   rf_recall[0],    xgb_recall[0]],
    "AP":       [log_AP,            svc_AP,          rf_AP,           xgb_AP]
})

# 2. Set up a larger figure for three subplots
plt.figure(figsize=(14, 10))

# --- Subplot 1: F1 Scores ---
plt.subplot(3, 1, 1)
f1_sorted = compare.sort_values(by="F1", ascending=False)  # sort by F1 descending
ax = sns.barplot(x="F1", y="Model", data=f1_sorted, palette="Blues_d")
ax.bar_label(ax.containers[0], fmt="%.3f")                 # annotate bars with three decimals
plt.title("Model Comparison: F1 Score")

# --- Subplot 2: Recall Scores ---
plt.subplot(3, 1, 2)
recall_sorted = compare.sort_values(by="Recall", ascending=False)  # sort by Recall descending
ax = sns.barplot(x="Recall", y="Model", data=recall_sorted, palette="Blues_d")
ax.bar_label(ax.containers[0], fmt="%.3f")
plt.title("Model Comparison: Recall")

# --- Subplot 3: Average Precision (AP) ---
plt.subplot(3, 1, 3)
ap_sorted = compare.sort_values(by="AP", ascending=False)  # sort by AP descending
ax = sns.barplot(x="AP", y="Model", data=ap_sorted, palette="Blues_d")
ax.bar_label(ax.containers[0], fmt="%.3f")
plt.title("Model Comparison: Average Precision (AP)")

# 3. Improve layout and display the plots
plt.tight_layout()
plt.show()


In [None]:
# bu data icin logreg iyi, tree based modeller coktu.

## Before the Deployment
- Choose the model that works best based on your chosen metric
- For final step, fit the best model with whole dataset to get better performance.
- And your model ready to deploy, dump your model and scaler.

In [None]:
#-------------
#  Define the column transformer
# ----------------------------------------
# 'cat' should be a list of your categorical feature names or indices
# OneHotEncoder will handle unknown categories by ignoring them,
# and remainder=MinMaxScaler() scales all other (numeric) features.

column_trans_final = make_column_transformer(
    (OneHotEncoder(handle_unknown="ignore", sparse_output=False), cat),
    remainder=MinMaxScaler(),
)

# ----------------------------------------
# . Build the pipeline
# ----------------------------------------
# First step applies the preprocessing transformer,
# second step fits a logistic regression with balanced class weights.
operations_final = [
    ("preprocessor", column_trans_final),
    (
        "logistic",
        LogisticRegression(
            class_weight="balanced",   # adjust for class imbalance
            max_iter=10000,            # ensure convergence
            random_state=101           # for reproducibility
        )
    ),
]

final_model = Pipeline(steps=operations_final)



In [None]:
# X = feature matrix, y = target array
final_model.fit(X, y)


In [None]:
male_mean_human = X[X['gender'] == "Male"] \
    .describe(include="all") \
    .loc["mean"]
male_mean_human



In [None]:
# 2. Convert that Series into a dict of numeric feature means
numeric_means = male_mean_human.drop(labels=['gender', 'writingpreference', 'birth_region_grouped']).to_dict()

# 3. Define placeholder values for your 3 categorical columns
#    – pick valid entries that appear in your training data
cat_values = {
    'gender': 'Male',
    'writingpreference': 'Right',         # e.g. “Right”, “Left”, or your actual categories
    'birth_region_grouped': 'North America'  # e.g. one of the grouped regions
}

# 4. Merge the numeric means and categorical placeholders
example_dict = {**numeric_means, **cat_values}


In [None]:
# 5. Create a one‐row DataFrame from that dict
example_row = pd.DataFrame([example_dict])

example_row



In [None]:
# 6. Feed into your trained pipeline
prediction = final_model.predict(example_row)
print("\nPredicted class for this synthesized example:", prediction[0])



---
---

# SMOTE
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/

##  Smote implement

In [None]:
# !pip install imblearn

In [None]:
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import pipeline as imbpipeline

In [None]:
### en son care basvurulmali cunku yapaylik katiyor
#
#

In [None]:

# ----------------------------------------
# 2. Define your preprocessing transformer
# ----------------------------------------
# 'cat' should be a list of your categorical feature names
column_trans = make_column_transformer(
    (OneHotEncoder(handle_unknown="ignore", sparse_output=False), cat),
    remainder=MinMaxScaler(),
)


In [None]:
# ----------------------------------------
# 3. Fit-transform the training features
# ----------------------------------------
# X_train = original training feature matrix
X_train_ohe = column_trans.fit_transform(X_train)


In [None]:

# ----------------------------------------
# 4. Apply SMOTE to balance classes
# ----------------------------------------
# Initialize SMOTE (you can set random_state for reproducibility)
over = SMOTE(random_state=42)
# Fit SMOTE on the preprocessed training data
X_train_over, y_train_over = over.fit_resample(X_train_ohe, y_train)

# Now X_train_over and y_train_over are balanced and ready for modeling


In [None]:
X_train_over.shape

In [None]:
y_train_over.value_counts()

In [None]:
# 2. Initialize the sampler
# ----------------------------------------
under = RandomUnderSampler(random_state=42)  # fix seed for reproducibility


In [None]:
# ----------------------------------------
# 3. Apply under-sampling to your preprocessed training set
#    X_train_ohe (or X_train_one) is the one-hot-encoded + scaled matrix
#    y_train is the original target array
# ----------------------------------------
X_train_under, y_train_under = under.fit_resample(X_train_ohe, y_train)

# ----------------------------------------
# 4. inspect the new shapes and class balance
# ----------------------------------------
print("Resampled X shape:", X_train_under.shape)
print("Resampled y distribution:\n", pd.Series(y_train_under).value_counts())





## Logistic Regression Over/ Under Sampling

In [None]:
# 2. Configure custom sampling targets
# ----------------------------------------
# Upsample "Hispanic" to 1000 total samples
over = SMOTE(
    sampling_strategy={'Hispanic': 1000},
    random_state=42
)

# Downsample "White" to 2500 total samples
under = RandomUnderSampler(
    sampling_strategy={'White': 2500},
    random_state=42
)

# ----------------------------------------
# 3. Apply SMOTE first
#    (on your one-hot‐encoded & scaled training data)
# ----------------------------------------
X_resampled_over, y_resampled_over = over.fit_resample(X_train_ohe, y_train)


y_resampled_over.value_counts()


In [None]:
# ----------------------------------------
# 4. Then apply random under-sampling
# ----------------------------------------
X_resampled_under, y_resampled_under = under.fit_resample(X_train_smote, y_train_smote)

In [None]:
y_resampled_under.value_counts()

In [None]:
from imblearn.pipeline import Pipeline as imbpipeline


# 1. Define the sequence of resampling steps:
#    - 'o': apply SMOTE to upsample the "Hispanic" class
#    - 'u': apply RandomUnderSampler to downsample the "White" class
steps = [
    ('o', over),    # SMOTE(sampling_strategy={'Hispanic':1000})
    ('u', under)    # RandomUnderSampler(sampling_strategy={'White':2500})
]


# 2. Build an imblearn Pipeline with those steps
pipeline = imbpipeline(steps=steps)

# 3. Fit & resample in one go on your preprocessed training data
#    X_train_ohe: one‐hot encoded & scaled features
#    y_train: original target array
X_resampled, y_resampled = pipeline.fit_resample(X_train_ohe, y_train)



In [None]:

# 4. Verify the new class distribution
y_resampled.value_counts()

In [None]:
column_trans = make_column_transformer(
    (OneHotEncoder(handle_unknown="ignore", sparse_output=False), cat),
    remainder=MinMaxScaler()
)


operations = [
    ("preprocessor", column_trans),
    ("o", over),
    ("u", under),
    ("log", LogisticRegression(max_iter=10000, random_state=101))
]



In [None]:
# ----------------------------------------
# 3. Build and fit the imblearn Pipeline
# ----------------------------------------
smote_pipeline = imbpipeline(steps=operations)


In [None]:

# Fit to the raw X_train, y_train in one go:
smote_pipeline.fit(X_train, y_train)


In [None]:
eval_metric(smote_pipeline, X_train, y_train, X_test, y_test)


In [None]:
model = imbpipeline(steps=operations)

scores = cross_validate(
    model, X_train, y_train, scoring=scoring, cv=10, n_jobs=-1, return_train_score=True
)

df_scores = pd.DataFrame(scores, index=range(1, 11))
df_scores.mean()[2:]

#  SHAP
- http://archive.today/2024.02.04-155206/https://towardsdatascience.com/shapley-values-clearly-explained-a7f7ef22b104
- https://towardsdatascience.com/shap-explain-any-machine-learning-model-in-python-24207127cad7

In [None]:
# Prepare data for SHAP explanations
# ----------------------------------------
column_trans_shap = make_column_transformer(
    (OneHotEncoder(handle_unknown="ignore", sparse_output=False), cat),
    remainder=MinMaxScaler(),
    verbose_feature_names_out=False,
)

# Transform train and test sets manually
X_train_trans = column_trans_shap.fit_transform(X_train)
X_test_trans  = column_trans_shap.transform(X_test)

# ----------------------------------------
# Fit a logistic regression model for SHAP
# ----------------------------------------
model_shap = LogisticRegression(
    class_weight="balanced",
    max_iter=10000,
    random_state=101,
    penalty="l1",        # l1 lasso feature selection yapacagimiz icin,  
    solver="saga",       # baska da
)

model_shap.fit(X_train_trans, y_train)

# Since SHAP doesn't work with the model fitted inside the pipeline,
# we apply transformations manually before explaining.



In [None]:
eval_metric(model_shap, X_train_trans, y_train, X_test_trans, y_test)


In [None]:

# Define the steps for the pipeline
operations = [
    ("OneHotEncoder", column_trans_shap),
    (
        "log",
        LogisticRegression(
            class_weight="balanced",
            max_iter=10000,
            random_state=101,
            penalty="l1",
            solver="saga",
        ),
    ),
]

# Build the pipeline
model = Pipeline(steps=operations)

# Perform cross-validation
scores = cross_validate(
    model,
    X_train,
    y_train,
    scoring=scoring,
    cv=10,
    n_jobs=-1,
    return_train_score=True
)

# Aggregate results into a DataFrame and average the test metrics
df_scores = pd.DataFrame(scores, index=range(1, 11))
df_scores.mean()[2:]


In [None]:
features= column_trans_shap.get_feature_names_out()

In [None]:
features

# SHAP for feature selection.   (train)

In [None]:
import shap


# 1. Create a Linear SHAP explainer using the manually‐fit logistic model
explainer = shap.LinearExplainer(model_shap, X_train_trans)

# 2. Compute SHAP values on the training set
shap_values = explainer.shap_values(X_train_trans)

# 3. Plot a SHAP summary of feature importances
shap.summary_plot(
    shap_values,
    max_display=300,           # show up to 300 features
    feature_names=features,    # list of original feature names
    plot_size=(20, 100)        # width x height in inches
)

In [None]:
import shap

# 1. Define your class names in the correct order
class_names = ["White", "Black", "Hispanic"]

# 2. Generate a SHAP bar‐summary plot that shows mean(|SHAP|) per feature,
#    split by class (one color per class)
shap.summary_plot(
    shap_values,                # list of arrays, one (n_samples, n_features) per class
    X_train_trans,              # your transformed train set (all numeric)
    feature_names=features,     # list of original feature names
    class_names=class_names,    # names for each of the three classes
    plot_type="bar",            # bar chart of mean absolute SHAP values
    max_display=300,             # show top 30 features
    plot_size=(15, 35)           # width, height in inches
)

In [None]:

green_features = [
    "bideltoidbreadth",
    "birth_region_grouped",
    "handlength",
    "waistdepth",
    "bimalleolarbreadth",
    "wristcircumference",
    "age",
    "earlength",
    "bitragionsubmandibulararc",
    "crotchheight",
    "forearmcircumferenceflexed",
    "headlength",
     "buttockkneelength",
    "footbreadthhorizontal",
    "elbowrestheight",
    "tragiontopofhead",
    "kneeheightmidpatella",
    "earprotrusion",
    "mentonsellionlength",
    "bizygomaticbreadth",
    "neckcircumference",
    "poplitealheight",
  
    "writingpreference",
    "biacromialbreadth",

    "crotchlengthomphalion",
    "earbreadth",
  
    "functionalleglength",

    "shoulderlength",
]

In [None]:
X2 = X[green_features]
X2

In [None]:

X2.head()

In [None]:

# 1. Identify duplicate column names
cols = X2.columns
dup_cols = cols[cols.duplicated()]
print("Duplicate column names found:", dup_cols.unique())


In [None]:

# 2. Drop all but the first occurrence of each duplicate column
#    This will keep the first and remove subsequent columns with the same name
X2_dedup = X2.loc[:, ~X2.columns.duplicated()]

# 3. Verify that duplicates are gone
print("Columns after deduplication:", X2_dedup.columns.tolist())



In [None]:
cat_new = X2.select_dtypes("object").columns
cat_new

In [None]:
X2.shape

In [None]:

# 1. Split your selected features and target for SHAP modeling
X_train2, X_test2, y_train2, y_test2 = train_test_split(
    X2, y, test_size=0.2, random_state=101, stratify=y
)

# 2. Build a column transformer for SHAP (one‐hot + scaling)
column_trans_shap = make_column_transformer(
    (OneHotEncoder(handle_unknown="ignore", sparse_output=False), cat_new),
    remainder=MinMaxScaler(),
    verbose_feature_names_out=False,
)

# 3. Define the operations for the SHAP pipeline
operations_shap = [
    ("OneHotEncoder", column_trans_shap),
    (
        "log",
        LogisticRegression(
            class_weight="balanced",
            max_iter=10000,
            random_state=101,
            penalty="l1",
            solver="saga",
        ),
    ),
]

# 4. Create the SHAP pipeline
pipe_shap_model = Pipeline(steps=operations_shap)
pipe_shap_model.fit(X_train2, y_train2)

In [None]:
eval_metric(pipe_shap_model, X_train2, y_train2, X_test2, y_test2)


In [None]:
# Build and evaluate the SHAP‐based logistic pipeline
model = Pipeline(steps=operations_shap)

scores = cross_validate(
    model,
    X_train2,
    y_train2,
    scoring=scoring,
    cv=5,
    n_jobs=-1,
    return_train_score=True
)

df_scores = pd.DataFrame(scores, index=range(1, 6))
df_scores.mean()[2:]

In [None]:
# Get predicted class probabilities on the test set
y_pred_proba = pipe_shap_model.predict_proba(X_test2)

# Plot the precision–recall curves for each class
plot_precision_recall(y_test2, y_pred_proba)
plt.show()

In [None]:
#     shap ile feature selection yapmis olduk.  

In [None]:
#  bunlar(selected features) logreg icin iyi calisir diger modellerde ayni olmaz. 