# One Versus Rest & One Versis One

## Iris Dataset Example
The iris dataset is a classic and very easy multi-class classification dataset.
This dataset consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) 
petal and sepal length, stored in a 150x4 numpy.ndarray.

It helps to understand the basic steps of training a multi-class classification model in 
a very simple and easy-to-understand dataset.

In [None]:
import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the XGBoost model
xgb_model = xgb.XGBClassifier(eval_metric='mlogloss', random_state=42)

# Wrap the XGBoost model with OneVsRestClassifier
ovr_model = OneVsRestClassifier(xgb_model)

# Train the OneVsRest model
ovr_model.fit(X_train, y_train)

# Predict on the test set
y_pred = ovr_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("One-vs-Rest with XGBoost Accuracy:", accuracy)




# Customer Segmentation Example

By using segmented customer data, we can achieve better model performance by training separate models for each segment.
In this approach, we are generating synthetic customer data with random values for the features and segmenting the 
customers into three segments: 'High Value', 'Medium Value', and 'Low Value'.

However, in a real-world scenario, you would use actual customer data and segment the customers based on their behavior,
such as purchase history, spending patterns, etc. This means that you will not achieve the same level of accuracy as in
this synthetic example, but the concept of segmenting customers and training separate models for each segment still applies.

In [None]:
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score

def segmented_anual_spending(segment):
    """
    Generate a random annual spending based on the segment.
    """
    if segment == 'High Value':
        return np.random.uniform(10000, 20000)
    elif segment == 'Medium Value':
        return np.random.uniform(5000, 10000)
    else:
        return np.random.uniform(1000, 5000)
    
def segmented_number_of_purchases(segment):
    """
    Generate a random number of purchases based on the segment.
    """
    if segment == 'High Value':
        return np.random.randint(50, 100)
    elif segment == 'Medium Value':
        return np.random.randint(20, 50)
    else:
        return np.random.randint(1, 20)        

def generate_customer_data(num_samples):
    """
    Generate synthetic customer data with random values for the features.
    """
    np.random.seed(42)  # For reproducibility

    # Generate random segments for the customers
    segments = np.random.choice(['High Value', 'Medium Value', 'Low Value'], num_samples)

    customer_ids = np.arange(1, num_samples + 1)
    annual_spending = np.array([segmented_anual_spending(segment) for segment in segments])
    number_of_purchases = np.array([segmented_number_of_purchases(segment) for segment in segments])
    avg_purchase_value = annual_spending / number_of_purchases
    ages = np.random.randint(18, 70, num_samples)
    tenure = np.random.randint(1, 20, num_samples)
    

    data = {
        'CustomerID': customer_ids,
        'AnnualSpending': annual_spending,
        'NumberOfPurchases': number_of_purchases,
        'AvgPurchaseValue': avg_purchase_value,
        'Age': ages,
        'Tenure': tenure,
        'Segment': segments
    }

    return pd.DataFrame(data)

# Generate a larger dataset
num_samples = 1000000  # Adjust the number of samples as needed
df = generate_customer_data(num_samples)

# Drop CustomerID as it is not a useful feature for the model
df = df.drop('CustomerID', axis=1)

# Encode the target variable
label_encoder = LabelEncoder()
df['Segment'] = label_encoder.fit_transform(df['Segment'])

# Split the data into features and target
X = df.drop('Segment', axis=1)
y = df['Segment']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the XGBoost model
model = xgb.XGBClassifier(eval_metric='mlogloss', random_state=42)
model.fit(X_train, y_train)

# Make predictions and evaluate the model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# Decode the predicted labels back to the original categories
y_pred_labels = label_encoder.inverse_transform(y_pred)
print("Predicted Segments:", y_pred_labels)


### Summary of Predictions
In this summary, we can first validate the number of samples in the testing set and the number of predicted segments.
Next, we can validate the mapping of the labels to the original categories using the label encoder.
Finally, we can display a sample of the testing set with the actual segment and the predicted segment to visually validate the predictions.

We can use the following code to summarize the results:

In [None]:
# View counts to validate
print(f"""
TESTING SET COUNT: {X_test.shape[0]:,.0f}
PREDICTED SEGMENT COUNTS: {len(y_pred_labels):,.0f}
""")

# View the mapping of labels to validate
label_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
print("Label Encoder Mapping:", label_mapping)

predicted_test_df = pd.concat([X_test, y_test.to_frame()], axis=1)
predicted_test_df["prediction"] = y_pred_labels

predicted_test_df.head(10)

# Example of one customer record

In this example, we have a customer with the following features:
- Annual Spending: $12,000
- Number of Purchases: 75
- Average Purchase Value: $160
- Age: 35
- Tenure: 15 years

We can use the trained model to predict the segment for this customer.

In [None]:
new_customer_data = {
    'AnnualSpending': [12000],
    'NumberOfPurchases': [75],
    'AvgPurchaseValue': [160],
    'Age': [35],
    'Tenure': [15]
}

new_customer_df = pd.DataFrame(new_customer_data)

# Make predictions for the new customer data
new_customer_pred = model.predict(new_customer_df)
new_customer_pred_label = label_encoder.inverse_transform(new_customer_pred)
print("Predicted Segment for New Customer:", new_customer_pred_label[0])

# END OF NOTEOOK