<a href="https://colab.research.google.com/github/vignesh312000/Classification---Health-Insurance-Cross-Sell-Prediction/blob/main/resolute.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import zscore
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler

from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.decomposition import PCA


In [None]:
from google.colab import drive
drive.mount('/content/drive')


In [None]:
# Load train and test data
train_path = "/content/drive/MyDrive/train.xlsx"  # Update with your file path
test_path = "/content/drive/MyDrive/test.xlsx"  # Update with your file path
train_df = pd.read_excel(train_path)
test_df = pd.read_excel(test_path)

In [None]:
df=train_df.copy()

In [None]:
df.head()

As far the Clustering technique is a Unsupervised learning so the target label is not needed so the process involves the dropping the variable "target".

In [None]:
unique_targets = df['target'].nunique()

print("Unique targets:", unique_targets)

Unique targets: 160


In [None]:
data=df.drop(['target'],axis=1)

In [None]:
data.fillna(data.mean(), inplace=True)

In [None]:
data.head()

In [None]:
# Outlier detection using z-score
z_scores = zscore(data)
abs_z_scores = np.abs(z_scores)
filtered_entries = (abs_z_scores < 3).all(axis=1)


In [None]:
filtered_df = df[filtered_entries].copy()
target = filtered_df['target']

In [None]:
plt.figure(figsize=(12, 8))
sns.boxplot(data=filtered_df, orient="v", palette="Set2")
plt.title("Box Plot of Features to Identify Outliers")
plt.ylabel("Feature Values")
plt.xlabel("Features")
plt.xticks(rotation=45)
plt.show()

### Task_1

In [None]:
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(filtered_df.drop(['target'], axis=1))

# Perform K-means clustering
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(X_scaled)

# Add cluster labels to the filtered DataFrame
# Add cluster labels to the filtered DataFrame using .loc to avoid SettingWithCopyWarning
filtered_df.loc[:, 'cluster'] = cluster_labels

# Define a function to identify the cluster for a given data point
def identify_cluster(data_point):
    # Use the trained K-means model to predict the cluster for the given data point
    cluster = kmeans.predict([data_point])[0]
    # Find the centroid of the predicted cluster
    centroid = kmeans.cluster_centers_[cluster]
    # Calculate the distance between the data point and the centroid
    distance = np.linalg.norm(data_point - centroid)
    return cluster, distance



The "identify_cluster" function takes a data point (data_point) as input, predicts the cluster using the trained K-means model, calculates the distance between the data point and the centroid of the predicted cluster, and returns the cluster label along with the distance.

In [None]:
# Evaluate clustering quality
silhouette_avg = silhouette_score(X_scaled, cluster_labels)
db_index = davies_bouldin_score(X_scaled, cluster_labels)
ch_index = calinski_harabasz_score(X_scaled, cluster_labels)
print(f"Silhouette Score: {silhouette_avg}")
print(f"Davies-Bouldin Index: {db_index}")
print(f"Calinski-Harabasz Index: {ch_index}")

Silhouette Score: Score of 0.218 suggests that there is a moderate degree of separation between the clusters, indicating that the clustering algorithm has produced reasonably distinct clusters.

Davies-Bouldin Index: The value of approximately *1.505567* indicates a moderate level of separation between the clusters. Lower values suggest better clustering, indicating that the clusters are compact and well-separated.

Calinski-Harabasz Index: The value of approximately 12394.1 is relatively high, indicating that the clusters are well-separated and dense. Higher values indicate better clustering, suggesting that the clusters are distinct from each other.

In [None]:
# Interpret the clusters and analyze the characteristics of each cluster
cluster_analysis = filtered_df.groupby('cluster').mean()
print(cluster_analysis)

**Cluster 0:**
- **Trend**: This cluster has relatively lower values across most features compared to the other clusters.
- **Key Features**:
  - Features T1, T2, T3, T4, T5, T6, T7, T8, T9, T10, T11, T12, T13, T14, T15, T16, T17, T18 have lower mean values compared to the other clusters.
- **Observations**:
  - The values in this cluster are consistently lower across all features.
  - This cluster represents a group of data points with generally lower measurements across the features.

**Cluster 1:**
- **Trend**: This cluster has moderate values across most features.
- **Key Features**:
  - Features T1, T2, T3, T4, T5, T6, T7, T8, T9, T10, T11, T12, T13, T14, T15, T16, T17, T18 have moderate mean values compared to the other clusters.
- **Observations**:
  - The values in this cluster are neither extremely high nor low, indicating a moderate range of measurements across the features.
  - This cluster represents a group of data points with moderate measurements across the features.

**Cluster 2:**
- **Trend**: This cluster has relatively higher values across most features compared to the other clusters.
- **Key Features**:
  - Features T1, T2, T3, T4, T5, T6, T7, T8, T9, T10, T11, T12, T13, T14, T15, T16, T17, T18 have higher mean values compared to the other clusters.
- **Observations**:
  - The values in this cluster are consistently higher across all features.
  - This cluster represents a group of data points with generally higher measurements across the features.

These interpretations provide insights into the characteristics of each cluster based on the mean feature values. It helps in understanding the differences between the clusters and can be useful for further analysis or decision-making.

In [None]:
# Visualize clusters using PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

plt.figure(figsize=(10, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=cluster_labels, cmap='viridis', s=50, alpha=0.5, label='Data Points')
centroids_pca = pca.transform(kmeans.cluster_centers_)
plt.scatter(centroids_pca[:, 0], centroids_pca[:, 1], marker='X', c='red', s=200, label='Centroids')
plt.title('K-means Clustering Visualization')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.grid(True)
plt.show()

### Cross Validation for scores

In [None]:
n_splits = 5
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
silhouette_scores = []

for train_index, test_index in kf.split(X_scaled):
    X_train, X_test = X_scaled[train_index], X_scaled[test_index]
    kmeans = KMeans(n_clusters=3, n_init=10)  # Explicitly set n_init to suppress warning
    kmeans.fit(X_train)
    cluster_labels = kmeans.predict(X_test)
    silhouette_avg = silhouette_score(X_test, cluster_labels)
    silhouette_scores.append(silhouette_avg)

average_silhouette_score = np.mean(silhouette_scores)
print("Average Silhouette Score:", average_silhouette_score)

### Task_2

In [None]:
# Drop 'target' column for training
X_train = train_df.drop(['target'], axis=1)
y_train = train_df['target']

# Fill missing values with mean
X_train.fillna(X_train.mean(), inplace=True)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

In [None]:
# Initialize and train RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_train_scaled, y_train)

In [None]:
# Train the model
clf.fit(X_train, y_train)

# Make predictions on the test data (assuming test data has the same features as train data)
y_pred = clf.predict(X_train)

# Calculate train accuracy
train_accuracy = accuracy_score(y_train, y_pred)

# Share target values for the test data (assuming test data has the same features as train data)
test_predictions = clf.predict(test_df)

print("Train Accuracy:", train_accuracy)

Train Accuracy: 0.9994286025250326


In [None]:
# Create a DataFrame with row numbers and corresponding predicted targets
test_predictions_df = pd.DataFrame({
    'Row Number': test_df.index + 1,  # Adding 1 to start row numbers from 1
    'Target': test_predictions
})

# Print the DataFrame with test predictions
print("Test Predictions:")
print(test_predictions_df.to_string(index=False))

In [None]:
# Add predictions to the test DataFrame
test_df['Prediction'] = test_predictions

# Save the updated test DataFrame to an Excel file
output_path = "/content/drive/MyDrive/test_with_predictions.xlsx"  # Update with your desired output path
test_df.to_excel(output_path, index=False)

print("Test predictions saved to:", output_path)

Test predictions saved to: /content/drive/MyDrive/test_with_predictions.xlsx


Model training

In [None]:
# Model training
X_train, X_test, y_train, y_test = train_test_split(filtered_df.drop(['target'], axis=1), target, test_size=0.2, random_state=42)
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.9866340902558206


The Random Forest Classifier was chosen for its versatility, robustness, and ability to handle both classification and regression tasks effectively. Here are some reasons why this algorithm might have been selected:

## High Accuracy:
Random Forest Classifier tends to provide high accuracy on a wide range of datasets without much parameter tuning. It is an ensemble method that combines multiple decision trees, which helps in reducing overfitting and improving accuracy.

##Robust to Overfitting:
 Random Forest Classifier mitigates the risk of overfitting by averaging multiple decision trees built on random subsets of the data and features. This makes it more robust compared to individual decision trees.

##Handles Both Numerical and Categorical Features:
 Random Forest Classifier can handle both numerical and categorical features without the need for feature scaling or one-hot encoding. This makes it convenient when working with diverse datasets.

##Implicit Feature Selection:
Random Forest Classifier implicitly performs feature selection by considering subsets of features at each split. This can be beneficial when dealing with high-dimensional data or datasets with redundant features.

Overall, the Random Forest Classifier is a popular choice for classification tasks due to its ease of use, robustness, and ability to provide high accuracy across various types of datasets. However, it's essential to consider the specific characteristics of your dataset and problem domain when choosing the most appropriate algorithm.

In [None]:
# Predict on test set
y_pred = clf.predict(X_test)

# Calculate precision
precision = precision_score(y_test, y_pred, average='weighted')
print("Precision:", precision)

# Calculate recall
recall = recall_score(y_test, y_pred, average='weighted')
print("Recall:", recall)

# Calculate F1-score
f1 = f1_score(y_test, y_pred, average='weighted')
print("F1-score:", f1)

Precision: 0.9867266834732401
Recall: 0.9866340902558206
F1-score: 0.9865966336896794


The precision, recall, and F1-score are all very high, indicating excellent performance of the classifier on the test set:

- Precision: 0.9879
  Precision represents the ratio of correctly predicted positive observations to the total predicted positives. In this case, it indicates that 98.79% of the samples predicted as positive by the classifier are actually positive.

- Recall: 0.9878
  Recall, also known as sensitivity, measures the ratio of correctly predicted positive observations to all actual positives. It tells us that 98.78% of the actual positive samples were correctly identified by the classifier.

- F1-score: 0.9878
  The F1-score is the harmonic mean of precision and recall. It provides a balance between precision and recall. A high F1-score indicates that the classifier has both high precision and high recall.

These high values suggest that the classifier is performing very well in terms of correctly predicting both positive and negative samples, demonstrating its effectiveness in classification tasks.