# Task
Perform a comprehensive analysis of the 'blood-transfusion-service-center' dataset from "https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data". This includes loading and preparing the data, handling missing values and duplicates, conducting exploratory data analysis with visualizations (histograms, box plots, correlation heatmap), preparing the data for machine learning by encoding the target variable and splitting into training/testing sets, building and training a simple classification model, and finally, evaluating its performance using appropriate metrics. The final deliverable should summarize insights from cleaning, analysis, visualizations, and model performance.

## Load and Prepare Data

### Subtask:
Download the 'blood-transfusion-service-center' dataset from the provided URL, convert it into a pandas DataFrame, and perform an initial inspection of its structure, data types, and first few rows. I will use the dataset description to understand the meaning of each column (V1: Recency, V2: Frequency, V3: Monetary, V4: Time, and the target variable).


**Reasoning**:
I will import the pandas library, define the dataset URL, load the dataset into a DataFrame, and then display the first 5 rows, column information, and descriptive statistics as requested in the instructions.



In [None]:
import pandas as pd

# Define the URL for the dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data'

# Load the dataset into a pandas DataFrame
# The dataset has a header, so we'll let pandas infer it or explicitly set header=0
# The separator is ',', and '?' is specified as a missing value indicator, although inspecting the data suggests no '?' are present.
# For safety and clarity, let's explicitly name the columns as per the problem description.
column_names = ['Recency (months)', 'Frequency (times)', 'Monetary (c.c. blood)', 'Time (months)', 'donated_blood_in_march']
df = pd.read_csv(url, sep=',', na_values='?', names=column_names, header=0)

# Display the first 5 rows of the DataFrame
print("First 5 rows of the DataFrame:")
print(df.head())

# Display the column names and their data types
print("\nDataFrame Info:")
df.info()

# Display basic descriptive statistics of the DataFrame
print("\nDescriptive statistics of the DataFrame:")
print(df.describe())

## Data Cleaning

### Subtask:
Check for and handle any missing values or duplicate entries within the dataset. Ensure that all columns have appropriate data types for analysis and model building, converting them if necessary.


**Reasoning**:
The subtask requires checking for missing values and duplicate entries, and then handling duplicates if found. After these operations, it's good practice to re-verify the data types and overall DataFrame structure.



In [None]:
print("\nMissing values before cleaning:")
print(df.isnull().sum())

print("\nNumber of duplicate rows before cleaning:")
duplicate_rows_count = df.duplicated().sum()
print(duplicate_rows_count)

if duplicate_rows_count > 0:
    df.drop_duplicates(inplace=True)
    print(f"\n{duplicate_rows_count} duplicate rows removed.")
else:
    print("\nNo duplicate rows found.")

print("\nDataFrame Info after cleaning (if any duplicates were found and removed):")
df.info()


## Exploratory Data Analysis (EDA)

### Subtask:
Conduct a detailed exploratory data analysis. This includes generating descriptive statistics for all features, analyzing the distribution of numerical features, and investigating the relationships between the features and the target variable (whether a donor donated blood in March 2007).


**Reasoning**:
I will re-examine the basic descriptive statistics for the cleaned DataFrame to understand the central tendency, dispersion, and shape of the dataset's distribution after removing duplicates. This addresses the first instruction of the subtask.



In [None]:
print("\nDescriptive statistics of the DataFrame after duplicate removal:")
print(df.describe())

**Reasoning**:
To analyze the distribution of numerical features and check for skewness or potential outliers, I will generate histograms for each numerical column. This addresses the second instruction of the subtask.



In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set the aesthetic style of the plots
sns.set_style("whitegrid")

# List of numerical features to plot
numerical_features = ['Recency (months)', 'Frequency (times)', 'Monetary (c.c. blood)', 'Time (months)']

# Create histograms for each numerical feature
plt.figure(figsize=(15, 10))
for i, col in enumerate(numerical_features):
    plt.subplot(2, 2, i + 1) # Arrange in 2 rows, 2 columns
    sns.histplot(df[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Count')
plt.tight_layout()
plt.show()


**Reasoning**:
To investigate the relationship between each numerical feature and the target variable, I will calculate the mean of each numerical feature grouped by the 'donated_blood_in_march' target variable. This will help identify noticeable differences between donors and non-donors.



In [None]:
print("\nMean of numerical features grouped by 'donated_blood_in_march':")
print(df.groupby('donated_blood_in_march')[numerical_features].mean())

**Reasoning**:
To visually investigate the relationship between each numerical feature and the target variable, I will generate box plots for each numerical feature, colored by 'donated_blood_in_march'. This will complement the mean comparison and help identify differences in distributions between donors and non-donors.



In [None]:
plt.figure(figsize=(15, 10))
for i, col in enumerate(numerical_features):
    plt.subplot(2, 2, i + 1)
    sns.boxplot(x='donated_blood_in_march', y=col, data=df)
    plt.title(f'{col} by Donated Blood in March')
    plt.xlabel('Donated Blood in March (0: No, 1: Yes)')
    plt.ylabel(col)
plt.tight_layout()
plt.show()

## Visualize Cleaned and Analyzed Data

### Subtask:
Create various visualizations to clearly represent the cleaned data and the insights gained from the EDA. This will include plots like histograms for feature distributions, box plots to identify outliers, and potentially a correlation heatmap to show relationships between variables, including the target.


**Reasoning**:
To investigate the linear relationships between all numerical features and the target variable, I will calculate the correlation matrix and visualize it using a heatmap with annotations.



In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Calculate the correlation matrix for all relevant columns, including the target
correlation_matrix = df.corr()

# Create a heatmap to visualize the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Features')
plt.show()

## Prepare Data for Machine Learning

### Subtask:
Separate the dataset into features (X) and the target variable (y). The target variable 'donated blood in March 2007' will be encoded (e.g., 2 to 1 for donation, 1 to 0 for no donation). The data will then be split into training and testing sets to prepare for model building.


**Reasoning**:
To prepare the data for machine learning, I will first separate the features (X) from the target variable (y), then split the data into training and testing sets for model evaluation.



In [None]:
from sklearn.model_selection import train_test_split

# Define features (X) and target (y)
X = df.drop('donated_blood_in_march', axis=1)
y = df['donated_blood_in_march']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

## Build and Train a Simple Classification Model

### Subtask:
Implement and train a simple classification model, such as Logistic Regression or a Decision Tree Classifier, to predict the target variable based on the prepared features. The model's objective will be to classify whether a donor will donate blood.


**Reasoning**:
To implement and train a simple classification model, I will import `LogisticRegression`, instantiate it, and then train it using the `fit` method on the prepared training data (`X_train`, `y_train`).



In [None]:
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
model = LogisticRegression(random_state=42, solver='liblinear') # Using 'liblinear' solver for small datasets

# Train the model
model.fit(X_train, y_train)

print("Logistic Regression model trained successfully.")

**Reasoning**:
Now that the model has been trained, the next step is to evaluate its performance. To do this, I will use the trained model to make predictions on the test set (`X_test`) and store these predictions. This is a crucial step before calculating evaluation metrics.



In [None]:
y_pred = model.predict(X_test)
print("Model predictions on the test set generated successfully.")

**Reasoning**:
To evaluate the performance of the trained model, I will calculate common classification metrics such as accuracy, precision, recall, and F1-score, and also generate a confusion matrix. This will provide a comprehensive understanding of the model's performance on the unseen test data.



In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"\nModel Evaluation Metrics:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
print(f"\nConfusion Matrix:\n{conf_matrix}")

# Optionally visualize the confusion matrix
plt.figure(figsize=(6, 4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', cbar=False,
            xticklabels=['Predicted 0', 'Predicted 1'], yticklabels=['Actual 0', 'Actual 1'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()

## Final Task

### Subtask:
Provide a comprehensive summary of the entire process, including key insights from data cleaning and analysis, the visualizations created, and the performance metrics of the simple machine learning model.


## Summary:

### Data Analysis Key Findings

*   The initial dataset, comprising 748 entries and 5 columns, was successfully loaded and inspected. All features were numerical (`int64`), and the target variable, 'donated\_blood\_in\_march', indicated an imbalanced dataset where approximately 23.8% of entries represented donors.
*   Data cleaning revealed no missing values. However, 215 duplicate rows were identified and removed, reducing the dataset size to 533 unique entries.
*   Exploratory Data Analysis (EDA) showed that all numerical features ('Recency (months)', 'Frequency (times)', 'Monetary (c.c. blood)', 'Time (months)') exhibited right-skewed distributions.
*   Analysis of feature means grouped by the target variable provided key insights into donor behavior:
    *   Donors who gave blood in March had a significantly lower mean 'Recency' (5.89 months) compared to non-donors (10.94 months).
    *   Donors showed a higher mean 'Frequency' (8.74 times) and 'Monetary' contribution (2186 c.c.) than non-donors (6.31 times and 1577 c.c. respectively).
    *   Donors had a slightly lower mean 'Time' active (37.29 months) than non-donors (44.25 months).
*   A correlation heatmap was generated, visualizing the linear relationships between all features and the target variable.
*   The data was successfully prepared for machine learning, with features and target separated, and then split into 80% training (426 samples) and 20% testing (107 samples) sets.
*   A Logistic Regression model was trained and evaluated, yielding the following performance metrics on the test set:
    *   Accuracy: 0.7664
    *   Precision: 0.5833
    *   Recall: 0.2593
    *   F1-Score: 0.3590
*   The confusion matrix indicated 7 True Positives (correctly predicted donors), 20 False Negatives (actual donors missed), 5 False Positives (non-donors incorrectly predicted as donors), and 75 True Negatives (correctly predicted non-donors).

### Insights or Next Steps

*   The low recall of the Logistic Regression model (0.2593) indicates a significant challenge in correctly identifying actual blood donors, which is a critical aspect for a donor prediction service. This suggests that the model is biased towards predicting the majority class (non-donors), likely due to the imbalanced nature of the dataset.
*   To improve the model's ability to predict donors, future steps should focus on addressing the class imbalance using techniques like oversampling (e.g., SMOTE), undersampling, or using algorithms inherently robust to imbalance. Additionally, exploring feature engineering (e.g., transformations for skewed data) and more advanced classification models (e.g., Gradient Boosting, Random Forests) or hyperparameter tuning could enhance predictive performance, especially recall and F1-score.


# Task
Build a Logistic Regression model to predict a target variable using a user-provided CSV dataset. This involves loading the dataset, handling missing values and duplicates, performing exploratory data analysis (EDA) with visualizations, preprocessing data types, splitting data into training and testing sets, training the model, and evaluating its performance.