# Case Study: Wearable Sensor Data Analysis for Disease Detection

**Step-by-Step: Implementing Logistic Regression for Disease Classification Using Wearable Sensor Data**

In this notebook, we will apply logistic regression to healthcare data collected from wearable sensors to classify diseases such as heart disease, diabetes, etc., based on sensor data. The goal is to predict whether a patient is diseased (1) or healthy (0).

### Step 1: Problem Definition

We aim to predict whether a patient has a particular disease (e.g., heart disease, diabetes) based on sensor data collected from wearable devices (e.g., accelerometer, heart rate monitors, temperature sensors). The goal is to classify the data into diseased (1) or healthy (0).

### Step 2: Dataset Preparation

We need a dataset with sensor data features such as:

- Heart rate
- Steps taken
- Temperature
- Activity level (e.g., walking, running)
- Sleep duration

The labels are binary: 0 (healthy) and 1 (diseased).

#### Step 2.1: Prepare a Sample Dataset

In practice, you would use a real healthcare or sensor dataset. For demonstration, we will use a hypothetical dataset.

In [None]:
import pandas as pd

# Sample dataset
data = {
    'heart_rate': [72, 80, 78, 65, 85, 92, 68, 76, 77, 79],
    'steps': [1000, 1500, 1200, 800, 1800, 2000, 900, 1000, 1100, 1300],
    'temperature': [36.5, 37.0, 36.8, 36.4, 37.1, 37.5, 36.6, 36.7, 36.9, 37.2],
    'activity_level': [1, 3, 2, 1, 4, 5, 1, 2, 3, 4],  # 1 = sedentary, 5 = active
    'sleep_duration': [7, 6, 7, 8, 5, 4, 8, 7, 6, 5],  # hours of sleep
    'label': [0, 1, 0, 0, 1, 1, 0, 0, 1, 0]  # 0 = healthy, 1 = diseased
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Features and target variable
X = df.drop('label', axis=1)  # features
y = df['label']  # target variable

# Large Data: 1000 Samples

In [None]:
import pandas as pd
import numpy as np

# Generate random data for 1000 samples
np.random.seed(42)

# Simulate data
heart_rate = np.random.randint(60, 100, 1000)
steps = np.random.randint(500, 2500, 1000)
temperature = np.random.uniform(36.0, 37.5, 1000)
activity_level = np.random.randint(1, 6, 1000)  # 1 = sedentary, 5 = active
sleep_duration = np.random.randint(4, 9, 1000)  # hours of sleep
label = np.random.choice([0, 1], 1000, p=[0.7, 0.3])  # 0 = healthy, 1 = diseased

# Create DataFrame
data = {
    'heart_rate': heart_rate,
    'steps': steps,
    'temperature': temperature,
    'activity_level': activity_level,
    'sleep_duration': sleep_duration,
    'label': label
}

df = pd.DataFrame(data)

# Features and target variable
X = df.drop('label', axis=1)  # features
y = df['label']  # target variable

df.head()

### Step 3: Data Preprocessing & Normalization

Sensor data often varies in scale, so normalization or standardization is important to ensure all features contribute equally to the model.

#### Step 3.1: Normalize the Data Using StandardScaler

We use the `StandardScaler` to transform the features into a similar scale.

In [None]:
from sklearn.preprocessing import StandardScaler

# Initialize scaler
scaler = StandardScaler()

# Fit and transform the data
X_scaled = scaler.fit_transform(X)

# Verify the scaling
print(pd.DataFrame(X_scaled, columns=X.columns))

### Step 4: Split the Data into Training and Testing Sets

We need to split the data into training and testing sets to evaluate the model's performance.

#### Step 4.1: Split the Data Using `train_test_split`

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.1, random_state=42)

### Step 5: Train the Logistic Regression Model

Now, we can train the logistic regression model on the training data.

#### Step 5.1: Initialize and Fit the Model

In [None]:
from sklearn.linear_model import LogisticRegression

# Initialize the model
model = LogisticRegression()

# Train the model on the training set
model.fit(X_train, y_train)

### Step 6: Make Predictions

After training the model, we can use it to make predictions on the test set.

#### Step 6.1: Predict Disease Classification for the Test Data

In [None]:
# Make predictions on the test set
y_pred = model.predict(X_test)

# Display the predictions
print("Predictions:", y_pred)

### Step 7: Model Evaluation

We can evaluate the performance of the model using accuracy, precision, recall, and F1-score.

#### Step 7.1: Evaluate the Model Using `classification_report` and `accuracy_score`

In [None]:
from sklearn.metrics import accuracy_score, classification_report

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Print the classification report
print("Classification Report:", classification_report(y_test, y_pred))

### Step 8: Confusion Matrix

A confusion matrix will help us visualize the performance of the model by showing the true positives, true negatives, false positives, and false negatives.

#### Step 8.1: Generate and Display the Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix

# Create confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Display the confusion matrix
print("Confusion Matrix:", conf_matrix)