# ML Project Data Exploration

This notebook demonstrates the basic workflow of a machine learning project, including:
1. Data loading and inspection
2. Data preprocessing and cleaning
3. Feature analysis and visualization
4. Model training and evaluation

Let's go through each step to understand the process.

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Set up plotting style
plt.style.use('seaborn')
sns.set_theme()

print("Libraries imported successfully!")

## Creating Sample Data

For this example, we'll create a simple synthetic dataset to demonstrate the ML workflow. In a real project, you would load your data from files in the `data/raw/` directory.

In [None]:
# Create synthetic data
np.random.seed(42)
n_samples = 1000

# Generate features
X = np.random.randn(n_samples, 3)  # 3 features
# Create a simple classification target
y = (X[:, 0] + X[:, 1] - X[:, 2] > 0).astype(int)

# Create a DataFrame
feature_names = ['feature_1', 'feature_2', 'feature_3']
df = pd.DataFrame(X, columns=feature_names)
df['target'] = y

print("Dataset shape:", df.shape)
print("\nFirst few rows:")
df.head()

## Data Analysis and Visualization

Let's explore the data to understand its characteristics:

In [None]:
# Basic statistics
print("Basic statistics:")
print(df.describe())

# Distribution plots
plt.figure(figsize=(12, 4))
for i, feature in enumerate(feature_names, 1):
    plt.subplot(1, 3, i)
    sns.histplot(data=df, x=feature, hue='target', multiple="stack")
    plt.title(f'{feature} Distribution by Target')
plt.tight_layout()
plt.show()

# Correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.show()

## Model Training and Evaluation

Now let's prepare the data for training, train a simple logistic regression model, and evaluate its performance:

In [None]:
# Prepare features and target
X = df[feature_names]
y = df['target']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the model
model = LogisticRegression(random_state=42)
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)

# Evaluate the model
print("\nModel Evaluation:")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

## Feature Importance

Let's examine which features are most important for our model:

In [None]:
# Plot feature importance
importance = pd.DataFrame({
    'feature': feature_names,
    'importance': abs(model.coef_[0])
})
importance = importance.sort_values('importance', ascending=False)

plt.figure(figsize=(8, 4))
sns.barplot(data=importance, x='importance', y='feature')
plt.title('Feature Importance')
plt.xlabel('Absolute Coefficient Value')
plt.tight_layout()
plt.show()