In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# {{ name }} - Data Science Project\n",
    "\n",
    "This notebook demonstrates a typical data science workflow including:\n",
    "1. Loading and exploring a dataset\n",
    "2. Data preprocessing and cleaning\n",
    "3. Exploratory data analysis with visualizations\n",
    "4. Building a simple ML model\n",
    "5. Evaluating model performance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Setup and Import Libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Import standard data science libraries\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "{% if 'matplotlib' in libraries %}\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "plt.style.use('seaborn-v0_8-whitegrid')\n",
    "plt.rcParams['figure.figsize'] = (12, 8)\n",
    "{% endif %}\n",
    "{% if 'scikit-learn' in libraries %}\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.metrics import classification_report, confusion_matrix, accuracy_score\n",
    "{% endif %}\n",
    "{% if 'plotly' in libraries %}\n",
    "import plotly.express as px\n",
    "import plotly.graph_objects as go\n",
    "{% endif %}\n",
    "\n",
    "# Set random seed for reproducibility\n",
    "np.random.seed(42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Load and Explore Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Load sample dataset (using Iris dataset for this template)\n",
    "{% if 'scikit-learn' in libraries %}\n",
    "from sklearn.datasets import load_iris\n",
    "\n",
    "# Load the iris dataset\n",
    "iris = load_iris()\n",
    "data = pd.DataFrame(data=np.c_[iris['data'], iris['target']],\n",
    "                    columns=iris['feature_names'] + ['target'])\n",
    "{% else %}\n",
    "# URL for Iris dataset\n",
    "url = \"https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data\"\n",
    "column_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']\n",
    "data = pd.read_csv(url, names=column_names)\n",
    "{% endif %}\n",
    "\n",
    "# Display basic information about the dataset\n",
    "print(\"Dataset Shape:\", data.shape)\n",
    "print(\"\\nFirst 5 rows:\")\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Check for missing values\n",
    "print(\"Missing values in each column:\\n\", data.isnull().sum())\n",
    "\n",
    "# Statistical summary\n",
    "print(\"\\nStatistical Summary:\")\n",
    "data.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Data Preprocessing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Handle missing values if any exist\n",
    "# (In this example dataset, there shouldn't be any missing values)\n",
    "data_clean = data.copy()\n",
    "\n",
    "# Feature engineering (if needed)\n",
    "# For example, we could create a new feature that's the ratio of sepal length to width\n",
    "{% if 'scikit-learn' in libraries %}\n",
    "data_clean['sepal_ratio'] = data_clean['sepal length (cm)'] / data_clean['sepal width (cm)']\n",
    "data_clean['petal_ratio'] = data_clean['petal length (cm)'] / data_clean['petal width (cm)']\n",
    "{% else %}\n",
    "data_clean['sepal_ratio'] = data_clean['sepal_length'] / data_clean['sepal_width']\n",
    "data_clean['petal_ratio'] = data_clean['petal_length'] / data_clean['petal_width']\n",
    "{% endif %}\n",
    "\n",
    "# Display the updated dataframe\n",
    "data_clean.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Exploratory Data Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "{% if 'matplotlib' in libraries %}\n",
    "# Histograms for each feature\n",
    "data_clean.hist(figsize=(15, 10))\n",
    "plt.suptitle('Feature Distributions', y=1.02, fontsize=16)\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "{% endif %}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "{% if 'matplotlib' in libraries %}\n",
    "# Correlation matrix and heatmap\n",
    "numeric_columns = data_clean.select_dtypes(include=[np.number]).columns\n",
    "correlation_matrix = data_clean[numeric_columns].corr()\n",
    "\n",
    "plt.figure(figsize=(12, 10))\n",
    "sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)\n",
    "plt.title('Correlation Matrix', fontsize=16)\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "{% endif %}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "{% if 'matplotlib' in libraries %}\n",
    "# Pairplot to visualize relationships between features\n",
    "{% if 'scikit-learn' in libraries %}\n",
    "# Add class names for better visualization\n",
    "target_names = iris['target_names']\n",
    "data_clean['species'] = data_clean['target'].map({0: target_names[0], 1: target_names[1], 2: target_names[2]})\n",
    "\n",
    "# Select only the original features and the species for the pairplot\n",
    "plot_data = data_clean[iris['feature_names'] + ['species']]\n",
    "sns.pairplot(plot_data, hue='species', height=2.5)\n",
    "plt.suptitle('Pairwise Feature Relationships by Species', y=1.02, fontsize=16)\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "{% else %}\n",
    "sns.pairplot(data_clean, hue='class', height=2.5)\n",
    "plt.suptitle('Pairwise Feature Relationships by Class', y=1.02, fontsize=16)\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "{% endif %}\n",
    "{% endif %}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "{% if 'plotly' in libraries %}\n",
    "# Interactive scatter plot using Plotly\n",
    "{% if 'scikit-learn' in libraries %}\n",
    "fig = px.scatter_3d(data_clean, \n",
    "                    x='sepal length (cm)', \n",
    "                    y='sepal width (cm)', \n",
    "                    z='petal length (cm)',\n",
    "                    color='species',\n",
    "                    symbol='species',\n",
    "                    title='3D Scatter Plot of Iris Features')\n",
    "{% else %}\n",
    "fig = px.scatter_3d(data_clean, \n",
    "                    x='sepal_length', \n",
    "                    y='sepal_width', \n",
    "                    z='petal_length',\n",
    "                    color='class',\n",
    "                    symbol='class',\n",
    "                    title='3D Scatter Plot of Iris Features')\n",
    "{% endif %}\n",
    "fig.update_layout(scene=dict(xaxis_title='Sepal Length',\n",
    "                            yaxis_title='Sepal Width',\n",
    "                            zaxis_title='Petal Length'))\n",
    "fig.show()\n",
    "{% endif %}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Model Building and Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "{% if 'scikit-learn' in libraries %}\n",
    "# Prepare data for modeling\n",
    "X = data_clean[iris['feature_names']].values\n",
    "y = data_clean['target'].values\n",
    "\n",
    "# Split data into training and test sets\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)\n",
    "\n",
    "# Standardize features\n",
    "scaler = StandardScaler()\n",
    "X_train_scaled = scaler.fit_transform(X_train)\n",
    "X_test_scaled = scaler.transform(X_test)\n",
    "\n",
    "print(f\"Training set shape: {X_train.shape}\")\n",
    "print(f\"Test set shape: {X_test.shape}\")\n",
    "{% endif %}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "{% if 'scikit-learn' in libraries %}\n",
    "# Train a logistic regression model\n",
    "lr_model = LogisticRegression(max_iter=200, random_state=42)\n",
    "lr_model.fit(X_train_scaled, y_train)\n",
    "\n",
    "# Predictions\n",
    "y_pred_lr = lr_model.predict(X_test_scaled)\n",
    "\n",
    "# Evaluate the model\n",
    "print(\"Logistic Regression Model:\")\n",
    "print(f\"Accuracy: {accuracy_score(y_test, y_pred_lr):.4f}\")\n",
    "print(\"\\nClassification Report:\")\n",
    "print(classification_report(y_test, y_pred_lr, target_names=iris['target_names']))\n",
    "{% endif %}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "{% if 'scikit-learn' in libraries %}\n",
    "# Train a random forest model\n",
    "rf_model = RandomForestClassifier(n_estimators=100, random_state=42)\n",
    "rf_model.fit(X_train_scaled, y_train)\n",
    "\n",
    "# Predictions\n",
    "y_pred_rf = rf_model.predict(X_test_scaled)\n",
    "\n",
    "# Evaluate the model\n",
    "print(\"Random Forest Model:\")\n",
    "print(f\"Accuracy: {accuracy_score(y_test, y_pred_rf):.4f}\")\n",
    "print(\"\\nClassification Report:\")\n",
    "print(classification_report(y_test, y_pred_rf, target_names=iris['target_names']))\n",
    "{% endif %}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "{% if 'scikit-learn' in libraries and 'matplotlib' in libraries %}\n",
    "# Visualize confusion matrix\n",
    "plt.figure(figsize=(12, 5))\n",
    "\n",
    "plt.subplot(1, 2, 1)\n",
    "cm_lr = confusion_matrix(y_test, y_pred_lr)\n",
    "sns.heatmap(cm_lr, annot=True, fmt='d', cmap='Blues', \n",
    "            xticklabels=iris['target_names'], \n",
    "            yticklabels=iris['target_names'])\n",
    "plt.title('Logistic Regression Confusion Matrix')\n",
    "plt.ylabel('True Label')\n",
    "plt.xlabel('Predicted Label')\n",
    "\n",
    "plt.subplot(1, 2, 2)\n",
    "cm_rf = confusion_matrix(y_test, y_pred_rf)\n",
    "sns.heatmap(cm_rf, annot=True, fmt='d', cmap='Blues', \n",
    "            xticklabels=iris['target_names'], \n",
    "            yticklabels=iris['target_names'])\n",
    "plt.title('Random Forest Confusion Matrix')\n",
    "plt.ylabel('True Label')\n",
    "plt.xlabel('Predicted Label')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "{% endif %}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "{% if 'scikit-learn' in libraries and 'matplotlib' in libraries %}\n",
    "# Feature importance from Random Forest\n",
    "feature_importance = rf_model.feature_importances_\n",
    "feature_names = iris['feature_names']\n",
    "\n",
    "# Create a DataFrame for better visualization\n",
    "importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importance})\n",
    "importance_df = importance_df.sort_values('Importance', ascending=False)\n",
    "\n",
    "plt.figure(figsize=(10, 6))\n",
    "sns.barplot(x='Importance', y='Feature', data=importance_df, palette='viridis')\n",
    "plt.title('Feature Importance from Random Forest', fontsize=16)\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "{% endif %}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Conclusions\n",
    "\n",
    "In this notebook, we've accomplished the following:\n",
    "\n",
    "1. Loaded and explored the Iris dataset\n",
    "2. Performed basic data preprocessing and feature engineering\n",
    "3. Conducted exploratory data analysis with various visualizations\n",
    "4. Built and evaluated classification models\n",
    "5. Analyzed feature importance to understand what drives predictions\n",
    "\n",
    "Key findings:\n",
    "- The Iris dataset has three distinct species classes that can be relatively well separated using the provided features\n",
    "- Petal dimensions appear to be more important for classification than sepal dimensions\n",
    "- Random Forest slightly outperformed Logistic Regression in our evaluation\n",
    "\n",
    "Next steps could include:\n",
    "- Hyperparameter tuning to optimize model performance\n",
    "- Trying additional classification algorithms\n",
    "- Investigating more advanced feature engineering techniques\n",
    "- Deploying the model to a production environment"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}