HOW TO SET UP YOUR JUPYTER NOTEBOOK

Step 1: Install Anaconda (if you have not already)
– Go to: https://www.anaconda.com/download
– Download the installer for your operating system (Windows, macOS, or Linux).
– Run the installer and accept the default options.
– This will install Python and Jupyter Notebook together.

Step 2: Launch Jupyter
Option A: Using Anaconda Navigator
– Open “Anaconda Navigator”.
– Click on “Jupyter Notebook” or “JupyterLab”.
– A browser window will open (usually at http://localhost:8888).

Option B: Using the terminal or command prompt
– Open a terminal (macOS/Linux) or Anaconda Prompt (Windows).
– Type: jupyter notebook
– Press Enter.
– A browser window will open.

Step 3: Create a new notebook
– In the browser interface, navigate to a folder where you want to save your work.
– Click “New” → “Python 3 (ipykernel)” or similar.
– A new, empty notebook will open in a new tab.

Step 4: Rename the notebook
– At the top of the notebook, click the name (e.g., “Untitled”).
– Rename it to: Week01.ipynb
– Press Enter to save the new name.

Step 5: Learn the basics of cells
– A Jupyter Notebook is made up of “cells”.
– Code cells: you write Python code and run it.
– Markdown cells: you write text (notes, headings, explanations).
– To run the current cell: press Shift + Enter.
– To add a new cell: use the “+” button or Insert menu.

SECTION 1 – WHAT IS DATA SCIENCE? FROM BUSINESS QUESTIONS TO ANALYTICS PROBLEMS

What is Data Science ?
• Use data to answer questions and support decisions.  
• Combine:
  – Domain knowledge (the real-world context)  
  – Data and statistics  
  – Computing and programming  
  – Communication and visualization  

Goal: Turn questions into useful, evidence-based answers.

Examples of Data Science

• Healthcare  
  – Predict which patients are at high risk of readmission.  

• Operations  
  – Forecast daily patient volume to plan staffing.  

• E-commerce  
  – Recommend products to customers.  

• Marketing  
  – Measure which ads lead to purchases, not just clicks.

BUSINESS QUESTIONS VS ANALYTICS PROBLEMS

From Business Questions → Analytics Problems

Business question (vague):
• “Why are our patients waiting so long?”

Analytics problem (concrete):
• “Identify which clinics and hours of the day have high average wait times, using past visit data.”

We must translate vague questions into precise problems.

Make the Analytics Problem Explicit

Specify:
• Target: what are we predicting or estimating?  
• Unit: who or what gets a value?  
• Time horizon: over what time frame?  
• Success metric: how do we judge “good”?  
• Constraints: time, budget, privacy, ethics.

Clinic Example:
• Target: average wait_minutes  
• Unit: clinic-hour (clinic and hour of day)  
• Time horizon: next month’s schedule  
• Success metric: identify high-wait periods accurately  
• Constraints: use existing electronic health record data



CRISP-DM STAGES WITH FAKE DATASET clinic_waits.csv


Example Dataset: clinic_waits.csv

Columns (per visit):
• visit_id – unique ID for the visit  
• patient_id – unique ID for the patient  
• clinic – e.g., "North", "South", "Downtown"  
• provider_type – e.g., "MD", "NP", "PA"  
• scheduled_time – scheduled appointment time  
• arrival_time – when patient arrived  
• seen_time – when patient first saw provider  
• visit_type – e.g., "Urgent", "Routine", "Follow-up"

Derived column:
• wait_minutes = seen_time – arrival_time (in minutes)

CRISP-DM: Data Science Project Lifecycle
Cross-Industry Standard Process for Data Mining 

Stages:
1. Business Understanding  
2. Data Understanding  
3. Modeling  
4. Evaluation  
5. Deployment  

We move back and forth between stages.

In [1]:
STAGE 1: BUSINESS UNDERSTANDING

Business question:
• “Why are our patients waiting so long?”

Analytics problem:
• Identify clinics and hours with high average wait_minutes.

Clarify:
• Stakeholders: clinic managers, scheduling staff  
• Goal: see patterns in wait times to adjust staffing/scheduling  
• Constraints: only use existing EHR data, finish in 4 weeks


SyntaxError: invalid character '•' (U+2022) (3347445149.py, line 4)

STAGE 2: DATA UNDERSTANDING

Using clinic_waits.csv

Questions:
• What columns exist? What do they mean?  
• How many rows (how many visits)?  
• Any missing or invalid arrival_time or seen_time?  
• Any negative wait_minutes?

Simple summaries:
• Overall average wait_minutes  
• Average wait_minutes by clinic  
• Average wait_minutes by hour_of_day



STAGE 3: MODELING

Possible goals:
• Predict wait_minutes for future visits.  
• Classify clinic-hour periods as “high_wait” or “normal”.

Steps:
• Create features:
  – clinic, hour_of_day, day_of_week, visit_type, provider_type, etc.  
• Create labels:
  – For classification: high_wait = 1 if wait_minutes > threshold.  
• Train models:
  – Start with simple models, then more complex if needed.


Stage 4: Evaluation

Questions:
• How well does the model perform on unseen test data?  
• Does it meet the business goal?  
• Where does it perform poorly?

Metrics:
• Regression: MAE, RMSE on wait_minutes.  
• Classification: accuracy, precision, recall, F1 for high_wait.

Decision:
• Refine model or move to deployment.


Stage 5: Deployment

For clinic_waits:

Examples:
• Weekly report: average wait_minutes by clinic and hour.  
• Dashboard: interactive view of wait times and trends.  
• Batch job: flag high_wait hours for upcoming week.

Also:
• Monitor performance and data drift.  
• Retrain model periodically.  
• Document design, assumptions, and limitations.



Why Python?

• Easy-to-read syntax  
• Widely used in data science  
• Strong library ecosystem:
  – pandas (tables)  
  – NumPy (numerical)  
  – scikit-learn (ML)  
  – matplotlib (plots)

We will use Python for:
• Data cleaning  
• Exploration and visualization  
• Modeling and evaluation

What is a Jupyter Notebook?

• A document made of “cells”  
• Code cells:
  – Run Python code and show results  
• Markdown cells:
  – Write notes, headings, instructions

Basic actions:
• Run current cell: Shift + Enter  
• Change cell type (Code / Markdown)  
• Insert new cell: toolbar or menu  
• Save notebook: Ctrl+S or File → Save

Create and Name Your Notebook

1. Open Jupyter.  
2. Navigate to your course folder.  
3. Click “New” → “Python 3” (or similar).  
4. Rename the notebook:
   – Week01_Intro_DS.ipynb  

First markdown cell:
“Week 1 – What is Data Science? CRISP-DM, Python Basics”


In [3]:

message = 'Hello, data science!'  
print(message)

#Expected output:
#hello, data science!

Hello, data science!


In [None]:
#Basic Python Data Types

Integers (int):
x = 10  
y = -3  

Floats (float):
pi = 3.14159  
price = 19.99  

Strings (str):
name = 'Ana'  
status = "Checked in"  

Booleans (bool):
is_active = True  
is_late = False



SyntaxError: invalid syntax (1697745467.py, line 1)