# Day 1: Introduction to Data Science, Python, and Git Basics

**Prepared By:** Dr. Kenechi Omeke  
**Date:** November 2024  

---

## Chapter Overview
Welcome to your journey into data science! This chapter will introduce you to the world of data-driven discovery, the power of Python, and the essentials of version control with Git. By the end, you’ll have a strong foundation to build on for the rest of the course—and your career.

## Aim
Introduce fundamental concepts of data science, the role of Python in data analysis, and basics of version control with Git.

## Intended Learning Outcomes
By the end of this chapter, you will be able to:
- Define data science and its significance in various industries.
- Set up a Python environment suitable for data analysis.
- Understand basic Git concepts and workflow.
- Initialize a Git repository and commit changes.
- Run and write Python code in Jupyter Notebooks.

## Why This Matters
Data science is transforming the world. From healthcare to finance, marketing to agriculture, the ability to extract insights from data is a superpower. Python and Git are the tools that make this possible for millions of professionals worldwide.

## Topics Covered
- What is Data Science?
- Python Basics (with code examples)
- Setting Up Your Python Environment (Anaconda, venv, Jupyter)
- Introduction to Git & Version Control
- Mini-Project: Your First Data Science Workflow

---

# 1. What is Data Science?

Data science is an interdisciplinary field that uses scientific methods, algorithms, and systems to extract knowledge and insights from data. It combines statistics, computer science, domain expertise, and communication skills to solve real-world problems.

**Key Components:**
- **Data Collection:** Gathering raw data from various sources (databases, sensors, web, etc.)
- **Data Cleaning:** Removing errors, handling missing values, and preparing data for analysis
- **Exploratory Data Analysis (EDA):** Summarizing and visualizing data to find patterns
- **Modeling:** Using statistical and machine learning models to make predictions or classifications
- **Interpretation & Communication:** Explaining results to stakeholders and making data-driven decisions

**Examples:**
- **Healthcare:** Predicting disease outbreaks, optimizing treatment plans, analyzing patient data for better outcomes.
- **Finance:** Fraud detection, credit risk assessment, micro-loans (e.g., M-Pesa in Kenya).
- **Marketing:** Personalized recommendations, customer segmentation (e.g., Jumia in Africa).
- **Agriculture:** Forecasting crop yields, optimizing irrigation using sensor data.

**Why learn data science?**
- High demand for data skills in the job market
- Enables evidence-based decisions
- Useful in research, business, government, and more

**Diagram:**
```
[Data Collection] → [Cleaning] → [EDA] → [Modeling] → [Interpretation]
```

**Key Takeaway:**
Data science is about turning raw data into actionable insights that drive real change.

# 2. Python Basics for Data Science

Python is the most popular language for data science due to its simplicity, readability, and powerful libraries. Let’s review the essentials with hands-on code.

**Why Python?**
- Easy to learn and read
- Huge ecosystem of libraries (pandas, numpy, matplotlib, scikit-learn, etc.)
- Active community and lots of resources

**Best Practice:**
- Write clear, well-commented code
- Use descriptive variable names
- Test your code with small examples

**Let’s get started!**

In [None]:
# Numbers, Strings, and Booleans
# These are the basic building blocks of Python data.
age = 25
price = 19.99
name = "John"
is_valid = True
print(f"Age: {age}, Price: {price}, Name: {name}, Valid: {is_valid}")

In [None]:
# Lists and Dictionaries
# Lists store ordered collections; dictionaries store key-value pairs.
fruits = ["apple", "banana", "mango"]
student = {"name": "Amina", "age": 21, "major": "Statistics"}
print(f"Fruits: {fruits}")
print(f"Student: {student}")

In [None]:
# Conditionals and Loops
# Use loops to repeat actions, and conditionals to make decisions.
for fruit in fruits:
    if fruit == "banana":
        print(f"Found a banana!")
    else:
        print(f"This is a {fruit}.")

In [None]:
# Functions
# Functions let you reuse code and organize your logic.
def add_numbers(a, b):
    """Return the sum of two numbers."""
    return a + b

result = add_numbers(5, 10)
print(f"5 + 10 = {result}")

## Exercise: Python Practice

1. Create a variable called `country` and assign it your country name.
2. Write a function that takes two numbers and returns their product.
3. Make a list of three cities in your country and print each one using a loop.

**Reflection:**
- How does Python’s syntax compare to other languages you know?
- Why is it important to practice writing and running code yourself?

# 3. Setting Up Your Python Environment

A good environment makes coding easier and more reliable. You can use either **Anaconda** (recommended for beginners) or a standard Python virtual environment (venv).

**What is a Python environment?**
A self-contained directory that contains a Python installation plus a set of packages. This helps avoid conflicts and makes your work reproducible.

## Option 1: Anaconda
- Download and install from [Anaconda's website](https://www.anaconda.com/products/distribution).
- Open Anaconda Navigator or use the terminal:
  ```bash
  conda create -n ds_env python=3.10
  conda activate ds_env
  conda install jupyter pandas numpy matplotlib seaborn scikit-learn scipy statsmodels sqlalchemy
  jupyter notebook
  ```

## Option 2: Python venv + pip
- Open your terminal (Command Prompt on Windows):
  ```bash
  python -m venv venv
  venv\Scripts\activate  # On Windows
  # or
  source venv/bin/activate  # On Mac/Linux
  pip install -r requirements.txt
  jupyter notebook
  ```

**Troubleshooting:**
- If you see `'python' is not recognized`, check your PATH or install Python.
- If `pip` is not found, use `python -m pip`.
- If you get permission errors, try running as administrator.

**Required packages:** pandas, numpy, matplotlib, seaborn, scikit-learn, scipy, statsmodels, sqlalchemy

**Best Practice:**
- Always use a virtual environment for each project.
- Keep a `requirements.txt` file for reproducibility.

**Key Takeaway:**
A reproducible environment is essential for collaboration and sharing your work.

# 4. Introduction to Jupyter Notebooks

Jupyter Notebooks let you write and run code, add notes, and visualize results interactively. They are the standard tool for data science education and research.

**Features:**
- Mix code and markdown (text) cells for explanations and results.
- Visualize data with plots and charts.
- Share your work as a notebook file or export to PDF/HTML.

**How to Start:**
- To start: `jupyter notebook`
- Create a new notebook and try running the code examples above.
- Use Shift+Enter to run a cell.

**Best Practice:**
- Use markdown cells to explain your code and results.
- Save your notebook frequently.

**Diagram:**
```
[Markdown Cell: Explanation]
[Code Cell: Python code]
[Output: Results/Plots]
```

**Key Takeaway:**
Jupyter Notebooks make your work interactive, reproducible, and easy to share.

# 5. Introduction to Git & Version Control

Version control lets you track changes, collaborate, and revert to previous versions. **Git** is the most popular tool for this purpose.

**Why use version control?**
- Never lose your work—revert to any previous state
- Collaborate with others without overwriting each other’s changes
- Keep a history of your project’s evolution

**Key Concepts:**
- **Repository:** A folder tracked by Git (`.git` folder inside)
- **Commit:** A snapshot of your project with a message
- **Branch:** A parallel version for new features or experiments
- **Merge:** Combine changes from different branches
- **GitHub/GitLab:** Online platforms for sharing and collaborating

**Essential Git Commands**
```bash
git init                # Start a new repo
git add filename        # Stage changes
git commit -m "Message" # Save a snapshot
git status              # See what's changed
git log                 # View history
```

**Mini-Exercise: Your First Git Repo**
1. Install Git from [git-scm.com](https://git-scm.com/).
2. Open a terminal in your project folder:
   ```bash
   git init
   git config --global user.name "Your Name"
   git config --global user.email "you@example.com"
   echo "print('Hello, Data Science!')" > hello.py
   git add hello.py
   git commit -m "Initial commit: hello world script"
   ```
3. Check your commit history with `git log`.

**Best Practice:**
- Write clear commit messages (what/why, not just "update")
- Commit often, but only working code
- Use branches for new features or experiments

**Key Takeaway:**
Version control is essential for professional, collaborative, and reproducible data science.

# 6. Mini-Project: Your First Data Science Workflow

Let's put it all together! In this mini-project, you'll load a dataset, do some basic analysis, and plot a chart. This is the foundation of every data science project.

**Project Steps:**
1. Load a sample dataset (Iris)
2. Explore the data: view the first few rows, check data types
3. Calculate summary statistics
4. Visualize the data with a scatterplot
5. Document your process in markdown cells

**Reflection:**
- What did you find interesting or challenging?
- How would you explain your workflow to a friend?

In [None]:
# Load a sample dataset (Iris)
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
df = sns.load_dataset('iris')
df.head()

In [None]:
# Basic data exploration
print(df.describe())
print(df["species"].value_counts())

In [None]:
# Simple visualization
sns.scatterplot(data=df, x="sepal_length", y="sepal_width", hue="species")
plt.title("Iris Sepal Length vs Width")
plt.show()

## Reflection & Next Steps

- What did you find interesting or challenging today?
- Try changing the dataset or plot variables above.
- Explore more with Pandas and Seaborn documentation.

---

## Key Takeaways
- Data science is about extracting insights from data to solve real problems.
- Python and Jupyter are your main tools for analysis and communication.
- Version control with Git is essential for collaboration and reproducibility.
- Practice, curiosity, and clear documentation are your best friends.

## Further Reading & Resources
- Wes McKinney, *Python for Data Analysis*, O'Reilly Media
- Scott Chacon and Ben Straub, *Pro Git*, Apress ([Pro Git Book](https://git-scm.com/book/en/v2))
- [Pandas Documentation](https://pandas.pydata.org/)
- [Seaborn Documentation](https://seaborn.pydata.org/)
- [Jupyter Documentation](https://jupyter.org/documentation)
- [GitHub Guides](https://guides.github.com/)

---

**Congratulations! You've completed your first chapter in data science.**