# Session 1: Python Fundamentals & Environment Setup
## Week 1 - Data Science & Machine Learning Training Programme

**Learning Objectives:**
- Master Python fundamentals for data science
- Set up professional development environment
- Introduction to data science libraries ecosystem
- Configure Git and GitHub for portfolio development

---

## Part 1: Python Fundamentals Review (45 minutes)

Let's start with a comprehensive review of Python fundamentals essential for data science.

### 1.1 Data Types and Variables

In [1]:
# Basic data types in Python
# Integers
age = 25
print(f"Age: {age}, Type: {type(age)}")

# Floats
height = 5.8
print(f"Height: {height}, Type: {type(height)}")

# Strings
name = "Data Scientist"
print(f"Name: {name}, Type: {type(name)}")

# Booleans
is_student = True
print(f"Is Student: {is_student}, Type: {type(is_student)}")

Age: 25, Type: <class 'int'>
Height: 5.8, Type: <class 'float'>
Name: Data Scientist, Type: <class 'str'>
Is Student: True, Type: <class 'bool'>


In [None]:
# Lists - Ordered, mutable collections
numbers = [1, 2, 3, 4, 5]
mixed_list = [1, "hello", 3.14, True]

print("Numbers:", numbers)
print("Mixed list:", mixed_list)
print("First element:", numbers[0])
print("Last element:", numbers[-1])

# List operations
numbers.append(6)
print("After append:", numbers)

# List comprehension (very important for data science)
squares = [x**2 for x in numbers]
print("Squares:", squares)

In [None]:
# Dictionaries - Key-value pairs (essential for data science)
student_data = {
    'name': 'John Doe',
    'age': 23,
    'grades': [85, 90, 78, 92],
    'is_graduate': False
}

print("Student data:", student_data)
print("Student name:", student_data['name'])
print("Average grade:", sum(student_data['grades']) / len(student_data['grades']))

# Adding new key-value pair
student_data['major'] = 'Data Science'
print("Updated data:", student_data)

### 1.2 Control Structures

In [None]:
# Conditional statements
score = 85

if score >= 90:
    grade = 'A'
elif score >= 80:
    grade = 'B'
elif score >= 70:
    grade = 'C'
else:
    grade = 'F'

print(f"Score: {score}, Grade: {grade}")

# Ternary operator (useful for data cleaning)
status = "Pass" if score >= 60 else "Fail"
print(f"Status: {status}")

In [None]:
# Loops - Essential for data processing
# For loop with range
print("Numbers 1 to 5:")
for i in range(1, 6):
    print(i, end=" ")
print()

# For loop with list
fruits = ['apple', 'banana', 'orange', 'grape']
print("\nFruits:")
for fruit in fruits:
    print(f"- {fruit}")

# Enumerate (very useful in data science)
print("\nFruits with index:")
for index, fruit in enumerate(fruits):
    print(f"{index}: {fruit}")

In [None]:
# While loop example
count = 0
total = 0

while count < 5:
    count += 1
    total += count
    print(f"Count: {count}, Total: {total}")

print(f"Final total: {total}")

### 1.3 Functions - Building Reusable Code

In [None]:
# Basic function definition
def calculate_average(numbers):
    """
    Calculate the average of a list of numbers.
    
    Args:
        numbers (list): List of numerical values
    
    Returns:
        float: Average of the numbers
    """
    if not numbers:
        return 0
    return sum(numbers) / len(numbers)

# Test the function
test_scores = [85, 90, 78, 92, 88]
avg_score = calculate_average(test_scores)
print(f"Test scores: {test_scores}")
print(f"Average score: {avg_score:.2f}")

In [None]:
# Function with default parameters
def clean_text(text, remove_spaces=True, to_lowercase=True):
    """
    Clean text data (useful for NLP tasks).
    
    Args:
        text (str): Input text
        remove_spaces (bool): Whether to remove extra spaces
        to_lowercase (bool): Whether to convert to lowercase
    
    Returns:
        str: Cleaned text
    """
    if remove_spaces:
        text = ' '.join(text.split())
    
    if to_lowercase:
        text = text.lower()
    
    return text

# Test the function
messy_text = "  Hello   WORLD   with   Extra    Spaces  "
cleaned = clean_text(messy_text)
print(f"Original: '{messy_text}'")
print(f"Cleaned: '{cleaned}'")

In [None]:
# Lambda functions (useful for data transformations)
# Regular function
def square(x):
    return x ** 2

# Lambda equivalent
square_lambda = lambda x: x ** 2

numbers = [1, 2, 3, 4, 5]

# Using map with lambda
squared_numbers = list(map(lambda x: x ** 2, numbers))
print(f"Original: {numbers}")
print(f"Squared: {squared_numbers}")

# Using filter with lambda
even_numbers = list(filter(lambda x: x % 2 == 0, numbers))
print(f"Even numbers: {even_numbers}")

## Part 2: Development Environment Setup (60 minutes)

Let's set up a professional data science environment.

### 2.1 Jupyter Notebook Best Practices

#### Markdown Formatting in Jupyter

**Headers:**
# Header 1
## Header 2
### Header 3

**Text formatting:**
- *Italic text*
- **Bold text**
- ***Bold and italic***
- `Code inline`

**Lists:**
1. Numbered item 1
2. Numbered item 2

- Bullet point 1
- Bullet point 2

**Mathematical equations:**
- Inline math: $y = mx + b$
- Block math: $$\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i$$

**Code blocks:**
```python
print("This is a code block")
```

In [None]:
# Jupyter Magic Commands
# These are special commands that start with % (line magic) or %% (cell magic)

# Time execution of a single line
%timeit sum(range(100))

# Get current working directory
%pwd

In [None]:
# Cell magic for timing entire cell
%%timeit
# This will time the entire cell
total = 0
for i in range(1000):
    total += i

In [None]:
# Useful magic commands for data science
# List all magic commands
%lsmagic

### 2.2 Introduction to Data Science Libraries

In [None]:
# Check if libraries are installed
import sys

libraries = ['numpy', 'pandas', 'matplotlib', 'seaborn', 'scikit-learn']

for lib in libraries:
    try:
        __import__(lib)
        print(f"✓ {lib} is installed")
    except ImportError:
        print(f"✗ {lib} is NOT installed")
        print(f"  Install with: pip install {lib}")

In [None]:
# Standard imports for data science (always use these conventions)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Print versions
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")
print(f"Matplotlib version: {plt.matplotlib.__version__}")
print(f"Seaborn version: {sns.__version__}")

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")
%matplotlib inline

### 2.3 Quick Library Demonstration

In [None]:
# NumPy demonstration
print("=== NumPy Demo ===")
# Create arrays
arr1 = np.array([1, 2, 3, 4, 5])
arr2 = np.array([6, 7, 8, 9, 10])

print(f"Array 1: {arr1}")
print(f"Array 2: {arr2}")
print(f"Sum: {arr1 + arr2}")
print(f"Mean of Array 1: {np.mean(arr1)}")
print(f"Standard deviation: {np.std(arr1)}")

# 2D array
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(f"\n2D Array:\n{matrix}")
print(f"Shape: {matrix.shape}")
print(f"Sum of each column: {np.sum(matrix, axis=0)}")

In [None]:
# Pandas demonstration
print("=== Pandas Demo ===")
# Create a simple dataset
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [25, 30, 35, 28, 32],
    'City': ['London', 'Paris', 'Berlin', 'Madrid', 'Rome'],
    'Salary': [50000, 60000, 70000, 55000, 65000]
}

df = pd.DataFrame(data)
print("Sample DataFrame:")
print(df)

print(f"\nDataFrame shape: {df.shape}")
print(f"Column names: {list(df.columns)}")
print(f"\nBasic statistics:")
print(df.describe())

In [None]:
# Simple visualisation with matplotlib
print("=== Matplotlib Demo ===")
plt.figure(figsize=(10, 4))

# Subplot 1: Line plot
plt.subplot(1, 2, 1)
x = range(len(df))
plt.plot(x, df['Age'], 'bo-', label='Age')
plt.xlabel('Person Index')
plt.ylabel('Age')
plt.title('Age Distribution')
plt.legend()

# Subplot 2: Bar plot
plt.subplot(1, 2, 2)
plt.bar(df['Name'], df['Salary'], color='skyblue')
plt.xlabel('Name')
plt.ylabel('Salary')
plt.title('Salary by Person')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

## Part 3: Git and GitHub Setup (45 minutes)

Version control is essential for any data science project.

### 3.1 Git Basics

Git is a distributed version control system. Here are the essential commands:

**Basic Git Workflow:**
1. `git init` - Initialize a repository
2. `git add .` - Stage all changes
3. `git commit -m "message"` - Commit changes
4. `git push` - Push to remote repository

**Common Git Commands:**
- `git status` - Check repository status
- `git log` - View commit history
- `git branch` - List branches
- `git checkout -b new-branch` - Create and switch to new branch
- `git merge branch-name` - Merge branches

In [None]:
# Check if git is installed
import subprocess
import os

try:
    result = subprocess.run(['git', '--version'], 
                          capture_output=True, text=True, check=True)
    print(f"✓ Git is installed: {result.stdout.strip()}")
except (subprocess.CalledProcessError, FileNotFoundError):
    print("✗ Git is not installed or not in PATH")
    print("Please install Git from: https://git-scm.com/")

### 3.2 GitHub Repository Structure for Data Science

A well-structured repository is crucial for portfolio development:

```
project-name/
│
├── README.md                 # Project description and setup
├── requirements.txt          # Python dependencies
├── .gitignore               # Files to ignore
├── data/                    # Data files
│   ├── raw/                 # Raw, original data
│   ├── processed/           # Cleaned data
│   └── external/            # External data sources
├── notebooks/               # Jupyter notebooks
│   ├── 01_data_exploration.ipynb
│   ├── 02_data_cleaning.ipynb
│   └── 03_modeling.ipynb
├── src/                     # Source code
│   ├── __init__.py
│   ├── data_processing.py
│   └── models.py
├── models/                  # Trained models
├── reports/                 # Generated reports
│   └── figures/             # Generated graphics
└── docs/                    # Documentation
```

### 3.3 Creating Your First Data Science Repository

**Step-by-step guide:**

1. **Create a new repository on GitHub:**
   - Go to GitHub.com
   - Click "New repository"
   - Name it "ds-ml-training-portfolio"
   - Add description: "Portfolio of data science and machine learning projects"
   - Make it public
   - Initialize with README

2. **Clone to your local machine:**
   ```bash
   git clone https://github.com/YOUR_USERNAME/ds-ml-training-portfolio.git
   cd ds-ml-training-portfolio
   ```

3. **Set up the directory structure:**
   ```bash
   mkdir -p data/{raw,processed,external}
   mkdir -p notebooks
   mkdir -p src
   mkdir -p models
   mkdir -p reports/figures
   mkdir -p docs
   ```

4. **Create essential files:**
   - `.gitignore` (for Python projects)
   - `requirements.txt` (list of dependencies)
   - Update `README.md`

## Part 4: Practical Exercises (30 minutes)

### Exercise 1: Data Structure Practice

In [None]:
# Exercise 1: Create a student database
# TODO: Create a list of dictionaries representing students
# Each student should have: name, age, grades (list), major

students = [
    # Add your student data here
]

# TODO: Write functions to:
# 1. Calculate average grade for each student
# 2. Find the student with highest average
# 3. Group students by major

def calculate_student_average(student):
    """Calculate average grade for a student."""
    # Your code here
    pass

def find_top_student(students):
    """Find student with highest average grade."""
    # Your code here
    pass

def group_by_major(students):
    """Group students by their major."""
    # Your code here
    pass

# Test your functions here

### Exercise 2: Basic Data Analysis

In [None]:
# Exercise 2: Sales data analysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Create sample sales data
np.random.seed(42)
sales_data = {
    'month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 
              'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'],
    'sales': np.random.randint(1000, 5000, 12),
    'expenses': np.random.randint(500, 2000, 12)
}

df_sales = pd.DataFrame(sales_data)
print("Sales Data:")
print(df_sales)

# TODO: Complete the following tasks:
# 1. Calculate profit for each month (sales - expenses)
# 2. Find the month with highest profit
# 3. Calculate total sales and expenses for the year
# 4. Create a simple plot showing sales vs expenses

# Your code here

## Summary and Next Steps

### What We Covered Today:
1. ✅ Python fundamentals: data types, control structures, functions
2. ✅ Jupyter notebook setup and best practices
3. ✅ Introduction to data science libraries (NumPy, Pandas, Matplotlib)
4. ✅ Git and GitHub setup for portfolio development
5. ✅ Practical exercises with real data

### Homework Before Next Session:
1. **Complete the exercises** above if not finished in class
2. **Set up your GitHub repository** following the structure we discussed
3. **Read about Pandas basics** - we'll dive deep into data manipulation next session
4. **Optional:** Explore the Iris dataset (we'll use it in Session 3)

### Resources for Further Learning:
- [Python.org Tutorial](https://docs.python.org/3/tutorial/)
- [Jupyter Documentation](https://jupyter-notebook.readthedocs.io/)
- [Git Handbook](https://guides.github.com/introduction/git-handbook/)
- [Pandas Documentation](https://pandas.pydata.org/docs/)

### Next Session Preview:
**Session 2: Data Manipulation with Pandas & NumPy**
- Advanced Pandas operations
- Data loading from various sources
- Introduction to the Iris dataset
- Data cleaning fundamentals