# 01. The Python Ecosystem and Our Setup

This notebook is our starting point. We'll document how we've configured our development environment and explore the fundamentals of NumPy, the mathematical engine behind all Data Science in Python.

## 1. Environment Setup

To ensure a clean and reproducible project, we followed these steps in a macOS Terminal:

1.  **Install Base Tools (Homebrew):**
    * We used [Homebrew](https://brew.sh/) (the macOS package manager) to install:
    * `brew install python3`
    * `brew install git`
    * `brew install --cask visual-studio-code`

2.  **Create Project Folder:**
    * We created a dedicated directory (`~/Projects`) outside of iCloud to avoid syncing heavy files and virtual environments.
    * `mkdir ~/Projects`
    * `cd ~/Projects`

3.  **Clone the Repository:**
    * We cloned our GitHub repository.
    * `git clone [YOUR_REPO_URL_HERE]`
    * `cd learning-ai`

4.  **Create Virtual Environment (`.venv`):**
    * We created an isolated environment for this project to manage its dependencies.
    * `python3 -m venv .venv`
    * **Activation:** `source .venv/bin/activate` (our terminal prompt now shows `(.venv)`).

5.  **Install Libraries:**
    * We upgraded `pip` (Python's package manager) and then installed the basic data science stack:
    * `pip install --upgrade pip`
    * `pip install jupyterlab pandas numpy scikit-learn matplotlib`

6.  **Ignore Files (`.gitignore`):**
    * We created a `.gitignore` file in the project root to tell Git to ignore the `.venv` folder, cache files (`__pycache__`), and OS-specific files (`.DS_Store`).

In [1]:
# The industry standard is to import numpy with the alias "np"
import numpy as np

print(f"NumPy installed. Version: {np.__version__}")

NumPy installed. Version: 2.0.2


## 2. The Core of NumPy: The `ndarray`

At the heart of NumPy is the **`ndarray`** (N-dimensional array). You can think of this as the Python equivalent of a **vector** (1D array) or a **matrix** (2D array), which are the fundamental building blocks of your work in algebra.

**Why not just use a standard Python `list`?**

* **Performance:** NumPy arrays are written in C and are * orders of magnitude* faster than Python lists for mathematical operations.
* **Memory:** They are incredibly memory-efficient, storing data in a continuous block.
* **Vectorization:** This is the key. NumPy allows you to perform batch operations on entire arrays at once without writing `for` loops. This is called **vectorization**, and it's the core concept that makes code both fast and easy to read.

In [2]:
# 1. Creating a 1D Array (Vector)
# We can create one directly from a Python list
my_list = [1, 2, 3, 4, 5]
my_vector = np.array(my_list)

print(f"This is a Python list: {my_list}")
print(f"This is a NumPy 1D array (vector): {my_vector}")
print(f"Type of vector: {type(my_vector)}")
print("-" * 20) # A simple separator

# 2. Creating a 2D Array (Matrix)
# We use a list of lists
my_matrix_list = [ [1, 2, 3], [4, 5, 6], [7, 8, 9] ]
my_matrix = np.array(my_matrix_list)

print(f"This is a NumPy 2D array (matrix):\n {my_matrix}")
print(f"Shape of the matrix (rows, cols): {my_matrix.shape}")
print(f"Number of dimensions: {my_matrix.ndim}")

This is a Python list: [1, 2, 3, 4, 5]
This is a NumPy 1D array (vector): [1 2 3 4 5]
Type of vector: <class 'numpy.ndarray'>
--------------------
This is a NumPy 2D array (matrix):
 [[1 2 3]
 [4 5 6]
 [7 8 9]]
Shape of the matrix (rows, cols): (3, 3)
Number of dimensions: 2


## 3. The Power of Vectorization

Let's say we want to add 10 to every single number in our vector.

* **The Python List Way:** We would have to use a `for` loop, iterate through each element, and create a new list.
* **The NumPy Way:** We just... add 10. NumPy understands we want to apply this operation to *every element* in the array. This is vectorization.

This same logic applies to multiplication, division, subtraction, or even complex functions like `np.sin()`, `np.log()`, etc. You apply the function directly to the array (vector or matrix) and NumPy handles the element-by-element operation at C-language speed.

In [3]:
# Let's use the vector we created earlier
print(f"Original vector: {my_vector}")

# The NumPy Way (Vectorized)
# This is fast, clean, and easy to read.
vector_plus_10 = my_vector + 10
print(f"Vectorized add:  {vector_plus_10}")

# The Python List Way (Looping)
# This is slow, verbose, and less "mathematical"
list_plus_10 = []
for item in my_list:
    list_plus_10.append(item + 10)
print(f"List loop add:   {list_plus_10}")

# This works for any math operation
print(f"Vector times 3:  {my_vector * 3}")
print(f"Vector squared:  {my_vector ** 2}")

Original vector: [1 2 3 4 5]
Vectorized add:  [11 12 13 14 15]
List loop add:   [11, 12, 13, 14, 15]
Vector times 3:  [ 3  6  9 12 15]
Vector squared:  [ 1  4  9 16 25]


## 4. Introduction to Pandas: Data Analysis Toolkit

If NumPy is the engine for (linear) algebra, **Pandas** is the high-level toolkit for practical data analysis and manipulation. It's built on top of NumPy.

The key benefits are:
* It introduces labeled data structures: the `Series` (1D) and `DataFrame` (2D).
* It makes it incredibly simple to **load data** from various sources (like CSVs, Excel, SQL databases).
* It provides powerful, SQL-like tools for filtering, grouping, joining, and cleaning data.

We use the standard alias `pd` when importing it.

In [4]:
# The industry standard alias is "pd"
import pandas as pd

print(f"Pandas installed. Version: {pd.__version__}")

Pandas installed. Version: 2.3.3


## 5. Core Pandas Structures: `Series` and `DataFrame`

Pandas has two main data structures you'll use constantly.

### The `Series` (1D)
* A **1D labeled array**. It's like a NumPy 1D array, but it has an **index** (labels for each row).
* You can think of it as a single column in a spreadsheet.

### The `DataFrame` (2D)
* This is the **most important** structure.
* It's a 2D labeled table with rows and columns (like a full spreadsheet or a SQL table).
* Each column in a `DataFrame` is actually a `Series`.
* You can create one from many sources, but a `dict` of lists is a common way.

In [5]:
# 1. Creating a Series
# Notice the 'index' on the left side
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print("--- A Pandas Series (1D) ---")
print(s)
print(f"\nAccessing element 'b': {s['b']}")
print("-" * 30)


# 2. Creating a DataFrame
# We'll use a Python dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}

df = pd.DataFrame(data)

print("--- A Pandas DataFrame (2D) ---")
display(df) # 'display()' is better than 'print()' for DataFrames in a notebook

--- A Pandas Series (1D) ---
a    10
b    20
c    30
d    40
dtype: int64

Accessing element 'b': 20
------------------------------
--- A Pandas DataFrame (2D) ---


Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago
3,David,40,Houston


In [None]:
# Creating a file with pure Python

# 1. Define the content we want to write
csv_content = """ProductID,ProductName,Price
101,Laptop,1200
102,Mouse,25
103,Keyboard,80
104,Monitor,300
"""

# 2. Define the file name
file_name = "sample_data.csv"

# 3. Open the file in 'write' mode ('w') and write the content
with open(file_name, 'w') as f:
    f.write(csv_content)

print(f"File '{file_name}' created successfully!")

File 'sample_data.csv' created successfully!


In [8]:
# Now, let's read the CSV file we just created
data_df = pd.read_csv('sample_data.csv')

print("--- Data loaded from sample_data.csv ---")
display(data_df)

--- Data loaded from sample_data.csv ---


Unnamed: 0,ProductID,ProductName,Price
0,101,Laptop,1200
1,102,Mouse,25
2,103,Keyboard,80
3,104,Monitor,300


In [9]:
# 1. See the first few rows
print("--- .head() ---")
display(data_df.head())

# 2. Get the technical summary
print("\n--- .info() ---")
data_df.info()

# 3. Get the statistical summary
print("\n--- .describe() ---")
display(data_df.describe())

--- .head() ---


Unnamed: 0,ProductID,ProductName,Price
0,101,Laptop,1200
1,102,Mouse,25
2,103,Keyboard,80
3,104,Monitor,300



--- .info() ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   ProductID    4 non-null      int64 
 1   ProductName  4 non-null      object
 2   Price        4 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 224.0+ bytes

--- .describe() ---


Unnamed: 0,ProductID,Price
count,4.0,4.0
mean,102.5,401.25
std,1.290994,545.594095
min,101.0,25.0
25%,101.75,66.25
50%,102.5,190.0
75%,103.25,525.0
max,104.0,1200.0
