<a href="https://colab.research.google.com/github/tengleemail-png/6m-data-1.6-intro-numpy/blob/main/postlesson.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This post-class practice helps you reinforce the core NumPy skills we used:

- Creating and inspecting arrays
- Indexing, slicing, and boolean filtering
- Aggregations over rows/columns
- Reshaping and simple matrix operations
You can work through these in the same environment you used in class (Colab or VS Code).

**1. Warm-up: Rebuild the basics**
Goal: Make array creation and inspection feel automatic.

1. Create each of the following arrays and print its `shape`, `ndim`, and `dtype`:

 - A 1D array of integers from 10 to 19
 - A 2D array of shape `(4, 3)` filled with 1.5
 - A 3D array of zeros with shape `(2, 2, 3)`
2. Convert the `(4, 3)` float array to integers using `.astype(int)`.

3. Write down (in a markdown cell) when you would prefer `float` vs `int` in real analyst work (e.g., prices vs counts).

In [10]:
import numpy as np

arr1d = np.arange(10,20)
#print( arr1d, "\nShape:\n", arr1d.shape, "\nDim:\n", arr1d.ndim, "\ntype:\n", arr1d.dtype)

arr2d = np.full((4,3),1.5)

arr3d = np.zeros((2,2,3))

print( arr1d, "\nShape:\n", arr1d.shape, "\nDim:\n", arr1d.ndim, "\ntype:\n", arr1d.dtype)
print( arr2d, "\nShape:\n", arr2d.shape, "\nDim:\n", arr2d.ndim, "\ntype:\n", arr2d.dtype)
print( arr3d, "\nShape:\n", arr3d.shape, "\nDim:\n", arr3d.ndim, "\ntype:\n", arr3d.dtype)


arr2d_int = arr2d.astype(int)

print("\n", arr2d_int)

[10 11 12 13 14 15 16 17 18 19] 
Shape:
 (10,) 
Dim:
 1 
type:
 int64
[[1.5 1.5 1.5]
 [1.5 1.5 1.5]
 [1.5 1.5 1.5]
 [1.5 1.5 1.5]] 
Shape:
 (4, 3) 
Dim:
 2 
type:
 float64
[[[0. 0. 0.]
  [0. 0. 0.]]

 [[0. 0. 0.]
  [0. 0. 0.]]] 
Shape:
 (2, 2, 3) 
Dim:
 3 
type:
 float64

 [[1 1 1]
 [1 1 1]
 [1 1 1]
 [1 1 1]]


**2. Sales table drill: Indexing and slicing**

Scenario: You have quarterly sales numbers for 4 regions (rows) over 5 quarters (columns).

In [13]:
import numpy as np

sales = np.array([
    [100, 120, 110, 150, 130], # Region A
    [90, 80, 95, 100, 110],    # Region B
    [200, 210, 190, 220, 250], # Region C
    [150, 140, 130, 160, 170]  # Region D
])
regions = np.array(["A", "B", "C", "D"])
quarters = np.array(["Q1", "Q2", "Q3", "Q4", "Q5"])

Do the following:

Select all quarters for Region B as a 1D array.
Select Q2 to Q4 (inclusive) for all regions as a 2D subarray.
Select Q5 sales for Regions A and D only (use slicing or fancy indexing).
Compute the shape of each result and add a brief comment: “Is this 1D or 2D, and why?”

In [20]:
#qn1
reg_B = sales[1]

#qn2
q2_q4 = sales[:,1:4]

#qn3
q5_A_D = sales[[0,3], 4]
#qn3
q5_A_D_v2 = sales[[0,3], 4:5] # this will give you a 2D result

print("q5_A_D id\n",q5_A_D, "\n shape:\n", q5_A_D.shape)
print("q5_A_D id\n",q5_A_D_v2, "\n shape:\n", q5_A_D_v2.shape)

q5_A_D id
 [130 170] 
 shape:
 (2,)
q5_A_D id
 [[130]
 [170]] 
 shape:
 (2, 1)


**3. Boolean masks: Work with targets**

Scenario: You have campaign response data.

In [21]:
names = np.array(["Ana", "Ben", "Chen", "Dana", "Eli", "Fatima", "George", "Hui"])
spend = np.array([200, 150, 300, 120, 180, 220, 160, 310])      # marketing spend
revenue = np.array([400, 180, 500, 100, 220, 260, 150, 600])   # revenue



1. Compute the ROI for each person: roi = revenue / spend.
1. Create a boolean mask for customers with roi >= 2.0.
3. Use the mask to:
  - List their names
  - List their spend and revenue
4. Create a second mask for customers with spend >= 200.
5. Combine the two masks to find customers who have roi >= 2.0 AND spend >= 200.
6. In a markdown cell, answer: “What business insight do you get from this filtered group?”


In [27]:
#qn1
roi = revenue/spend
print(roi)

#qn2
roi_more_2_mask = (roi >= 2.0)

#qn3
names_roi_more_2 = names[roi_more_2_mask]
print("ROI more than 2:\n", names_roi_more_2)
print("The spend is:", spend[roi_more_2_mask], " and the revenue is: ", revenue[roi_more_2_mask])

#qn4
spend_mask = (spend >= 200)

#qn5
print("Customer with roi >=2.0 and spend>=200: ", names[roi_more_2_mask & spend_mask])

[2.         1.2        1.66666667 0.83333333 1.22222222 1.18181818
 0.9375     1.93548387]
ROI more than 2:
 ['Ana']
The spend is: [200]  and the revenue is:  [400]
Customer with roi >=2.0 and spend>=200:  ['Ana']


Business Insight: Customers in the filtered group (ROI >= 2.0 AND Spend >= 200) represent our Scalable Successes. These are accounts where we spent a significant amount of money and still saw a high rate of return. We should prioritize these profiles for future budget increases.

**4. Gradebook: Aggregations and broadcasting**

Reuse the “Gradebook” idea from class and extend it.

In [29]:
students = np.array(["S1", "S2", "S3", "S4", "S5"])
subjects = np.array(["Math", "Stats", "Python"])

scores = np.array([
    [75, 80, 85],
    [60, 65, 70],
    [90, 88, 92],
    [82, 79, 84],
    [70, 72, 78]
])

1. Compute each student’s average score (per row).
2. Compute each subject’s average score (per column).
3. Create `scores_centered` by subtracting the subject mean from each column (broadcasting).
4. For `scores_centered`, compute per-student averages again.
5. Compare which student looks best by raw average vs centered average.
6. In a markdown cell, explain briefly why centered scores might give a fairer comparison.




In [35]:
#qn1
stu_ave_score = scores.mean(axis = 1) #avg score across all sub

#qn2
sbj_ave_score = scores.mean(axis=0) #avg score per sub

#qn3
scores_centered = scores - sbj_ave_score #student score minus avg score

#4
print("Raw Scores:\n", scores)
print("Raw Averages:\n", stu_ave_score)
print("Centered Score:\n", scores_centered)
print("Centered Averages:\n", scores_centered.mean(axis=1))
# centered average will account for the difficulity of the paper

Raw Scores:
 [[75 80 85]
 [60 65 70]
 [90 88 92]
 [82 79 84]
 [70 72 78]]
Raw Averages:
 [80.         65.         90.         81.66666667 73.33333333]
Centered Score:
 [[ -0.4   3.2   3.2]
 [-15.4 -11.8 -11.8]
 [ 14.6  11.2  10.2]
 [  6.6   2.2   2.2]
 [ -5.4  -4.8  -3.8]]
Centered Averages:
 [  2.         -13.          12.           3.66666667  -4.66666667]


**5. Reshape and flatten: From daily to weekly**
Scenario: You have 30 days of daily website visits, and you want to summarize them by week.

In [None]:
daily_visits = np.array([
    120, 130, 125, 140, 150,
    160, 170, 155, 145, 135,
    128, 132, 138, 142, 148,
    152, 158, 162, 168, 172,
    180, 190, 185, 175, 165,
    155, 145, 135, 125, 115
])

Reshape daily_visits into a (6, 5) array representing 6 weeks × 5 days.
Compute total visits per week.
Compute average visits per day of the week (e.g., all “day 1 of week” together, all “day 2 of week” together, etc.).
Flatten the reshaped array back to 1D and confirm it matches the original daily_visits.
Hint: Pay attention to how reshaping groups the data; mention any assumptions you’re making about which days belong to which week.



**6. Mini-project: Simple scoring model**
This mirrors the tiny matrix multiply example from class, but with more features.

In [None]:
# Each row: [page_views, time_on_site (minutes), past_purchases]
X = np.array([
    [10,  3.5,  0],
    [25,  5.0,  1],
    [40,  2.0,  0],
    [15, 10.0,  3],
    [30,  4.0,  2]
])

customers = np.array(["C1", "C2", "C3", "C4", "C5"])

Choose a weight vector w = [w_views, w_time, w_purchases] (for example [0.1, 0.5, 1.0]).
Compute a score for each customer using scores = X @ w.
Rank customers by score (highest first).
Change the weights to emphasize past_purchases more than other features, and recompute.
In a markdown cell, answer:
Which customer is top-ranked before vs after changing weights?
In what real-world situation might you prefer each weighting?


**7. Stretch ideas (optional)**
If you’re comfortable with the above, try:

Generate 1,000 random test scores with np.random.randn, scale them to have mean 70 and standard deviation 10, then:
Clip scores to the range [0, 100]
Compute min, max, mean, and standard deviation
Simulate a small A/B test:
Two groups of 20 users each
Randomly generate conversions (0 or 1) for each group
Compute conversion rate per group using pure NumPy operations
