<a href="https://colab.research.google.com/github/tengleemail-png/6m-data-1.6-intro-numpy/blob/main/postlesson.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This post-class practice helps you reinforce the core NumPy skills we used:

- Creating and inspecting arrays
- Indexing, slicing, and boolean filtering
- Aggregations over rows/columns
- Reshaping and simple matrix operations
You can work through these in the same environment you used in class (Colab or VS Code).

**1. Warm-up: Rebuild the basics**
Goal: Make array creation and inspection feel automatic.

1. Create each of the following arrays and print its `shape`, `ndim`, and `dtype`:

 - A 1D array of integers from 10 to 19
 - A 2D array of shape `(4, 3)` filled with 1.5
 - A 3D array of zeros with shape `(2, 2, 3)`
2. Convert the `(4, 3)` float array to integers using `.astype(int)`.

3. Write down (in a markdown cell) when you would prefer `float` vs `int` in real analyst work (e.g., prices vs counts).

In [None]:
import numpy as np

arr1d = np.arange(10,20)
#print( arr1d, "\nShape:\n", arr1d.shape, "\nDim:\n", arr1d.ndim, "\ntype:\n", arr1d.dtype)

arr2d = np.full((4,3),1.5)

arr3d = np.zeros((2,2,3))

print( arr1d, "\nShape:\n", arr1d.shape, "\nDim:\n", arr1d.ndim, "\ntype:\n", arr1d.dtype)
print( arr2d, "\nShape:\n", arr2d.shape, "\nDim:\n", arr2d.ndim, "\ntype:\n", arr2d.dtype)
print( arr3d, "\nShape:\n", arr3d.shape, "\nDim:\n", arr3d.ndim, "\ntype:\n", arr3d.dtype)


arr2d_int = arr2d.astype(int)

print("\n", arr2d_int)

[10 11 12 13 14 15 16 17 18 19] 
Shape:
 (10,) 
Dim:
 1 
type:
 int64
[[1.5 1.5 1.5]
 [1.5 1.5 1.5]
 [1.5 1.5 1.5]
 [1.5 1.5 1.5]] 
Shape:
 (4, 3) 
Dim:
 2 
type:
 float64
[[[0. 0. 0.]
  [0. 0. 0.]]

 [[0. 0. 0.]
  [0. 0. 0.]]] 
Shape:
 (2, 2, 3) 
Dim:
 3 
type:
 float64

 [[1 1 1]
 [1 1 1]
 [1 1 1]
 [1 1 1]]


**2. Sales table drill: Indexing and slicing**

Scenario: You have quarterly sales numbers for 4 regions (rows) over 5 quarters (columns).

In [None]:
import numpy as np

sales = np.array([
    [100, 120, 110, 150, 130], # Region A
    [90, 80, 95, 100, 110],    # Region B
    [200, 210, 190, 220, 250], # Region C
    [150, 140, 130, 160, 170]  # Region D
])
regions = np.array(["A", "B", "C", "D"])
quarters = np.array(["Q1", "Q2", "Q3", "Q4", "Q5"])

Do the following:

Select all quarters for Region B as a 1D array.
Select Q2 to Q4 (inclusive) for all regions as a 2D subarray.
Select Q5 sales for Regions A and D only (use slicing or fancy indexing).
Compute the shape of each result and add a brief comment: “Is this 1D or 2D, and why?”

In [None]:
#qn1
reg_B = sales[1]

#qn2
q2_q4 = sales[:,1:4]

#qn3
q5_A_D = sales[[0,3], 4]
#qn3
q5_A_D_v2 = sales[[0,3], 4:5] # this will give you a 2D result

print("q5_A_D id\n",q5_A_D, "\n shape:\n", q5_A_D.shape)
print("q5_A_D id\n",q5_A_D_v2, "\n shape:\n", q5_A_D_v2.shape)

q5_A_D id
 [130 170] 
 shape:
 (2,)
q5_A_D id
 [[130]
 [170]] 
 shape:
 (2, 1)


**3. Boolean masks: Work with targets**

Scenario: You have campaign response data.

In [None]:
names = np.array(["Ana", "Ben", "Chen", "Dana", "Eli", "Fatima", "George", "Hui"])
spend = np.array([200, 150, 300, 120, 180, 220, 160, 310])      # marketing spend
revenue = np.array([400, 180, 500, 100, 220, 260, 150, 600])   # revenue



1. Compute the ROI for each person: roi = revenue / spend.
1. Create a boolean mask for customers with roi >= 2.0.
3. Use the mask to:
  - List their names
  - List their spend and revenue
4. Create a second mask for customers with spend >= 200.
5. Combine the two masks to find customers who have roi >= 2.0 AND spend >= 200.
6. In a markdown cell, answer: “What business insight do you get from this filtered group?”


In [None]:
#qn1
roi = revenue/spend
print(roi)

#qn2
roi_more_2_mask = (roi >= 2.0)

#qn3
names_roi_more_2 = names[roi_more_2_mask]
print("ROI more than 2:\n", names_roi_more_2)
print("The spend is:", spend[roi_more_2_mask], " and the revenue is: ", revenue[roi_more_2_mask])

#qn4
spend_mask = (spend >= 200)

#qn5
print("Customer with roi >=2.0 and spend>=200: ", names[roi_more_2_mask & spend_mask])

[2.         1.2        1.66666667 0.83333333 1.22222222 1.18181818
 0.9375     1.93548387]
ROI more than 2:
 ['Ana']
The spend is: [200]  and the revenue is:  [400]
Customer with roi >=2.0 and spend>=200:  ['Ana']


Business Insight: Customers in the filtered group (ROI >= 2.0 AND Spend >= 200) represent our Scalable Successes. These are accounts where we spent a significant amount of money and still saw a high rate of return. We should prioritize these profiles for future budget increases.

**4. Gradebook: Aggregations and broadcasting**

Reuse the “Gradebook” idea from class and extend it.

In [None]:
students = np.array(["S1", "S2", "S3", "S4", "S5"])
subjects = np.array(["Math", "Stats", "Python"])

scores = np.array([
    [75, 80, 85],
    [60, 65, 70],
    [90, 88, 92],
    [82, 79, 84],
    [70, 72, 78]
])

1. Compute each student’s average score (per row).
2. Compute each subject’s average score (per column).
3. Create `scores_centered` by subtracting the subject mean from each column (broadcasting).
4. For `scores_centered`, compute per-student averages again.
5. Compare which student looks best by raw average vs centered average.
6. In a markdown cell, explain briefly why centered scores might give a fairer comparison.




In [None]:
#qn1
stu_ave_score = scores.mean(axis = 1) #avg score across all sub

#qn2
sbj_ave_score = scores.mean(axis=0) #avg score per sub

#qn3
scores_centered = scores - sbj_ave_score #student score minus avg score

#4
print("Raw Scores:\n", scores)
print("Raw Averages:\n", stu_ave_score)
print("Centered Score:\n", scores_centered)
print("Centered Averages:\n", scores_centered.mean(axis=1))
# centered average will account for the difficulity of the paper

Raw Scores:
 [[75 80 85]
 [60 65 70]
 [90 88 92]
 [82 79 84]
 [70 72 78]]
Raw Averages:
 [80.         65.         90.         81.66666667 73.33333333]
Centered Score:
 [[ -0.4   3.2   3.2]
 [-15.4 -11.8 -11.8]
 [ 14.6  11.2  10.2]
 [  6.6   2.2   2.2]
 [ -5.4  -4.8  -3.8]]
Centered Averages:
 [  2.         -13.          12.           3.66666667  -4.66666667]


## 5\. Reshape and flatten: From daily to weekly

**Scenario:** You have 30 days of daily website visits, and you want to summarize them by week.

In [None]:
daily_visits = np.array([
    120, 130, 125, 140, 150,
    160, 170, 155, 145, 135,
    128, 132, 138, 142, 148,
    152, 158, 162, 168, 172,
    180, 190, 185, 175, 165,
    155, 145, 135, 125, 115
])

1. Reshape `daily_visits` into a `(6, 5)` array representing **6 weeks × 5 days**.  
2. Compute **total visits per week**.  
3. Compute **average visits per day of the week** (e.g., all “day 1 of week” together, all “day 2 of week” together, etc.).  
4. Flatten the reshaped array back to 1D and confirm it matches the original `daily_visits`.

Hint: Pay attention to how reshaping groups the data; mention any assumptions you’re making about which days belong to which week.




In [13]:
import numpy as np

daily_visits = np.array([
    120, 130, 125, 140, 150,
    160, 170, 155, 145, 135,
    128, 132, 138, 142, 148,
    152, 158, 162, 168, 172,
    180, 190, 185, 175, 165,
    155, 145, 135, 125, 115
])

#print(daily_visits)

wkly_visits = daily_visits.reshape((6,5))
print(wkly_visits)

print("\nTotal visits per wk:\n", wkly_visits.sum(axis = 1))
print("\nAvg visit per day of the week:\n", wkly_visits.mean(axis = 0, dtype = int))

print("\nFlatten array equal original?:/n", wkly_visits.flatten() == daily_visits)

[[120 130 125 140 150]
 [160 170 155 145 135]
 [128 132 138 142 148]
 [152 158 162 168 172]
 [180 190 185 175 165]
 [155 145 135 125 115]]

Total visits per wk:
 [665 765 688 812 895 675]

Avg visit per day of the week:
 [149 154 150 149 147]

Flatten array equal original?:/n [ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True]


In [12]:
#Given solution
daily_visits = np.array([
    120, 130, 125, 140, 150, 160, 170, 155, 145, 135,
    128, 132, 138, 142, 148, 152, 158, 162, 168, 172,
    180, 190, 185, 175, 165, 155, 145, 135, 125, 115
])

# Reshape (6 weeks, 5 days - assuming a 5-day work week)
weekly_view = daily_visits.reshape(6, 5)

# Total visits per week
total_per_week = weekly_view.sum(axis=1)
print("Total visit per week:/n", total_per_week)

# Average visits per day of the week (Avg of all Day 1s, Day 2s, etc.)
avg_day_of_week = weekly_view.mean(axis=0)
print("Average visits per day of the week:/n", avg_day_of_week)

# Flatten
original_shape = weekly_view.flatten()
print("Back to original match?", np.array_equal(daily_visits, original_shape))

Total visit per week:/n [665 765 688 812 895 675]
Average visits per day of the week:/n [149.16666667 154.16666667 150.         149.16666667 147.5       ]
Back to original match? True



---

## 6\. Mini-project: Simple scoring model

This mirrors the tiny matrix multiply example from class, but with more features.



In [14]:
# Each row: [page_views, time_on_site (minutes), past_purchases]
X = np.array([
    [10,  3.5,  0],
    [25,  5.0,  1],
    [40,  2.0,  0],
    [15, 10.0,  3],
    [30,  4.0,  2]
])

customers = np.array(["C1", "C2", "C3", "C4", "C5"])


1. Choose a weight vector `w = [w_views, w_time, w_purchases]` (for example `[0.1, 0.5, 1.0]`).  
2. Compute a **score** for each customer using `scores = X @ w`.  
3. Rank customers by score (highest first).  
4. Change the weights to emphasize **past\_purchases** more than other features, and recompute.  
5. In a markdown cell, answer:  
   - Which customer is top-ranked before vs after changing weights?  
   - In what real-world situation might you prefer each weighting?



In [18]:
w = [0.1,0.5,1.0]
score = X @ w

print(score)

# argsort returns the indices that would sort the array
# We use [::-1] to get descending order (highest score first)
rank_indices = np.argsort(score)[::-1]

print("Customer Ranking (Best to Worst):")
for i, idx in enumerate(rank_indices):
    print(f"{i+1}. {customers[idx]} (Score: {score[idx]:.2f})")

[2.75 6.   5.   9.5  7.  ]
Customer Ranking (Best to Worst):
1. C4 (Score: 9.50)
2. C5 (Score: 7.00)
3. C2 (Score: 6.00)
4. C3 (Score: 5.00)
5. C1 (Score: 2.75)


In [19]:
#Given solution
X = np.array([
    [10,  3.5,  0],
    [25,  5.0,  1],
    [40,  2.0,  0],
    [15, 10.0,  3],
    [30,  4.0,  2]
])
customers = np.array(["C1", "C2", "C3", "C4", "C5"])

# Weight set 1: Balanced
w1 = np.array([0.1, 0.5, 1.0])
scores1 = X @ w1

# Weight set 2: Heavy on Purchases
w2 = np.array([0.05, 0.1, 5.0])
scores2 = X @ w2

print("Balanced Rankings:", customers[np.argsort(-scores1)])
print("Purchase-heavy Rankings:", customers[np.argsort(-scores2)])

Balanced Rankings: ['C4' 'C5' 'C2' 'C3' 'C1']
Purchase-heavy Rankings: ['C4' 'C5' 'C2' 'C3' 'C1']


---

## 7\. Stretch ideas (optional)

If you’re comfortable with the above, try:

- Generate 1,000 random test scores with `np.random.randn`, scale them to have mean 70 and standard deviation 10, then:  
  - Clip scores to the range `[0, 100]`  
  - Compute min, max, mean, and standard deviation  
- Simulate a small A/B test:  
  - Two groups of 20 users each  
  - Randomly generate conversions (0 or 1\) for each group  
  - Compute conversion rate per group using pure NumPy operations

---

In [38]:
test_score = (np.random.randn(1000)*10) + 70
clip_score = test_score.clip(0,100)

#print(clip_score)
print(f"min = {clip_score.min():.2f}")
print(f"mean = {clip_score.mean():.2f}")
print(f"std = {clip_score.std():.2f}")

#A/B test
import numpy as np

# Set a seed so the results are the same every time you run it
np.random.seed(42)

# 1. Generate conversions (0 or 1) for 2 groups of 20 users
# We create a 2x20 matrix:
# Row 0 = Group A, Row 1 = Group B
conversions = np.random.randint(0, 2, size=(2, 20))

# 2. Compute conversion rate per group
# We use axis=1 to calculate the mean across the columns (users)
rates = conversions.mean(axis=1)

print(f"Group A Conversion Rate: {rates[0]:.2%}")
print(f"Group B Conversion Rate: {rates[1]:.2%}")

# 3. Quick comparison
diff = rates[1] - rates[0]
print(f"Lift (B vs A): {diff:+.2%}")

min = 37.59
mean = 70.29
std = 9.76
Group A Conversion Rate: 35.00%
Group B Conversion Rate: 65.00%
Lift (B vs A): +30.00%
