#### COMPANION WORKBOOK

# NumPy

To make the most out this program, we strongly recommend you to:
1. First practice writing and implementing all of the code from Coding Section of the online lesson.
2. Then, freely experiment with and explore any interesting or confusing concepts. Simply insert new code cells and then use the help of Google and official documentation.
3. Finally, tackle all of the exercises at the end. They will help you tie everything together and **learn in context.**

#### <span style="color:#555">LESSON CODE SANDBOX</span>

Use this space to practice writing and implementing all of the code from Coding Section of the online lesson. Insert new code cells as needed, and feel free to write notes to yourself in Markdown.

First, let's import the actual NumPy library.

In [None]:
import numpy as np

## I. NumPy Arrays are homogeneous.

## II. NumPy Arrays are multidimensional.

## III. NumPy math is elementwise.

## IV. NumPy is reliably random.

#### <span style="color:#555">EXERCISES</span>

Complete each of the following exercises.

## <span style="color:RoyalBlue">Exercise 4.1 - Business Park Arrays</span>

#### In the previous lesson...

In the previous lesson, we decided on Park Royal as our first stop in our training mission to London. Now we've arrived at the tube station! Park Royal is home to London's largest business park. 
* Assume the business park supports 1,700 businesses.
* HQ sent you a manual with their names and locations. 
* *They are listed in order from smallest to largest* (this will be important later).

#### A.) First, let's create a NumPy array called <code style="color:steelblue">business_ids</code>.
* The first business has ID <code>1</code>, the second one has ID <code>2</code> and so on.
* Remember that Python is zero-indexed... but your business ID array should still start from 1!
* The array should have shape <code style="color:steelblue">(1700,)</code>.

In [None]:
# Create array of business_ids with values ranging from 1 to 1700


#### B.) Next, print the shape of <code style="color:steelblue">business_ids</code> to confirm it has shape <code style="color:steelblue">(1700,)</code>.

In [None]:
# Print shape of business_ids


<strong style="color:RoyalBlue">Expected output:</strong>

<pre>
(1700,)
</pre>

#### C.) Finally, print the last 10 business ID's to confirm the array is set up properly.

In [None]:
# Print last 10 business ID's


<strong style="color:RoyalBlue">Expected output:</strong>

<pre>
[1691 1692 1693 1694 1695 1696 1697 1698 1699 1700]
</pre>

## <span style="color:RoyalBlue">Exercise 4.2 - Stratified Random Sampling</span>

There are **1700** businesses in Park Royal's business park. However, you only have **10** hours to stay in Park Royal before you need to move on to the next location. Assuming you could only visit **5** businesses per hour, which businesses should you visit?

#### Well, 2 options immediately come to mind:
1. You could just visit the first 50 businesses.
2. You could randomly sample 50 businesses.

While these appear fine at first glance, there are potential flaws with both of these approaches. *Remember, we learned that the businesses were listed in order from smallest to largest! Therefore...*
1. Just visiting the first 50 would give us a biased sample of only the smallest businesses.
2. Visiting a random sample of 50 is better, but there's still a chance that our sample ends up biased from pure chance.

#### Instead, let's take a *stratified random sample* based on the size of the business.
* Stratified random sampling is a sampling method by which you first group your observations based on a **key variable**, and then randomly sampling from those groups.
* This will ensure your sample is **representative** of the broader dataset along that key variable. In other words, your sample will be "spread out" across different values of that key variable.

<p style="text-align:center">
<img src="https://upload.wikimedia.org/wikipedia/commons/f/fa/Stratified_sampling.PNG" alt="Stratified Random Sampling" style="width: 280px;"/>
<br>
<em>Stratified Random Sampling</em>
</p>

#### For this exercise, stratified random sampling just means that...
* We'll start by splitting our businesses into 10 groups of 170 businesses each.
* The first group will have ID's from 1 to 170, the second will have ID's from 171 to 340, etc.
* Then we'll randomly select 5 businesses from each group of 170.
* Since the businesses are already ordered by size, this will ensure that small, medium, and big businesses are all represented in our sample.

#### A.) First, reshape the 1-dimensional <code style="color:steelblue">business_ids</code> array into a new 2-dimensional <code style="color:steelblue">id_matrix</code> array.
* <code style="color:steelblue">id_matrix</code> should have **10 columns**... one for each group of businesses.
* How many rows should it have?
* What does the number of rows represent?

In [None]:
# Create id_matrix by reshaping business_ids to have 10 columns


# Print shape



<strong style="color:RoyalBlue">Expected output:</strong>

<pre>
(170, 10)
</pre>

Great, now we have a matrix with 10 columns representing 10 groups of businesses. But remember, our goal is to stratify our sample by size of the business, and because our businesses are ordered by size, the first group should be 1 to 170, the second group should be 171 to 340, and so on.

#### B.) Print the first column (group) of <code style="color:steelblue">id_matrix</code>.
* Does it contain businesses 1 to 170?

In [None]:
# Print first column of id_matrix


<strong style="color:RoyalBlue">Expected output:</strong>

<pre>
[   1   11   21   31   41   51   61   71   81   91  101  111  121  131  141
  151  161  171  181  191  201  211  221  231  241  251  261  271  281  291
  301  311  321  331  341  351  361  371  381  391  401  411  421  431  441
  451  461  471  481  491  501  511  521  531  541  551  561  571  581  591
  601  611  621  631  641  651  661  671  681  691  701  711  721  731  741
  751  761  771  781  791  801  811  821  831  841  851  861  871  881  891
  901  911  921  931  941  951  961  971  981  991 1001 1011 1021 1031 1041
 1051 1061 1071 1081 1091 1101 1111 1121 1131 1141 1151 1161 1171 1181 1191
 1201 1211 1221 1231 1241 1251 1261 1271 1281 1291 1301 1311 1321 1331 1341
 1351 1361 1371 1381 1391 1401 1411 1421 1431 1441 1451 1461 1471 1481 1491
 1501 1511 1521 1531 1541 1551 1561 1571 1581 1591 1601 1611 1621 1631 1641
 1651 1661 1671 1681 1691]
</pre>

Crap, that's not what we wanted. Ah, but we should have seen this coming!

Remember, when you <code style="color:steelblue">.reshape</code> an array, the new array keeps the order of the elements. Let's see how we can solve this.

## <span style="color:RoyalBlue">Exercise 4.3 - Toy Problem Interlude</span>

Let's walk through a miniature example of what we just did in the previous exercise, because it will be easier to understand by peeking under the hood. 
* Instead of 1700 businesses, let's say we only had **170** businesses to visit, but we still want to group them into **10** groups by ID.
* Therefore, we want the first group to have businesses **1 to 17**, the second group to have **18 to 34**, the third group to have **35 to 51**, and so on.

By using this type of "toy problem," we can actually print the entire matrix and see what's going on under the hood when we reshape an array.

#### A.) Create a new <code>mini_business_ids</code> array with only 170 businesses (10% of the size of our full problem).
* Then reshape it to only have 10 columns and store it as <code>mini_id_matrix</code>.
* Finally, display <code>mini_id_matrix</code>.

In [None]:
# Create array of 170 ID's


# Reshape to have 10 columns


# Display mini_id_matrix



<strong style="color:RoyalBlue">Expected output:</strong>

<pre>
[[  1   2   3   4   5   6   7   8   9  10]
 [ 11  12  13  14  15  16  17  18  19  20]
 [ 21  22  23  24  25  26  27  28  29  30]
 [ 31  32  33  34  35  36  37  38  39  40]
 [ 41  42  43  44  45  46  47  48  49  50]
 [ 51  52  53  54  55  56  57  58  59  60]
 [ 61  62  63  64  65  66  67  68  69  70]
 [ 71  72  73  74  75  76  77  78  79  80]
 [ 81  82  83  84  85  86  87  88  89  90]
 [ 91  92  93  94  95  96  97  98  99 100]
 [101 102 103 104 105 106 107 108 109 110]
 [111 112 113 114 115 116 117 118 119 120]
 [121 122 123 124 125 126 127 128 129 130]
 [131 132 133 134 135 136 137 138 139 140]
 [141 142 143 144 145 146 147 148 149 150]
 [151 152 153 154 155 156 157 158 159 160]
 [161 162 163 164 165 166 167 168 169 170]]
</pre>

There! Do you see how it fills in the matrix one row at a time?
* So, when you select the first **column**, you're not getting businesses 1 to 17. 
* Instead, you're getting every 10th business: 1, 11, 21, 31, etc...

#### B.) Aha, now we're on to something... so instead of reshaping to 17 rows and 10 columns, let's try reshaping it to 10 rows and 17 columns.

In [None]:
# Reshape to have 10 rows


# Display mini id matrix



<strong style="color:RoyalBlue">Expected output:</strong>

<pre>
[[  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17]
 [ 18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34]
 [ 35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51]
 [ 52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68]
 [ 69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85]
 [ 86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102]
 [103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119]
 [120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136]
 [137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153]
 [154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170]]
</pre>

Now we have 10 rows instead of 10 columns. 
* The first **row** has the businesses 1 to 17, which is what we want!
* And if we want the first **column** to have businesses 1 to 17, we can simply **transpose** this matrix.

Working through toy problems is an underappreciated and very useful skill in data science... Toy problems are basically miniature versions of your problem that are easier to break apart and understand conceptually. You'll often be working with difficult datasets or solving tough analytical problems. If you get stuck, it can be helpful to work through a simpler version of the problem by reducing the size of your dataset or trying easier algorithms!

## <span style="color:RoyalBlue">Exercise 4.4 - SRS Round 2</span>

Let's try building that <code style="color:steelblue">id_matrix</code> again. 

#### A.) This time, reshape it to have 10 rows, each representing 1 group of businesses.
* The first row should contain businesses 1 to 170.

In [None]:
# Reshape business_ids to have 10 rows, with 170 businesses each


#### B.) Now, print the first *row* of <code style="color:steelblue">id_matrix</code>.

In [None]:
# Print the first row of id_matrix


<strong style="color:RoyalBlue">Expected output:</strong>

<pre>
[  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
 163 164 165 166 167 168 169 170]
</pre>

You should see the businesses with ID's from **1 to 170**. Now, what if we still want our groups to be grouped by column, instead of by row? Let's *flip our rows and columns* now that the rows have the correct groups.

#### C.) Overwrite <code style="color:steelblue">id_matrix</code> with its transposed version.
* After you do so, print the first column in the new <code style="color:steelblue">id_matrix</code> to confirm the group is correct.
* Also print the shape of the new <code style="color:steelblue">id_matrix</code> to confirm your new dimensions are correct.

In [None]:
# Overwrite id_matrix with flipped version


# Print first column


# Print shape of new id_matrix



<strong style="color:RoyalBlue">Expected output:</strong>

<pre>
[  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
 163 164 165 166 167 168 169 170]
(170, 10)
</pre>

## <span style="color:RoyalBlue">Exercise 4.5 - Final Business Selections</span>

Let's take another look at our <code style="color:steelblue">id_matrix</code>. 

* In <span style="color:royalblue">Exercise 4.2b</span> we displayed the first group (column) and confirmed that the ID's were from **1 to 170**.
* Now let's confirm the rest of the array is correct by finding the <code style="color:steelblue">np.min()</code> and <code style="color:steelblue">np.max()</code> of each group.

#### A.) First, create and print an object called <code style="color:steelblue">group_min</code> with the minimum ID of each group (column) of <code style="color:steelblue">id_matrix</code>.

In [None]:
# Create group_min


# Print group_min



<strong style="color:RoyalBlue">Expected output:</strong>

<pre>
[   1  171  341  511  681  851 1021 1191 1361 1531]
</pre>

#### B.) Next, create and print an object called <code style="color:steelblue">group_max</code> with the maximum ID of each group (column) of <code style="color:steelblue">id_matrix</code>.

In [None]:
# Create group_max


# Print group_max



<strong style="color:RoyalBlue">Expected output:</strong>

<pre>
[ 170  340  510  680  850 1020 1190 1360 1530 1700]
</pre>

#### C.) Next, subtract <code style="color:steelblue">group_min</code> from <code style="color:steelblue">group_max</code> to confirm that each of the 10 groups has a <span style="tomato">range</span> of 170.
* Remember to add 1 to the difference between max and min because the ends are inclusive 
    * i.e. $170 - 1 = 169$

In [None]:
# Print range of each group


<strong style="color:RoyalBlue">Expected output:</strong>

<pre>
[170 170 170 170 170 170 170 170 170 170]
</pre>

Great. We are now ready to select 5 businesses from each group in <code style="color:steelblue">id_matrix</code>.

#### D.) Finally, write a loop that chooses 5 businesses from column of <code style="color:steelblue">id_matrix</code>.
* Set the random seed to 123. (This is just an arbitrary seed chosen for replicable results... you could technically choose any seed you'd like.)
* Format the output like in the <span style="color:royalblue">Expected output</span> cell below.
* Should we sample with replacement or without replacement? **Hint:** What happens if a business is selected twice?
* Bonus points for writing the loop in a way that works with any number of columns.

In [None]:
# Seed random seed


# Print selected businesses from each group



<strong style="color:RoyalBlue">Expected output:</strong>

<pre>
Group 1: [ 92  73 139  54 147]
Group 2: [182 333 219 292 214]
Group 3: [482 384 400 458 477]
Group 4: [662 603 676 594 609]
Group 5: [827 718 829 845 803]
Group 6: [979 893 932 864 984]
Group 7: [1115 1103 1093 1062 1160]
Group 8: [1308 1248 1231 1222 1330]
Group 9: [1492 1406 1401 1456 1398]
Group 10: [1680 1674 1697 1534 1540]
</pre>

Awesome, now we're ready to start visiting the businesses... Let's start with Group 1 and the business with ID 7. After checking our manual, we find out that this business happens to be flower shop. We'll see why that's relevant in the next lesson.

#### The mission continues in the next lesson...

In this lesson, you narrowed the list of businesses down through stratified random sampling. In the next lesson, you'll visit the flower shop and offer help using your data analysis skills!