# Lab 3

# Introduction to NumPy

In this lab, you'll be working through Chapter 2 to get an introduction to the numerical computing package for Python, NumPy. This notebook is made up of two sections.

- Section 1: Work through the code samples in Chapter 2
- Section 2: Exercises

# Section 1: Code Practice

In this section, you will be reading through the various chapter sections and typing out/running the code samples given in the sections. The purpose of this is for you to practice using Jupyter to run Python code as well as learn about the functionality available to you in both IPython and Jupyter.

##### Executing code in Jupyter

When typing and executing code in Jupyter, it is helpful to know the various keyboard shortcuts. You can find the full list of these by clicking **Help &rarr; Keyboard Shortcuts** in the menu. However, the two most useful keyboard shortcuts are:

- `Shift-Enter`: Execute the current cell and advance to the next cell. This will create one if none exists, but if a cell exists below your current cell, a new cell will **not** be created.
- `Alt-Enter`: Execute the current cell and **create** a new cell below.
- `Control-Enter`: Execute the current cell without advancing to the next cell

When writing your code, you will be using these two commands to make sure input/output (`In`/`Out`) is consistent with what is found in the chapter. If you create a cell by mistake, you can always go to **Edit &rarr; Delete Cells** to remove it.

#### Purpose of Section 1

Your purpose in this section is 

- **Type out** the code examples from the chapter (do not copy and paste)
- **Run** them
- **Check** to **make sure** you are getting the same results as what is contained in the chapter

---




## Computation on Arrays: Broadcasting

[Chapter/Section link](https://nbviewer.jupyter.org/urls/bitbucket.org/dogwynn/pythondatasciencehandbook/raw/master/notebooks/02.05-Computation-on-arrays-broadcasting.ipynb)

### Introducing Broadcasting

### Rules of Broadcasting

#### Broadcasting example 1

#### Broadcasting example 2

#### Broadcasting example 3

### Broadcasting in Practice

#### Centering an array

#### Plotting a two-dimensional function

## Comparisons, Masks, and Boolean Logic

[Chapter/section link](https://nbviewer.jupyter.org/urls/bitbucket.org/dogwynn/pythondatasciencehandbook/raw/master/notebooks/02.06-Boolean-Arrays-and-Masks.ipynb)

### Example: Counting Rainy Days

In [7]:
import pandas as pd
import numpy as np

In [8]:
def array_from_file(filename, column):

    data = pd.read_csv(filename)
    return np.array(data[column])

rainfall = array_from_file('Seattle2014.csv', 'PRCP')

Start the next cell at `inches = rainfall / 254.0`

In [9]:
rainfall_inches = rainfall / 25.4
rainy_days = np.sum(rainfall_inches > 0.1)

In [10]:
print(f"Number of rainy days: {rainy_days}")

Number of rainy days: 150


### Comparison Operators as UFuncs

In [11]:
rainfall_inches = rainfall / 25.4

In [12]:
rainy_days_bool = rainfall_inches > 0.1

In [13]:
rainy_days = np.sum(rainy_days_bool)

In [14]:
tmax = array_from_file('Seattle2014.csv', 'TMAX')

In [15]:
hotdaysbool = tmax > 100

In [16]:
hotdays = np.sum(hotdaysbool)

In [17]:
print(f"Number of days with temperature above 100°F: {hotdays}")

Number of days with temperature above 100°F: 297


### Working with Boolean Arrays

In [19]:
tmaxf = tmax * 0.1 * 9/5 + 32
rainydaysbool = rainfall > 0
colddaysbool = tmaxf < 32
rainyandcolddaysbool = rainydaysbool & colddaysbool
rainyandcolddays = np.sum(rainyandcolddaysbool)

In [27]:
data = pd.read_csv('Seattle2014.csv')
dates = data['DATE'][rainyandcolddaysbool]
print(f"Number of days with rain and freezing temperatures: {rainyandcolddays}")
print("Dates with rain and freezing temperatures:")
print(dates.info())

Number of days with rain and freezing temperatures: 0
Dates with rain and freezing temperatures:
<class 'pandas.core.series.Series'>
Int64Index: 0 entries
Series name: DATE
Non-Null Count  Dtype
--------------  -----
0 non-null      int64
dtypes: int64(1)
memory usage: 0.0 bytes
None


#### Couting entries

In [28]:
tmaxf = tmax * 0.1 * 9/5 + 32

noprecipitationbool = rainfall == 0
warmdaysbool = tmaxf > 75

noprecipitationandwarmdaysbool = noprecipitationbool & warmdaysbool

In [29]:
countnoprecipitationandwarmdays = np.sum(noprecipitationandwarmdaysbool)

In [30]:
print(f"Number of days with no precipitation and temperatures above 75°F: {countnoprecipitationandwarmdays}")

Number of days with no precipitation and temperatures above 75°F: 74


### Aside: Using the Keywords `and`/`or` Versus the Operators `&`/`|`

In [31]:
a = np.array([True, False, True, False])
b = np.array([True, True, False, False])

In [35]:
a & b

array([ True, False, False, False])

In [36]:
a | b

array([ True,  True,  True, False])

In [33]:
np.logical_and(a, b)

array([ True, False, False, False])

In [34]:
np.logical_or(a, b)

array([ True,  True,  True, False])

## Fancy Indexing

[Chapter/section link](https://nbviewer.jupyter.org/urls/bitbucket.org/dogwynn/pythondatasciencehandbook/raw/master/notebooks/02.07-Fancy-Indexing.ipynb)

### Exploring Fancy Indexing

In [37]:
arr = np.array([10, 20, 30, 40, 50])

In [38]:
indices = np.array([1, 3, 4])

In [39]:
selectedelements = arr[indices]

In [40]:
print(arr)

[10 20 30 40 50]


In [41]:
print(indices)

[1 3 4]


In [43]:
print(selectedelements)

[20 40 50]


### Combined Indexing

In [63]:
arr = np.array([[10, 20, 30],
                [40, 50, 60],
                [70, 80, 90]])

In [64]:
row_indices = np.array([0, 2])
column_slice = slice(1, 3)

In [65]:
selected_elements = arr[row_indices, column_slice]

In [66]:
print(arr)

[[10 20 30]
 [40 50 60]
 [70 80 90]]


In [67]:
print(row_indices)

[0 2]


In [68]:
print(column_slice)

slice(1, 3, None)


In [69]:
print(selected_elements)

[[20 30]
 [80 90]]


### Example: Selecting Random Points

In [79]:
arr = np.array([1, 2, 3, 4, 50, 6, 70, 80, 9, 100000])

In [80]:
points = 5

In [81]:
random_indices = np.random.choice(arr.size, size=points, replace=False)

In [82]:
random_points = arr[random_indices]

In [83]:
print(random_indices)

[5 9 8 0 6]


In [84]:
print(random_points)

[     6 100000      9      1     70]


### Modifying Values with Fancy Indexing

In [85]:
indices = np.array([2, 5, 7])

In [86]:
new_values = np.array([300, 600, 800])

In [87]:
arr[indices] = new_values

In [88]:
arr

array([     1,      2,    300,      4,     50,    600,     70,    800,
            9, 100000])

### Example: Binning Data

In [90]:
data = np.array([1.5, 2.2, 3.8, 4.1, 5.5, 6.2, 7.0])

In [91]:
bins = np.array([1, 3, 5, 7])

In [92]:
bin_indices = np.digitize(data, bins)

In [93]:
bins

array([1, 3, 5, 7])

In [94]:
for i, index in enumerate(bin_indices):
    print(f"point {data[i]} goes into bin {index}")

point 1.5 goes into bin 1
point 2.2 goes into bin 1
point 3.8 goes into bin 2
point 4.1 goes into bin 2
point 5.5 goes into bin 3
point 6.2 goes into bin 3
point 7.0 goes into bin 4


## Sorting Arrays

[Chapter/section link](https://nbviewer.jupyter.org/urls/bitbucket.org/dogwynn/pythondatasciencehandbook/raw/master/notebooks/02.08-Sorting.ipynb)

In [102]:
def bubblesort(arr: np.array) -> np.array:
    n = len(arr)
    for i in range(n):
        swapped = False
        for j in range(0, n - i - 1):
            if arr[j] > arr[j + 1]:
                arr[j], arr[j + 1] = arr[j + 1], arr[j]
                swapped = True
        if not swapped:
            break
    return arr

In [103]:
np.random.seed(0)
array = np.random.randint(0, 100000, size=10000)

In [104]:
arraylist = array.tolist()

In [105]:
arraylist

[68268,
 43567,
 42613,
 45891,
 21243,
 95939,
 97639,
 41993,
 86293,
 55026,
 80471,
 80966,
 48600,
 39512,
 52620,
 80186,
 17089,
 32230,
 18983,
 89688,
 82457,
 93005,
 6921,
 38804,
 67699,
 70608,
 37619,
 7877,
 83966,
 1871,
 73135,
 2496,
 47954,
 24675,
 31921,
 99059,
 797,
 49811,
 68755,
 80782,
 90535,
 81857,
 52489,
 84665,
 41504,
 49866,
 84212,
 96766,
 11723,
 43890,
 17591,
 58146,
 92288,
 59300,
 89141,
 71557,
 20006,
 3560,
 90868,
 61713,
 96591,
 45444,
 46522,
 86904,
 20737,
 28647,
 61353,
 92217,
 80163,
 12134,
 60535,
 47883,
 86107,
 63360,
 51811,
 9781,
 19340,
 27257,
 16298,
 12372,
 4420,
 82991,
 62079,
 79860,
 7012,
 9396,
 3918,
 9359,
 44259,
 23482,
 15127,
 37237,
 79701,
 8752,
 80041,
 71331,
 50624,
 89183,
 40133,
 93790,
 55153,
 62756,
 90928,
 81757,
 84355,
 99938,
 48682,
 66509,
 86384,
 75751,
 76693,
 24777,
 13824,
 2418,
 12843,
 78778,
 36223,
 61570,
 6521,
 88162,
 56894,
 74659,
 96990,
 14254,
 51939,
 89236,
 37073,


In [106]:
sorted_array_list = bubblesort(arraylist)
sorted_array = np.array(sorted_array_list)

In [108]:
sorted_array

array([   10,    13,    21, ..., 99965, 99975, 99980])

### Fast Sorting in NumPy: `np.sort` and `np.argsort`

In [109]:
np.sort(array)

array([   10,    13,    21, ..., 99965, 99975, 99980])

In [110]:
np.argsort(array)

array([3601, 5385, 8897, ..., 9718,  501, 5468])

#### Sorting along rows or columns

In [117]:
array = np.array([[9, 3, 1],
                     [4, 2, 7],
                     [8, 6, 5]])

In [118]:
print(array)

[[9 3 1]
 [4 2 7]
 [8 6 5]]


In [119]:
rows = np.sort(array, axis=1)

In [120]:
print(rows)

[[1 3 9]
 [2 4 7]
 [5 6 8]]


In [121]:
columns = np.sort(array, axis=0)

In [122]:
print(columns)

[[4 2 1]
 [8 3 5]
 [9 6 7]]


### Partial Sorts: Partitioning

In [141]:
array = np.random.randint(0, 100, size=20)
k = 3

In [142]:
partitionedarr = np.partition(array, k)

In [143]:
np.sort(partitionedarr[:k])

array([3, 6, 7])

In [144]:
smallest_elements = np.sort(partitionedarr[:k])

In [145]:
smallest_elements

array([3, 6, 7])

### Example: k-Nearest Neighbors

In [146]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [147]:
iris = load_iris()
X = iris.data  
y = iris.target  

In [148]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.3, random_state=0)

In [149]:
k = 3

In [150]:
knn = KNeighborsClassifier(n_neighbors=k)

In [151]:
knn.fit(Xtrain, ytrain)

KNeighborsClassifier(n_neighbors=3)

In [152]:
ypred = knn.predict(Xtest)

In [153]:
accuracy = accuracy_score(ytest, ypred)

In [154]:
print(f"Predicted labels for the test set: {ypred}")

Predicted labels for the test set: [2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 2 1 1 2 0 2 0 0]


In [155]:
print(f"True labels for the test set: {ytest}")

True labels for the test set: [2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 1 1 1 2 0 2 0 0]


In [156]:
print(f"Accuracy of the k-NN classifier with k={k}: {accuracy:.2f}")

Accuracy of the k-NN classifier with k=3: 0.98


## Structured Data: NumPy's Structured Arrays

[Chapter/section link](https://nbviewer.jupyter.org/urls/bitbucket.org/dogwynn/pythondatasciencehandbook/raw/master/notebooks/02.09-Structured-Data-NumPy.ipynb)

In [157]:
dtype = [('name', 'U10'), ('age', 'i4'), ('height', 'f4')]

In [158]:
data = [('Alice', 30, 5.5), ('Bob', 25, 5.8), ('Charlie', 35, 5.9)]

In [159]:
structured_array = np.array(data, dtype=dtype)

In [160]:
structured_array

array([('Alice', 30, 5.5), ('Bob', 25, 5.8), ('Charlie', 35, 5.9)],
      dtype=[('name', '<U10'), ('age', '<i4'), ('height', '<f4')])

In [None]:
"""https://numpy.org/doc/stable/user/basics.rec.html
was used for this part, they have example lists int heir documentation
that I used for this part"""

### More Advanced Compound Types

In [171]:
nested_dtype = [('score', 'f4'), ('subject', 'U10')]
compound_dtype = [('name', 'U10'), 
                  ('age', 'i4'), 
                  ('grades', nested_dtype), 
                  ('hobbies', 'O')] 

In [172]:
compound_dtype = [('name', 'U10'), 
                  ('age', 'i4'), 
                  ('grades', nested_dtype), 
                  ('hobbies', 'O')] 

In [173]:
data = [
    ('Alice', 30, [('Math', 95.0), ('English', 88.5)], ['Reading', 'Swimming']),
    ('Bob', 25, [('Math', 78.0), ('English', 82.0)], ['Cycling', 'Hiking']),
    ('Charlie', 35, [('Math', 89.0), ('English', 92.0)], ['Photography', 'Traveling'])
]

In [175]:
advanced_structured_array = np.empty(len(data), dtype=compound_dtype)

In [176]:
print(advanced_structured_array)

[('', 0, (0., ''), None) ('', 0, (0., ''), None) ('', 0, (0., ''), None)]


### RecordArrays: Structured Arrays with a Twist

In [181]:
grades_dtype = [('subject', 'U10'), ('score', 'f4')]

In [189]:
student_dtype = [('name', 'U10'), 
                 ('age', 'i4'), 
                 ('grades', grades_dtype), 
                 ('enrolled', 'b')]

In [190]:
data = [
    ('Alice', 30, [('Math', 95.0), ('English', 88.5)], True),
    ('Bob', 25, [('Math', 78.0), ('English', 82.0)], False),
    ('Charlie', 35, [('Math', 89.0), ('English', 92.0)], True)
]

In [191]:
structured_array = np.empty(3, dtype=student_dtype)


In [192]:
structured_array

array([('', 0, ('', 0.), 0), ('', 0, ('', 0.), 0), ('', 0, ('', 0.), 0)],
      dtype=[('name', '<U10'), ('age', '<i4'), ('grades', [('subject', '<U10'), ('score', '<f4')]), ('enrolled', 'i1')])

In [None]:
"""https://numpy.org/doc/stable/user/basics.rec.html
was used for this part, they have example lists int heir documentation
that I used for this part"""

---

# Section 2: Exercises

In this section, you will be provided a few exercises to demonstrate your understanding of the chapter contents. Each exercise will have a Markdown section describing the problem, and you will provide cells below the description with code, comments and visual demonstrations of your solution.

---

### Problem 1

Make sure you have the `array_from_url` function defined:

```python
def array_from_url(url, column):
    import pandas as pd
    import numpy as np
    data = pd.read_csv(url)
    return np.array(data[column])
```

Using the `array_from_url` function, load the following two data sets into memory using the variable names provided:

- variable: `areas`
    - URL: `"https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-areas.csv"`
    - column: `"area (sq. mi)"`
- variable: `populations`
    - URL: `"https://raw.githubusercontent.com/jakevdp/PythonDataScienceHandbook/master/notebooks/data/state-population.csv"`
    - column: `"population"`

Compute a new variable: `pop_density` containing the population density of each of the states (plus D.C. and Puerto Rico). Population density is defined as the population divided by the area.

Use this NumPy array to answer the following questions.

- Which state has the highest population density and what is it?
- Which territory has the highest population density and what is it?
- What is the mean population density of just the United States in 2012?
- What is the mean population density of the United States and territories in 2012?

In [227]:
url = "https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-areas.csv"
df_area = pd.read_csv(url)

In [228]:
state_abbreviations = {
    'Alabama': 'AL', 'Alaska': 'AK', 'Arizona': 'AZ', 'Arkansas': 'AR', 'California': 'CA',
    'Colorado': 'CO', 'Connecticut': 'CT', 'Delaware': 'DE', 'Florida': 'FL', 'Georgia': 'GA',
    'Hawaii': 'HI', 'Idaho': 'ID', 'Illinois': 'IL', 'Indiana': 'IN', 'Iowa': 'IA', 'Kansas': 'KS',
    'Kentucky': 'KY', 'Louisiana': 'LA', 'Maine': 'ME', 'Maryland': 'MD', 'Massachusetts': 'MA',
    'Michigan': 'MI', 'Minnesota': 'MN', 'Mississippi': 'MS', 'Missouri': 'MO', 'Montana': 'MT',
    'Nebraska': 'NE', 'Nevada': 'NV', 'New Hampshire': 'NH', 'New Jersey': 'NJ', 'New Mexico': 'NM',
    'New York': 'NY', 'North Carolina': 'NC', 'North Dakota': 'ND', 'Ohio': 'OH', 'Oklahoma': 'OK',
    'Oregon': 'OR', 'Pennsylvania': 'PA', 'Rhode Island': 'RI', 'South Carolina': 'SC', 
    'South Dakota': 'SD', 'Tennessee': 'TN', 'Texas': 'TX', 'Utah': 'UT', 'Vermont': 'VT', 
    'Virginia': 'VA', 'Washington': 'WA', 'West Virginia': 'WV', 'Wisconsin': 'WI', 'Wyoming': 'WY',
    'District of Columbia': 'DC', 'Puerto Rico': 'PR'
}

In [229]:
df_area['state'] = df_area['state'].map(state_abbreviations)

In [230]:
print(df_area)

   state  area (sq. mi)
0     AL          52423
1     AK         656425
2     AZ         114006
3     AR          53182
4     CA         163707
5     CO         104100
6     CT           5544
7     DE           1954
8     FL          65758
9     GA          59441
10    HI          10932
11    ID          83574
12    IL          57918
13    IN          36420
14    IA          56276
15    KS          82282
16    KY          40411
17    LA          51843
18    ME          35387
19    MD          12407
20    MA          10555
21    MI          96810
22    MN          86943
23    MS          48434
24    MO          69709
25    MT         147046
26    NE          77358
27    NV         110567
28    NH           9351
29    NJ           8722
30    NM         121593
31    NY          54475
32    NC          53821
33    ND          70704
34    OH          44828
35    OK          69903
36    OR          98386
37    PA          46058
38    RI           1545
39    SC          32007
40    SD        

In [231]:
url = "https://raw.githubusercontent.com/jakevdp/PythonDataScienceHandbook/master/notebooks/data/state-population.csv"
df_population = pd.read_csv(url)

In [232]:
df_population = df[(df['ages'] == 'total') & (df['year'] == 2012)]

In [233]:
print(df_population)

     state/region   ages  year   population
1              AL  total  2012    4817528.0
95             AK  total  2012     730307.0
97             AZ  total  2012    6551149.0
191            AR  total  2012    2949828.0
193            CA  total  2012   37999878.0
287            CO  total  2012    5189458.0
289            CT  total  2012    3591765.0
383            DE  total  2012     917053.0
385            DC  total  2012     633427.0
479            FL  total  2012   19320749.0
480            GA  total  2012    9915646.0
575            HI  total  2012    1390090.0
576            ID  total  2012    1595590.0
671            IL  total  2012   12868192.0
672            IN  total  2012    6537782.0
767            IA  total  2012    3075039.0
768            KS  total  2012    2885398.0
863            KY  total  2012    4379730.0
864            LA  total  2012    4602134.0
959            ME  total  2012    1328501.0
960            MD  total  2012    5884868.0
1055           MA  total  2012  

In [234]:
df_area.columns = ['state', 'area_sq_mi']
df_population_2012 = df_population[(df_population['ages'] == 'total') & (df_population['year'] == 2012)]
df_merged = pd.merge(df_population_2012, df_area, left_on='state/region', right_on='state')
df_merged['population_density'] = df_merged['population'] / df_merged['area_sq_mi']

In [235]:
state_max_density = df_merged[df_merged['state/region'] != 'PR']
max_density_state = state_max_density.loc[state_max_density['population_density'].idxmax()]

In [236]:
territory_max_density = df_merged[df_merged['state/region'] == 'PR']
max_density_territory = territory_max_density.loc[territory_max_density['population_density'].idxmax()]

In [237]:
pop_density = df_merged[df_merged['state/region'] != 'PR']['population_density'].mean()

In [240]:
total_population_density = df_merged['population_density'].mean()

In [241]:
print(f"highest population density: {max_density_state['state/region']} with density {max_density_state['population_density']:.2f} people per square mile")

highest population density: DC with density 9315.10 people per square mile


In [242]:
print(f"highest population density: {max_density_territory['state/region']} with density {max_density_territory['population_density']:.2f} people per square mile")

highest population density: PR with density 1038.85 people per square mile


In [243]:
print(f"mean population density in 2012: {pop_density:.2f} people per square mile")

mean population density in 2012: 347.54 people per square mile


In [244]:
print(f"Mean population density and territories in 2012: {total_population_density:.2f} people per square mile")

Mean population density and territories in 2012: 360.84 people per square mile


---

### Problem 2

Using the `array_from_url` function, load the following two data sets into memory using the variable names provided:

- variable: `titanid`
    - URL: `"https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"`
    - column: `"age"`



Answer the following questions:

- What are the minimum, maximum, and mean ages of the following types of passengers on the Titanic?
    - All passengers
    - Survivors 
    - Those that died
- What are the percentage of male passengers that died?
- What are the percentage of female passengers that died?


---

### Problem 3

Define the following function:

```python
def titanic_structured():
    import pandas as pd, numpy as np
    data = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
    cols = ['survived', 'pclass', 'sex', 'age', 'fare']
    sarray = np.zeros(len(data), dtype={'names':cols,'formats':('i4','i4','U10','f8','f8')})
    sarray['survived'] = data.Survived
    sarray['pclass'] = data.Pclass
    sarray['sex'] = data.Sex
    sarray['age'] = data.Age
    sarray['fare'] = data.Fare
    return sarray
```

Assign the output of this function to a new variable `titanic_new`, and answer the following questions:

- What is the average age of men that survived?
- What is the average age of women that survived?
- What is the [mode](https://www.mathsisfun.com/definitions/mode.html) of the class of survivors?
- What is the mode of the class of those that died?
   