# STATS 503, Group Work Assignment 1: Introduction to Python

*Instructions:* Collaborate in two-person teams to complete the data analysis
exercises below. The GSI will help individual teams encountering difficulty,
make announcements addressing common issues, and help ensure progress for all teams. Teams will be randomly assigned by the GSIs. Teams are encouraged to talk with their GSI if they need help. Upon completion, one member of the team should submit their team's work through Canvas as html.

*Notebook credits:* Roman Kouznetsov and Paolo Borello


## Where to go for help


In lab section, we cannot everything you will need to know to solve the problems below.  The standard way people learn Python is by trial and error and by consulting with online resource.  Here are some resources we suggest.

* Books such as https://do1.dr-chuck.com/pythonlearn/EN_us/pythonlearn.pdf can help with basic python knowledge.
* Large language models (e.g., ask chatgpt to generate python code that loads a file named "college_train.csv" and make a histogram of the Books feature.  See what you get!)
* Package documentation
  * pandas for manipulating data sets https://pandas.pydata.org/docs/
  * numpy for computings with arrays https://numpy.org/doc/stable/
  * sklearn for fitting models https://scikit-learn.org/stable/user_guide.html
  * matplotlib for plotting https://matplotlib.org/stable/gallery/index
* [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/index.html)
*  Python for Data Analysis (You can access through U-M library.)
* [Stackoverflow](https://stackoverflow.com/)
* The open internet.  Search and you may just find.


## Installing Anaconda (**SKIP THIS IF YOU ARE USING COLAB**)

There are various ways to install Python. In this class, we suggest Google colab. Alternatively, you may try Anaconda, which includes both Python and conda, and additionally bundles a suite of other pre-installed packages geared toward scientific computing.


You can download Anaconda from https://www.anaconda.com/products/distribution, which includes installers for Windows and MacOS. For MacOS, we recommend using the *Graphical Installer*. You can also check the instructions for installation on [Windows](https://docs.anaconda.com/anaconda/install/windows/) and [MacOS](https://docs.anaconda.com/anaconda/install/mac-os/).

[Anaconda Navigator](https://docs.anaconda.com/navigator/) is automatically installed when you install Anaconda version 4.0.0 or higher. To verify your installation, you can
- Windows: Click **Start**, search for Anaconda Navigator, and click to open.
- MacOS: Click **Launchpad** and select Anaconda Navigator. Or use **Cmd+Space** to  open Spotlight Search and type “Navigator” to open the program.

We will use the Jupyter Notebook from Anaconda Navigator. To install Jupter Notebook, open the Anaconda Navigator and click the "Install" button under Jupyter Notebook.

If you can't get jupyter installed on your computer, you might try a cloud version of jupyter notebook, such as https://anaconda.cloud/ or https://colab.research.google.com/.

## What is a jupyter notebook?

A jupyter notebook is a collection of "cells."  There are two kinds of cells.  The first kind is *markdown cells* like this one.  They can have normal text, *italics* and **bold**.  If you edit the cell, you'll see this is done with *markdown* formatting conventions.  

Markdown cells can also have equations such as $\frac{a}{b}$ and

$$\frac{a}{b}.$$

These equations are written using latex notation.  You can learn more about how to do different things in markdown from resources such as https://towardsdatascience.com/write-markdown-latex-in-the-jupyter-notebook-10985edb91fd.


In [None]:
# the other kind of cell is a code cell
# in code cells, if you want to write ordinary text
# you have to put a "#" at the beginning of the line
# all other content in code cells is interpreted as
# instruction for the computer

3+5

8

## Installing packages

Most of the commonly-used libraries are pre-installed with Anaconda and/or Google Colab. You can still install packages using
- `conda install packagename`
- or `pip install packagename` in Terminal (MacOS) or Command Prompt (Windows).

In [None]:
# import packages with aliases
import numpy as np
import pandas as pd

##  Lists

In [None]:
# create a list of integers
L = list(range(10))
print(L)
type(L[0])

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


int

In [None]:
# create a list of strings
L2 = [str(c) for c in L]
print(L2)
type(L2[0])

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']


str

In [None]:
# create a list of heterogeneous elements
L3 = [True, "2", 3.0, 4]
[type(item) for item in L3]

[bool, str, float, int]

## Arrays

We have covered about list using built-in functions in Python. Next, let's use `numpy` package to create arrays.

In [None]:
np.array([1, 4, 2, 5, 3])
np.array([3.14, 4, 2, 3])
np.array([1, 2, 3, 4], dtype='float32') # dtype explicitly sets the data type

array([1., 2., 3., 4.], dtype=float32)

In [None]:
# Create a length-10 integer array filled with zeros
np.zeros(10, dtype=int)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [None]:
# Create a 3x5 floating-point array filled with ones
np.ones((3, 5), dtype=float)

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

In [None]:
# Create a 3x5 array filled with 3.14
np.full((3, 5), 3.14)

array([[3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14]])

In [None]:
# Create a 3x3 identity matrix
np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [None]:
# Create an array filled with a linear sequence
# Starting at 0, ending at 20, stepping by 2
# (this is similar to the built-in range() function)
np.arange(0, 20, 2)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [None]:
# Create an array of five values evenly spaced between 0 and 1
np.linspace(0, 1, 5)

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

In [None]:
# Create a 3x3 array of normally distributed random values
# with mean 0 and standard deviation 1
np.random.normal(0, 1, (3, 3))

array([[-0.66584647,  0.20497576,  0.77989197],
       [ 0.14012474, -0.72974592,  0.22650878],
       [ 0.87393273,  1.39753374,  0.34552798]])

In [None]:
# Create a 3x3 array of random integers in the interval [0, 10)
np.random.randint(0, 10, (3, 3))

array([[9, 1, 6],
       [5, 0, 1],
       [1, 8, 5]])

## Array indexing: single elements


In a one-dimensional array, the ith value (**counting from zero**) can be accessed by specifying the desired index in square brackets.

In [None]:
np.random.seed(0)  # seed for reproducibility

x1 = np.random.randint(10, size=6)  # One-dimensional array
x2 = np.random.randint(10, size=(3, 4))  # Two-dimensional array
x3 = np.random.randint(10, size=(3, 4, 5))  # Three-dimensional array

print("x3 ndim: ", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)

x3 ndim:  3
x3 shape: (3, 4, 5)
x3 size:  60


In [None]:
# access single elements
print(x1)
print(x1[0]) # first element in x1
print(x1[4]) # fifth element in x1

[5 0 3 3 7 9]
5
7


In [None]:
# index from the end of the array
print(x1[-1]) # last element
print(x1[-2]) # second-to-last element

9
7


In [None]:
print(x2)
print(x2[0, 0])
print(x2[2, 0])
print(x2[2, -1])

[[3 5 2 4]
 [7 6 8 8]
 [1 6 7 7]]
3
1
7


In [None]:
# modify values using index notation
x2[0, 0] = 12
x2

array([[12,  5,  2,  4],
       [ 7,  6,  8,  8],
       [ 1,  6,  7,  7]])

## Array slicing: subarrays

In [None]:
# one-dimensional subarrays
x = np.arange(10)
x

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [None]:
print(x[:5])  # first five elements
print(x[5:])  # elements after index 5

[0 1 2 3 4]
[5 6 7 8 9]


In [None]:
print(x[4:7])  # middle sub-array
print(x[::2])  # every other element

[4 5 6]
[0 2 4 6 8]


In [None]:
print(x[::-1])  # all elements, reversed
print(x[5::-2])  # reversed every other from index 5

[9 8 7 6 5 4 3 2 1 0]
[5 3 1]


In [None]:
# multi-dimensional subarrays
print(x2[:, 0])  # first column of x2
print(x2[0, :])  # first row of x2

[12  7  1]
[12  5  2  4]


In [None]:
print(x2[:2, :3])  # two rows, three columns
print(x2[:3, ::2])  # all rows, every other column

[[12  5  2]
 [ 7  6  8]]
[[12  2]
 [ 7  8]
 [ 1  7]]


In [None]:
print(x2[::-1, ::-1]) # reverse dimensions together

[[ 7  7  6  1]
 [ 8  8  6  7]
 [ 4  2  5 12]]


## Arithmetic commands

In [None]:
x = np.arange(4)
print("x     =", x)
print("x + 5 =", x + 5)
print("x - 5 =", x - 5)
print("x * 2 =", x * 2)
print("x / 2 =", x / 2)
print("x // 2 =", x // 2)  # floor division

x     = [0 1 2 3]
x + 5 = [5 6 7 8]
x - 5 = [-5 -4 -3 -2]
x * 2 = [0 2 4 6]
x / 2 = [0.  0.5 1.  1.5]
x // 2 = [0 0 1 1]


In [None]:
print("-x     = ", -x)
print("x ** 2 = ", x ** 2) # ** operator for exponentiation
print("x % 2  = ", x % 2) # % operator for modulus

-x     =  [ 0 -1 -2 -3]
x ** 2 =  [0 1 4 9]
x % 2  =  [0 1 0 1]


In [None]:
x = np.array([-2, -1, 0, 1, 2])
abs(x)
np.absolute(x)

array([2, 1, 0, 1, 2])

In [None]:
x = [1, 2, 3]
print("x     =", x)
print("e^x   =", np.exp(x))
print("2^x   =", np.exp2(x))
print("3^x   =", np.power(3, x))

x     = [1, 2, 3]
e^x   = [ 2.71828183  7.3890561  20.08553692]
2^x   = [2. 4. 8.]
3^x   = [ 3  9 27]


In [None]:
x = [1, 2, 4, 10]
print("x        =", x)
print("ln(x)    =", np.log(x))
print("log2(x)  =", np.log2(x))
print("log10(x) =", np.log10(x))

x        = [1, 2, 4, 10]
ln(x)    = [0.         0.69314718 1.38629436 2.30258509]
log2(x)  = [0.         1.         2.         3.32192809]
log10(x) = [0.         0.30103    0.60205999 1.        ]


In [None]:
a = np.array([0, 1, 2])
b = np.array([5, 5, 5])
a + b

array([5, 6, 7])

In [None]:
a + 5

array([5, 6, 7])

In [None]:
# centering
X = np.random.random((10, 3))
print(f"X is {X}")
Xmean = X.mean(0)
print(f"X bar is {Xmean}")
X_centered = X - Xmean
X_centered.mean(0)

X is [[0.26034093 0.53702252 0.44792617]
 [0.09956909 0.35231166 0.46924917]
 [0.84114013 0.90464774 0.03755938]
 [0.50831545 0.16684751 0.77905102]
 [0.8649333  0.41139672 0.13997259]
 [0.03322239 0.98257496 0.37329075]
 [0.42007537 0.05058812 0.36549611]
 [0.01662797 0.23074234 0.7649117 ]
 [0.94412352 0.74999925 0.33940382]
 [0.48954894 0.33898512 0.17949026]]
X bar is [0.44778971 0.47251159 0.3896351 ]


array([-9.99200722e-17,  2.22044605e-17, -2.22044605e-17])

In [None]:
np.random.seed(1)
L = np.random.random(100)
sum(L)
np.sum(L)

48.587792760014565

In [None]:
big_array = np.random.random(1000000)
min(big_array), max(big_array)
np.min(big_array), np.max(big_array)

(3.0077687129814734e-07, 0.9999995556201814)

In [None]:
M = np.random.random((3, 4))

# find the minimum value within each column by specifying axis=0
M.min(axis=0)

array([0.00488767, 0.53595758, 0.10182827, 0.1352457 ])

In [None]:
# find the variance within each row
M.var(axis=1)

array([0.02826916, 0.06416402, 0.09646854])

## Iterations and traversals

A for loop is used for iterating over a sequence (that is either a list, a tuple, a dictionary, a set, or a string).

In [None]:
fruits = ["apple", "banana", "cherry"]
for i, x in enumerate(fruits):
  print("I have one", x, i)

I have one apple 0
I have one banana 1
I have one cherry 2


In [None]:
# notice the zero indexing again here
for i in range(5):
    print(i)

0
1
2
3
4


## Basic dataset manipulation

In [None]:
# use pandas libray to load a dataset
# https://github.com/selva86/datasets/blob/master/iris_test.csv
iris=pd.read_csv('iris_test.csv')
iris

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.6,3.1,1.5,0.2,setosa
3,5.0,3.4,1.5,0.2,setosa
4,4.4,2.9,1.4,0.2,setosa
5,4.3,3.0,1.1,0.1,setosa
6,5.1,3.8,1.5,0.3,setosa
7,5.0,3.0,1.6,0.2,setosa
8,4.8,3.1,1.6,0.2,setosa
9,5.5,3.5,1.3,0.2,setosa


In [None]:
# use numpy to calculate the median of one of the columns
np.median(iris['Sepal.Length'])

6.0

In [None]:
# create a transformed version of one of the columns
iris['doublesepal']=iris['Sepal.Length']*2

# and look at the first 4 rows of the modified dataframe
iris.iloc[:4]

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species,doublesepal
0,5.1,3.5,1.4,0.2,setosa,10.2
1,4.9,3.0,1.4,0.2,setosa,9.8
2,4.6,3.1,1.5,0.2,setosa,9.2
3,5.0,3.4,1.5,0.2,setosa,10.0


In [None]:
# create a new series that is True where species is virginica and false elsewhere
is_virginica=iris['Species']=='virginica'

# look at first 4 rows
is_virginica.iloc[:4]

Unnamed: 0,Species
0,False
1,False
2,False
3,False


In [None]:
# use numpy to calculate count how many times the species is "virginica"
np.sum(is_virginica)

15

In [None]:
# can also do this more directly without creating a new variable for it
np.sum(iris['Species']=='virginica')

15

In [None]:
# subset dataframe to include only samples that are virginica
new_dataframe=iris.loc[is_virginica]
new_dataframe

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species,doublesepal
30,6.5,3.0,5.8,2.2,virginica,13.0
31,7.6,3.0,6.6,2.1,virginica,15.2
32,6.4,2.7,5.3,1.9,virginica,12.8
33,5.7,2.5,5.0,2.0,virginica,11.4
34,5.8,2.8,5.1,2.4,virginica,11.6
35,6.4,3.2,5.3,2.3,virginica,12.8
36,7.7,3.8,6.7,2.2,virginica,15.4
37,7.2,3.2,6.0,1.8,virginica,14.4
38,7.2,3.0,5.8,1.6,virginica,14.4
39,6.4,2.8,5.6,2.2,virginica,12.8


### Credits: This notebook is adapted from material by Gang Qiao, Junting Wang, Yichao Chen, and Ji Zhu.

# **Group Work (30 points)**

During weekly lab sections (listed above), students will collaborate to complete data analysis exercises
in the Jupyter notebooks provided. GSIs will help individual teams encountering difficulty, make
announcements addressing common issues, and help ensure progress for all teams. Initially, teams will
be randomly assigned by the GSIs, but students will later be invited to choose their own teammates.
Cross-team collaboration is discouraged; teams are instead encouraged to talk with their GSI if they
need help.
Some weeks the group-work assignment can be completed entirely during lab section, in which case
students are expected to submit their work by the end of the lab or shortly thereafter. Other weeks,
the group-work assignments initiated during lab section will be due several days later.
The purpose of this format is to teach students about teamwork and collaborative data science
practices, which are increasingly important in the industry. Working in a group setting can enhance
motivation for some students, while others will benefit from the experience of serving as mentors.
To accommodate short-term illness and occasional scheduling conflicts, the lowest **3 out of the 11** group-work
scores will be dropped.

It is our intention that students score near 100% on all of these because Group Work is an opportunity to get everyone initially familiar with using the relevant material, no matter the difficulty level.

**Please note that the test cases dictate which function and variable names you should be using.**

## **Function Writing (10 points total)**

### Problem 1 (2 points)

Write a function which computes gross pay given the hours worked and the rate per hour.

P.S. Note how the variable names are well-named. They explain what their role in the function is and no one has to guess. Please be mindful when making variable names so people reading your code know what they are.

In [None]:
def get_gross_pay(hours_worked, rate_per_hour):
    pass

When you are finished with your implementation, check your work on the test cases below. Assert statements check a boolean condition before further code execution is allowed to proceed. Your code will be expected to pass a large variety of test cases to ensure you have a proper solution.

In [None]:
assert get_gross_pay(40, 20) == 800
assert get_gross_pay(30, 10) == 300

### Problem 2 (3 points)

Write a new version of your `get_gross_pay` function to give the employee 1.5 times the hourly rate for hours worked above 40 hours. In addition to the test cases provided, what if some poor soul was pulling 80 hours a week? How much would they make? Include your answer as a markdown answer.

In [None]:
def get_gross_pay(hours_worked, rate_per_hour):
    pass

In [None]:
assert get_gross_pay(0, 1000)==0
assert get_gross_pay(2, 10)==20
assert get_gross_pay(41, 10)== 415

### Problem 3 (2 points)

Write a function `is_power_of_2` that takes a positive integer as its only argument and returns a Boolean indicating whether or not the input is a power of 2.

You may assume that the input is a positive integer. You may not use the built-in math.sqrt function in your solution. You should need only the division and modulus (%) operations.

In [None]:
def is_power_of_2(hours_worked, rate_per_hour):
    pass

### Problem 3T (3 points)

When writing test cases for your code, there is a solid mantra: "test none, test one, test many". Outline how you would write test cases for testing your is_power_of_2 function in markdown. Then, using the assert statement, write the appropriate test cases.

<u>**Hint**</u>: Think about some of the edge cases your program may need to handle given the instructions.

In [None]:
assert # add as many assert statements as you think you need

## **Working with Datasets (10 points total)**



### Problem 1 (2 points)

You may have noticed in the lab we were able to calculate a median for a variable of interest. However, what if we wanted to generate a 5 number summary for all of the variables for some quick EDA? Write some code that provides a summary of all the variables in the dataset. You may use GenAI or google or pandas documentation to complete this part.

### Problem 1 Follow Up (1 point)

What do you notice about the variables that were included in the description table? Answer in markdown.

### Problem 2 (2 points)

The iris dataset actually has too broad of samples for my liking. Please filter the dataset to only have the following properties

1.   Sepal Length of at least 6.9.
2.   Petal Width of at least 2.0.
3.   Petal length greater than 5.1 but less than 6.1.



### Problem 2 Follow Up (1 point)

In a new code cell, write an assert statement that checks that you have the correct number of rows in the filtered dataset.

### Problem 3 (2 points)

Assume that petals are rectangular-shaped. Create a new column that stores the area of petals.

### Problem 4 (2 points)

Find the maximal area for each class of iris plant.

## **Data Structures (10 points total)**

### Problem 1 Setup (1 point)

Masking is a common opreation in machine learning and artificial intelligence. We will learn how to perform this operation together.



1.   Set the random seed to the value 8. This ensures that everyone has the same randomization procedures.
2.   Generate a random 5x5 sample of Unif$(0,1)$ values.




In [None]:
# DO NOT COPY PASTE THIS AS YOUR ANSWER. DOING SO WILL RESULT IN A 0 FOR GROUP WORK.
correct_answer = np.array([[0.8734294 , 0.96854066, 0.86919454, 0.53085569, 0.23272833],
       [0.0113988 , 0.43046882, 0.40235136, 0.52267467, 0.4783918 ],
       [0.55535647, 0.54338602, 0.76089558, 0.71237457, 0.6196821 ],
       [0.42609177, 0.28907503, 0.97385524, 0.33377405, 0.21880106],
       [0.06580839, 0.98287055, 0.12785571, 0.32213079, 0.07094284]])

assert np.sum(np.abs(random_array - correct_answer)) < 1e-7

### Problem 1 (3 points)

Perform the following 3 masks:

1. Zero out the first 3 columns.
2. Zero out any values less than 0.5.
3. Zero out all values who's first digit after the decimal place is an even number.

In [None]:
# copy arrays
masked_array1 = random_array.copy()
masked_array2 = random_array.copy()
masked_array3 = random_array.copy()

# perform masks


In [None]:
correct_mask1 = np.array([[0.        , 0.        , 0.        , 0.53085569, 0.23272833],
       [0.        , 0.        , 0.        , 0.52267467, 0.4783918 ],
       [0.        , 0.        , 0.        , 0.71237457, 0.6196821 ],
       [0.        , 0.        , 0.        , 0.33377405, 0.21880106],
       [0.        , 0.        , 0.        , 0.32213079, 0.07094284]])
correct_mask2 = np.array([[0.8734294 , 0.96854066, 0.86919454, 0.53085569, 0.        ],
       [0.        , 0.        , 0.        , 0.52267467, 0.        ],
       [0.55535647, 0.54338602, 0.76089558, 0.71237457, 0.6196821 ],
       [0.        , 0.        , 0.97385524, 0.        , 0.        ],
       [0.        , 0.98287055, 0.        , 0.        , 0.        ]])
correct_mask3 = np.array([[0.        , 0.96854066, 0.        , 0.53085569, 0.        ],
       [0.        , 0.        , 0.        , 0.52267467, 0.        ],
       [0.55535647, 0.54338602, 0.76089558, 0.71237457, 0.        ],
       [0.        , 0.        , 0.97385524, 0.33377405, 0.        ],
       [0.        , 0.98287055, 0.12785571, 0.32213079, 0.        ]])

assert np.sum(np.abs(masked_array1 - correct_mask1)) < 1e-7
assert np.sum(np.abs(masked_array2 - correct_mask2)) < 1e-7
assert np.sum(np.abs(masked_array3 - correct_mask3)) < 1e-7

### Problem 2 Setup (1 points)

In modern ML, we often use matrices and tensors when dealing with image data. A simple image task consists of applying a [kernel](https://en.wikipedia.org/wiki/Kernel_(image_processing)).

For this exercise we first generate a 7x7 grayscale image:
1. Create a vector of the first 49 consecutive numbers (0 included)
2. Reshape it and make it a 7x7 matrix

In [None]:
correct_answer = np.array([[ 0,  1,  2,  3,  4,  5,  6],
       [ 7,  8,  9, 10, 11, 12, 13],
       [14, 15, 16, 17, 18, 19, 20],
       [21, 22, 23, 24, 25, 26, 27],
       [28, 29, 30, 31, 32, 33, 34],
       [35, 36, 37, 38, 39, 40, 41],
       [42, 43, 44, 45, 46, 47, 48]])

assert np.sum(np.abs(test_image - correct_answer)) < 1e-7

### Problem 2 (3 points)

We now will create a function to apply a so-called box filter of size 3 x 3. When applied to an image, the filter will modify each pixel by rounding down the average of the pixel and the eight surrounding pixels.  If one or more of the surrounding pixels is not present, we do not consider it in the average.

An example is
$$
\begin{bmatrix}
  1 & 2 & 3 & 4 \\
  5 & 6 & 7 & 8 \\
  9 & 10 & 11 & 12 \\
  13 & 14 & 15 & 16
\end{bmatrix} \longrightarrow \begin{bmatrix}
  3 & 4 & 5 & 5 \\
  6 & 7 & 8 & 8 \\
  9 & 10 & 11 & 11 \\
  11 & 12 & 13 & 13
\end{bmatrix}
$$


For the cell containing 4, the calculation is given by
$$
(3+4+7+8) / 4 = 5.5 \rightarrow 5
$$


For the cell containing 15, the calculation is given by
$$
(10+11+12+14+15+16) / 6 = 13
$$

Write a function that takes in a matrix and computes the smoothed out version.

In [None]:
def box_filter(image: np.ndarray) -> np.ndarray:
  pass

In [None]:
smoothed_image = box_filter(test_image)
smoothed_image

In [None]:
correct_answer = np.array([[ 4,  4,  5,  6,  7,  8,  9],
       [ 7,  8,  9, 10, 11, 12, 12],
       [14, 15, 16, 17, 18, 19, 19],
       [21, 22, 23, 24, 25, 26, 26],
       [28, 29, 30, 31, 32, 33, 33],
       [35, 36, 37, 38, 39, 40, 40],
       [39, 39, 40, 41, 42, 43, 44]])

assert np.sum(np.abs(smoothed_image - correct_answer)) < 1e-7

### Problem 3 Setup (NO POINTS)
**DO NOT MODIFY** the setup cells.

In [None]:
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np

# Define the transformation pipeline (convert to tensor, no resizing)
transform = transforms.Compose([
    transforms.ToTensor()  # Convert the image to a tensor
])

# Download the Caltech101 dataset
caltech101_dataset = torchvision.datasets.Caltech101(
    root='./data',
    download=True,
    transform=transform
)
# Get a single image and its label
rgb_image, label = caltech101_dataset[35]

# Convert the image tensor to a NumPy matrix
rgb_image = np.transpose((rgb_image.numpy() * 255).astype(np.uint8), (1, 2, 0))

In [None]:
plt.imshow(rgb_image)
plt.show()

### Problem 3 (2 points)

The image now has one more axis for the RGB channels. The shape is now HxWxC (height, width, channels). Modify your code from above to apply the filter to each channel and then output the blurred RGB image.

In [None]:
def box_filter_RGB(image: np.ndarray) -> np.ndarray:
  pass

In [None]:
blurred_image = box_filter_RGB(rgb_image)
plt.imshow(blurred_image)
plt.show()

In [None]:
diff_image = np.abs(rgb_image - blurred_image)
max_diff = np.max(diff_image)
min_diff = np.min(diff_image)
mean_diff = np.mean(diff_image)
median_diff = np.median(diff_image)

assert max_diff == 87
assert min_diff == 0
assert abs(mean_diff - 3.646) < 0.01
assert median_diff == 2

# Conversion

Run all cells by using `Runtime > Run all`.

In order to convert the file to html for submission, first download the `.ipynb` file (`File > Download > Download .ipynb`). Then upload it in the working directory using the GUI (`Files > Upload to session storage`). Finaly run the next cell and download the resulting `.html`.

In [103]:
!jupyter nbconvert --to html STATS503_GroupWork1.ipynb

[NbConvertApp] Converting notebook STATS503_GroupWork1.ipynb to html
[NbConvertApp] Writing 448917 bytes to STATS503_GroupWork1.html
