<h1 style="text-align: center">
<div style="color: #DD3403; font-size: 60%">Data Science DISCOVERY MicroProject</div>
<span style="">MicroProject: Building a Scene Recognition Model form Video Frames</span>
<div style="font-size: 60%;"><a href="https://discovery.cs.illinois.edu/microproject/video-frame-scene-recognition-model/">https://discovery.cs.illinois.edu/microproject/video-frame-scene-recognition-model/</a></div>
</h1>

<hr style="color: #DD3403;">

## Data Source: Frames of a Video

Visual images are an important part of all media and Data Scientists are often using images as data sources.  In this MicroProject, you will create a simple model to detect the amount of time spent in two different "scenes" we used when creating office-hour style videos for Data Science DISCOVERY.

*This MicroProject was inspired by a podcast that we recently recorded with the team from the Center for Innovation in Teaching and Learning who helped produce our video.  To learn the background and hear from Karle and Wade about the journey of creating DISCOVERY, go over and listen to our episode on the "Teach Talk Listen Learn Podcast" where talk with TTLL host Bob Dignan and our CITL video producer Eric Schumacher: https://citl.illinois.edu/citl-101/teaching-learning/teach-talk-listen-learn*


### Loading a Video Frame

We have provided you with one frame every second from our video [*"Outliers Impact on Correlation (m6-02b)"*](https://www.youtube.com/watch?v=bd6hQ2UcIJc) that is used as part of our [DISCOVERY lecture covering Correlation](https://discovery.cs.illinois.edu/learn/Towards-Machine-Learning/Correlation/).

The `skimage` library is commonly used to load image data into Python.  Specifically:

- `skimage.io.imread(filename)` will read a filename and return the pixel color for every pixel in the image.
- To use the `imread` function, you will need to either do one of the following:

    1. Either import all of `sklearn` by using the import line `import sklearn`.  After importing all of `sklearn`, you will call the function using it's fully qualified name: `skimage.io.imread(filename)`.
    
    **OR**
    
    2. Import only the `imread` function by using the import line `from sklearn.io import imread`.  After importing only `imread`, you will call the function directly: `imread(filename)`

#### Read Pixel Data for `frames/frame_0001.jpg`

We have provided a `frames` directory with all of the frames.  In the following cell, store the pixel color data from the file named `frames/frame_0001.png` image in the variable `pixels` by using the `imread` function:


In [28]:
import skimage

pixels = skimage.io.imread("frames/frame_0001.jpg")


### 🔬 Checkpoint Tests 🔬

In [29]:
### TEST CASE for Reading Video Frames
tada = "\N{PARTY POPPER}"

assert("pixels" in vars())
assert(pixels.shape == (360, 640, 3))
assert(pixels[0][0][0] == 91)

print(f"{tada} All Tests Passed! {tada}")

🎉 All Tests Passed! 🎉


<hr style="color: #DD3403;">

## Part 1: Storing Average Pixel Color

The **shape** of your data is the `rows` by `columns` by `color values` as 3-dimensional list.  Here's a formatted view of your `pixels` data:

```
[
  [ [91, 83, 80], [91, 83, 80], [91, 83, 80] ], ... ],   # Row #1
  [ [91, 83, 80], [91, 83, 80], [91, 83, 80] ], ... ],   # Row #2
  ...                                                    # ...
]
```

The current shape of `pixels` is 360 rows by 640 columns by 3 colors.  Each of the three colors represent the three color channels on a screen: red, green, and blue.

Using `pixel.mean()`, we find the average color grouping **ALL** the color channels (combining blues and reds and greens together).  Try it out:


In [30]:
pixels.mean()

72.18011863425926

In [31]:
pixels = pixels.reshape(-1, 3)
pixels


array([[ 91,  83,  80],
       [ 91,  83,  80],
       [ 91,  83,  80],
       ...,
       [162, 131, 110],
       [162, 131, 110],
       [162, 131, 110]], dtype=uint8)

To find the average of each color channel, the `pixels.resize(-1, 3).mean(axis=0)` function will find the mean of everything **except** the color channels.  Check out the new mean:

In [32]:
pixels.mean(axis=0)

array([88.65917535, 67.45620226, 60.4249783 ])

### Puzzle 1.1: Finding the Average Color of One Image

Store `pixel`'s average red value in `r`, average green value in `g`, and average blue value in `b`:

In [33]:
r, g, b = pixels.mean(axis=0)

In [34]:
### TEST CASE for Puzzle 1.1: Finding the Average Color of One Image
tada = "\N{PARTY POPPER}"

import math
assert("r" in vars())
assert("g" in vars())
assert("b" in vars())
assert(math.isclose(r, 88.65917534722222))
assert(math.isclose(g, 67.45620225694445))
assert(math.isclose(b, 60.42497829861111))

print(f"{tada} All Tests Passed! {tada}")

🎉 All Tests Passed! 🎉


### Puzzle 1.2: Finding the Average Color of All Images

The following code loops through every file in the `frames` directory -- this will include `frame_0001.jpg` (like you analyzed already) and also `frame_0002.jpg`, `frame_0003.jpg`, and all 300+ frames!

Create a DataFrame where each row is one frame with the following four columns:
- `frame`, the filename of the frame
- `r`, the average red color of the frame
- `g`, the average green color of the frame
- `b`, the average blue color of the frame

The structure of the code should be nearly identical to writing a simulation.  Instead of creating random variables for your real world data, your real world data will be the filename, and the average color values.

- See: https://discovery.cs.illinois.edu/learn/Simulation-and-Distributions/Simple-Simulations-in-Python/

In [35]:
import glob
import os
import pandas as pd

data = []
for frame in glob.glob(os.path.join("frames", "*.jpg")): 
  # `frame`` contains the filename of the frame (ex: "frames/frame_0001.jpg").  Use it for `imread` to read the frame image data.
  ps = skimage.io.imread(frame).reshape(-1, 3)
  r, g, b = ps.mean(axis=0)
  data.append({ "frame": frame, "r": r, "g": g, "b": b })

df = pd.DataFrame(data=data)

### 🔬 Checkpoint Tests 🔬

In [36]:
### TEST CASE for Puzzle 1.2: Finding the Average Color of All Images
tada = "\N{PARTY POPPER}"

import math
assert("df" in vars())
assert(len(df) == 330)
assert("r" in df)
assert("g" in df)
assert("b" in df)
assert("frame" in df)
assert( abs( df[ df.frame.str.endswith("_0001.jpg") ]["r"].sum() - 88 ) < 1 )

print(f"{tada} All Tests Passed! {tada}")

🎉 All Tests Passed! 🎉


<hr style="color: #DD3403;">

## Part 2: Create a Simple Classifier

In the DISCOVERY lecture videos, there are two primary "scenes" in the video:

1. **"Office Hours Studio Scene"**, where Karle and Wade are talking to each other and the audience,

2. **"Notebook Scene"**, where the notebook is displayed

View the `frames` folder on your computer and find **at least three more frames** that are in the "office hours studio scene" and **at least three more frames** that are in the "notebook scene".  Add the frames you found to the list below:

In [37]:
# List of at least four office hour frames by the filename's frame number:
office_hour_frames = [1, 2, 3, 4]

# List of at least four notebook frames by the filename's frame number:
notebook_frames = [30, 31, 32, 33]

### Observing the Average Colors of Your Frames

The following code uses your sample frames to display the average color values for your selected frames:

In [38]:
import os

print("== Office Hour Frames ==")
print( df[ df["frame"].isin( [f"frames{os.sep}frame_{frame:04d}.jpg" for frame in office_hour_frames] ) ] )
print()
print("== Notebook Frames ==")
print( df[ df["frame"].isin( [f"frames{os.sep}frame_{frame:04d}.jpg" for frame in notebook_frames] ) ] )





== Office Hour Frames ==
                    frame          r          g          b
12  frames/frame_0001.jpg  88.659175  67.456202  60.424978
15  frames/frame_0003.jpg  88.028351  66.913845  60.064592
30  frames/frame_0002.jpg  88.697865  67.453529  60.475660
61  frames/frame_0004.jpg  88.825629  67.340347  60.491645

== Notebook Frames ==
                     frame           r           g           b
293  frames/frame_0033.jpg  237.115829  236.491884  236.751220
308  frames/frame_0032.jpg  237.195208  236.540660  236.820846
310  frames/frame_0030.jpg  237.225595  236.513451  236.777122
329  frames/frame_0031.jpg  237.253437  236.602648  236.892174


### Create Your Classifier Function

A **classifier function** is a function that takes data and gives a classification for that data.  Create a new function, `classifyFrame` that receives an `r`, `g`, and `b` value.

Using information from your frames above, have the function return the string `"office hour"` or `"notebook"` based on the values of `r`, `g`, and `b`.

**IMPORTANT**: Make sure your classifier can handle **ANY** input -- even frames you have not seen before!  For example, you might decide that you will call a frame an `"office hour"` frame if the sum of `r`, `g` and `b` is greater than 100 and otherwise it's a `"notebook"` scene.

In [39]:
office_hour_avg = (88., 67., 60.)
notebook_avg = (237., 236., 236.)

def classifyFrame(*args):
  # Return either "office hour" or "notebook" based on the values of `r`, `g`, and `b`.
  d_office_hours = math.dist(args, office_hour_avg)
  d_notebook = math.dist(args, notebook_avg)

  if d_notebook < d_office_hours:
    return "notebook"
  else:
    return "office hour"

### 🔬 Checkpoint Tests 🔬

In [40]:
### TEST CASE for Part 2: Create a Simple Classifier
tada = "\N{PARTY POPPER}"

r = classifyFrame(0, 0, 0)
assert(r == "notebook" or r == "office hour")

r = classifyFrame(255, 255, 255)
assert(r == "notebook" or r == "office hour")

r = classifyFrame(0, 255, 255)
assert(r == "notebook" or r == "office hour")

r = classifyFrame(255, 255, 0)
assert(r == "notebook" or r == "office hour")

print(f"{tada} All Tests Passed! {tada}")

🎉 All Tests Passed! 🎉


<hr style="color: #DD3403;">

## Part 3: Using Your Classifier!

Now that we have a classifier, we should run it on every frame!

The following cell runs your `classifyFrame` classifier on every frame and adds a new column `scene` and displayed 20 random rows:

In [41]:
df["scene"] = df.apply(lambda row: classifyFrame(row.r, row.g, row.b), axis=1)
df.sample(20)

Unnamed: 0,frame,r,g,b,scene
126,frames/frame_0058.jpg,230.351159,229.848355,230.070725,notebook
47,frames/frame_0011.jpg,106.510043,53.178854,44.992795,office hour
221,frames/frame_0284.jpg,243.156128,242.113767,240.858954,notebook
208,frames/frame_0246.jpg,244.159071,243.40612,241.71901,notebook
189,frames/frame_0134.jpg,87.36822,67.516016,60.36855,office hour
72,frames/frame_0103.jpg,230.9523,230.266042,227.938733,notebook
257,frames/frame_0237.jpg,244.377049,243.657613,241.884232,notebook
252,frames/frame_0085.jpg,230.72431,229.92013,230.498368,notebook
37,frames/frame_0172.jpg,90.04786,70.132135,61.90582,office hour
186,frames/frame_0135.jpg,88.526589,68.401576,60.930148,office hour


### 🔬 Checkpoint Tests 🔬

In [42]:
### TEST CASE for Part 3: Using Your Classifier
tada = "\N{PARTY POPPER}"

assert("scene" in df)

assert(len(df[ df.scene == "notebook" ]) > 100)
assert(len(df[ df.scene == "office hour" ]) > 75)
assert(len(df[ df.scene == "notebook" ]) + len(df[ df.scene == "office hour" ]) == len(df))

assert( len( df[ (df.frame.str.endswith("0001.jpg")) & (df.scene == "office hour") ] ) == 1 )
assert( len( df[ (df.frame.str.endswith("0306.jpg")) & (df.scene == "office hour") ] ) == 1 )
assert( len( df[ (df.frame.str.endswith("0081.jpg")) & (df.scene == "notebook") ] ) == 1 )
assert( len( df[ (df.frame.str.endswith("0191.jpg")) & (df.scene == "notebook") ] ) == 1 )

print(f"{tada} All Tests Passed! {tada}")

🎉 All Tests Passed! 🎉


## Observing Results

In the next 5 cells, we display a frame and you'll run code to check what your classifier classified the frame as being!  Make sure to run the code for each frame:

### Frame #0001: Office Hours

In [43]:
df[ df.frame.str.endswith("0001.jpg") ]

Unnamed: 0,frame,r,g,b,scene
12,frames/frame_0001.jpg,88.659175,67.456202,60.424978,office hour


![Frame 0001](frames/frame_0001.jpg)

### Frame #0081: Notebook

In [44]:
df[ df.frame.str.endswith("0081.jpg") ]

Unnamed: 0,frame,r,g,b,scene
150,frames/frame_0081.jpg,230.721385,229.915091,230.48303,notebook


![Frame 0001](frames/frame_0081.jpg)

### Frame #0191: Notebook

In [45]:
df[ df.frame.str.endswith("0191.jpg") ]

Unnamed: 0,frame,r,g,b,scene
304,frames/frame_0191.jpg,233.117088,232.354644,230.103359,notebook


![Frame 0001](frames/frame_0191.jpg)

### Frame #0306: Office Hours

In [46]:
df[ df.frame.str.endswith("0306.jpg") ]

Unnamed: 0,frame,r,g,b,scene
121,frames/frame_0306.jpg,89.403867,70.149223,62.83901,office hour


![Frame 0001](frames/frame_0306.jpg)

### Frame #0320: Data Science Duo Logo???

What did you classify the DUO logo as?  It's nether one, but we don't have that option!

In [47]:
df[ df.frame.str.endswith("0320.jpg") ]

Unnamed: 0,frame,r,g,b,scene
159,frames/frame_0320.jpg,221.227565,71.838433,54.457305,office hour


![Frame 0001](frames/frame_0320.jpg)

### Frame #328: Video Credits

What did you classify the DUO logo as?  It's another tricky one!


In [48]:
df[ df.frame.str.endswith("0328.jpg") ]

Unnamed: 0,frame,r,g,b,scene
78,frames/frame_0328.jpg,7.480234,7.481519,7.487826,office hour


![Frame 0328](frames/frame_0328.jpg)

<hr style="color: #DD3403;">

## Part 4: Update Your Classifier to Account with an "Other" Category

Create a second classifier -- `classifyFrame2` -- that returns either `"notebook"`, `"office hour"` or `"other"`.  Your classifier should correctly handle the "Data Science Duo" (ex: #0320) frames and the "Credit" frames (ex: #0328).

In [55]:
def classifyFrame2(*args):
  # Return either "office hour", "notebook", or "other" based on the values of `r`, `g`, and `b`.
  d_office_hours = math.dist(args, office_hour_avg)
  d_notebook = math.dist(args, notebook_avg)
  
  if math.isclose(d_notebook, 0, abs_tol=50):
    return "notebook"
  if math.isclose(d_office_hours, 0, abs_tol=30):
    return "office hour"
  return "other"

## Apply your `classifyFrame2` function

Using `classifyFrame2`, this code replaces the value in the column `scene` with your `classifyFrame2` classification function.  The output of this cell shows the last frames of the video, which we expect to be `"other"`:

In [58]:
df["scene"] = df.apply(lambda row: classifyFrame2(row.r, row.g, row.b), axis=1)
df.tail(20)

Unnamed: 0,frame,r,g,b,scene
310,frames/frame_0030.jpg,237.225595,236.513451,236.777122,notebook
311,frames/frame_0024.jpg,88.317817,66.974431,60.013902,office hour
312,frames/frame_0018.jpg,90.887626,65.819722,58.584336,office hour
313,frames/frame_0232.jpg,244.644492,244.001172,242.137786,notebook
314,frames/frame_0226.jpg,244.742813,244.085543,242.222925,notebook
315,frames/frame_0187.jpg,233.127865,232.372739,230.165365,notebook
316,frames/frame_0193.jpg,233.118607,232.325985,230.080495,notebook
317,frames/frame_0144.jpg,230.627274,229.696172,227.422426,notebook
318,frames/frame_0150.jpg,230.216254,229.317413,227.004748,notebook
319,frames/frame_0178.jpg,232.595569,231.87026,229.682127,notebook


### 🔬 Checkpoint Tests 🔬

In [59]:
### TEST CASE for Part 4: Update Your Classifier to Account with an Other Category
tada = "\N{PARTY POPPER}"

print(len(df[ df.scene == "notebook" ]))
print(len(df[ df.scene == "office hour" ]))
print(len(df[ df.scene == "other" ]))

assert("scene" in df)

assert(len(df[ df.scene == "notebook" ]) > 100)
assert(len(df[ df.scene == "office hour" ]) > 75)
assert(len(df[ df.scene == "other" ]) >= 15)
assert(len(df[ df.scene == "other" ]) <= 18)   # Okay to classify the intro screens as well, but not any others.
assert(len(df[ df.scene == "notebook" ]) + len(df[ df.scene == "office hour" ]) + len(df[ df.scene == "other" ]) == len(df))

assert( len( df[ (df.frame.str.endswith("0001.jpg")) & (df.scene == "office hour") ] ) == 1 )
assert( len( df[ (df.frame.str.endswith("0306.jpg")) & (df.scene == "office hour") ] ) == 1 )
assert( len( df[ (df.frame.str.endswith("0081.jpg")) & (df.scene == "notebook") ] ) == 1 )
assert( len( df[ (df.frame.str.endswith("0191.jpg")) & (df.scene == "notebook") ] ) == 1 )
assert( len( df[ (df.frame.str.endswith("0317.jpg")) & (df.scene == "other") ] ) == 1 )
assert( len( df[ (df.frame.str.endswith("0325.jpg")) & (df.scene == "other") ] ) == 1 )
assert( len( df[ (df.frame.str.endswith("0328.jpg")) & (df.scene == "other") ] ) == 1 )

print(f"{tada} All Tests Passed! {tada}")

212
100
18
🎉 All Tests Passed! 🎉


<hr style="color: #DD3403;">

## Submission

You're almost done!  All you need to do is to commit your lab to GitHub and run the GitHub Actions Grader:

1.  ⚠️ **Make certain to save your work.** ⚠️ To do this, go to **File => Save All**

2.  After you have saved, exit this notebook and return to https://discovery.cs.illinois.edu/microproject/video-frame-scene-recognition-model/ and complete the section **"Commit and Grade Your Notebook"**.

3. If you see a 100% grade result on your GitHub Action, you've completed this MicroProject! 🎉