# Assignment 1:  Numpy, Pandas, Matplotlib and Data Wrangling

This task contains 3 main questions with details as below. The due date is October 22 (Tuesday), 2024, at 11:59 pm. Each day late will result in a loss of 20% of your total points. Please upload a `.ipynb` file with your solved notebook on Moodle.

Good job! To enjoy!

### Question 1 (5,0 values) Spike Triggered Average

Welcome to the world of neuroscience! <br>
In this exercise you will have the opportunity to be gently introduced to the magnificent world of how your brain  executes what you think! 

<img src="brain.jpg" alt="brain" width="50%">

The brain is perhaps our most complex vital living organ, meaning that the multitude of years of research so far is still not sufficient to understand how such an organ behaves.
 <br>

Biologically, the brain is formed by a particular form of cells that are called <strong>neurons</strong>. Such neurons are electrically excitable cells that communicate with one another via specialized connections named <strong>synapses</strong>. These connections allow the transmission of chemical and electrical signals across neurons. In very high-level terms, such a process gives birth to the human thought, giving us the possibility to carry out our ordinary life tasks and habits, such as talking with friends, eating, drinking, studying, and so on and so forth.
Moreover, it is possible to say that communication across two neurons is experienced when we observe a <strong>spike</strong>, which is solicited by an electrical <strong>stimulus</strong> (through time). <br> 
As an example, imagine you are on your sofa watching your favorite movie: each photogram of the video you see from your eyes will be converted into a sequence of electrical signals, which we previously defined as <strong>stimulus</strong> (this is the scientific term). Then the stimulus flows in the network of your neurons, potentially activating their effects on your behavior. Specifically, if you watch a love scene, some neurons of the brain may <strong>spike</strong> (actually they are activated), potentially making you feel pleased and emotional. On the contrary, if you watch a violent scene, some other neurons may spike, potentially making you feel sad and uncomfortable. This is an abstract example talking about emotions, but bear in mind that examples of this kind are extendable to more practical activities such as the ones mentioned in the paragraph above. 

Practically speaking, in this exercise, you will analyze a stimulus and consequantly how it affects the spikes of a single neuron. <strong>Data is randomly generated</strong>, but it simulates perfectly the setting depicted above. In the end, you will compute the <strong>Spike-Triggered Average</strong> (STA), which, given a fixed time window, approximates the stimulus's behavior before a spike occurs. This is a time-wise average, meaning that, given many <strong>fixed-in-length time-sequences</strong> (same milliseconds long in this case) of the same stimulus, we average the sequences at their values at each millisecond step. 

Some clarifications: 

You will be provided with two time-series, one with the stimulus and the other one with the spikes. The latter series maps to the former, of course.
Stimulus varies in time, specifically milliseconds (ms), meaning that each element in the stimulus is an electrical signal at a single millisecond. Whereas, the spikes are in binary form: 1 if a spike occurred, 0 if a spike did not occur at that specific milliseconds. 

In [1]:
# Import Dependencies 

import numpy as np # DON'T CHANGE THIS LINE 
import pickle # DON'T CHANGE THIS LINE 

In [2]:
# Load Data 

path = "data.pickle" # Make sure the dataset "data.pickle" is within the same folder of this notebook
data = pickle.load(open(path, 'rb')) # DON'T CHANGE THIS LINE


In [3]:
# Reference Data 

# DON'T CHANGE THESE 2 LINES 
stimulus = data['stim'] # Stimulus (in STA units) over time (in milliseconds units) - Artificial Data - type: numpy.ndarray
rho = data['rho'] # Spikes - 0 or 1. 0 no spike, 1 yes spike - Mapping stimulus - type: numpy.ndarray

#### Question 1.1

How many milliseconds does the `stimulus` provided above have? 

In [None]:
# Question 1.1


#### Question 1.2

Filter out the low stimulus values. 
Set a minimum threshold of 10 STA units for the <strong> absolute value </strong> of the stimulus and filter out everything below it (do not change the original stimulus array).

For example: 
<br>Consider stimulus = [-5.2345, 3.4564, 13.1245, -15.2356]<br>
The final result should be: filtered_stimulus = [13.1245, -15.2356]

Tip: Use 
<strong> print(filtered_stimulus[0:100]) </strong>
to check if the first 100 values are beeing filtered correctly.

In [None]:
# Question 1.2

#### Question 1.3

Compute the interquartile range of the values of the `stimulus` time series.

  ℹ️<strong>interquartile_range = q75 - q25</strong>
    <br> Where: q25 is the first quartile and q75 is the third quartile.

In [None]:
# Question 1.3

#### Question 1.4

Find the position of the three maximums of the `stimulus` and replace these values with the average (do not change the original stimulus array).

In [None]:
# Question 1.4

#### Question 1.5

Compute the <strong>Spike Triggered Average</strong> as described previously with a time window of 300 ms. I.e. each sequence to be considered for the Spike Triggered Average should have a length of 300 ms. 

Here we provide a visual toy example of the Spike Triggered Average complementing what is described in the main passage:

<img src="sta_example.png" alt="img not available" width="50%">

Each sequence is average (millisecond-wise) with a time window of 30 ms before a spike. Bear in mind that, in this question, you are asked to use 300 ms time window.

In [None]:
# Question 1.5


##### 🎯Check your answer - NOT GRADED

In [4]:
import matplotlib.pyplot as plt # DON'T CHANGE THIS LINE 
import matplotlib.image as mpimg # DON'T CHANGE THIS LINE 


sta = sta # Please delete "None" and insert here your spike triggered average answer 
plt.plot(range(sta.shape[0]), sta) # DON'T CHANGE THIS LINE 

NameError: name 'sta' is not defined

<strong>If you did everything correct, your plot should look the same as the one below!</strong>

<img src="sta_sample.png" alt="no-picture" align="left"> <br><br><br><br><br><br><br><br><br><br><br><br><br>


<strong>Congratulations! You have just learned how the brain works! Kudos to you! :)</strong>

<br><br>
### Question 2 (7,0 values)  Aircraft Wildlife Strikes

![](https://i.pinimg.com/400x/30/de/ee/30deee2cbc1aec96a1532c4a27962f0c--bird-strike-passenger-aircraft.jpg)

The "wildlifestrikes.csv" dataset contains a record of each reported wildlife strike of military, commercial, or civil aircraft between 1990 and 2015. Each row contains the incident date, aircraft operator, aircraft make and model, engine make and model, airport name and location, species name and quantity, and aircraft damage.
The wildlife strike database was compiled from reports received from airports, airlines, and pilots and published by the Federal Aviation Association. 
<br>

**Question 2.1** 

Import the dataset "wildlifestrikes.csv" and show the ***total number of records***, the ***number of columns***, and how many ***unique species*** the dataset has.


In [None]:
import pandas as pd
import numpy as np

In [None]:
# Question 2.1



**Question 2.2** 

In 2000, what percentage of flights suffered a wild animal strike and had to perform a "PRECAUTIONARY LANDing"? (hint: check the "Flight Impact" column in the dataset). Report the result to two decimal places.

In [None]:
# Question 2.2


**Question 2.3** 

What was the year with the highest fatalities due to a wildlife strike? (hint: check the column "Fatalities" on the dataset)

In [None]:
# Question 2.3


**Question 2.4** 

Find the top 5 species that caused engine damage on the aircraft after striking it (remember that there are up to 4 different possible engines on an aircraft). For this exercise, you can consider unknown species.

In [None]:
# Question 2.4






**Question 2.5** 

Find the name of the specie that was responsible for the most incidents and indicate the total number of incidents caused by it. The result cannot be an unknown bird.

In [None]:
# Question 2.5



**Question 2.6**  Show how many incidents occurred every five years (i.e. total number of incidents for each of the following time intervals: (1990, 1995], (1995, 2000], (2000, 2005], (2005, 2010], and (2010,2015]). Note that the time intervals are open on the left and closed on the right.

In [None]:
# Question 2.6


**Question 2.7** 

Consider the five-time intervals: (1990, 1995], (1995, 2000], (2000, 2005], (2005, 2010], and (2010,2015]. Use a pivot_table to find out, ***on each time interval***, which type of aircraft ("Aircraft") got most frequently damaged ("Aircraft Damage") after a wildlife strike. (hint: First, do the pivot table with the total sum of the damage per each type of aircraft suffered at each time interval. After that, obtain the aircraft type with the highest value for each time interval.)

In [None]:
# Question 2.7




### Question 3 (8 values)- A Million Dollar Question: Squid Game or Alice in Borderland?


“What TV sereis should I binge-watch this evening?” This perhaps is a question you would ask yourself very often. As for me — yes, and more than once. As such, from Netflix to Hulu, the need to build robust movie recommendation systems is extremely important given the huge demand for personalized content of modern consumers. **Netflix is forecasting it will add 3.5 million paying subscribers thanks to the surprise hit Squid Games**

We are going to examine a MovieLens dataset which provides non-commercial, personalized movie recommendations. 

This dataset describes user ratings from MovieLens. It contains ratings and tag applications across movies created by  users. Users were selected at random for inclusion. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in the files movies_fe.xlsx, ratings_fe.csv. More details about the contents and use of all these files follows.

**Ratings Data File Structure (ratings_fe.csv)**
All ratings are contained in the file ratings_fe.csv. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:
`userId,movieId,rating,timestamp`

The lines within this file are ordered first by userId, then, within user, by movieId.

Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

**Movies Data File Structure (movies_fe.xlsx)**
Movie information is contained in the file movies_fe.xlsx. Each line of this file after the header row represents one movie, and has the following format:
`movieId,title,year,genres`

Answer the following questions using the provided dataset. You can write down intermediate results towards the final answers

In [None]:
import pandas as pd
import numpy as np

#### Question 3.1

However, there may be errors and inconsistencies in these files, as shown below:

The ratings in the `rating_fe.csv` should be made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars). So if the ratings are larger than 5 or smaller than 0.5, you need to round it to the value of 5 and 1, respectively. For example, if a movie is rated 8, then it might be wrongly rated and you need to change the value to 5. Similarly, if a movie is rated negatively, e.g., -1, then it should be changed to 1, if any.

The movie information in the `movies_fe.xlsx` contains the movies with the missing information about **year**. You should also remove them.

You should also inspect the data to make sure the correct starting row of the data.

In [None]:
# Question 3.1

#### Question 3.2 

Show the top 5 Action movies with the highest median ratings:

In [None]:
# Question 3.2

#### Question 3.3 

Among all movies that user with Id 500 has rated, show the his/her top 5 favorite movies in each of the following three genres **Adventure**, **Comedy**, **Drama** (i.e., the movie he/she rated 5) more recently as three columns: `movieId, title, genre`. If you see the movies with overlapping genres, it is ok to include them several times.

In [None]:
# Question 3.3

#### Question 3.4 

Show the pivot table of mean and standard deviation for the ratings of movies across the row of released decades (for example, year 1995 belongs to 1990s decade), and the column of quartile of the timestamp values (in terms of 4 groups).

In [None]:
# Question 3.4

#### Question 3.5 

Now you need to implement a **recommender system using collaborative filtering method**. This works simply as to recommend movies that "people who like this movie also like these movies". For example, people who like to watch Star Wars are very likely to watch Star Treks. 

In order to do so, you need to find all users who like one movie (i.e., post a rating of 5), and identify the movies these users also like, ranked by the number of likes. 

Show the recommended movie list with top 10 movies that users who like the *Titanic* may also like.

In [None]:
# Question 3.5

Congratulations! You just build the first [recommender system that worth 1 million dollars](https://www.netflixprize.com/) :D

![netflix_prize](https://cdn.vox-cdn.com/thumbor/Kp9TEknNzIQV-ZijAm74cfHx_D0=/0x124:1100x700/fit-in/1200x630/cdn.vox-cdn.com/uploads/chorus_asset/file/15788062/netflix-prize1.0.1537040369.jpg)

Before submission, do not forget to restart the kernel and run the whole notebook. 

Thank you!!