# Exploring and analysing interaction data
In this section we will work with a real-life dataset of user interaction data. 

The data was obtained using a [data donation](https://datadonation.eu/) driven research project.

This notebook follows a loose structure - you may find other avenues of analysis more interesting. As always, we encourage improvisation. 

# 0 - Goal
The end-goal of this exercise is to build a recommender system that uses a binge-watching feature. The exact implementation is up to you, but the general outline is:
- Detect binge-watching sessions in user data
- Annotate content data with how 'bingeable' the content is (item-based collaborative filtering)
- Annotate user data with 'binge-watching' status of the user (user-based collaborative filtering)
- Combine one or both of above features into a recommender system that can recommend new shows

## 1 - Exploration


In [None]:
import os.path
import pandas as pd

fileloc = os.path.join('data', 'netflix_viewing_2023.csv')
df = pd.read_csv(fileloc, index_col=0)

# Translate column headers
df.columns = ['ID', 'starting_time', 'hours_watched', 'title', 'device', 'date', 'minutes_watched', 'content_type']

# Correct datatype for date column
df.date = pd.to_datetime(df.date)

Rest of exploration is up to you.

In [None]:
# So what's in there?
df.info()

## 2 - Cleaning
The dataset contains viewing data for movies and TV shows. For the purpose of a recommender system, we are interest in the series title, not individual episodes. That information is not currently present.

In [None]:
# implement standardise_title to clean TV show titles
def standardise_title(title):
    '''standardise a single title
    In the case of a TV show, only keep return the show title
    '''
    return title


df['title_standardised'] = df['title'].apply(standardise_title)
df.info()

Note that this section is iterative. You will probably come back to cleaning a few times in the design process. 

## 3 - Enrichment
There are a few enrichments that make sense for this dataset:
- Rating (e.g. IMDb, RottenTomatoes)
- Watched percentage (percentage of runtime that was watched)
- Full watch (True/False if above percentage high enough)

These enrichments can only be realised with external data sources. 

You can use the [OMDb API](http://www.omdbapi.com/) to obtain useful info such as IMDb rating and runtime. You will need to apply for a free API key (1000 requests max/day). Make sure you use those 1000 requests in an optional way.


In [None]:
import requests
from bs4 import BeautifulSoup

API_KEY = 'your key here'

def get_omdb_data(title):
    # url parameters
    response = requests.get(f'http://www.omdbapi.com/?t={title}&apikey={API_KEY}')
    if response.status == 200:
        return response.json()
    return None 

def filter_omdb_data(raw_data):
    # filter out any interesting fields
    return raw_data


bridgerton_info = get_omdb_data('bridgerton')
bridgerton_info


## 4 - Binge-watching detection
Using the provided data, an algorithm can be implemented to detect binge-watching sessions. This information could be used in a recommender system as a feature. 
We leave the implementation up to you, general outline:

- Identify watching sessions
- Decide on a defintion of binge-watching (X session length, same content for X repeats in a session, etc)
- Classify sessions based on above criterium
- Annotate rows with binge information

In [None]:
# binge detection code here

## 5 - Recommender system
- Adapt the code from `Nearest neighbour and Rating Prediction.ipynb` to build a recommender/rating predictor system using the binge information.