# Project Description

This project aims to optimise short-video recommendation using multi-modal
information.

# Data preparation

We need to prepare four parts of data.

**Part 1**: videos. Prefer in `.mp4` format.

**Part 2**: musics. Prefer in `.wav` format.

**Part 3**: interaction information of users and videos. Please include
following fields, and save in a `.csv` file.
The illustrations of data type and content are as below.


| Field       | Data Type     | Content  |
| ------------|:--------------| :-----|
| userID      | string or int | the user's ID |
| userGender  | string        | the user's gender |
| userAge     | int           | the user's age |
| userLocation| string        | the user's location |
| videoID     | string or int | the video's ID |
| liveScores  | float         | the video's live score |
| finish      | int           | whether the user finishes the video, if is use 1, otherwise 0 |
| like        | int           | whether the user likes the video, if is use 1, otherwise 0 |
| favorites   | int           | whether the user favorites the video, if is use 1, otherwise 0 |
| forward     | int           | whether the user forwards the video, if is use 1, otherwise 0 |

Specifically, the live score of a short video is calculated as following:

$\frac{1}{n}\sum \limits_{n=1} \frac{finish_n + 1.2 like_n + 1.2favourite_n + 0.8forward_n}{finish_n + like_n + favourite_n+forward_n}$

where $n$ is the number of users that have interacted with this short video till now.


**Part 4**: videos information. Please include following fields, and save
in a `.csv` file.
The illustrations of data type and content are as below.


| Field        | Data Type           | Content  |
| ------------- |:-------------| :-----|
| videoID      | string or int |the video's ID|
| musicID      | string or int |ID of the video's music|
| videoLocation | string      |the video's location|
| description | string      |the video's descriptions |
| comment1 | string      |the video's top comment |
| comment2 | string      |the video's second top comment |
| comment3 | string      |the video's third top comment|
| comment4 | string      |the video's fourth top comment |
| comment5 | string      |the video's fifth top comment|
| hot | int      |whether the video is on the hot list, if is use 1, otherwise 0 |
| entertainmentHot | int      |whether the video is on the entertainment hot list, if is use 1, otherwise 0 |
| socialHot | int      |whether the video is on the social hot list, if is use 1, otherwise 0|
| challengeHot | int      |whether the video is on the challenge hot list, if is use 1, otherwise 0|
| authorHot | int      |whether the video is on the author hot list, if is use 1, otherwise 0|
| authorID | string or int      |the author's ID|
| authorGender | string      |the author's gender|
| authorAge | int      |the author's age|
| authorLocation | string      |the author's location|

# Model Structure

We use a BERT-based model to do self-attention cross modalities. The input for
our model is as following.

![input](./docs/input.png)

## Input

We use three parts of information as the input——the user, the author and the video
information——to predict whether we should recommend a short video to a user or not.

With regards to the video part, we preprocess it to get three kinds of information as parts
of our input - semantics, actions and music.
To get semantic information, we extract subtitles from key frames, as well as semantic
features using S3D which is pretrained to generate video titles.
To get action information, we extract action features using 3D-ResNet which is pretrained
to classify actions of human beings and animals.
To get music information, we extract music features using Wav2Vec which is pretrained
on songs.

Among all the above information, there are three types of data - text, categorical, and
numerical. As to the text input, we use get word embeddings from pretrained BERT model.
As to the categorical input, we use different embedding layers to encode each category.
As to the numerical input, we employ a MLP to transform it into desired input embeddings.

## Label

The goal of this model is to predict whether we should recommend a short video to a
user. The ground truth label is 1 for yes and 0 for no. We label a data as 1 as long
as the user has one of these actions-finish, like, favorite, or forward.

## Loss function

We feed the last hidden state of [CLS] token into a classifier, which has two feed-forward
layers and one fully-connected layer, to predict the label. The loss function is
cross-entropy.




# Feature work

We were intended to separate narrations in the audio from the music in it which has a vocal
track, but didn't find available pretrained models. We consider involving the narrations
features as input would enhance the model's performance.
