Skip to content

My capstone project for Udacity Data Scientist Nanodegree

Notifications You must be signed in to change notification settings

winterlovet44/capstone-recommend

Repository files navigation

Udacity Capstone project - Recommender system

Motivation

In this project, i will build a recommender system web-app to make recommendation for user. I recommend for user the items they may like and a recommender system to find out which item is similar to this item.

Table of Contents

  1. Project Description
  2. Installation
  3. EDA
  4. Methodology
  5. Results
  6. Instructions

Project Description

In this project, I build a application about movie recommender system with Movielens 1M dataset. For more information about the dataset, please check here. I use Alternating Least Square which has been implemented in implicit for user recommendation and a Content-based module for related item recommendation. I also implement a simple web-app to perform recommendation for the user or the item (model serving).

This project contains four steps of ML pipeline:

  1. ETL: Clean data and save cleaned data to file and database.
  2. Feature engineering: Transform feature to meet model fitting.
  3. Modelling: Build a Machine learning pipeline to feature engineering and train ML model.
  4. Model serving: Build Flask web app to predict user's input query.

Installation

The code was implemented in Python 3.9. All necessary package was contained in requirements.txt file.

For quick installation:

pip install -r requirements.txt

EDA

The Movielens 1M dataset contains information about history of user and movie's profile. To see the EDA of Movielens 1M, please go to this notebook

Methodology

Model

  1. Alternating Leasts Squares (ALS): An approach of matrix factorization. this model try to decompose rating matrix into two factos matrix
  2. Content-based (CB): A content based approach use cosine similarity to find most similar item.

With ALS model, i use implementation from implicit for better performance. With CB model, i implement my own and try to combine multiple of data type. My CB implementation can handle multiple of content with data type can be list, category or text. Final similarity of pair items is average of all features input. Code of this implementation you can find here

Metrics

In this project, i only implement evaluation for ALS. To evaluate ALS, I use 3 metrics: RMSE, MAP@k and P@k.

  1. RMSE (Root Mean Squares Error): the differences between predicted rating and true rating.
  2. P@k (Precision at k): Precision of recommendation with top k result.
  3. MAP@k (Mean average precision at k): Mean of P@k with all users.

Results

The result of ALS for Movielens 1M. You can view detail at here

Factors RMSE MAP@k P@k
10 3.21 0.10 0.206
30 3.18 0.122 0.244
50 3.18 0.128 0.246
100 3.214 0.121 0.234
300 3.42 0.087 0.171
1000 3.67 0.0364 0.0733

Instructions

  1. ETL pipeline

We need to pre-processing for user's dataset and item's dataset. To run ETL pipeline for clean user dataset, run the code below:

python movielens_rating_etl.py

To run ETL pipeline for clean item's dataset:

python movielens_meta_etl.py
  1. To build and train model

We have 2 model is ALS and ContentBased. To run ALS model

python als.py

To run ContentBased model

python cb.py
  1. Run web app

Run the code below to start the web app at localhost

python run.py

And go to http://localhost:3000 to see the web app

About

My capstone project for Udacity Data Scientist Nanodegree

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages