# Data Pipeline Demo

<img src="../images/SystemDesign.png"/>

This notebook is a hands on demonstration of the data pipeline shown above. Follow along to understand how data goes from ingestion, through projection, to recommendation, and finally creates a match for the user. 

## Setup

The first thing we'll do is install the required python packages. Even if we have the code working outside of jupyter, we should do this just to be safe, jupyter may not be using the same environment we normally use to run python code. The last bit of code restarts the kernel, which may be required for us to use the updated packaged. Just reload the page after the kernel dies, and skip running this cell next time.

In [6]:
%pip install pandas
%pip install boto3

import IPython
IPython.Application.instance().kernel.do_shutdown(True)

You should consider upgrading via the '/usr/local/Cellar/jupyterlab/3.4.3/libexec/bin/python3.10 -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.
Collecting boto3
  Using cached boto3-1.24.36-py3-none-any.whl (132 kB)
Collecting jmespath<2.0.0,>=0.7.1
  Using cached jmespath-1.0.1-py3-none-any.whl (20 kB)
Collecting botocore<1.28.0,>=1.27.36
  Using cached botocore-1.27.36-py3-none-any.whl (9.0 MB)
Collecting s3transfer<0.7.0,>=0.6.0
  Using cached s3transfer-0.6.0-py3-none-any.whl (79 kB)
Installing collected packages: jmespath, botocore, s3transfer, boto3
Successfully installed boto3-1.24.36 botocore-1.27.36 jmespath-1.0.1 s3transfer-0.6.0
You should consider upgrading via the '/usr/local/Cellar/jupyterlab/3.4.3/libexec/bin/python3.10 -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


{'status': 'ok', 'restart': True}

In [19]:
# Third party imports
import pandas as pd
from pathlib import Path
import sys

In [20]:
# Internal imports
sys.path.append("..")
from app.ingestion.dataframe_ingestion_client import DataframeIngestionClient
from app.ingestion.main_datastore_proxy import MainDatastoreProxy
from app.projection.projection_engine import ProjectionEngine
from app.projection.projection_datastore_factory import ProjectionDatastoreFactory
from app.recommendation.match_generator_factory import MatchGeneratorFactory

## Ingestion

Now, we can create a main data store and upload some data. For this demo, we'll use the "in memory" data store, which means that the data won't persist once we end the program, or in this case, restart the kernel. 

In [21]:
# Path to input data
filepath = "../tests/test_data.csv"

In [22]:
data = pd.read_csv(filepath, header=0)
database = MainDatastoreProxy(in_memory=True)
client = DataframeIngestionClient(database)
client.upload(data)
print(data)

   author        movie  rating
0  steven  bladerunner    0.80
1   isaac  bladerunner    1.00
2   ebert  bladerunner    1.00
3  steven       clerks    0.60
4   ebert       clerks    0.75


## Projection

In [23]:
authors = list(database.get_keys())
projection_databse = ProjectionDatastoreFactory(projection_filepath = f'../data/projection.json', 
                 movie_indices_filepath = f'../data/movie_indices.json').build()
projection_engine = ProjectionEngine(
    database, projection_databse)
projection_engine.create_projection()
print(projection_databse.get())

{'steven': [0.8], 'isaac': [1.0], 'ebert': [1.0], '_average': [0.9333333333333332]}


## Recommendation

In [18]:
match_generator = MatchGeneratorFactory(database, projection_databse).build()
user_input = {'bladerunner': 0.4}
match = match_generator.get_match(user_input)
print(match)

('steven', [Review(author='steven', movie='bladerunner', rating=0.8), Review(author='steven', movie='clerks', rating=0.6)])
