# CU Recommender System

In this example we are creating a cu recommender system by extractung feature vectors using PaddlePaddle, importing bets vectors into Milvus, and then searching in Milvus and Redis.

## Data
In this project, we use nsl-finance tenant data for 2021. This dataset contains approximately 3800 cus made under 160 Books and 2500 solutions. 

We use the following files:
- movies.dat: Contains movie information.
- movie_vectors.txt: Contains movie vectors that can be imported to Milvus easily.

File structure:

 - movMovieID::Title::Genres   

     - Titles are identical to titles provided by the IMDB (includingyear of release)
 
     - Genres are pipe-separated

     - Some MovieIDs do not correspond to a movie due to accidental duplicate entries and/or test entries
 
    - Movies are mostly entered by hand, so errors and inconsistencies may exist



## Requirements

Due to package constraints, this notebook needs to be run using Python 3.6/3.7 . It is recommended that you use a virtual enviroment like Conda, instructions for installing Conda can be found [here](https://conda.io/projects/conda/en/latest/user-guide/install/index.html). 

Currently, there is a dirty workaround that you can use for python 3.8. When installing the `requirements.txt`, pip will fail to install`sentencepiece`. If you rerun the notebook after the install fails and avoid redownloading the packages, the rest of the notebook should run without any hiccups.

|  Packages |  Servers |
| --------------- | -------------- |
| pymilvus==2.0.0rc5 | milvus-2.0.0-rc5 |
| pymongo           | mongodb          |
| paddle_serving_app |
| paddlepaddle==2.1.1 |


We have included a requirements.txt file in order to easily satisfy the required packages. 

## Up and Running

### Installing Packages
Install the required python packages with `requirements.txt`. If using Python 3.8, look at workaround in the Requirements section. Uninstall previous pymilvus-orm if needed.

In [79]:
! pip3 install -r requirements.txt

[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m
Collecting pymilvus==2.0.0rc6
  Using cached pymilvus-2.0.0rc6-py3-none-any.whl (117 kB)
Collecting flask-cors
  Using cached Flask_Cors-3.0.10-py2.py3-none-any.whl (14 kB)
Collecting flask
  Using cached Flask-2.0.2-py3-none-any.whl (95 kB)
Collecting flask_restful
  Using cached Flask_RESTful-0.3.9-py2.py3-none-any.whl (25 kB)
Collecting sentence_transformers
  Using cached sentence-transformers-2.1.0.tar.gz (78 kB)
Collecting grpcio==1.37.1
  Using cached grpcio-1.37.1.tar.gz (21.7 MB)
Collecting ujson>=2.0.0
  Using cached ujson-4.2.0.tar.gz (7.1 MB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h    Preparing wheel metadata ... [?25ldone
[?25hCollecti

### Getting Milvus Server

This demo uses Milvus 2.0 Standalone with docker-compose, please refer to [Install Milvus 2.0](https://milvus.io/docs/v2.0.0/install_standalone-docker.md) for other installation options (on Kubernetes or use Milvus Cluster). Currently we have deployed standalone milvus in dev kubernetes. We need to grep the endpoint by pod ip.

In [None]:
#!docker-compose up -d

### Starting Redis Server
We are using Redis as a metadata storage service. Code can easily be modified to use a python dictionary, but that usually does not work in any use case outside of quick examples. We need a metadata storage service in order to be able to be able to map between embeddings and the corresponding data.

In [None]:
!docker run  --name redis -d -p 6379:6379 redis

### Confirm Running Servers

In [None]:
! docker ps

### Downloading Data

In [58]:
import pandas as pd
import urllib.parse
from pymongo import MongoClient

def _connect_mongo(host, port, username, password, db):
    """ A util for making a connection to mongo """

    if username and password:
        mongo_uri = 'mongodb://%s:%s@%s:%s/%s?authSource=%s&readPreference=primary&ssl=false' % (username, urllib.parse.quote(password), host, port, db,'nsl_bet_db_qa3')
        conn = MongoClient(mongo_uri)
    else:
        conn = MongoClient(host, port)

    return conn[db]


In [59]:
def read_mongo(db, collection, query={}, host='10.220.98.254', port=27017, username='bet', password='bet@123', no_id=False):
    """ Read from Mongo and Store into DataFrame """

    # Connect to MongoDB
    db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)

    # Make a query to the specific DB and Collection
    cursor = db[collection].aggregate(query)

    # Expand the cursor and construct the DataFrame
    df =  pd.DataFrame(list(cursor))
    
    # Delete the _id
    if no_id:
        del df['_id']

    return df

In [70]:
gsi_pipeline = [
    {
        "$project": {"nb": "$$ROOT", "_id": 0}
    },
    {
        "$lookup": {
            "localField": "nb.gsiList.id",
            "from": "nsl_gsi",
            "foreignField": "id",
            "as": "ng"
        }
    },
    {
        "$unwind": {
            "path": "$ng",
            "preserveNullAndEmptyArrays": False
        }
    },
    {
        "$lookup": {
            "localField": "ng.solutionLogic.referencedChangeUnit",
            "from": "nsl_change_unit",
            "foreignField": "id",
            "as": "cu"
        }
    },
    {
        "$unwind": {
            "path": "$cu",
            "preserveNullAndEmptyArrays": False
        }
    },
    {
        "$addFields": {
            "bookId": "$nb.displayName",
            "gsiId": "$ng.displayName",
            "bookName": "$nb.displayName",
            "gsiName": "$ng.displayName",
            "cuName": "$cu.displayName"
        }
    },
    {
        "$match": {
            "$and": [
                {
                    "nb.tenantId": {
                        "$in": [
                            "ProjectCarnivals",
                            "Banking",
                            "projectmanagement",
                            "Brane-Finance",
                            "FinanceSolBrane",
                            "AILABS",
                            "BRF2008",
                            "Finance2008",
                            "FamilyApp2008",
                            "Learning2008",
                            "GRC",
                            "Insurance",
                            "Healthcare",
                            "SupplyChain",
                            "Pharma",
                            "CustomerSuccess"]
                    }
                },
                {"nb.displayName": {"$regex": "^((?!test).)*$", "$options": "i"}},
                {"nb.displayName": {"$regex": "^((?!book).)*$", "$options": "i"}},
                {"ng.displayName": {"$regex": "^((?!test).)*$", "$options": "i"}}
            ]
        }
    },
    {
        "$group": {
            "_id": {"bookId": "$bookId", "gsiId": "$gsiId"},
            "addToSet(cu_name)": {"$addToSet": "$cu.displayName"}
        }
    },
    {
        "$project": {
            "bookId": "$_id.bookId",
            "gsiId": "$_id.gsiId",
            "cuId": "$addToSet(cu_name)",
            "bookName": "$nb.displayName",
            "gsiName": "$ng.displayName",
            "cuName": "$cu.displayName",
            "_id": 0
        }
    }
]

In [71]:
df = read_mongo(db='nsl_bet_db_soln',collection='nsl_book',query=gsi_pipeline)
df.to_pickle("data/cu.pickle")

In [83]:
df2 = pd.read_pickle("data/cu.pickle")
df2.head()

Unnamed: 0,bookId,gsiId,cuId
0,Covid Relief App,Emergency Helplines for Covid,"[View Helplines for Covid, CR_Helpline_Search]"
1,Kids,New Solution Building,"[Quiz Category, Login ]"
2,cus_999,Policy Renewal,"[Display Renewal Amount, Grace Period Renewal,..."
3,newcarriermaster1,New Carrier Master,"[Send Email Reserved CU, search reserved cu, R..."
4,Governance Risk and Compliance,SearchAuditWebex,"[SearchWebex, Displaycu, Auditee head option, ..."


## Code Overview


### Importing Movies into Milvus

#### 1. Connectings to Milvus and Redis
Both servers are running as Docker containers on the localhost with their corresponding default ports.

In [None]:
from pymilvus import *
import redis

# connections.connect()
connections.connect("default", host="localhost", port="19530")
r = redis.StrictRedis(host="localhost", port=6379) 

#### 2. Loading Movies into Redis
We begin by loading all the movie files into redis. 

In [None]:
import json
import codecs

#1::Toy Story (1995)::Animation|Children's|Comedy
def process_movie(lines, redis_cli):
    for line in lines:
        if len(line.strip()) == 0:
            continue
        tmp = line.strip().split("::")
        movie_id = tmp[0]
        title = tmp[1]
        genre_group = tmp[2]
        tmp = genre_group.strip().split("|")
        genre = tmp
        movie_info = {"movie_id" : movie_id,
                "title" : title,
                "genre" : genre
                }
        redis_cli.set("{}##movie_info".format(movie_id), json.dumps(movie_info))
        
with codecs.open("movie_recommender/movies.dat", "r",encoding='utf-8',errors='ignore') as f:
        lines = f.readlines()
        process_movie(lines, r)

#### 3. Creating Partition and Collection in Milvus

In [None]:
COLLECTION_NAME = 'demo_films'
PARTITION_NAME = 'Movie'

pk = FieldSchema(name='pk', dtype=DataType.INT64, is_primary=True, auto_id=False)
field = FieldSchema(name='vec', dtype=DataType.FLOAT_VECTOR, dim=32)
schema = CollectionSchema(fields=[pk, field], description="movie recommender: demo films")

if utility.get_connection().has_collection(COLLECTION_NAME): # drop the same collection created before
    collection = Collection(COLLECTION_NAME)
    collection.drop()
else:
    collection = Collection(name=COLLECTION_NAME, schema=schema)
    partition = collection.create_partition(PARTITION_NAME)
    print("Collection & partition are successfully created.")

#### 4. Getting Embeddings and IDs
The vectors in `movie_vectors.txt` are obtained from the `user_vector_model` downloaded above. So we can directly get the vectors and the IDs by reading the file.

In [None]:
def get_vectors():
    with codecs.open("movie_recommender/movie_vectors.txt", "r", encoding='utf-8', errors='ignore') as f:
        lines = f.readlines()
    ids = [int(line.split(":")[0]) for line in lines]
    embeddings = []
    for line in lines:
        line = line.strip().split(":")[1][1:-1]
        str_nums = line.split(",")
        emb = [float(x) for x in str_nums]
        embeddings.append(emb)
    return ids, embeddings

ids, embeddings = get_vectors()

#### 4. Importing Vectors into Milvus
Import vectors into the partition **Movie** under the collection **demo_films**.

In [None]:
if collection.num_entities != 0:
    print(COLLECTION_NAME + " is not empty!")  
else:
    mr = collection.insert(data=[ids,embeddings], partition_name=PARTITION_NAME)

print("Record count in collection: " + str(collection.num_entities))
# print(str(len(mr.primary_keys)) + " ids:\n", mr.primary_keys[:10])

### Build Index

In [None]:
# Flush collection with inserted vectors to disk
# utility.get_connection().flush([COLLECTION_NAME])

index_param = {
    "metric_type": "L2",
    "index_type":"IVF_FLAT",
    "params":{"nlist":128}
}

collection.create_index(field_name="vec", index_params=index_param)

### Recalling Vectors in Milvus
#### 1. Genarating User Embeddings
Pass in the gender, age and occupation of the user we want to recommend. **user_vector_model** model will generate the corresponding user vector.
Occupation is chosen from the following choices:
*  0:  "other" or not specified
*  1:  "academic/educator"
*  2:  "artist"
*  3:  "clerical/admin"
*  4:  "college/grad student"
*  5:  "customer service"
*  6:  "doctor/health care"
*  7:  "executive/managerial"
*  8:  "farmer"
*  9:  "homemaker"
*  10:  "K-12 student"
*  11:  "lawyer"
*  12:  "programmer"
*  13:  "retired"
*  14:  "sales/marketing"
*  15:  "scientist"
*  16:  "self-employed"
*  17:  "technician/engineer"
*  18:  "tradesman/craftsman"
*  19:  "unemployed"
*  20:  "writer"

In [None]:
import numpy as np
from paddle_serving_app.local_predict import LocalPredictor

class RecallServerServicer(object):
    def __init__(self):
        self.uv_client = LocalPredictor()
        self.uv_client.load_model_config("movie_recommender/user_vector_model/serving_server_dir") 
        
    def hash2(self, a):
        return hash(a) % 1000000

    def get_user_vector(self):
        dic = {"userid": [], "gender": [], "age": [], "occupation": []}
        lod = [0]
        dic["userid"].append(self.hash2('0'))
        dic["gender"].append(self.hash2('M'))
        dic["age"].append(self.hash2('23'))
        dic["occupation"].append(self.hash2('6'))
        lod.append(1)

        dic["userid.lod"] = lod
        dic["gender.lod"] = lod
        dic["age.lod"] = lod
        dic["occupation.lod"] = lod
        for key in dic:
            dic[key] = np.array(dic[key]).astype(np.int64).reshape(len(dic[key]),1)
        fetch_map = self.uv_client.predict(feed=dic, fetch=["save_infer_model/scale_0.tmp_1"], batch=True)
        return fetch_map["save_infer_model/scale_0.tmp_1"].tolist()[0]

recall = RecallServerServicer()
user_vector = recall.get_user_vector()

#### 2. Searching
Pass in the user vector, and then recall vectors in the previously imported data collection and partition.

In [None]:
collection.load() # load collection memory before search

topK = 20
SEARCH_PARAM = {
    "metric_type":"L2",
    "params":{"nprobe": 20},
    }
results = collection.search([user_vector],"vec",param=SEARCH_PARAM, limit=topK, expr=None, output_fields=None)

#### 3. Returning Information by IDs

In [None]:
I = []
for x in results:
    for y in x.ids:
        I.append(y)
        
recall_results = []
for x in I:
    recall_results.append(r.get("{}##movie_info".format(x)).decode('utf-8'))
recall_results

## Conclusion

After completing the recall service, the results can be further sorted using the **movie_recommender** model, and then the movies with high similarity scores can be recommended to users. You can try this deployable recommender system using this [quick start](QUICK_START.md).