Workflow:
- Load in daily scrape 
- Clean data
    - Subset features of interest
    - (Optional) Subset by subreddit

Note: Load and data cleaning might look different based on final data store in ETL pipeline (could easily be done in SQL)


- Preprocess the data
    - normalize/scale the data
    - Feature engineer if applicable
- Train model 
    - Explore models
    - Evaluation metrics
        - Need quantitative measures (ex. Silhouette Coefficient (good), Elbow method (bad)) 
        - Automate finding the optimal number of clusters during training
- Export the final trained model
    - This can then be served to users via Flask

- If models are insufficent:
    - Explore more data sources
    - Research other ways of generating similarity metrics


Brainstroming/Notes to self:
- Keep in mind what input can be generated from users and how the model will be able to take this data and provide valuable recommendations
- If clusters become too large, too many recommendations will be generated, 
- how can we filter these down when the data grows
    - initial thought: factor in date posted (relevance) along with popularity (upvotes)
        - this could potentially only subset popular song which may already be recommended via Spotify)

In [2]:
import json
import pandas as pd

In [None]:
import mlflow


mlflow.set_tracking_uri("sqlite:///mlflow.db")
mlflow.set_experiment("spot-it-recommendation")

In [6]:
file_path = "../etl/temp_data/2022-09-03_clean_features.json"

In [7]:
#Load in the json file
with open(file_path) as f:
    data = json.load(f)

In [11]:
#Convert to dataframe
df = pd.DataFrame(data["data"])

In [12]:
df.head()

Unnamed: 0,name,title,num_comments,ups,upvote_ratio,created,url,subreddit,track,artist,...,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature
0,t3_x4s4zy,[FRESH] Loony - First Thing Smokin’,1,8,0.8,1662204000.0,https://youtube.com/watch?v=zvfOTx8T2XI&amp;fe...,indieheads,First Thing Smokin’,Loony,...,0.116,0.572,74.827,audio_features,1gEI1aE97pEFA8RfE3BWmA,spotify:track:1gEI1aE97pEFA8RfE3BWmA,https://api.spotify.com/v1/tracks/1gEI1aE97pEF...,https://api.spotify.com/v1/audio-analysis/1gEI...,223926,4
1,t3_x4rvcz,[FRESH] Sheenah Ko - Eyes of the Ego (radio edit),0,5,0.7,1662203000.0,https://open.spotify.com/track/3Zt8xTAoeHjSvkO...,indieheads,Eyes of the Ego (radio edit),Sheenah Ko,...,0.589,0.156,120.113,audio_features,3Zt8xTAoeHjSvkOhm3G6UR,spotify:track:3Zt8xTAoeHjSvkOhm3G6UR,https://api.spotify.com/v1/tracks/3Zt8xTAoeHjS...,https://api.spotify.com/v1/audio-analysis/3Zt8...,228000,4
2,t3_x4rtyj,[FRESH] Bec Sandridge - The Jetty (ft. Andy Bull),0,2,0.75,1662203000.0,https://open.spotify.com/track/5JB7zQ4CIjU0Byy...,indieheads,The Jetty (ft. Andy Bull),Bec Sandridge,...,0.241,0.721,117.061,audio_features,5JB7zQ4CIjU0ByyFhx3yOF,spotify:track:5JB7zQ4CIjU0ByyFhx3yOF,https://api.spotify.com/v1/tracks/5JB7zQ4CIjU0...,https://api.spotify.com/v1/audio-analysis/5JB7...,224853,4
3,t3_x4covr,[FRESH] Peach Tinted - Handsome / RIP,0,3,0.72,1662153000.0,https://open.spotify.com/album/57M0muPF80DyMfz...,indieheads,Handsome / RIP,Peach Tinted,...,0.0905,0.896,77.464,audio_features,4VwXaLB6ILcbJBEWgEw4eO,spotify:track:4VwXaLB6ILcbJBEWgEw4eO,https://api.spotify.com/v1/tracks/4VwXaLB6ILcb...,https://api.spotify.com/v1/audio-analysis/4VwX...,135476,4
4,t3_x48xk1,[FRESH] Nisa - Affection,0,2,0.63,1662144000.0,https://youtu.be/-bu9UrWjO1E,indieheads,Affection,Nisa,...,0.338,0.743,125.969,audio_features,0D7L4cQv5nPfMqcHf90s8X,spotify:track:0D7L4cQv5nPfMqcHf90s8X,https://api.spotify.com/v1/tracks/0D7L4cQv5nPf...,https://api.spotify.com/v1/audio-analysis/0D7L...,212659,4


In [None]:
with mlflow.start_run():

    mlflow.set_tag("tag-name", "tag-value")

    mlflow.log_param("train-data-path", file_path)
    
    clusters = 2
    mlflow.log_param("num-clusters", clusters)
    #put model here
    #metric = shillouette coef
    mlflow.log_metric("sc", metric)

    #mlflow.log_artifact(local_path="models/lin_reg.bin", artifact_path="models_pickle")