# Pre-Execution

## Things needed 
- Download the dataset from [here]()
- Add the path to the 'ARTIFACTS_PATH' variable in the code below or in the .env file

-  This project comes with an updated pipfile. You can install the dependencies using the command below:
```bash
pipenv install
```





In [1]:
import os
from src.utils import *
from src.libshift_search import LibshiftSearch
import dotenv
from src.config import Config
from src.db_handler import DBHandler
config = Config(dev_mode=False)

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
OUTPUT_PATH = 'output'
if not os.path.exists(OUTPUT_PATH):
    os.makedirs(OUTPUT_PATH)

In [3]:
'''
Our Work focuses on finding potential replacements for deprecated api methods across the following libraries:
- pydantic
- scipy
- pandas
- sqlalchemy
- numpy
- pytorch
'''
LIBS = [ 'pydantic', 'scipy', 'pandas', 'sqlalchemy', 'numpy', 'pytorch']
FEATURES = ["name", "code", "docstring", "nodoc"]
TOPKs = [1, 3, 5, 7, 10, 15]


This example is aimed as a quick look into the flow and hence uses a fixed best config of feature type: model setting The full search code will be available in the notebook add link to 02_grid_search_results.ipynb

In [5]:
best_model_config = {
    'name': 'ibm-granite/granite-embedding-125m-english',
    'code': 'w601sxs/b1ade-embed',
    'doc': 'ibm-granite/granite-embedding-125m-english',
    'nodoc': 'avsolatorio/GIST-large-Embedding-v0'
}
    
model_cols = filter_read_cols(best_model_config)
filter_cols =['id'] + FEATURES + model_cols
filter_cols

['id',
 'name',
 'code',
 'docstring',
 'nodoc',
 'name_ibm-granite_granite-embedding-125m-english',
 'code_w601sxs_b1ade-embed',
 'doc_ibm-granite_granite-embedding-125m-english',
 'nodoc_avsolatorio_GIST-large-Embedding-v0']

In [6]:
removed_df = pd.read_pickle(f"{config.ARTIFACTS_PATH}/removed_methods/removed_methods.pkl")
val_df = pd.read_csv(f"{config.ARTIFACTS_PATH}/validation_data/clean_val.csv")
folder = f"{config.ARTIFACTS_PATH}/snapshot_embeddings/"
snapshot_dict = get_snapshot_dict(folder, LIBS)

# Pytorch Snapshot is rather large, so it will take up a lot of space and time to load


Loading snapshots: 100%|██████████| 6/6 [05:09<00:00, 51.61s/repo]

Loaded 6 snapshots from /Volumes/AnushHD/libshiftartifacts//snapshot_embeddings/





In [7]:
db = DBHandler(config)
search = LibshiftSearch(
    model_dict=best_model_config,
    removed_df=removed_df,
    snapshot_dictionary=snapshot_dict,
    validation_df=val_df,
    features=FEATURES,
    db_handler=db,
    top_ks=TOPKs,
    )

Cleaning SQLite lock files in: /Volumes/AnushHD/libshiftartifacts/similarity_cache.db


In [8]:
search_data, results, match_json, combined_hits_df = search.controller('cosine')
db.close()

In [10]:
combined_hits_df

Unnamed: 0,Combined Top-k,Correct Replacements
0,1,29
1,3,62
2,5,75
3,7,77
4,10,84
5,15,84


In [None]:
# for mode in [ "cosine_soft", "dot", "angular", "euclidean","cosine"]:
        # search_data, results, match_json, combined_hits_df = search.controller(mode)
        # db.close()
        # output_path = os.path.join(OUTPUT_PATH, f"results_{mode}.csv")
        # results.to_csv(output_path, index=False)