<h1> <center> Table of Contents </center> </h1>

<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ol>
        <li><a href="#1.-Required-Libraries"> Required Libraries</a></li>
        <li><a href="#2.-Importing-the-Data"> Importing the Data</a></li>
        <li><a href="#3.-Creating-Corpus"> Creating-Corpus</a> </li>
        <li><a href="#4.-Using-Fuzzywuzzy-to-get-best-matched-dashboards"> Using Fuzzywuzzy to get best matched dashboards</a></li> 
      </li>
    </ol>
</div>

# 1. Required Libraries 

## pip installs

In [68]:
#!pip install python-Levenshtein
#!pip install "fuzzywuzzy==0.18.0"
#!pip install rapidfuzz
#!pip install spacy
#!pip install gensim
#!pip install rank_bm25
#!python -m spacy download en_core_web_lg
#!pip install fast-autocomplete

## Importing Libraries

In [69]:
import pandas as pd
from pandas.core.common import flatten
import numpy as np

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

import rapidfuzz
from rapidfuzz import process, utils

import timeit
import time

import warnings
warnings.filterwarnings('ignore')

## 2. Importing the Data

In [70]:
# importing the files
dfdashboards = pd.read_csv('microstrategy_and_dashboard.csv')
dfmetrics = pd.read_csv('tableau_metric.csv')
dfuserquey = pd.read_csv('sample_user_searchs.csv')

In [71]:
def preprocess_datafiles(dashboard , metric):

    # converting column values to lowercase
    dashboard = dashboard.apply(lambda x: x.astype(str).str.lower())
    metric = metric.apply(lambda x: x.astype(str).str.lower())
    
    # splitting caption and descriptors columns 
    dashboard = dashboard[['dashboards','caption','Global_usage']]
    dashboard['metric_names'] = dashboard.caption.str.split('|')
    metric['descriptors'] = metric.descriptors.str.split('|')
    
    # converting metric names and descriptors in each row values to multiple rows
    temp1 = dashboard.set_index(['dashboards'])['metric_names'].apply(pd.Series).stack().reset_index().drop('level_1', axis=1).rename(columns={0:'caption'})
    temp2 = metric.set_index(['metric_name'])['descriptors'].apply(pd.Series).stack().reset_index().drop('level_1', axis=1).rename(columns={0:'descriptor'})
    
    # merging metrics and descriptors
    master = temp1.merge(temp2, how = 'inner', left_on ='caption', right_on = 'metric_name')[['dashboards','metric_name','descriptor']]
    master = master.rename(columns={"dashboards":"dashboard_names", "metric_name": "metric_names", "descriptor": "descriptor_names"})

    return master



In [72]:
# Final Dataframe
master = preprocess_datafiles(dfdashboards,dfmetrics)
master.head()

Unnamed: 0,dashboard_names,metric_names,descriptor_names
0,ad analysis,beer style level value,macro style value
1,ad analysis,beer style level value,mezzo style value
2,ad analysis,beer style level value,micro style valu
3,ad analysis,beer style level value,beer style value
4,ad analysis,beer style level value,altbier


## 3. Creating Corpus

In [73]:
dashboard_names_list = list(master.dashboard_names.unique())
metric_names_list = list(master.metric_names.unique())
descriptor_names_list = list(master.descriptor_names.unique())
mastercorpus = dashboard_names_list + metric_names_list + descriptor_names_list

mastercorpus = list(filter(None,mastercorpus))

mastercorpus[:5]

['ad analysis',
 'ad recap',
 'aggregate sales per pt vs. cwd',
 'brands on ad',
 'brewery comparisons']

## 4. Using Fuzzywuzzy to get best matched dashboards

FuzzyWuzzy is a library of Python which is used for string matching. Fuzzy string matching is the process of finding strings that match a given pattern. Basically it uses <b>Levenshtein Distance</b> to calculate the differences between sequences.


In [74]:
# Logic to get the dashboards

def dashboard_names_suggestion(master, suggestions):
    
    dashboard_names = []   
    
    for i in suggestions:
        
        if i in dashboard_names_list:    
            dashboard_names.append(i)
        
        elif i in metric_names_list:    
            dashboard_names = dashboard_names + (master.loc[(master.metric_names.str.lower() == i)].dashboard_names).to_list()
        
        else:
            dashboard_names = dashboard_names + (master.loc[(master.descriptor_names.str.lower() == i)].dashboard_names).to_list()

    return dashboard_names

In [75]:
token_set_ratio_sugg = {}

def fuzzywuzzy_scorers_suggestions(usersearch, corpus):
    
    for token in corpus:
        sugg_score = fuzz.token_set_ratio(usersearch, token)
        token_set_ratio_sugg[token] = sugg_score

        
def fuzzywuzzy_scorers_similiarity(scorers,suggestions_count,sort_scorers):
   
    scoreDf = pd.DataFrame()
    scoreDf = scoreDf.from_dict([token_set_ratio_sugg]).T.reset_index()
    scoreDf.columns = ['suggestion','token_set_ratio_sugg']
#     scoreDf['mean'] = scoreDf.mean(axis = 1)
    
    return scoreDf[scorers].sort_values(by = sort_scorers, ascending = False).head(suggestions_count)

In [83]:
input_query = input()

dollar sales amber ale


In [85]:
start = timeit.default_timer()

# FUNCTION CALLING
fuzzywuzzy_scorers_suggestions(input_query.lower(), mastercorpus)

# FUNCTION CALLING 
scoreSortedDf = fuzzywuzzy_scorers_similiarity(scorers=fuzzyScorers, suggestions_count = 10,sort_scorers= scorersSorting)

# creating column names, list to append suggestions

fuzzyScorers = ['suggestion','token_set_ratio_sugg']
scorersSorting = ['token_set_ratio_sugg']

allScorersOutput = []

fuzzywuzzySearchOutput = []


# FUNCTION CALLING
fuzzywuzzyDashboardsList = list(dashboard_names_suggestion(master, scoreSortedDf['suggestion'][:10]))

fuzzywuzzySearchOutput = list(dict.fromkeys(fuzzywuzzyDashboardsList))
allScorersOutput.append(fuzzywuzzySearchOutput[:10])


column_names = ['Dashboard Suggestions']
finalSuggestionDF = pd.DataFrame(allScorersOutput).transpose().set_axis(column_names, axis =1)

stop = timeit.default_timer()
execution_time = stop - start
print("Time taken by Method: "+str(execution_time))


print('User Search : ' + str(input_query))

print("Displaying the dashboards names by each scorer methods.")
display(finalSuggestionDF)

Time taken by Method: 1.7763754190000327
User Search : dollar sales amber ale
Displaying the dashboards names by each scorer methods.


Unnamed: 0,Dashboard Suggestions
0,ad analysis
1,r geography over time
2,category and segment analysis
3,competitive set
4,dimensions over time
5,line geogs over time
6,market share
7,package analysis
8,rankers
9,style analysis


In [82]:
pd.concat([finalSuggestionDF['Dashboard Suggestions'],scoreSortedDf['suggestion'].reset_index()], axis =1)[['Dashboard Suggestions','suggestion']]

Unnamed: 0,Dashboard Suggestions,suggestion
0,ad analysis,dollar sales
1,r geography over time,amber ale
2,category and segment analysis,dollar sales per pt
3,competitive set,dollar sales change ya
4,dimensions over time,dollar sales per pt change ya
5,line geogs over time,rank (dollar)
6,market share,dollar sales selections
7,package analysis,dollar sales per percent last year
8,rankers,amber lage
9,style analysis,avery india pale ale
