# Cosine Similarity

The final decision of our recommender system: In this notebook, we will select which features of the coffee to compare, transform them, and implement different recommender systems.

In [161]:
import pandas as pd
from scipy import sparse
from sklearn.feature_extraction.text  import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import pairwise_distances
import numpy as np

In [162]:
df = pd.read_csv('../data/coffee_clean.csv')
names_df = pd.read_csv('../data/coffee_id.csv')

## Feature-based Recommender

- Select which features to consider
- Scale the data
- Create a recommender based on their cosine similarity

In [172]:
df.shape

(4887, 27)

In [163]:
regions = ['region_africa_arabia', 'region_caribbean', 'region_central_america', 
           'region_hawaii', 'region_asia_pacific', 'region_south_america']
types = ['type_espresso', 'type_organic', 'type_fair_trade', 
         'type_decaffeinated', 'type_pod_capsule', 'type_blend', 'type_estate']
roasts = ['roast_dark', 'roast_light', 'roast_medium', 'roast_medium_dark',
       'roast_medium_light', 'roast_very_dark', 'roast_nan']

#select all features
features = ['aroma','acid_or_milk','body','flavor','type_with_milk'] + roasts + types + regions

In [164]:
def set_filter(filter_on = None, df=df):
    try: 
        slugs = df[df[filter_on] == 1]['slug']
        filtered_df = df[df[filter_on] == 1][features]
        #create array of scaled data
        ss = StandardScaler()
        ss_fitted = ss.fit_transform(filtered_df)

        #calculate cosine similarities and create dataframe
        features_recommender = pairwise_distances(ss_fitted, metric='cosine')
        features_recommender_df = pd.DataFrame(features_recommender, index = slugs, columns = slugs)
        return features_recommender_df

    except KeyError as e:
        print(f"Sorry {e} is not a valid filter")

In [165]:
idx = df['slug']

#create array of scaled data
ss = StandardScaler()
ss_fitted = ss.fit_transform(df[features])

#calculate cosine similarities and create dataframe of all similarities
features_recommender = pairwise_distances(ss_fitted, metric='cosine')
features_recommender_df = pd.DataFrame(features_recommender, index = df['slug'], columns = df['slug'])

## Text/Description-Based Recommender

Latent Semantic Analysis:
   - Create TFIDF vectors of text data
   - Reduce dimensionality using TruncatedSVD 
   - Create a recommender based on their cosine similarity

In [166]:
normer = Normalizer(norm='l1')

In [167]:
#TFIDF with specific n-gram range and max features
tfidf = TfidfVectorizer(min_df=2, ngram_range=(2,4),max_features=10000)
tfidf_fitted = tfidf.fit_transform(df['clean_text'])

#TruncatedSVD transformation, number of components
tsvd = TruncatedSVD(n_components=225,random_state=36)
tsvd_fitted = tsvd.fit_transform(tfidf_fitted)

#calculate cosine similarities and create dataframe of all similarities
text_recommender = pairwise_distances(tsvd_fitted, metric='cosine')


text_recommender_df = pd.DataFrame(text_recommender, index = df['slug'], columns = df['slug'])

## Combination Recommender

We have 225 components from our text data and 25 categorical and numerical features.  
Our combination recommender's similarity is based 90% of text, and 10% of categorical and numerical features.

In [168]:
#combine arrays of scaled numerical features and truncatedsvd array
joined = np.concatenate((ss_fitted, tsvd_fitted), axis=1)

#calculate cosine similarities and create dataframe of all similarities
full_recommender = pairwise_distances(joined, metric='cosine')


full_recommender_df = pd.DataFrame(full_recommender, index = df['slug'], columns = df['slug'])

### Function to print recommendations:

In [169]:
def get_recommendations(input_slug, rec_df, names_df, 
                        pick_best = True, n_nearest = 10):
    '''
    Prints coffee recommendation.

    input_slug: Slug of coffee to make comparisons with
    rec_df: DataFrame of recommendations with cosine similarities
    names_df: DataFrame of coffee slugs, name, and roaster
    pick_best: Picks the highest rated coffee of the 'n_nearest' most similar coffees.
    n_nearest: Number of coffees to compare the coffees to (when pick_best = True)
    '''
    
    
    input_name = names_df[names_df['slug'] == input_slug]['name'].to_string(index = False)
    input_roaster = names_df[names_df['slug'] == input_slug]['roaster'].to_string(index = False)
    
    sims = names_df.join(rec_df[input_slug], how='outer', on='slug')
    sorted_sims = sims.drop(sims[sims['slug'] == input_slug].index).sort_values(by = input_slug)
    
    
    if pick_best:
        print("*Recommending the highest rated coffee out of the", n_nearest, "most similar coffees*")
        rec = sorted_sims[0:n_nearest].sort_values(by='rating', ascending=False).iloc[0]
    else:
        print("*Recommending the most similar coffee*")
        rec = sorted_sims.iloc[0]
    
    
    print("If you like " + input_name + " by " + input_roaster +
         ", you might also like " + rec['name'] + " by " + rec['roaster'] + ".")
    print("\nCompare for yourself:\n",
         "https://www.coffeereview.com/review/" + input_slug,
         "\n https://www.coffeereview.com/review/" + rec['slug'])
    print("\nCosine Similarity: ", round(rec.loc[input_slug],3))

In [170]:
test_slug = np.random.choice(df['slug'])

In [171]:
aanum_comparisons = 5
best_rating = False

print("\n(Based on Text Description)")
get_recommendations(test_slug, text_recommender_df, names_df, n_nearest=num_comparisons, pick_best = best_rating)

print('\n-------------------------------------')
print("\n(Based on Ratings, Roast, Type, and Region)")
get_recommendations(test_slug, features_recommender_df, names_df, n_nearest=num_comparisons, pick_best = best_rating)

print('\n-------------------------------------')

print("\n(Based on Everything)")
get_recommendations(test_slug, full_recommender_df,names_df,  n_nearest=num_comparisons, pick_best = best_rating)


(Based on Text Description)
*Recommending the most similar coffee*
If you like Don Quijote Cafe de Costa Rica by The Roasterie, you might also like Tanzania by Joe's Coffee House.

Compare for yourself:
 https://www.coffeereview.com/review/don-quijote-cafe-de-costa-rica 
 https://www.coffeereview.com/review/tanzania

Cosine Similarity:  0.489

-------------------------------------

(Based on Ratings, Roast, Type, and Region)
*Recommending the most similar coffee*
If you like Don Quijote Cafe de Costa Rica by The Roasterie, you might also like Organic Bali Kintamani Highlands by Wicked Joe.

Compare for yourself:
 https://www.coffeereview.com/review/don-quijote-cafe-de-costa-rica 
 https://www.coffeereview.com/review/organic-bali-kintamani-highlands

Cosine Similarity:  0.0

-------------------------------------

(Based on Everything)
*Recommending the most similar coffee*
If you like Don Quijote Cafe de Costa Rica by The Roasterie, you might also like Papua New Guinea Light Roast by C

Now that we have three recommender systems, moving forward we will need to pick one of the three to conduct A/B hypothesis testing on to determine whether or not the system is useful.

### Good slugs for presentation:
`100-colombia-instant-coffee` (20, True)  
`la-esmeralda` (10, False)  
`panama-lerida-estate-sonias-crop` (15,True)  
`italian-roast` (10,True)