<a href="https://colab.research.google.com/github/saranyamandava/Top-5-Meaningful-terms-of-an-Etsy-Shop/blob/master/Etsy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Problem:
Etsy is a marketplace where creative entrepreneurs sell what they make or curate to online shoppers. There are 1.6 million active sellers with more than 35 million hand crafted items for sale. \\

The goal of this project is to reasonably determine the five meaningful terms for each given shop in an Etsy shop sample. \

Web UI for this project can be accessed using http://etsyapp.us-east-1.elasticbeanstalk.com/

# Applications: 
1. Identify shops selling similar products. \\
2. It may Improve sales by identifying information about top sellers. (It would be more effective if we use sales listings instead of active listings. But Etsy does not share sales data publicly on their website nor through their developer API.) \
3. Users who spend a lot of time looking at certain shops could be shown other similar suggested shops to check out.


# Import the Libraries:

In [0]:
import sys
import json
import urllib
import logging
import urllib.request
import re
import math
import operator
import pandas as pd
import nltk
from nltk.corpus import stopwords

In [0]:
# API key is removed
KEYSTRING = "REMOVED" 
URL_BASE = "https://openapi.etsy.com/v2/"

In [0]:
# the total number of shops to download
total = 10  

# Getting the Data:
The below code gets the data in json format from the Etsy using API key. Using this, random set of 10 shops are fetched.

In [0]:
def get_data_from_api(url):

    # Input: a url
    # Output: the .json objects found at that url
    
    try:
        with urllib.request.urlopen(url) as url:
            objects = json.loads(url.read().decode())
    except:
        e = sys.exc_info()[0]
        logging.error("We had an error (" + str(e) + ") with url " + url + ".")
        objects = None
    return objects

In [0]:
def get_shops(limit, offset):
    
    # Inputs: limit = the number of shops to download,
    #         offset = the starting index of shops to fetch
    # Output: A list of shops, from the Etsy API
    
    url = (
        URL_BASE 
        + "shops?"
        + "&limit=" + str(limit)
        + "&offset=" + str(offset)
        + "&api_key=" + KEYSTRING
    ) 
    object = get_data_from_api(url)
    if object and object['results']:
        return object['results']
    else:
        return []

The below code gets the shop details along with their listings.

In [0]:
# Fetch the initial set of shops      
print ("Getting shops.")    
shops = []
for offset in range(0, total, 2):
  limit = min(total - offset, 2)
  shops += get_shops(limit, offset) 

Getting shops.


# Fetching the Listings:
Below function gets the active listings for all the shops.

In [0]:
def get_listings(shop_id):

    # Input: a shop_id
    # Output: A list of listings from that shop, from the Etsy API
    
    url = (
        URL_BASE 
        + "shops/" 
        + str(shop_id)
        + "/listings/active?"
        + "&api_key=" + KEYSTRING
    )
    object = get_data_from_api(url)
    if object and object['results']:
        return object['results']
    else:
        return []

In [0]:
logging.info("Getting listings.")
total_listings = []
for shop in shops:
  shop['listings'] = get_listings(shop['shop_id'])
  total_listings.append(len(shop['listings']))

In [0]:
# count's of listing from each shop
print (total_listings)

[1, 1, 1, 3, 1, 1, 1, 2, 1, 1]


In [0]:
shops[2]

{'accepts_custom_requests': False,
 'announcement': None,
 'creation_tsz': 1546523566,
 'currency_code': 'EUR',
 'digital_listing_count': 0,
 'digital_sale_message': None,
 'has_onboarded_structured_policies': False,
 'has_unstructured_policies': False,
 'icon_url_fullxfull': None,
 'image_url_760x100': None,
 'include_dispute_form_link': False,
 'is_calculated_eligible': False,
 'is_direct_checkout_onboarded': True,
 'is_using_structured_policies': False,
 'is_vacation': False,
 'languages': ['es'],
 'last_updated_tsz': 1546523566,
 'listing_active_count': 1,
 'listings': [{'category_id': 68890062,
   'category_path': ['Clothing', 'Jacket'],
   'category_path_ids': [69150353, 68890062],
   'creation_tsz': 1546523566,
   'currency_code': 'EUR',
   'description': 'Super dope vintage Tommy Hilfiger jacket!!\n\nMeasurements: XXL \n\nLenght: 70cm/29&quot;\nArmpit to armpit: 68,5cm/28&quot;\nArmpit to the end of cuff: 54cm/21&quot;\nFrom neck to the end of cuff: 84cm/33&quot;\n\n\nAsk me an

# Splitting the Title and Description  from each listing to identify terms:
The below code fetches the terms from title and description of each listing for specified shops.

In [0]:
def get_listing_terms(listings):

    # Input: a list of Etsy Listing objects
    # Output: a list of terms found in various fields in these listings
    
    terms = []
    for listing in listings:

        if listing['title']:
            terms += listing['title'].split()
        if listing['description']:
            terms += listing['description'].split()    
            
    return terms

In [0]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

#Making a Custom list of Stop Words
This list includes more stop words based on observations (from the output of term counts) along with stop words in nltk library

In [0]:
"""Create a list of common words to remove"""
stop_words=['a','about','above','after','again','against','ain','all','am','an','and','any','are','aren',"aren't",'as','at','be',
             'because','been','before','being','below','between','both','but','by','can','couldn',"couldn't",'d','did','didn',
             "didn't",'do','does','doesn',"doesn't",'doing','don',"don't",'down','during','each','few','for','from','further',
             'had','hadn',"hadn't",'has','hasn',"hasn't",'have','haven',"haven't",'having','he','her','here','hers','herself',
             'him','himself','his','how','i','if','in','into','is','isn',"isn't",'it',"it's",'its','itself','just','ll','m','ma',
             'me','mightn',"mightn't",'more','most','mustn',"mustn't",'my','myself','needn',"needn't",'no','nor','not','now','o',
             'of','off','on','once','only','or','other','our','ours','ourselves','out','over','own','re','s','same','shan',"shan't",
             'she',"she's",'should',"should've",'shouldn',"shouldn't",'so','some','such','t','than','that',"that'll",'the','their',
             'theirs','them','themselves','then','there','these','they','this','those','through','to','too','under','until','up','ve',
             'very','was','wasn',"wasn't",'we','were','weren',"weren't",'what','when','where','which','while','who','whom','why',
             'will','with','won',"won't",'wouldn',"wouldn't",'y','you',"you'd","you'll","you're","you've",'your','yours','yourself',
             'yourselves','d','oz','t',"http",'r','e','n','u','us','l','etsy','c','name','cm','x','svg','ref','aus', 'vier', 'mit','mm']

# Cleaning:
All the non-alphabetic characters including stop words, numbers, non-alphabetic characters were removed. \\
Numbers has to be removed just for the sake of not getting them included in top terms, which doesn't give much information. \\
Most common words from vocabulary are removed using stopwords.

In [0]:
def clean_terms(terms):

    # Input: a list of terms
    # Output: a list of these terms with non-alphabetic characters and stopwords removed, 
    #         converted to lowercase, and tokenized
    
    cleaned = []
    #stops = set(stopwords.words("english"))
    for term in terms:
        stems = re.sub('[^a-zA-Z]+', ' ', term).lower().strip().split()
        text = [w for w in stems if not w in stop_words]
    
        for stem in text:
            cleaned.append(stem)
    return cleaned

# Calculating Term Counts:

In [0]:
def get_term_counts(terms):

    # Input: a list of terms
    # Output: a hash from term to term count
    
    term_counts = {}
    if not terms:
        return term_counts
    for term in terms:
        if term not in term_counts:
            term_counts[term] = 0
        term_counts[term] += 1
    return term_counts

In [0]:
# Collect the terms from each shop and its related objects
all_terms = []
z = []
for shop in shops:
  terms = [] 
  z.append(shop['shop_name'])
  terms += get_listing_terms(shop['listings'])
  # Store the term-count hash in the shop object
  shop['term_counts'] = get_term_counts(clean_terms(terms))
  all_terms += shop['term_counts'].keys()

In [0]:
terms

['Predators',
 'Helmet',
 '(HA-0001)',
 'Predators',
 'Helmet',
 '👹',
 'Predator',
 'Mask',
 'made',
 'of',
 'fiberglass',
 '+',
 'the',
 'real',
 'helmet',
 'inside',
 'with',
 'good',
 'paint.',
 '–',
 'Comes',
 'with',
 'black',
 'visor',
 '–',
 'Added',
 '2',
 'dots',
 'LED',
 'Lights',
 'both',
 'of',
 'sides',
 '–',
 'LED',
 'lights',
 'color',
 'available',
 ':',
 'BLUE,',
 'GREEN',
 'N',
 'RED',
 '+',
 'On/Off',
 'Switch',
 '–',
 'Size',
 'Available',
 ':',
 'S',
 '–',
 'M',
 '–',
 'L',
 '–',
 'XL',
 '–',
 'The',
 'Helmet',
 'basic',
 'has',
 'a',
 'DOT',
 'Approved',
 '–',
 'Color',
 '&',
 'Airbrush',
 'design',
 'by',
 'Request',
 '____________________________________________________________',
 '☑',
 'Worldwide',
 'shipping.',
 '☑',
 'Secure',
 'Payment',
 'Through',
 'PayPal',
 '☑',
 'Please',
 'choose',
 'your',
 'size,',
 'because',
 'we',
 'can&#39;t',
 'refund']

In [0]:
def get_term_weights(term_counts, term_shop_counts, num_shops):

    # Input: a hash of term to term count, 
    #        a hash of term to the number of documents which contain that term,
    #        the total number of shops
    # Output: a hash of terms to the tf-idf weighting of each term,
    #         using the augmented norm for term frequency
    
    weights = {}
    max_count = 1
    for term in term_counts:
        max_count = max(max_count, term_counts[term])
    for term in term_counts:
        normalized_count = 0.5 + 0.5 * term_counts[term] / max_count
        weights[term] = normalized_count * math.log(float(num_shops) / term_shop_counts[term])
    return weights

In [0]:
# Calculate the number of shops that use each term      
    term_shop_counts = get_term_counts(all_terms)

In [0]:
# Calculate the tf-idf weight for the terms in each shop
for shop in shops:
  shop['term_weights'] = get_term_weights(shop['term_counts'], term_shop_counts, len(shops))

# Identifying meaningful terms has been done in two ways: 
1.Bag-of-words  \\
2.TF-IDF Vectorization \\
Let's use these two approaches  and check which one performs better.

# Bag-of-Words

In [0]:
# Creating dataframe variables for holding shop_name and their respective terms

x = []
y = []
df = pd.DataFrame(columns =['shop_name','Top 5 Terms'])

for primary_shop in shops:   
  if len(primary_shop['term_counts']) == 0:
    x.append(primary_shop['shop_name'])
    y.append("Has No Terms")
  else:
    x.append(primary_shop['shop_name'])
    
    # Sorting the list of tuples based on count in descending order
    # Prints only first 5 terms

    y.append([i[0] for i in sorted(primary_shop['term_counts'].items(), key=operator.itemgetter(1), reverse=True)[:5]])


In [0]:
df['shop_name'] = x
df['Top 5 Terms'] = y
df

Unnamed: 0,shop_name,Top 5 Terms
0,MalbumuDesignShop,"[mustafa, kemal, atat, rk, signiture]"
1,Addarbable,"[handmade, ear, warmer, sizes, colors]"
2,RavenandRose1809,"[custom, order, perfume]"
3,LoveMyDogCo,"[bigfoot, coffee, mug, funny, gift]"
4,Jo120783,"[ercol, plank, table, candlestick, chairs]"
5,MinimalistMouse,"[reborn, baby, boo, inches, adorable]"
6,OrogenProjects,"[cards, christmas, detail, cactus, cute]"
7,DrawNaked,"[billie, original, drawing, heavy, weight]"
8,UnDesignedShop,"[chips, blythe, glass, eye, pairs]"
9,PREDATORSHELMET,"[helmet, predators, led, lights, color]"


# TF-IDF Terms

In [0]:
x = []
y = []
df = pd.DataFrame(columns =['shop_name','Top 5 Terms'])


for primary_shop in shops:   
  if len(primary_shop['term_weights']) == 0:
    x.append(primary_shop['shop_name'])
    y.append("Has No Terms")
    
  else:
    x.append(primary_shop['shop_name'])
    # Sorting the list of tuples based on count in descending order
    # Prints only first 5 terms

    y.append([i[0] for i in sorted(primary_shop['term_weights'].items(), key=operator.itemgetter(1), reverse=True)[:5]])  

In [0]:
df['shop_name'] = x
df['Top 5 Terms'] = y
df

Unnamed: 0,shop_name,Top 5 Terms
0,MalbumuDesignShop,"[mustafa, kemal, atat, rk, signiture]"
1,Addarbable,"[ear, warmer, sizes, colors, keep]"
2,RavenandRose1809,"[custom, order, perfume]"
3,LoveMyDogCo,"[bigfoot, coffee, mug, funny, gift]"
4,Jo120783,"[ercol, plank, table, candlestick, chairs]"
5,MinimalistMouse,"[reborn, baby, boo, adorable, layette]"
6,OrogenProjects,"[cards, christmas, detail, cactus, cute]"
7,DrawNaked,"[billie, drawing, heavy, weight, cartridge]"
8,UnDesignedShop,"[chips, blythe, glass, eye, pairs]"
9,PREDATORSHELMET,"[helmet, predators, led, lights, color]"


# Results:
These two approaches seems to return almost the same results. \
Now, the next step to identify the shops which sells same kind of products by comparing the terms and their counts produced by Bag-of-words using cosine similarity.

# Identifying Similar Shops: 
Let's check on existance of similar shops within the set of shops using cosine similarity of the terms obtained from TF-IDF

In [0]:
def cosine_similarity(weights_1, weights_2):

    # Input: two hashes from term to term weight
    # Output: the cosine similarity score between the two hashes
    
    if not weights_1 or not weights_2:
        return 0.0
    intersection = set(weights_1.keys()) & set(weights_2.keys())
    numerator = sum([weights_1[term] * weights_2[term] for term in intersection])

    sum_1 = sum([weights_1[term]**2 for term in weights_1.keys()])
    sum_2 = sum([weights_2[term]**2 for term in weights_2.keys()])
    denominator = math.sqrt(sum_1) * math.sqrt(sum_2)

    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator

In [0]:
x = []
y = []
df2 = pd.DataFrame(columns =['shop_name','similar_shops'])
def print_similar_shops(primary_shop, similar_shops):

    # Input: a shop object along with a list of (shop, similarity score) pairs
    # Output: prints the primary and similar shops' names
    
    print ('\033[1m' +primary_shop['shop_name'] + ":")
    if len(similar_shops) <= 1:
        x.append(primary_shop['shop_name'])
        y.append(" No similar shops were found!")
   
        print ('\033[0m' + "No similar shops were found!")
    else:
        for similar_shop in similar_shops[:-1]:
            print ('\033[0m' + similar_shop[0]['shop_name'] + "\t")
        print ('\033[0m' + similar_shops[-1][0]['shop_name'])
        x.append(primary_shop['shop_name'])
        y.append(similar_shop[0]['shop_name'] +"  "+ similar_shops[-1][0]['shop_name'])
        

The below shown results has primary shop is displayed in bold with their complete list of similar shops displayed next to them.

In [0]:
# Calculate and print the five most similar shops to each shop
for primary_shop in shops:   
        if len(primary_shop['term_counts']) == 0:
            print (primary_shop['shop_name'] + " has no terms!")
            continue     
        similar_shops = []
        for other_shop in shops:
            similarity = cosine_similarity(
                primary_shop['term_weights'], 
                other_shop['term_weights']
            )
            if similarity > 0:
                similar_shops.append((other_shop, similarity))
        similar_shops = sorted(similar_shops, key=lambda entry: -1 * entry[1]) 
        print_similar_shops(primary_shop, similar_shops[1:6])

[1mMalbumuDesignShop:
[0mUnDesignedShop	
[0mPREDATORSHELMET
[1mAddarbable:
[0mPREDATORSHELMET	
[0mMinimalistMouse	
[0mUnDesignedShop	
[0mOrogenProjects	
[0mJo120783
[1mRavenandRose1809:
[0mNo similar shops were found!
[1mLoveMyDogCo:
[0mNo similar shops were found!
[1mJo120783:
[0mDrawNaked	
[0mPREDATORSHELMET	
[0mOrogenProjects	
[0mAddarbable	
[0mUnDesignedShop
[1mMinimalistMouse:
[0mAddarbable	
[0mUnDesignedShop	
[0mPREDATORSHELMET	
[0mOrogenProjects	
[0mJo120783
[1mOrogenProjects:
[0mDrawNaked	
[0mJo120783	
[0mAddarbable	
[0mMinimalistMouse
[1mDrawNaked:
[0mOrogenProjects	
[0mJo120783	
[0mPREDATORSHELMET
[1mUnDesignedShop:
[0mMalbumuDesignShop	
[0mMinimalistMouse	
[0mAddarbable	
[0mPREDATORSHELMET	
[0mJo120783
[1mPREDATORSHELMET:
[0mAddarbable	
[0mJo120783	
[0mDrawNaked	
[0mMinimalistMouse	
[0mUnDesignedShop


Let's format these in readle format with a pandas dataframe. The below shows atmost two similar shops for a specified shop

In [0]:
pd.set_option('display.max_colwidth', -1)
df2['shop_name'] = x
df2['similar_shops'] = y
df2

Unnamed: 0,shop_name,similar_shops
0,MalbumuDesignShop,UnDesignedShop PREDATORSHELMET
1,Addarbable,OrogenProjects Jo120783
2,RavenandRose1809,No similar shops were found!
3,LoveMyDogCo,No similar shops were found!
4,Jo120783,Addarbable UnDesignedShop
5,MinimalistMouse,OrogenProjects Jo120783
6,OrogenProjects,Addarbable MinimalistMouse
7,DrawNaked,Jo120783 PREDATORSHELMET
8,UnDesignedShop,PREDATORSHELMET Jo120783
9,PREDATORSHELMET,MinimalistMouse UnDesignedShop


# Conclusion:
The Successful model may improve the sales by giving information to sellers what kind of  products to focus on.This information is obtained by looking at the meaningful terms of most selling shops on Etsy. Information about most selling shops is obtained from http://www.craftcount.com/index.php. \\
This site ranks Etsy shops by sales over a given duration. We can also choose to see rankings for specific countries or categories. This is a useful tool, however the website only ranks shops that have specifically signed up to appear in the rankings.
