Looking at that large of a dataset was really neat and gave a lot of insight; however, actually training a model on 131k+ examples will prove to be nearly impossible. So the idea here is to try and compress the dataset down into a more reasonable size in which we'll need it for two things.

The overall architecture will be:

Training Phase:

1. Image Classifier trained on (images, features)
2. Caption Generator trained on (features, captions)

So the idea will be that one can pass in an image to this pipeline, from which features will first be extracted (what the image contains, etc). These features will then be used to create a caption.

Focusing on just (1) for now, there's a few steps to figure. To avoid downloading 100k+ images, let's first consider what we can use as "features." Without using a pretrained model, it seems like the best bet would be to use the hashtags and emojis (especially if we can get a mapping of them to a word). So let's first try and get rid of all rows without any hashtags or emojis and then make a "features" column.

In [135]:
pip install emoji --upgrade

Requirement already up-to-date: emoji in /anaconda3/lib/python3.7/site-packages (0.5.3)
Note: you may need to restart the kernel to use updated packages.


In [136]:
import pickle 
import pandas as pd
import emoji
from emoji import UNICODE_EMOJI
import numpy as np
from tqdm import tqdm
import gensim
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_columns', 50)

In [137]:
with open("../data/preprocessed/insta_data.pickle", 'rb') as f:
    df = pickle.load(f)

In [138]:
df = df[(df.num_hashtags!=0) & (df.num_emojis!=0)]

In [139]:
features = []
for index, row in df.iterrows(): #loop through each row
    emojis = row["emojis"] 
    hashtags = row["hashtags"]
    feature = []
    for x in emojis:
        string = emoji.demojize(x)[1:-1] #get word equivalent of each emoji
        list_string = string.split("_") #split longer strings into each indv. word and add that to features
        for y in list_string:
            feature.append(y)
        feature.append(string)
    for x in hashtags:
        hashtag = x.replace("#", "")
        if hashtag in UNICODE_EMOJI: #convert hashtag to emoji word if necessary
            hashtag = emoji.demojize(hashtag)[1:-1]
        else: #check if there's an emoji inside a hashtag
            for char in hashtag:
                if char in UNICODE_EMOJI:
                    hashtag = hashtag.replace(char, "") 
        feature.append(hashtag) #append those to features
    features.append(list(set(feature))) #get rid of duplicates

In [140]:
df["features"] = np.asarray(features)

However, we can't just send this into a CNN and call it a day because we have an absurd amount of labels (40k). This would make it unreasonable to do any sort of multinomial classification, so let's try and cluster our data into a more reasonable number of labels. Since we don't have an extensive corpora to work with to try and create clusters, we'll use Google's pretrained word2vec embeddings. For the actual clustering, we have two main options:

1) K-means Clustering -  I'm inclined to use this technique because it's one that's been used commonly with precomputed word embeddings. The caveat is that it requires that you specify how many clusters you want, which isn't ideal. We'll be clustering based on cosine distance between word embeddings. I'll start with K=100 and then upon training, see how that works out.

2) Mean Shift Clustering - This technique allows for you to avoid specifying how many clusters you want to end off with, which is ideal for this situation because we don't want to force examples under the same label just because they're closer together than other labels (best scenario would be to just keep them separate)

But first, we should cut down our available dataset. We'll first try and get word2vec embeddings for all of them and then cut it down from there because I read that running mean shift clustering with >10K samples is ill-advised, which makes sense given its overlying concept of kernel density estimation, where the probability density function is calculated for the entire dataset. Mean shift then takes this one step further and iteratively shifts each data point uphill until it reaches a peak on the KDE surface, so running it on an immensely large dataset probably isn't the best idea. So let's try and minimize the dataset before continuing any further and comparing the two.

Let's get rid of the rows where the engagement rate and caption rating is too low since it won't help our model much. From a cursory glance, these look pretty garbage as well, so to try and get below 10K, let's cut everything off below the 40th percentile.

In [142]:
df.sort_values(by=['engagement_rate', 'caption_rating'], ascending=True)


Unnamed: 0,caption,picture,num_comments,num_likes,username,num_followers,sponsored,num_posts,english,hashtags,caption_no_hashtags,num_hashtags,emojis,num_emojis,normalized_likes,normalized_comments,caption_rating,valid_followers,user_post_engagements,engagement_rate,full_caption_length,no_hashtag_caption_length,features
55904,📣💵If you are looking for a genuine way to make money from home opportunities http://bit.ly/trainingvault1 #askLynn🔥,https://scontent-mia3-1.cdninstagram.com/vp/fbe48fd0a66da9da67b73a273e8dfd02/5DE9FEC2/t51.2885-15/e15/66396305_386950268627102_3487065750404024558_n.jpg?_nc_ht=scontent-mia3-1.cdninstagram.com,0,0,lovingbears19631,432.0,False,1676,True,[#askLynn🔥],📣💵If you are looking for a genuine way to make money from home opportunities http://bit.ly/trainingvault1,1,"[📣, 💵, 🔥]",3,0.000000,0.000000,0.000000e+00,1,"[0.0, 0.0, 0.0, 0.0, 0.0]",0.000000,115,105,"[fire, banknote, megaphone, dollar_banknote, askLynn, dollar]"
146112,New Balance 🎉💖 Limited sale 🤗 #men #mensfashion #original #classy #comfy #branded #gift #getitn0w #giftforhim #footwear #egypt #6 Only for 1199LE,https://scontent-mia3-1.cdninstagram.com/vp/2ed3392cf777490b95fe7d07d17ce2b3/5DD0EFC9/t51.2885-15/e35/66194265_2164198387039989_6439306357726613695_n.jpg?_nc_ht=scontent-mia3-1.cdninstagram.com,0,0,get.itn0w,2689.0,False,1164,True,"[#men, #mensfashion, #original, #classy, #comfy, #branded, #gift, #getitn0w, #giftforhim, #footwear, #egypt, #6]",New Balance 🎉💖 Limited sale 🤗 Only for 1199LE,12,"[🎉, 💖, 🤗]",3,0.000000,0.000000,0.000000e+00,1,[0.0],0.000000,146,46,"[classy, branded, comfy, heart, face, sparkling, getitn0w, hugging_face, party, footwear, hugging, 6, giftforhim, party_popper, egypt, sparkling_heart, mensfashion, gift, popper, men, original]"
62574,. . . . . #nailsoftheday #nails💅 #nailsofinstagram #nailporn #nailart #instanails #beauty #gelpolish #love #longnails #gelnails #nailartist #nail #naildesigns #nails #manicure #nailsmagazine #nailstyle #notd #glitternails #coffinnails #nailtech #nailswag #nailpro #acrylicnails #nails2inspire #naildesign #glitter #nailstagram #nailsonfleek via my long nails💅💅💅💅,https://scontent-mia3-1.cdninstagram.com/vp/c203e305f84122c5df28cb46696ddacd/5DDD1BBE/t51.2885-15/e35/s1080x1080/66307247_592197541186642_967152634480204550_n.jpg?_nc_ht=scontent-mia3-1.cdninstagram.com,0,0,hairweavemakeup,4133.0,False,12064,True,"[#nailsoftheday, #nails💅, #nailsofinstagram, #nailporn, #nailart, #instanails, #beauty, #gelpolish, #love, #longnails, #gelnails, #nailartist, #nail, #naildesigns, #nails, #manicure, #nailsmagazine, #nailstyle, #notd, #glitternails, #coffinnails, #nailtech, #nailswag, #nailpro, #acrylicnails, #nails2inspire, #naildesign, #glitter, #nailstagram, #nailsonfleek]",. . . . . via my long nails💅💅💅💅,30,"[💅, 💅, 💅, 💅, 💅]",5,0.000000,0.000000,0.000000e+00,1,"[0.00024195499637067505, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.000024,362,31,"[nail, beauty, gelnails, nailsmagazine, instanails, nailstagram, acrylicnails, nailporn, nailsonfleek, nailpro, naildesigns, nailsoftheday, nail_polish, nailswag, nails2inspire, glitternails, polish, coffinnails, nailart, glitter, gelpolish, nailsofinstagram, notd, nailartist, love, longnails, naildesign, nailtech, nails, nailstyle, manicure]"
62575,. . . . . #nailsoftheday #nails💅 #nailsofinstagram #nailporn #nailart #instanails #beauty #gelpolish #love #longnails #gelnails #nailartist #nail #naildesigns #nails #manicure #nailsmagazine #nailstyle #notd #glitternails #coffinnails #nailtech #nailswag #nailpro #acrylicnails #nails2inspire #naildesign #glitter #nailstagram #nailsonfleek via my long nails💅💅💅💅,https://scontent-mia3-1.cdninstagram.com/vp/f2c1e084325502dacc2809911a35a4bb/5DD04342/t51.2885-15/e35/s1080x1080/66444271_160596111725921_4291427482446145387_n.jpg?_nc_ht=scontent-mia3-1.cdninstagram.com,0,0,hairweavemakeup,4133.0,False,12064,True,"[#nailsoftheday, #nails💅, #nailsofinstagram, #nailporn, #nailart, #instanails, #beauty, #gelpolish, #love, #longnails, #gelnails, #nailartist, #nail, #naildesigns, #nails, #manicure, #nailsmagazine, #nailstyle, #notd, #glitternails, #coffinnails, #nailtech, #nailswag, #nailpro, #acrylicnails, #nails2inspire, #naildesign, #glitter, #nailstagram, #nailsonfleek]",. . . . . via my long nails💅💅💅💅,30,"[💅, 💅, 💅, 💅, 💅]",5,0.000000,0.000000,0.000000e+00,1,"[0.00024195499637067505, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.000024,362,31,"[nail, beauty, gelnails, nailsmagazine, instanails, nailstagram, acrylicnails, nailporn, nailsonfleek, nailpro, naildesigns, nailsoftheday, nail_polish, nailswag, nails2inspire, glitternails, polish, coffinnails, nailart, glitter, gelpolish, nailsofinstagram, notd, nailartist, love, longnails, naildesign, nailtech, nails, nailstyle, manicure]"
62576,. . . . . #nailsoftheday #nails💅 #nailsofinstagram #nailporn #nailart #instanails #beauty #gelpolish #love #longnails #gelnails #nailartist #nail #naildesigns #nails #manicure #nailsmagazine #nailstyle #notd #glitternails #coffinnails #nailtech #nailswag #nailpro #acrylicnails #nails2inspire #naildesign #glitter #nailstagram #nailsonfleek via my long nails💅💅💅💅,https://scontent-mia3-1.cdninstagram.com/vp/7f662e54459dbb5e54155e7397274e28/5DE1B6F9/t51.2885-15/e35/s1080x1080/66506720_387755148602622_5033271024973726916_n.jpg?_nc_ht=scontent-mia3-1.cdninstagram.com,0,0,hairweavemakeup,4133.0,False,12064,True,"[#nailsoftheday, #nails💅, #nailsofinstagram, #nailporn, #nailart, #instanails, #beauty, #gelpolish, #love, #longnails, #gelnails, #nailartist, #nail, #naildesigns, #nails, #manicure, #nailsmagazine, #nailstyle, #notd, #glitternails, #coffinnails, #nailtech, #nailswag, #nailpro, #acrylicnails, #nails2inspire, #naildesign, #glitter, #nailstagram, #nailsonfleek]",. . . . . via my long nails💅💅💅💅,30,"[💅, 💅, 💅, 💅, 💅]",5,0.000000,0.000000,0.000000e+00,1,"[0.00024195499637067505, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.000024,362,31,"[nail, beauty, gelnails, nailsmagazine, instanails, nailstagram, acrylicnails, nailporn, nailsonfleek, nailpro, naildesigns, nailsoftheday, nail_polish, nailswag, nails2inspire, glitternails, polish, coffinnails, nailart, glitter, gelpolish, nailsofinstagram, notd, nailartist, love, longnails, naildesign, nailtech, nails, nailstyle, manicure]"
62577,. . . . . #nailsoftheday #nails💅 #nailsofinstagram #nailporn #nailart #instanails #beauty #gelpolish #love #longnails #gelnails #nailartist #nail #naildesigns #nails #manicure #nailsmagazine #nailstyle #notd #glitternails #coffinnails #nailtech #nailswag #nailpro #acrylicnails #nails2inspire #naildesign #glitter #nailstagram #nailsonfleek via,https://scontent-mia3-1.cdninstagram.com/vp/1ff0718fdf1582f18a3b825137d8b3c2/5DEDEC09/t51.2885-15/e35/s1080x1080/67313291_1333906863441641_5969749715531594851_n.jpg?_nc_ht=scontent-mia3-1.cdninstagram.com,0,0,hairweavemakeup,4133.0,False,12064,True,"[#nailsoftheday, #nails💅, #nailsofinstagram, #nailporn, #nailart, #instanails, #beauty, #gelpolish, #love, #longnails, #gelnails, #nailartist, #nail, #naildesigns, #nails, #manicure, #nailsmagazine, #nailstyle, #notd, #glitternails, #coffinnails, #nailtech, #nailswag, #nailpro, #acrylicnails, #nails2inspire, #naildesign, #glitter, #nailstagram, #nailsonfleek]",. . . . . via,30,[💅],1,0.000000,0.000000,0.000000e+00,1,"[0.00024195499637067505, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.000024,344,13,"[nail, beauty, gelnails, nailsmagazine, instanails, nailstagram, acrylicnails, nailporn, nailsonfleek, nailpro, naildesigns, nailsoftheday, nail_polish, nailswag, nails2inspire, glitternails, polish, coffinnails, nailart, glitter, gelpolish, nailsofinstagram, notd, nailartist, love, longnails, naildesign, nailtech, nails, nailstyle, manicure]"
62578,. . . . . #nailsoftheday #nails💅 #nailsofinstagram #nailporn #nailart #instanails #beauty #gelpolish #love #longnails #gelnails #nailartist #nail #naildesigns #nails #manicure #nailsmagazine #nailstyle #notd #glitternails #coffinnails #nailtech #nailswag #nailpro #acrylicnails #nails2inspire #naildesign #glitter #nailstagram #nailsonfleek via,https://scontent-mia3-1.cdninstagram.com/vp/0784b9f7aeb43e588ac598349b5e35ea/5DDE4F24/t51.2885-15/e35/s1080x1080/66658457_348339246102131_2078090342664993585_n.jpg?_nc_ht=scontent-mia3-1.cdninstagram.com,0,0,hairweavemakeup,4133.0,False,12064,True,"[#nailsoftheday, #nails💅, #nailsofinstagram, #nailporn, #nailart, #instanails, #beauty, #gelpolish, #love, #longnails, #gelnails, #nailartist, #nail, #naildesigns, #nails, #manicure, #nailsmagazine, #nailstyle, #notd, #glitternails, #coffinnails, #nailtech, #nailswag, #nailpro, #acrylicnails, #nails2inspire, #naildesign, #glitter, #nailstagram, #nailsonfleek]",. . . . . via,30,[💅],1,0.000000,0.000000,0.000000e+00,1,"[0.00024195499637067505, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.000024,344,13,"[nail, beauty, gelnails, nailsmagazine, instanails, nailstagram, acrylicnails, nailporn, nailsonfleek, nailpro, naildesigns, nailsoftheday, nail_polish, nailswag, nails2inspire, glitternails, polish, coffinnails, nailart, glitter, gelpolish, nailsofinstagram, notd, nailartist, love, longnails, naildesign, nailtech, nails, nailstyle, manicure]"
62579,. . . . . #nailsoftheday #nails💅 #nailsofinstagram #nailporn #nailart #instanails #beauty #gelpolish #love #longnails #gelnails #nailartist #nail #naildesigns #nails #manicure #nailsmagazine #nailstyle #notd #glitternails #coffinnails #nailtech #nailswag #nailpro #acrylicnails #nails2inspire #naildesign #glitter #nailstagram #nailsonfleek via,https://scontent-mia3-1.cdninstagram.com/vp/f83327123f5aeddec98b453b1f6e3835/5DEC237B/t51.2885-15/e35/s1080x1080/67633258_2416796075310852_1259694601690750835_n.jpg?_nc_ht=scontent-mia3-1.cdninstagram.com,0,0,hairweavemakeup,4133.0,False,12064,True,"[#nailsoftheday, #nails💅, #nailsofinstagram, #nailporn, #nailart, #instanails, #beauty, #gelpolish, #love, #longnails, #gelnails, #nailartist, #nail, #naildesigns, #nails, #manicure, #nailsmagazine, #nailstyle, #notd, #glitternails, #coffinnails, #nailtech, #nailswag, #nailpro, #acrylicnails, #nails2inspire, #naildesign, #glitter, #nailstagram, #nailsonfleek]",. . . . . via,30,[💅],1,0.000000,0.000000,0.000000e+00,1,"[0.00024195499637067505, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.000024,344,13,"[nail, beauty, gelnails, nailsmagazine, instanails, nailstagram, acrylicnails, nailporn, nailsonfleek, nailpro, naildesigns, nailsoftheday, nail_polish, nailswag, nails2inspire, glitternails, polish, coffinnails, nailart, glitter, gelpolish, nailsofinstagram, notd, nailartist, love, longnails, naildesign, nailtech, nails, nailstyle, manicure]"
120877,"Pharmanex Optimum Omega adalah cara yang mudah dan aman dalam upaya untuk meningkatkan asam lemak omega-3. Kemurnian minyak ikan dalam Optimum Omega diekstraksi dari ikan yang dipanen dari perairan laut murni. Dibawah pengawasan yang ketat oleh Pharmanex 6S Quality Guarantee untuk kemanjuran dan keamanan, Pharmanex menggunakan ikan yang bebas dari racun, kontaminansi atau logam berat. Optimum Omega juga mengandung vitamin E untuk mempertahankan kesegaran dengan melawan oksidasi. Manfaat Utama : 🐟 Mendukung kesehatan kardiovaskuler. 🐟 Mendukung kesehatan respon imun. 🐟 Mendukung kesehatan fungsi sendi dan mobilitas. 🐟 Memberikan vitamin E alami, sebagai antioksidan. Apa yang Membuat Produk Ini Unik? Minyak ikan yang segar dan sangat murni yang telah diujikan bebas dari racun, polutan dan logam berat. Asam lemak omega-3 dari sumber perairan lebih di minati daripada sumber buatan berdasarkan dua alasan. Satu, kebanyakan dari sumber buatan dan minyak sayur hanya menawarkan jumlah asam lemak omega-3 yang terbatas (kurang dari 1%). Dan kedua, bahkan sumber buatan yang sangat baik, seperti minyak wijen tidak menawarkan EPA dan DHA, yang spesifik mengandung omega-3 dengan manfaat kesehatan yang nyata. Siapa yang Seharusnya Menggunakan Produk Ini? Optimum Omega direkomendasikan untuk siapa saja yang ingin menyeimbangkan nutrisi asam lemak esensial dan yang ingin menambahkan asam lemak omega-3 pada diet harian mereka untuk mendukung kesehatan imun, jantung dan sendi. Apakah Anda Tahu? Ketertarikan pada asam lemak omega-3 dimulai ketika kesehatan kardiovaskuler yang sangat baik pada suku Eskimo ternyata berkaitan erat dengan gaya hidup mereka dalam mengkonsumsi ikan. Anda seharusnya mengkonsumsi paling tidak 8 liter dari minyak yang anda gunakan untuk memasak untuk mendapatkan jumlah yang sama dengan omega-3 yang terkandung dalam dosis harian Optimum Omega. American Heart Association (AHA) merekomendasikan semua orang untuk meningkatkan omega-3 harian mereka. #nuskin #nuskinbekasi #nuskinoribekasi #debm #lowcarb #nuskinharapanindah #suplemen #hidupsehat #kolestreol #minyakikan #optimumomega #omega3 #ketogenic",https://scontent-mia3-1.cdninstagram.com/vp/c2fcd6d22c3275f1d29903b612c547da/5DE39B20/t51.2885-15/e35/67331750_378320172877786_1812055582119772814_n.jpg?_nc_ht=scontent-mia3-1.cdninstagram.com,0,0,katalog_nadya,4435.0,False,1133,True,"[#nuskin, #nuskinbekasi, #nuskinoribekasi, #debm, #lowcarb, #nuskinharapanindah, #suplemen, #hidupsehat, #kolestreol, #minyakikan, #optimumomega, #omega3, #ketogenic]","Pharmanex Optimum Omega adalah cara yang mudah dan aman dalam upaya untuk meningkatkan asam lemak omega-3. Kemurnian minyak ikan dalam Optimum Omega diekstraksi dari ikan yang dipanen dari perairan laut murni. Dibawah pengawasan yang ketat oleh Pharmanex 6S Quality Guarantee untuk kemanjuran dan keamanan, Pharmanex menggunakan ikan yang bebas dari racun, kontaminansi atau logam berat. Optimum Omega juga mengandung vitamin E untuk mempertahankan kesegaran dengan melawan oksidasi. Manfaat Utama : 🐟 Mendukung kesehatan kardiovaskuler. 🐟 Mendukung kesehatan respon imun. 🐟 Mendukung kesehatan fungsi sendi dan mobilitas. 🐟 Memberikan vitamin E alami, sebagai antioksidan. Apa yang Membuat Produk Ini Unik? Minyak ikan yang segar dan sangat murni yang telah diujikan bebas dari racun, polutan dan logam berat. Asam lemak omega-3 dari sumber perairan lebih di minati daripada sumber buatan berdasarkan dua alasan. Satu, kebanyakan dari sumber buatan dan minyak sayur hanya menawarkan jumlah asam lemak omega-3 yang terbatas (kurang dari 1%). Dan kedua, bahkan sumber buatan yang sangat baik, seperti minyak wijen tidak menawarkan EPA dan DHA, yang spesifik mengandung omega-3 dengan manfaat kesehatan yang nyata. Siapa yang Seharusnya Menggunakan Produk Ini? Optimum Omega direkomendasikan untuk siapa saja yang ingin menyeimbangkan nutrisi asam lemak esensial dan yang ingin menambahkan asam lemak omega-3 pada diet harian mereka untuk mendukung kesehatan imun, jantung dan sendi. Apakah Anda Tahu? Ketertarikan pada asam lemak omega-3 dimulai ketika kesehatan kardiovaskuler yang sangat baik pada suku Eskimo ternyata berkaitan erat dengan gaya hidup mereka dalam mengkonsumsi ikan. Anda seharusnya mengkonsumsi paling tidak 8 liter dari minyak yang anda gunakan untuk memasak untuk mendapatkan jumlah yang sama dengan omega-3 yang terkandung dalam dosis harian Optimum Omega. American Heart Association (AHA) merekomendasikan semua orang untuk meningkatkan omega-3 harian mereka.",13,"[🐟, 🐟, 🐟, 🐟]",4,0.000000,0.000000,0.000000e+00,1,"[0.00022547914317925591, 0.0, 0.0]",0.000075,2140,1987,"[nuskin, nuskinoribekasi, suplemen, ketogenic, optimumomega, hidupsehat, nuskinbekasi, nuskinharapanindah, kolestreol, minyakikan, omega3, lowcarb, fish, debm]"
120878,"Pharmanex Optimum Omega adalah cara yang mudah dan aman dalam upaya untuk meningkatkan asam lemak omega-3. Kemurnian minyak ikan dalam Optimum Omega diekstraksi dari ikan yang dipanen dari perairan laut murni. Dibawah pengawasan yang ketat oleh Pharmanex 6S Quality Guarantee untuk kemanjuran dan keamanan, Pharmanex menggunakan ikan yang bebas dari racun, kontaminansi atau logam berat. Optimum Omega juga mengandung vitamin E untuk mempertahankan kesegaran dengan melawan oksidasi. Manfaat Utama : 🐟 Mendukung kesehatan kardiovaskuler. 🐟 Mendukung kesehatan respon imun. 🐟 Mendukung kesehatan fungsi sendi dan mobilitas. 🐟 Memberikan vitamin E alami, sebagai antioksidan. Apa yang Membuat Produk Ini Unik? Minyak ikan yang segar dan sangat murni yang telah diujikan bebas dari racun, polutan dan logam berat. Asam lemak omega-3 dari sumber perairan lebih di minati daripada sumber buatan berdasarkan dua alasan. Satu, kebanyakan dari sumber buatan dan minyak sayur hanya menawarkan jumlah asam lemak omega-3 yang terbatas (kurang dari 1%). Dan kedua, bahkan sumber buatan yang sangat baik, seperti minyak wijen tidak menawarkan EPA dan DHA, yang spesifik mengandung omega-3 dengan manfaat kesehatan yang nyata. Siapa yang Seharusnya Menggunakan Produk Ini? Optimum Omega direkomendasikan untuk siapa saja yang ingin menyeimbangkan nutrisi asam lemak esensial dan yang ingin menambahkan asam lemak omega-3 pada diet harian mereka untuk mendukung kesehatan imun, jantung dan sendi. Apakah Anda Tahu? Ketertarikan pada asam lemak omega-3 dimulai ketika kesehatan kardiovaskuler yang sangat baik pada suku Eskimo ternyata berkaitan erat dengan gaya hidup mereka dalam mengkonsumsi ikan. Anda seharusnya mengkonsumsi paling tidak 8 liter dari minyak yang anda gunakan untuk memasak untuk mendapatkan jumlah yang sama dengan omega-3 yang terkandung dalam dosis harian Optimum Omega. American Heart Association (AHA) merekomendasikan semua orang untuk meningkatkan omega-3 harian mereka. #nuskin #nuskinbekasi #nuskinoribekasi #debm #lowcarb #nuskinharapanindah #suplemen #hidupsehat #kolestreol #minyakikan #optimumomega #omega3 #ketogenic",https://scontent-mia3-1.cdninstagram.com/vp/2bd4986c628269b99bcf29a340601206/5DE12002/t51.2885-15/e35/66390228_491736571660113_2056312626605593970_n.jpg?_nc_ht=scontent-mia3-1.cdninstagram.com,0,0,katalog_nadya,4435.0,False,1133,True,"[#nuskin, #nuskinbekasi, #nuskinoribekasi, #debm, #lowcarb, #nuskinharapanindah, #suplemen, #hidupsehat, #kolestreol, #minyakikan, #optimumomega, #omega3, #ketogenic]","Pharmanex Optimum Omega adalah cara yang mudah dan aman dalam upaya untuk meningkatkan asam lemak omega-3. Kemurnian minyak ikan dalam Optimum Omega diekstraksi dari ikan yang dipanen dari perairan laut murni. Dibawah pengawasan yang ketat oleh Pharmanex 6S Quality Guarantee untuk kemanjuran dan keamanan, Pharmanex menggunakan ikan yang bebas dari racun, kontaminansi atau logam berat. Optimum Omega juga mengandung vitamin E untuk mempertahankan kesegaran dengan melawan oksidasi. Manfaat Utama : 🐟 Mendukung kesehatan kardiovaskuler. 🐟 Mendukung kesehatan respon imun. 🐟 Mendukung kesehatan fungsi sendi dan mobilitas. 🐟 Memberikan vitamin E alami, sebagai antioksidan. Apa yang Membuat Produk Ini Unik? Minyak ikan yang segar dan sangat murni yang telah diujikan bebas dari racun, polutan dan logam berat. Asam lemak omega-3 dari sumber perairan lebih di minati daripada sumber buatan berdasarkan dua alasan. Satu, kebanyakan dari sumber buatan dan minyak sayur hanya menawarkan jumlah asam lemak omega-3 yang terbatas (kurang dari 1%). Dan kedua, bahkan sumber buatan yang sangat baik, seperti minyak wijen tidak menawarkan EPA dan DHA, yang spesifik mengandung omega-3 dengan manfaat kesehatan yang nyata. Siapa yang Seharusnya Menggunakan Produk Ini? Optimum Omega direkomendasikan untuk siapa saja yang ingin menyeimbangkan nutrisi asam lemak esensial dan yang ingin menambahkan asam lemak omega-3 pada diet harian mereka untuk mendukung kesehatan imun, jantung dan sendi. Apakah Anda Tahu? Ketertarikan pada asam lemak omega-3 dimulai ketika kesehatan kardiovaskuler yang sangat baik pada suku Eskimo ternyata berkaitan erat dengan gaya hidup mereka dalam mengkonsumsi ikan. Anda seharusnya mengkonsumsi paling tidak 8 liter dari minyak yang anda gunakan untuk memasak untuk mendapatkan jumlah yang sama dengan omega-3 yang terkandung dalam dosis harian Optimum Omega. American Heart Association (AHA) merekomendasikan semua orang untuk meningkatkan omega-3 harian mereka.",13,"[🐟, 🐟, 🐟, 🐟]",4,0.000000,0.000000,0.000000e+00,1,"[0.00022547914317925591, 0.0, 0.0]",0.000075,2140,1987,"[nuskin, nuskinoribekasi, suplemen, ketogenic, optimumomega, hidupsehat, nuskinbekasi, nuskinharapanindah, kolestreol, minyakikan, omega3, lowcarb, fish, debm]"


In [143]:
print(df['engagement_rate'].describe())
print(df['engagement_rate'].quantile(.4))
print("\n")
print(df['caption_rating'].describe())
print(df['caption_rating'].quantile(.4))

count    40530.000000
mean     0.141224    
std      0.272429    
min      0.000000    
25%      0.030656    
50%      0.078414    
75%      0.170884    
max      24.333333   
Name: engagement_rate, dtype: float64
0.05688888888888888


count    40530.000000 
mean     19.521766    
std      576.345529   
min      0.000000     
25%      0.491470     
50%      1.954417     
75%      6.439553     
max      102694.360013
Name: caption_rating, dtype: float64
1.19410313044375


In [144]:
df = df[(df.engagement_rate > 0.10058973964698172) & (df.caption_rating > 2.723735791766927)]
print(len(df))

13655


So now we've cut it down to 13K posts, which should be a lot easier to work with. 

In [145]:
features = df["features"].tolist()
features = np.hstack(np.asarray(features)).tolist()
unique_features = list(set(features)) #list of all unique words in the features column

In [8]:
model = gensim.models.KeyedVectors.load_word2vec_format('precomputed_embedding_models/GoogleNews-vectors-negative300.bin.gz', binary=True)  


In [146]:
embeddings_mapping = {}

for word_index in tqdm(range(len(unique_features))):
    try:
        embedding = model[unique_features[word_index]]
        embeddings_mapping[unique_features[word_index]] = embedding
    except:
        pass

unique_words = list(embeddings_mapping.keys())
embeddings = list(embeddings_mapping.values())

100%|██████████| 42541/42541 [00:00<00:00, 123916.70it/s]


In [147]:
print(len(embeddings))

9444


Perfect. We're below the 10K limit. Let's start with K-means clustering with K=100.

In [148]:
from nltk.cluster import KMeansClusterer
import nltk

In [150]:
NUM_CLUSTERS=100
kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance, repeats=25)
assigned_clusters = kclusterer.cluster(embeddings, assign_clusters=True)

In [151]:
cluster_mapping = {}
clusters = {}

for word, cluster in zip(unique_words, assigned_clusters):
    if word not in cluster_mapping:
        cluster_mapping[word] = cluster
    if cluster not in clusters:
        clusters[cluster] = []
    clusters[cluster].append(word)

Now Mean Shift Clustering.

In [152]:
from sklearn.cluster import MeanShift
from sklearn.datasets.samples_generator import make_blobs
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

In [153]:
ms = MeanShift()
ms.fit(embeddings)
cluster_centers = ms.cluster_centers_
labels = ms.labels_

In [154]:
ms_word_to_cluster_mapping = {}
ms_cluster = {}

for word, label in zip(unique_words, labels):
    if word not in ms_word_to_cluster_mapping:
        ms_word_to_cluster_mapping[word] = label
    if label not in ms_cluster:
        ms_cluster[label] = []
    ms_cluster[label].append(word)

print(len(ms_cluster[0]))

9349


Unfortunately, the mean shift ended up being one of those "good in theory, terrible in practice" techniques for our case here because the majority of the words got thrown in the first cluster and the rest of the clusters only had one word.

This can be most likely attributed to the mean shift clustering being done in high dimensions (in this case, our 300 dimension vectors). Since this is usually done for image segmentation, the dimensions don't get larger than 2-3, so using it for word embeddings is most likely a stretch

Thus, we'll just go forth with the k-means clustering with 100 labels. If our CNN results go very poorly, then we'll return to decrease the number of labels.

In [156]:
with open("../data/preprocessed/clusters.pickle", 'wb') as f:
    pickle.dump(clusters, f)
    
with open("../data/preprocessed/clusters_mapping.pickle", 'wb') as f:
    pickle.dump(cluster_mapping, f)
    
with open("../data/preprocessed/unique_word_to_embeddings_mapping.pickle", 'wb') as f:
    pickle.dump(embeddings_mapping, f)

Now let's start splitting them into our train-dev-test split (80/10/10). Since our train set is so obnoxiously large, I'm going to split it into traina and trainb so that after preprocessing, I can still store them as pickles (since I envivision that they'll go over the 4 GB cap)

In [157]:
training_examples = []

for index, row in df.iterrows():
    img_url = row["picture"]
    features = row["features"]
    
    for feature in features:
        if feature in cluster_mapping:
            example = (img_url, feature)
            training_examples.append(example)

In [158]:
split_1 = int(0.8*len(training_examples))
split_2 = int(0.1*len(training_examples))
train = training_examples[:split_1]
dev = training_examples[split_1:split_1 + split_2]
test = training_examples[-split_2:]
small = train[:10]
traina = train[:int(len(train)/2)]
trainb = train[-int(len(train)/2):]

with open("../data/preprocessed/traina.pickle", 'wb') as f:
    pickle.dump(traina, f)
    
with open("../data/preprocessed/trainb.pickle", 'wb') as f:
    pickle.dump(trainb, f)

with open("../data/preprocessed/dev.pickle", 'wb') as f:
    pickle.dump(dev, f)
    
with open("../data/preprocessed/test.pickle", 'wb') as f:
    pickle.dump(test, f)
    
with open("../data/preprocessed/small.pickle", 'wb') as f:
    pickle.dump(small, f)
    
with open("../data/preprocessed/label_map.pickle", 'wb') as f:
    pickle.dump(cluster_mapping, f)