###  Engagement Analysis

In [1]:
import nltk; nltk.download('stopwords')
import re
import numpy as np
import pandas as pd
from pprint import pprint
import spacy
from ast import literal_eval

from nltk.tokenize import word_tokenize

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# spacy for lemmatization
import spacy

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline

# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Sahana\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### Read in the scraped file with labels

In [2]:
df = pd.read_csv("natgeo_labels_final.csv")

df.drop(df.columns[0], axis=1, inplace = True)
df['labels'] = df['labels'].apply(lambda x: literal_eval(x))

In [3]:
## Remove punctuation and convert the labels to lower case
df["labels"] = df["labels"].astype(str)
df["labels"] = df["labels"].apply(lambda each_post: word_tokenize(re.sub(r'[^\w\s]',' ',each_post.lower())))
df["labels_strings"] = df['labels'].apply(' '.join)

In [4]:
df.head()

Unnamed: 0,Index,display_url,comments,is_video,likes,caption,labels,labels_strings
0,0,https://instagram.fftw1-1.fna.fbcdn.net/v/t51....,145,False,31505,Photo by Amber Bracken @photobracken | Jocelyn...,"[hair, beauty, hairstyle, skin, long, hair, li...",hair beauty hairstyle skin long hair lip hand ...
1,2,https://instagram.fftw1-1.fna.fbcdn.net/v/t51....,1018,False,330690,Photo by Charlie Hamilton James @chamiltonjame...,"[sky, wildlife, natural, environment, ecoregio...",sky wildlife natural environment ecoregion mar...
2,3,https://instagram.fftw1-1.fna.fbcdn.net/v/t51....,660,False,217013,Photo by @brianskerry | A great white shark sw...,"[great, white, shark, shark, lamniformes, tige...",great white shark shark lamniformes tiger shar...
3,4,https://instagram.fftw1-1.fna.fbcdn.net/v/t51....,1458,False,281516,Photo by @gabrielegalimbertiphoto and Juri De ...,"[room, living, room, furniture, interior, desi...",room living room furniture interior design tab...
4,5,https://instagram.fftw1-1.fna.fbcdn.net/v/t51....,540,False,161964,"Photo by @amivitale | Jenabu, 13, waits for he...","[face, black, people, skin, child, head, lip, ...",face black people skin child head lip eyebrow ...


#### Create an engagement metric for each Instagram post
Created a metric for engagement by using a weighted sum of # likes and # comments. However, first normalize # likes and # comments such that they both have values between 0 and 1. Now create an engagement score = .4*# likes (normalized) + .6*# comments (normalized). Define High (1) and Low (0) engagement based on whether the engagement score is above or below the median value.  

In [5]:
## Normalize the number of likes and comments
df["likes_normalized"] = df["likes"]/df["likes"].max() 
df["comments_normalized"] = df["comments"]/df["comments"].max()

In [6]:
## Create engagement score
df["engagement_score"] = 0.4*df["likes_normalized"] + 0.6*df["comments_normalized"]

In [7]:
## Define whether the post has "high" or "low" engagement based on the median score
engagement_median = df["engagement_score"].median()
df["engagement"] = df["engagement_score"].apply(lambda x: 1 if x > engagement_median else 0)

In [8]:
df.head()

Unnamed: 0,Index,display_url,comments,is_video,likes,caption,labels,labels_strings,likes_normalized,comments_normalized,engagement_score,engagement
0,0,https://instagram.fftw1-1.fna.fbcdn.net/v/t51....,145,False,31505,Photo by Amber Bracken @photobracken | Jocelyn...,"[hair, beauty, hairstyle, skin, long, hair, li...",hair beauty hairstyle skin long hair lip hand ...,0.019494,0.011139,0.014481,0
1,2,https://instagram.fftw1-1.fna.fbcdn.net/v/t51....,1018,False,330690,Photo by Charlie Hamilton James @chamiltonjame...,"[sky, wildlife, natural, environment, ecoregio...",sky wildlife natural environment ecoregion mar...,0.20462,0.078205,0.128771,1
2,3,https://instagram.fftw1-1.fna.fbcdn.net/v/t51....,660,False,217013,Photo by @brianskerry | A great white shark sw...,"[great, white, shark, shark, lamniformes, tige...",great white shark shark lamniformes tiger shar...,0.13428,0.050703,0.084134,0
3,4,https://instagram.fftw1-1.fna.fbcdn.net/v/t51....,1458,False,281516,Photo by @gabrielegalimbertiphoto and Juri De ...,"[room, living, room, furniture, interior, desi...",room living room furniture interior design tab...,0.174193,0.112007,0.136882,1
4,5,https://instagram.fftw1-1.fna.fbcdn.net/v/t51....,540,False,161964,"Photo by @amivitale | Jenabu, 13, waits for he...","[face, black, people, skin, child, head, lip, ...",face black people skin child head lip eyebrow ...,0.100218,0.041484,0.064978,0


### Regression Analysis
Run a logistic regression with Engagement (binary) as the dependent variable, and the image labels as independent variables. Display the accuracy (confusion matrix). What accuracy do we get by using the post caption words as the independent variables instead of image labels? Finally, what accuracy do we get by combining the image labels and post captions and using them as independent variables? What can we conclude from our analysis? 

Note: Doing a word frequency analysis and word replacement on the image labels as well as captions will increase the accuracy of prediction. Needless to say, TF-IDF scores should be used. 

####  Logistic Regression with Labels to Predict Engagement

In [9]:
import pandas as pd
from pandas import DataFrame, Series
import urllib.request 
import statsmodels.api as sm
import math
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score, confusion_matrix, f1_score
from sklearn.model_selection import train_test_split

In [10]:
## Keeping all labels intact and using tf-idf score
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['labels_strings'].tolist())
q = vectorizer.get_feature_names()
l = pd.DataFrame(X.toarray())
for i in range(len(q)):
    l = l.rename(columns={i: q[i]}) 

l['engagement'] = df['engagement']


X_train, X_test, y_train, y_test = train_test_split(l.iloc[:,:-1], l['engagement'], test_size=0.20, random_state=42)

clf = LogisticRegression(random_state=0).fit(X_train,y_train )
print("Accuracy with Image Labels: " + str(clf.score(X_test, y_test)))
print("Confusion Matrix:")
print(confusion_matrix(y_test, clf.predict(X_test)))



Accuracy with Image Labels: 0.7073170731707317
Confusion Matrix:
[[32  9]
 [15 26]]


#### Logistic Regression with Captions to Predict Engagement

In [11]:
## Keeping all captions intact and using tf-idf score
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['caption'].tolist())
q = vectorizer.get_feature_names()
l = pd.DataFrame(X.toarray())
for i in range(len(q)):
    l = l.rename(columns={i: q[i]}) 

l['engagementVal'] = df['engagement']

X_train, X_test, y_train, y_test = train_test_split(l.iloc[:,:-1], l['engagementVal'], test_size=0.20, random_state=42)

clf = LogisticRegression(random_state=0, solver='lbfgs').fit(X_train,y_train )
print("Accuracy with Caption Labels: " + str(clf.score(X_test, y_test)))
print("Confusion Matrix:")
print(confusion_matrix(y_test, clf.predict(X_test)))

Accuracy with Caption Labels: 0.6219512195121951
Confusion Matrix:
[[24 17]
 [14 27]]


#### Logistic Regression with Captions and Labels to Predict Engagement

In [12]:
df['caption+labels'] = df['caption'] + " " + df['labels_strings']

In [13]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['caption+labels'].tolist())
q = vectorizer.get_feature_names()
l = pd.DataFrame(X.toarray())
for i in range(len(q)):
    l = l.rename(columns={i: q[i]}) 

l['engagementVal'] = df['engagement']

X_train, X_test, y_train, y_test = train_test_split(l.iloc[:,:-1], l['engagementVal'], test_size=0.33, random_state=42)

clf = LogisticRegression(random_state=0, solver='lbfgs').fit(X_train,y_train )
print("Accuracy with both Captions and Image Labels: " + str(clf.score(X_test, y_test)))
print("Confusion Matrix:")
print(confusion_matrix(y_test, clf.predict(X_test)))

Accuracy with both Captions and Image Labels: 0.7185185185185186
Confusion Matrix:
[[52 15]
 [23 45]]


Looking at just the image labels from Google Vision, we can predict the engagement of the post with an accuracy of 70.7%. With the caption labels off of Instagram, our accuracy of predicting engagement is 62.2%. Combining the captions and labels, the achieved accuracy of our prediction increases to 71.9%. Thus, we can see that the best model to predict engagement will be by using both the captions and the labels. This makes sense, as the Instagram caption does bear some weight in determining engagement, but is not necessarily sufficient, as captions may be slightly unrelated to the actual image. Combining these captions with the labels off of Google Vision, we are able to better predict engagement, as Google Vision has already been exposed to a variety of images that can then be used to determine the labels.

### Topic Modeling
Perform topic modeling (LDA) on the image labels. Choose an appropriate number of topics. You may want to start with 5, but adjust the number up or down depending on the word distributions you get. LDA should produce two outputs: (i) A file showing which words load on which topics, and (ii) a file showing topic weights for each image. 

Now take the quartiles with highest and lowest engagement scores. What are the differences in the average topic weights of pictures across the two quartiles (e.g., greater proportion of some topics in highest engagement quartile)? Show the main results in a table. 

In [14]:
# NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Sahana\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [15]:
## Removing stop words, punctuation and tokenizing
stop = stopwords.words('english')
## stop=stop+['photography']
df["labels"] = df["labels"].astype(str)
df["label_tokens"] = df["labels"].apply(lambda each_post: word_tokenize(re.sub(r'[^\w\s]',' ',each_post.lower())))
df["label_tokens"] = df["label_tokens"].apply(lambda list_of_words: [x for x in list_of_words if x not in stop])

In [16]:
def bigrams(words, bi_min=15, tri_min=10):
    bigram = gensim.models.Phrases(words, min_count = bi_min)
    bigram_mod = gensim.models.phrases.Phraser(bigram)
    return bigram_mod

def get_corpus(df):
    """
    Get Bigram Model, Corpus, id2word mapping
    """
    bigram = bigrams(df.label_tokens)
    bigram = [bigram[review] for review in df.label_tokens]
    id2word = gensim.corpora.Dictionary(bigram)
    id2word.filter_extremes(no_below=10, no_above=0.35)
    id2word.compactify()
    corpus = [id2word.doc2bow(text) for text in bigram]
    return corpus, id2word, bigram

In [17]:
train_corpus, train_id2word, bigram_train = get_corpus(df)

In [18]:
import logging
logging.basicConfig(filename='lda_model.log', format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    lda_train = gensim.models.ldamulticore.LdaMulticore(
                           corpus=train_corpus,
                           num_topics=10,
                           id2word=train_id2word,
                           chunksize=100,
                           workers=7, # Num. Processing Cores - 1
                           passes=50,
                           eval_every = 1,
                           per_word_topics=True)
    lda_train.save('lda_train.model')

In [19]:
## Print coherence of the LDA model
coherence_model_lda = CoherenceModel(model=lda_train, texts=bigram_train, dictionary=train_id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print("The coherence of the LDA model is",coherence_lda)

The coherence of the LDA model is 0.35881249942559834


In [20]:
lda_train.print_topics()

[(0,
  '0.134*"fish" + 0.104*"organism" + 0.099*"human" + 0.093*"underwater" + 0.092*"marine" + 0.087*"biology" + 0.078*"art" + 0.053*"photography" + 0.039*"yellow" + 0.038*"water"'),
 (1,
  '0.155*"photography" + 0.110*"black" + 0.101*"white" + 0.077*"monochrome" + 0.076*"hair" + 0.049*"photograph" + 0.049*"beauty" + 0.049*"skin" + 0.045*"head" + 0.045*"face"'),
 (2,
  '0.180*"fun" + 0.119*"recreation" + 0.113*"horse" + 0.082*"vacation" + 0.072*"vehicle" + 0.069*"animal" + 0.066*"photography" + 0.048*"child" + 0.041*"adaptation" + 0.037*"sky"'),
 (3,
  '0.151*"wildlife" + 0.114*"terrestrial_animal" + 0.114*"vertebrate" + 0.108*"mammal" + 0.097*"dog" + 0.081*"carnivore" + 0.072*"snout" + 0.068*"canidae" + 0.043*"felidae" + 0.041*"adaptation"'),
 (4,
  '0.173*"plant" + 0.158*"tree" + 0.079*"nature" + 0.075*"rock" + 0.064*"forest" + 0.051*"woody" + 0.051*"natural_environment" + 0.043*"branch" + 0.043*"reserve" + 0.034*"landscape"'),
 (5,
  '0.099*"sky" + 0.088*"grass" + 0.085*"night" + 0

#### Visualize the Topics

In [21]:
vis = pyLDAvis.gensim.prepare(topic_model=lda_train, corpus=train_corpus, dictionary=train_id2word)
pyLDAvis.enable_notebook()
pyLDAvis.display(vis)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


Each bubble on the left-hand side plot represents a topic. The larger the bubble, the more prevalent is that topic.
A good topic model will have fairly big, non-overlapping bubbles scattered throughout the chart instead of being clustered in one quadrant.
A model with too many topics, will typically have many overlaps, small sized bubbles clustered in one region of the chart.

Note: The Topic IDs in the InterTopic Distance Map do not correspond to the LDA topic IDs.

In [22]:
df_lda = pd.DataFrame(lda_train.show_topics(), columns=['Topic','Word Weights'])

In [23]:
df_lda.to_csv('Word_Weights.csv')

In [25]:
train_vecs = []
for i in range(len(df.label_tokens)):
    top_topics = lda_train.get_document_topics(train_corpus[i], minimum_probability=0.0)
    topic_vec = [top_topics[i][1] for i in range(10)]
    train_vecs.append(topic_vec)

In [26]:
print(train_vecs[10])
print(len(train_vecs))

[0.020003553, 0.21998511, 0.020005194, 0.020000074, 0.020001208, 0.020005703, 0.020005012, 0.37349468, 0.020005118, 0.2664944]
408


In [27]:
train_vec_df=pd.DataFrame(train_vecs)
train_vec_df.columns=['topic0','topic1','topic2','topic3','topic4','topic5','topic6','topic7','topic8','topic9']
train_vec_df.iloc[31]

topic0    0.012500
topic1    0.012500
topic2    0.012501
topic3    0.227575
topic4    0.401413
topic5    0.283503
topic6    0.012505
topic7    0.012502
topic8    0.012500
topic9    0.012501
Name: 31, dtype: float64

In [28]:
df_nat_final=pd.concat([df.reset_index(drop=True), train_vec_df.reset_index(drop=True)], axis=1)
df_nat_final[:2]

Unnamed: 0,Index,display_url,comments,is_video,likes,caption,labels,labels_strings,likes_normalized,comments_normalized,...,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9
0,0,https://instagram.fftw1-1.fna.fbcdn.net/v/t51....,145,False,31505,Photo by Amber Bracken @photobracken | Jocelyn...,"['hair', 'beauty', 'hairstyle', 'skin', 'long'...",hair beauty hairstyle skin long hair lip hand ...,0.019494,0.011139,...,0.02,0.82,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02
1,2,https://instagram.fftw1-1.fna.fbcdn.net/v/t51....,1018,False,330690,Photo by Charlie Hamilton James @chamiltonjame...,"['sky', 'wildlife', 'natural', 'environment', ...",sky wildlife natural environment ecoregion mar...,0.20462,0.078205,...,0.014286,0.014286,0.014286,0.01429,0.014289,0.702664,0.014289,0.014287,0.014286,0.183037


In [29]:
df_nat_final.to_csv("Topic_Weights.csv")

In [30]:
q1=np.percentile(df_nat_final.engagement_score, 25) 
q2=np.percentile(df_nat_final.engagement_score, 50)  
q3=np.percentile(df_nat_final.engagement_score, 75)

print (q1,q2,q3)

0.08030411923207784 0.11903844877968242 0.1809772493842345


In [31]:
top_quartile=df_nat_final[df_nat_final['engagement_score']>q3]
top_quartile[:3]

Unnamed: 0,Index,display_url,comments,is_video,likes,caption,labels,labels_strings,likes_normalized,comments_normalized,...,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9
6,7,https://instagram.fftw1-1.fna.fbcdn.net/v/t51....,1116,False,526240,Photos by Pete McBride @pedromcbride | Source ...,"['geological', 'phenomenon', 'geology', 'rock']",geological phenomenon geology rock,0.32562,0.085734,...,0.025,0.025,0.025,0.025,0.025013,0.025001,0.025021,0.025,0.774962,0.025002
13,15,https://instagram.fftw1-1.fna.fbcdn.net/v/t51....,3023,False,771071,Photo by Keith Ladzinski @ladzinski | In searc...,"['sky', 'wildlife', 'organism', 'cloud', 'phot...",sky wildlife organism cloud photography wood l...,0.477113,0.232235,...,0.156915,0.011114,0.011113,0.011115,0.011112,0.75418,0.011113,0.011112,0.011113,0.011112
19,22,https://instagram.fftw1-1.fna.fbcdn.net/v/t51....,1368,False,587036,Photo by @ronan_donovan | A mother wolf return...,"['canidae', 'wildlife', 'irish', 'wolfhound', ...",canidae wildlife irish wolfhound sheep carnivo...,0.363238,0.105093,...,0.02,0.02,0.02,0.62,0.02,0.020001,0.219998,0.02,0.02,0.02


In [37]:
average_topic_weights_top = top_quartile[["topic0",'topic1',"topic2",'topic3',"topic4",'topic5',"topic6",'topic8','topic9']].mean(axis=0)
average_topic_weights_top

topic0    0.077535
topic1    0.069047
topic2    0.047511
topic3    0.199879
topic4    0.103715
topic5    0.083818
topic6    0.132692
topic8    0.105989
topic9    0.098898
dtype: float64

In [34]:
bottom_quartile=df_nat_final[df_nat_final['engagement_score']<q1]
bottom_quartile[:3]

Unnamed: 0,Index,display_url,comments,is_video,likes,caption,labels,labels_strings,likes_normalized,comments_normalized,...,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9
0,0,https://instagram.fftw1-1.fna.fbcdn.net/v/t51....,145,False,31505,Photo by Amber Bracken @photobracken | Jocelyn...,"['hair', 'beauty', 'hairstyle', 'skin', 'long'...",hair beauty hairstyle skin long hair lip hand ...,0.019494,0.011139,...,0.02,0.82,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02
4,5,https://instagram.fftw1-1.fna.fbcdn.net/v/t51....,540,False,161964,"Photo by @amivitale | Jenabu, 13, waits for he...","['face', 'black', 'people', 'skin', 'child', '...",face black people skin child head lip eyebrow ...,0.100218,0.041484,...,0.012501,0.887496,0.012502,0.0125,0.0125,0.0125,0.0125,0.0125,0.012502,0.0125
7,8,https://instagram.fftw1-1.fna.fbcdn.net/v/t51....,423,False,215519,Photo by William Albert Allard @williamalberta...,"['action', 'adventure', 'game', 'horse', 'cowb...",action adventure game horse cowboy screenshot ...,0.133356,0.032496,...,0.033333,0.033333,0.7,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333,0.033333


In [36]:
average_topic_weights_bot = bottom_quartile[["topic0",'topic1',"topic2",'topic3',"topic4",'topic5',"topic6",'topic8','topic9']].mean(axis=0)
average_topic_weights_bot

topic0    0.088776
topic1    0.106151
topic2    0.136207
topic3    0.030750
topic4    0.073892
topic5    0.148118
topic6    0.090610
topic8    0.226552
topic9    0.036310
dtype: float64

In [38]:
quartile_topics = pd.concat([average_topic_weights_top,average_topic_weights_bot],axis=1)
quartile_topics.columns = ['Top Quartile','Bottom Quartile']

In [39]:
quartile_topics

Unnamed: 0,Top Quartile,Bottom Quartile
topic0,0.077535,0.088776
topic1,0.069047,0.106151
topic2,0.047511,0.136207
topic3,0.199879,0.03075
topic4,0.103715,0.073892
topic5,0.083818,0.148118
topic6,0.132692,0.09061
topic8,0.105989,0.226552
topic9,0.098898,0.03631


### Insights:
Based on LDA, we can see that many of our posts fall under the main category of nature. There are many overlaps within these topics, given that most are related to nature, landscapes and animals. People expect these photos from NatGeo, and these photos are probably the reason they are actively following the account. NatGeo is first and foremost known for their nature based magazines, which have been a household name since 1888. The Instagram account is an extension of this brand image they have created in homes, and thus, it is essential to maintain that on different platforms as well. However, there is also room to engage with a new userbase through Instagram, which is why we recommend slowly starting to add in a mix of new images corresponding to other topics.

Looking at just our high engagement posts, we see that three topics stand out with the highest scores. These topics include Photography, Nature, and Mountainous terrain. We can assume that users on Instagram are generally more likely to be interested in taking good photos – they are, after all, browsing the social media platform to see images! So NatGeo can group these together by finding new perspectives in photographing nature and mountains. People are also interested in images of nature probably because it is a respite from their day to day; thus, we can perhaps look to optimize our posts to encourage high engagement by posting at certain times of the workday when people tend to find their days dragging (for example, at 2pm when the post lunch coma hits).

Our low engagement posts surprisingly coincide with topics that National Geographic post the most about. They are posting many photos about the natural environment, yet they aren’t receiving the engagement they want on these pictures. This could be because people’s Instagram feeds are oversaturated by these images. Posts with more rugged terrain tend to be less popular, possibly because it appeals to a different type of beauty. We would suggest focusing more on posting images that develop engagement, like those listed under high engagement, and reduce the number of nature posts to a maximum limit per week. Limiting the saturation of these images on people’s feeds might encourage more engagement with the posts of nature in general.

One thing that could be useful to monitor is how the trends in engagement topics change based on season. For example, more individuals might engage with travel related images prior to planning a trip, which would cause an increase in certain topic label engagement. They may also want to see more images of people/human elements of emotion during the holidays, and pictures of sunny warm places in the winter. Looking at how engagement changes by season in general could be a great way to drive engagement by posting more targeted images at certain times of the year.
