@NatGeo's August Anomaly

In August 2016, the @NatGeo Instagram account experienced a spike in the number of likes per photo: In September, the following month, National Geographic published the book @NatGeo: The Most Popular Instagram Photos. It appears that there was some promotional activity going on during the leadup to the book's release that drove additional traffic to the @NatGeo account (at least, this is the only explanation I can find through my research). However, I wanted to run a statistical test to see what would the probability be that this spike occurred by sheer random chance.

Because during a short time period around this anomaly the account has basically constant average likes per photo, I decided to treat each post as an independent event. This allows me to use the t-test to compare the means of two distributions: the August posts and the non-August posts. I separated all posts during the 31 days of August into one bucket of data and the final two weeks of July along with the first two weeks of September for another bucket.

This is 245 posts located within the August window and 253 located outside it. Plotting these distributions clearly reveals differing trends:

So, what is the probability that the Baseline Mean and the Anomaly Mean are different by random chance alone?

Independent t-test formula

Let A and B represent the two groups to compare. Let m_A and m_B represent the means of groups A and B, respectively. Let n_A and n_B represent the sizes of group A and B, respectively.

The t test statistic value to test whether the means are different can be calculated as follows:

S² is an estimator of the common variance of the two samples. It can be calculated as follows:

The t-test statistic value is used in a t-test table to look up the critical value of Student’s t distribution corresponding to the significance level alpha of your choice (often chosen as 5%). The degrees of freedom (df) used in this test is:

If the absolute value of the t-test statistic (|t|) is greater than the critical value, then the difference is significant. Otherwise the difference cannot be considered significant. The level of significance or (p-value) corresponds to the risk indicated by the t-test table for the calculated |t| value. The test can be used only when the two groups of samples (A and B) being compared follow bivariate normal distribution with equal variances. If the variances of the two groups being compared are different, the Welch's t-test can be used.

Plugging in some real numbers

Running through those equations using the average likes per photo for the two distributions reveals the following:

The t-statistic is -10.431478262230554
The degrees of freedom is 496.0
The critical value is 1.6479315230387215
The p-value is 6.558e-23

That p-value is essentially zero, which tells us that there is a 0% probability that this spike in likes occurred simply by chance. If it wasn't due to the release of @NatGeo's book, then there must be some other explanation.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
images		images
README.md		README.md
t-test_natgeo_anomaly.ipynb		t-test_natgeo_anomaly.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

images

images

README.md

README.md

t-test_natgeo_anomaly.ipynb

t-test_natgeo_anomaly.ipynb

Repository files navigation

@NatGeo's August Anomaly

Independent t-test formula

Plugging in some real numbers

About

Releases

Packages

Languages

raffg/natgeo_instagram_anomaly

Folders and files

Latest commit

History

Repository files navigation

@NatGeo's August Anomaly

Independent t-test formula

Plugging in some real numbers

About

Resources

Stars

Watchers

Forks

Languages