**Note: A diagram should be displayed here, but for some reason it is not showing on github.**
**You might need to clone the repo and vizualize it locally.**

In [1]:
import pandas as pd
from utils import *

# Data Understanding

## 1. Exploratory Data Analysis

Through our exploration of the data using for example the following diagram:

In [6]:
df = load_dataset()
_, _, aggregated_df_by_parent_company = groupby_hierarchy(df)
plot_brands_compset(aggregated_df_by_parent_company, 'parent_company', ['Nestle'], color='brand')

We were able to find out the underlying hidden structure of the dataset:

<img src="./doc/structure_data.png" width="800" height="500">

After understanding the data's structure we were able to delve more into the dataset.

The quantitative data (likes, followers...) had the same numerical values for each of the different "compset" and "compset_group" for a given brand:

In [2]:
df = pd.read_csv("./data/skylab_instagram_datathon_dataset.csv", delimiter=";")
check_likes_same(df, 'Versace', '2023-03-25', 'Luxury & Premium & Mainstream')

False

Looking more into the data, we found that around 1/7 of the rows had missing values. We started by checking if these NAs were in time intervals or just randomly spreaded in the data (which you can find in the lstm.ipynb).

We noticed that these missing values were in fact 99% of time intervals at the beginning of the time series recording of a brand except for only one brand which we decided to dropped for simplicity and relevance.

If we had more time, we would have extrapolated the data using some machine learning model or moving average estimation.
We however decided to choose to analyse the past 4 years which only yielded a very small percentage of missing values. 

Furthermore, we assumed that every brand could be invested in either through primary exchange market or "private" for the missing values in the dataset.
By looking at the data and outside knowledge, we assumed that each brand own only one instagram account, which leads to having the same stats over the different categories (compset and compset_group).

# 2. Modelling approach

### Feature Engineering

We started by creating some new features which would be more relevant to the popularity of the brand:

In [7]:
# Engagement ratios
df['content'] = df['videos'] + df['pictures']
df['engagement'] = df['likes'] + df['comments']
df['likes_to_followers'] = df['likes'] / df['followers']
df['comments_to_likes'] = df['comments'] / df['likes']
df['content_engagement'] = df['engagement'] / df['content']

Creating new features such as engagement ratios and content engagement can provide additional insights into user behavior and content performance on a platform (presumably social media). These features can help in understanding how users interact with content relative to their follower base, the level of engagement per piece of content, and the ratio of comments to likes, which may indicate the level of interaction and interest in the content.

### Modelling

We first started with a simple approach by comparing the average popularity of the brand over all categories (compset_group) since each company own a unique social media for the whole brand.

We then wanted to remove some confounders which could impact the stats. We thus also compared the brand to the companies in the same field (compset/compset_group).

<img src="./doc/outliers_detection.png" width="800" height="400">


We transitioned to a machine learning approach by implementing an LSTM model. The rationale behind this approach was to leverage historical data to predict future trends in popularity. 

By training the LSTM on data up to a certain point, excluding the most recent 3 months, we aimed to capture patterns and behaviors over time without incorporating the latest trends. This deliberate exclusion of recent data served as a metric relevant to investors or marketing strategies, allowing for a focused analysis of longer-term trends. 

The LSTM model was tasked with forecasting the popularity of the model for the subsequent 3 months. Subsequently, we could compare these predictions with the actual popularity trends to assess the deviations.

### Model Evaluation

In order to evaluate our results we used the followings:
- We compared our results with stocks/revenue prices to see if deviations actually led to deviations.
- We also compared to google trend's api to check if these deviations were always reflected there.

Moreover, for instance with H&M, we noticed an anomaly which was justified and caused by a racist post which made a scandal and led to negative deviation.

### Model Enhancement

- Combine with stock/revenue: We started working on combining the given dataset with the revenue and stock price using yahoo's api but noticed too many stocks were actually missing and couldn't have time to use another one. This would have provided valuable insights for investing opportunities.

# 3. Results Interpretation

When we take look at a specific brand, in this case H&M, we can get valuable business value.
- Consulting for content creation and social media strategy: For e.g, favorise posting videos over pictures, get correlation between their social media activity's revenue...
- Investment opportunities: If we see that a company is gaining in popularity we could find causation between the stocks and success of the company and thus find investment opportunities

<img src="./doc/H&M.png" width="800" height="600">