# User interaction on Yandex media platform

## Scope of the project

**The goal**:
- Create a dashboard for managers

**Input data**: 
- We have data with users interactions on the platform on 24.09.2019 from 18:28 till 19:00

**Analysis structure**:
* Data overview
* Terms of reference
* Links
* Results

In [47]:
import pandas as pd
from sqlalchemy import create_engine
from yaml import load, FullLoader

In [48]:
# read yaml file with configs

config = load(open('config.yaml'), Loader=FullLoader)

In [49]:
user = config['db_config']['user']
pwd = config['db_config']['pwd']
host = config['db_config']['host']
port = config['db_config']['port']
db = config['db_config']['db']

In [50]:
# create connection str

connection_string = 'postgresql+psycopg2://{}:{}@{}:{}/{}'.format(
    user,
    pwd,
    host,
    port,
    db,
)

In [51]:
# create connection

engine = create_engine(connection_string)

In [52]:
# create function for selecting

def select(query):
    return pd.io.sql.read_sql(query, con = engine)

In [53]:
# write query

query = """
select *
from dash_visits
"""

In [54]:
visits = select(query)

## Data overview

In [57]:
visits.head(5)

Unnamed: 0,record_id,item_topic,source_topic,age_segment,dt,visits
0,1040597,Деньги,Авто,18-25,2019-09-24 18:32:00,3
1,1040598,Деньги,Авто,18-25,2019-09-24 18:35:00,1
2,1040599,Деньги,Авто,18-25,2019-09-24 18:54:00,4
3,1040600,Деньги,Авто,18-25,2019-09-24 18:55:00,17
4,1040601,Деньги,Авто,18-25,2019-09-24 18:56:00,27


In [58]:
visits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30745 entries, 0 to 30744
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   record_id     30745 non-null  int64         
 1   item_topic    30745 non-null  object        
 2   source_topic  30745 non-null  object        
 3   age_segment   30745 non-null  object        
 4   dt            30745 non-null  datetime64[ns]
 5   visits        30745 non-null  int64         
dtypes: datetime64[ns](1), int64(2), object(3)
memory usage: 1.4+ MB


There are 6 columns in the dataframe:

`record_id` - id  
`item_topic` - topic of the card  
`source_topic` - topic of the source of the card  
`age_segment` - age category  
`dt` - date stamp  
`visits` - number of visits  

In [60]:
visits['item_topic'].unique()

array(['Деньги', 'Дети', 'Женская психология', 'Женщины', 'Здоровье',
       'Знаменитости', 'Интересные факты', 'Искусство', 'История',
       'Красота', 'Культура', 'Наука', 'Общество', 'Отношения',
       'Подборки', 'Полезные советы', 'Психология', 'Путешествия',
       'Рассказы', 'Россия', 'Семья', 'Скандалы', 'Туризм', 'Шоу', 'Юмор'],
      dtype=object)

In [62]:
visits['source_topic'].unique()

array(['Авто', 'Деньги', 'Дети', 'Еда', 'Здоровье', 'Знаменитости',
       'Интерьеры', 'Искусство', 'История', 'Кино', 'Музыка', 'Одежда',
       'Полезные советы', 'Политика', 'Психология', 'Путешествия',
       'Ремонт', 'Россия', 'Сад и дача', 'Сделай сам',
       'Семейные отношения', 'Семья', 'Спорт', 'Строительство',
       'Технологии', 'Финансы'], dtype=object)

In [63]:
visits['age_segment'].unique()

array(['18-25', '26-30', '31-35', '36-40', '41-45', '45+'], dtype=object)

In [66]:
visits['dt'].min()

Timestamp('2019-09-24 18:28:00')

In [67]:
visits['dt'].max()

Timestamp('2019-09-24 19:00:00')

In [68]:
# to csv to work in Tableau

visits.to_csv('dash_visits.csv', index=False)

## Terms of reference

After talking with managers and database administrators, we wrote a brief terms of reference:  

* **Business task**: analysis of user interaction with Yandex.Zen cards; 
* **Use of the dashboard**: at least once a week;  
* **Main users of the dashboard**: content analysis managers;  
* **Data for the dashboard**:
    * History of events on the topics of cards (two graphs - absolute numbers and percentage);
    * Breakdown of events by topics of sources;
    * Correspondence table of the topics of the sources to the topics of the cards;
* **By what parameters should the data be grouped**:
    * Date and time;
    * The topic of the card;
    * The source topic;
    * Age group;
* **Importance**: all charts are of equal importance;
* **Data sources for the dashboard**: raw data about user interaction events with cards;
* **Database where the aggregated data will be stored**: additional aggregated tables in the zen database;
* **Data update frequency**: once a day, at midnight UTC

## Links

Link to the dashboard: https://public.tableau.com/views/Yandex_Dzen_dashboard/Dashboard?:language=en-US&:display_count=n&:origin=viz_share_link

## Results

* In the period from 18:28 to 18:53, the total number of visits on all cards did not exceed 2,5k
* However, later there was a sharp jump in visits with a peak at 18:58 - about 60k visits
* Science, relationships, interesting facts, society, and collections are in the top 5 card topics by total visits for the period



* Top 5 topics take about 35% of the total number of visits
* Their share remains at the same average level, with a dip around 18:35, when the topic of the cards “History” and “Useful tips” strongly wins back the share
* The remaining topics of the cards are fairly stable in their proportions throughout the analysis


* Top 3 source topics: “Family Relations”, “Russia”, and “Useful Tips”
* Together they occupy about 30% of all topics of sources
* There are 3 topics of sources that each occupy less than 1% of the total: “Finance”, “Music”, and “Construction”