# Data Exploration

Dive into favorite dataframe to see what happened with the data and how we gonna use it.

**DISCLAIMER**: This study is for my personal machine learning & recommender system study, it cannot indicate actual performance of the members. The data is gathered from twitter @ 2018/02/14. Eventhough twitter platform is not popular in Thailand, it's the only source allow me to fetch unique ID of follower (even I got stuck its rate limit and retries for several times)

In [1]:
import pandas as pd
from bokeh.plotting import figure
from bokeh.io import output_notebook

output_notebook()

## Rating Implication

I decided to use **implicit rating** rather than explicit ones because I could not find source of data!! 

Only source that I could grab behaviour of individual identity was only *twitter* left, even though it was not very popular in TH and there was not official page of each member. I searched several pages by their name + BNK48 and used my personal judgement + mood to choose them. So, instead of putting huge effort to scale rating for each user, I decided to imply that if they follow one on member's page, putting `1` = like and if never follow any page putting `0`, mean never know how lovely the member are.

In [2]:
rating_list = pd.read_csv('follower.csv')
del rating_list['Unnamed: 0']
rating_count = len(rating_list)
rating_count

201229

In [3]:
distinct_rating = rating_list.drop_duplicates().copy()
distinct_rating_count = len(distinct_rating)
distinct_rating_count

141101

In [4]:
most_popular = distinct_rating.groupby('member').count().sort_values('follwer_id', ascending=False)
most_popular.rename(columns={'follwer_id': 'follower_count'}, inplace=True)
most_popular[:10]

Unnamed: 0_level_0,follower_count
member,Unnamed: 1_level_1
Izurina,28088
Cherprang,23799
Pun,14825
Music,11464
Jan,7690
Kaew,7076
Orn,6450
Jennis,5898
Mobile,5362
Tarwaan,4808


In [5]:
import math
from bokeh.plotting import show 

p = figure(plot_width=800, plot_height=400, x_range=list(most_popular.index))
p.vbar(top=most_popular['follower_count'], x=list(most_popular.index), width=0.6)

p.y_range.start = 0
p.xaxis.major_label_orientation = math.pi/2

show(p)

## Feel something strange?

As we know, Izurina is not very popular like Cherprang or Pun. The data is grabed from twitter which is not very popular in TH. Only few fan follow thier fanpage on twitter, but Izurina convert her original AKB48 page to BNK48 and many japanese fan still follow her.

In [6]:
distinct_rating['count'] = 1

rating_matrix = distinct_rating.pivot(index='follwer_id', columns='member', values='count')
rating_matrix.fillna(0, inplace=True)

In [7]:
if 'oshi_count' in rating_matrix:
    del rating_matrix['oshi_count']
rating_matrix['oshi_count'] = rating_matrix.sum(axis=1)
rating_matrix['oshi_count'].mean()

2.1597863188991444

In [8]:
def tan_oshi_ratio(rating_matrix, member):
    oshies = rating_matrix[rating_matrix[member] == 1]
    total_oshi = oshies[member].count()
    tan_oshi = len(oshies[oshies['oshi_count'] == 1])
    return tan_oshi / total_oshi

In [9]:
tan_oshi_ratios = [tan_oshi_ratio(rating_matrix, member) for member in rating_matrix.columns]
del tan_oshi_ratios[len(tan_oshi_ratios) - 1]

In [10]:
members = list(rating_matrix.columns)[:-1]
tan_ratios_mean = sum(tan_oshi_ratios) / len(members)

p = figure(plot_width=800, plot_height=450, x_range=members, title='Tan-Oshi Ratio')

colors = ['#FDE724'] * len(members)
colors[members.index('Izurina')] = '#35B778'
p.vbar(top=tan_oshi_ratios, x=members, width=0.9, color=colors, legend='Tan-Oshi per Total-Oshi')
p.line(x=members, y=[tan_ratios_mean] * len(members), color='#440154', line_width=2, line_dash='dotdash', legend='Average')

p.xgrid.grid_line_color = None
p.y_range.start = 0.
p.xaxis.major_label_orientation = math.pi/2

show(p)

I decided to keep the data even though it's strange in my mind to test attack resistance for each algorithm I apply in the future.

In [16]:
distinct_rating.rename(columns={'follwer_id': 'follower_id'})[['member', 'follower_id']]\
    .to_csv('ratings.csv', index=False)

In [18]:
del rating_matrix['oshi_count']

## User reaction

In [19]:
num_member_by_users = rating_matrix.sum(axis=1).rename('count')
user_member_hist = num_member_by_users.to_frame()\
                            .groupby('count')['count'].count()
user_member_hist

count
1.0     45201
2.0      7374
3.0      3625
4.0      2139
5.0      1531
6.0      1058
7.0       787
8.0       608
9.0       539
10.0      459
11.0      422
12.0      342
13.0      304
14.0      255
15.0      169
16.0      160
17.0      139
18.0      102
19.0       56
20.0       29
21.0       16
22.0       10
23.0        4
24.0        1
26.0        1
Name: count, dtype: int64

In [32]:
p = figure(plot_width=800, plot_height=450, title='No. of member follow distribution')

p.vbar(top=user_member_hist, x=list(user_member_hist.index),
       width=0.9)

p.y_range.start = 0
show(p)

### User who follow multiple members

In [36]:
user_member_hist.sum() - user_member_hist.loc[1]

20130

In [37]:
1 - (user_member_hist.loc[1] / user_member_hist.sum())

0.3081232492997199