In [1]:
# pandas is a Data Analysis library
import pandas as pd

# Data import

Here I imported the data coming from the two datasets.
Two pandas DataFrames are built from the two csv files containing the datasets.

In [2]:
dataset_full_path = "data/"
edge_dataset_filename = "K-pop_edge.csv"
vertex_dataset_filename = "K-pop_node.csv"

edges = pd.read_csv(dataset_full_path + edge_dataset_filename)
vertexes = pd.read_csv(dataset_full_path + vertex_dataset_filename)

# Vertexes dataset

Dataset is composed by:
- **label** (record company)
- **group** (musical group)
- **artist**
    - male (male person)
    - female (female person)
    - person (non-specified sex person, it's not the sum of males and females)

In [3]:
vertexes

Unnamed: 0,id,type,name
0,4735,label,레인보우브릿지에이전시
1,4734,label,주식회사 스톰이앤에프
2,4733,label,가족액터스
3,4732,label,튠테이블 무브먼트
4,4731,label,오렌지엔터테인먼트
...,...,...,...
4669,66,group,bikiny
4670,65,group,Stellar
4671,64,group,S.I.D-Sound
4672,63,group,Xenos-5


# Edges dataset

This dataset models bidirectional relationships between vertexes.

- **label-label**: recording companies associations.
- **label-artist/group**: management relation.
- **artist-artist**: relationship between artists.
- **artist-group**: 
    - most of the times represents a "is-a-member-of" relationship
    - could be a "collaborates-with" relationship.
- **group-group**: 
    - represents the association between groups.
    - can be even used to model group-group collaboration.

In [4]:
edges

Unnamed: 0,source,target
0,4735,1782
1,4735,1393
2,4735,4188
3,4735,4187
4,4733,4635
...,...,...
5089,64,1496
5090,63,312
5091,62,2529
5092,62,2528


# Edges analysis

Some vertexes has higher grade, I need to find out why.

Total number of group-group edges is 242. The 20 vertexes with higher number of group-group relationships has an average of 16 relationships.

In [5]:
edge_distinct_counts = edges.groupby('source').count()

### Source vertexes stars

In [6]:
edge_distinct_counts = pd.DataFrame(edge_distinct_counts)

In [7]:
edge_distinct_counts

Unnamed: 0_level_0,target
source,Unnamed: 1_level_1
62,3
63,1
64,51
68,2
70,2
...,...
4729,1
4730,1
4731,1
4733,1


### How many group-group edges

In [8]:
# filtering to get only group vertexes
group_vertexes = vertexes[vertexes['type'] == 'group']

In [9]:
# data intesection between edges and group vertexes
# dropna() required to filter out NaN rows
group_to_anonymous_edges = edges.join(group_vertexes.set_index('id')).dropna()

# removing all the non group-group edges
group_vertexes_ids = group_vertexes['id']
group_to_group_edges = group_to_anonymous_edges[~group_to_anonymous_edges['target'].isin(group_vertexes_ids)]

# counting the grade of each node with all non group-group edges removed
group_to_group_edges_count = group_to_group_edges.groupby('source')['target'].count()

In [10]:
pd.DataFrame(group_to_group_edges_count).sort_values(by='target', ascending=False).head(20)

Unnamed: 0_level_0,target
source,Unnamed: 1_level_1
4629,40
4603,30
4611,28
4616,22
4619,20
4628,19
4633,17
4617,16
4635,16
4639,14


# Vertexes analysis

Most of the highest grade vertexes are recording companies.

Except a couple of outliers, the average number of the 20 most connected vertexes is around 15/20.

In [11]:
vertex_labels_distinct_counts = vertexes.groupby('type')['id'].nunique()

In [12]:
pd.DataFrame(vertex_labels_distinct_counts)

Unnamed: 0_level_0,id
type,Unnamed: 1_level_1
female,805
group,967
label,269
male,1383
person,1250


## Vertexes with highest grade

In [13]:
vertexes_ordered_by_grade = edge_distinct_counts.sort_values(by=['target'], ascending=False)

In [14]:
vertexes_ordered_by_grade

Unnamed: 0_level_0,target
source,Unnamed: 1_level_1
4563,176
4552,110
231,100
4541,65
4540,63
...,...
3209,1
3212,1
3215,1
3218,1


### Highest grade vertexes informations

This is a data intersection between the highest grade vertexes and all the vertexes.

In [15]:
vertexes_ordered_by_grade.join(vertexes.set_index('id')).head(20)

Unnamed: 0_level_0,target,type,name
source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4563,176,label,S.M.Entertainment
4552,110,label,LOEN Entertainment
231,100,group,SMTOWN
4541,65,label,YG Entertainment
4540,63,label,JYP Entertainment
64,51,group,S.I.D-Sound
4518,50,label,Cube Entertainment
4570,47,label,Mnet Media
4536,46,label,DSP Entertainment
4668,46,label,유니버설 뮤직 그룹


### Highest grade group vertexes

This is a data intersection between the highest grade vertexes and the group vertexes.

In [16]:
# filtering to get only group vertexes
group_vertexes = vertexes[vertexes['type'] == 'group']

In [17]:
# dropna() required to filter out NaN rows
vertexes_ordered_by_grade.join(group_vertexes.set_index('id')).dropna().head(20)

Unnamed: 0_level_0,target,type,name
source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
231,100,group,SMTOWN
64,51,group,S.I.D-Sound
152,23,group,EXO
618,22,group,Five Dolls
99,20,group,Girls’ Generation
135,19,group,Super Junior-K.R.Y.
2350,18,group,Xing
361,18,group,EXID
348,17,group,After School
451,17,group,T-ARA
