# Overview



- **목적** : 데이터프레임의 `artist` 컬럼 속, 뮤지션 SNS계정의 팔로워/구독자 수를 크롤링.


- **크롤링 대상 웹페이지** : http://www.chartmetric.io


- **크롤링 대상 sns** : Twitter, Instagram, Facebook, Spotify, Soundcloud, Youtube

<img src="screenshot2.png", width="800">

### 1번 째 접근 : Selenium
- pickle을 이용하여 계정 로그인 
- Selenium을 이용하여 search창에 아티스트 이름을 입력
- 아티스트 페이지 이동
- SNS 팔로워에 해당하는 css element text 스크래핑

**문제점**
- 페이지 로드 시간이 들쭉날쭉해서인지, 종종 element를 찾지 못 하는 에러가 발생합니다. 
- `time.sleep`을 적용해면 시간이 매우 오래 걸리고, 이 또한 element를 찾지 못 하는 에러가 발생합니다.

In [1]:
import pandas as pd
import numpy as np

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

import pickle
import time

In [2]:
debut_df = pd.read_csv("debut_album_1118.csv")

In [3]:
debut_df.shape[0]

1949

In [4]:
def get_followers(dataframe):
    
    # Search 창에 입력 할 뮤지션 이름 리스트 생성
    ls = list(dataframe['artist'])
    
    # 스크래핑한 데이터를 입력 할 데이터프레임 생성
    sns_df = pd.DataFrame(columns=['artist', 'twitter', 'instagram', 'facebook', 'spotify', 'soundcloud', 'youtube'])
    
    # 로그인 페이지 이동
    driver = webdriver.Chrome()
    driver.get('https://chartmetric.io/login')
    time.sleep(3)
    
    # pickle 파일 로드
    pickle_pw = pickle.load(open("chartmetric_pw.pickle", "rb"))

    # 아이디, PW 입력 후 로그인
    driver.find_element_by_css_selector( "body > div > div:nth-child(2) > div.container.ng-scope > div > div > div > div > form > div:nth-child(4) > div > input" ).send_keys( "lucaseo1991@gmail.com" )
    driver.find_element_by_css_selector( "body > div > div:nth-child(2) > div.container.ng-scope > div > div > div > div > form > div:nth-child(5) > div > input" ).send_keys( pickle_pw )
    driver.find_element_by_css_selector("body > div > div:nth-child(2) > div.container.ng-scope > div > div > div > div > form > div:nth-child(7) > div > button").click()
    time.sleep(7)

    
    
    
    for artist in ls:
        
        # Search 창으로 이동
        driver.get('https://chartmetric.io/search')
        time.sleep(10)

        # 아티스트 이름 입력
        driver.find_element_by_css_selector( "body > div:nth-child(1) > div:nth-child(2) > div > div > form > input" ).send_keys(artist)
        driver.find_element_by_css_selector("body > div:nth-child(1) > div:nth-child(2) > div > div > form > input").send_keys(Keys.ENTER)
        time.sleep(10)

        # 아티스트 페이지로 이동
        keyword = driver.find_element_by_css_selector("#artist > ul > li > div.media-body > div.media-heading > a").text
        driver.find_element_by_link_text(keyword).click()
        
            
        time.sleep(10)
        
        # element 가져오기
        social = driver.find_element_by_id("socialStats")

        time.sleep(5)
        
        twitter = social.find_element_by_id("twitterfollowers-total-fans").text
        instagram = social.find_element_by_id("instagram-total-fans").text
        facebook = social.find_element_by_id("facebooklikes-total-fans").text
        spotify = social.find_element_by_id("spotify-total-fans").text
        soundcloud = social.find_element_by_id("soundcloud-total-fans").text
        youtube = social.find_element_by_id("youtubesubscribers-total-fans").text

        
        # 데이터 입력
        data = {
            "artist": artist, 
            "twitter":twitter, 
            "instagram":instagram, 
            "facebook":facebook,
            "spotify":spotify, 
            "soundcloud":soundcloud,
            "youtube":youtube,
        }    
        
        sns_df.loc[len(sns_df)] = data
        
        time.sleep(2)
       
    driver.close()
        
    return sns_df

In [None]:
df_1 = get_followers(debut_df)

In [None]:
print(df_1.shape)
df_1.head()

* * * *

### 2번 째 접근 : requests, bs4

- 위의 selenium을 통해 고유한 artist번호를 알아낸 후
- 해당 api를 통해 scraping

**문제점**
- user agent만 헤더로 입력 했을 시,  "No authorization token was found"이라는 에러가 뜹니다. 
- user agent와 authorization을 헤더로 입력 했을 시, 400 에러가 뜹니다.

<img src="screenshot1.png", width="800">

In [6]:
import requests
from bs4 import BeautifulSoup

In [9]:
# ex) 뮤지션 이름 " eminem "
artist = 'eminem'

# 로그인 페이지 이동
driver = webdriver.Chrome()
driver.get('https://chartmetric.io/login')
time.sleep(3)

# pickle 파일 로드
pickle_pw = pickle.load(open("chartmetric_pw.pickle", "rb"))

# 아이디, PW 입력 후 로그인
driver.find_element_by_css_selector( "body > div > div:nth-child(2) > div.container.ng-scope > div > div > div > div > form > div:nth-child(4) > div > input" ).send_keys( "lucaseo1991@gmail.com" )
driver.find_element_by_css_selector( "body > div > div:nth-child(2) > div.container.ng-scope > div > div > div > div > form > div:nth-child(5) > div > input" ).send_keys( pickle_pw )
driver.find_element_by_css_selector("body > div > div:nth-child(2) > div.container.ng-scope > div > div > div > div > form > div:nth-child(7) > div > button").click()
time.sleep(10)

# Search 창으로 이동
driver.get('https://chartmetric.io/search')
time.sleep(10)

# 아티스트 이름 입력
driver.find_element_by_css_selector( "body > div:nth-child(1) > div:nth-child(2) > div > div > form > input" ).send_keys(artist)
driver.find_element_by_css_selector("body > div:nth-child(1) > div:nth-child(2) > div > div > form > input").send_keys(Keys.ENTER)
time.sleep(5)

# 아티스트 고유번호 체크
artist_num = driver.find_element_by_css_selector("#artist > ul > li:nth-child(1) > div.media-body > div.media-heading > a").get_attribute("href")[33:]
artist_num

'236'

In [10]:
# 각 아티스트 페이지의 api 주소
link = 'https://api.chartmetric.io/SNS/socialStat/cm_artist/' + str(artist_num) + '/twitter?fromDaysAgo=9999&valueColumn=followers'
link

'https://api.chartmetric.io/SNS/socialStat/cm_artist/236/twitter?fromDaysAgo=9999&valueColumn=followers'

In [11]:
# authrization

headers = {
    "autorization token" : "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VySWQiOjcxNzEsInVzZXJFbWFpbCI6Imx1Y2FzZW8xOTkxQGdtYWlsLmNvbSIsInRpbWVzdGFtcCI6MTUyNDM4NDk0MjY3OSwiaWF0IjoxNTI0Mzg0OTQyLCJleHAiOjE1MjU1OTQ1NDJ9.QdtgaKfUlVkRiz9jqb7oZ9EdtXvhxUCMEeTgOr_Df_U",
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"
}

In [12]:
response = requests.get(link, headers=headers)
dom = BeautifulSoup(response.content, "html.parser")
dom

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">

<html><head>
<title>400 Bad Request</title>
</head><body>
<h1>Bad Request</h1>
<p>Your browser sent a request that this server could not understand.<br/>
</p>
</body></html>

#### 각 계정 별 api

- twitter : https://api.chartmetric.io/SNS/socialStat/cm_artist/200408/twitter?fromDaysAgo=9999&valueColumn=followers
- instagram : https://api.chartmetric.io/SNS/socialStat/cm_artist/200408/instagram?fromDaysAgo=9999
- facebook : https://api.chartmetric.io/SNS/socialStat/cm_artist/200408/facebook?fromDaysAgo=9999&valueColumn=likes
- spotify : https://api.chartmetric.io/SNS/socialStat/cm_artist/200408/spotify?fromDaysAgo=9999&valueColumn=popularity
- soundcloud : https://api.chartmetric.io/SNS/socialStat/cm_artist/200408/soundcloud?fromDaysAgo=9999
- youtube : https://api.chartmetric.io/SNS/socialStat/cm_artist/200408/youtube?fromDaysAgo=9999&valueColumn=subscribers