# Lyric Crawler from `music.163.com`

This is the first attempt of gathering lyric data. We will start from NetEase Cloud Music and see how it goes. 

## Prerequisites
- Selenium Python binding
  - `pip install selenium` does the job
- Chrome Webdriver
  - Download from [here](https://sites.google.com/a/chromium.org/chromedriver/downloads) and copy the binary file to somewhere in `$PATH`

## Copyright Notice
The lyrics that appear in this notebook are not licensed under the GNU General Public License. All copyrights are reserved by the original artist, which is indicated in comments accordingly.

本文件中所出现的所有歌词不适用于本 repository 的 GNU General Public License 授权协议。歌词版权归原始作者所有。原始作者在本文件相应位置以注释的形式标注。

In [1]:
from selenium import webdriver
from urllib.request import urlopen
from re import findall

## Indexing song info from artist page

### Scratch

In [2]:
artist_page_url = 'http://music.163.com/#/artist?id=2111'   # 崔健

In [3]:
driver = webdriver.Chrome()
driver.get(artist_page_url)
frames = driver.find_elements_by_tag_name('iframe')
print(len(frames))

2


In [4]:
driver.switch_to_frame(frames[0])
table_xpath = '/html/body/div/div/div/div/div/div/div/div/div/table'
table = driver.find_element_by_xpath(table_xpath)

In [5]:
row_xpath = '/html/body/div/div/div/div/div/div/div/div/div/table/tbody/tr'
rows = table.find_elements_by_xpath(row_xpath)

In [6]:
songs = []
for row in rows:
    href_xpath = 'td/div/div/div/span/a'
    href = row.find_element_by_xpath(href_xpath)
    title_xpath = 'b'
    title = href.find_element_by_xpath(title_xpath)
    link = href.get_attribute('href')
    name = title.get_attribute('title')
    song_id = findall('.*id=(\d*).*', link)[0]
    songs.append((name, link, song_id))

In [7]:
songs

[('花房姑娘', 'http://music.163.com/song?id=63692', '63692'),
 ('假行僧', 'http://music.163.com/song?id=63612', '63612'),
 ('新长征路上的摇滚', 'http://music.163.com/song?id=63628', '63628'),
 ('一块红布', 'http://music.163.com/song?id=63627', '63627'),
 ('一无所有', 'http://music.163.com/song?id=63677', '63677'),
 ('快让我在雪地上撒点野', 'http://music.163.com/song?id=26172016', '26172016'),
 ('是否', 'http://music.163.com/song?id=63795', '63795'),
 ('从头再来', 'http://music.163.com/song?id=63683', '63683'),
 ('快让我在这雪地上撒点野', 'http://music.163.com/song?id=63624', '63624'),
 ('让世界充满爱-幸福平安(童声合唱)', 'http://music.163.com/song?id=5285232', '5285232'),
 ('草帽歌 - (日本电影《人证》插曲)', 'http://music.163.com/song?id=63791', '63791'),
 ('蓝色的骨头\xa0(Live)', 'http://music.163.com/song?id=499957703', '499957703'),
 ('生活', 'http://music.163.com/song?id=63736', '63736'),
 ('不再掩饰', 'http://music.163.com/song?id=63686', '63686'),
 ('飞了', 'http://music.163.com/song?id=63611', '63611'),
 ('浪子归', 'http://music.163.com/song?id=63747', '63747'),
 ('让世界充

In [8]:
driver.close()

### Refactored code

In [9]:
def index_artist(artist_id, driver=None, url_fmt=None):
    to_close = False
    if driver is None:
        to_close = True
        driver = webdriver.Chrome()
    if url_fmt is None:
        url_fmt = 'http://music.163.com/#/artist?id={artist_id}'
    url = url_fmt.format(artist_id=artist_id)
    
    driver.get(url)
    frames = driver.find_elements_by_tag_name('iframe')
    assert len(frames) == 2
    driver.switch_to_frame(frames[0])
    
    table_xpath = '/html/body/div/div/div/div/div/div/div/div/div/table'
    table = driver.find_element_by_xpath(table_xpath)
    row_xpath = '/html/body/div/div/div/div/div/div/div/div/div/table/tbody/tr'
    rows = table.find_elements_by_xpath(row_xpath)
    
    songs = []
    for row in rows:
        href_xpath = 'td/div/div/div/span/a'
        href = row.find_element_by_xpath(href_xpath)
        title_xpath = 'b'
        title = href.find_element_by_xpath(title_xpath)
        link = href.get_attribute('href')
        name = title.get_attribute('title')
        song_id = findall('.*id=(\d*).*', link)[0]
        songs.append({'name': name, 'link': link, 'song_id': song_id})
    
    if to_close:
        driver.close()
    return songs

In [10]:
index_artist(2111)   # 崔健

[{'link': 'http://music.163.com/song?id=63692',
  'name': '花房姑娘',
  'song_id': '63692'},
 {'link': 'http://music.163.com/song?id=63612',
  'name': '假行僧',
  'song_id': '63612'},
 {'link': 'http://music.163.com/song?id=63628',
  'name': '新长征路上的摇滚',
  'song_id': '63628'},
 {'link': 'http://music.163.com/song?id=63627',
  'name': '一块红布',
  'song_id': '63627'},
 {'link': 'http://music.163.com/song?id=63677',
  'name': '一无所有',
  'song_id': '63677'},
 {'link': 'http://music.163.com/song?id=26172016',
  'name': '快让我在雪地上撒点野',
  'song_id': '26172016'},
 {'link': 'http://music.163.com/song?id=63795',
  'name': '是否',
  'song_id': '63795'},
 {'link': 'http://music.163.com/song?id=63683',
  'name': '从头再来',
  'song_id': '63683'},
 {'link': 'http://music.163.com/song?id=63624',
  'name': '快让我在这雪地上撒点野',
  'song_id': '63624'},
 {'link': 'http://music.163.com/song?id=5285232',
  'name': '让世界充满爱-幸福平安(童声合唱)',
  'song_id': '5285232'},
 {'link': 'http://music.163.com/song?id=63791',
  'name': '草帽歌 - (日本电影《人证

In [11]:
index_artist(3681)    # 李志

[{'link': 'http://music.163.com/song?id=26508186',
  'name': '天空之城',
  'song_id': '26508186'},
 {'link': 'http://music.163.com/song?id=25867002',
  'name': '关于郑州的记忆',
  'song_id': '25867002'},
 {'link': 'http://music.163.com/song?id=26508240',
  'name': '梵高先生',
  'song_id': '26508240'},
 {'link': 'http://music.163.com/song?id=26508242',
  'name': '你离开了南京，从此没有人和我说话',
  'song_id': '26508242'},
 {'link': 'http://music.163.com/song?id=26523120',
  'name': '和你在一起',
  'song_id': '26523120'},
 {'link': 'http://music.163.com/song?id=26508232',
  'name': '山阴路的夏天',
  'song_id': '26508232'},
 {'link': 'http://music.163.com/song?id=26522011',
  'name': '米店',
  'song_id': '26522011'},
 {'link': 'http://music.163.com/song?id=29724295',
  'name': '热河',
  'song_id': '29724295'},
 {'link': 'http://music.163.com/song?id=29724292',
  'name': '定西',
  'song_id': '29724292'},
 {'link': 'http://music.163.com/song?id=26353044',
  'name': '忽然',
  'song_id': '26353044'},
 {'link': 'http://music.163.com/song?id=

## Crawler for lyric of each song

In [12]:
driver = webdriver.Chrome()
driver.get('http://music.163.com/#/song?id=63692')

In [13]:
frames = driver.find_elements_by_tag_name('iframe')

In [14]:
driver.switch_to_frame(frames[0])

In [15]:
lyric_content_id = 'lyric-content'
lyric_content = driver.find_element_by_id(lyric_content_id)

In [16]:
base_lyric = lyric_content.text

In [17]:
more_content_id = 'flag_more'
more_content = lyric_content.find_element_by_id(more_content_id)
driver.execute_script("arguments[0].setAttribute('class','')", more_content)
more_content.get_attribute('class')
more_lyric = more_content.text

In [18]:
lyric_lines = base_lyric.split('\n') + more_lyric.split('\n')
lyric_lines_clean = []
for line in lyric_lines:
    if (line[:2] == '作词') or (line[:2] == '作曲') or (line == '展开'):
        # print('removing:', line)
        continue
    lyric_lines_clean.append(line)
lyric_lines_clean = [l for l in lyric_lines_clean if len(l) > 0]
full_lyric = '\n'.join(lyric_lines_clean)
print('=====')
print(full_lyric)

=====
我独自走过你身旁 并没有话要对你讲
我不敢抬头看着你的 噢...脸庞.
你问我要去向何方 我指着大海的方向
你的惊奇像是给我 噢...赞扬.
你问我要去向何方 我指着大海的方向
你问我要去向何方 我指着大海的方向
你带我走进你的花房 我无法逃脱花的迷香
我不知不觉忘记了 噢...方向
你说我世上最坚强 我说你世上最善良
我不知不觉已和花儿 噢...一样
你说我世上最坚强 我说你世上最善良
你说我世上最坚强 我说你世上最善良
你要我留在这地方 你要我和它们一样
我看着你默默地说 噢...不能这样
我想要回到老地方 我想要走在老路上
这时我才知离不开你! 噢...姑娘!
我就要回到老地方 我就要走在老路上
我明知我已离不开你! 噢...姑娘!
我就要回到老地方 我就要走在老路上
我明知我已离不开你! 噢...姑娘!
我就要回到老地方 我就要走在老路上
我明知我已离不开你! 噢...姑娘!
我就要回到老地方 我就要走在老路上
我明知我已离不开你! 噢...姑娘!


In [19]:
driver.close()

### Refactored code