<a href="https://colab.research.google.com/github/ssooni/data_mining_practice/blob/master/elasticsearch/elasticsearch_lyrics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 개발 환경 구성
1. Google Drive mount
2. Elastic Search 설치 / 구동

In [None]:
from google.colab import drive
import os

drive.mount('/content/gdrive/')
### 구글 클라우드 컴퓨터에 elastic server 서버 설치를 위한 폴더 생성 
!sudo mkdir /content/elasticsearch
### 접근 권한 수정
!chmod 755 -R elasticsearch
### 현재 작업 디렉토리 설정
os.chdir('/content/elasticsearch')

Drive already mounted at /content/gdrive/; to attempt to forcibly remount, call drive.mount("/content/gdrive/", force_remount=True).
mkdir: cannot create directory ‘/content/elasticsearch’: File exists


In [None]:
### 리눅스용 엘라스틱서치 서버 설치를 위한 패키지 다운로드
!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.0.0-linux-x86_64.tar.gz -q
### 위에서 다운로드 받은 압축 파일을 해제
!tar -xzf elasticsearch-7.0.0-linux-x86_64.tar.gz
### 코랩 노트북 환경에서 서버 구동을 위해서 PPID 1의 백그라운드 데몬 프로세스가 해당 폴더에 접근이 가능하도록 소유자 변경
!chown -R daemon:daemon elasticsearch-7.0.0
### 파이썬 환경에서 구동을 위한 elasticsearch 패키지 설치
!pip install elasticsearch



In [None]:
# 데몬 프로세스로 엘라스틱 서버 개시하기
import os
from subprocess import Popen, PIPE, STDOUT
es = Popen(['elasticsearch-7.0.0/bin/elasticsearch'], 
                  stdout=PIPE, stderr=STDOUT,
                  preexec_fn=lambda: os.setuid(1)  # as daemon
                )

In [None]:
# 로컬 서버에 엘라스틱 서버와 python을 연결
from elasticsearch import Elasticsearch
es = Elasticsearch("localhost:9200/")
es.info()

{'cluster_name': 'elasticsearch',
 'cluster_uuid': 'BACxBSShS3Wbh10PULbh8A',
 'name': 'c968c350237b',
 'tagline': 'You Know, for Search',
 'version': {'build_date': '2019-04-05T22:55:32.697037Z',
  'build_flavor': 'default',
  'build_hash': 'b7e28a7',
  'build_snapshot': False,
  'build_type': 'tar',
  'lucene_version': '8.0.0',
  'minimum_index_compatibility_version': '6.0.0-beta1',
  'minimum_wire_compatibility_version': '6.7.0',
  'number': '7.0.0'}}

In [None]:
### set workspace
os.chdir('/content/gdrive/My Drive/Colab Notebooks/information_retrieval/text')

In [None]:
def indexing(es, index_name):
    if es.indices.exists(index=index_name):
        es.indices.delete(index=index_name)

    print(es.indices.create(index=index_name))

index_name="lyrics"
indexing(es, index_name)

{'acknowledged': True, 'shards_acknowledged': True, 'index': 'lyrics'}


## 영문 가사 데이터 수집
1. 빌보드 Chart Hot 100 데이터 
  - 1999~2019 사이의 빌보드 Hot 100 차트에 수록된 곡의 가사, 노래명, 작곡가 등의 데이터(97725 건)
  - 출처 : https://www.kaggle.com/danield2255/data-on-songs-from-billboard-19992019
2. Kaggle API를 이용해서 로컬 서버에 저장 
  - 캐글 API Key 발급 후 구글 드리아브에 저장
  - 저장된 API Key파일을 이용해서 데이터 다운로드 API 


In [None]:
!mkdir /root/.kaggle/
!cp /content/gdrive/MyDrive/kaggle/kaggle.json /root/.kaggle/kaggle.json   # kaggl.json위치 지정
!chmod 600 /root/.kaggleA/kaggle.json
!kaggle datasets download -d danield2255/data-on-songs-from-billboard-19992019
!unzip data-on-songs-from-billboard-19992019.zip

mkdir: cannot create directory ‘/root/.kaggle/’: File exists
data-on-songs-from-billboard-19992019.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  data-on-songs-from-billboard-19992019.zip
replace BillboardFromLast20/artistDf.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: BillboardFromLast20/artistDf.csv  
  inflating: BillboardFromLast20/billboardHot100_1999-2019.csv  
  inflating: BillboardFromLast20/grammyAlbums_199-2019.csv  
  inflating: BillboardFromLast20/grammySongs_1999-2019.csv  
  inflating: BillboardFromLast20/riaaAlbumCerts_1999-2019.csv  
  inflating: BillboardFromLast20/riaaSingleCerts_1999-2019.csv  
  inflating: BillboardFromLast20/songAttributes_1999-2019.csv  
  inflating: BillboardFromLast20/spotifyWeeklyTop200Streams.csv  


### File encoding 확인
Pandas.DataFrame로 저장하기 위해서 파일인코딩 정보 확인

In [None]:
!apt-get install -y file
!file -bi ./BillboardFromLast20/billboardHot100_1999-2019.csv

Reading package lists... Done
Building dependency tree       
Reading state information... Done
file is already the newest version (1:5.32-2ubuntu0.4).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 34 not upgraded.
text/plain; charset=utf-8


In [None]:
import pandas as pd
lyrics = pd.read_csv("./BillboardFromLast20/billboardHot100_1999-2019.csv", encoding="utf-8", index_col=0)

In [None]:
lyrics

Unnamed: 0,Artists,Name,Weekly.rank,Peak.position,Weeks.on.chart,Week,Date,Genre,Writing.Credits,Lyrics,Features
1,"Lil Nas,",Old Town Road,1,1.0,7.0,2019-07-06,"April 5, 2019","Country,Atlanta,Alternative Country,Hip-Hop,Tr...","Jozzy, Atticus ross, Trent reznor, Billy ray c...","Old Town Road Remix \nOh, oh-oh\nOh\nYeah, I'm...",Billy Ray Cyrus
2,"Shawn Mendes, Camila Cabello",Senorita,2,,,2019-07-06,"June 21, 2019",Pop,"Cashmere cat, Jack patterson, Charli xcx, Benn...",Senorita \nI love it when you call me senorita...,
3,Billie Eilish,Bad Guy,3,2.0,13.0,2019-07-06,"March 29, 2019","Hip-Hop,Dark Pop,House,Trap,Memes,Alternative ...","Billie eilish, Finneas","bad guy \nWhite shirt now red, my bloody nose\...",
4,Khalid,Talk,4,3.0,20.0,2019-07-06,"February 7, 2019","Synth-Pop,Pop","Howard lawrence, Guy lawrence, Khalid",Talk \nCan we just talk? Can we just talk?\nTa...,
5,"Ed Sheeran, Justin Bieber",I Don't Care,5,2.0,7.0,2019-07-06,"May 10, 2019","Canada,UK,Dance,Dance-Pop,Pop","Ed sheeran, Justin bieber, Shellback, Max mart...",I Don't Care \nI'm at a party I don't wanna be...,
...,...,...,...,...,...,...,...,...,...,...,...
97221,Vitamin C,Smile,95,,,1999-07-12,,"Jamaica,Pop","Colleen fitzpatrick, Josh deutsch","Smile \nHahaha\nAlright, yeah\nAlright\nFirst ...",Lady Saw
97222,Collective Soul,Heavy,96,73.0,20.0,1999-07-12,,"Hockey,Gaming,Soundtrack,Rock",Collective soul,Heavy \nComplicate this world you wrapped for ...,
97223,Mary Chapin Carpenter,Almost Home,97,,,1999-07-12,,"Country,Pop","Annie roboff, Beth nielsen chapman, Mary chapi...",Almost Home \nI saw my life this morning\nLyin...,
97224,Q,Vivrant Thing,98,,,1999-07-12,,Rap,"Q tip, J dilla, Barry white",Vivrant Thing \nUh check it out now\nUh no dou...,


## Stanza 라이브러리 설치
Stanza는 Standford NLP Group에서 제공하는 NLP Core 라이브러리로 Tokenization, Lemma, Sentiment 분석 등 NLP 전반적으로 사용하는 모듈을 제공  
[Statnza 공식 홈페이지]( https://stanfordnlp.github.io/stanza/)

In [None]:
!pip install stanza

In [None]:
import stanza
stanza.download('en')       # This downloads the English models for the neural pipeline
nlp = stanza.Pipeline(processors='tokenize,pos', lang='en', tokenize_no_ssplit=False, tokenize_pretokenized = True)

### DataFrame to JSON
엘라스틱 서치에서 사용하는 JSON 형태로 DataFrame을 변환한다.

In [None]:
import json
result = lyrics.to_json(orient="index")
lyrics_json = json.loads(result)
print(lyrics_json["1"])

{'Artists': 'Lil Nas,', 'Name': 'Old Town Road', 'Weekly.rank': 1, 'Peak.position': 1.0, 'Weeks.on.chart': 7.0, 'Week': '2019-07-06', 'Date': 'April 5, 2019', 'Genre': 'Country,Atlanta,Alternative Country,Hip-Hop,Trap,Memes,Remix,Country Rap,Rap', 'Writing.Credits': 'Jozzy, Atticus ross, Trent reznor, Billy ray cyrus, Lil nas x', 'Lyrics': "Old Town Road Remix \nOh, oh-oh\nOh\nYeah, I'm gonna take my horse to the old town road\nI'm gonna ride til I can't no more\nI'm gonna take my horse to the old town road\nI'm gonna ride til I can't no more\nKio, Kio\nI got the horses in the back\nHorse tack is attached\nHat is matte black\nGot the boots that's black to match\nRiding on a horse, ha\nYou can whip your Porsche\nI been in the valley\nYou ain't been up off that porch, now\nCan't nobody tell me nothin'\nYou can't tell me nothin'\nCan't nobody tell me nothin'\nYou can't tell me nothin'\nRiding on a tractor\nLean all in my bladder\nCheated on my baby\nYou can go and ask her\nMy life is a mo

In [None]:
for i, key in enumerate(lyrics_json):
  doc = nlp(lyrics_json[key]["Lyrics"])
  sentences = list()
  for i, sentence in enumerate(doc.sentences):    
    sentences.append(sentence.text + ".")
  lyrics_json[key]["Lyrics"] = " ".join(sentences)

### Indexing 
Elastic Search에 문서를 저장하며 Index를 생성한다.

In [None]:
# 색인화 진행
for i, idx in enumerate(lyrics_json):
    es.index(index = index_name, doc_type= 'string', body=lyrics_json[idx])




In [None]:
es.indices.refresh(index=index_name)
results = es.search(index=index_name, body={'from':0, 'size':10, 'query': {'match':{'Lyrics':'Love'}}})
for result in results['hits']['hits']:
    print('score:', result['_score'], 'source:', result['_source'])

score: 1.6539348 source: {'Artists': 'Childish Gambino', 'Name': 'Summertime Magic', 'Weekly.rank': 100, 'Peak.position': 44.0, 'Weeks.on.chart': 5.0, 'Week': '2018-09-15', 'Date': 'July 11, 2018', 'Genre': 'Alternative R&;B,Hip-Hop,Soul,Bounce,R&;B', 'Writing.Credits': 'Childish gambino, Ludwig goransson', 'Lyrics': "Summertime Magic \nYou feel like summertime\nYou took this heart of mine\nYou'll be my valentine in the summer, in the summer\nYou are my only one\nJust dancin'; having fun\nOut in the shining sun of the summer, of the summer\nDo love me, do love me, do\nDo love me, do love me, do yeah\nI love you\nDo love me, do love me, do\nDo love me, do love me, do ohh\nPut no one else above you\nDo love me, do love me, do\nDo love me, do love me, do yeah\nI need you\nDo love me, do love me, do\nDo love me, do love me, do ohh\nOh!\nDo love me, do love me, do\nDo love me, do love me, do\nI love you\nDo love me, do love me, do\nDo love me, do love me, do\nPut no one else above you\nDo l

In [None]:
es.indices.refresh(index=index_name)
query = {
  "bool": {
    "must": [
      {
        "match": {
          "Lyrics": "Love"
        }
      },
      {
        "match": {
          "Weeks.on.chart": "1"
        }
      }
    ]
  }
}

results = es.search(index=index_name, body={'from':0, 'size':10, 'query': query})
for result in results['hits']['hits']:
    print('score:', result['_score'], 'source:', result['_source'])

score: 2.461847 source: {'Artists': 'Drake', 'Name': 'In My Feelings', 'Weekly.rank': 1, 'Peak.position': 1.0, 'Weeks.on.chart': 1.0, 'Week': '2018-09-22', 'Date': 'June 29, 2018', 'Genre': 'Bounce,Pop,Trap,Canada,R&;B,Rap', 'Writing.Credits': 'Phil triggaman price, Orville bugs can can hall, Magnolia shorty, Lil wayne, Rex zamor, Jim jonsin, Static major, City girls, Deezle, Trapmoneybenny, Drake', 'Lyrics': "In My Feelings \nTrap, TrapMoneyBenny\nThis shit got me in my feelings\nGotta be real with it, yeah\nKiki, do you love me? Are you riding?\nSay you'll never ever leave from beside me\n'Cause I want ya, and I need ya\nAnd I'm down for you always\nKB, do you love me? Are you riding?\nSay you'll never ever leave from beside me\n'Cause I want ya, and I need ya\nAnd I'm down for you always\nLook, the new me is really still the real me\nI swear you gotta feel me before they try and kill me\nThey gotta make some choices, they running out of options\n'Cause I've been going off and they d

In [None]:
es.indices.refresh(index=index_name)
query = {
  "bool": {
    "must": [
      {
        "match": {
          "Lyrics": "Sad"
        }
      },
      {
        "range": {
          "Week": {
              "gte": "2010||/y",
              "lte": "2011||/y",
              "format": "yyyy"
          }
        }
      }
    ]
  }
}

results = es.search(index=index_name, body={'from':0, 'size':10, 'query': query})
for result in results['hits']['hits']:
    print('score:', result['_score'], 'source:', result['_source'])

score: 7.4518905 source: {'Artists': 'Glee Cast', 'Name': 'I Love New York / New York, New York', 'Weekly.rank': 81, 'Peak.position': None, 'Weeks.on.chart': None, 'Week': '2011-06-09', 'Date': 'May 24, 2011', 'Genre': 'Pop', 'Writing.Credits': 'Madonna, Stuart price', 'Lyrics': "I Love New York / New York, New York \nFinn:\nI don't like cities\nBut I like New York\nSantana:\nThe famous places to visit are so many\nFinn:\nOther places\nMake me feel like a dork\nSantana:\nI told my grandpa I wouldn't miss on any\nArtie:\nLos Angeles is for\nPeople who sleep\nMercedes:\nGot to see the whole town right from Yonkers on down to the Bay\nArtie:\nParis and London\nOh baby you can keep\nSantana:\nBaby you can keep\nMercedes:\nBaby you can keep\nRachel with Finn and New Directions New Directions:\nOther cities always make me mad\nOther places always make me sad\nNo other city ever made me glad\nExcept New York, New York\nIt's a wonderful town New York\nI love New York\nArtie and Mercedes with N