# <b>Section 1: Data Crawling</b>

### <b><u>Step 1</u>: Import library</b>

These are the main libraries used for data crawling:
- `spotipy`: Spotipy is a lightweight Python library for the Spotify Web API. With Spotipy you get full access to all of the music data provided by the Spotify platform.
- `dotenv`: used to get spotipy's client_id, secret and redirect uri in .env file
- `os`: used to get accessed to .env files in system

In [42]:
from spotipy.oauth2 import SpotifyClientCredentials, SpotifyOAuth
from dotenv import load_dotenv
import spotipy
import os

### <b><u>Step 2</u>: Request access to Spotify API by using OAuth method

Firstly, we will need to load the .env file to get the spotipy's client_id, secret and redirect uri in order to get accessed to Spotify's API service using OAuth method.

After that, we will initialize a `SpotifyClientCredentials` object and pass in as a parameter along with spotipy's client_id, secret and redirect uri to get permission to API service.

In [43]:
load_dotenv()

client_id = os.getenv('SPOTIPY_CLIENT_ID')
client_secret = os.getenv('SPOTIPY_CLIENT_SECRET')
redirect_uri = os.getenv('SPOTIPY_REDIRECT_URI')

client_credentials_manager = SpotifyClientCredentials()
sp = spotipy.Spotify(auth_manager=SpotifyOAuth(client_id=client_id, client_secret=client_secret, redirect_uri=redirect_uri),
                    client_credentials_manager=SpotifyClientCredentials())

### <b><u>Step 3</u>: Crawl top 3000 songs from Spotify from 2020-2022

day la markdown

In [44]:
result = sp.search(q='year:2022', limit=50)

day la markdown

In [45]:
songs_data = result['tracks']['items']

for _ in range(19):
    result = sp.next(result['tracks'])
    songs_data.extend(result['tracks']['items'])

day la markdown

In [46]:
for i in range(20):
    result = sp.search(q='year:2021', limit=50, offset=i*50)
    songs_data.extend(result['tracks']['items'])

day la markdown

In [47]:
for i in range(20):
    result = sp.search(q='year:2020', limit=50, offset=i*50)
    songs_data.extend(result['tracks']['items'])

day la markdown

In [50]:
for i in songs_data:
    print(i['name'])

Dreamers [Music from the FIFA World Cup Qatar 2022 Official Soundtrack]
dự báo thời tiết hôm nay mưa
Chết Trong Em
Em Là
Wild Flower (with youjeen)
Có Đâu Ai Ngờ
Tại Vì Sao
ThichThich
Chìm Sâu
Shut Down
Lâu Lâu Nhắc Lại
Left and Right (Feat. Jung Kook of BTS)
Waiting For You
Ngày Đầu Tiên
The Astronaut
Run BTS
Mặt Mộc
Chạy Khỏi Thế Giới Này
Một Ngàn Nỗi Đau
Bên Trên Tầng Lầu
Có Em (feat. Low G)
Lonely
Still Life (with Anderson .Paak)
Yêu Người Có Ước Mơ
Anti-Hero
Vì Anh Đâu Có Biết
vaicaunoicokhiennguoithaydoi
Change pt.2
Anti-Hero
Yet To Come
No.2 (with parkjiyoon)
Butter
Closer (with Paul Blanco, Mahalia)
Hectic (with Colde)
Ngã Tư Không Đèn
Yun (with Erykah Badu)
cardigan
willow
All Day (with Tablo)
Anti-Hero
willow
Christmas Tree Farm
Forg_tful (with Kim Sawol)
Christmas Tree Farm
cardigan
cardigan
Anh Nhớ Ra (feat. TRANG)
cardigan
willow
Christmas Tree Farm
Anti-Hero
Anti-Hero
có hẹn với thanh xuân
Anti-Hero
Anti-Hero
Anti-Hero
Anti-Hero
Anti-Hero
Anti-Hero
Anti-Hero
Anti-Hero
Ant

### <b><u>Step 4</u>: Store songs data to 'songs_data.tsv' file

In [49]:
artists_uri = [[artist['uri'] for artist in track['artists']] for track in songs_data]
len(artists_uri)

3000

day la markdown

In [None]:
with open("../../data/songs_data.tsv", 'w') as f:
    f.write("id\tname\tartist\tgenres\tartist_followers\tartist_popularity\tmarkets\talbum\treleased_date\talbum_popularity\tduration\texplicit\tpopularity\n")
    for track, uri_row in zip(songs_data, artists_uri):
        # artists_uri = [artist['uri'] for artist in track['artists']]
        artists_data = [sp.artist(uri) for uri in uri_row]
        artists_info = {k: [] for k in artists_data[0].keys() if k in {'followers', 'genres', 'name', 'popularity'}}

        for i in artists_data:
            artists_info['name'].append(i['name'])
            artists_info['genres'].extend(i['genres'])
            artists_info['followers'].append(str(i['followers']['total']))
            artists_info['popularity'].append(str(i['popularity']))

        album_popularity = str(sp.album(track['album']['uri'])['popularity'])
        print(1)

        f.write(track['id']+'\t'+track['name']+'\t'+(','.join(artists_info['name']))+'\t'+(','.join(set(artists_info['genres']))) \
                +'\t'+(','.join(artists_info['followers']))+'\t'+(','.join(artists_info['popularity'])) \
                +'\t'+ str(len(track['available_markets'])) \
                +'\t'+track['album']['name']+'\t'+track['album']['release_date']+'\t'+album_popularity \
                +'\t'+str(track['duration_ms'])+'\t'+str(track['explicit'])+'\t'+str(track['popularity'])+'\n')
