This section shows the process of getting the data from Billboards WebSite and Twitter using `selenium` web drive the `snscrape` library.

In [1]:
# we need to import some packages. If you do not have intelled a specific package, you can install it using either - conda install or pip install.
import requests # to gather online data using methods such as GET and POST
import time # time modules 
import os #operational system 
from bs4 import BeautifulSoup # html interp
from selenium import webdriver #browser simulator
from selenium.webdriver.chrome.options import Options 
import pandas as pd
import numpy as np
import itertools
import glob
import sqlalchemy
from sqlalchemy import create_engine
import snscrape.modules.twitter as sntwitter #twitter scrapper library
from nltk.sentiment import SentimentIntensityAnalyzer #sentiment analyzer

The billboard website is written in Javascript and doesn't have an open API. These website features prevent us from using the library Requests directly to get the information wanted. To overcome this issue we need to set up a WebDrive browser simulator. I'm using a chrome simulator and you can download the web driver [here](https://chromedriver.chromium.org/downloads). With the code below in hand, you can set up a chrome simulator.

In [3]:
os.chdir('C:/Users/joaom/Desktop') # change  directory
chrome_driver = 'C://Users//joaom//Desktop//chromedriver' #set the webdriver 
#Set up chromedrive options
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--window-size=1366x768")
#set the browser simulator
driver = webdriver.Chrome(options=chrome_options)
driver

<selenium.webdriver.chrome.webdriver.WebDriver (session="d7cadb81dab30f59aeb195f13d6d2760")>

If you want to save this data straight to your `SQL` server you can run this code below with your `SQL` server user and password.

In [4]:
engine = create_engine("mysql://User:password@localhost/database")
con = engine.connect()

In order to save the data in a `SQL` server, we need to treat the strings first because there are some symbols that are not allowed in the `SQL` language. To do this, I created two functions: the first one is to strip accents of the text using `unicodedata`, and the second is where we use `Regular Expressions` to substitute the not allowed symbols in `SQL`.

In [4]:
## These functions are important to strip accents and other special characteres from our data 
#before inserting them into MySQL 
import re
import unicodedata

def strip_accents(text):
    """
    Strip accents from input String.

    """
    try:
        text = unicode(text, 'utf-8')
    except (TypeError, NameError): # unicode is a default on python 3 
        pass
    text = unicodedata.normalize('NFD', text)
    text = text.encode('ascii', 'ignore')
    text = text.decode("utf-8")
    return str(text)

def text_to_sql(text):
    """
    Convert input text to id.
    """
    text = strip_accents(text.lower())
    text = re.sub('[^@$!?&.#0-9a-zA-Z_-]'," ", text)
    text= text.lstrip()
    return text

After investigating the Billboards website HTML code look for the charts table, we were able to create the web crawler. With the following function, we are able to get the charts of any given week on the Hot 100 historical charts. 

In [5]:
def get_charts(weekdate):
    url= "https://www.billboard.com/charts/hot-100" +'/'+weekdate
    driver.get(url)
    html_source = driver.page_source
    soup = BeautifulSoup(html_source, 'lxml')
    data=[]
    weekdate=weekdate
    for music in range(100):
        rawdata= soup.find_all('div', attrs={'class' : 'o-chart-results-list-row-container'})[music].text.replace('NEW','').replace('ENTRY','').replace('RE-','').splitlines()
        fields={}
        x=[]
        for i in rawdata:    
            if i != '':
                text= strip_accents(i)
                text=text_to_sql(text)
                x.append(text) 
        fields['position']=x[0]
        fields['music']=x[1]
        fields['artist']=x[2]
        fields['lastweek']=x[3]
        fields['peak'] = x[4]
        fields['weeks'] = x[5]
        fields['weekdate']=weekdate
        data.append(fields)
    data = pd.DataFrame(data)
    return data

But the weekly charts are not the only data we need, we have to get the tweets that contain the songs on the chart and the information in these tweets, as the tweet content and the favorites and retweets numbers. So with the following function, we are able to get the data from billboard and Twitter merged in one dataset just inserting a date. Notice the tweets are collected from one week lag to the inserted week date in the YY-MM-DD format. I'm doing this because the alleged tweets' impact on Billboard charts is counted before the weekly charts are updated. The function uses `snscrapper` to overcome the Twitter's API limits. The data can be stored in a SQL server or save as Pickle or any other data file supported by `Pandas`. 

In [6]:
def get_weekly_data(weekdate):
    charts = get_charts(weekdate)
    weekdate=pd.to_datetime(weekdate)
    def get_tweets(charts, weekdate):
        data = charts.assign(search= lambda x: x['artist'] + ' AND ' + x['music'] )
        tweets_list = []
        for search in data['search']:
            music= search.split('AND ')[-1]
            until = weekdate.strftime('%Y-%m-%d')
            since = (weekdate - pd.to_timedelta(1, 'W')).strftime('%Y-%m-%d')
            query = search +  ' lang:en since:{} until:{}'.format(since, until)
            for tweet in itertools.islice(sntwitter.TwitterSearchScraper(query).get_items(), 0,100,None):
                fields = {}
                date = str(tweet.date)     
                text = str(tweet.content)
                text = strip_accents(text)
                text = text_to_sql(text)
                username = str(tweet.user.username)
                favorites = str(tweet.likeCount)
                retweets = str(tweet.retweetCount)
                fields['datetime'] = date
                fields['usarname'] = username
                fields['text'] = text
                fields['favorites'] = favorites
                fields['retweets'] = retweets
                fields['music'] = music
                tweets_list.append(fields)
        tweets_df = pd.DataFrame(tweets_list)
        return tweets_df
    tweets = get_tweets(charts, weekdate)
    week_data = charts.merge(tweets,
                             left_on="music",
                             right_on="music",
                             how="outer",
                             indicator=True)
    week_data = week_data.drop('_merge', axis=1)
    week_data.to_pickle('C:/Users/joaom/Desktop/data/data_'+
                        weekdate.strftime('%Y')+'/data_'+
                        ''.join(weekdate.strftime('%Y-%m-%d').split('-'))+'.pkl')
    #week_data.to_sql(name= 'data_'+''.join(weekdate.strftime('%Y-%m-%d').split('-')),con=con,if_exists='replace')
    return week_data
    

For instance, we can get the data for my 2020's birthday.

It's note worthing knowing that this process is not the most fast method to web scrap, so It will take couple of minutes to gathered the data. 

In [11]:
Birthday_data = get_weekly_data('2020-05-09')
birthday_data.head()

Unnamed: 0,position,music,artist,lastweek,peak,weeks,weekdate,datetime,usarname,text,favorites,retweets
0,1,the scotts,the scotts travis scott & kid cudi,-,1,1,2020-05-09,2020-05-08 23:40:55+00:00,Steven_Patz,the scotts by the scotts travis scott kid cu...,0,0
1,1,the scotts,the scotts travis scott & kid cudi,-,1,1,2020-05-09,2020-05-08 22:16:35+00:00,MusikhedRadio,travis scott kid cudi - the scotts,0,0
2,1,the scotts,the scotts travis scott & kid cudi,-,1,1,2020-05-09,2020-05-08 20:37:27+00:00,BarzFan,the scotts a song by kid cudi the scotts and...,1,1
3,1,the scotts,the scotts travis scott & kid cudi,-,1,1,2020-05-09,2020-05-08 20:33:26+00:00,Ayarrrod99,the scotts rmx out now someone let travis scot...,0,0
4,1,the scotts,the scotts travis scott & kid cudi,-,1,1,2020-05-09,2020-05-08 20:20:11+00:00,squatterant,travis scott debuts kid cudi collab the scott...,0,0


Now we need to automatize the process of inserting in the fuction the date in the right format. To do this, I created a function where you insert a year the funtion will return the weekly data of a given year. I used `pandas.Datetime` features and a simple `for loop`. As the Billboard Hot100 is updated every Thursday, the week inserted in the `get_weekly_data()` will be all the thursday of a given year. As twitter was open in 2007, I decided to get the data from 2008 and on.

In [10]:
def all_data(year):
    def all_thursday(year):
        return pd.date_range(start=str(year),
                             end=str(year+1),
                             freq='W-Thu').strftime('%Y-%m-%d').tolist()
    thursdays = all_thursday(year)
    for thursday in thursdays:
        get_weekly_data(thursday)
for i in range(2008,2022):
    all_data(i)

Now with the data in hand, we can go for the part two of this project, the data treatment and vizualization and some insights. 