# ADS 509 Module 1: APIs and Web Scraping

This notebook has three parts. In the first part you will pull data from the Twitter API. In the second, you will scrape lyrics from AZLyrics.com. In the last part, you'll run code that verifies the completeness of your data pull. 

For this assignment you have chosen two musical artists who have at least 100,000 Twitter followers and 20 songs with lyrics on AZLyrics.com. In this part of the assignment we pull the some of the user information for the followers of your artist and store them in text files. 


## General Assignment Instructions

These instructions are included in every assignment, to remind you of the coding standards for the class. Feel free to delete this cell after reading it. 

One sign of mature code is conforming to a style guide. We recommend the [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html). If you use a different style guide, please include a cell with a link. 

Your code should be relatively easy-to-read, sensibly commented, and clean. Writing code is a messy process, so please be sure to edit your final submission. Remove any cells that are not needed or parts of cells that contain unnecessary code. Remove inessential `import` statements and make sure that all such statements are moved into the designated cell. 

Make use of non-code cells for written commentary. These cells should be grammatical and clearly written. In some of these cells you will have questions to answer. The questions will be marked by a "Q:" and will have a corresponding "A:" spot for you. *Make sure to answer every question marked with a `Q:` for full credit.* 


# Twitter API Pull

In [1]:
# for the twitter section
import tweepy
import os
import datetime
import re
from pprint import pprint

# for the lyrics scrape section
import requests
import time
from bs4 import BeautifulSoup
import shutil
from collections import defaultdict, Counter


In [2]:
# Use this cell for any import statements you add
print(tweepy.__version__)


4.9.0


In [3]:
#getting approval from Twitter API Keys, first need elevated access from Twitter 
#then click on create project and click on app to generate keys

We need bring in our API keys. Since API keys should be kept secret, we'll keep them in a file called `api_keys.py`. This file should be stored in the directory where you store this notebook. The example file is provided for you on Blackboard. The example has API keys that are _not_ functional, so you'll need to get Twitter credentials and replace the placeholder keys. 

In [4]:
api_key= 'ppDZ0vlZFrUTzFKEo1KL8wrrc'
api_key_secret= 'PdiObflSlZLULfdjI2d3NPB4a3Pwo89MBadIhJNinppY8ltq7K'
access_token='1526040925029408770-TSoSoFkJP9t7wepjWajVRRYCMH1b0h'
access_token_secret= '9Xugjs127ec4GLNcwtf4KTCVprd7iDy4HHzAfZFU7mwoh'

In [5]:
#from api_keys import api_key, api_key_secret, access_token, access_token_secret 

In [6]:

auth = tweepy.OAuthHandler(api_key, api_key_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(
    auth,
    wait_on_rate_limit=True)

## Testing the API

The Twitter APIs are quite rich. Let's play around with some of the features before we dive into this section of the assignment. For our testing, it's convenient to have a small data set to play with. We will seed the code with the handle of John Chandler, one of the instructors in this course. His handle is `@37chandler`. Feel free to use a different handle if you would like to look at someone else's data. 

We will write code to explore a few aspects of the API: 

1. Pull all the follower IDs for @katymck.
1. Explore the user object, which gives us information about Twitter users. 
1. Pull some user objects for the followers. 
1. Pull the last few tweets by @katymck.


In [7]:

handle = "katymck"

followers = []

for page in tweepy.Cursor(api.get_follower_ids,
                          # acquiring all ids 
                          # wait on rate limit
                          wait_on_rate_limit=True,  
                          compression=True,
                          screen_name=handle).pages():

    # calling function .extend 
    followers.extend(page)
        
        
print(f"Here are the first five follower ids for {handle} out of the {len(followers)} total.")
print(f'{handle} has {len(followers)} followers')
followers_subset = followers[:5]
followers_subset

Unexpected parameter: wait_on_rate_limit
Unexpected parameter: compression


Here are the first five follower ids for katymck out of the 13 total.
katymck has 13 followers


[1351711667416023041, 1150871234, 248618836, 3199856542, 1239123794]

In [8]:
# Testing API went well

We have the follower IDs, which are unique numbers identifying the user, but we'd like to get some more information on these users. Twitter allows us to pull "fully hydrated user objects", which is a fancy way of saying "all the information about the user". Let's look at user object for our starting handle.

In [9]:
#acquiring user information
user = api.get_user(screen_name=handle) 
print(user._json)

{'id': 17947072, 'id_str': '17947072', 'name': 'Katy McKinney-Bock', 'screen_name': 'katymck', 'location': 'Portland, OR', 'profile_location': None, 'description': '', 'url': None, 'entities': {'description': {'urls': []}}, 'protected': False, 'followers_count': 13, 'friends_count': 108, 'listed_count': 0, 'created_at': 'Sun Dec 07 20:21:24 +0000 2008', 'favourites_count': 5, 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'verified': False, 'statuses_count': 1, 'lang': None, 'status': {'created_at': 'Tue Sep 06 18:09:24 +0000 2016', 'id': 773221578850967552, 'id_str': '773221578850967552', 'text': '@MattAndersonBBC our eyes give us away-@1000frames/sec, eye-tracking shows we use adjective order to deduce meaning! https://t.co/ZuyGIHnB1U', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/ZuyGIHnB1U', 'expanded_url': 'http://tinyurl.com/z3rvmlh', 'display_url': 'tinyurl.com/z3rvmlh', 'indices': [117, 140]}]}

Now a few questions for you about the user object.

Q: How many fields are being returned in the \_json portion of the user object? 

A: <!-- Put your answer here --> There are 44 fields that are being returned

---

Q: Are any of the fields within the user object non-scaler? TK correct term

A: <!-- Put your answer here --> Yes, it is because some lines have multiple values

---

Q: How many friends, followers, favorites, and statuses does this user have? 

A: <!-- Put your answer here --> There are 13 followers count, 108 friends count, 5 favorites count, and status count is 1.


We can map the follower IDs onto screen names by accessing the screen_name key within the user object. Modify the code below to also print out how many people the follower is following and how many followers they have. 

In [10]:
ids_to_lookup = followers[:10]

for user_obj in api.lookup_users(user_id=ids_to_lookup) :

    print(f"user: {user_obj.screen_name} ,  followers_count: {user_obj.followers_count},friends_count: {user_obj.friends_count} ")
    
    # Add code here to print out friends and followers of `handle`


user: roadmaphome2030 ,  followers_count: 7356,friends_count: 3251 
user: ImaneBello ,  followers_count: 373,friends_count: 819 
user: BuseCett ,  followers_count: 1433,friends_count: 1749 
user: apreshill ,  followers_count: 16746,friends_count: 1905 
user: DaraShifrer ,  followers_count: 395,friends_count: 602 
user: g2barrow ,  followers_count: 4,friends_count: 14 
user: ErikaVaris ,  followers_count: 952,friends_count: 458 
user: mlroach ,  followers_count: 1918,friends_count: 667 
user: rlengland ,  followers_count: 46,friends_count: 157 
user: mmkane ,  followers_count: 15,friends_count: 205 


Although you won't need it for this assignment, individual tweets (called "statuses" in the API) can be a rich source of text-based data. To illustrate the concepts, let's look at the last few tweets for this user. You are encouraged to explore the `status` object and marvel in the richness of the data that is available. 


In [11]:
tweet_count = 0

for status in tweepy.Cursor(api.user_timeline, id=handle).items():
    tweet_count += 1
    
    print(f"The tweet was tweeted at {status.created_at}.")
    print(f"The original tweet has been retweeted {status.retweet_count} times.")
    
    clean_status = status.text
    clean_status = clean_status.replace("\n"," ")
    
    print(f"{clean_status}")
    print("\n"*2)
        
    if tweet_count > 10 :
        break



Unexpected parameter: id
Unexpected parameter: id


The tweet was tweeted at 2016-09-06 18:09:24+00:00.
The original tweet has been retweeted 0 times.
@MattAndersonBBC our eyes give us away-@1000frames/sec, eye-tracking shows we use adjective order to deduce meaning! https://t.co/ZuyGIHnB1U





## Pulling Follower Information

In this next section of the assignment, we will pull information about the followers of your two artists. We must first get the follower IDs, then we will be able to "hydrate" the IDs, pulling the user objects for them. Once we have those user objects we will extract some fields that we can use in future analyses. 


The Twitter API only allows users to make 15 requests per 15 minutes when pulling followers. Each request allows you to gather 5000 follower ids. Tweepy will grab the 15 requests quickly then wait 15 minutes, rather than slowly pull the requests over the time period. Before we start grabbing follower IDs, let's first just check how long it would take to pull all of the followers. To do this we use the `followers_count` item from the user object. 

In [12]:
# I'm putting the handles in a list to iterate through below
handles = ['taylorswift13','blakeshelton']

# This will iterate through each Twitter handle that we're collecting from
for screen_name in handles:
    
    # Tells Tweepy we want information on the handle we're collecting from
    # The next line specifies which information we want, which in this case is the number of followers 
    user = api.get_user(screen_name=screen_name) 
    followers_count = user.followers_count

    # Let's see roughly how long it will take to grab all the follower IDs. 
    print(f'''
    @{screen_name} has {followers_count} followers. 
    That will take roughly {followers_count/(5000*15*4):.2f} hours to pull the followers.
    ''')
    


    @taylorswift13 has 90361248 followers. 
    That will take roughly 301.20 hours to pull the followers.
    

    @blakeshelton has 19851727 followers. 
    That will take roughly 66.17 hours to pull the followers.
    


As we pull data for each artist we will write their data to a folder called "twitter", so we will make that folder if needed.

In [13]:
# Make the "twitter" folder here. If you'd like to practice your programming, add functionality 
# that checks to see if the folder exists. If it does, then "unlink" it. Then create a new one.

if not os.path.isdir("twitter") : 
    #shutil.rmtree("twitter/")
    os.mkdir("twitter")

In this following cells, use the `api.followers_ids` (and the `tweepy.Cursor` functionality) to pull some of the followers for your two artists. As you pull the data, write the follower ids to a file called `[artist name]_followers.txt` in the "twitter" folder. For instance, for Cher I would create a file named `cher_followers.txt`. As you pull the data, also store it in an object like a list or a data frame.

In [14]:
num_followers_to_pull = 1*1000 # feel free to use this to limit the number of followers you pull.

In [15]:
# Modify the below code stub to pull the follower IDs and write them to a file. 

# Grabs the time when we start making requests to the API
start_time = datetime.datetime.now()
all_followers = []
for handle in handles :
    
    output_file =  'twitter/'+ handle + "_followers.txt"
    print(f'pulling followers for {handle}.')
    
    # Pull and store the follower IDs
    followers = []
    for page in tweepy.Cursor(api.get_follower_ids,screen_name=handle).pages():
        # The page variable comes back as a list, so we have to use .extend rather than .append
        followers.extend(page)
        all_followers.extend(page)
        
        print(f'loaded in {len(followers)}accounts...')
        
        # If you've pulled num_followers_to_pull, feel free to break out paged twitter API response
        if len(followers) >= num_followers_to_pull:
            break
            
        time.sleep(30)
        
    print(f'finished pulling followers for {handle}.')
    
    print(f'writing {handle} followers info to file: {output_file}')
    # Write the IDs to the output file in the `twitter` folder.        
    with open(output_file, 'w') as f:
        for follower in followers:
            f.write(str(follower)+'\n')
        
            
        
        
# Let's see how long it took to grab all follower IDs
end_time = datetime.datetime.now()
print(end_time - start_time)
print(f'all_followers {len(all_followers)}')


pulling followers for taylorswift13.
loaded in 5000accounts...
finished pulling followers for taylorswift13.
writing taylorswift13 followers info to file: twitter/taylorswift13_followers.txt
pulling followers for blakeshelton.
loaded in 5000accounts...
finished pulling followers for blakeshelton.
writing blakeshelton followers info to file: twitter/blakeshelton_followers.txt
0:00:01.040951
all_followers 10000


Now that you have your follower ids, gather some information that we can use in future assignments on them. Using the `lookup_users` function, pull the user objects for your followers. These requests are limited to 900 per 15 minutes, but you can request 100 users at a time. At 90,000 users per 15 minutes, the rate limiter on pulls might be bandwidth rather than API limits. 

Extract the following fields from the user object: 

* screen_name	
* name	
* id	
* location	
* followers_count	
* friends_count	
* description

These can all be accessed via these names in the object. Store the fields with one user per row in a tab-delimited text file with the name `[artist name]_follower_data.txt`. For instance, for Cher I would create a file named `cher_follower_data.txt`. 


In [16]:
ls twitter

 Volume in drive C is OS
 Volume Serial Number is 64E0-3BF8

 Directory of C:\Users\saith\twitter

05/26/2022  05:06 PM    <DIR>          .
05/26/2022  05:06 PM    <DIR>          ..
05/26/2022  05:07 PM            99,597 blakeshelton_followers.txt
05/26/2022  04:24 PM           423,314 blakeshelton_followers_data.txt
05/26/2022  05:07 PM           101,305 taylorswift13_followers.txt
05/26/2022  04:23 PM           393,593 taylorswift13_followers_data.txt
               4 File(s)      1,017,809 bytes
               2 Dir(s)  117,948,719,104 bytes free


In [17]:
n_followers = 5000
all_followers = {}
for handle in handles:
    with open(f'twitter/{handle}_followers.txt', 'r') as f:
        all_followers[handle] = f.read().split('\n')[:n_followers] # limit number of followers

In [18]:
all_followers.keys()

dict_keys(['taylorswift13', 'blakeshelton'])

In [19]:
import pandas as pd
    # in this cell, do the following
    # 1. Set up a data frame or dictionary to hold the user information
for handle in handles:
    print(f'Getting {handle} follower data')
   
    users_info = []
    # 2. Use the `lookup_users` api function to pull sets of 100 users at a time

    pages = len(all_followers[handle]) // 100 + 1
    follower_users = []

    print(f'pages: {pages}')
    # get the user object for each foollower
    for i in range(pages):

        follower_ids_subset = all_followers[handle][i:i+100]

        print(f'follower_ids_subset: {len(follower_ids_subset)}')

        #grab users for these 100 or less followers
        user_results = api.lookup_users(user_id=follower_ids_subset)#issue with call : positional argument error

        follower_users.extend(user_results)

        print(f'obtained page :{i+1},  users:{len(follower_users)}')

        if len(follower_users)% 90_000 == 0:
            time.sleep(60*15)


    # for follower_user in follower_users:
    # 3. Store the listed fields in your data frame or dictionary.
    items = ['screen_name', 'name', 'id', 'location', 'followers_count',
             'friends_count', 'description']
    fields = []
    for follower_user in follower_users:
        fields.append({x: follower_user._json[x] for x in items})       
    follower_df = pd.DataFrame(fields)
    
    # clean tabs based on the following instruction
    follower_df['description'] = follower_df['description'].apply(lambda x: re.sub(r"\s+"," ",x))

    
    # 4. Write the user information in tab-delimited form to the follower data text file. 
    follower_df.to_csv(f'twitter/{handle}_followers_data.txt', index=False, sep='\t')
    print()


Getting taylorswift13 follower data
pages: 51
follower_ids_subset: 100
obtained page :1,  users:100
follower_ids_subset: 100
obtained page :2,  users:200
follower_ids_subset: 100
obtained page :3,  users:300
follower_ids_subset: 100
obtained page :4,  users:400
follower_ids_subset: 100
obtained page :5,  users:500
follower_ids_subset: 100
obtained page :6,  users:600
follower_ids_subset: 100
obtained page :7,  users:700
follower_ids_subset: 100
obtained page :8,  users:800
follower_ids_subset: 100
obtained page :9,  users:900
follower_ids_subset: 100
obtained page :10,  users:1000
follower_ids_subset: 100
obtained page :11,  users:1100
follower_ids_subset: 100
obtained page :12,  users:1200
follower_ids_subset: 100
obtained page :13,  users:1300
follower_ids_subset: 100
obtained page :14,  users:1400
follower_ids_subset: 100
obtained page :15,  users:1500
follower_ids_subset: 100
obtained page :16,  users:1600
follower_ids_subset: 100
obtained page :17,  users:1700
follower_ids_subset:

One note: the user's description can have tabs or returns in it, so make sure to clean those out of the description before writing them to the file. Here's an example of how you might do this. 

In [20]:
tricky_description = """
    Home by Warsan Shire
    
    no one leaves home unless
    home is the mouth of a shark.
    you only run for the border
    when you see the whole city
    running as well.

"""
# This won't work in a tab-delimited text file.

clean_description = re.sub(r"\s+"," ",tricky_description)
clean_description

' Home by Warsan Shire no one leaves home unless home is the mouth of a shark. you only run for the border when you see the whole city running as well. '

---

# Lyrics Scrape

This section asks you to pull data from the Twitter API and scrape www.AZLyrics.com. In the notebooks where you do that work you are asked to store the data in specific ways. 

In [21]:
# twitter
import tweepy
import os
import datetime
import re
from pprint import pprint

In [22]:
# to implement Lyrics Scrate 
import requests
import time
from bs4 import BeautifulSoup
import shutil
from collections import defaultdict, Counter


In [23]:
artists = {'taylorswift':"https://www.azlyrics.com/t/taylorswift.html", 
           'blakeshelton':"https://www.azlyrics.com/b/blakeshelton.html"} 
# we'll use this dictionary to hold both the artist name and the link on AZlyrics

## A Note on Rate Limiting

The lyrics site, www.azlyrics.com, does not have an explicit maximum on number of requests in any one time, but in our testing it appears that too many requests in too short a time will cause the site to stop returning lyrics pages. (Entertainingly, the page that gets returned seems to only have the song title to [a Tom Jones song](https://www.azlyrics.com/lyrics/tomjones/itsnotunusual.html).) 

Whenever you call `requests.get` to retrieve a page, put a `time.sleep(5 + 10*random.random())` on the next line. This will help you not to get blocked. If you _do_ get blocked, which you can identify if the returned pages are not correct, just request a lyrics page through your browser. You'll be asked to perform a CAPTCHA and then your requests should start working again. 

## Part 1: Finding Links to Songs Lyrics

That general artist page has a list of all songs for that artist with links to the individual song pages. 

Q: Take a look at the `robots.txt` page on www.azlyrics.com. (You can read more about these pages [here](https://developers.google.com/search/docs/advanced/robots/intro).) Is the scraping we are about to do allowed or disallowed by this page? How do you know? 

A: <!-- Delete this comment and put your answer here. --> According to the robots.txt, the scraping we are about to do is disallowed. 
<User-agent: *
Disallow: /lyricsdb/
Disallow: /song/
Allow: /
                    

Disallow: />
    


In [24]:
# Let's set up a dictionary of lists to hold our links
from lxml import html
import random
import requests 
lyrics_pages = defaultdict(list)

for artist, artist_page in artists.items() :
    # request the page and sleep
    r = requests.get(artist_page)
    webpage = html.fromstring(r.content)
    #get all the links in the artist page and slice out the links that are not necessary
    links = webpage.xpath('//a/@href')[31:-8]
    lyrics_pages[artist] = links
    time.sleep(5 + 10*random.random())

    

In [25]:
#print(len(lyrics_pages['taylorswift']))
#print(len(lyrics_pages['blakeshelton']))

Let's make sure we have enough lyrics pages to scrape. 

In [26]:
for artist, lp in lyrics_pages.items() :
    assert(len(set(lp)) > 20) 
         

In [27]:
# Let's see how long it's going to take to pull these lyrics 
# if we're waiting `5 + 10*random.random()` seconds 
for artist, links in lyrics_pages.items() : 
    print(f"For {artist} we have {len(links)}.")
    print(f"The full pull will take for this artist will take {round(len(links)*10/3600,2)} hours.")

For taylorswift we have 342.
The full pull will take for this artist will take 0.95 hours.
For blakeshelton we have 190.
The full pull will take for this artist will take 0.53 hours.


## Part 2: Pulling Lyrics

Now that we have the links to our lyrics pages, let's go scrape them! Here are the steps for this part. 

1. Create an empty folder in our repo called "lyrics". 
1. Iterate over the artists in `lyrics_pages`. 
1. Create a subfolder in lyrics with the artist's name. For instance, if the artist was Cher you'd have `lyrics/cher/` in your repo.
1. Iterate over the pages. 
1. Request the page and extract the lyrics from the returned HTML file using BeautifulSoup.
1. Use the function below, `generate_filename_from_url`, to create a filename based on the lyrics page, then write the lyrics to a text file with that name. 


In [28]:
def generate_filename_from_link(link) :
    
    if not link :
        return None
    
    # drop the http or https and the html
    name = link.replace("https","").replace("http","")
    name = link.replace(".html","")

    name = name.replace("/lyrics/","")
    
    # Replace useless chareacters with UNDERSCORE
    name = name.replace("://","").replace(".","_").replace("/","_")
    
    # tack on .txt
    name = name + ".txt"
    
    return(name)


In [29]:
# Make the lyrics folder here. If you'd like to practice your programming, add functionality 
# that checks to see if the folder exists. If it does, then use shutil.rmtree to remove it and create a new one.

if os.path.isdir("lyrics") : 
    shutil.rmtree("lyrics/")

os.mkdir("lyrics")

In [30]:
os.getcwd()

'C:\\Users\\saith'

In [31]:
# for the twitter section
import tweepy
import os
import datetime
import re
from pprint import pprint

# for the lyrics scrape section
import requests
import time
from bs4 import BeautifulSoup
import shutil
from collections import defaultdict, Counter
from lxml import html
import random

def generate_filename_from_link(link) :
    if not link: return None
    # drop the http or https and the html
    name = link.replace("https","").replace("http","")
    name = link.replace(".html","")
    name = name.replace("/lyrics/","")
    # Replace useless chareacters with UNDERSCORE
    name = name.replace("://","").replace(".","_").replace("/","_")
    # tack on .txt
    name = name + ".txt"
    return(name)
 

# Let's set up a dictionary of lists to hold our links
print("Fetching Lyric Pages...")
lyrics_pages = defaultdict(list)
for artist, artist_page in artists.items() :
    # request the page and sleep
    r = requests.get(artist_page)
    webpage = html.fromstring(r.content)
    #get all the links in the artist page and slice out the links that are not necessary
    links = webpage.xpath('//a/@href')[31:-8]
    lyrics_pages[artist] = links
    time.sleep(5 + 10*random.random())


url_stub = "https://www.azlyrics.com" 
start = time.time()
total_pages = 0 

#if not os.path.isdir('lyrics'): 
 #   os.mkdir('lyrics')

for artist in lyrics_pages:
    print(f"{artist}: {len(lyrics_pages[artist])}")


    # 1. Build a subfolder for the artist
    # if os.path.isdir(artist):
    #     shutil.rmtree(artist)
    folderpath= os.path.join('lyrics',artist)

    if not os.path.isdir(folderpath): 
        os.mkdir(folderpath)
    
    
    # 2. Iterate over the lyrics pages
    for i,link in enumerate(lyrics_pages[artist]):
        # 2.1 Prepare fileName for the current song
        fileName = generate_filename_from_link(link)
        print(fileName)
        filePath = os.path.join('lyrics',artist, fileName)

        # 2.2 Skip song if we already scraped it in the past
        if os.path.isfile(filePath): continue;
        print(link)
        # 3. Request the lyrics page. 
        lyric_url = url_stub + link.replace("..","")
        print("\t"+lyric_url)
        r = requests.get(lyric_url, timeout=5)
        soup = BeautifulSoup(r.content, 'html.parser')
        r.close()
        time.sleep(5 + 10*random.random()) # sleep after making the request

        # 4. Extract the title and lyrics from the page.
        titleelement=soup.findAll("title")
        song_title=soup.title.text.split("-")[1].split("|")[0].replace("Lyrics","").strip()
        start=soup.text.find(f'"{song_title}"')
        end=soup.text.find('Submit Corrections')
        

        # 5. Write out the title, two returns ('\n'), and the lyrics.
        with open(filePath, 'w') as f:
            f.write(song_title+'\n\n')
            for line in soup.text[start:end].split('\n')[11:]:
                f.write(line+'\n')
        f.close()
    
        # Remember to pull at least 20 songs per artist. It may be fun to pull all the songs for the artist
        if i>=21:
            break

Fetching Lyric Pages...
taylorswift: 342
taylorswift_timmcgraw.txt
/lyrics/taylorswift/timmcgraw.html
	https://www.azlyrics.com/lyrics/taylorswift/timmcgraw.html
taylorswift_picturetoburn.txt
/lyrics/taylorswift/picturetoburn.html
	https://www.azlyrics.com/lyrics/taylorswift/picturetoburn.html
taylorswift_teardropsonmyguitar.txt
/lyrics/taylorswift/teardropsonmyguitar.html
	https://www.azlyrics.com/lyrics/taylorswift/teardropsonmyguitar.html
taylorswift_aplaceinthisworld.txt
/lyrics/taylorswift/aplaceinthisworld.html
	https://www.azlyrics.com/lyrics/taylorswift/aplaceinthisworld.html
taylorswift_coldasyou.txt
/lyrics/taylorswift/coldasyou.html
	https://www.azlyrics.com/lyrics/taylorswift/coldasyou.html
taylorswift_theoutside.txt
/lyrics/taylorswift/theoutside.html
	https://www.azlyrics.com/lyrics/taylorswift/theoutside.html
taylorswift_tiedtogetherwithasmile.txt
/lyrics/taylorswift/tiedtogetherwithasmile.html
	https://www.azlyrics.com/lyrics/taylorswift/tiedtogetherwithasmile.html
tayl

In [32]:
print(f"Total run time was {round((time.time() - start)/3600,2)} hours.")

Total run time was 459336.26 hours.


---

# Evaluation

This assignment asks you to pull data from the Twitter API and scrape www.AZLyrics.com.  After you have finished the above sections , run all the cells in this notebook. Print this to PDF and submit it, per the instructions.

In [33]:
# Simple word extractor from Peter Norvig: https://norvig.com/spell-correct.html
def words(text): 
    return re.findall(r'\w+', text.lower())

---

## Checking Twitter Data

The output from your Twitter API pull should be two files per artist, stored in files with formats like `cher_followers.txt` (a list of all follower IDs you pulled) and `cher_followers_data.txt`. These files should be in a folder named `twitter` within the repository directory. This code summarizes the information at a high level to help the instructor evaluate your work. 

In [34]:
cwd = os.getcwd()
print(cwd)

C:\Users\saith


In [35]:

#os.chdir("C:/Users/saith/twitter")
os.chdir("C://users//saith//")
os.listdir()

['.atom',
 '.aws',
 '.bash_history',
 '.cisco',
 '.conda',
 '.condarc',
 '.continuum',
 '.ipynb_checkpoints',
 '.ipython',
 '.jupyter',
 '.matplotlib',
 '.redhat',
 '.vscode',
 '3D Objects',
 'ads509_assign1.ipynb',
 'anaconda3',
 'API and Scrape.ipynb',
 'AppData',
 'Application Data',
 'captureMsi.log',
 'cfn101-workshop',
 'Contacts',
 'Cookies',
 'Copy_of_solution.ipynb',
 'credit_fraud_detection.ipynb',
 'Desktop',
 'Documents',
 'Downloads',
 'Favorites',
 'Google Drive',
 'IntelGraphicsProfiles',
 'Links',
 'Local Settings',
 'lyrics',
 'Lyrics and Description EDA.ipynb',
 'MicrosoftEdgeBackups',
 'Music',
 'My Documents',
 'NetHood',
 'New folder',
 'NTUSER.DAT',
 'ntuser.dat.LOG1',
 'ntuser.dat.LOG2',
 'NTUSER.DAT{b03892b5-eeda-11ea-805b-854a883dddf0}.TM.blf',
 'NTUSER.DAT{b03892b5-eeda-11ea-805b-854a883dddf0}.TMContainer00000000000000000001.regtrans-ms',
 'NTUSER.DAT{b03892b5-eeda-11ea-805b-854a883dddf0}.TMContainer00000000000000000002.regtrans-ms',
 'ntuser.ini',
 'OneDrive'

In [36]:
#os.chdir("../")
#os.getcwd()

In [37]:
twitter_files = os.listdir("twitter")
twitter_files = [f for f in twitter_files if f != ".DS_Store"]
artist_handles = list(set([name.split("_")[0] for name in twitter_files]))

print(f"We see two artist handles: {artist_handles[0]} and {artist_handles[1]}.")

We see two artist handles: taylorswift13 and blakeshelton.


In [38]:
for artist in artist_handles :
    follower_file = artist + "_followers.txt"
    follower_data_file = artist + "_followers_data.txt"
    
    ids = open("twitter/" + follower_file,'r').readlines()
    
    print(f"We see {len(ids)-1} in your follower file for {artist}, assuming a header row.")
    
    with open("twitter/" + follower_data_file,'r',  encoding='utf-8') as infile :
        
        # check the headers
        headers = infile.readline().split("\t")
        
        print(f"In the follower data file ({follower_data_file}) for {artist}, we have these columns:")
        print(" : ".join(headers))
        
        description_words = []
        locations = set()
        
        
        for idx, line in enumerate(infile.readlines()) :
            line = line.strip("\n").split("\t")
            
            try : 
                locations.add(line[3])            
                description_words.extend(words(line[6]))
            except :
                pass
    
        

        print(f"We have {idx+1} data rows for {artist} in the follower data file.")

        print(f"For {artist} we have {len(locations)} unique locations.")

        print(f"For {artist} we have {len(description_words)} words in the descriptions.")
        print("Here are the five most common words:")
        print(Counter(description_words).most_common(5))

        
        print("")
        print("-"*40)
        print("")
    

We see 4999 in your follower file for taylorswift13, assuming a header row.
In the follower data file (taylorswift13_followers_data.txt) for taylorswift13, we have these columns:
screen_name : name : id : location : followers_count : friends_count : description

We have 5100 data rows for taylorswift13 in the follower data file.
For taylorswift13 we have 39 unique locations.
For taylorswift13 we have 17912 words in the descriptions.
Here are the five most common words:
[('pro', 306), ('is', 243), ('i', 214), ('you', 204), ('que', 202)]

----------------------------------------

We see 4999 in your follower file for blakeshelton, assuming a header row.
In the follower data file (blakeshelton_followers_data.txt) for blakeshelton, we have these columns:
screen_name : name : id : location : followers_count : friends_count : description

We have 5100 data rows for blakeshelton in the follower data file.
For blakeshelton we have 26 unique locations.
For blakeshelton we have 21397 words in th

## Checking Lyrics 

The output from your lyrics scrape should be stored in files located in this path from the directory:
`/lyrics/[Artist Name]/[filename from URL]`. This code summarizes the information at a high level to help the instructor evaluate your work. 

In [39]:
dir_path= "lyrics/"

In [40]:
artist_folders = os.listdir(dir_path)
artist_folders = [f for f in artist_folders if os.path.isdir(dir_path + f)]

for artist in artist_folders : 
    artist_files = os.listdir(dir_path + artist)
    artist_files = [f for f in artist_files if 'txt' in f or 'csv' in f or 'tsv' in f]

    print(f"For {artist} we have {len(artist_files)} files.")

    artist_words = []

    for f_name in artist_files : 
        with open(dir_path + artist + "/" + f_name) as infile : 
            artist_words.extend(words(infile.read()))

            
    print(f"For {artist} we have roughly {len(artist_words)} words, {len(set(artist_words))} are unique.")

For blakeshelton we have 22 files.
For blakeshelton we have roughly 5351 words, 1069 are unique.
For taylorswift we have 22 files.
For taylorswift we have roughly 6426 words, 822 are unique.
