## ADS 509: Assignment 1 Evaluation Script

This assignment asks you to pull data from the Twitter API and scrape www.AZLyrics.com. In the notebooks where you do that work you are asked to store the data in specific ways. After you have finished the assignment, run all the cells in this notebook. Print this to PDF and submit it with your other two PDFs. 

In [5]:
import os
import re
from collections import Counter

In [6]:
# Simple word extractor from Peter Norvig: https://norvig.com/spell-correct.html
def words(text): 
    return re.findall(r'\w+', text.lower())

---

## Checking Twitter Data

The output from your Twitter API pull should be two files per artist, stored in files with formats like `cher_followers.txt` (a list of all follower IDs you pulled) and `cher_followers_data.txt`. These files should be in a folder named `twitter` within the repository directory. This code summarizes the information at a high level to help the instructor evaluate your work. 

In [9]:
twitter_files = os.listdir("twitter")
twitter_files = [f for f in twitter_files if f != ".DS_Store"]
artist_handles = list(set([name.split("_")[0] for name in twitter_files]))

print(f"We see two artist handles: {artist_handles[0]} and {artist_handles[1]}.")

We see two artist handles: robynkonichiwa and cher.


In [11]:
for artist in artist_handles :
    follower_file = artist + "_followers.txt"
    follower_data_file = artist + "_followers_data.txt"
    
    ids = open("twitter/" + follower_file,'r').readlines()
    
    print(f"We see {len(ids)-1} in your follower file for {artist}, assuming a header row.")
    
    with open("twitter/" + follower_data_file,'r') as infile :
        
        # check the headers
        headers = infile.readline().split("\t")
        
        print(f"In the follower data file ({follower_data_file}) for {artist}, we have these columns:")
        print(" : ".join(headers))
        
        description_words = []
        locations = set()
        
        
        for idx, line in enumerate(infile.readlines()) :
            line = line.strip("\n").split("\t")
            
            try : 
                locations.add(line[3])            
                description_words.extend(words(line[6]))
            except :
                pass
    
        

        print(f"We have {idx+1} data rows for {artist} in the follower data file.")

        print(f"For {artist} we have {len(locations)} unique locations.")

        print(f"For {artist} we have {len(description_words)} words in the descriptions.")
        print("Here are the five most common words:")
        print(Counter(description_words).most_common(5))

        
        print("")
        print("-"*40)
        print("")
    

We see 358461 in your follower file for robynkonichiwa, assuming a header row.
In the follower data file (robynkonichiwa_followers_data.txt) for robynkonichiwa, we have these columns:
screen_name : name : id : location : followers_count : friends_count : description

We have 358372 data rows for robynkonichiwa in the follower data file.
For robynkonichiwa we have 59211 unique locations.
For robynkonichiwa we have 2082459 words in the descriptions.
Here are the five most common words:
[('i', 45774), ('and', 45074), ('the', 34883), ('a', 31554), ('of', 25974)]

----------------------------------------

We see 3995235 in your follower file for cher, assuming a header row.
In the follower data file (cher_followers_data.txt) for cher, we have these columns:
screen_name : name : id : location : followers_count : friends_count : description

We have 3994803 data rows for cher in the follower data file.
For cher we have 463634 unique locations.
For cher we have 22827257 words in the descriptio

## Checking Lyrics 

The output from your lyrics scrape should be stored in files located in this path from the directory:
`/lyrics/[Artist Name]/[filename from URL]`. This code summarizes the information at a high level to help the instructor evaluate your work. 

In [12]:
artist_folders = os.listdir("lyrics/")
artist_folders = [f for f in artist_folders if os.path.isdir("lyrics/" + f)]

for artist in artist_folders : 
    artist_files = os.listdir("lyrics/" + artist)
    artist_files = [f for f in artist_files if 'txt' in f or 'csv' in f or 'tsv' in f]

    print(f"For {artist} we have {len(artist_files)} files.")

    artist_words = []

    for f_name in artist_files : 
        with open("lyrics/" + artist + "/" + f_name) as infile : 
            artist_words.extend(words(infile.read()))

            
    print(f"For {artist} we have roughly {len(artist_words)} words, {len(set(artist_words))} are unique.")


For robyn we have 104 files.
For robyn we have roughly 31297 words, 2242 are unique.
For cher we have 316 files.
For cher we have roughly 74133 words, 3696 are unique.
