# ADS 509 Assignment 2.1: Tokenization, Normalization, Descriptive Statistics 

This notebook holds Assignment 2.1 for Module 2 in ADS 509, Applied Text Mining. Work through this notebook, writing code and answering questions where required. 

In the previous assignment you put together Twitter data and lyrics data on two artists. In this assignment we explore some of the textual features of those data sets. If, for some reason, you did not complete that previous assignment, data to use for this assignment can be found in the assignment materials section of Blackboard. 

This assignment asks you to write a short function to calculate some descriptive statistics on a piece of text. Then you are asked to find some interesting and unique statistics on your corpora. 


## General Assignment Instructions

These instructions are included in every assignment, to remind you of the coding standards for the class. Feel free to delete this cell after reading it. 

One sign of mature code is conforming to a style guide. We recommend the [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html). If you use a different style guide, please include a cell with a link. 

Your code should be relatively easy-to-read, sensibly commented, and clean. Writing code is a messy process, so please be sure to edit your final submission. Remove any cells that are not needed or parts of cells that contain unnecessary code. Remove inessential `import` statements and make sure that all such statements are moved into the designated cell. 

Make use of non-code cells for written commentary. These cells should be grammatical and clearly written. In some of these cells you will have questions to answer. The questions will be marked by a "Q:" and will have a corresponding "A:" spot for you. *Make sure to answer every question marked with a `Q:` for full credit.* 


In [1]:
import os
import re
import emoji
import pandas as pd
import numpy as np

from collections import Counter, defaultdict
from nltk.corpus import stopwords
from string import punctuation
import string

import nltk
nltk.download('stopwords')

sw = stopwords.words("english")

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/summerpurschke/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
# Add any additional import statements you need here
import nltk
from nltk.probability import FreqDist
from nltk.tokenize import word_tokenize
import copy
import matplotlib as plt

In [3]:
# change `data_location` to the location of the folder on your machine.
data_location = "/Users/summerpurschke/Desktop/ADS/ADS509(TextMining)/Mod2"

# These subfolders should still work if you correctly stored the 
# data from the Module 1 assignment
twitter_folder = "/twitter/"
lyrics_folder = "/lyrics/"

In [4]:
def descriptive_stats(tokens, num_tokens = 5, verbose=True) :

    # Fill in the correct values here. 
    num_tokens = len(tokens)
    num_unique_tokens = len(set(tokens))
    lexical_diversity = num_unique_tokens / num_tokens
    num_characters = sum(len(token) for token in tokens)
    
    if verbose :        
        print(f"There are {num_tokens} tokens in the data.")
        print(f"There are {num_unique_tokens} unique tokens in the data.")
        print(f"There are {num_characters} characters in the data.")
        print(f"The lexical diversity is {lexical_diversity:.3f} in the data.")
    
        # print the five most common tokens
        fdist = FreqDist(tokens)
        print(f"The {num_tokens} most common tokens are:")
        for token, frequency in fdist.most_common(num_tokens)[:5]:
            print(f"{token}: {frequency}")
        
    return([num_tokens, num_unique_tokens,
            lexical_diversity,
            num_characters
            ])

In [5]:
text = """here is some example text with other example text here in this text""".split()
assert(descriptive_stats(text, verbose=True)[0] == 13)
assert(descriptive_stats(text, verbose=False)[1] == 9)
assert(abs(descriptive_stats(text, verbose=False)[2] - 0.69) < 0.02)
assert(descriptive_stats(text, verbose=False)[3] == 55)

There are 13 tokens in the data.
There are 9 unique tokens in the data.
There are 55 characters in the data.
The lexical diversity is 0.692 in the data.
The 13 most common tokens are:
text: 3
here: 2
example: 2
is: 1
some: 1


Q: Why is it beneficial to use assertion statements in your code? 

A: Assertion statements are a great way to check your own code for errors. These statements can help to identify and diagnose issues during development time, and can can serve as documentation for other programmers looking at your code. 

## Data Input

Now read in each of the corpora. For the lyrics data, it may be convenient to store the entire contents of the file to make it easier to inspect the titles individually, as you'll do in the last part of the assignment. In the solution, I stored the lyrics data in a dictionary with two dimensions of keys: artist and song. The value was the file contents. A data frame would work equally well. 

For the Twitter data, we only need the description field for this assignment. Feel free all the descriptions read it into a data structure. In the solution, I stored the descriptions as a dictionary of lists, with the key being the artist. 




In [6]:
def create_lyrics_dictionary(folder_path):
    lyrics_dict = {}

    # Iterate over each item in the folder
    for item_name in os.listdir(folder_path):
        item_path = os.path.join(folder_path, item_name)

        # Check if the item is a folder (artist folder)
        if os.path.isdir(item_path):
            artist_dict = {}

            # Iterate over each file in the artist folder
            for filename in os.listdir(item_path):
                file_path = os.path.join(item_path, filename)

                # Check if the item is a file (song file)
                if os.path.isfile(file_path):
                    with open(file_path, "r") as file:
                        # Read the contents of the file
                        lines = file.readlines()

                        if lines:
                            title = lines[0].strip()  # Assuming the first line contains the title
                            lyrics = ''.join(lines[1:])  # Combine the remaining lines as the lyrics

                            # Add the song lyrics to the artist's dictionary with the title as the inner key
                            artist_dict[title] = lyrics

            # Add the artist's dictionary to the main lyrics dictionary
            lyrics_dict[item_name] = artist_dict

    return lyrics_dict

# Call the function with the lyrics folder path
lyrics_dictionary = create_lyrics_dictionary(data_location + lyrics_folder)


In [7]:
# Read in the twitter data

twitter_files = os.listdir(data_location + twitter_folder)
desc_files = [f for f in twitter_files if "followers_data" in f]
twitter_data = defaultdict(list)
for f in desc_files :
    artist = f.split("_")[0]
        
    with open(data_location + twitter_folder + f,'r', encoding='utf8') as infile :
        next(infile)
        for idx, line in enumerate(infile.readlines()) :
            line = line.strip().split("\t")
            if len(line) == 7 :
                twitter_data[artist].append(line[6])

twitter_data = dict(twitter_data)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd3 in position 8: invalid continuation byte

## Data Cleaning

Now clean and tokenize your data. Remove punctuation chacters (available in the `punctuation` object in the `string` library), split on whitespace, fold to lowercase, and remove stopwords. Store your cleaned data, which must be accessible as an interable for `descriptive_stats`, in new objects or in new columns in your data frame. 



In [None]:
punctuation = set(punctuation) # speeds up comparison

In [None]:
# Define function for removing punctuation from dictionary 
def clean_text(text):
    cleaned_text = text.translate(str.maketrans("", "", string.punctuation))
    return cleaned_text

Clean Lyrics

In [None]:
# Initiate a new dictionary that is empty - will add cleaned data as it's processed 
lyrics_dictionary_cleaned = {}

# Fold to lowercase and populate lyrics_dictionary_cleaned
for artist, songs in lyrics_dictionary.items():
    cleaned_songs = {}
    for song, lyrics in songs.items():
        cleaned_lyrics = lyrics.lower()
        cleaned_songs[song] = cleaned_lyrics
    lyrics_dictionary_cleaned[artist] = cleaned_songs

# Remove stopwords directly from lyrics_dictionary_cleaned
for artist, songs in lyrics_dictionary_cleaned.items():
    for song, lyrics in songs.items():
        # Split the lyrics into individual words
        words = lyrics.split()
        # Remove stopwords from the list of words
        cleaned_words = [word for word in words if word.lower() not in sw]
        # Join the cleaned words back into a single string
        cleaned_lyrics = " ".join(cleaned_words)
        # Update the lyrics in the lyrics_dictionary_cleaned
        lyrics_dictionary_cleaned[artist][song] = cleaned_lyrics

# Remove punctuation directly from lyrics_dictionary_cleaned
for artist, songs in lyrics_dictionary_cleaned.items():
    cleaned_songs = {}
    for song, lyrics in songs.items():
        # Remove punctuation marks
        cleaned_lyrics = lyrics.translate(str.maketrans("", "", string.punctuation))
        cleaned_songs[song] = cleaned_lyrics
    lyrics_dictionary_cleaned[artist] = cleaned_songs

# split at whitespace 
for artist, songs in lyrics_dictionary_cleaned.items():
    for song, lyrics in songs.items():
        split_lyrics = lyrics.split()
        lyrics_dictionary_cleaned[artist][song] = split_lyrics

In [None]:
# Get rid of outer keys 
for artist, songs in lyrics_dictionary_cleaned.items():
    word_list = []
    for song, lyrics in songs.items():
        word_list.extend(lyrics)
    lyrics_dictionary_cleaned[artist] = word_list

Clean Twitter Data

In [None]:
# Initiate a new dictionary that is empty - will add cleaned data as it's processed 
twitter_data_cleaned = {}

# Fold to lowercase and populate the cleaned dictionary
for artist, tweets in twitter_data.items():
    cleaned_tweets = {}
    for idx, words in enumerate(tweets):
        cleaned_words = words.lower()
        cleaned_tweets[f'tweet{idx+1}'] = cleaned_words
    twitter_data_cleaned[artist] = cleaned_tweets

# Remove stopwords directly from twitter_data_cleaned
for artist, tweets in twitter_data_cleaned.items():
    for tweet, words in tweets.items():
        # Split the words into individual tokens
        tokens = words.split()
        # Remove stopwords from the list of tokens
        cleaned_tokens = [token for token in tokens if token.lower() not in sw]
        # Join the cleaned tokens back into a single string
        cleaned_words = " ".join(cleaned_tokens)
        # Update the words in the twitter_data_cleaned
        twitter_data_cleaned[artist][tweet] = cleaned_words

# Remove punctuation directly from twitter_data_cleaned
for artist, tweets in twitter_data_cleaned.items():
    cleaned_tweets = {}
    for tweet, words in tweets.items():
        cleaned_words = ''.join(character for character in words if character not in string.punctuation)
        cleaned_tweets[tweet] = cleaned_words
    twitter_data_cleaned[artist] = cleaned_tweets

# split at whitespace 
for artist, tweets in twitter_data_cleaned.items():
    for tweet, words in tweets.items():
        split_words = words.split()
        twitter_data_cleaned[artist][tweet] = split_words

In [None]:
# Get rid of outer keys
for artist, tweets in twitter_data_cleaned.items():
    word_list = []
    for tweet, words in tweets.items():
        word_list.extend(words)
    twitter_data_cleaned[artist] = word_list

## Basic Descriptive Statistics

Call your `descriptive_stats` function on both your lyrics data and your twitter data and for both artists (four total calls). 

Q: How do you think the "top 5 words" would be different if we left stopwords in the data? 

A: If we left stopwords in the data, the top 5 words of each twitter account and lyrics database would likely be much more similar than when we removed them. 

---

Q: What were your prior beliefs about the lexical diversity between the artists? Does the difference (or lack thereof) in lexical diversity between the artists conform to your prior beliefs? 

A: The artists have similar lexical diversity, which I expected knowing fairly little about both artists. I knew that the artists are similar in style and public stances. 

In [None]:
#Create Token Lists for Lyrics 
cher_lyric_tokens= lyrics_dictionary_cleaned['cher']

# call descriptive stats
descriptive_stats(cher_lyric_tokens, num_tokens=5)

In [None]:
#Create Token Lists for Lyrics 
robyn_lyric_tokens= lyrics_dictionary_cleaned['robyn']

# call descriptive stats
descriptive_stats(robyn_lyric_tokens)

Twitter Data

In [None]:
# Create Token Lists for Twitter Data
cher_twitter_tokens= twitter_data_cleaned['cher']
# Call descriptive stats on Twitter Data
descriptive_stats(cher_twitter_tokens)

In [None]:
# Create Token Lists for Twitter Data
robyn_twitter_tokens= twitter_data_cleaned['robynkonichiwa']

# Call descriptive stats on Twitter Data
descriptive_stats(robyn_twitter_tokens)


## Specialty Statistics

The descriptive statistics we have calculated are quite generic. You will now calculate a handful of statistics tailored to these data.

1. Ten most common emojis by artist in the twitter descriptions.
1. Ten most common hashtags by artist in the twitter descriptions.
1. Five most common words in song titles by artist. 
1. For each artist, a histogram of song lengths (in terms of number of tokens) 

We can use the `emoji` library to help us identify emojis and you have been given a function to help you.


In [None]:
assert(emoji.is_emoji("❤️"))
assert(not emoji.is_emoji(":-)"))

### Emojis 😁

What are the ten most common emojis by artist in the twitter descriptions? 


In [None]:
#Create lists 
cher_tweets = twitter_data_cleaned['cher'] 
robyn_tweets = twitter_data_cleaned['robynkonichiwa']

In [None]:
cher_emoji_counter = {}

for item in cher_tweets:
    emoji_list = emojis.get(item)
    for emoji in emoji_list:
        if emoji in cher_emoji_counter:
            cher_emoji_counter[emoji] += 1
        else:
            cher_emoji_counter[emoji] = 1

# Get the top 10 most common emojis
top_10_emojis = sorted(cher_emoji_counter.items(), key=lambda x: x[1], reverse=True)[:10]

# Print the top 10 most common emojis
print("Top 10 most common emojis in Cher's Tweets:")
for emoji, count in top_10_emojis:
    print(f"{emoji}: {count}")

In [None]:
robyn_emoji_counter = {}
for item in robyn_tweets:
    emoji_list = emojis.get(item)
    for emoji in emoji_list:
        if emoji in robyn_emoji_counter:
            robyn_emoji_counter[emoji] += 1
        else:
            robyn_emoji_counter[emoji] = 1

# Get the top 10 most common emojis
top_10_emojis = sorted(robyn_emoji_counter.items(), key=lambda x: x[1], reverse=True)[:10]

# Print the top 10 most common emojis
print("Top 10 most common emojis in Robyn's Tweets:")
for emoji, count in top_10_emojis:
    print(f"{emoji}: {count}")

### Hashtags

What are the ten most common hashtags by artist in the twitter descriptions? 


In [None]:
for artist, tweets in twitter_data.items():
    word_list = []
    for tweet in tweets:
        words = tweet.split()  # Split the tweet at whitespace
        word_list.extend(words)
    twitter_data[artist] = word_list

In [None]:
# Create list of hashtags
cher_hashtags = [tweet.lower() for tweet in twitter_data['cher'] if tweet.lower().startswith('#')]

robyn_hashtags = [tweet.lower() for tweet in twitter_data['robynkonichiwa'] if tweet.lower().startswith('#')]

In [None]:
def find_top_10_hashtags(words):
    hashtags_counter = Counter()
    
    # Iterate over each word in the list
    for word in words:
        # Check if the word starts with a '#' character
        if word.startswith('#'):
            # Remove any leading or trailing punctuation from the word
            hashtag = word.strip()
            # Update the counter with the hashtag
            hashtags_counter[hashtag] += 1
    
    # Get the top 10 most common hashtags
    top_10_hashtags = hashtags_counter.most_common(10)
    
    return top_10_hashtags

In [None]:
cher_top_hashtags = find_top_10_hashtags(cher_hashtags)

print("Top 10 most common hashtags in Cher's Tweets:")
# Print the top 10 hashtags
for hashtag, count in cher_top_hashtags:
    print(f"{hashtag}: {count}")

In [None]:
robyn_top_hashtags = find_top_10_hashtags(robyn_hashtags)

print("Top 10 most common hashtags in Robyn's Tweets:")
# Print the top 10 hashtags
for hashtag, count in robyn_top_hashtags:
    print(f"{hashtag}: {count}")

### Song Titles

What are the five most common words in song titles by artist? The song titles should be on the first line of the lyrics pages, so if you have kept the raw file contents around, you will not need to re-read the data.


In [None]:
# Create new dictionary of just song titles per artist 
song_titles = {}
for artist, songs in lyrics_dictionary.items():
    song_titles[artist] = list(songs.keys())

In [None]:
song_titles_cleaned = {}

# Iterate over each artist in the dictionary
for artist, songs in song_titles.items():
    # Create a list to store the cleaned song titles
    cleaned_titles = []
    
    # Iterate over each song title for the current artist
    for title in songs:
        # Convert the title to lowercase
        cleaned_title = title.lower()
        
        # Remove punctuation from the title
        cleaned_title = "".join(char for char in cleaned_title if char not in string.punctuation)
        
        # Split the title by whitespace
        words = cleaned_title.split()
        
        # Remove stop words from the title
        cleaned_words = [word for word in words if word not in sw]
        
        # Extend the cleaned_titles list with the cleaned words
        cleaned_titles.extend(cleaned_words)
    
    # Update the dictionary with the cleaned titles for the current artist
    song_titles_cleaned[artist] = cleaned_titles

In [None]:
def find_most_common_words(word_list, num_words=5):
    # Create a Counter object from the word list
    word_counts = Counter(word_list)
    
    # Get the most common words and their frequencies
    most_common_words = word_counts.most_common(num_words)
    
    return most_common_words

In [None]:
# Cher Song titles most common words:
cher_common_title_words = find_most_common_words(song_titles_cleaned['cher'], num_words=5)
print('Five most common words in Cher Song Titles: ')
print(cher_common_title_words)

In [None]:
# ong titles most common words:
robyn_common_title_words = find_most_common_words(song_titles_cleaned['robyn'], num_words=5)
print('Five most common words in Robyn Song Titles: ')
print(robyn_common_title_words)

### Song Lengths

For each artist, a histogram of song lengths (in terms of number of tokens). If you put the song lengths in a data frame with an artist column, matplotlib will make the plotting quite easy. An example is given to help you out. 


In [None]:
# Initiate a new dictionary that is empty - will add cleaned data as it's processed 
song_lengths = {}

# Fold to lowercase and populate lyrics_dictionary_cleaned
for artist, songs in lyrics_dictionary.items():
    cleaned_songs = {}
    for song, lyrics in songs.items():
        cleaned_lyrics = lyrics.lower()
        cleaned_songs[song] = cleaned_lyrics
    song_lengths[artist] = cleaned_songs

# Remove stopwords directly from lyrics_dictionary_cleaned
for artist, songs in song_lengths .items():
    for song, lyrics in songs.items():
        # Split the lyrics into individual words
        words = lyrics.split()
        # Remove stopwords from the list of words
        cleaned_words = [word for word in words if word.lower() not in sw]
        # Join the cleaned words back into a single string
        cleaned_lyrics = " ".join(cleaned_words)
        # Update the lyrics in the lyrics_dictionary_cleaned
        song_lengths [artist][song] = cleaned_lyrics

# Remove punctuation directly from lyrics_dictionary_cleaned
for artist, songs in song_lengths.items():
    cleaned_songs = {}
    for song, lyrics in songs.items():
        # Remove punctuation marks
        cleaned_lyrics = lyrics.translate(str.maketrans("", "", string.punctuation))
        cleaned_songs[song] = cleaned_lyrics
    song_lengths [artist] = cleaned_songs

# split at whitespace 
for artist, songs in song_lengths.items():
    for song, lyrics in songs.items():
        split_lyrics = lyrics.split()
        song_lengths [artist][song] = split_lyrics

In [None]:
#replace each key with the length of that key (number of tokens in the songs)
# remove the extra "" inside of the ''  
for artist, songs in lyrics_dictionary.items():
    for song, words in songs.items():
        song_cleaned = song.strip('"')
        lyrics_dictionary[artist][song_cleaned] = len(words)
        del lyrics_dictionary[artist][song]

In [None]:
cher_song_lengths = song_lengths['cher']
cher_song_lengths = song_lengths['cher']

In [None]:
from collections import Counter
from matplotlib import pyplot as plt

word_counts = Counter(cher_lyric_tokens)
frequencies = list(word_counts.values())

plt.hist(frequencies, bins= 5)
plt.xlabel('Word Frequency')
plt.ylabel('Count')
plt.title('Histogram of Word Occurrences')
plt.show()



In [None]:
plot_word_histogram(cher_lyric_tokens, num_bins=10)

Since the lyrics may be stored with carriage returns or tabs, it may be useful to have a function that can collapse whitespace, using regular expressions, and be used for splitting. 

Q: What does the regular expression `'\s+'` match on? 

A: 


In [None]:
collapse_whitespace = re.compile(r'\s+')

def tokenize_lyrics(lyric) : 
    """strip and split on whitespace"""
    return([item.lower() for item in collapse_whitespace.split(lyric)])

In [None]:
# Your lyric length comparison chart here. 