# ADS 509 Assignment 2.1: Tokenization, Normalization, Descriptive Statistics 

This notebook holds Assignment 2.1 for Module 2 in ADS 509, Applied Text Mining. Work through this notebook, writing code and answering questions where required. 

In the previous assignment you put together Twitter data and lyrics data on two artists. In this assignment we explore some of the textual features of those data sets. If, for some reason, you did not complete that previous assignment, data to use for this assignment can be found in the assignment materials section of Blackboard. 

This assignment asks you to write a short function to calculate some descriptive statistics on a piece of text. Then you are asked to find some interesting and unique statistics on your corpora. 


## General Assignment Instructions

These instructions are included in every assignment, to remind you of the coding standards for the class. Feel free to delete this cell after reading it. 

One sign of mature code is conforming to a style guide. We recommend the [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html). If you use a different style guide, please include a cell with a link. 

Your code should be relatively easy-to-read, sensibly commented, and clean. Writing code is a messy process, so please be sure to edit your final submission. Remove any cells that are not needed or parts of cells that contain unnecessary code. Remove inessential `import` statements and make sure that all such statements are moved into the designated cell. 

Make use of non-code cells for written commentary. These cells should be grammatical and clearly written. In some of these cells you will have questions to answer. The questions will be marked by a "Q:" and will have a corresponding "A:" spot for you. *Make sure to answer every question marked with a `Q:` for full credit.* 


In [6]:
import os
import re
import emoji
import pandas as pd
import numpy as np

from collections import Counter, defaultdict
from nltk.corpus import stopwords
from string import punctuation

sw = stopwords.words("english")

In [7]:
# Add any additional import statements you need here
from lexical_diversity import lex_div as ld
import csv
import html
import textacy.preprocessing as tprep
import spacy

In [8]:
spacy.load('en_core_web_sm')

<spacy.lang.en.English at 0x29ee3410890>

In [9]:
# change `data_location` to the location of the folder on your machine.
data_location = "C:\\Users\\zfreitas\\Dropbox\\Classes\\USD\\ADS-509-01-SP23 - Applied Text Mining\\2. Module Two\\Assignment 1\\M1 Results\\"
# These subfolders should still work if you correctly stored the 
# data from the Module 1 assignment
twitter_folder = "twitter\\"
lyrics_folder = "lyrics\\"


In [10]:
def descriptive_stats(tokens, num_tokens = 5, verbose=True) :
    """
        Given a list of tokens, print number of tokens, number of unique tokens, 
        number of characters, lexical diversity (https://en.wikipedia.org/wiki/Lexical_diversity), 
        and num_tokens most common tokens. Return a list with the number of tokens, number
        of unique tokens, lexical diversity, and number of characters. 
    
    """
    # Fill in the correct values here. 
    num_tokens = len(tokens)
    num_unique_tokens = len(set(tokens))
    lexical_diversity = ld.ttr(tokens) # Simple TTR = len(Counter(text))/len(text)
    num_characters = sum([len(i) for i in tokens])
    
    if verbose:        
        print(f"There are {num_tokens} tokens in the data.")
        print(f"There are {num_unique_tokens} unique tokens in the data.")
        print(f"There are {num_characters} characters in the data.")
        print(f"The lexical diversity is {lexical_diversity:.3f} in the data.")
    
        # print the five most common tokens
        print(f"The {num_tokens} most common tokens")
        print(Counter(tokens).most_common(num_tokens))
        
    return([num_tokens, num_unique_tokens,
            lexical_diversity,
            num_characters])
    

In [11]:
text = """here is some example text with other example text here in this text""".split()
assert(descriptive_stats(text, verbose=True)[0] == 13)
assert(descriptive_stats(text, verbose=False)[1] == 9)
assert(abs(descriptive_stats(text, verbose=False)[2] - 0.69) < 0.02)
assert(descriptive_stats(text, verbose=False)[3] == 55)


There are 13 tokens in the data.
There are 9 unique tokens in the data.
There are 55 characters in the data.
The lexical diversity is 0.692 in the data.
The 13 most common tokens
[('text', 3), ('here', 2), ('example', 2), ('is', 1), ('some', 1), ('with', 1), ('other', 1), ('in', 1), ('this', 1)]


Q: Why is it beneficial to use assertion statements in your code? 

A: Assertion statements are a way to make sure your code is working as you expect it to. You create examples that you know to be true and have the answer for. Then you run your code through your predefined examples to make sure it is running as expected.
  

## Data Input

Now read in each of the corpora. For the lyrics data, it may be convenient to store the entire contents of the file to make it easier to inspect the titles individually, as you'll do in the last part of the assignment. In the solution, I stored the lyrics data in a dictionary with two dimensions of keys: artist and song. The value was the file contents. A data frame would work equally well. 

For the Twitter data, we only need the description field for this assignment. Feel free all the descriptions read it into a data structure. In the solution, I stored the descriptions as a dictionary of lists, with the key being the artist. 




In [12]:
# # Dictionary Approach 1 - first

# d[artist][title] = “the song lyrics as a string”
# d = {}
# # Get the directory location
# directory = data_location + lyrics_folder
# # Get all the subfolders in directory.
# artist_subfolders = [name for name in os.listdir(directory) if os.path.isdir(os.path.join(directory, name))]


# # Get all the files in each of the subfolders
# for artist in artist_subfolders:
#     d[artist] = {}
#     for filename in os.listdir(directory + artist):
#         f = os.path.join(directory + artist, filename)
#         # checking if it is a file
#         if os.path.isfile(f):
#             with open(f) as file:
#                 title = file.readline().strip()
#                 d[artist][title] = file.read().strip()


In [73]:
# Read in the lyrics data

# Dictionary Approach 2 - Using defaultdict

# d[artist][title] = “the song lyrics as a string”
lyrics = defaultdict(lambda: defaultdict(str))
#  
# Get the directory location
directory = data_location + lyrics_folder
# Get all the subfolders in directory.
artist_subfolders = [name for name in os.listdir(directory) if os.path.isdir(os.path.join(directory, name))]


# Get all the files in each of the subfolders
for artist in artist_subfolders:
    for filename in os.listdir(directory + artist):
        f = os.path.join(directory + artist, filename)
        # checking if it is a file
        if os.path.isfile(f):
            with open(f) as file:
                title = file.readline().strip()
                lyrics[artist][title] = file.read().strip()

In [74]:
lyrics["robyn"].get('"88 Days"')



In [75]:
print(len(lyrics['cher']))
print(len(lyrics['robyn']))

313
93


In [77]:
# Create lyrics Pandas Dataframe for Cleaning
artists = ['cher', 'robyn']

for i, artist in enumerate(artists):
    if i == 0:
        lyrics_df = pd.DataFrame(lyrics[artist].items(), columns=['title', 'lyrics'])
        lyrics_df['artist'] = artist
    lyrics_dfi = pd.DataFrame(lyrics[artist].items(), columns=['title', 'lyrics'])
    lyrics_dfi['artist'] = artist
    lyrics_df = pd.concat([lyrics_df, lyrics_dfi], ignore_index=True)

lyrics_df = lyrics_df.rename({0:"description"}, inplace=True,  axis=1)



In [43]:
# Read in the twitter data

# Dictionary Approach 2 - Using defaultdict
""""For the Twitter data, we only need the description field for this assignment. 
Feel free all the descriptions read it into a data structure. 
In the solution, I stored the descriptions as a dictionary of lists, 
with the key being the artist. 
"""
maxInt = sys.maxsize

while True:
    # decrease the maxInt value by factor 10 
    # as long as the OverflowError occurs.

    try:
        csv.field_size_limit(maxInt)
        break
    except OverflowError:
        maxInt = int(maxInt/10)

# d[artist][title] = “the song lyrics as a string”
desc = defaultdict(list)
#  
artists = ['cher', 'robyn']
filename = ['cher_followers_data.txt', 'robynkonichiwa_followers_data.txt']

# Get the directory location
directory = data_location + twitter_folder

# Get all the files in each of the subfolders
for i, artist in enumerate(artists):
    f = os.path.join(directory, filename[i])
    # checking if it is a file
    if os.path.isfile(f):
        with open(f, encoding="utf8") as f:
            reader = csv.DictReader(f, delimiter="\t") # read rows into a dictionary format
            for row in reader: # read a row as {column1: value1, column2: value2,...}
                for (k,v) in row.items(): # go over each column name and value 
                    if k == "description":
                        # append the value into the appropriate list
                        # based on column name k                        
                        desc[artist].append(v)



In [78]:
# Create Pandas Dataframe for Cleaning
for i, artist in enumerate(artists):
    if i == 0:
        twitter_df = pd.DataFrame.from_dict(desc[artist], orient='columns')
        twitter_df.rename({0:"description"}, inplace=True,  axis=1)
        twitter_df['artist'] = artist
    twitter_dfi = pd.DataFrame.from_dict(desc[artist])
    twitter_dfi.rename({0:"description"}, inplace=True,  axis=1)
    twitter_dfi['artist'] = artist
    twitter_df = pd.concat([twitter_df, twitter_dfi], ignore_index=True)


twitter_df = twitter_df.fillna(value="")

## Data Cleaning

Now clean and tokenize your data. Remove punctuation chacters (available in the `punctuation` object in the `string` library), split on whitespace, fold to lowercase, and remove stopwords. Store your cleaned data, which must be accessible as an interable for `descriptive_stats`, in new objects or in new columns in your data frame. 



In [79]:
punctuation = set(punctuation) # speeds up comparison

In [80]:
# Helper functions - Cleaning data



def clean(text):
    # convert html escapes like &amp; to characters.
    text = html.unescape(text)
    # tags like <tab>
    text = re.sub(r'<[^<>]*>', ' ', text)
    # markdown URLs like [Some text](https://....)
    text = re.sub(r'\[([^\[\]]*)\]\([^\(\)]*\)', r'\1', text)
    # text or code in brackets like [0]
    text = re.sub(r'\[[^\[\]]*\]', ' ', text)
    # standalone sequences of specials, matches &# but not #cool
    text = re.sub(r'(?:^|\s)[&#<>{}\[\]+|\\:-]{1,}(?:\s|$)', ' ', text)
    # standalone sequences of hyphens like --- or ==
    text = re.sub(r'(?:^|\s)[\-=\+]{2,}(?:\s|$)', ' ', text)
    # Remove Puncuation
    def remove_punc(s):
        return ''.join(ch for ch in s if ch not in punctuation)
    text = remove_punc(text)
    # sequences of white spaces
    text = re.sub(r'\s+', ' ', text)
    return text.strip()




def normalize(text):
    text = tprep.normalize_hyphenated_words(text)
    text = tprep.normalize_quotation_marks(text)
    text = tprep.normalize_unicode(text)
    text = tprep.remove_accents(text)
    return text



In [81]:
# create your clean twitter data here

twitter_df['clean_description'] = twitter_df['description'].map(clean)

twitter_df



Unnamed: 0,description,artist,clean_description
0,,cher,
1,𝙿𝚛𝚘𝚞𝚍 𝚜𝚞𝚙𝚙𝚘𝚛𝚝𝚎𝚛 𝚘𝚏 𝚖𝚎𝚜𝚜𝚢 𝚋𝚞𝚗𝚜 & 𝚕𝚎𝚐𝚐𝚒𝚗𝚐𝚜,cher,𝙿𝚛𝚘𝚞𝚍 𝚜𝚞𝚙𝚙𝚘𝚛𝚝𝚎𝚛 𝚘𝚏 𝚖𝚎𝚜𝚜𝚢 𝚋𝚞𝚗𝚜 𝚕𝚎𝚐𝚐𝚒𝚗𝚐𝚜
2,163㎝／愛かっぷ💜26歳🍒 工〇好きな女の子💓 フォローしてくれたらDMします🧡,cher,163㎝／愛かっぷ💜26歳🍒 工〇好きな女の子💓 フォローしてくれたらDMします🧡
3,csu,cher,csu
4,Writer @Washinformer @SpelmanCollege alumna #D...,cher,Writer Washinformer SpelmanCollege alumna DCna...
...,...,...,...
8184476,"singer of songs, type 1 diabetic, tired $jakel...",robyn,singer of songs type 1 diabetic tired jakelgil...
8184477,Dadx2/ Con-Arch/ Photographer/ DK #stemgrønnes...,robyn,Dadx2 ConArch Photographer DK stemgrønnest grø...
8184478,A year to change a life is still a year ✨😌,robyn,A year to change a life is still a year ✨😌
8184479,Head of Consumer - Mango. Made in Melbourne. R...,robyn,Head of Consumer Mango Made in Melbourne Rambl...


In [84]:
# create your clean lyrics data here

lyrics_df.columns()

AttributeError: 'NoneType' object has no attribute 'columns'

## Basic Descriptive Statistics

Call your `descriptive_stats` function on both your lyrics data and your twitter data and for both artists (four total calls). 

In [None]:
# calls to descriptive_stats here

Q: How do you think the "top 5 words" would be different if we left stopwords in the data? 

A: 

---

Q: What were your prior beliefs about the lexical diversity between the artists? Does the difference (or lack thereof) in lexical diversity between the artists conform to your prior beliefs? 

A: 



## Specialty Statistics

The descriptive statistics we have calculated are quite generic. You will now calculate a handful of statistics tailored to these data.

1. Ten most common emojis by artist in the twitter descriptions.
1. Ten most common hashtags by artist in the twitter descriptions.
1. Five most common words in song titles by artist. 
1. For each artist, a histogram of song lengths (in terms of number of tokens) 

We can use the `emoji` library to help us identify emojis and you have been given a function to help you.


In [6]:
assert(emoji.is_emoji("❤️"))
assert(not emoji.is_emoji(":-)"))

### Emojis 😁

What are the ten most common emojis by artist in the twitter descriptions? 


In [None]:
# Your code here

### Hashtags

What are the ten most common hashtags by artist in the twitter descriptions? 


In [None]:
# Your code here

### Song Titles

What are the five most common words in song titles by artist? The song titles should be on the first line of the lyrics pages, so if you have kept the raw file contents around, you will not need to re-read the data.


In [None]:
# Your code here

### Song Lengths

For each artist, a histogram of song lengths (in terms of number of tokens). If you put the song lengths in a data frame with an artist column, matplotlib will make the plotting quite easy. An example is given to help you out. 


In [None]:
num_replicates = 1000

df = pd.DataFrame({
    "artist" : ['Artist 1'] * num_replicates + ['Artist 2']*num_replicates,
    "length" : np.concatenate((np.random.poisson(125,num_replicates),np.random.poisson(150,num_replicates)))
})

df.groupby('artist')['length'].plot(kind="hist",density=True,alpha=0.5,legend=True)

Since the lyrics may be stored with carriage returns or tabs, it may be useful to have a function that can collapse whitespace, using regular expressions, and be used for splitting. 

Q: What does the regular expression `'\s+'` match on? 

A: 


In [None]:
collapse_whitespace = re.compile(r'\s+')

def tokenize_lyrics(lyric) : 
    """strip and split on whitespace"""
    return([item.lower() for item in collapse_whitespace.split(lyric)])

In [None]:
# Your lyric length comparison chart here. 