# ADS 509 Module 3: Group Comparison 

The task of comparing two groups of text is fundamental to textual analysis. There are innumerable applications: survey respondents from different segments of customers, speeches by different political parties, words used in Tweets by different constituencies, etc. In this assignment you will build code to effect comparisons between groups of text data, using the ideas learned in reading and lecture.

This assignment asks you to analyze the lyrics and Twitter descriptions for the two artists you selected in Module 1. If the results from that pull were not to your liking, you are welcome to use the zipped data from the “Assignment Materials” section. Specifically, you are asked to do the following: 

* Read in the data, normalize the text, and tokenize it. When you tokenize your Twitter descriptions, keep hashtags and emojis in your token set. 
* Calculate descriptive statistics on the two sets of lyrics and compare the results. 
* For each of the four corpora, find the words that are unique to that corpus. 
* Build word clouds for all four corpora. 

Each one of the analyses has a section dedicated to it below. Before beginning the analysis there is a section for you to read in the data and do your cleaning (tokenization and normalization). 


## General Assignment Instructions

These instructions are included in every assignment, to remind you of the coding standards for the class. Feel free to delete this cell after reading it. 

One sign of mature code is conforming to a style guide. We recommend the [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html). If you use a different style guide, please include a cell with a link. 

Your code should be relatively easy-to-read, sensibly commented, and clean. Writing code is a messy process, so please be sure to edit your final submission. Remove any cells that are not needed or parts of cells that contain unnecessary code. Remove inessential `import` statements and make sure that all such statements are moved into the designated cell. 

Make use of non-code cells for written commentary. These cells should be grammatical and clearly written. In some of these cells you will have questions to answer. The questions will be marked by a "Q:" and will have a corresponding "A:" spot for you. *Make sure to answer every question marked with a `Q:` for full credit.* 


In [42]:
import os
import re
import emoji
import pandas as pd

from collections import Counter, defaultdict
from nltk.corpus import stopwords
from string import punctuation
from wordcloud import WordCloud 

from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer


In [43]:
# Use this space for any additional import statements you need
from nltk.tokenize import word_tokenize
import statistics
#from sklearn.feature_extraction import text

In [44]:
# Place any addtional functions or constants you need here. 

# Some punctuation variations
punctuation = set(punctuation) # speeds up comparison
tw_punct = punctuation - {"#"}

# Stopwords
sw = stopwords.words("english") # given


# Two useful regex
whitespace_pattern = re.compile(r"\s+")
hashtag_pattern = re.compile(r"^#[0-9a-zA-Z]+")

# It's handy to have a full set of emojis
all_language_emojis = set()    # create a set 

def emoji_country(ls):
    for country in ls:   # relies on is_emoji results s so had to change to def than solid code 
        for em in country: 
            if emoji.is_emoji(em):
                all_language_emojis.add(em) # adds to the end of the set 


# and now our functions
def descriptive_stats(tokens, num_tokens = 5, verbose=True) :
    num_tokens =  len(tokens)
    num_unique_tokens = len(set(tokens)) #len(np.unique(tokens))
    lexical_diversity =  len(set(tokens)) / len(tokens) # from the wikipedia article
    num_characters =  sum([len(i) for i in tokens]) 

    # Place your Module 2 solution here
    if verbose :        
        print(f"There are {num_tokens} tokens in the data.")
        print(f"There are {num_unique_tokens} unique tokens in the data.")
        print(f"There are {num_characters} characters in the data.")
        print(f"The lexical diversity is {lexical_diversity:.3f} in the data.")
    
        # print the five most common tokens
        print(Counter(tokens).most_common(5))
    
    return num_tokens,num_unique_tokens,lexical_diversity,num_characters


    
def contains_all_language_set(s):
    return(s in all_language_emojis)

def is_emoji(s):
    return(emoji.is_emoji(s))



def contains_emoji(s):
    
    s = str(s)
    emojis = [ch for ch in s if is_emoji(ch)]

    return(len(emojis) > 0)

wordsFiltered = [] # this line was missing
def remove_stop(tokens) :
    tokens = tokens.apply(lambda x: ' '.join([word for word in x.split() if word not in (sw)]))
    return tokens
    
    
def remove_punctuation(text) : 
    return text.str.replace('[^\w\s]','')

def tokenize(text) : 
    """ Splitting on whitespace rather than the book's tokenize function. That 
        function will drop tokens like '#hashtag' or '2A', which we need for Twitter. """    
    #word = re.split('\s+', text)
    return  text.apply(word_tokenize)


def prepare(text, pipeline) : 
    tokens = str(text)
    
    for transform in pipeline : 
        tokens = transform(tokens)
        
    return(tokens)

def lower(text):

    return(text.lower())

In [None]:
fruits = ["apple", "banana", "cherry"]

a_list = ["🤔,👭,👙"]
b=["🤔,🙈,a"]
c= ["✨"]
emoji_country(c)
print(all_language_emojis)

## Data Ingestion

Use this section to ingest your data into the data structures you plan to use. Typically this will be a dictionary or a pandas DataFrame.

In [45]:
# Feel fre to use the below cells as an example or read in the data in a way you prefer

data_location = "/Users/Blaven/ADS-509_Assignment/M1 Results" # change to your location if it is not in the same directory as your notebook
twitter_folder = "/twitter/"
lyrics_folder = "/Users/Blaven/ADS-509_Assignment/M1 Results/lyrics"

artist_files = {'cher':'cher_followers_data.txt',
                'robyn':'robynkonichiwa_followers_data.txt'}


In [46]:
twitter_data = pd.read_csv(data_location + twitter_folder + artist_files['cher'],
                           sep="\t",
                           quoting=3)
# read in tab seperated file
twitter_data['artist'] = "cher" # creates artists colm and store cher

In [47]:
twitter_data_2 = pd.read_csv(data_location + twitter_folder + artist_files['robyn'],
                             sep="\t",
                             quoting=3)
twitter_data_2['artist'] = "robyn"

twitter_data = pd.concat([
    twitter_data,twitter_data_2])
    
del(twitter_data_2)

In [48]:
twitter_data=twitter_data.fillna(" ")
twitter_data

Unnamed: 0,screen_name,name,id,location,followers_count,friends_count,description,artist
0,hsmcnp,Country Girl,35152213,,1302,1014,,cher
1,horrormomy,Jeny,742153090850164742,Earth,81,514,𝙿𝚛𝚘𝚞𝚍 𝚜𝚞𝚙𝚙𝚘𝚛𝚝𝚎𝚛 𝚘𝚏 𝚖𝚎𝚜𝚜𝚢 𝚋𝚞𝚗𝚜 & 𝚕𝚎𝚐𝚐𝚒𝚗𝚐𝚜,cher
2,anju79990584,anju,1496463006451974150,,13,140,163㎝／愛かっぷ💜26歳🍒 工〇好きな女の子💓 フォローしてくれたらDMします🧡,cher
3,gallionjenna,J,3366479914,,752,556,csu,cher
4,bcscomm,bcscomm,83915043,"Washington, DC",888,2891,Writer @Washinformer @SpelmanCollege alumna #D...,cher
...,...,...,...,...,...,...,...,...
358367,jakegiles,Jake Giles,19725102,LA,7690,2165,"singer of songs, type 1 diabetic, tired $jakel...",robyn
358368,axelbluhme,Axel Bluhme,19573759,DK,238,1565,Dadx2/ Con-Arch/ Photographer/ DK #stemgrønnes...,robyn
358369,RayRayRay27,Rachael :),19919217,Oldham,762,1479,A year to change a life is still a year ✨😌,robyn
358370,bhandberg,Ben Handberg,12642462,Sydney,432,593,Head of Consumer - Mango. Made in Melbourne. R...,robyn


In [5]:
# pulls lyrics and song titles and stores in series

def tester(artist):
    path = lyrics_folder+artist 
    list = []
    lyrics_storage = []
    # dirs=directories
    for (root, dirs, file) in os.walk(path):
        for f in file:
            if '.txt' in f:
                list.append(f)
                l_storage = open(path+"/"+f,'r')
                temp = l_storage.read()
                lyrics_storage.append(temp)
                l_storage.close()
            

    x = pd.Series(list)  
    y = pd.Series(lyrics_storage)

    return x,y

In [8]:
#tester("/robyn")
#("/cher")

x,y=tester("/robyn")
#display(x)
#display(y)

In [7]:


lyrics_data = pd.DataFrame()
lyrics_data['Lyrics'] = y
lyrics_data['Titles'] = x
lyrics_data['artist'] = "cher"
display(lyrics_data)

Unnamed: 0,Lyrics,Titles,artist
0,"""Come And Stay With Me""\n\n\n\nI'll send away ...",cher_comeandstaywithme.txt,cher
1,"""Pirate""\n\n\n\nHe'll sail on with the summer ...",cher_pirate.txt,cher
2,"""Stars""\n\n\n\nI was never one for saying what...",cher_stars.txt,cher
3,"""These Days""\n\n\n\nWell I've been out walking...",cher_thesedays.txt,cher
4,"""Love So High""\n\n\n\nEvery morning I would wa...",cher_lovesohigh.txt,cher
...,...,...,...
311,"""Take It From The Boys""\n\n\n\nSo scared I nev...",cher_takeitfromtheboys.txt,cher
312,"""Dream Baby""\n\n\n\nI found the boy for me\nHe...",cher_dreambaby.txt,cher
313,"""Please Don't Tell Me""\n\n\n\nYa shook the ove...",cher_pleasedonttellme.txt,cher
314,"""I Hope You Find It""\n\n\n\nThese clouds aren'...",cher_ihopeyoufindit.txt,cher


In [9]:
lyrics_data2 = pd.DataFrame()
lyrics_data2['Lyrics'] = y
lyrics_data2['Titles'] = x
lyrics_data2['artist'] = "robyn"

lyrics_data = pd.concat([
    lyrics_data,lyrics_data2])
del(lyrics_data2)
display(lyrics_data)

Unnamed: 0,Lyrics,Titles,artist
0,"""Come And Stay With Me""\n\n\n\nI'll send away ...",cher_comeandstaywithme.txt,cher
1,"""Pirate""\n\n\n\nHe'll sail on with the summer ...",cher_pirate.txt,cher
2,"""Stars""\n\n\n\nI was never one for saying what...",cher_stars.txt,cher
3,"""These Days""\n\n\n\nWell I've been out walking...",cher_thesedays.txt,cher
4,"""Love So High""\n\n\n\nEvery morning I would wa...",cher_lovesohigh.txt,cher
...,...,...,...
99,"""Do You Really Want Me (Show Respect)""\n\n\n\n...",robyn_doyoureallywantmeshowrespect.txt,robyn
100,"""The Last Time""\n\n\n\nEver since I was a litt...",robyn_thelasttime.txt,robyn
101,"""Missing U""\n\n\n\nBaby, it's so weird to me n...",robyn_missingu.txt,robyn
102,"""Breakdown Intermission""\n\n\n\nHey my friend ...",robyn_breakdownintermission.txt,robyn


In [10]:
#lyrics_data['Lyrics'] = pd.Series(lyrics_data['Lyrics'], dtype="string")
#lyrics_data['Lyrics'].dtype
#tokenize(lyrics_data['Lyrics'])
#lyrics_data['Lyrics'].apply(word_tokenize)
#tokenize(lyrics_data['Lyrics'])
lyrics_data['Lyrics']=lyrics_data['Lyrics'].str.lower()
#lower(lyrics_data['Lyrics'])
#lyrics_data['Lyrics']
lyrics_data['Lyrics']

0      "come and stay with me"\n\n\n\ni'll send away ...
1      "pirate"\n\n\n\nhe'll sail on with the summer ...
2      "stars"\n\n\n\ni was never one for saying what...
3      "these days"\n\n\n\nwell i've been out walking...
4      "love so high"\n\n\n\nevery morning i would wa...
                             ...                        
99     "do you really want me (show respect)"\n\n\n\n...
100    "the last time"\n\n\n\never since i was a litt...
101    "missing u"\n\n\n\nbaby, it's so weird to me n...
102    "breakdown intermission"\n\n\n\nhey my friend ...
103    "play"\n\n\n\num'um'\nsome people live their l...
Name: Lyrics, Length: 420, dtype: object

In [11]:
#remove_punctuation(lyrics_data['Lyrics'])
#remove_stop(lyrics_data['Lyrics'])
lyrics_data["Lyrics"]=remove_punctuation(lyrics_data["Lyrics"])

lyrics_data["Lyrics"]


  return text.str.replace('[^\w\s]','')


0      come and stay with me\n\n\n\nill send away all...
1      pirate\n\n\n\nhell sail on with the summer win...
2      stars\n\n\n\ni was never one for saying what i...
3      these days\n\n\n\nwell ive been out walking \n...
4      love so high\n\n\n\nevery morning i would wake...
                             ...                        
99     do you really want me show respect\n\n\n\nboy ...
100    the last time\n\n\n\never since i was a little...
101    missing u\n\n\n\nbaby its so weird to me now t...
102    breakdown intermission\n\n\n\nhey my friend i ...
103    play\n\n\n\numum\nsome people live their lives...
Name: Lyrics, Length: 420, dtype: object

In [12]:
lyrics_data["Lyrics"]= tokenize(lyrics_data["Lyrics"])


In [None]:
lyrics_data["Lyrics"]

In [83]:
lyrics_data

Unnamed: 0,Lyrics,Titles,artist
0,"[come, and, stay, with, me, ill, send, away, a...",cher_comeandstaywithme.txt,cher
1,"[pirate, hell, sail, on, with, the, summer, wi...",cher_pirate.txt,cher
2,"[stars, i, was, never, one, for, saying, what,...",cher_stars.txt,cher
3,"[these, days, well, ive, been, out, walking, a...",cher_thesedays.txt,cher
4,"[love, so, high, every, morning, i, would, wak...",cher_lovesohigh.txt,cher
...,...,...,...
99,"[do, you, really, want, me, show, respect, boy...",robyn_doyoureallywantmeshowrespect.txt,robyn
100,"[the, last, time, ever, since, i, was, a, litt...",robyn_thelasttime.txt,robyn
101,"[missing, u, baby, its, so, weird, to, me, now...",robyn_missingu.txt,robyn
102,"[breakdown, intermission, hey, my, friend, i, ...",robyn_breakdownintermission.txt,robyn


## Tokenization and Normalization

In this next section, tokenize and normalize your data. We recommend the following cleaning. 

**Lyrics** 

* Remove song titles
* Casefold to lowercase
* Remove punctuation
* Split on whitespace
* Remove stopwords (optional)

Removal of stopwords is up to you. Your descriptive statistic comparison will be different if you include stopwords, though TF-IDF should still find interesting features for you.

**Twitter Descriptions** 

* Casefold to lowercase
* Remove punctuation other than emojis or hashtags
* Split on whitespace
* Remove stopwords

Removing stopwords seems sensible for the Twitter description data. Remember to leave in emojis and hashtags, since you analyze those. 

In [13]:
# apply the `pipeline` techniques from BTAP Ch 1 or 5


#lyrics_data["tokens"] = lyrics_data ["Lyrics"].apply(lower)
#lyrics_data["tokens"]= pd.Series(lyrics_data['Lyrics'], dtype="string")
#lyrics_data


#lyrics_data["tokens"]= lyrics_data["tokens"].apply(remove_punctuation)                        
#.apply(remove_punctuation)).apply(tokenize)
lyrics_data["num_tokens"] = lyrics_data["Lyrics"].map(len) 



In [14]:
lyrics_data

Unnamed: 0,Lyrics,Titles,artist,num_tokens
0,"[come, and, stay, with, me, ill, send, away, a...",cher_comeandstaywithme.txt,cher,155
1,"[pirate, hell, sail, on, with, the, summer, wi...",cher_pirate.txt,cher,297
2,"[stars, i, was, never, one, for, saying, what,...",cher_stars.txt,cher,314
3,"[these, days, well, ive, been, out, walking, a...",cher_thesedays.txt,cher,161
4,"[love, so, high, every, morning, i, would, wak...",cher_lovesohigh.txt,cher,176
...,...,...,...,...
99,"[do, you, really, want, me, show, respect, boy...",robyn_doyoureallywantmeshowrespect.txt,robyn,225
100,"[the, last, time, ever, since, i, was, a, litt...",robyn_thelasttime.txt,robyn,347
101,"[missing, u, baby, its, so, weird, to, me, now...",robyn_missingu.txt,robyn,372
102,"[breakdown, intermission, hey, my, friend, i, ...",robyn_breakdownintermission.txt,robyn,139


In [49]:
#my_pipeline = [str.lower,second ,third ,fourth]
#twitter_data["tokens"] = twitter_data["description"].apply(prepare,pipeline=my_pipeline)
twitter_data["description"]=twitter_data["description"].str.lower()
twitter_data["description"]
#twitter_data["num_tokens"] = twitter_data["tokens"].map(len) 

0                                                          
1                  𝙿𝚛𝚘𝚞𝚍 𝚜𝚞𝚙𝚙𝚘𝚛𝚝𝚎𝚛 𝚘𝚏 𝚖𝚎𝚜𝚜𝚢 𝚋𝚞𝚗𝚜 & 𝚕𝚎𝚐𝚐𝚒𝚗𝚐𝚜
2                 163㎝／愛かっぷ💜26歳🍒 工〇好きな女の子💓 フォローしてくれたらdmします🧡
3                                                       csu
4         writer @washinformer @spelmancollege alumna #d...
                                ...                        
358367    singer of songs, type 1 diabetic, tired $jakel...
358368    dadx2/ con-arch/ photographer/ dk #stemgrønnes...
358369           a year to change a life is still a year ✨😌
358370    head of consumer - mango. made in melbourne. r...
358371    stand for what is right, even if you stand alone.
Name: description, Length: 4353175, dtype: object

In [50]:

twitter_data["description"]=remove_punctuation(twitter_data["description"])
twitter_data["description"]


  return text.str.replace('[^\w\s]','')


0                                                          
1                   𝙿𝚛𝚘𝚞𝚍 𝚜𝚞𝚙𝚙𝚘𝚛𝚝𝚎𝚛 𝚘𝚏 𝚖𝚎𝚜𝚜𝚢 𝚋𝚞𝚗𝚜  𝚕𝚎𝚐𝚐𝚒𝚗𝚐𝚜
2                       163愛かっぷ26歳 工〇好きな女の子 フォローしてくれたらdmします
3                                                       csu
4         writer washinformer spelmancollege alumna dcna...
                                ...                        
358367    singer of songs type 1 diabetic tired jakelgil...
358368    dadx2 conarch photographer dk stemgrønnest grø...
358369             a year to change a life is still a year 
358370    head of consumer  mango made in melbourne ramb...
358371      stand for what is right even if you stand alone
Name: description, Length: 4353175, dtype: object

In [51]:
twitter_data["description"]=remove_stop(twitter_data["description"])
twitter_data["description"]

0                                                          
1                    𝙿𝚛𝚘𝚞𝚍 𝚜𝚞𝚙𝚙𝚘𝚛𝚝𝚎𝚛 𝚘𝚏 𝚖𝚎𝚜𝚜𝚢 𝚋𝚞𝚗𝚜 𝚕𝚎𝚐𝚐𝚒𝚗𝚐𝚜
2                       163愛かっぷ26歳 工〇好きな女の子 フォローしてくれたらdmします
3                                                       csu
4         writer washinformer spelmancollege alumna dcna...
                                ...                        
358367    singer songs type 1 diabetic tired jakelgiles ...
358368    dadx2 conarch photographer dk stemgrønnest grø...
358369                          year change life still year
358370         head consumer mango made melbourne ramblings
358371                         stand right even stand alone
Name: description, Length: 4353175, dtype: object

In [52]:
twitter_data["description"]=tokenize(twitter_data["description"])
twitter_data

Unnamed: 0,screen_name,name,id,location,followers_count,friends_count,description,artist
0,hsmcnp,Country Girl,35152213,,1302,1014,[],cher
1,horrormomy,Jeny,742153090850164742,Earth,81,514,"[𝙿𝚛𝚘𝚞𝚍, 𝚜𝚞𝚙𝚙𝚘𝚛𝚝𝚎𝚛, 𝚘𝚏, 𝚖𝚎𝚜𝚜𝚢, 𝚋𝚞𝚗𝚜, 𝚕𝚎𝚐𝚐𝚒𝚗𝚐𝚜]",cher
2,anju79990584,anju,1496463006451974150,,13,140,"[163愛かっぷ26歳, 工〇好きな女の子, フォローしてくれたらdmします]",cher
3,gallionjenna,J,3366479914,,752,556,[csu],cher
4,bcscomm,bcscomm,83915043,"Washington, DC",888,2891,"[writer, washinformer, spelmancollege, alumna,...",cher
...,...,...,...,...,...,...,...,...
358367,jakegiles,Jake Giles,19725102,LA,7690,2165,"[singer, songs, type, 1, diabetic, tired, jake...",robyn
358368,axelbluhme,Axel Bluhme,19573759,DK,238,1565,"[dadx2, conarch, photographer, dk, stemgrønnes...",robyn
358369,RayRayRay27,Rachael :),19919217,Oldham,762,1479,"[year, change, life, still, year]",robyn
358370,bhandberg,Ben Handberg,12642462,Sydney,432,593,"[head, consumer, mango, made, melbourne, rambl...",robyn


In [53]:

twitter_data["num_tokens"] = twitter_data["description"].map(len) 

In [54]:
twitter_data

Unnamed: 0,screen_name,name,id,location,followers_count,friends_count,description,artist,num_tokens
0,hsmcnp,Country Girl,35152213,,1302,1014,[],cher,0
1,horrormomy,Jeny,742153090850164742,Earth,81,514,"[𝙿𝚛𝚘𝚞𝚍, 𝚜𝚞𝚙𝚙𝚘𝚛𝚝𝚎𝚛, 𝚘𝚏, 𝚖𝚎𝚜𝚜𝚢, 𝚋𝚞𝚗𝚜, 𝚕𝚎𝚐𝚐𝚒𝚗𝚐𝚜]",cher,6
2,anju79990584,anju,1496463006451974150,,13,140,"[163愛かっぷ26歳, 工〇好きな女の子, フォローしてくれたらdmします]",cher,3
3,gallionjenna,J,3366479914,,752,556,[csu],cher,1
4,bcscomm,bcscomm,83915043,"Washington, DC",888,2891,"[writer, washinformer, spelmancollege, alumna,...",cher,17
...,...,...,...,...,...,...,...,...,...
358367,jakegiles,Jake Giles,19725102,LA,7690,2165,"[singer, songs, type, 1, diabetic, tired, jake...",robyn,9
358368,axelbluhme,Axel Bluhme,19573759,DK,238,1565,"[dadx2, conarch, photographer, dk, stemgrønnes...",robyn,8
358369,RayRayRay27,Rachael :),19919217,Oldham,762,1479,"[year, change, life, still, year]",robyn,5
358370,bhandberg,Ben Handberg,12642462,Sydney,432,593,"[head, consumer, mango, made, melbourne, rambl...",robyn,6


In [18]:
twitter_data['has_emoji'] = twitter_data["description"].apply(contains_emoji)

Let's take a quick look at some descriptions with emojis.

In [20]:
twitter_data[twitter_data.has_emoji].sample(10)[["artist","description","num_tokens"]]

Unnamed: 0,artist,description,num_tokens
1460167,cher,"[ℹ, womenareleaders, womenarerolemodels, women...",4
1319491,cher,"[always, kind, ℹ, views, opinions]",5
1673624,cher,"[fashion, designer, professional, makeupartist...",14
134412,robyn,"[dj, desde, 1997, dj, protour, barcelona, la, ...",13
185654,cher,"[venta, de, jabones, artesanales, sígueme, en,...",17
2116585,cher,"[99, time, idea, whats, going, onℹ, medical, s...",9
604165,cher,"[ℹlightwarriorlatinasopranosarcasticvo, actriz...",5
3488865,cher,"[hesheautisticadhdenby2slgbtqia, blmleftist, a...",5
1302491,cher,"[ℹ, queen]",2
357727,robyn,"[ℹ, unsigned, award, winning, songwriter, outs...",12


With the data processed, we can now start work on the assignment questions. 

Q: What is one area of improvement to your tokenization that you could theoretically carry out? (No need to actually do it; let's not make perfect the enemy of good enough.)

A: Seems like some splits didnt occure during the tokenization  and some words are clumped together. Meaning that the token produced arent in good quality


## Calculate descriptive statistics on the two sets of lyrics and compare the results. 


In [105]:
#df.loc[[2]] 
#descriptive_stats(lyrics_data["Lyrics"].loc[[2]] )  


cher_song= []
for i in lyrics_data["Lyrics"].loc[[2]]:
    cher_song.extend(i)

    
descriptive_stats(cher_song)




There are 615 tokens in the data.
There are 192 unique tokens in the data.
There are 2298 characters in the data.
The lexical diversity is 0.312 in the data.
[('the', 24), ('they', 20), ('you', 20), ('go', 19), ('party', 19)]


(615, 192, 0.3121951219512195, 2298)

In [106]:
robyn_song = []
for i in lyrics_data["Lyrics"].loc[[99]]:
    robyn_song.extend(i)
descriptive_stats(robyn_song)

There are 555 tokens in the data.
There are 199 unique tokens in the data.
There are 2072 characters in the data.
The lexical diversity is 0.359 in the data.
[('you', 47), ('gon', 13), ('na', 13), ('me', 13), ('her', 13)]


(555, 199, 0.35855855855855856, 2072)

Q: what observations do you make about these data? 

A: Robyn lyrics for doyoureallywantme has more unique tokens that cher stars even though she has lower tokens. Also a common token was you.


## Find tokens uniquely related to a corpus

Typically we would use TF-IDF to find unique tokens in documents. Unfortunately, we either have too few documents (if we view each data source as a single document) or too many (if we view each description as a separate document). In the latter case, our problem will be that descriptions tend to be short, so our matrix would be too sparse to support analysis. 

To avoid these problems, we will create a custom statistic to identify words that are uniquely related to each corpus. The idea is to find words that occur often in one corpus and infrequently in the other(s). Since corpora can be of different lengths, we will focus on the _concentration_ of tokens within a corpus. "Concentration" is simply the count of the token divided by the total corpus length. For instance, if a corpus had length 100,000 and a word appeared 1,000 times, then the concentration would be $\frac{1000}{100000} = 0.01$. If the same token had a concentration of $0.005$ in another corpus, then the concentration ratio would be $\frac{0.01}{0.005} = 2$. Very rare words can easily create infinite ratios, so you will also add a cutoff to your code so that a token must appear at least $n$ times for you to return it. 

An example of these calculations can be found in [this spreadsheet](https://docs.google.com/spreadsheets/d/1P87fkyslJhqXFnfYezNYrDrXp_GS8gwSATsZymv-9ms). Please don't hesitate to ask questions if this is confusing. 

In this section find 10 tokens for each of your four corpora that meet the following criteria: 

1. The token appears at least `n` times in all corpora
1. The tokens are in the top 10 for the highest ratio of appearances in a given corpora vs appearances in other corpora.

You will choose a cutoff for yourself based on the side of the corpus you're working with. If you're working with the Robyn-Cher corpora provided, `n=5` seems to perform reasonably well.

In [40]:
# your code here
#ndf=lyrics_data["Lyrics"]
#ndf.dtypes
#lyrics_data
#ndf=ndf.astype('string')
#new_df = ndf.str.split(expand=True).stack().value_counts().reset_index()
 
#new_df.columns = ['Word', 'Frequency'] 
 
#new_df


cher=lyrics_data.loc[lyrics_data['artist'] == 'cher']
cher_word_count=cher["Lyrics"]
cher_word_count=cher_word_count.astype('string')
cher_wordf = cher_word_count.str.split(expand=True).stack().value_counts().reset_index()
cher_wordf.columns  = ['Word', 'Frequency'] 
cher_wordf

Unnamed: 0,Word,Frequency
0,"'you',",2754
1,"'i',",2663
2,"'the',",2425
3,"'to',",1686
4,"'and',",1570
...,...,...
4194,"'reservation',",1
4195,"'astound',",1
4196,"'mistreat',",1
4197,"'duplex',",1


In [41]:
robyn=lyrics_data.loc[lyrics_data['artist'] == 'robyn']
robyn_word_count=robyn["Lyrics"]
robyn_word_count=robyn_word_count.astype('string')
robyn_wordf = robyn_word_count.str.split(expand=True).stack().value_counts().reset_index()
robyn_wordf.columns  = ['Word', 'Frequency'] 
robyn_wordf

Unnamed: 0,Word,Frequency
0,"'you',",1485
1,"'the',",867
2,"'i',",847
3,"'me',",716
4,"'to',",666
...,...,...
2405,"'cliche',",1
2406,"'gãrs',",1
2407,'moneyman'],1
2408,"'spinnin',",1


In [55]:
robyn_desc=twitter_data.loc[twitter_data['artist'] == 'robyn']
robyn_word_desc_count=robyn_desc["description"]
robyn_word_desc_count=robyn_word_desc_count.astype('string')
copra3 = robyn_word_desc_count.str.split(expand=True).stack().value_counts().reset_index()
copra3.columns  = ['Word', 'Frequency'] 
copra3

Unnamed: 0,Word,Frequency
0,[],172172
1,"'music',",11810
2,"'love',",7993
3,"'och',",7908
4,"'de',",6290
...,...,...
279580,"'aewdark',",1
279581,"'estranged',",1
279582,"'etcccc',",1
279583,'oslotips'],1


In [59]:
cher_desc=twitter_data.loc[twitter_data['artist'] == 'cher']
cher_worddesc_count=cher_desc["description"]
cher_worddesc_count=cher_worddesc_count.astype('string')
copra4 = cher_worddesc_count.str.split(expand=True).stack().value_counts().reset_index()
copra4.columns  = ['Word', 'Frequency'] 
copra4.head(20)

Unnamed: 0,Word,Frequency
0,[],2044049
1,"'love',",147844
2,"'im',",87170
3,"'life',",86053
4,"['im',",73538
5,"'de',",72086
6,"'music',",69869
7,"['love',",55594
8,"'lover',",49163
9,"'follow',",46090


In [77]:
df3 = copra4[copra4['Word'].isin(copra3['Word'])]

robyn_wordf
#display(df3)
df4=df3[df3['Word'].isin(robyn_wordf['Word'])]
#display(df4)
df5=df4[df4['Word'].isin(cher_wordf['Word'])]
df5=df5.drop(columns=['Frequency'])
display(df5) # filtered to common words in all 

Unnamed: 0,Word
1,"'love',"
2,"'im',"
3,"'life',"
5,"'de',"
6,"'music',"
...,...
78773,"'brokenhearted',"
81667,"'doorstep',"
90773,"['88',"
106010,"'unfamiliar',"


In [81]:
tempy=df5
Comb=pd.merge(tempy,copra4, how='inner', on=['Word'])
display(Comb)

Unnamed: 0,Word,Frequency
0,"'love',",147844
1,"'im',",87170
2,"'life',",86053
3,"'de',",72086
4,"'music',",69869
...,...,...
1203,"'brokenhearted',",9
1204,"'doorstep',",9
1205,"['88',",7
1206,"'unfamiliar',",6


In [83]:
Comb2=pd.merge(Comb,copra3, how='inner', on=['Word'])
display(Comb2)

Unnamed: 0,Word,Frequency_x,Frequency_y
0,"'love',",147844,7993
1,"'im',",87170,5724
2,"'life',",86053,5127
3,"'de',",72086,6290
4,"'music',",69869,11810
...,...,...,...
1203,"'brokenhearted',",9,1
1204,"'doorstep',",9,3
1205,"['88',",7,3
1206,"'unfamiliar',",6,1


In [85]:
Comb3=pd.merge(Comb2,robyn_wordf, how='inner', on=['Word'])
display(Comb3)

Unnamed: 0,Word,Frequency_x,Frequency_y,Frequency
0,"'love',",147844,7993,270
1,"'im',",87170,5724,299
2,"'life',",86053,5127,39
3,"'de',",72086,6290,1
4,"'music',",69869,11810,31
...,...,...,...,...
1203,"'brokenhearted',",9,1,11
1204,"'doorstep',",9,3,1
1205,"['88',",7,3,1
1206,"'unfamiliar',",6,1,1


In [86]:
Comb4=pd.merge(Comb3,cher_wordf, how='inner', on=['Word'])
display(Comb4)

Unnamed: 0,Word,Frequency_x,Frequency_y,Frequency_x.1,Frequency_y.1
0,"'love',",147844,7993,270,982
1,"'im',",87170,5724,299,511
2,"'life',",86053,5127,39,123
3,"'de',",72086,6290,1,1
4,"'music',",69869,11810,31,33
...,...,...,...,...,...
1203,"'brokenhearted',",9,1,11,3
1204,"'doorstep',",9,3,1,1
1205,"['88',",7,3,1,1
1206,"'unfamiliar',",6,1,1,1


In [88]:
totalsum=Comb4
totalsum['totalsum'] = totalsum.sum(axis=1)
display(totalsum)

Unnamed: 0,Word,Frequency_x,Frequency_y,Frequency_x.1,Frequency_y.1,totalsum
0,"'love',",147844,7993,270,982,157089.0
1,"'im',",87170,5724,299,511,93704.0
2,"'life',",86053,5127,39,123,91342.0
3,"'de',",72086,6290,1,1,78378.0
4,"'music',",69869,11810,31,33,81743.0
...,...,...,...,...,...,...
1203,"'brokenhearted',",9,1,11,3,24.0
1204,"'doorstep',",9,3,1,1,14.0
1205,"['88',",7,3,1,1,12.0
1206,"'unfamiliar',",6,1,1,1,9.0


In [103]:
# top 10 common words 
top10=totalsum.sort_values(by='totalsum', ascending=False)
top10.columns.values[1] = 'cher_dec'
top10.columns.values[2] = 'robyn_dec'
top10.columns.values[3] = 'robyn_lyric'
top10.columns.values[4] = 'cher_lyric'
display(top10)

Unnamed: 0,Word,cher_dec,robyn_dec,robyn_lyric,cher_lyric,totalsum
0,"'love',",147844,7993,270,982,157089.0
1,"'im',",87170,5724,299,511,93704.0
2,"'life',",86053,5127,39,123,91342.0
4,"'music',",69869,11810,31,33,81743.0
3,"'de',",72086,6290,1,1,78378.0
...,...,...,...,...,...,...
1200,"'chilly',",11,1,2,1,15.0
1204,"'doorstep',",9,3,1,1,14.0
1207,"'springtime',",6,1,4,2,13.0
1205,"['88',",7,3,1,1,12.0


In [93]:
divisionResults1 = top10['cher_dec'].div(top10['totalsum'])
divisionResults2 = top10['robyn_dec'].div(top10['totalsum'])
divisionResults3 = top10['robyn_lyric'].div(top10['totalsum'])
divisionResults4 = top10['cher_lyric'].div(top10['totalsum'])


In [105]:
top10['divisionResults1'] = divisionResults1
newtop10=top10.sort_values(by ='divisionResults1', ascending=False)
newtop10.head(10)
# you can repeat for the rest of division to plug in 

Unnamed: 0,Word,cher_dec,robyn_dec,robyn_lyric,cher_lyric,totalsum,divisionResults1
5,"['love',",55594,2929,3,12,58538.0,0.949708
2,"'life',",86053,5127,39,123,91342.0,0.942097
0,"'love',",147844,7993,270,982,157089.0,0.941148
1,"'im',",87170,5724,299,511,93704.0,0.93027
6,"'lover',",49163,3790,2,9,52964.0,0.928234
9,"'one',",36735,2572,63,276,39646.0,0.926575
3,"'de',",72086,6290,1,1,78378.0,0.919722
7,"'follow',",46090,4171,2,23,50286.0,0.916557
8,"'like',",45465,3733,231,270,49699.0,0.914807
4,"'music',",69869,11810,31,33,81743.0,0.85474


Q: What are some observations about the top tokens? Do you notice any interesting items on the list? 

A: Love showed up alot in many diffrent forms and the description had a heavy weight on total tokens

## Build word clouds for all four corpora. 

For building wordclouds, we'll follow exactly the code of the text. The code in this section can be found [here](https://github.com/blueprints-for-text-analytics-python/blueprints-text/blob/master/ch01/First_Insights.ipynb). If you haven't already, you should absolutely clone the repository that accompanies the book. 


In [106]:
from matplotlib import pyplot as plt

def wordcloud(word_freq, title=None, max_words=200, stopwords=None):

    wc = WordCloud(width=800, height=400, 
                   background_color= "black", colormap="Paired", 
                   max_font_size=150, max_words=max_words)
    
    # convert data frame into dict
    if type(word_freq) == pd.Series:
        counter = Counter(word_freq.fillna(0).to_dict())
    else:
        counter = word_freq

    # filter stop words in frequency counter
    if stopwords is not None:
        counter = {token:freq for (token, freq) in counter.items() 
                              if token not in stopwords}
    wc.generate_from_frequencies(counter)
 
    plt.title(title) 

    plt.imshow(wc, interpolation='bilinear')
    plt.axis("off")
    
    
def count_words(df, column='tokens', preprocess=None, min_freq=2):

    # process tokens and update counter
    def update(doc):
        tokens = doc if preprocess is None else preprocess(doc)
        counter.update(tokens)

    # create counter and run through all data
    counter = Counter()
    df[column].map(update)

    # transform counter into data frame
    freq_df = pd.DataFrame.from_dict(counter, orient='index', columns=['freq'])
    freq_df = freq_df.query('freq >= @min_freq')
    freq_df.index.name = 'token'
    
    return freq_df.sort_values('freq', ascending=False)

Q: What observations do you have about these (relatively straightforward) wordclouds? 

A: 