# ADS 509 Sentiment Assignment

This notebook holds the Sentiment Assignment for Module 6 in ADS 509, Applied Text Mining. Work through this notebook, writing code and answering questions where required. 

In a previous assignment you put together Twitter data and lyrics data on two artists. In this assignment we apply sentiment analysis to those data sets. If, for some reason, you did not complete that previous assignment, data to use for this assignment can be found in the assignment materials section of Blackboard. 


## General Assignment Instructions

These instructions are included in every assignment, to remind you of the coding standards for the class. Feel free to delete this cell after reading it. 

One sign of mature code is conforming to a style guide. We recommend the [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html). If you use a different style guide, please include a cell with a link. 

Your code should be relatively easy-to-read, sensibly commented, and clean. Writing code is a messy process, so please be sure to edit your final submission. Remove any cells that are not needed or parts of cells that contain unnecessary code. Remove inessential `import` statements and make sure that all such statements are moved into the designated cell. 

Make use of non-code cells for written commentary. These cells should be grammatical and clearly written. In some of these cells you will have questions to answer. The questions will be marked by a "Q:" and will have a corresponding "A:" spot for you. *Make sure to answer every question marked with a `Q:` for full credit.* 


In [2]:
import os
import re
import emoji
import pandas as pd
import numpy as np

from collections import Counter, defaultdict
from string import punctuation

from nltk.corpus import stopwords

sw = stopwords.words("english")

In [3]:
# Add any additional import statements you need here
from nltk.tokenize import word_tokenize


In [4]:
# Functions and constants:

# Some punctuation variations
punctuation = set(punctuation) # speeds up comparison
tw_punct = punctuation - {"#"}

# Stopwords 
sw = stopwords.words("english")

# Removing punctuation from stopwords
sw_nopunct = [word.replace("'", "") for word in sw]

# Function to remove stopwords
def remove_stop(tokens) :
    return([t for t in tokens if t.lower() not in sw_nopunct])
 
# Function to remove punctuation
def remove_punctuation(text, punct_set=tw_punct) : 
    return("".join([ch for ch in text if ch not in punct_set]))

# Function to tokenize the text
def tokenize(text) : 
    return(text.split())

def prepare(text, pipeline) : 
    tokens = str(text)
    for transform in pipeline : 
        tokens = transform(tokens) 
    return(tokens)

# Pipline to clean and tokenize the data
cleaning_pipeline = [str.lower, remove_punctuation, tokenize, remove_stop]

In [5]:
# change `data_location` to the location of the folder on your machine.
data_location = "/Users/summerpurschke/Desktop/ADS/ADS509/Mod2"

# These subfolders should still work if you correctly stored the 
# data from the Module 1 assignment
twitter_folder = "/twitter/"
lyrics_folder = "/lyrics"

positive_words_file = "positive-words.txt"
negative_words_file = "negative-words.txt"
tidy_text_file = "tidytext_sentiments.txt"

## Data Input

Now read in each of the corpora. For the lyrics data, it may be convenient to store the entire contents of the file to make it easier to inspect the titles individually, as you'll do in the last part of the assignment. In the solution, I stored the lyrics data in a dictionary with two dimensions of keys: artist and song. The value was the file contents. A Pandas data frame would work equally well. 

For the Twitter data, we only need the description field for this assignment. Feel free all the descriptions read it into a data structure. In the solution, I stored the descriptions as a dictionary of lists, with the key being the artist. 




In [6]:
# Reading in the lyrics here

# Creating lyrics dataframe
lyrics_df = pd.DataFrame(columns = ['Artist', 'Song Title', 'Lyrics'])

artists = ["cher", "robyn"]

for artist in artists:
    # joining path data_folder with artist to get a complete path to each artist's folder
    artist_folder = os.path.join(data_location + lyrics_folder, artist)
    # iterating over the artist_folder
    for root, folders, files in os.walk(artist_folder):
        # iterate over all the files/lyrics in each artist folder
        for file in files:
            path = os.path.join(root, file)
            # defining content to fill the lyrics_df with
            with open(path) as inf:
                content = inf.read()
                # extracting just the lyrics by splitting the content into two parts
                # [0] represents content before the split or the titles
                # [1] represents content after the split or the lyrics
                lyrics = content.split('\n', 1)[1].strip()
                song_title = content.split('\n', 1)[0].strip()
            # updating lyrics_df
            lyrics_df = pd.concat([lyrics_df, pd.DataFrame({"Artist": [artist], "Song Title": [song_title], 
                                                            "Lyrics": [lyrics]})], ignore_index=True)
lyrics_df.head(5)

Unnamed: 0,Artist,Song Title,Lyrics
0,cher,"""Come And Stay With Me""",I'll send away all my false pride\nAnd I'll fo...
1,cher,"""Pirate""",He'll sail on with the summer wind\nThat blows...
2,cher,"""Stars""",I was never one for saying what I really feel\...
3,cher,"""These Days""",Well I've been out walking \nAnd I don't do th...
4,cher,"""Love So High""",Every morning I would wake up\nAnd I'd tie the...


In [7]:
# cleaning and tokenizing the lyrics_df
my_pipeline = [str.lower, remove_punctuation, tokenize, remove_stop]

lyrics_df["tokens"] = lyrics_df["Lyrics"].apply(prepare,pipeline=my_pipeline)
lyrics_df

Unnamed: 0,Artist,Song Title,Lyrics,tokens
0,cher,"""Come And Stay With Me""",I'll send away all my false pride\nAnd I'll fo...,"[ill, send, away, false, pride, ill, forsake, ..."
1,cher,"""Pirate""",He'll sail on with the summer wind\nThat blows...,"[hell, sail, summer, wind, blows, day, everybo..."
2,cher,"""Stars""",I was never one for saying what I really feel\...,"[never, one, saying, really, feel, except, ton..."
3,cher,"""These Days""",Well I've been out walking \nAnd I don't do th...,"[well, ive, walking, much, talking, days, days..."
4,cher,"""Love So High""",Every morning I would wake up\nAnd I'd tie the...,"[every, morning, would, wake, id, tie, sun, ar..."
...,...,...,...,...
415,robyn,"""Do You Really Want Me (Show Respect)""",Boy listen to me careful now\nCause this is so...,"[boy, listen, careful, cause, something, every..."
416,robyn,"""The Last Time""",Ever since I was a little girl even younger th...,"[ever, since, little, girl, even, younger, alw..."
417,robyn,"""Missing U""","Baby, it's so weird to me now that it's over\n...","[baby, weird, space, used, head, shoulder, pla..."
418,robyn,"""Breakdown Intermission""",Hey my friend I thought you knew it's all an i...,"[hey, friend, thought, knew, illusion, nobody,..."


In [8]:
# Read in the twitter data for Cher
artist_files = {'cher':'cher_followers_data.txt',
                'robyn':'robynkonichiwa_followers_data.txt'}

twitter_df = pd.read_csv(data_location + twitter_folder + artist_files['cher'],
                           sep="\t",
                           quoting=3)

twitter_df['artist'] = "cher"

# read in twitter data for Robyn
twitter_df_2 = pd.read_csv(data_location + twitter_folder + artist_files['robyn'],
                             sep="\t",
                             quoting=3)

twitter_df_2['artist'] = "robyn"

# concatenating Cher and Robyn Twitter data
twitter_df = pd.concat([
    twitter_df,twitter_df_2])
    
del(twitter_df_2)

# keep relevant twitter_df columns and remove rows with NaN value as the 'description' 
# given that it implies there is no description for the Twitter user
twitter_df = twitter_df[['artist', 'description']]
twitter_df_subset = twitter_df[twitter_df['description'].notna()]
twitter_df_subset.head(5)

Unnamed: 0,artist,description
1,cher,𝙿𝚛𝚘𝚞𝚍 𝚜𝚞𝚙𝚙𝚘𝚛𝚝𝚎𝚛 𝚘𝚏 𝚖𝚎𝚜𝚜𝚢 𝚋𝚞𝚗𝚜 & 𝚕𝚎𝚐𝚐𝚒𝚗𝚐𝚜
2,cher,163㎝／愛かっぷ💜26歳🍒 工〇好きな女の子💓 フォローしてくれたらDMします🧡
3,cher,csu
4,cher,Writer @Washinformer @SpelmanCollege alumna #D...
6,cher,I’m unemployed and live with my parents. MOOPS!


In [9]:
# Read in the positive and negative words and the tidytext sentiment. 
# Store these so that the positive words are associated with a score 
# of +1 and negative words are associated with a score of -1. 
# You can use a dataframe or a dictionary for this.

# Read in the positive-words.txt data
# Set header = None because the header is not contained in the data
# Set comment = ';' to ignore rows that start with a ;, removing the 
# irrelevant lines in the data
positive_words_df = pd.read_csv(data_location + 'positive-words.txt', header = None, comment = ';')
# Naming the column 'word'
positive_words_df = positive_words_df.rename(columns={0: 'word'})
# Setting a score of +1 for positive words
positive_words_df['score'] = +1

# Read in the negative-words.txt data
# Set header = None because the header is not contained in the data
# Set comment = ';' to ignore rows that start with a ;, removing the 
# irrelevant lines in the data
negative_words_df = pd.read_csv(data_location + 'negative-words.txt', header = None, comment = ';')
# Naming the column 'word'
negative_words_df = negative_words_df.rename(columns={0: 'word'})
# Setting a score of -1 for negative words
negative_words_df['score'] = -1

# Combining the positive_words_df and negative_words_df together
words_df = pd.concat([positive_words_df, negative_words_df])

# Read in the tidytext_sentiments.txt data
# Set header = None because the header is not contained in the data 
sentiment_df = pd.read_csv(data_location + 'tidytext_sentiments.txt', header = 0, sep = '\t')
sentiment_df['score'] = sentiment_df['sentiment'].map({'positive': 1, 'negative': -1})

display(words_df)
display(sentiment_df)

# Merging together the words_df and sentiment_df
sentiments = pd.concat([words_df, sentiment_df])
display(sentiments)

FileNotFoundError: [Errno 2] No such file or directory: '/Users/summerpurschke/Desktop/ADS/ADS509/Mod2positive-words.txt'

## Sentiment Analysis on Songs

In this section, score the sentiment for all the songs for both artists in your data set. Score the sentiment by manually calculating the sentiment using the combined lexicons provided in this repository. 

After you have calculated these sentiments, answer the questions at the end of this section.


In [53]:
# your code here
def sentiment_score_calc(text):
    sentiment_score = 0
    for word in :
        if word in sentiments:
            sentiment_score += sentiments['score']
    return sentiment_score 

In [54]:
lyrics_df['sentiment_score'] = lyrics_df['tokens'].apply(sentiment_score_calc)
lyrics_df.sample(5)



NameError: name 'words' is not defined

### Questions

Q: Overall, which artist has the higher average sentiment per song? 

A: <!-- Your answer here -->

---

Q: For your first artist, what are the three songs that have the highest and lowest sentiments? Print the lyrics of those songs to the screen. What do you think is driving the sentiment score? 

A: <!-- Your answer here -->

---

Q: For your second artist, what are the three songs that have the highest and lowest sentiments? Print the lyrics of those songs to the screen. What do you think is driving the sentiment score? 

A: <!-- Your answer here -->

---

Q: Plot the distributions of the sentiment scores for both artists. You can use `seaborn` to plot densities or plot histograms in matplotlib.




## Sentiment Analysis on Twitter Descriptions

In this section, define two sets of emojis you designate as positive and negative. Make sure to have at least 10 emojis per set. You can learn about the most popular emojis on Twitter at [the emojitracker](https://emojitracker.com/). 

Associate your positive emojis with a score of +1, negative with -1. Score the average sentiment of your two artists based on the Twitter descriptions of their followers. The average sentiment can just be the total score divided by number of followers. You do not need to calculate sentiment on non-emoji content for this section.

In [12]:
# your code here

Q: What is the average sentiment of your two artists? 

A: <!-- Your answer here --> 

---

Q: Which positive emoji is the most popular for each artist? Which negative emoji? 

A: <!-- Your answer here --> 

