# ADS 509 Sentiment Assignment

Shailja Somani\
ADS 509 Assignment 6.1\
June 10, 2024

This notebook holds the Sentiment Assignment for Module 6 in ADS 509, Applied Text Mining. Work through this notebook, writing code and answering questions where required. 

In a previous assignment you put together Twitter data and lyrics data on two artists. In this assignment we apply sentiment analysis to those data sets. If, for some reason, you did not complete that previous assignment, data to use for this assignment can be found in the assignment materials section of Blackboard. 


In [1]:
import os
import re
import emoji
import pandas as pd
import numpy as np

from collections import Counter, defaultdict
from string import punctuation

from nltk.corpus import stopwords

sw = stopwords.words("english")

In [None]:
# Add any additional import statements you need here




In [2]:
# change `data_location` to the location of the folder on your machine.
data_location = "/users/shailjasomani/Documents/USD_MS_ADS/ADS_509/"

# These subfolders should still work if you correctly stored the 
# data from the Module 1 assignment
twitter_folder = "M1_Results/twitter/"
lyrics_folder = "M1_Results/lyrics/"

positive_words_file = "positive-words.txt"
negative_words_file = "negative-words.txt"
tidy_text_file = "tidytext_sentiments.txt"

## Data Input

Now read in each of the corpora. For the lyrics data, it may be convenient to store the entire contents of the file to make it easier to inspect the titles individually, as you'll do in the last part of the assignment. In the solution, I stored the lyrics data in a dictionary with two dimensions of keys: artist and song. The value was the file contents. A Pandas data frame would work equally well. 

For the Twitter data, we only need the description field for this assignment. Feel free all the descriptions read it into a data structure. In the solution, I stored the descriptions as a dictionary of lists, with the key being the artist. 




In [4]:
# Read in the lyrics data here to result in a df
# Initialize a list to collect the data as loop through artists
lyrics_data = []
lyrics_path = os.path.join(data_location, lyrics_folder)

# Loop through artists 
for artist in os.listdir(lyrics_path):
    artist_path = os.path.join(lyrics_path, artist)
    # Loop through all song file names in each artist folder 
    for song_file in os.listdir(artist_path):
        song_path = os.path.join(artist_path, song_file)
        with open(song_path, 'r', encoding='utf-8') as file:
            lyrics = file.read()
            # Extract song name from song_file
            song_name = song_file.split('_', 1)[-1].rsplit('.txt', 1)[0]
            # Append a tuple with artist, song name, and lyrics to list 
            lyrics_data.append((artist, song_name, lyrics))

# Create a DataFrame from the collected data in list
lyrics_df = pd.DataFrame(lyrics_data, columns=['artist', 'song_name', 'lyrics'])

# Check data is as expected
lyrics_df.head()

Unnamed: 0,artist,song_name,lyrics
0,robyn,includemeout,"""Include Me Out""\n\n\n\nIt is really very simp..."
1,robyn,electric,"""Electric""\n\n\n\nElectric...\n\nIt's electric..."
2,robyn,beach2k20,"""Beach 2K20""\n\n\n\n(So you wanna go out?\nHow..."
3,robyn,lovekills,"""Love Kills""\n\n\n\nIf you're looking for love..."
4,robyn,timemachine,"""Time Machine""\n\n\n\nHey, what did I do?\nCan..."


In [7]:
# Read in the twitter data to result in df
artist_files = {'cher':'cher_followers_data.txt',
                'robyn':'robynkonichiwa_followers_data.txt'}

# Read in Cher data
twitter_data = pd.read_csv(data_location + twitter_folder + artist_files['cher'],
                           sep="\t",
                           quoting=3)

twitter_data['artist'] = "cher"

# Read in Robyn data
twitter_data_2 = pd.read_csv(data_location + twitter_folder + artist_files['robyn'],
                             sep="\t",
                             quoting=3)
twitter_data_2['artist'] = "robyn"

# Concat both & delete redundant df
twitter_data = pd.concat([
    twitter_data,twitter_data_2])
del(twitter_data_2)

# Keep only description field
twitter_data = twitter_data[['artist', 'description']]
twitter_data.head()

Unnamed: 0,artist,description
0,cher,
1,cher,𝙿𝚛𝚘𝚞𝚍 𝚜𝚞𝚙𝚙𝚘𝚛𝚝𝚎𝚛 𝚘𝚏 𝚖𝚎𝚜𝚜𝚢 𝚋𝚞𝚗𝚜 & 𝚕𝚎𝚐𝚐𝚒𝚗𝚐𝚜
2,cher,163㎝／愛かっぷ💜26歳🍒 工〇好きな女の子💓 フォローしてくれたらDMします🧡
3,cher,csu
4,cher,Writer @Washinformer @SpelmanCollege alumna #D...


### Set Up Sentiments Dataframe

In [8]:
# Read in the positive and negative words and the
# tidytext sentiment. Store these so that the positive
# words are associated with a score of +1 and negative words
# are associated with a score of -1. You can use a dataframe or a 
# dictionary for this.

# Read in positive words & assign score of +1
with open(positive_words_file, 'r') as file:
    positive_words = file.read().splitlines()

# Filter out comments & empty lines
positive_words = [word for word in positive_words if word and not word.startswith(';')]

# Create df for positive words
df_positive = pd.DataFrame(positive_words, columns=['word'])
df_positive['score'] = 1
df_positive.head()

Unnamed: 0,word,score
0,a+,1
1,abound,1
2,abounds,1
3,abundance,1
4,abundant,1


In [10]:
# Read in negative words & assign score of -1
with open(negative_words_file, 'r') as file:
    negative_words = file.read().splitlines()

# Filter out comments & empty lines
negative_words = [word for word in negative_words if word and not word.startswith(';')]

# Create df for negative words
df_negative = pd.DataFrame(negative_words, columns=['word'])
df_negative['score'] = -1
df_negative.head()

Unnamed: 0,word,score
0,2-faced,-1
1,2-faces,-1
2,abnormal,-1
3,abolish,-1
4,abominable,-1


In [12]:
# Read in tidytext sentiments
df_tidytext = pd.read_csv(tidy_text_file, delimiter='\t')

# Map sentiment to score (1 for pos, -1 for negative)
sentiment_score_map = {'positive': 1, 'negative': -1}
df_tidytext['score'] = df_tidytext['sentiment'].map(sentiment_score_map)

# Keep only the required columns
df_tidytext = df_tidytext[['word', 'score']]
df_tidytext.head()

Unnamed: 0,word,score
0,abandon,-1
1,abandoned,-1
2,abandonment,-1
3,abba,1
4,abduction,-1


In [13]:
# Combine all 3 dfs above
df_combined = pd.concat([df_positive, df_negative, df_tidytext]).reset_index(drop=True)
df_combined.head()

Unnamed: 0,word,score
0,a+,1
1,abound,1
2,abounds,1
3,abundance,1
4,abundant,1


## Sentiment Analysis on Songs

In this section, score the sentiment for all the songs for both artists in your data set. Score the sentiment by manually calculating the sentiment using the combined lexicons provided in this repository. 

After you have calculated these sentiments, answer the questions at the end of this section.


In [None]:
# your code here

### Questions

Q: Overall, which artist has the higher average sentiment per song? 

A: <!-- Your answer here -->

---

Q: For your first artist, what are the three songs that have the highest and lowest sentiments? Print the lyrics of those songs to the screen. What do you think is driving the sentiment score? 

A: <!-- Your answer here -->

---

Q: For your second artist, what are the three songs that have the highest and lowest sentiments? Print the lyrics of those songs to the screen. What do you think is driving the sentiment score? 

A: <!-- Your answer here -->

---

Q: Plot the distributions of the sentiment scores for both artists. You can use `seaborn` to plot densities or plot histograms in matplotlib.




## Sentiment Analysis on Twitter Descriptions

In this section, define two sets of emojis you designate as positive and negative. Make sure to have at least 10 emojis per set. You can learn about the most popular emojis on Twitter at [the emojitracker](https://emojitracker.com/). 

Associate your positive emojis with a score of +1, negative with -1. Score the average sentiment of your two artists based on the Twitter descriptions of their followers. The average sentiment can just be the total score divided by number of followers. You do not need to calculate sentiment on non-emoji content for this section.

In [None]:
# your code here

Q: What is the average sentiment of your two artists? 

A: <!-- Your answer here --> 

---

Q: Which positive emoji is the most popular for each artist? Which negative emoji? 

A: <!-- Your answer here --> 

