# Formalia:

Please read the [assignment overview page](https://github.com/SocialComplexityLab/socialgraphs2024/wiki/Assignments) carefully before proceeding. This page contains information about formatting (including formats etc), group sizes, and many other aspects of handing in the assignment. 

_If you fail to follow these simple instructions, it will negatively impact your grade!_

**Due date and time**: The assignment is due on Tuesday November 5th, 2024 at 23:55. Hand in your IPython notebook file (with extension `.ipynb`) via DTU Learn

In the exercises below, I describe the exercises in a general way. Drawing in the right parts of the exercises is part of the assignment. (That way we're helping you get a little bit more ready for the Final Project, where you have to decide what information to include in your report and analysis). 


# Part 1: Genres and communities and plotting 

The questions below are based on Lecture 7, part 2.

* Write about genres and modularity.
* Detect the communities, discuss the value of modularity in comparison to the genres.
* Calculate the matrix $D$ and discuss your findings.
* Plot the communities and comment on your results.

# Part 2: TF-IDF to understand genres and communities 

The questions below  are based on Lecture 7, part 2 and 3.

* Explain the concept of TF-IDF in your own words and how it can help you understand the genres and communities.
* Calculate and visualize TF-IDF for the genres and communities.
* Use the matrix $D$ (Lecture 7, part 2) to dicusss the difference between the word-clouds between genres and communities.

# Part 3: Sentiment of the artists and communities

The questions below are based on Lecture 8

* Calculate the sentiment of the Artists pages (OK to work with the sub-network of artists-with-genre) and describe your findings using stats and visualization, inspired by the first exercise of week 8.
* Discuss the sentiment of the largest communities. Do the findings using TF-IDF during Lecture 7 help you understand your results?

In [None]:
import json
import networkx as nx
import re
import matplotlib.pyplot as plt
import seaborn as sns

# Load LabMT dictionary from previous function
labmt_dict = load_labmt_word_list('Data_Set_S1.txt')

# Load the genres dataset
with open('genres.txt', 'r') as f:
    genres_data = json.load(f)

# Initialize directed graph
G = nx.DiGraph()

# Tokenize function (for genres)
def tokenize_genres(genres):
    tokens = []
    for genre in genres:
        tokens.extend(re.findall(r'\b\w+\b', genre.lower()))
    return tokens

# Calculate sentiment for each artist's genres and add it as node attribute
for artist, genres in genres_data.items():
    tokens = tokenize_genres(genres)
    sentiment_score = calculate_sentiment(tokens, labmt_dict)
    G.add_node(artist, genres=genres, sentiment=sentiment_score)

# Calculate statistics
sentiments = [data['sentiment'] for _, data in G.nodes(data=True)]
mean_sentiment = sum(sentiments) / len(sentiments)
median_sentiment = sorted(sentiments)[len(sentiments) // 2]
std_dev_sentiment = (sum((x - mean_sentiment) ** 2 for x in sentiments) / len(sentiments)) ** 0.5

print(f"Mean Sentiment: {mean_sentiment}")
print(f"Median Sentiment: {median_sentiment}")
print(f"Standard Deviation of Sentiment: {std_dev_sentiment}")

# Plotting sentiment distribution
sns.histplot(sentiments, kde=True)
plt.xlabel('Sentiment Score')
plt.ylabel('Frequency')
plt.title('Distribution of Sentiment Scores for Artists based on Genres')
plt.show()
