# Final Project Notebook

DS 5001 Exploratory Text Analytics | Spring 2024

# Metadata

- Full Name: Sunidhi Goyal
- Userid:myp8ma
- GitHub Repo URL:https://github.com/sunidhigoyal05/eta_female_popstars
- UVA Box URL:https://virginia.app.box.com/folder/262022862641

# Overview

The goal of the final project is for you to create a **digital analytical edition** of a corpus using the tools, practices, and perspectives you’ve learning in this course. You will select a corpus that has already been digitized and transcribed, parse that into an F-compliant set of tables, and then generate and visualize the results of a series of fitted models. You will also draw some tentative conclusions regarding the linguistic, cultural, psychological, or historical features represented by your corpus. The point of the exercise is to have you work with a corpus through the entire pipeline from ingestion to interpretation. 

Specifically, you will acquire a collection of long-form texts and perform the following operations:

- **Convert** the collection from their source formats (F0) into a set of tables that conform to the Standard Text Analytic Data Model (F2).
- **Annotate** these tables with statistical and linguistic features using NLP libraries such as NLTK (F3).
- **Produce** a vector representation of the corpus to generate TFIDF values to add to the TOKEN (aka CORPUS) and VOCAB tables (F4).
- **Model** the annotated and vectorized model with tables and features derived from the application of unsupervised methods, including PCA, LDA, and word2vec (F5).
- **Explore** your results using statistical and visual methods.
- **Present** conclusions about patterns observed in the corpus by means of these operations.

When you are finished, you will make the results of your work available in GitHub (for code) and UVA Box (for data). You will submit to Gradescope (via Canvas) a PDF version of a Jupyter notebook that contains the information listed below.

# Some Details

- Please fill out your answers in each task below by editing the markdown cell. 
- Replace text that asks you to insert something with the thing, i.e. replace `(INSERT IMAGE HERE)` with an image element, e.g. `![](image.png)`.
- For URLs, just paste the raw URL directly into the text area. Don't worry about providing link labels using `[label](link)`.
- Please do not alter the structure of the document or cell, i.e. the bulleted lists. 
- You may add explanatory paragraphs below the bulleted lists.
- Please name your tables as they are named in each task below.
- Tasks are indicated by headers with point values in parentheses.

# ----------------------------------------------------------------------------------
# ----------------------------------------------------------------------------------
#                                               POP QUEENS THROUGH THE DECADES
# ----------------------------------------------------------------------------------
# ----------------------------------------------------------------------------------

### WITH A LITTLE AWARD CEREMONY AT THE END

# ----------------------------------------------------------------------------------


# Raw Data

## Source Description (1)

-------------------------------------------------------------------------------------------
Provide a brief description of your source material, including its provenance and content. Tell us where you found it and what kind of content it contains.

-------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------
To gather the data for this assignment, I utilized the Genius.com API, leveraging its access key to extract song lyrics. My dataset encompasses the lyrics of 50 of the most popular songs from each of the following 11 female pop stars: 'ADELE',
 'AMY WINEHOUSE',
 'ARIANA GRANDE',
 'BEYONCE',
 'CYNDI LAUPER',
 'DOLLY PARTON',
 'DUA LIPA',
 'LAURYN HILL',
 'PINK',
 'TAYLOR SWIFT',
 'WHITNEY HOUSTON'

This approach provided a comprehensive dataset for analysis, focusing on popular music within the female pop star category. Each artist's lyrics was obtained through API requests, ensuring consistency and accuracy in data collection. This dataset allows for detailed examination and comparison of lyrical content among a diverse group of artists.

The metadata for each artist consisting of 'nationality', 'genre', 'decade_of_prominence', 'birth_year',
       'instruments', 'character_count' was found using Wikipedia.
-------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------

## Source Features (1)

Add values for the following items. (Do this for all following bulleted lists.)

- Source URL: Scraping code: https://github.com/sunidhigoyal05/eta_female_popstars/blob/main/lyrics_scraper_genius.ipynb
- Genius: https://genius.com/
- UVA Box URL: https://virginia.box.com/s/k78cceqlc7qebmyilei9yjko431r4awp
- Number of raw documents: 1 
- Total size of raw documents (e.g. in MB): 1020.6 kB
- File format(s), e.g. XML, plaintext, etc.: txt file


-------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------
## Source Document Structure (1)

-------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------

Provide a brief description of the internal structure of each document. That, describe the typical elements found in document and their relation to each other. For example, a corpus of letters might be described as having a date, an addressee, a salutation, a set of content paragraphs, and closing. If they are various structures, state that.

The document begins with 10 new lines before each artist's name in capitals and then the body of work starts. Each artit's body of work begins with the title of the song. It is typically presented as a heading and may also include additional information like "(Live Acoustic)" or "(Taylor's Version)"

Songs are generally divided into distinct sections that follow a consistent pattern. These sections include:
Intro: Some songs have a brief introductory section, which often contains a few lines of lyrics. Not all songs include this section.
Verse: This is a primary section of the song where the narrative or storytelling occurs. A song usually contains multiple verses, each contributing to the overarching story or theme.
Pre-Chorus: This is a transitional section leading into the chorus. Not all songs have a pre-chorus.
Chorus: This section is typically the most recognizable part of the song, with repeated lyrics or phrases. It often encapsulates the main theme or emotion of the song.
Bridge: A contrasting section that provides a break from the repeating verse-chorus structure. It often introduces a new perspective or variation in the melody.
Outro: This section concludes the song, either with repeating lines or unique lyrics that signal the song's end.

#### Within each section, the lyrics are presented as text lines. These lines are often broken into smaller segments with stanza-like structures.
-------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------

# Parsed and Annotated Data

Parse the raw data into the three core tables of your addition: the `LIB`, `CORPUS`, and `VOCAB` tables.

These tables will be stored as CSV files with header rows.

You may consider using `|` as a delimitter.

Provide the following information for each.

## LIB (2)

The source documents the corpus comprises. These may be books, plays, newspaper articles, abstracts, blog posts, etc. 

Note that these are *not* documents in the sense used to describe a bag-of-words representation of a text, e.g. chapter.

- UVA Box URL: https://virginia.box.com/s/qaq97zljpnhrep0xviu2dhvkv120rug4
- GitHub URL for notebook used to create: https://github.com/sunidhigoyal05/eta_female_popstars/blob/main/lib_table.ipynb
- Delimitter: ,
- Number of observations: 10
- List of features, including at least three that may be used for model summarization (e.g. date, author, etc.): 'nationality', 'genre', 'decade_of_prominence', 'birth_year',
       'instruments', 'character_count'
- Average length of each document in characters: n_characters in the following table gives info about this. 

![Screenshot 2024-05-03 at 11.16.57 PM.png](attachment:ac9233e4-b247-42e4-8b3e-18a2d15629bd.png)

## CORPUS (2)

The sequence of word tokens in the corpus, indexed by their location in the corpus and document structures.

- UVA Box URL: https://virginia.box.com/s/n4svx4fu8v1n9ykmib521eh7bqgflmz5
- GitHub URL for notebook used to create: https://github.com/sunidhigoyal05/eta_female_popstars/blob/main/tokens_table.ipynb
- Delimitter: , 
- Number of observations : 93450 rows
- OHCO Structure (as delimitted column names): OHCO = ['artist', 'song','lines', 'token_num']

WORDS = OHCO[:4]
LINES = OHCO[:3]
SONG = OHCO[:2]
ARTIST = OHCO[:1]
- Columns (as delimitted column names, including `token_str`, `term_str`, `pos`, and `pos_group`):
'artist', 'song', 'lines', 'token_num', 'pos_tuple', 'pos', 'token_str',
       'term_str', 'pos_group'

## VOCAB (2)

The unique word types (terms) in the corpus.

- UVA Box URL: https://virginia.box.com/s/fkhji18w0plxh1ni54109nsxzw47gent
- GitHub URL for notebook used to create:https://github.com/sunidhigoyal05/eta_female_popstars/blob/main/vocab_table.ipynb
- Delimitter: , 
- Number of observations: 4184 rows
- Columns (as delimitted names, including `n`, `p`', `i`, `dfidf`, `porter_stem`, `max_pos` and `max_pos_group`, `stop`): ['term_str', 'n', 'p', 'i', 'n_chars', 'max_pos_group', 'max_pos',
       'stop', 'porter_stem', 'df', 'idf', 'dfidf', 'pos_max']
- List the top 20 significant words in the corpus by DFIDF.

![Screenshot 2024-05-03 at 11.24.31 PM.png](attachment:78fcd2b8-1463-467b-8bd8-31f9ee3e9d80.png)

# Derived Tables

## BOW (3)

A bag-of-words representation of the CORPUS.

- UVA Box URL: https://virginia.app.box.com/file/1520849565288
- GitHub URL for notebook used to create: https://github.com/sunidhigoyal05/eta_female_popstars/blob/main/bow_by_artist.ipynb
- Delimitter: , 
- Bag (expressed in terms of OHCO levels): OHCO[:1] -> Artist
- Number of observations: 8218 rows 
- Columns (as delimitted names, including `n`, `tfidf`): ![Screenshot 2024-05-03 at 11.27.16 PM.png](attachment:03d7c5d2-81b6-4487-a252-2a8b40a55ce7.png)

## DTM (3)

A represenation of the BOW as a sparse count matrix.

- UVA Box URL: https://virginia.box.com/s/qcolpxyn4ws58bta9xr0gdt34cvzia3h
- UVA Box URL of BOW used to generate (if applicable): https://virginia.app.box.com/file/1520849565288
- GitHub URL for notebook used to create:  https://github.com/sunidhigoyal05/eta_female_popstars/blob/main/DTM.ipynb
- Delimitter:, 
- Bag (expressed in terms of OHCO levels): OHCO[:1] -> artist

## TFIDF (3)

A Document-Term matrix with TFIDF values.

- UVA Box URL: https://virginia.box.com/s/qcolpxyn4ws58bta9xr0gdt34cvzia3h
- UVA Box URL of DTM or BOW used to create: https://virginia.box.com/s/qcolpxyn4ws58bta9xr0gdt34cvzia3h
- GitHub URL for notebook used to create: https://github.com/sunidhigoyal05/eta_female_popstars/blob/main/bow_by_artist.ipynb
- Delimitter:,
- Description of TFIDIF formula ($\LaTeX$ OK): 

$$TF = term  frequency  in document/total words in  document $$ 

$$IDF = log(total  documents  in  corpus/documents  with  term)$$

$$TF-IDF = TF*IDF$$

## Reduced and Normalized TFIDF_L2 (3)

A Document-Term matrix with L2 normalized TFIDF values.

- UVA Box URL: https://virginia.box.com/s/yulfp8s2iakz5y4chb10geq9ckog5mvh
- UVA Box URL of source TFIDF table: https://virginia.box.com/s/qcolpxyn4ws58bta9xr0gdt34cvzia3h
- GitHub URL for notebook used to create: https://github.com/sunidhigoyal05/eta_female_popstars/blob/main/tfidf_l2.ipynb
- Delimitter: , 
- Number of features (i.e. significant words): 3460
- Principle of significant word selection: Nouns and Verbs

# Models

## PCA Components (4)

- UVA Box URL: https://virginia.box.com/s/yulfp8s2iakz5y4chb10geq9ckog5mvh
- UVA Box URL of the source TFIDF_L2 table: https://virginia.box.com/s/qcolpxyn4ws58bta9xr0gdt34cvzia3h
- GitHub URL for notebook used to create: https://github.com/sunidhigoyal05/eta_female_popstars/blob/main/pca_components_and_loadings.ipynb
- Delimitter:,
- Number of components: 10
- Library used to generate: Scikit learn
- Top 5 positive terms for first component: ![Screenshot 2024-05-03 at 11.38.46 PM.png](attachment:752c0d72-24f2-45c3-93f9-e062d073c2a4.png)
- Top 5 negative terms for second component: ![Screenshot 2024-05-03 at 11.39.02 PM.png](attachment:c7ddd07f-e867-4618-b819-7cee93ffa5fa.png)

## PCA DCM (4)

The document-component matrix generated.

- UVA Box URL: https://virginia.app.box.com/file/1520849372595
- GitHub URL for notebook used to create: https://github.com/sunidhigoyal05/eta_female_popstars/blob/main/pca_components_and_loadings.ipynb
- Delimitter:,

## PCA Loadings (4)

The component-term matrix generated.

- UVA Box URL:https://virginia.app.box.com/file/1520849372595
- GitHub URL for notebook used to create:https://github.com/sunidhigoyal05/eta_female_popstars/blob/main/pca_components_and_loadings.ipynb
- Delimitter:,

## PCA Visualization 1 (4)

Include a scatterplot of documents in the space created by the first two components.

Color the points based on a metadata feature associated with the documents.

![Screenshot 2024-05-03 at 11.47.59 PM.png](attachment:b29013a2-c509-4416-8fe4-9e4b0519081f.png)

Also include a scatterplot of the loadings for the same two components. (This does not need a feature mapped onto color.)

![Screenshot 2024-05-03 at 11.48.49 PM.png](attachment:7905bfe9-557f-460f-a59a-3adff7ac6861.png)

Briefly describe the nature of the polarity you see in the first component:

From the scatterplots above, there is not a very clear relationship that can be established. However, the loadings plot does show, that the words that are away and are towards the negative, are often verbs. From the PCA vis, it is apparent, that the British artists are somwwhat bent towards the negative polarity of the first componet as compared to their American counterparts. 


## PCA Visualization 2 (4)

Include a scatterplot of documents in the space created by the second two components.

Color the points based on a metadata feature associated with the documents.

![Screenshot 2024-05-03 at 11.53.28 PM.png](attachment:d94c5331-6f84-4153-8ebf-710a0f77856f.png)

Also include a scatterplot of the loadings for the same two components. 

![Screenshot 2024-05-03 at 11.55.48 PM.png](attachment:5eee8824-72a6-4407-b1d8-2993c3b5c1ba.png)

Briefly describe the nature of the polarity you see in the second component:

In the first scatterplot, it is apparent that the British artists are more balanced, but the American artists lean comparitively more towards the positive polarity. 
The loading visualization show a starburst pattern, but no relationship is apparent between the parts of speech based words. 

## LDA TOPIC (4)

- UVA Box URL: https://virginia.box.com/s/1ngppbf7bpd74rt3i4fuypwnw1eralpy
- UVA Box URL of count matrix used to create: https://virginia.box.com/s/qcolpxyn4ws58bta9xr0gdt34cvzia3h
- GitHub URL for notebook used to create: https://github.com/sunidhigoyal05/eta_female_popstars/blob/main/lda.ipynb
- Delimitter: ,
- Libary used to compute: ScikitLearn
- A description of any filtering : Nouns and Verbs only
- Number of topics: 40 
- Any other parameters sed: no
- Top 5 words and best-guess labels for topic five topics by mean document weight:![Screenshot 2024-05-04 at 12.06.53 AM.png](attachment:b5338cca-95a1-48bd-8cc8-7da96345aa02.png)
  - T00: finding home
  - T01: dancing through pain
  - T02: hopeful love
  - T03: dance sounds
  - T04: hopeless love

## LDA THETA (4)

- UVA Box URL: https://virginia.box.com/s/1ngppbf7bpd74rt3i4fuypwnw1eralpy
- GitHub URL for notebook used to create:https://github.com/sunidhigoyal05/eta_female_popstars/blob/main/lda.ipynb
- Delimitter:,

## LDA PHI (4)

- UVA Box URL:https://virginia.box.com/s/pouteoyz1olzmfujrl3rmwo2gh9nqhgy
- GitHub URL for notebook used to create:https://github.com/sunidhigoyal05/eta_female_popstars/blob/main/lda.ipynb
- Delimitter:,



Apply PCA to the PHI table and plot the topics in the space opened by the first two components.

Size the points based on the mean document weight of each topic (using the THETA table).

Color the points basd on a metadata feature from the LIB table.

Provide a brief interpretation of what you see.

![Scr## LDA + PCA Visualization (4)eenshot 2024-05-04 at 12.13.10 AM.png](attachment:2e1e97cc-0825-4a24-abf7-d3713ee7b84f.png)

I checked the topics that were used by popstars over the decades. They don't show a particular pattern, which could tell us that in a certain decade, pop stars chose to write about a particular kind of topic, because they are all meddled together.

## Sentiment VOCAB_SENT (4)

Sentiment values associated with a subset of the VOCAB from a curated sentiment lexicon.

- UVA Box URL: https://virginia.box.com/s/wbvg8tgao9v6mkywy1nm60fucu3y5xq3
- UVA Box URL for source lexicon: https://virginia.box.com/s/mqc51lb0b7w5thpmtw0rpa82gasrokdi
- GitHub URL for notebook used to create: https://github.com/sunidhigoyal05/eta_female_popstars/blob/main/sentiment_analysis.ipynb
- Delimitter: ,

## Sentiment BOW_SENT (4)

Sentiment values from VOCAB_SENT mapped onto BOW.

- UVA Box URL: https://virginia.box.com/s/oajxsj80f8k6b3236zc1mges89y9fcrg
- GitHub URL for notebook used to create:https://github.com/sunidhigoyal05/eta_female_popstars/blob/main/sentiment_analysis.ipynb
- Delimitter:,

## Sentiment DOC_SENT (4)

Computed sentiment per bag computed from BOW_SENT.

- UVA Box URL: https://virginia.box.com/s/oajxsj80f8k6b3236zc1mges89y9fcrg
- GitHub URL for notebook used to create: https://github.com/sunidhigoyal05/eta_female_popstars/blob/main/sentiment_analysis.ipynb
- Delimitter:,
- Document bag expressed in terms of OHCO levels: OHCO[:1] -> artist



Plot sentiment over some metric space, such as time.

If you don't have a metric metadata features, plot sentiment over a feature of your choice.

You may use a bar chart or a line graph.

*By Decade of Prominence*

![Screenshot 2024-05-04 at 12.25.50 AM.png](attachment:3a2be878-6b06-4290-b3fa-cb8a61e9028d.png)

*By Artist's Birth Years*

![Screenshot 2024-05-04 at 12.26.40 AM.png](attachment:5150b92f-3be2-48b3-a6b2-989c9d6a7c6d.png)

## VOCAB_W2V (4)

A table of word2vec features associated with terms in the VOCAB table.

- UVA Box URL: https://virginia.box.com/s/k52ffuutoi3zjpruusaf89klmrp133jk
- GitHub URL for notebook used to create: https://github.com/sunidhigoyal05/eta_female_popstars/blob/main/word2vec.ipynb
- Delimitter: ,
- Document bag expressed in terms of OHCO levels: OHCO[:1] -> artist
- Number of features generated: 246
- The library used to generate the embeddings: Gensim

## Word2vec tSNE Plot (4)

Plot word embedding featues in two-dimensions using t-SNE.

Describe a cluster in the plot that captures your attention.

![Screenshot 2024-05-04 at 12.34.53 AM.png](attachment:3e4546a3-d0d7-47a4-b12a-0fb88ce97115.png)

The TSNE plot does not give any special insights, the clusters that form inside are really interesting. I see words like down, let, go, clustering together, and then me, when, what clustering together, which give insight into the constant questioning and doubt into these singer-songwriters' minds. However, one of the clusters, I would expound upon is: 

![Screenshot 2024-05-04 at 12.41.08 AM.png](attachment:26ed691f-cf0c-493a-8e4f-85c71b6c83ad.png)

which has words like which, how, made, about, alright which point towards the story of heartbreak, and eventual healing, and was really interesting to me.


# Riffs

Provde at least three visualizations that combine the preceding model data in interesting ways.

These should provide insight into how features in the LIB table are related. 

The nature of this relationship is left open to you -- it may be correlation, or mutual information, or something less well defined. 

In doing so, consider the following visualization types:

- Hierarchical cluster diagrams
- Heatmaps
- Scatter plots
- KDE plots
- Dispersion plots
- t-SNE plots
- etc.

## Riff 1 (5)

![Screenshot 2024-05-04 at 12.43.46 AM.png](attachment:f39eb1ac-5ad6-4dd4-a7c0-0eb919fa3351.png)

This gives an insight into which artist uses some of these commonly used words in songs the most in their bodies of work. 
For example, so was used by Whitney Houston a lot. 
The word, dance, was used by Dua Lipa a lot in her body of work, and so on.

## Riff 2 (5)

![Screenshot 2024-05-04 at 12.48.59 AM.png](attachment:90a1ccb3-e31f-4349-bacc-4d5416fceefd.png)

This dendrogram gives an insight into similarity and dissimilarity between the bodies of work of different artists. It is interesting how Whitney Houston's and Adele's work are regarded similar by this, and the same is true for Dua Lipa and Beyonce. However, the works of Cyndi Lauper and Whitney Houston are the farthest apart.

## Riff 3 (5)

![Screenshot 2024-05-04 at 12.46.24 AM.png](attachment:d74d2bfe-acc3-4c4e-ba7f-a8656e0d6cfa.png)

This plot gives an insight into the sentiments and emotions that dominate each of these artist's most dominant works. It would be insteresting for any fan of these artists to know how they think, and which emotions dominate their most popular works. Like so: 

![5.png](attachment:1080f2d7-e0ab-4bff-ad18-f1c7ddee510a.png)
![6.png](attachment:efa22bab-52a2-4172-bda7-6997cff8f318.png)

# The Promised Award Ceremony :  AFTER ALL, TOPICS SHOULD BE GIVEN AWARDS TOO

# --------------------------------------------------------------
![4.png](attachment:4648b095-f9c3-4011-9fb6-2bb1ce9713b8.png)
# --------------------------------------------------------------
![3.png](attachment:26482039-50d1-4f51-bc37-abfc382068b7.png)
# --------------------------------------------------------------
![2.png](attachment:8cee420d-a2a2-441e-a66b-4d41b979f2ee.png)
# --------------------------------------------------------------
![1.png](attachment:8d69ce27-3c5d-4a92-aa31-71026b1bf914.png)

# Interpretation (4)

Describe something interesting about your corpus that you discovered during the process of completing this assignment.

At a minumum, use 250 words, but you may use more. You may also add images if you'd like.

For me, it was really interesting to see how our favorite pop divas have been writing about the same topics over the years, but some of them have been chosen more over the others by them to write their music. I always used to listen to happy music by P!nk, but it was surprising how her most popular 50 songs comprise of sad songs. The process of finding the words which are used the most by certain artists, and were used the most in certain decades is also interesting. 
Also, how the choices of words and topics and the vibe of pop music and songwriting has eveolved over the decades is very interesting. It funny how in 1970s, singers would use the topic, "come hear me" the most, but now it is, "need calm" and I am done kind of attitude that prevails.  

It is also interesting to see the sentiments that British and American pop divas have in their songs, are not quite the same, and the British ones are comparitively sadder. 

Another interesting thing was that the topics could not be significantly segmented using PCA. The topics are quite intertwined even in very diverse components, which means there is a chance of overlap in the topics used by female popstars, regardless of the era, or nationality, or supplementary genres they work in. 

It also gives an insight into how female perspectives, and attitudes are changing over the decades, and how the choice of words, even if the topics are the same, is evolving, and becoming for authorative instead of submissive. 

