# Summary of this notebook

In this notebook, we import the "Composers" and "Producers" subreddit data sets from the [last notebook](./01_data_collection.ipynb).  We remove any duplicate posts from these data sets, and we fill "missing" body text with the empty string (since it appears that these body texts are not eactually missing, but intentionally empty).  We also make sure that there are no other obvious issues with the data, such as missing post titles or incorrect data types.  We then export the cleaned data sets.

## Import packages

In [1]:
import numpy as np
import pandas as pd

In [2]:
#Change display options to show longer messages
pd.options.display.max_colwidth = 400

## Import data

In order to ensure replicability, we use data collected up to January 16th, 2023.  These data are contained in the `composers_jan16` and `producers_jan16` csv files in the [data folder](../data).  To instead use the most recent data, replace the names in the next cell with simply `composers` and `producers`.

In [3]:
#Change this to just 'composers' to use the most recent data
composers_to_use = 'composers_jan16'

#Change this to just 'prodcuers' to use the most recent data
producers_to_use = 'producers_jan16'

In [4]:
composers = pd.read_csv(f"../data/{composers_to_use}.csv", index_col='id')
composers.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1086 entries, 107hfj5 to 10b7z2a
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   title   1086 non-null   object 
 1   text    1086 non-null   object 
 2   utc     1086 non-null   float64
 3   author  1076 non-null   object 
dtypes: float64(1), object(3)
memory usage: 42.4+ KB


In [5]:
producers = pd.read_csv(f"../data/{producers_to_use}.csv", index_col='id')
producers.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1245 entries, 107ilpd to 10b33q8
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   title   1245 non-null   object 
 1   text    1030 non-null   object 
 2   utc     1245 non-null   float64
 3   author  1237 non-null   object 
dtypes: float64(1), object(3)
memory usage: 48.6+ KB


## Drop the `author` column

In [6]:
composers.drop(columns='author', inplace=True)
producers.drop(columns='author', inplace=True)

# "Composers" data

## Check for missing data

In [7]:
composers.isnull().sum()

title    0
text     0
utc      0
dtype: int64

## Examine non-unique titles

In [8]:
composers['title'].value_counts(ascending=False).head(3)

Music challenge app for creative development    3
New to composing                                2
minature series                                 1
Name: title, dtype: int64

It looks like there are a couple titles that have multiple posts.  Are these duplicates, or are they truly different posts that happen to share a name?

In [9]:
composers[composers['title']=='Music challenge app for creative development']

Unnamed: 0_level_0,title,text,utc
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10245a9,Music challenge app for creative development,"Hi,\n\nI created a web app with music challenges in form of an image- goal is to create a song that resembles/ describes given photo! Song with most votes is declared a winner! There is also search by colour for challenges. I just made a mobile version too so check it out!\n\n[https://musicchallengeapp.onrender.com/](https://musicchallengeapp.onrender.com/)",1672740000.0
1022y60,Music challenge app for creative development,"Hi,\n\nI created a web app with music challenges in form of an image- goal is to create a song that resembles/ describes given photo! Song with most votes is declared a winner! There is also search by colour for challenges. I just made a mobile version too so check it out!\n\n[https://musicchallengeapp.onrender.com/](https://musicchallengeapp.onrender.com/)",1672736000.0
zg6jlb,Music challenge app for creative development,"Hi,\n\nI created a web app with music challenges in form of an image- goal is to create a song that resembles/ describes given photo! Song with most votes is declared a winner! There is also search by colour for challenges. I just made a mobile version too so check it out!\n\n[https://musicchallengeapp.onrender.com/](https://musicchallengeapp.onrender.com/)",1670523000.0


These look like duplicates that have been re-posted, so we'll want to keep just the first of these three.  What about the second non-unique title?

In [10]:
composers[composers['title']=='New to composing']

Unnamed: 0_level_0,title,text,utc
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ymqqyx,New to composing,"I have been playing piano for almost 2 decades now and I wanted to start creating some orchestral scores for a game I'm working on. But my current problem is that I can't seem to find any good free instrument libraries for the music. I found some where you have 1 piano, or 1 violin, but never a full orchestra. I also use the sforzando font, but I have a feeling that a font is not the prefered ...",1667646000.0
xmm6sk,New to composing,What are some good techniques to add to a composition?,1664005000.0


These are different messages that just happen to have the same title.  So we'll want to keep both of these.

Below, we write a function that should accomplish this data-cleaning task automatically, regardless of the number of different posts that share a title.  It deletes all but one of any "duplicate" posts - i.e. those posts that have the same title *and* the same message - while leaving in those posts that have the same title but different messages.

In [11]:
def remove_duplicates(df):
    '''
    Deletes all but one of any "duplicate" posts (i.e. posts that have the same title
    AND the same text) from the given dataframe df.
    '''
    
    titles_with_duplicates_at_top = df['title'].value_counts(ascending=False)
    
    for title in titles_with_duplicates_at_top.index:
        #If we've already reached a title that appears only once, then there are no more
        #duplicate titles at all (as the series is sorted by title frequency).
        if titles_with_duplicates_at_top[title]==1:
            break
    
        #Otherwise, we have a duplicate title, so we'll check if it corresponds to
        #duplicate messages.  If so, we'll drop all but one of each of the duplicates.
        
        #Make a dataframe of just those posts with this (duplicate) title
        has_this_title = df[df['title']==title]
        
        #Loop over each different message that has this same title
        for message in has_this_title['text'].value_counts().index:
            
            #Check if there is more than one copy of this message
            if has_this_title['text'].value_counts()[message]>1:
                
                #If so, record the indices of all but one of these messages...
                drop_inds = has_this_title[has_this_title['text']==message].index[:-1]
                
                #then drop these indices from the original data frame.
                df.drop(index=drop_inds, inplace=True)
            

In [12]:
#Apply the function
remove_duplicates(composers)

#See what happened
composers['title'].value_counts(ascending=False).head(3)

New to composing                                              2
minature series                                               1
Educational institutions for composition/electronic music?    1
Name: title, dtype: int64

As we can see, the function successfully dropped the "truly duplicate" posts but left in both posts that have the title "New to composing" (since these posts have different body text).

## Examine non-unique messages

In [13]:
composers['text'].value_counts(ascending=False).head(3)

Over 2022, I wrote about 30 little pieces for solo piano. I find it very satisfying and helpful to my process to constantly finish little things in between my larger projects- it helps me stay organized and keep composing.\nStay tuned as I'll be releasing one little miniature recording a week on youtube for the next few months!\nThis first installment is called "Madrigal" after the old italian vocal tradition. It's a tender little thing that I think is appropriate for this time of year! It's dedicated to my old composition teacher at Peabody, Michael Hersch, and is a sort of hommage to some of his music.\nHope y'all enjoy!\n\nhttps://youtu.be/DSJukDcSRsQ                                                                                                                                                                                                                                                                                                                                                  

So there are no duplicate messages remaining in the `composers` dataframe.

## Export "composers" data set

In [14]:
composers.to_csv('../data/composers_cleaned.csv', index_label='id')

# "Producers" data

## Check for missing data

In [15]:
producers.isnull().sum()

title      0
text     215
utc        0
dtype: int64

In [16]:
#Look at the first 20 posts with missing body text
producers[producers['text'].isnull()].head(20)

Unnamed: 0_level_0,title,text,utc
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
107ilpd,"I'm starting a YouTube series about home-recording. A stream-of-consciousness type of ""explain my process"" show. I'm curious if this is something you all might find fun to watch. Thoughts?",,1673282000.0
107iae0,"Recording Vocals With a Neumann U87 ai, Neve 1073 SPX, Distressor Compre...",,1673281000.0
107htcg,"fuck it im going dawless, its way mire fun jamming this way than with plugins imo",,1673280000.0
107g42u,A fantastic AI Audio to MIDI tool,,1673276000.0
107fgke,Should I sell my music on Artlist or Epidemic Sound? Anyone have any experience?,,1673274000.0
10754js,"How to get this airy, spacey groove? I tried recreating this song on Logic but mine sounded much more duller.",,1673240000.0
106zcsk,Does anyone know how the guitar riff in the beginning of Mama Mia sounds the way it does?,,1673224000.0
106ymzd,I wanna start recording myself singing with background music. Do you think this recording bundle will help me do that?,,1673222000.0
106vd41,music for my school play,,1673214000.0
106sja7,"Does anybody here make music for Epidemic sounds? If so, what has your experience been like?",,1673208000.0


There don't seem to be any obvious issues with these posts that are missing text.  So we'll just replace the `NaN` values with the empty string so that our text processing algorithms won't run into any issues.

In [17]:
#Replace NaN's with missing values
producers.fillna('', inplace=True)
producers.isnull().sum()

title    0
text     0
utc      0
dtype: int64

## Examine non-unique titles

In [18]:
producers['title'].value_counts(ascending=False).head()

Free DAWS for Music Making?                                                                                                                                                                     2
Mysterious microphone crackling                                                                                                                                                                 2
Asus Flow z13 or x13                                                                                                                                                                            2
I'm starting a YouTube series about home-recording. A stream-of-consciousness type of "explain my process" show. I'm curious if this is something you all might find fun to watch. Thoughts?    1
How to get royalties?                                                                                                                                                                           1
Name: title, dtype: int64

This time we only have three duplicate titles.  This is still few enough to do things manually and not write a function.

In [19]:
#Apply the function written above to remove duplicate posts
remove_duplicates(producers)

#See the results
producers['title'].value_counts(ascending=False).head()

Mysterious microphone crackling                                                                                                                                                                 2
I'm starting a YouTube series about home-recording. A stream-of-consciousness type of "explain my process" show. I'm curious if this is something you all might find fun to watch. Thoughts?    1
want to make music, keep forgetting the melodies that pop into my head, what are the best ways to "write" them down?                                                                            1
How to make a sample fit my tempo constantly?                                                                                                                                                   1
Mic records all audio even when there is no speaker volume                                                                                                                                      1
Name: title, dtype: int64

So we still have one title that is shared between two posts.  Let's examine these two posts that share this title:

In [20]:
producers[producers['title']=='Mysterious microphone crackling']

Unnamed: 0_level_0,title,text,utc
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
zvlvmf,Mysterious microphone crackling,"Hey \nI have a problem in my home studio. I can hear a crackling when I’m using my microphones. I use a Shure SM7b as you can see in the picture, but I also have a Rode NT1 a. I use the Rode AI 1 Interface. I tried it out with both mics, with two Rode AI 1s, with a Mac Mini, with a HP Laptop, with a HP Omen Gaming PC and in different rooms of my House.\nI noticed that the crackling is sometime...",1672056000.0
zuikqg,Mysterious microphone crackling,"Hey \nI have a problem with my setups. I can hear a crackling of my microphones. I tried both mics (shure sm7b, Rode nt1 a) with two Rode AI 1s, with a Mac Mini, with a HP Laptop and with a Gaming PC. \nI noticed that the crackling is sometimes not noticeable and on the Gaming PC very intensive. I also noticed that the more usb/hdmi cables I use with my Pc’s, the more intensive is the cracklin...",1671915000.0


As we can see, these two posts' texts are not *exactly* identical.  It's a bit of a judgment call as to whether to leave both in or not.  I will drop one of them at random:

In [21]:
#Get indices of these duplicate posts
inds = producers[producers['title']=='Mysterious microphone crackling'].index

#Set the random seed, for replicability
np.random.seed(123)

#Drop one of the duplicates at random
drop_ind = np.random.choice(inds)
producers.drop(index=[drop_ind], inplace=True)

#Check that it worked
producers[producers['title']=='Mysterious microphone crackling']

Unnamed: 0_level_0,title,text,utc
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
zuikqg,Mysterious microphone crackling,"Hey \nI have a problem with my setups. I can hear a crackling of my microphones. I tried both mics (shure sm7b, Rode nt1 a) with two Rode AI 1s, with a Mac Mini, with a HP Laptop and with a Gaming PC. \nI noticed that the crackling is sometimes not noticeable and on the Gaming PC very intensive. I also noticed that the more usb/hdmi cables I use with my Pc’s, the more intensive is the cracklin...",1671915000.0


## Examine non-unique messages

In [22]:
producers['text'].value_counts(ascending=False).head(3)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              215
I was 

In [23]:
#Examine the duplicate message
duplicate_messages_at_top = producers['text'].value_counts(ascending=False)
duplicate_message = duplicate_messages_at_top.index[0]

producers[producers['text']==duplicate_message]

Unnamed: 0_level_0,title,text,utc
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
107ilpd,"I'm starting a YouTube series about home-recording. A stream-of-consciousness type of ""explain my process"" show. I'm curious if this is something you all might find fun to watch. Thoughts?",,1.673282e+09
107iae0,"Recording Vocals With a Neumann U87 ai, Neve 1073 SPX, Distressor Compre...",,1.673281e+09
107htcg,"fuck it im going dawless, its way mire fun jamming this way than with plugins imo",,1.673280e+09
107g42u,A fantastic AI Audio to MIDI tool,,1.673276e+09
107fgke,Should I sell my music on Artlist or Epidemic Sound? Anyone have any experience?,,1.673274e+09
...,...,...,...
10bgjbp,How to emulate this (synthy?) bass sound?,,1.673674e+09
10bbf0w,does anyone know what this sfx is? i’ve been looking for it for a while,,1.673659e+09
10ba44j,"I’m having a lot of trouble getting a VST plugin on windows. I just get this file that doesn’t do anything. I’m trying to download to Cakewalk, but I can’t even extract anything in the first place.",,1.673655e+09
10b6e57,What’s the best mixing head phones out there?,,1.673646e+09


Again, it's a bit of a judgment call as to whether to drop one of these.  As before, I will drop one at random.

In [24]:
#Get indices of these duplicate posts
inds = producers[producers['text']==duplicate_message].index

#Set the random seed, for replicability
np.random.seed(123)

#Drop one of the duplicates at random
drop_ind = np.random.choice(inds)
producers.drop(index=[drop_ind], inplace=True)

#Check that it worked
producers[producers['text']==duplicate_message]

Unnamed: 0_level_0,title,text,utc
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
107ilpd,"I'm starting a YouTube series about home-recording. A stream-of-consciousness type of ""explain my process"" show. I'm curious if this is something you all might find fun to watch. Thoughts?",,1.673282e+09
107iae0,"Recording Vocals With a Neumann U87 ai, Neve 1073 SPX, Distressor Compre...",,1.673281e+09
107htcg,"fuck it im going dawless, its way mire fun jamming this way than with plugins imo",,1.673280e+09
107g42u,A fantastic AI Audio to MIDI tool,,1.673276e+09
107fgke,Should I sell my music on Artlist or Epidemic Sound? Anyone have any experience?,,1.673274e+09
...,...,...,...
10bgjbp,How to emulate this (synthy?) bass sound?,,1.673674e+09
10bbf0w,does anyone know what this sfx is? i’ve been looking for it for a while,,1.673659e+09
10ba44j,"I’m having a lot of trouble getting a VST plugin on windows. I just get this file that doesn’t do anything. I’m trying to download to Cakewalk, but I can’t even extract anything in the first place.",,1.673655e+09
10b6e57,What’s the best mixing head phones out there?,,1.673646e+09


## Export "producers" data set

In [25]:
producers.to_csv('../data/producers_cleaned.csv', index_label='id')