# Capstone 2 Final Report: Predicting Music Genre from User Tweets
By: Soham Desai

__Table of Contents:__
1. Introduction
2. Problem Statement
3. Client Profile
4. Data Information
5. Data Wrangling
6. Exploratory Data Analysis
7. Machine Learning
8. Summary, Recommendations and Next Steps

### 1. Introduction:

Social media can tell you lots of information about a person or individual that you might not even know you can find out. My focus will be specifically on Twitter and tweets. From a user’s profile you can gain a lot of insight on what they like to do, what they are interested in, and much more from the information they tweet, retweet and the people they follow. How can this information be quantified at some level and how can it be leveraged to be used in a business setting?

I’ve always enjoyed listening to music and have found it a common talking point amongst others. It is always around whether its a Spotify playlist, or background music at your favorite coffee shop. I mentioned that you can gain a lot of insight by looking at an individual user and their tendencies, but is it possible to find trends and talking points amongst people who have similar musical interests? I aim to use Natural Language Processing and Machine Learning models to correctly classify users who have self identified themselves into liking a specific genre of music in their Twitter bios.

### 2. Problem Statement:

With the growing use of technology, the landscape of advertising is changing significantly. Ads can be tailored to individuals based on their interests allowing for more relevant information to appear for users, as well as companies being able to spread their product effectively. By determining common diction and similarities in text by users who prefer the same genre of music, this model can be expanded to search through users who have not labeled their musical preferences to continue to improve their ad experience while using social media and other platforms of the web. I plan to use NLP techniques to create a solution for this problem.

### 3. Client Profile:
 
There is so much untouched data and information that is being accumulated on a daily basis, that companies and people in general have to clue what to do with. To have the ability to understand or group individuals together based on their virtual messages off of a social site like Twitter would allow marketing companies such as AppLovin, Liftoff, Kantar, Amazon, etc. to target individuals based on what they say and how they say it. It could help in targeting individuals who may not be aware of certain products. For example, let's say Sam is a hip hop music listener but you don't know that. Well with ML you now are able to. If a client comes to a marketing company wanting to share his mixtape with lets say 100,000 impressions. Sam can be analyzed to be a high profile target and be targeted with this new hip hop music compared to just sending out and reaching as many people as possible.
 
### 4. Data Acquisition:

To begin my project I needed to acquire the tweets of individuals. I focused on five genres of music:
+ Hip Hop
+ Country
+ Jazz
+ Metal
+ Electronic Dance Music (EDM)

To identify users of each genre of music I used FollowerWonk. FollowerWonk allowed me to search the bio of twitter users for key words that may identify them as a enthusiast of a specific genre of music. You can see below the tags I used to identify twitter users for each genre. In total I found 200 users per genre summing to a grand total of 1000.
+ Hip Hop --- "hip hop head", "hip hop fan"
+ Country --- "country music fan", "country music lover"
+ Jazz ------ "jazz enthusiast", "jazz lover"
+ Metal ----- "metal head", "metalhead"
+ EDM ------- "edm fan", "edm junkie", "edm lover", "edm enthusiast"

Next came the task to scrape data from each user. My goal was to scrape 40 tweets from each user, avoiding any tweets that had any links associated with them. Using the Twitter API, you have to login to a twitter account and also take into account for rate limitations as well as the number of tweets you can scrape from a user. To tackle this issue I was able to find an advanced Twitter scraping tool built in python, the Twitter Intelligence Tool, TWINT for short. With a quick inital setup I was able to pull 40,000 tweets in under an hour.

The code for this portion of my project can be found on my github under 1__ScrapeTweets. 

### 5. Data Wrangling 

Music is differentiated by artists, slang, songs, albums, etc. that can make it very easy to identify what type of music that a user enjoys listening to. For this reason, I want to remove any tweets associated or related to music and focus on the tweets left behind. In addition I want to avoid tweets with only emojis, or only one of two words. 

To accomplish this task, I had to use beginner and advanced NLP techniques using spaCy, an open source library for Natural Language Processing. More specifically I had to use the technique known as semantic similarity. I go more into this later in the notebook.

My goal was to remove any tweets related to music but that can be very tricky.
1. Looking through 40,000 tweets would be very time consuming.
2. I may miss a term within a genre that I am unfamiliar with. For example, Kanye West can be seen referred as Kanye, Yeezus, Ye, Kanye West, etc. 
3. Building on this even more, an album name can have words that are not correlated to music in any way. Building off the Kanye example, his new album is called "Jesus is King", none of which are words related to music.

To attempt to tackle this I decided to use semantic similarity. Using the algorithm word2vec and spaCy's prebuilt word embeddings, I looked at the similarity of a tweet to the three words, "music", "album" and "song". From here if the similarity was higher than 0.5 I removed the entire tweet from the dataset. 

Finally, I removed any tweets that were not in English using lang detect leaving my total number of tweets across the five music genres to be 6,660. 

The code for this portion of my project can be found on my github under 2__DataWrangle. 


### 6. Exploratory Analysis

__Predictive Features: Words__

| Genre      | Top Words              | Worst Words                  |
|------------|------------------------|------------------------------|
| Country    | fuck, stream, business | birthday, tour, amazing      |
| EDM        | trump, important, plus | india, rn, lbs               |
| Hip Hop    | meet, amazing, flight  | bro, homie, italy            |
| Jazz       | favorite, yeah, tho    | student, trump, constitution |
| Metal Rock | half, student, teach   | metal, hospital, sort        |

Looking at this comparison between the various genres, there doesn't really appear to be any words or associations that jump out from one another. However there are some interesting tidbits such as "trump" being most mentioned among EDM listeners while least predictive among Jazz listeners. 

Given the nature of Twitter being only 140 characters at the time that I had pulled this data, it may be that due to the limitaiton in length, there is a lot more similarity among users when not mentoining music in specific.

__Predictive Features: Emojis__

| Genre      | Top Emojis                            | Worst Emojis                      | 
|------------|---------------------------------------|-----------------------------------|
| Country    | pray, you, male                       | blush, hearts, tongue out winking |
| EDM        | cry, tongue out winking, blush        | 100, purple heart, pleading face  |
| Hip Hop    | heart, blush, two hearts              | male sign, facepalm, fire         |
| Jazz       | sweat smile, rolling eyes, two hearts | thinking, pray, rofl              |
| Metal Rock | fire, clap, 100                       | sweat smile, two hearts, wink     |

With the new emoji library, diversifying emojis to include racial diversity there are a lot more options for individuals to use, yet you can see lots of similarities among twitter users. Emojis such as the fire emoji, heart, 100 and pray are seen frequently in tweets across users. However you can see distinct differences among users. For example Hip Hop users showed more usage of heart emojis in general compared to their counterparts.


__Word Frequency Counts and Word Clouds:__
To add a splash of visualization I decided to create some Word Clouds as well as counting word frequencies based on the lemmatized text. However again, it seems that when removing originality portions of a person's tweets left behind very similar words that were used frequently. Some of these words were "go", "get", "like", "people", "say", etc.

For the full visualziation and information go to github.com/sdesaidata and 3__EDA for the full details.

### 7. Machine Learning

The machine learning model aims to take new tweets and predict what genre that tweet belongs to. Just to reiterate, the tweets had no mention of music whatsover as I had removed that.

__6.1. Vectorizer Selection__ 

Vectorizers encode text data and depending on which method you choose, it is calculated differently. I decided to use two different ones, CountVectorizer and TfidfVectorizer, which are both part of Scikit-learn.

CountVectorizer: uses a "bag-of-words" approach where the text is analyzed by word counts. It counts the occurrence of each token for each review.
TfidfVectorizer: uses the term frequency(tf) and the inverse document frequency(idf) to create weighted term frequencies. To make more sense of this here is an example. If the word "IPA" shows up in all the documents, it wouldn't be that helpful for predictions. This vectorizer will downweight the predictive value of the word "IPA" because it does not help us as much.

To compare the two vectorizers I applied both vectorizers separately to a simple Naive Bayes classifier, MultinomialNB() including bigrams (ngram_range = (1,2)). I used the classification report to compare accuracy and F1-scores across the five genres of music: 0 - Country, 1 - EDM, 2 - Hip Hop, 3 - Jazz, 4 - Metal Rock.

When comparing them I noticed that there weren't distinct differences between the two vectorizers and decided to keep both moving forward when utilizing other classifiers. 

__6.2. Model Comparison__

Using the CountVectorizer TfidfVectorizer I fit and tuned three classifiers: Logistic RegressionCV, Support Vector Machines, Multinomial Naive Bayes (already done) and Random Forest Trees. Each classifier used an ngram_range of (1,2). I used F1 scores and overall accuracy to compare the models to one another. For LogisticRegression I used an algorithm(LogisticRegressionCV) that included cross-validation. I also used RandomForestClassifier and LinearSVC.

The highest scoring model turned out to be Linear SVC. I've included the F1-scores per category, vectorizers, and overall accuracy by classifier below.

(Count Vectorizer)

| Classifier             | Overall Acc | Country (F1) | EDM (F1) | Hip Hop (F1) | Jazz (F1) | Metal Rock (F1) |
|------------------------|-------------|--------------|----------|--------------|-----------|-----------------|
| MultinomialNB          | 0.37        | 0.43         | 0.29     | 0.38         | 0.28      | 0.35            |
| LogisticRegressionCV   | 0.34        | 0.38         | 0.31     | 0.35         | 0.26      | 0.35            |
| RandomForestClassifier | 0.31        | 0.22         | 0.34     | 0.22         | 0.33      | 0.22            |
| LinearSVC              | 0.34        | 0.38         | 0.31     | 0.35         | 0.28      | 0.35            |


(Tfidf Vectorizer)

| Classifier             | Overall Acc | Country (F1) | EDM (F1) | Hip Hop (F1) | Jazz (F1) | Metal Rock (F1) |
|------------------------|-------------|--------------|----------|--------------|-----------|-----------------|
| MultinomialNB          | 0.32        | 0.42         | 0.15     | 0.39         | 0.04      | 0.21            |
| LogisticRegressionCV   | 0.34        | 0.39         | 0.19     | 0.30         | 0.20      | 0.40            |
| RandomForestClassifier | 0.32        | 0.38         | 0.20     | 0.35         | 0.20      | 0.35            |
| LinearSVC              | 0.34        | 0.41         | 0.27     | 0.36         | 0.27      | 0.27            |


### 8. Summary, Recommendations and Next Steps

Overall, the Machine Learning model did not perform very well. It had a lot of flaws and a lot of room for improvement will be needed in the future. However, there are always positives to be taken out of failure and there are definitely some in here. 

1. Acquiring tweets 
I was able pull over 40,000 tweets in under 20 minutes using TWINT and Follower Wonk. This was very useful, however a next step would be to figure out how to determine to what extent a person is listening to a specific genre of music. Maybe pair up with a streaming company to merge data between Twitter and let's say Spotify for example to obtain a more accurate representation of loyal listners to certain genres of music. 

2. Removing Topic Related Tweets
After pulling tweets from individuals who said they listened to certain music heavily, I also was able to use advanced NLP techniques to then remove any tweets they had related to music. This function itself can be useful for so many cases. Let's say you want to look at Reddit users who are in a sports threadl, but only want to look at messages unrelated to sports? If you loook at 2__DataWrangle you will find a simple and cool solution to remedy this.

HOWEVER, one thing to caveat for this is that when doing so you may be left with very general and similar messages that may be very difficult to separate. (What happend to me). 

3. Cool things to Visualize
Text can tell you so much present day with the incorporation of bitmojis, emojis, more and more slang, etc. Looking at predictive features can always be interesting because it can show what people in general are talking about or responding to within a certain segment. Also, everybody has their favorite emoji to use. From my analysis above I found that hearts, 100, and pray emojis were very popular which didn't surprise me much.

4. I was able to achieve accuracy up to 0.34 using a LinearSVC model, which is not great. Considering what I had removed and what I was left with it is not too bad, but needs A LOT of improvement. I've listed some next steps if you want to take a stab at it, but something I will definitely be getting back to in the near future.

First I would like to pull new data, and make it more densly populated with more individuals to diversify the tweets as well as incorporate more data. I also will be more vigilant on pulling on english tweets because that really limited my data sample down to 6600. 

I will also be looking more into higher level NLP strategies that will be coming out in the near future and are being tested right now to see if I can break down the common language to find some patter amongst individuals between certain genres. 

And at the end of the day who knows? There might not even really be a connection. However my passion for data, music and human behavior will continue to drive me to further understand a blend of these combinations.
