# Capstone 2 - Music, Tweets and Language 
***

Music has always been an interest of mine. Personally it helps me be me. I can listen to it to relax, to focus, to work out and so much more. Besides what it can do at a personal level, music has the ability to connect people in ways they may or may not now. Thanks to the ever growing use of social media and technology, these connections are formed even more frequently. For my second capstone, I'm interested in seeing if twitter users with similar music interests can be identified by their tweets. Previously I scraped 40,000 tweets from 1000 different twitter users. 
***

## Data Wrangling

Music is differentiated by artists, slang, songs, albums, etc. that can make it very easy to identify what type of music that a user enjoys listening to. For this reason, I want to remove any tweets associated or related to music and focus on the tweets left behind. In addition I want to avoid tweets with only emojis, or only one of two words. 

To accomplish this task, I had to use beginner and advanced NLP techniques using spaCy, an open source library for Natural Language Processing. More specifically I had to use the technique known as semantic similarity. I go more into this later in the notebook.

In [1]:
# Import necessary packages
import pandas as pd
import json 
import spacy
nlp = spacy.load('en_core_web_md')

import itertools
import collections

In [2]:
# FUNCTIONS

# Emoji Conversion Function
def convert_emoji(df):
    """ Iterate through each tweet in the dataframe and convert any emojis to a more readable format."""
    for index, val in enumerate(df.tweet):
        df['tweet'].iloc[index] = val.encode('unicode-escape').decode('ASCII')

In [3]:
# Import data
hiphop_tweets = pd.read_json('data/hiphoptweets.json', lines=True)
country_tweets = pd.read_json('data/countrytweets.json', lines=True)
jazz_tweets = pd.read_json('data/jazztweets.json', lines=True)
metal_tweets = pd.read_json('data/metaltweets.json', lines=True)
edm_tweets = pd.read_json('data/edmtweets.json', lines=True)

In [4]:
# Examine one dataframe to see contents of the data
hiphop_tweets.head()

Unnamed: 0,id,conversation_id,created_at,date,time,timezone,user_id,username,name,place,...,quote_url,video,near,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date
0,1185396204483555330,1185212741902057472,2019-10-19 03:24:13,2019-10-18,20:24:13,PST,1304920938,waxxgordon,Waxx Gordon,,...,,0,,,,,,,"[{'user_id': '1304920938', 'username': 'WaxXGo...",
1,1177041754371035136,1177037133552508928,2019-09-26 02:06:37,2019-09-25,19:06:37,PST,1304920938,waxxgordon,Waxx Gordon,,...,,0,,,,,,,"[{'user_id': '1304920938', 'username': 'WaxXGo...",
2,1176865647709564928,1176613793897558016,2019-09-25 14:26:50,2019-09-25,07:26:50,PST,1304920938,waxxgordon,Waxx Gordon,,...,,0,,,,,,,"[{'user_id': '1304920938', 'username': 'WaxXGo...",
3,1175851777566105600,1175790713755377664,2019-09-22 19:18:04,2019-09-22,12:18:04,PST,1304920938,waxxgordon,Waxx Gordon,,...,,0,,,,,,,"[{'user_id': '1304920938', 'username': 'WaxXGo...",
4,1175118782274199553,1174789269094838272,2019-09-20 18:45:25,2019-09-20,11:45:25,PST,1304920938,waxxgordon,Waxx Gordon,,...,,0,,,,,,,"[{'user_id': '1304920938', 'username': 'WaxXGo...",


In [5]:
# Drop unecessary columns
cols_to_drop = ['id', 'conversation_id', 'created_at', 'date', 'time', 'timezone',
                'user_id', 'name', 'place', 'urls','photos', 'replies_count', 'retweets_count', 
                'likes_count', 'cashtags', 'link', 'quote_url', 'video', 'near', 'geo','source', 
                'user_rt_id', 'user_rt', 'retweet_id', 'reply_to', 'retweet_date']

hiphop_tweets = hiphop_tweets.drop(cols_to_drop, axis=1)
country_tweets = country_tweets.drop(cols_to_drop, axis=1)
jazz_tweets = jazz_tweets.drop(cols_to_drop, axis=1)
metal_tweets = metal_tweets.drop(cols_to_drop, axis=1)
edm_tweets = edm_tweets.drop(cols_to_drop, axis=1)

In [6]:
# Add Genre Column to identify tweets
hiphop_tweets['Genre'] = 'Hip-Hop'
country_tweets['Genre'] = 'Country' 
jazz_tweets['Genre'] = 'Jazz'
metal_tweets['Genre'] = 'Metal Rock'
edm_tweets['Genre'] = 'EDM'

In [7]:
# Concat dataframes of tweets together
all_tweets = pd.concat([hiphop_tweets, country_tweets, jazz_tweets, metal_tweets, edm_tweets], ignore_index=True)

***
### Word Embeddings & Semantic Similarity 
My goal was to remove any tweets related to music but that can be very tricky.
1. Looking through 40,000 tweets would be very time consuming.
2. I may miss a term within a genre that I am unfamiliar with. For example, Kanye West can be seen referred as Kanye, Yeezus, Ye, Kanye West, etc. 
3. Building on this even more, an album name can have words that are not correlated to music in any way. Building off the Kanye example, his new album is called "Jesus is King", none of which are words related to music.

To attempt to tackle this I decided to use semantic similarity. Using the algorithm word2vec and spaCy's prebuilt word embeddings, I looked at the similarity of a tweet to the three words, "music", "album" and "song". From here if the similarity was higher than 0.5 I removed the entire tweet from the dataset. 

In [8]:
# Create a word embedding using spaCy for the words "music", "album", and "song"
music = nlp('music album song')

In [9]:
# Ignore UserWarnings for tweets that are only emojis
import warnings
warnings.simplefilter("ignore", UserWarning)


# Iterate over each tweet in the dataframe "all_tweets" and do the following inside
for i, tweet in enumerate(all_tweets['tweet']):
    # Create a word embedding for the current tweet
    word_sim = nlp(tweet)
    
    # If the current tweet is similar to music by more than 0.5 then drop the tweet from the dataframe
    if word_sim.similarity(music) > 0.5:
        all_tweets.drop(i , inplace=True)

In [10]:
# Determine word count for each tweet
all_tweets['word_count'] = all_tweets['tweet'].apply(lambda x: len(str(x).split(" ")))

In [11]:
# Drop tweets with less than 5 words
cleaned_tweets = all_tweets[all_tweets['word_count'] > 5]

In [12]:
# Group tweets by word count 
g = cleaned_tweets.groupby(["username"]).apply(lambda x: x.sort_values(["word_count"], 
                                                                       ascending = False)).reset_index(drop=True)

In [13]:
# Create final dataframe with top 10 tweets per user
final = g.groupby('username').head(10)

In [14]:
final

Unnamed: 0,username,tweet,mentions,hashtags,retweet,Genre,word_count
0,00sarrett,I got 5 others outta the bargain bin but they ...,[],[],False,EDM,50
1,00sarrett,I do believe Iâ€™ve determined a suitable replac...,[],[#forwardthinking],False,EDM,46
2,00sarrett,Both just started watching and finished The Se...,[],[],False,EDM,38
3,00sarrett,I got an offer today to move to another state ...,[],[#decisions],False,EDM,28
4,00sarrett,"Roflmao ðŸ˜‚ no doubt haha, this is the extent of...",[gabri_rae],[],False,EDM,28
...,...,...,...,...,...,...,...
21999,zmaskm,I am just absolutely baffled as to why this wo...,[],[],False,Metal Rock,45
22000,zmaskm,"So hey, Guts in D&D; VERY high level, definit...",[],[],False,Metal Rock,37
22001,zmaskm,"When you really think about it, programming is...",[],[],False,Metal Rock,34
22002,zmaskm,I honestly wouldn't be surprised if we did get...,[ficklampa],[],False,Metal Rock,33


In [15]:
final['username'].value_counts()

sigitt_iddm        10
miss_mina13        10
santoscooks        10
alec_aandjgoods    10
2xtremerko         10
                   ..
bigbillshater       3
avela_rongo         3
gustav_aka          2
dopedrumkits        1
iamrachelv          1
Name: username, Length: 839, dtype: int64

In [16]:
final['Genre'].value_counts()

Hip-Hop       1793
Metal Rock    1732
Jazz          1607
EDM           1599
Country       1584
Name: Genre, dtype: int64

***
### Export Dataframe

In [17]:
final.to_csv('tweets.csv')

In [18]:
final.to_pickle('tweets.pickle')

In [19]:
final.to_json('tweets.json')