# Capstone 2 - Music, Tweets and Language 
***

Music has always been an interest of mine. Personally it helps me be me. I can listen to it to relax, to focus, to work out and so much more. Besides what it can do at a personal level, music has the ability to connect people in ways they may or may not now. Thanks to the ever growing use of social media and technology, these connections are formed even more frequently. For my second capstone, I'm interested in seeing if twitter users with similar music interests can be identified by their tweets. 

Using previously scraped and cleaned twitter data, I will perform some machine learning to predict genre by text below. See the Scrape Tweets DataWrangle, and EDA notebooks for more information on how I acquired the data and analyzed the data
***

## Machine Learning

Because I have five genres to look at I will be performing multiclass classification using various Machine Learning classifiers and vectorizers to achieve the strongest results.

In [27]:
# import modules
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats
from scipy.stats import norm
import numpy as np
import scipy as sp
import pickle

# Import scikit-learn tools, vectorizers, transformer, and classifiers
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import roc_curve
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

# import CountVectorizer and TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# import Multinomial Naive Bayes classifier
from sklearn.naive_bayes import MultinomialNB

# import Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier

# import Logistic Regression CV Classifier
from sklearn.linear_model import LogisticRegressionCV

# import LinearSVC classifier
from sklearn.svm import LinearSVC


In [4]:
# Load in data
with open('data/tweetsML.pickle', 'rb') as b:
    tweets = pickle.load(b)

In [6]:
tweets.head()

Unnamed: 0,username,tweet,mentions,hashtags,retweet,Genre,word_count,cleaned,genre_cat,emojis
0,00sarrett,I got 5 others outta the bargain bin but they ...,[],[],False,EDM,19,get outta bargain bin one want add collection ...,1,[]
1,00sarrett,I do believe I’ve determined a suitable replac...,[],[#forwardthinking],False,EDM,24,believe determine suitable replacement twa ind...,1,[joy]
2,00sarrett,Both just started watching and finished The Se...,[],[],False,EDM,21,start watch finish seven deadly sin anime coup...,1,"[100, hearts]"
3,00sarrett,I got an offer today to move to another state ...,[],[#decisions],False,EDM,16,get offer today move another state bunk old fr...,1,[]
4,00sarrett,"Roflmao 😂 no doubt haha, this is the extent of...",[gabri_rae],[],False,EDM,13,roflmao doubt haha extent disagreement ever ac...,1,"[joy, 100]"


In [7]:
tweets = tweets.drop(['username', 'tweet', 'mentions', 'hashtags', 'retweet', 'word_count'], axis=1)

In [8]:
tweets.head()

Unnamed: 0,Genre,cleaned,genre_cat,emojis
0,EDM,get outta bargain bin one want add collection ...,1,[]
1,EDM,believe determine suitable replacement twa ind...,1,[joy]
2,EDM,start watch finish seven deadly sin anime coup...,1,"[100, hearts]"
3,EDM,get offer today move another state bunk old fr...,1,[]
4,EDM,roflmao doubt haha extent disagreement ever ac...,1,"[joy, 100]"


In [11]:
# Store tweet dataset into feature matrix and response vector
X_words = tweets['cleaned']
y_words = tweets['genre_cat']

# Instantiate CountVectorizer and TfidfVectorizer
count_vect = CountVectorizer(min_df=1, ngram_range=(1, 2)) 
tfidf_vect = TfidfVectorizer(min_df=1, ngram_range=(1, 2))


# Apply CountVectorizer 
X_count = count_vect.fit_transform(tweets['cleaned'].apply(str))
X_count = X_count.tocsc() 

# Apply TfidfVectorizer
X_tfidf = tfidf_vect.fit_transform(tweets['cleaned'].apply(str))
X_tfidf = X_tfidf.tocsc()


# Split train/test data for all data
Xtrain_count, Xtest_count, ytrain_count, ytest_count = train_test_split(X_count, y_words, random_state=17)
Xtrain_tfidf, Xtest_tfidf, ytrain_tfidf, ytest_tfidf = train_test_split(X_tfidf, y_words, random_state=17)

MultinomialNB(alpha=1, class_prior=None, fit_prior=True)

In [49]:
def evaluate_model(xtest, ytest, clf):
    """ 
    This function evaluates the effectiveness of a ML model and outputs F1 Scores, AUC score and Confusion Matrix
    """
    # Make predictions for Xtest
    y_pred = clf.predict(xtest)
    
    # Confusion matrix
    cm = metrics.confusion_matrix(ytest, y_pred)
    
    print(classification_report(ytest, y_pred))
    print('\nConfusion Matrix:\n', cm)

## Naive Bayes

In [None]:
# Instantiate multinomialNB()
nb_words_count = MultinomialNB(alpha=1, fit_prior=True)
nb_words_tfidf = MultinomialNB(alpha=1, fit_prior=True)

# Train model
nb_words_count.fit(Xtrain_count, ytrain_count)
nb_words_tfidf.fit(Xtrain_tfidf, ytrain_tfidf)

In [34]:
evaluate_model(Xtest_count, ytest_count, nb_words_count)

              precision    recall  f1-score   support

           0       0.34      0.59      0.43       361
           1       0.45      0.22      0.29       312
           2       0.36      0.41      0.38       335
           3       0.44      0.20      0.28       278
           4       0.36      0.35      0.35       381

    accuracy                           0.37      1667
   macro avg       0.39      0.36      0.35      1667
weighted avg       0.38      0.37      0.35      1667


Confusion Matrix:
 [[214  23  45  19  60]
 [109  68  59  21  55]
 [ 92  23 139  20  61]
 [ 91  13  58  56  60]
 [121  25  90  12 133]]


In [35]:
evaluate_model(Xtest_tfidf, ytest_tfidf, nb_words_tfidf)

              precision    recall  f1-score   support

           0       0.33      0.60      0.42       361
           1       0.78      0.08      0.15       312
           2       0.27      0.67      0.39       335
           3       1.00      0.02      0.04       278
           4       0.39      0.15      0.21       381

    accuracy                           0.32      1667
   macro avg       0.55      0.30      0.24      1667
weighted avg       0.53      0.32      0.25      1667


Confusion Matrix:
 [[215   3 125   0  18]
 [133  25 133   0  21]
 [ 86   2 226   0  21]
 [ 99   1 143   6  29]
 [125   1 199   0  56]]


Both of these results are not that great. It appears that Country music (0) and Hip Hop (2) were the best at being predicted. Let's trying using other models to see if we can get stronger results.

***

## LogisiticRegressionCV

In [41]:
# Instantiate and fit training data to Logistic Regression Model (CountVec)
log_clf_count = LogisticRegressionCV(scoring='accuracy', 
                                     class_weight='balanced', 
                                     cv=5, max_iter=1000).fit(Xtrain_count, ytrain_count)

# Instantiate and fit training data to Logistic Regression Model (TFIDF Vec)
log_clf_tfidf = LogisticRegressionCV(scoring='accuracy', 
                                     class_weight='balanced', 
                                     cv=5, max_iter=1000).fit(Xtrain_tfidf, ytrain_tfidf)



In [42]:
evaluate_model(Xtest_count, ytest_count, log_clf_count)

              precision    recall  f1-score   support

           0       0.35      0.42      0.38       361
           1       0.32      0.31      0.31       312
           2       0.37      0.32      0.35       335
           3       0.38      0.20      0.26       278
           4       0.31      0.40      0.35       381

    accuracy                           0.34      1667
   macro avg       0.35      0.33      0.33      1667
weighted avg       0.34      0.34      0.33      1667


Confusion Matrix:
 [[153  61  39  24  84]
 [ 76  97  40  18  81]
 [ 70  49 108  26  82]
 [ 61  31  44  55  87]
 [ 82  67  58  22 152]]


In [43]:
evaluate_model(Xtest_tfidf, ytest_tfidf, log_clf_tfidf)

              precision    recall  f1-score   support

           0       0.46      0.34      0.39       361
           1       0.61      0.12      0.19       312
           2       0.53      0.21      0.30       335
           3       0.37      0.14      0.20       278
           4       0.27      0.79      0.40       381

    accuracy                           0.34      1667
   macro avg       0.45      0.32      0.30      1667
weighted avg       0.44      0.34      0.31      1667


Confusion Matrix:
 [[123   8  11  18 201]
 [ 38  36  15  19 204]
 [ 37   1  69  16 212]
 [ 32   4  15  38 189]
 [ 39  10  20  11 301]]


Looking at the performance for Logistic Regression, using a Tfidf vectorizer it does appear to do slightly better than Naive Bayes. F1 Scores for Country, Hip Hop and Metal respectively 0.39, 0.3 and 0.4 respectively. Although this is not high, it is slightly better. Let's continue moving forward.

***
## Random Forests

In [44]:
# Instantiate and fit training data to Random Forest Model (CountVec)
forest_clf_count = RandomForestClassifier(class_weight='balanced',
                                     n_estimators=100).fit(Xtrain_count, ytrain_count)

# Instantiate and fit training data to Random Forest Model (TFIDF Vec)
forest_clf_tfidf = RandomForestClassifier(class_weight='balanced',
                                     n_estimators=100).fit(Xtrain_tfidf, ytrain_tfidf)

In [45]:
evaluate_model(Xtest_count, ytest_count, forest_clf_count)

              precision    recall  f1-score   support

           0       0.32      0.44      0.37       361
           1       0.37      0.15      0.22       312
           2       0.31      0.37      0.34       335
           3       0.52      0.14      0.22       278
           4       0.27      0.41      0.33       381

    accuracy                           0.31      1667
   macro avg       0.36      0.30      0.29      1667
weighted avg       0.35      0.31      0.30      1667


Confusion Matrix:
 [[158  28  59  12 104]
 [ 93  48  65   8  98]
 [ 81  11 125   9 109]
 [ 68  11  60  38 101]
 [ 99  32  88   6 156]]


In [46]:
evaluate_model(Xtest_tfidf, ytest_tfidf, forest_clf_tfidf)

              precision    recall  f1-score   support

           0       0.33      0.43      0.38       361
           1       0.33      0.14      0.20       312
           2       0.32      0.39      0.35       335
           3       0.44      0.13      0.20       278
           4       0.29      0.43      0.35       381

    accuracy                           0.32      1667
   macro avg       0.34      0.31      0.30      1667
weighted avg       0.34      0.32      0.30      1667


Confusion Matrix:
 [[157  29  58  13 104]
 [ 90  45  67  13  97]
 [ 72  16 132  13 102]
 [ 62  16  68  37  95]
 [ 93  31  84   8 165]]


Random Forests appears to be doing the worst in comparison to our other two models. This will be ignored and moved forward.

***
## LinearSVC

In [47]:
# Instantiate and fit training data to Random Forest Model (CountVec)
svc_count = LinearSVC().fit(Xtrain_count, ytrain_count)

# Instantiate and fit training data to Random Forest Model (TFIDF Vec)
svc_tfidf = LinearSVC().fit(Xtrain_tfidf, ytrain_tfidf)

In [50]:
evaluate_model(Xtest_count, ytest_count, svc_count)

              precision    recall  f1-score   support

           0       0.38      0.39      0.38       361
           1       0.35      0.28      0.31       312
           2       0.34      0.35      0.35       335
           3       0.37      0.23      0.28       278
           4       0.30      0.42      0.35       381

    accuracy                           0.34      1667
   macro avg       0.35      0.33      0.34      1667
weighted avg       0.35      0.34      0.34      1667


Confusion Matrix:
 [[141  45  49  25 101]
 [ 59  86  56  27  84]
 [ 56  34 118  28  99]
 [ 51  28  50  64  85]
 [ 69  51  70  30 161]]


In [51]:
evaluate_model(Xtest_count, ytest_count, svc_tfidf)

              precision    recall  f1-score   support

           0       0.30      0.67      0.41       361
           1       0.32      0.23      0.27       312
           2       0.40      0.33      0.36       335
           3       0.37      0.21      0.27       278
           4       0.40      0.21      0.27       381

    accuracy                           0.34      1667
   macro avg       0.36      0.33      0.32      1667
weighted avg       0.36      0.34      0.32      1667


Confusion Matrix:
 [[241  42  25  21  32]
 [146  73  37  26  30]
 [129  38 109  30  29]
 [122  24  48  58  26]
 [176  53  51  22  79]]


In my opinion LinearSVC perormed the best when using a CountVectorizer. With an overall accuracy of 0.34, F1 scores steadily remained between .28 to .38 across the genres of music.

***
## Next Steps:
Taking on the task to tackle classifying individuals by their tweets was very ambitious. To add on the factor of removing the topic, music, from the tweets prior to classifying only added another layer of complexity. From exploring the data I realized early on that there aren't many distinguising differences between people that I had scraped on Twitter after you remove the tweets that separate them into their respective musical genres. However I was able to achieve 0.34 accuracy using a LinearSVC model, which I'm sure can be improved signficantly in the future taking the following steps.

1. Spending more time text preprocessing. A lot of tweets had slang terms, broken grammar and english and shortened out words that may have misaligned the total weights when using vectorizers or predictive features. 
2. People listen to a lot of different music. Just because they say they listen to one doesn't necessarily mean they are only listening to that. It would be important to find respondents that are spending a majority (>70%) of their music listening time in a specific genre.
3. Considering using larger text chunks. With the limitation to tweets being only 140 characters, it may not be enough to really distinguish one section of words from another.

This would be very useful to many marketing sectors as it would unleash the potential to target individuals by their preferences based on their virtual messages and the patterns they relate to among other individuals.
