# Check (That Tweet) Yo Self 
## Prioritizing Tweets to Fact Check
###### Part 6: Clustering (Unsupervised Learning)
In the previous notebooks, we collected tweets about Coronavirus from around the time of Donald Trump's lysol comment and connected them with a separate pull on user data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn import metrics
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.ensemble import BaggingRegressor
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import pickle
from sklearn.neighbors import KNeighborsClassifier

Now that we have gathered all the neccesary user information, we are going to cluster based on user stats.

In [3]:
tweet = pd.read_csv('../data/tweet_users.csv')

In [4]:
tweet = tweet[:33200]

First, we'll engineer a few features that takes a closer look at the tweet text and user info. Descriptions for each features are below:
- **'len_user'** : lenth of the username
- **'big_feelings'** : percentage of the tweet in uppercase
- **'ratio'** : user's followers divded by how many they are following
- **'has_url'** : user has a url as part of their profile
- **'has_location'** : user has a location listed on their profile
- **'has_bio'** : user has a bio as part of their profile
- **'len_bio'** : length of user's bio
- **'ratio_num_user'** : percentage of numeric characters in the username
- **'emotional_range'** : idea stemmed from this [paper](https://hpi.de/fileadmin/user_upload/fachgebiete/naumann/publications/2013/Analyzing_and_Predicting_Viral_Tweets.pdf), it's the absoluate value of postive and negative emotion added together and divided by 10 to explain the range of emotions used in the tweet

In [5]:
tweet['len_user'] = [len(x) for x in tweet['author']]

In [6]:
def per_upper(string):
    count = 0
    for s in string:
         if s == s.upper():
                count += 1
    ratio = count / len(string)
    return ratio

In [7]:
tweet['big_feelings'] = tweet['text'].apply(per_upper)

In [8]:
def get_ratio(followers, following):
    if following == 0:
        following = 1
    elif followers == 0:
        return 0
    else:
        return followers / following

In [9]:
tweet['ratio'] = [get_ratio(m, n) for m, n in zip(tweet['user_followers'], tweet['user_following'])]

In [10]:
tweet['has_url'] = tweet['user_url'].notna().astype(int)
tweet['has_location'] = tweet['user_location'].notna().astype(int)
tweet['has_bio'] = tweet['user_bio'].notna().astype(int)

In [11]:
tweet['len_bio'] = [len(str(x)) for x in tweet['user_bio']]

In [12]:
def is_numeric(string):
    count = 0
    for s in string:
        try:
            int(s)
            count += 1
        except:
            count = count
    return count / len(string)

In [13]:
tweet['ratio_num_user'] = tweet['author'].apply(is_numeric)

In [14]:
sent = SentimentIntensityAnalyzer()

In [15]:
def emotion_range(string):
    emo_r = np.abs(sent.polarity_scores(string)['neg']) + sent.polarity_scores(string)['pos']
    return emo_r / 10

In [16]:
tweet['emotional_range'] = tweet['text'].apply(emotion_range)

In [17]:
tweet.drop(columns = ['user_bio', 'user_location', 'user_url'], inplace = True)

From earlier investigation of specific accounts, we concluded if a person has "null" favorites, it is because they have not favorited anything so we can impute "0" for the null values here.

In [18]:
tweet['user_favorites'] = tweet['user_favorites'].fillna(0)

Likewise for ratio, the values showing up as null are when they have 0 followers and 0 following. While this might be a ratio of 1 (equal followers/following), this really shows a lack of connection and engagement so it would be more appropriate to fill in 0.

In [19]:
tweet['ratio'] = tweet['ratio'].fillna(0)

In [20]:
tweet.dropna(inplace = True)

In [21]:
tweet.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 33199 entries, 0 to 33199
Data columns (total 37 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   id                  33199 non-null  int64  
 1   time                33199 non-null  object 
 2   author              33199 non-null  object 
 3   author_id           33199 non-null  int64  
 4   associated_tweet    33199 non-null  int64  
 5   text                33199 non-null  object 
 6   links               33199 non-null  object 
 7   hashtags            33199 non-null  object 
 8   mentions            33199 non-null  object 
 9   reply_count         33199 non-null  int64  
 10  favorite_count      33199 non-null  int64  
 11  retweet_count       33199 non-null  int64  
 12  day                 33199 non-null  object 
 13  not_english         33199 non-null  float64
 14  hashtag_count       33199 non-null  int64  
 15  mention_count       33199 non-null  int64  
 16  word

We now have no missing values and can continue to clustering.

These are the user attributes we want to focus our clustering on:

In [22]:
user_cat = ['user_tweets', 'user_following', 'user_followers', 
             'ratio', 'has_url', 'has_location', 'has_bio']

In [23]:
to_cluster = tweet[user_cat]

Scaling the data before creating clusters:

In [24]:
ss = StandardScaler()
ss.fit(to_cluster)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [25]:
z_cluster = ss.transform(to_cluster)

After try a range of different values for number of clusters, 5 gave the highest jump in silhouette score while still having distinct characteristics between the categories.

In [26]:
km = KMeans(n_clusters=10, random_state = 21)
km.fit(z_cluster)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=10, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=21, tol=0.0001, verbose=0)

In [27]:
db = DBSCAN(eps = .8, min_samples = 1000)

In [28]:
db.fit(z_cluster)

DBSCAN(algorithm='auto', eps=0.8, leaf_size=30, metric='euclidean',
       metric_params=None, min_samples=1000, n_jobs=None, p=None)

In [29]:
silhouette_score(z_cluster, km.labels_)

0.7495098100757066

In [30]:
silhouette_score(z_cluster, db.labels_)

0.7675174323581028

Our DBSCAN score is slightly higher and is producing more logical groupings.

In [31]:
db.labels_

array([0, 0, 1, ..., 2, 0, 3])

Pickling this model to use later on:

In [33]:
pickle.dump(db, open('../models/dbscan.pkl', 'wb'))

In [34]:
pickle.dump(ss, open('../models/standardscaler.pkl', 'wb'))

In [35]:
tweet['user_group_db'] = db.labels_

Checking out the distribution of our clusters:

In [36]:
tweet['user_group_db'].value_counts()

 0    16418
 3     7342
 1     4247
-1     1903
 4     1768
 2     1521
Name: user_group_db, dtype: int64

Adding these cluster groups to the DataFrame.

Even though our clusters didn't take into account the specific "reach" (reply, favorite, and retweet count) for the tweet, we want to see how different clusters measure up in this capacity.

In [37]:
tweet['target'] = tweet['reply_count'] + tweet['favorite_count'] + tweet['retweet_count']

In [38]:
tweet.head(2)

Unnamed: 0,id,time,author,author_id,associated_tweet,text,links,hashtags,mentions,reply_count,...,big_feelings,ratio,has_url,has_location,has_bio,len_bio,ratio_num_user,emotional_range,user_group_db,target
0,1254190074595553281,2020-04-25 16:26:30,Iam_helenna,215204985,1254190074595553281,"Today, we have 1182 cases in Nigeria with 35 d...",[],[''],[''],37,...,0.255556,1.357576,0,0,1,147,0.0,0.0195,0,289
1,1253828209075990531,2020-04-24 16:28:34,KerryeHill,2807727004,1253697753479331840,There's no such thing as a medical disinfectan...,[],[''],[''],1,...,0.217742,0.241706,0,0,1,89,0.0,0.014,0,3


Saving this DataFrame with new features and cluster assignments to a csv for further EDA in the next notebook.

In [39]:
tweet.to_csv('../data/user_cluster_tweets.csv', index = False)

In [40]:
tweet.shape

(33199, 39)

We realized later on that DBSCAN doesn't make prediction so we would have to train a classifer on the labels in order to analyze new information:

In [41]:
knn = KNeighborsClassifier()

In [42]:
X_knn = z_cluster
y_knn = tweet['user_group_db']

In [43]:
Xk_train, Xk_test, yk_train, yk_test = train_test_split(X_knn, y_knn, random_state = 21)

In [44]:
knn.fit(Xk_train, yk_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [45]:
knn.score(Xk_train, yk_train)

0.9993574039118037

In [46]:
knn.score(Xk_test, yk_test)

0.9985542168674699

Pickling this model for use later on

In [47]:
pickle.dump(knn, open('../models/knn.pkl', 'wb'))