# General data questions and exploration


In [2]:
##TODO:
# Try ML with and without network metrics
# Test at different time periods
# Test other event datasets
# Check steps from book - change to best practice sklearn


# What correlates with witness label
# GPS accounts tend to be spam/businesses?
# Compare GPS from stream / profile / hand coding
# Explore age of account
# Is detecting co-occuring tags viable?
# What kind of data/user is likely to be deleted?
# Is user name change / user deletion/protection a useful predictor
# Compare Change in network, whether it's useful to collect.
# Check gps count, location in profile
# Check timezone distribution
# 'Ordinary person' vs bot/celeb/business/news -- using source field, tweet rate, timezone

# prop of gps -- users and tweets. Automated, instagram sourced?
# prop of sources
# prop of media/urls
# users with location on profile? Some set 'in solidarity'?
# Cycadian posting rythym - can identify real people vs bots?
# location via friend network?
# language

# \item Tweets which were automatically generated from Instagram posts were much more likely to include GPS coordinates, and as media, more likely to represent a ground truth. Therefore this content may be worth focusing on.
# \item Aid requests were very rare. Those that were identified were often reposts rather than originals, and are often referring to the same original message which begins to trend.
# \item Info for affected class should differentiate between immediate and non-immediate content. E.g. a call to mobilise a clean-up or rescue crew vs. a link to an insurance claim form.
# \item For `unrelated' messages, those which matched the keyword stream were highly represented by automated messages coming from a particular set of sources which presumably uses trending tags to gain exposure. This is easy to pre-filter.
# \item Geographically-tagged Tweets are predominantly either: Instagram cross-posts, or automatically generated job listings from a small set of sources (and therefore easy to pre-filter).
        

# Sum of network edge reciprocity
# k-cohesiveness -- Structural cohesion

In [3]:
### Initialisation ###
import os
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline
plt.rcParams['figure.figsize'] = [6, 4]

EVENT_NAME = Event.objects.all()[0].name.replace(' ', '')
DIR = './data/harvey_user_location/'
DF_FILENAME = 'df_users.csv'

# Confirm correct database is set in Django settings.py
if 'Harvey' not in EVENT_NAME:
    raise Exception('Event name mismatch -- check database set in Django')

# Open original Dataframe
users_df = pd.read_csv(DIR + DF_FILENAME, index_col=0)
users_df.shape

(1500, 46)

First, given we are treating the data as binary-coded, we need to be aware of the third code of 'unsure'. In most comparisons, these will automatically be considered as the opposite code from what is being tested. We can instead manually set it to either value, or remove it entirely, given that it comprises a small proportion of the data.

In [4]:
unsure_code = (users_df.is_coded_as_witness == 0) & (users_df.is_coded_as_non_witness == 0)
print(sum(unsure_code), 'cases coded as \'unsure\'')

# Remove 'unsure' rows from data:
#users_df = users_df.loc[unsure_code==False]

# Assign 'unsure' rows to positive coded case:
#users_df.is_coded_as_witness = (users_df.is_coded_as_non_witness == False).astype(int)

# Assign 'unsure' rows to negative coded case:
#users_df.is_coded_as_non_witness = (users_df.is_coded_as_witness == False).astype(int)

31 cases coded as 'unsure'


## Geographic Metadata and Manual Coding
Manual coding of users targetted the perceived locality of the user to the event. We can compare the geographic metadata provided by Twitter to these codes to determine their usefulness as a predictor for this value.

### Profile Location Field
The first value to check is the location of a user as set in their profile. This is a user-set string. In an earlier notebook, this string was geocoded using Google Maps api and evaluated for whether it fell within the bounding box defined for this event. We can therefore check whether this test correlates with the coded value.

First, we check what proportion of users provide a value in the field. We can then generate a confusion matrix showing the agreement between the profile locality where provided, and the coded value.

In [5]:
us = User.objects.filter(user_class__gt=0)
#us = us.filter(coding_for_user__coding_id=1, coding_for_user__data_code__data_code_id__gt=0)
tot = us.count()
tot_loc = us.count() - (us.filter(location="") | us.filter(location__isnull=True)).count()

print('Total users: ', tot)
print('Total users with location filled: ', tot_loc)
print('Proportion: {:.4}%'.format((tot_loc/tot)*100))

print('\nProportion of coded users with location filled: {:.4}%'.format((sum(users_df.location.notna())/users_df.shape[0])*100))

print('\nProportion of coded users with parseable location filled: {:.4}%'.format(
    (users_df.loc[(users_df.is_non_local_profile_location + users_df.is_local_profile_location) > 0].shape[0]/users_df.shape[0])*100))


Total users:  31932
Total users with location filled:  25619
Proportion: 80.23%

Proportion of coded users with location filled: 79.4%

Proportion of coded users with parseable location filled: 77.47%


In [6]:
vals = users_df.loc[users_df["is_local_profile_location"] == 1]["is_coded_as_witness"].value_counts()
vals2 = users_df.loc[users_df["is_coded_as_witness"] == 1]["is_local_profile_location"].value_counts()

print('{} of {} ({:.4}%) users were classified as having a local profile'.format(sum(vals), len(users_df), sum(vals)/len(users_df)*100))
print('{} of {} ({:.4}%) users were coded as a witness'.format(sum(vals2), len(users_df), sum(vals2)/len(users_df)*100))
print('{} of {} ({:.4}%) users with local profile locations were coded as witness'.format(vals[1], sum(vals), vals[1]/sum(vals)*100))
print('{} of {} ({:.4}%) witness codes had a local profile'.format(vals2[1], sum(vals2), vals2[1]/sum(vals2)*100))

397 of 1500 (26.47%) users were classified as having a local profile
386 of 1500 (25.73%) users were coded as a witness
258 of 397 (64.99%) users with local profile locations were coded as witness
258 of 386 (66.84%) witness codes had a local profile


In [7]:
import pandas as pd

def confusion_matrix(df: pd.DataFrame, col1: str, col2: str):
    """
    Given a dataframe with at least
    two categorical columns, create a 
    confusion matrix of the count of the columns
    cross-counts
    """
    return (
            df
            .groupby([col1, col2])
            .size()
            .unstack(fill_value=0)
            )


def confusion_matrix_from_series(s1, s2):
    """
    Returns confusion matrix for two binary
    series
    """
    df = pd.concat([s1, s2], axis=1)
    return confusion_matrix(df, s1.name, s2.name)


def calc_agreement_coefs(df: pd.DataFrame):
    """
    Calculates Cohen's Kappa and
    Krippendorff's Alpha for a
    given confusion matrix.
    """
    arr = df.to_numpy()
    n = arr.sum()
    p_o = 0
    for i in range(len(arr)):
        p_o += arr[i][i]/n
    p_e = 0
    for i in range(len(arr)):
        p_e += (arr.sum(axis=1)[i] *
                arr.sum(axis=0)[i]) / (n*n)
    kappa = (p_o-p_e)/(1-p_e)
    
    coin_arr = np.transpose(arr) + arr
    exp_distribution = [sum(x) for x in coin_arr]
    p_e_krippendorf = sum([a * (a-1) for a in exp_distribution])/(2*n*((2*n)-1))
    alpha = (p_o - p_e_krippendorf) / (1-p_e_krippendorf)
    
    return p_o, kappa, alpha


def calc_agreement_metrics(df: pd.DataFrame):
    """
    Calculates various agreement metrics
    for a given binary confusion matrix.
    
    Assumes true condition as ROW heading and
    ascending integer labels.
    """
    arr = df.to_numpy()
    if len(arr) != 2:
        return null
    results = {}
    results['Prevalence'] = arr.sum(axis=0)[1]/arr.sum()
    results['Accuracy'] = (arr[0][0] + arr[1][1])/arr.sum()
    results['Prec'] = arr[1][1]/arr.sum(axis=1)[1]
    results['Recall'] = arr[1][1]/arr.sum(axis=0)[1]
    results['f1Score'] = (2 * results['Prec'] * results['Recall'])/(results['Prec']+results['Recall'])
    results['Specificity'] = arr[0][0]/arr.sum(axis=0)[0]
    results['FalseNegRate'] = arr[0][1]/arr.sum(axis=0)[1]
    p_o, kappa, alpha = calc_agreement_coefs(df)
    results['Cohen\'s Kappa'] = kappa
    results['Krippendorff\'s Alpha'] = alpha
    return results
    

In [8]:
# We exclude rows where either no profile location field was provided, 
# or the location was not parsed by the API:
loc_df = users_df.loc[(users_df.is_non_local_profile_location + users_df.is_local_profile_location) > 0]
print(users_df.shape[0] - loc_df.shape[0], 'rows with no parseable profile location value excluded')

conf = confusion_matrix(loc_df, 'is_local_profile_location', 'is_coded_as_witness')
conf

338 rows with no parseable profile location value excluded


is_coded_as_witness,0,1
is_local_profile_location,Unnamed: 1_level_1,Unnamed: 2_level_1
0,695,70
1,139,258


In [9]:
results = calc_agreement_metrics(conf)
res_df = pd.DataFrame.from_dict(results, orient='index', columns=['loc_prof_nona'])
res_df

Unnamed: 0,loc_prof_nona
Prevalence,0.282272
Accuracy,0.820138
Prec,0.649874
Recall,0.786585
f1Score,0.711724
Specificity,0.833333
FalseNegRate,0.213415
Cohen's Kappa,0.582731
Krippendorff's Alpha,0.581198


As excluding the ~20% of values with no parseable location field provided is not an option in practice, we must decide to either discard them (i.e. by default classify as non-local) or include them (default classify as local). The first option will inevitable discard true positive cases, thus reducing recall, whereas the latter will introduce false positives, reducing precision:

In [10]:
conf = confusion_matrix(users_df, 'is_local_profile_location', 'is_coded_as_witness')
print('loc_prof_notna')
print(conf)
res_df['loc_prof_notna'] = calc_agreement_metrics(conf).values()


conf = confusion_matrix_from_series(
    pd.Series(
                (users_df.is_local_profile_location) | (users_df.location.isna()).astype(int), 
                name='is_local_profile_location_or_na'
             ),
    users_df.is_coded_as_witness)
print('\nloc_prof_orna')
print(conf)
res_df['loc_prof_orna'] = calc_agreement_metrics(conf).values()


res_df

loc_prof_notna
is_coded_as_witness          0    1
is_local_profile_location          
0                          975  128
1                          139  258

loc_prof_orna
is_coded_as_witness                0    1
is_local_profile_location_or_na          
0                                714   80
1                                400  306


Unnamed: 0,loc_prof_nona,loc_prof_notna,loc_prof_orna
Prevalence,0.282272,0.257333,0.257333
Accuracy,0.820138,0.822,0.68
Prec,0.649874,0.649874,0.433428
Recall,0.786585,0.668394,0.792746
f1Score,0.711724,0.659004,0.56044
Specificity,0.833333,0.875224,0.640934
FalseNegRate,0.213415,0.331606,0.207254
Cohen's Kappa,0.582731,0.538603,0.341243
Krippendorff's Alpha,0.581198,0.538725,0.309098


The results above are as expected. Excluding empty fields gives a precision/recall of 0.650/0.668 whereas exluding them gives 0.433/0.793. The exclusion strategy provides the highest f1 score, however given the purpose of the algorithm must be considered when choosing how to weight precision and recall metrics. For example, given the algorithm is designed to curate the feed for human consumption, a high precision is only necessary if the rate of positive cases exceeds the humans' ability to parse the incoming stream. Where the rate is low, sacrificing precision is acceptable to present the human user with more cases which they can then manually filter. 

This concept will be explored in more depth later in the project. For now, it is sufficient to note the values as a baseline model.

### Tweet Stream Coordinates
When posting a Tweet, a user may attach geographic coordinates. The location of the device is provided by the hardware and automatically included with the Tweet (thus the user does not influence the input). A Tweet may also include, instead of specific coordinates, a 'Place' object -- a geographic region (defined by Twitter) which typically describes a location such as a city, state or other similarly-sized region.

To geolocate a user, we can therefore investigate their Twitter feed for any Tweets containing this geographic data and compare these to the bounding box of the observed event. The derived field therefore represents whether *any* of a user's Tweets were identified as 'local' during the event.

For this dataset, the Twitter feed for each observed user spanning the duration of the collection period was collected at the end of the collection period. The feed is therefore made up of Tweets detected during the collection period, and any other Tweets the user made during the period, before or after the detected Tweet, provided they existed at the end of the collection period.

Further work on this area should consider the following:
* Where a local Tweet has been detected, check the proportion of other Tweets containing geographic data.
* Where other geo-Tweets exist, check whether they are from the same point or move around -- consider recoding for where a user Tweets from within *and* without the bounding box.
* Where all geo-Tweets come from the same point, it is likely that location has been manually set (i.e. it is a storefront/business account). This may be verifiable by checking the Tweet source.

In [11]:
print('{} of {} ({:.3}%) users have Tweet from locality'.format(
    sum(users_df.has_tweet_from_locality), users_df.shape[0], 100*sum(users_df.has_tweet_from_locality)/users_df.shape[0]))

conf = confusion_matrix(users_df, 'has_tweet_from_locality', 'is_coded_as_witness')
conf

467 of 1500 (31.1%) users have Tweet from locality


is_coded_as_witness,0,1
has_tweet_from_locality,Unnamed: 1_level_1,Unnamed: 2_level_1
0,898,135
1,216,251


In [12]:
res_df['local_tw'] = calc_agreement_metrics(conf).values()
res_df

Unnamed: 0,loc_prof_nona,loc_prof_notna,loc_prof_orna,local_tw
Prevalence,0.282272,0.257333,0.257333,0.257333
Accuracy,0.820138,0.822,0.68,0.766
Prec,0.649874,0.649874,0.433428,0.537473
Recall,0.786585,0.668394,0.792746,0.650259
f1Score,0.711724,0.659004,0.56044,0.588511
Specificity,0.833333,0.875224,0.640934,0.806104
FalseNegRate,0.213415,0.331606,0.207254,0.349741
Cohen's Kappa,0.582731,0.538603,0.341243,0.42708
Krippendorff's Alpha,0.581198,0.538725,0.309098,0.425219


While the recall of this metric is close (slightly less than) `loc_prof_notna`, there is no increase in precision and thus it is an inferior metric to predict the true condition.

We can however check whether the metric is capturing a different proportion of local users, and therefore improve upon the existing measures through combination. Using an OR condition will increase recall at the cost of precision; using an AND condition will increase precision at the cost of recall.

In [13]:
conf = confusion_matrix_from_series(
    pd.Series(
        ((users_df.is_local_profile_location) | (users_df.has_tweet_from_locality)),
        name = 'is_local_profile_location_or_local_tw'
        ),
    users_df.is_coded_as_witness)
# conf

res_df['loc_prof_notna_or_loc_tw'] = calc_agreement_metrics(conf).values()

conf = confusion_matrix_from_series(
    pd.Series(
        ((users_df.is_local_profile_location) & (users_df.has_tweet_from_locality)),
        name = 'is_local_profile_location_and_local_tw'
        ),
    users_df.is_coded_as_witness)
# conf

res_df['loc_prof_notna_and_loc_tw'] = calc_agreement_metrics(conf).values()

res_df

Unnamed: 0,loc_prof_nona,loc_prof_notna,loc_prof_orna,local_tw,loc_prof_notna_or_loc_tw,loc_prof_notna_and_loc_tw
Prevalence,0.282272,0.257333,0.257333,0.257333,0.257333,0.257333
Accuracy,0.820138,0.822,0.68,0.766,0.788,0.8
Prec,0.649874,0.649874,0.433428,0.537473,0.558219,0.653571
Recall,0.786585,0.668394,0.792746,0.650259,0.84456,0.474093
f1Score,0.711724,0.659004,0.56044,0.588511,0.672165,0.54955
Specificity,0.833333,0.875224,0.640934,0.806104,0.768402,0.912926
FalseNegRate,0.213415,0.331606,0.207254,0.349741,0.15544,0.525907
Cohen's Kappa,0.582731,0.538603,0.341243,0.42708,0.524972,0.42517
Krippendorff's Alpha,0.581198,0.538725,0.309098,0.425219,0.515676,0.421208


As we can see from the table above, the highest precision is observed from the classifier `loc_prof_notna_and_loc_tw` at 0.656 (though at a great cost to recall). It should be noted that selecting for cases which satisfy both conditions may inadvertently select for a particular (as-yet unidentified) sub-category of user type and exclude other categories of value. The highest fscore comes from `loc_prof_notna_or_loc_tw` at 0.672, which has a precision of 0.558.

While the low scores of these metrics show that they cannot provide meaningful proxies for the manually assigned code, as classifiers they provide a suitable baseline from which to measure more sophisticated models.

## Tweet Source
An important distinguishing metadatum of a Tweet is the 'source' field, which represents the platform from which the Tweet was published. When creating a third-party application which can interact with the Twitter API, a developer must provide a descriptor string which populates this field. Because many third-party applications are designed for specific use-cases, this field provides useful information which characterises the motivations for conditions under which the Tweet was created. For example, the source `TweetMyJOBS` refers to a recruitment platform and thus is attached to Tweets advertising job listings.

We can look at the list of most common source from within the entire dataset and compare this to the sources detected during the collection period. (Note that the complete dataset will still contain a selection bias and does not necessarily characterise regular Twitter use)

In [44]:
# View most common sources from entire dataset:
from django.db.models import Count
from streamcollect.models import Tweet

ts = Tweet.objects.all()
print('Total Tweets:', ts.count(), '\n')

fieldname = 'source'
counts = ts.values(fieldname).order_by(fieldname).annotate(count=Count(fieldname)).order_by('-count')

for x in counts[:20]:
    if x['count'] > 10:
        print('{:.1f}% {}: {}'.format((x['count']/ts.count()*100), x['source'], x['count']))

Total Tweets: 1727438 

31.0% Twitter for iPhone: 535558
20.2% Twitter for Android: 348743
19.9% Twitter Web Client: 342908
5.5% IFTTT: 95360
2.7% Twitter for iPad: 45830
2.5% Twitter Lite: 43808
2.4% Instagram: 42045
1.9% TweetDeck: 32273
1.7% Facebook: 28536
1.6% Hootsuite: 27105
1.1% Paper.li: 18490
1.1% Botize: 18260
0.6% TweetMyJOBS: 10867
0.4% SafeTweet by TweetMyJOBS: 7650
0.4% Buffer: 6819
0.4% Google: 6469
0.3% WordPress.com: 5461
0.3% SocialNewsDesk: 4472
0.2% Mobile Web (M2): 3407
0.2% dlvr.it: 2972


In [46]:
# Count proportion of Tweets from first-party applications
first_party_sources = [
    'Twitter for iPhone',
    'Twitter for Android',
    'Twitter Web Client',
    'Twitter for iPad',
    'Twitter Lite',
    'TweetDeck',
    'Twitter for Windows',
    'Twitter for Mac',
    'Twitter for Windows Phone',
    'Twitter for BlackBerry',
    'Twitter for Android Tablets',
    'Twitter MMS'
    ]
fp_count = 0
for x in counts:
    if any([y in x['source'] for y in first_party_sources ]):
        fp_count += x['count']
print('{:.1f}% of total from first party clients: {}'.format((fp_count/ts.count()*100), fp_count))

78.5% of total from first party clients: 1355569


In [58]:
ts = Tweet.objects.filter(data_source=1)
print('Total for data_source=1 (keyword stream):', ts.count(), '\n')

fieldname = 'source'
counts = ts.values(fieldname).order_by(fieldname).annotate(count=Count(fieldname)).order_by('-count')

for x in counts[:20]:
    if x['count'] > 10:
        print('{:.1f}% {}: {}'.format((x['count']/ts.count()*100), x['source'], x['count']))

Total for source = 1 (keyword stream): 31303 

28.6% Twitter for iPhone: 8959
25.5% Twitter Web Client: 7989
16.2% Twitter for Android: 5086
5.6% Paper.li: 1754
3.1% Hootsuite: 981
2.3% Instagram: 733
2.3% IFTTT: 720
2.2% Facebook: 682
2.1% TweetDeck: 658
2.1% Twitter for iPad: 652
1.9% Twitter Lite: 595
0.9% Buffer: 274
0.4% SocialNewsDesk: 138
0.4% Sprout Social: 126
0.3% Error-log: 105
0.3% Botize: 93
0.2% Periscope: 70
0.2% VoiceStorm: 69
0.2% Google: 62
0.2% Twitter for Windows: 62


In [60]:
ts = Tweet.objects.filter(data_source__gte=1, coordinates_lat__isnull=False)
print('Total for data_source=3 (geo stream)', ts.count(), '\n')

fieldname = 'source'
counts = ts.values(fieldname).order_by(fieldname).annotate(count=Count(fieldname)).order_by('-count')

for x in counts[:20]:
    if x['count'] > 10:
        print('{:.1f}% {}: {}'.format((x['count']/ts.count()*100), x['source'], x['count']))

Total for data_source=3 (geo stream) 15630 

76.6% Instagram: 11971
6.5% TweetMyJOBS: 1023
5.9% SafeTweet by TweetMyJOBS: 928
3.9% BubbleLife: 602
1.7% Foursquare: 268
1.5% Untappd: 234
0.8% Twitter for Android: 131
0.7% Hootsuite: 106
0.6% Twitter for iPhone: 88
0.4% circlepix: 55
0.3% TownTweet: 52
0.2% iOS: 28
0.2% Twitter for Android Tablets: 27
0.2% Crowdfire - Go Big: 25
0.1% Squarespace: 18
0.1% Twitter for Windows Phone: 15
0.1% Tweetbot for iΟS: 12


In [61]:
fp_count = 0
for x in counts:
    if any([y in x['source'] for y in first_party_sources ]):
        fp_count += x['count']
print('{:.1f}% of geo stream from first party clients: {}'.format((fp_count/ts.count()*100), fp_count))

1.7% of geo stream from first party clients: 262


Of the entire dataset of 1,727,438 Tweets, those published by first-party Twitter clients comprised 78.5% (1,355,569). In contrast, of the subset of 15,630 Tweets collected based on their location within the event's bounding box, only 1.7% (262) were published from first-party apps. Tweets crossposted by Instagram comprised 76.6% of the geotagged Tweets. The high incidence of Instagram posts in the geographic stream therefore suggest that Instagram posts are much more likely than Tweets to include geographic data, which is preserved during the crossposting process. We can check the rate across the entire dataset excluding the geo stream:

In [65]:
ts = Tweet.objects.filter(source='Instagram', data_source__lt=3)
print('Total Instagram Tweets (excl. geo stream):', ts.count())
ts_geo = ts.filter(coordinates_lat__isnull=False)
print('Total geotagged Instagram Tweets: {}, {:.1f}%'.format(ts_geo.count(), ts_geo.count()/ts.count()*100 ))

Total Instagram Tweets (excl. geo stream): 30123
Total geotagged Instagram Tweets: 9818, 32.6%


As Instagram posts are traditionally based upon the publication of a recently-taken photo, the high incidence of geotagging makes this class of message highly useful in supporting the development of situational awareness.