In [3]:
%load_ext autoreload
%autoreload 2
import sys, codecs, json, os
import ttools
from twython import TwythonStreamer, Twython
from datetime import datetime
from time import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Image captioning! use the microsoft api...

# Note: this ipynb was used for the top100. the `imageCaption_userOnly.ipynb` is for whole dataset user profile image and banner image [and no tweet images]

Let's read in the whole top100 dataset. It's not that big. Then, work on image captioning in chunks

In [216]:
%%time
path = '/Users/olderhorselover/USC/fall2018/csci599/teamrepo/social-swear/misc/top100users_and_timelines_EXTENDED.csv'
ll = lambda x: eval(x) if x else None  # the list of urls needs to actually be a list...
ii = lambda x: x.replace('_normal.jpg','_400x400.jpg')  #the image urls give all really small photos and captioner can't handle it well. luckily there's a way to get larger images
df = pd.read_csv(path,index_col=None, lineterminator='\n', converters={'image_urls':ll,'user_img_url':ii})

CPU times: user 2.35 s, sys: 200 ms, total: 2.54 s
Wall time: 2.56 s




First, let's gather the user's profile images and banner images [should be quick, only 83 users or so]. Only requires one data instance from each user. drop_duplicates can get us that.

In [217]:
#since user banner url [and user image url] is one per user, don't need al lthe user's tweets..just get one
users_only = df.drop_duplicates(subset=['user_id'],keep='first')
print(users_only.shape)

(83, 30)


Now, the fun part! call the api wrapper to get the image captions. We will just need to store a new dataframe that will have `user_id`, `user_img_caption`, `user_banner_caption`, `user_img_caption_clean`, `user_banner_caption_clean`. Then, when using this data elsewhere, we can just load up this dataframe and do a merge.

In [218]:
%%time
session = requests.Session()  # start a session for faster calling to api
users_only['user_banner_caption'] = users_only.apply(lambda x: ttools.imageCaption(session,[x.user_banner_url]), axis=1)  #note the labnda wraps the x in a list. this is required for our api wrapper
users_only['user_img_caption'] = users_only.apply(lambda x: ttools.imageCaption(session,[x.user_img_url]), axis=1)  #note the labnda wraps the x in a list. this is required for our api wrapper

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


CPU times: user 551 ms, sys: 57.7 ms, total: 609 ms
Wall time: 2min 27s


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Now, clean up the captions for easier use later

In [219]:
%%time
users_only['user_banner_caption_clean'] = users_only.apply(lambda x: ttools.captionCleaner(x.user_banner_caption), axis=1)
users_only['user_img_caption_clean'] = users_only.apply(lambda x: ttools.captionCleaner(x.user_img_caption), axis=1)

CPU times: user 102 ms, sys: 2.35 ms, total: 104 ms
Wall time: 102 ms


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


And there you have it! All our users now have descriptions for their profile images and their banner images, if they have one. Let's take a look at a few

In [220]:
from IPython.display import Image

In [221]:
locs = np.random.randint(0,size=5,high=83)
for loc in locs:
    print(users_only.iloc[loc]['user_id'])
    display(Image(url=users_only['user_banner_url'].iloc[loc],width=600,height=400))
    print(users_only.iloc[loc]['user_banner_caption_clean'])
    display(Image(url=users_only['user_img_url'].iloc[loc],width=200,height=200))
    print(users_only.iloc[loc]['user_img_caption_clean'])

783214


 a close up of a sign.


 a close up of a bird.
428333


 a stop sign.


 a drawing of a person.
176566242


 water, orange, sitting.


 a man standing in front of a brick wall.
85452649


 a close up of blue water.


 sport, water, looking.
268414482


 a close up of a logo.


 a close up of a logo.


Now, write the dataframe with `user_id`, `user_img_caption`, `user_banner_caption`, `user_img_caption_clean`, `user_banner_caption_clean` fields. Then, when processing the full dataset [for top100] just read that frame in and merge on the `user_id`

In [225]:
%%time
users_only = users_only[['user_id','user_img_caption','user_banner_caption','user_img_caption_clean','user_banner_caption_clean']]
users_only.to_csv('./ADDITIONAL_FEATURES/top100_user_img_and_banner_captions.csv',index=False)
users_only.to_pickle('./ADDITIONAL_FEATURES/top100_user_img_and_banner_captions.pkl')
del users_only

CPU times: user 5.3 ms, sys: 4.85 ms, total: 10.2 ms
Wall time: 12.4 ms


## NOW, to do this for each of the individual tweets... 
specifically, we use `image_urls` [`user_id` and `category` not fully necessary but helps us w/ some cool stats] and we want to create a dataframe that will have `tweet_id`, `tweet_img_caption`, `tweet_img_caption_clean`, `tweet_num_imgs`
this will be tough timewise because look at the statistics:

In [328]:
print(f'original number of tweets: {df.shape[0]}')
t = df[~df['image_urls'].isna()]  # get the tweets that have at least one image in them
t = t[['tweet_id','image_urls','user_id','category']].drop_duplicates(subset=['tweet_id'])  # no need to keep the whole dataframe
t['tweet_num_imgs'] = t.apply(lambda x: len(x.image_urls),axis=1)  # get the number of images per tweet with images
print(f'number of tweets with at least one image: {t.shape[0]}')
print(f'total number of images: {t.tweet_num_imgs.sum()}')
runtime83 = 73.5  # seconds
print(f'expected runtime for all images: {(((t.tweet_num_imgs.sum())/83)*runtime83)/3600:0.1f} hours')

original number of tweets: 203993
number of tweets with at least one image: 65150
total number of images: 72595
expected runtime for all images: 17.9 hours


A quick pseudo analysis shows the types of users in the top100 that post the most photos in their tweets:

In [329]:
#cool! see how many images each user has posted...
imcount = t.groupby(['user_id','category']).agg({'tweet_num_imgs': 'sum'})  #just like (t[t['user_id']==428333])['tweet_num_imgs'].sum() but does for EVERY user_id
imcount.nlargest(n=10,columns=['tweet_num_imgs']).rename(columns={'tweet_num_imgs':'tweet_num_imgs_sum'})

Unnamed: 0_level_0,Unnamed: 1_level_0,tweet_num_imgs_sum
user_id,category,Unnamed: 2_level_1
627673190,athlete,4047
96951800,athlete,2783
19895282,artist,2779
18839785,politician,2691
31348594,artist,2598
759251,company,2473
26257166,company,2375
428333,company,2349
2557521,company,2292
11348282,company,2232


Wow, companies occupy half of the top 10 image posters! This is surprising, seeing that they only represent ~17% of the top100 users list. It would be interesting to see how OFTEN companies tweet compared to other categories and also the ENGAGEMENT score...do the companies spend more time honing their engagement and have they identified that tweets with images get more engagement? We can test the latter [tweets w/ more images get more engagement...]

In [330]:
users_only = df.drop_duplicates(subset=['user_id'],keep='first')
users_only['category'].value_counts()/users_only.shape[0]

artist            0.674699
company           0.168675
athlete           0.072289
politician        0.060241
businessLeader    0.024096
Name: category, dtype: float64

In [327]:
#STUDY THIS LATER
# dfc = df.copy()
# dfc['score'] = (dfc['retweet_count'] + dfc['favorite_count'])/dfc['user_followers_count']
# dfc.groupby(['category']).agg({'score': 'mean'})

#### Okay..now that that tangent is over, let's start captioning for ALL the images
To guard ourselves, let's process this in chucks and run as time permits. Specifically, lets keep each loop iter to ~30 mins. Based on our estimate from above, this would be about **2000 images per chunk!**

In [350]:
%%time
CHUNKSIZE = 2000
imgdfs = np.split(t,np.arange(CHUNKSIZE, t['tweet_num_imgs'].sum(), CHUNKSIZE))  # splits into list of dataframes of length=CHUNKSIZE!
lenVerify = []
for i,idf in enumerate(imgdfs[33:]):  #step thru each dataframe chunk!  # 14: because we interrupted our processing...
    i = i+33  # because we interrupted our processing...
    print('Working on chunk %s of %s'%(i,len(imgdfs)))
    if len(idf['image_urls']) < 1:  #no images in this data chunk, skip
        print('No images in data chunk %s. Skipping this chunk'%i)
        continue
    lenVerify.append(idf.shape[0])
    startTime = time()
    session = requests.Session()  # start a session for faster calling to api
    print('starting the captioning')
    idf['tweet_img_caption'] = idf.apply(lambda x: ttools.imageCaption(session,x.image_urls), axis=1)  #note the labnda no longer wraps the x in a list as it did for user images. tweet img urls already live in a list!. this is required for our api wrapper
    print('done with captioning')
    print('starting cleaning of captioning')
    idf['tweet_img_caption_clean'] = idf.apply(lambda x: ttools.captionCleaner(x.tweet_img_caption), axis=1)
    print('done with cleaning of captioning, writing to csv/pkl')
    
    idf.to_csv('./ADDITIONAL_FEATURES/top100_tweet_img_captions_chunk_%s.csv'%(i),index=False)
    idf.to_pickle('./ADDITIONAL_FEATURES/top100_tweet_img_captions_chunk_%s.pkl'%(i))

    endTime = time()
    print('done writing. loop time for chunk %s'%(i))
    print(f'{(endTime - startTime)/60:0.2f} minutes\n\n\n')
print(sum(lenVerify))

Working on chunk 33 of 37
No images in data chunk 33. Skipping this chunk
Working on chunk 34 of 37
No images in data chunk 34. Skipping this chunk
Working on chunk 35 of 37
No images in data chunk 35. Skipping this chunk
Working on chunk 36 of 37
No images in data chunk 36. Skipping this chunk
0
CPU times: user 24.1 ms, sys: 30.5 ms, total: 54.6 ms
Wall time: 69.9 ms


Working on chunk 0 of 37
starting the captioning
done with captioning
starting cleaning of captioning
done with cleaning of captioning, writing to csv/pkl
done writing. loop time for chunk 0
35.70 minutes



Working on chunk 1 of 37
starting the captioning
done with captioning
starting cleaning of captioning
done with cleaning of captioning, writing to csv/pkl
done writing. loop time for chunk 1
40.44 minutes



Working on chunk 2 of 37
starting the captioning
done with captioning
starting cleaning of captioning
done with cleaning of captioning, writing to csv/pkl
done writing. loop time for chunk 2
35.68 minutes



Working on chunk 3 of 37
starting the captioning
done with captioning
starting cleaning of captioning
done with cleaning of captioning, writing to csv/pkl
done writing. loop time for chunk 3
43.72 minutes



Working on chunk 4 of 37
starting the captioning
done with captioning
starting cleaning of captioning
done with cleaning of captioning, writing to csv/pkl
done writing. loop time for chunk 4
38.66 minutes



Working on chunk 5 of 37
starting the captioning
done with captioning
starting cleaning of captioning
done with cleaning of captioning, writing to csv/pkl
done writing. loop time for chunk 5
43.12 minutes



Working on chunk 6 of 37
starting the captioning
done with captioning
starting cleaning of captioning
done with cleaning of captioning, writing to csv/pkl
done writing. loop time for chunk 6
42.28 minutes



Working on chunk 7 of 37
starting the captioning
done with captioning
starting cleaning of captioning
done with cleaning of captioning, writing to csv/pkl
done writing. loop time for chunk 7
44.90 minutes



Working on chunk 8 of 37
starting the captioning
done with captioning
starting cleaning of captioning
done with cleaning of captioning, writing to csv/pkl
done writing. loop time for chunk 8
77.62 minutes



Working on chunk 9 of 37
starting the captioning
done with captioning
starting cleaning of captioning
done with cleaning of captioning, writing to csv/pkl
done writing. loop time for chunk 9
40.82 minutes



Working on chunk 10 of 37
starting the captioning
done with captioning
starting cleaning of captioning
done with cleaning of captioning, writing to csv/pkl
done writing.



Working on chunk 11 of 37
starting the captioning
done with captioning
starting cleaning of captioning
done with cleaning of captioning, writing to csv/pkl
done writing. loop time for chunk 11
33.95 minutes



Working on chunk 12 of 37
starting the captioning
done with captioning
starting cleaning of captioning
done with cleaning of captioning, writing to csv/pkl
done writing. loop time for chunk 12
45.15 minutes



Working on chunk 13 of 37
starting the captioning
done with captioning
starting cleaning of captioning
done with cleaning of captioning, writing to csv/pkl
done writing. loop time for chunk 13
37.42 minutes



Working on chunk 14 of 37
starting the captioning
done with captioning
starting cleaning of captioning
done with cleaning of captioning, writing to csv/pkl
done writing. loop time for chunk 14
34.59 minutes



Working on chunk 15 of 37
starting the captioning
done with captioning
starting cleaning of captioning
done with cleaning of captioning, writing to csv/pkl
done writing. loop time for chunk 15
34.77 minutes



Working on chunk 16 of 37
starting the captioning
done with captioning
starting cleaning of captioning
done with cleaning of captioning, writing to csv/pkl
done writing. loop time for chunk 16
40.48 minutes



Working on chunk 17 of 37
starting the captioning
done with captioning
starting cleaning of captioning
done with cleaning of captioning, writing to csv/pkl
done writing. loop time for chunk 17
32.47 minutes



Working on chunk 18 of 37
starting the captioning
done with captioning
starting cleaning of captioning
done with cleaning of captioning, writing to csv/pkl
done writing. loop time for chunk 18
34.93 minutes



Working on chunk 19 of 37
starting the captioning
done with captioning
starting cleaning of captioning
done with cleaning of captioning, writing to csv/pkl
done writing. loop time for chunk 19
41.32 minutes



Working on chunk 20 of 37
starting the captioning
done with captioning
starting cleaning of captioning
done with cleaning of captioning, writing to csv/pkl
done writing. loop time for chunk 20
34.60 minutes



Working on chunk 21 of 37
starting the captioning
done with captioning
starting cleaning of captioning
done with cleaning of captioning, writing to csv/pkl
done writing. loop time for chunk 21
31.09 minutes



Working on chunk 22 of 37
starting the captioning
done with captioning
starting cleaning of captioning
done with cleaning of captioning, writing to csv/pkl
done writing. loop time for chunk 22
37.68 minutes



Working on chunk 23 of 37
starting the captioning
done with captioning
starting cleaning of captioning
done with cleaning of captioning, writing to csv/pkl
done writing. loop time for chunk 23
35.42 minutes



Working on chunk 24 of 37
starting the captioning
done with captioning
starting cleaning of captioning
done with cleaning of captioning, writing to csv/pkl
done writing. loop time for chunk 24
38.12 minutes



Working on chunk 25 of 37
starting the captioning
done with captioning
starting cleaning of captioning
done with cleaning of captioning, writing to csv/pkl
done writing. loop time for chunk 25
41.29 minutes



Working on chunk 26 of 37
starting the captioning
done with captioning
starting cleaning of captioning
done with cleaning of captioning, writing to csv/pkl
done writing. loop time for chunk 26
47.01 minutes



Working on chunk 27 of 37
starting the captioning
done with captioning
starting cleaning of captioning
done with cleaning of captioning, writing to csv/pkl
done writing. loop time for chunk 27
66.19 minutes



Working on chunk 28 of 37
starting the captioning
done with captioning
starting cleaning of captioning
done with cleaning of captioning, writing to csv/pkl
done writing. loop time for chunk 28
31.54 minutes



Working on chunk 29 of 37
starting the captioning
done with captioning
starting cleaning of captioning
done with cleaning of captioning, writing to csv/pkl
done writing. loop time for chunk 29
26.83 minutes



Working on chunk 30 of 37
starting the captioning
done with captioning
starting cleaning of captioning
done with cleaning of captioning, writing to csv/pkl
done writing. loop time for chunk 30
37.45 minutes



Working on chunk 31 of 37
starting the captioning
done with captioning
starting cleaning of captioning
done with cleaning of captioning, writing to csv/pkl
done writing. loop time for chunk 31
31.58 minutes



Working on chunk 32 of 37
starting the captioning
done with captioning
starting cleaning of captioning
done with cleaning of captioning, writing to csv/pkl
done writing. loop time for chunk 32
20.91 minutes

Working on chunk 33 of 37
No images in data chunk 33. Skipping this chunk
Working on chunk 34 of 37
No images in data chunk 34. Skipping this chunk
Working on chunk 35 of 37
No images in data chunk 35. Skipping this chunk
Working on chunk 36 of 37
No images in data chunk 36. Skipping this chunk