__author__ = 'Ricardo Pasquini'

# Counting tweets at the hexagon level. Complete procedure

Overview: This code counts georeferenced tweets at the hexagon level. Hexagons are uniquely identified in space. We use Uber's H3 library for this purpose.  
The code  also performs a home location prediction for each user. This allows later counting tweets by whether they were tweeted by residents, residents of neighbors hexagons, or non-residents.
The project was designed to be scalable to millions of tweets. The process is done with MongoDB.

By default the project generates the following collections on the Mongo database:

tweets: Collection containing tweets. This collection is modified by incorporating the corresponding hex id. \
users: Collection containing users. This collection is modified by incorporating the home location data.\
hexcounts: Collection containing counts of tweets.


In [24]:
import sys
sys.path.append("../../py_func_NoEnvReq") 
import databasepopulation
import communicationwmongo as commu
import home_location as home
import analysis as a
import pymongo
import pandas as pd
import my_h3_functions as myh3

In [25]:
import importlib
importlib.reload(databasepopulation)
importlib.reload(home)
importlib.reload(a)
importlib.reload(myh3)

<module 'my_h3_functions' from '../../py_func_NoEnvReq\\my_h3_functions.py'>

# Counting tweets at the hexagon level. Complete procedure

This version of the code assumes that a different database will be used for each city.


If necessary check your current location with the following command:

In [26]:
pwd

'C:\\Users\\emman\\Documents\\git\\twitter_and_displacement\\notebooks\\Sydney'

# 0. Connect to Mongo and define a specific database

By default connect to Mongo running on a localhost. 
Requires manually changing the parameters if working on a cloud environment

I will use the name buenosaires as the db example. Recall each database for each city.


In [34]:
db=commu.connecttoLocaldb(database='sydney')

In [35]:
db.sydney

Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'sydney'), 'sydney')

# 1. Raw data to Mongo
This function populates the files containing tweets into Mongo.
Assumes files are named like: ba_2012.csv, etc.
Choose start and end years.


In [9]:
databasepopulation.populatetweets(db, path='/Users/emman/Box/Twitter data/Asia/Sydney/', cityprefix='sydney', yearstart=2012, yearend=2012)

Now populating year  2012
process completed


In [None]:
db.tweets.update({},{'$rename':{"latitude":"lat","longitude":"lon"}}, False,True,True)

# 2. Adding Hexids to tweets 
This function includes hexagons ids to each tweet.  

In [None]:
databasepopulation.addhexjob(db)

In [None]:
# the following is a check and a fix because I think the previous process did not run completely, probably a bug.
# are there any other documents that were not processed?
len(list(db.tweets.find({ "hex" : { "$exists" : False } })))

In [None]:
#Process any missing document
cursor=db.tweets.find({ "hex" : { "$exists" : False } })
df = pd.DataFrame(list(cursor))
requests = databasepopulation.add_hexs_and_prepare_bulk_request(df, dataformat='raw')
try:
    db.tweets.bulk_write(requests, ordered=False)
except BulkWriteError as bwe:
    print(bwe.details)

# 3. Database performance task: Add indexes to tweets

In [None]:
databasepopulation.create_indexes(db)

# 4. Populate Users Collection

In [None]:
databasepopulation.populate_users_collection(db)

# 5. Find Home Job for each user id in the database

In [None]:
print('Users with home location identified', db.users.count_documents({'hex9': { '$exists': True} }))


# 6 Generate a Hex-level collection including all hexs in the database

By default the collection include as ids only the ids of hexagons at resolution 9
Note that many types of hexogons could coexist in a same collection because hex identifiers are unique. This is not implemented in the current version of the code, though.

In [None]:
home.job_findhomeandpopulate_hex9(db)

In [None]:
databasepopulation.populate_hexcounts_collection(db)

# 7 Count tweets in each hex by residents and non-residents

In [None]:
import analysis as a

In [None]:
#a.countandpopulatejob(db)

# 8 Query to the DB to create the counts dataframe 

In [None]:
import time 
start=time.time()
df=a.hexcountsresults_to_df(db, save=False)
print(time.time()-start)
# if save=True then saves resulting dataframe a pickle in ./hexcountsdf.pkl

In [None]:
df.info()

In [None]:
#df.to_csv('C:/Users/Emman/Desktop/Sydney_Hexes_RP.csv')

# Funcions for Spatial Analysis
### A. Transform the dataframe with hexids into a geodataframe with hexagons as geometries

In [None]:
hexgdf = myh3.df_with_hexid_to_gdf(df, hexcolname='_id')
hexgdf.plot()

In [None]:
hexgdf.head()

## B. Transform the dataframe with hexids into a geodataframe with centoids points as geometries

#### Points can be used for spatial joins.

In [None]:
centroidsgdf = myh3.df_with_hexid_to_centroids_gdf(hexgdf, hexcolname='_id')
centroidsgdf.plot()

In [None]:
smooth_hexgdf = myh3.kring_smoother(hexgdf, hexcolname='_id',  metric_col='nonresidents')

In [None]:
smooth_hexgdf.info()

In [None]:
smooth_hexgdf.plot(column='nonresidents')