This gets the DataFrame as backend for Visualization. For now, this completes everything for July.

## Getting and saving the cleaned twitter data
* We get the necessary columns for the twitter data from the  .json files by running the code on the cluster and saving the dataframes as parquet.
* The code to get the data is in cleaned_twitter.py

In [1]:
import os
import sys

spark_path = os.environ["SPARK_PATH"]
os.environ['SPARK_HOME'] = spark_path
os.environ['HADOOP_HOME'] = spark_path

sys.path.append(spark_path + "/bin")
sys.path.append(spark_path + "/python")
sys.path.append(spark_path + "/python/pyspark/")
sys.path.append(spark_path + "/python/lib")
sys.path.append(spark_path + "/python/lib/pyspark.zip")
sys.path.append(spark_path + "/python/lib/py4j-0.9-src.zip")

In [2]:
import pandas as pd
import os
from tqdm import tqdm
from pyspark import SparkConf, SparkContext
import pyspark.sql
from pyspark.sql.types import *
from pyspark.sql import SQLContext
import time

conf = SparkConf().setAppName("ADA-gcl")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
#sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")


## Load the paqueret data

## DataFrames
* df1: num_followings, num_followers, mentions
* df2: gender, geo_state
* df3: language, sentiment, location, main, date, 

In [3]:
path_user_info = 'file:///Users/tbfang/Documents/EPFL/ADA/ADA-Project/twitter_data/twitter_user_info_jul.parquet'
df1 = sqlContext.read.parquet(path_user_info)

path_gender = 'file:///Users/tbfang/Documents/EPFL/ADA/ADA-Project/twitter_data/twitter_gender_jul.parquet'
df2 = sqlContext.read.parquet(path_gender)

path_sentiment = 'file:///Users/tbfang/Documents/EPFL/ADA/ADA-Project/twitter_data/twitter_cleaned_jul.parquet'
df3 = sqlContext.read.parquet(path_sentiment)

In [4]:
df1 = df1.toPandas()
df2 = df2.toPandas()
df3 = df3.toPandas()

In [5]:
print(df1.shape[0], df2.shape[0], df3.shape[0])

795026 795026 795026


In [6]:
orig_df = df1.merge(df2,on='id').merge(df3,on='id')

In [7]:
df = df1.merge(df2,on='id').merge(df3,on='id')

In [9]:
code_mapping = pd.read_csv('twitter_data/canton_code.csv').ix[:,1:3]

In [12]:
code_mapping.columns = ['canton_code','canton']

In [13]:
df = code_mapping.merge(df, on='canton', how='inner')

In [14]:
df.count()

canton_code      357881
canton           357881
id               357881
num_followers    357881
num_following    357815
mentions         119627
gender           357868
language         357881
sentiment        316497
location         343200
main             357881
date             357881
dtype: int64

## Analyzing DF
* 28/80, ~35% haveinfo about chatters
* 48/80, ~60% have cantons > 50%, pretty good!
* 70/80, ~87.5% have sentiment

### DataFrame Transformations
* DONE mentions -> number
* DONE parse date into month and time of day (6am to 6pm) (6pm to 6am) is night

In [15]:
df.mentions.fillna("",inplace=True)

### converting date

In [16]:
def month(iso_time):
    time_struct = time.strptime(iso_time, "%Y-%m-%dT%H:%M:%SZ")
    return time_struct.tm_mon

In [17]:
def time_of_day(iso_time):
    time_struct = time.strptime(iso_time, "%Y-%m-%dT%H:%M:%SZ")
    hour = time_struct.tm_hour
    if 6 <= hour < 18:
        return "Day"
    else:
        return "Night"

In [18]:
df['num_mentions'] = df.mentions.apply(len)
df['time_of_day'] = df.date.apply(time_of_day)
df['month'] = df.date.apply(month)

In [19]:
df.head()

Unnamed: 0,canton_code,canton,id,num_followers,num_following,mentions,gender,language,sentiment,location,main,date,num_mentions,time_of_day,month
0,AG,Aargau,1467354438402300035,34,97.0,,MALE,U,,Baden,#schweiz #switzerland #suisse #swiss #svizzera...,2016-07-01T06:23:45Z,0,Day,7
1,AG,Aargau,1467398079407500174,1634,1588.0,[simonshirley72],UNKNOWN,en,POSITIVE,Switzerland,@simonshirley72 go #storm didn't see that scor...,2016-07-01T18:31:53Z,1,Night,7
2,AG,Aargau,1467384385377100009,728,254.0,,FEMALE,en,POSITIVE,Switzerland,In vet's with Aethelflaed. The good thing abou...,2016-07-01T14:43:11Z,0,Day,7
3,AG,Aargau,1467403635318000029,1634,1588.0,[hannahmchugh_],UNKNOWN,en,POSITIVE,Switzerland,@hannahmchugh_ Not a good day at all.,2016-07-01T20:05:03Z,1,Night,7
4,AG,Aargau,1467380876526700035,1634,1588.0,"[callyowl, malcolm_fox2, babygibbo]",UNKNOWN,U,,Switzerland,@callyowl @malcolm_fox2 @babygibbo,2016-07-01T13:45:23Z,3,Day,7


## Tweet Level

In [20]:
df_author = df.drop(['mentions','location','main','date','num_mentions'],axis=1)
df.drop(['mentions','location','main','date','num_followers','num_following'],axis=1,inplace=True)
df.dropna(inplace=True)
# keeps 48/80 = 60% of data
# keeps only gender is known

## Cleaning up Cantons

Currently, there are 170 "cantons" from geo_state, that needs to be cleaned...

In [21]:
len(pd.DataFrame(df['canton'].unique(),columns=['Canton']))

23

## Handling Authors

Getting distinct authors only, I assume if they have the same # of followers, following and gender they are the same person.

In [22]:
df_author.shape[0]

357881

In [23]:
df_author.drop_duplicates(['num_followers', 'num_following', 'gender'], inplace=True) # only the authors

Need a notion of sample size, has to be significant. Need to remove outliers for num_followers

In [24]:
df_author = df_author.groupby(['canton','gender','language','time_of_day','month'], as_index=False).mean()

In [25]:
sentiment_map = {'NEUTRAL': 0, 'POSITIVE':1, 'NEGATIVE':-1}
# df['sentiment_val'] = 
df['sentiment'] = df.sentiment.map(sentiment_map)

In [26]:
df_user = df.groupby(['canton','gender','language','time_of_day','month'], as_index=False).mean()

In [33]:
df.groupby(['canton','gender','language','time_of_day','month'], as_index=False).count().head()

Unnamed: 0,canton,gender,language,time_of_day,month,canton_code,id,sentiment,num_mentions
0,Aargau,FEMALE,de,Day,7,217,217,217,217
1,Aargau,FEMALE,de,Night,7,64,64,64,64
2,Aargau,FEMALE,en,Day,7,473,473,473,473
3,Aargau,FEMALE,en,Night,7,362,362,362,362
4,Aargau,FEMALE,es,Day,7,13,13,13,13


In [28]:
df_final = df_user.merge(df_author,on=['canton','gender','language','time_of_day','month'])

In [29]:
df_final.shape[0]

691

### Evaluating the goodness of the conditions
We want to make sure there is enough in each category. We want to minimize the # of 0 sentiments.

In [30]:
df_final[df_final['sentiment'] == 0].shape[0] # sentiment = 0 means 

345

In [32]:
df_final.head()

Unnamed: 0,canton,gender,language,time_of_day,month,sentiment,num_mentions,num_followers,num_following
0,Aargau,FEMALE,de,Day,7,0.064516,0.917051,631.705882,349.117647
1,Aargau,FEMALE,de,Night,7,0.03125,0.75,620.75,404.625
2,Aargau,FEMALE,en,Day,7,0.22833,0.463002,1491.015873,448.031746
3,Aargau,FEMALE,en,Night,7,0.116022,0.638122,656.387097,547.548387
4,Aargau,FEMALE,es,Day,7,0.0,0.538462,156.5,81.5


## Final DF

* canton, gender, language, time of day, month are things we want to filter by
* sentiment, num mentions, num followers, and num_following are all averages for those specified categories. For num_followers and num_followings, it only averages the unique tweet authors. Sentiment and num mentions are averages over all tweets in those filters.

* will have big DataFrame for July to October. Good enough to do visualization...

Need to do:
* map location to Google Maps to cantons for January to June.

Spark/cluster work:
* January gender needs to be fixed

## creating the Canton Map

It's not possible to do location to canton mapping... So we need to map location to cantons directly. We lose too many locations. We need to use Google Maps to automatically map this... This is for January to June. 

In [None]:
orig_df.groupby('location').count().sort('id', ascending=False).head(100)

In [None]:
canton_mapping = orig_df[['canton','location']].dropna().drop_duplicates()

In [None]:
canton_mapping.groupby('canton').count().sort('location',ascending=False)

In [None]:
pd.DataFrame(canton_mapping.canton.unique(),columns=['Canton'])

In [None]:
code_mapping = pd.read_csv('twitter_data/canton_code.csv').ix[:,1:3]

In [None]:
canton_mapping.head()

In [None]:
code_mapping.head()

In [None]:
code_mapping.columns = ['canton_code','canton']

In [None]:
code_mapping.merge(canton_mapping, on='canton', how='inner')

In [None]:
canton_mapping.head()

In [None]:
canton_mapping.groupby('location').count().sort('canton',ascending=False)

Creating a Canton to Location Mapping

In [None]:
swiss_cantons = ['AG', 'AI', 'AR', 'BE', 'BL', 'BS', 'FR', 'GE', 'GL', 'GR', 'JU', 'LU', 'NE', 'NW', 
                 'OW', 'SG', 'SH', 'SO', 'SZ', 'TG', 'TI', 'UR', 'VD', 'VS', 'ZG', 'ZH']

In [None]:
len(swiss_cantons)